pepmatch


Namepepmatch JSON
Version 1.0.3 PyPI version JSON
download
home_pagehttps://github.com/IEDB/PEPMatch
SummarySearch tool for peptides and epitopes within a proteome, while considering potential residue substitutions.
upload_time2024-02-25 03:03:46
maintainer
docs_urlNone
authorDaniel Marrama
requires_python
license
keywords
VCS
bugtrack_url
requirements biopython numpy pandas openpyxl pre-commit
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="docs/logo.png">
</p>

--------------------------------------------------------------------

![Unit Tests](https://github.com/IEDB/PEPMatch/actions/workflows/tests.yml/badge.svg)


#### Author: Daniel Marrama

Peptide search against a reference proteome, or sets of proteins, with residue subtitutions.

Two step process: preprocessing and matching.

Preprocessed data is stored in a SQLite or pickle format and only has to be performed once.

As a competition to improve tool performance, we created a benchmarking framework with instructions [here](./benchmarking).


### Requirements

- Python 3.7+
- [Pandas](https://pandas.pydata.org/)
- [NumPy](https://numpy.org/)
- [Biopython](https://biopython.org/)


### Installation

```bash
pip install pepmatch
```


### Inputs


#### Preprocessor

```proteome``` - Path to proteome file to search against.\
```k``` - k-mer size to break up proteome into.\
```preprocessed_format``` - SQLite ("sqlite") or "pickle".\
```preprocessed_files_path``` - (optional) Directory where you want preprocessed files to go. Default is current directory.\
```gene_priority_proteome``` - (optional) Subset of ```proteome``` with prioritized protein IDs.\


#### Matcher

```query``` - Query of peptides to search either in .fasta file or as a Python list.\
```proteome_file``` - Name of preprocessed proteome to search against.\
```max_mismatches``` - Maximum number of mismatches (substitutions) for query.\
```k``` - (optional) k-mer size of the preprocessed proteome. If no k is selected, then a best k will be calculated and the proteome will be preprocessed\
```preprocessed_files_path``` - (optional) Directory where preprocessed files are. Default is current directory.\
```best_match``` - (optional) Returns only one match per query peptide. It will output the best match.\
```output_format``` - (optional) Outputs results into a file (CSV, XLSX, JSON, HTML) or just as a dataframe.\
```output_name``` - (optional) Specify name of file for output. Leaving blank will generate a name.

Note: For now, due to performance, SQLite is used for exact matching and pickle is used for mismatching.

Note: PEPMatch can also search for discontinuous epitopes in the residue:index format. Example: 

"R377, Q408, Q432, H433, F436, V441, S442, S464, K467, K489, I491, S492, N497"


### Command Line Example

```bash
# exact matching example
pepmatch-preprocess -p human.fasta -k 5 -f sql
pepmatch-match -q peptides.fasta -p human.fasta -m 0 -k 5

# mismatching example
pepmatch-preprocess -p human.fasta -k 3 -f pickle
pepmatch-match -q neoepitopes.fasta -p human.fasta -m 3 -k 3
```


### Exact Matching Example

```python
from pepmatch import Preprocessor, Matcher

Preprocessor('proteomes/human.fasta').sql_proteome(k = 5) 

Matcher( # 0 mismatches, k = 5
  'queries/mhc-ligands-test.fasta', 'proteomes/human.fasta', 0, 5
).match()
```


### Mismatching Example 

```python
from pepmatch import Preprocessor, Matcher

Preprocessor('proteomes/human.fasta').pickle_proteome(k = 3)

Matcher( # 3 mismatches, k = 3
  'queries/neoepitopes-test.fasta', 'proteomes/human.fasta', 3, 3
).match()
```


### Parallel Matcher Example

To run a job on multiple cores, use the `ParallelMatcher` class. The `n_jobs` parameter specifies the number of cores to use.

```python
from pepmatch import Preprocessor, ParallelMatcher 

Preprocessor('proteomes/betacoronaviruses.fasta').pickle_proteome(k = 3)

ParallelMatcher(
  query='queries/coronavirus-test.fasta',
  proteome_file='proteomes/betacoronaviruses.fasta',
  max_mismatches=3,
  k=3,
  n_jobs=2
).match()
```


### Best Match Example

```python
from pepmatch import Matcher
Matcher(
  'queries/milk-peptides-test.fasta', 'proteomes/human.fasta', best_match=True
).match()
```

The best match parameter without k or mismatch inputs will produce the best match for each peptide in the query, meaning the match with the least number of mismatches, the best protein existence level, and if the match exists in the gene priority proteome. No preprocessing beforehand is required, as the Matcher class will do this for you to find the best match.


### Outputs

As mentioned above, outputs can be specified with the ```output_format``` parameter in the ```Matcher``` class. The following formats are allowed: `dataframe`, `tsv`, `csv`, `xlsx`, `json`, and `html`.

If specifying `dataframe`, the ```match()``` method will return a pandas dataframe which can be stored as a variable:

```python
df = Matcher(
  'queries/neoepitopes-test.fasta', 'proteomes/human.fasta', 3, 3, output_format='dataframe'
).match()
```


### Citation

If you use PEPMatch in your research, please cite the following paper:

Marrama D, Chronister WD, Westernberg L, et al. PEPMatch: a tool to identify short peptide sequence matches in large sets of proteins. BMC Bioinformatics. 2023;24(1):485. Published 2023 Dec 18. doi:10.1186/s12859-023-05606-4

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/IEDB/PEPMatch",
    "name": "pepmatch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Daniel Marrama",
    "author_email": "dmarrama@lji.org",
    "download_url": "https://files.pythonhosted.org/packages/cf/7a/3c1e553d485cfd32ccb4b4234db65f71e6f7b3b71e9f7a00074226bef920/pepmatch-1.0.3.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"docs/logo.png\">\n</p>\n\n--------------------------------------------------------------------\n\n![Unit Tests](https://github.com/IEDB/PEPMatch/actions/workflows/tests.yml/badge.svg)\n\n\n#### Author: Daniel Marrama\n\nPeptide search against a reference proteome, or sets of proteins, with residue subtitutions.\n\nTwo step process: preprocessing and matching.\n\nPreprocessed data is stored in a SQLite or pickle format and only has to be performed once.\n\nAs a competition to improve tool performance, we created a benchmarking framework with instructions [here](./benchmarking).\n\n\n### Requirements\n\n- Python 3.7+\n- [Pandas](https://pandas.pydata.org/)\n- [NumPy](https://numpy.org/)\n- [Biopython](https://biopython.org/)\n\n\n### Installation\n\n```bash\npip install pepmatch\n```\n\n\n### Inputs\n\n\n#### Preprocessor\n\n```proteome``` - Path to proteome file to search against.\\\n```k``` - k-mer size to break up proteome into.\\\n```preprocessed_format``` - SQLite (\"sqlite\") or \"pickle\".\\\n```preprocessed_files_path``` - (optional) Directory where you want preprocessed files to go. Default is current directory.\\\n```gene_priority_proteome``` - (optional) Subset of ```proteome``` with prioritized protein IDs.\\\n\n\n#### Matcher\n\n```query``` - Query of peptides to search either in .fasta file or as a Python list.\\\n```proteome_file``` - Name of preprocessed proteome to search against.\\\n```max_mismatches``` - Maximum number of mismatches (substitutions) for query.\\\n```k``` - (optional) k-mer size of the preprocessed proteome. If no k is selected, then a best k will be calculated and the proteome will be preprocessed\\\n```preprocessed_files_path``` - (optional) Directory where preprocessed files are. Default is current directory.\\\n```best_match``` - (optional) Returns only one match per query peptide. It will output the best match.\\\n```output_format``` - (optional) Outputs results into a file (CSV, XLSX, JSON, HTML) or just as a dataframe.\\\n```output_name``` - (optional) Specify name of file for output. Leaving blank will generate a name.\n\nNote: For now, due to performance, SQLite is used for exact matching and pickle is used for mismatching.\n\nNote: PEPMatch can also search for discontinuous epitopes in the residue:index format. Example: \n\n\"R377, Q408, Q432, H433, F436, V441, S442, S464, K467, K489, I491, S492, N497\"\n\n\n### Command Line Example\n\n```bash\n# exact matching example\npepmatch-preprocess -p human.fasta -k 5 -f sql\npepmatch-match -q peptides.fasta -p human.fasta -m 0 -k 5\n\n# mismatching example\npepmatch-preprocess -p human.fasta -k 3 -f pickle\npepmatch-match -q neoepitopes.fasta -p human.fasta -m 3 -k 3\n```\n\n\n### Exact Matching Example\n\n```python\nfrom pepmatch import Preprocessor, Matcher\n\nPreprocessor('proteomes/human.fasta').sql_proteome(k = 5) \n\nMatcher( # 0 mismatches, k = 5\n  'queries/mhc-ligands-test.fasta', 'proteomes/human.fasta', 0, 5\n).match()\n```\n\n\n### Mismatching Example \n\n```python\nfrom pepmatch import Preprocessor, Matcher\n\nPreprocessor('proteomes/human.fasta').pickle_proteome(k = 3)\n\nMatcher( # 3 mismatches, k = 3\n  'queries/neoepitopes-test.fasta', 'proteomes/human.fasta', 3, 3\n).match()\n```\n\n\n### Parallel Matcher Example\n\nTo run a job on multiple cores, use the `ParallelMatcher` class. The `n_jobs` parameter specifies the number of cores to use.\n\n```python\nfrom pepmatch import Preprocessor, ParallelMatcher \n\nPreprocessor('proteomes/betacoronaviruses.fasta').pickle_proteome(k = 3)\n\nParallelMatcher(\n  query='queries/coronavirus-test.fasta',\n  proteome_file='proteomes/betacoronaviruses.fasta',\n  max_mismatches=3,\n  k=3,\n  n_jobs=2\n).match()\n```\n\n\n### Best Match Example\n\n```python\nfrom pepmatch import Matcher\nMatcher(\n  'queries/milk-peptides-test.fasta', 'proteomes/human.fasta', best_match=True\n).match()\n```\n\nThe best match parameter without k or mismatch inputs will produce the best match for each peptide in the query, meaning the match with the least number of mismatches, the best protein existence level, and if the match exists in the gene priority proteome. No preprocessing beforehand is required, as the Matcher class will do this for you to find the best match.\n\n\n### Outputs\n\nAs mentioned above, outputs can be specified with the ```output_format``` parameter in the ```Matcher``` class. The following formats are allowed: `dataframe`, `tsv`, `csv`, `xlsx`, `json`, and `html`.\n\nIf specifying `dataframe`, the ```match()``` method will return a pandas dataframe which can be stored as a variable:\n\n```python\ndf = Matcher(\n  'queries/neoepitopes-test.fasta', 'proteomes/human.fasta', 3, 3, output_format='dataframe'\n).match()\n```\n\n\n### Citation\n\nIf you use PEPMatch in your research, please cite the following paper:\n\nMarrama D, Chronister WD, Westernberg L, et al. PEPMatch: a tool to identify short peptide sequence matches in large sets of proteins. BMC Bioinformatics. 2023;24(1):485. Published 2023 Dec 18. doi:10.1186/s12859-023-05606-4\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Search tool for peptides and epitopes within a proteome, while considering potential residue substitutions.",
    "version": "1.0.3",
    "project_urls": {
        "Homepage": "https://github.com/IEDB/PEPMatch"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cf7a3c1e553d485cfd32ccb4b4234db65f71e6f7b3b71e9f7a00074226bef920",
                "md5": "8c928db995e5164c5e945e9e9ca0c970",
                "sha256": "c3c158a8144ff93f3406d583b4d6f6ee0f12cb0faf67937f6dd9708ede405424"
            },
            "downloads": -1,
            "filename": "pepmatch-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "8c928db995e5164c5e945e9e9ca0c970",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 22234,
            "upload_time": "2024-02-25T03:03:46",
            "upload_time_iso_8601": "2024-02-25T03:03:46.474429Z",
            "url": "https://files.pythonhosted.org/packages/cf/7a/3c1e553d485cfd32ccb4b4234db65f71e6f7b3b71e9f7a00074226bef920/pepmatch-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-25 03:03:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "IEDB",
    "github_project": "PEPMatch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "biopython",
            "specs": [
                [
                    ">=",
                    "1.78"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "3.3.2"
                ]
            ]
        }
    ],
    "lcname": "pepmatch"
}
        
Elapsed time: 0.20506s