instanovo


Nameinstanovo JSON
Version 0.1.7 PyPI version JSON
download
home_pagehttps://github.com/instadeepai/InstaNovo
SummaryDe novo sequencing with InstaNovo
upload_time2024-03-05 22:52:42
maintainer
docs_urlNone
authorInstaDeep
requires_python
license
keywords
VCS
bugtrack_url
requirements aiobotocore boto3 botocore click cloudpathlib datasets depthcharge-ms fastprogress jiwer matchms matplotlib numba numpy omegaconf pandas polars protobuf pyarrow pyopenms pyopenms python-dotenv pytorch_lightning scikit-learn scikit-learn seaborn spectrum_utils tensorboard tensorboardX torch tqdm transfusion-asr
Travis-CI No Travis.
coveralls test coverage
            # _De novo_ peptide sequencing with InstaNovo

[![PyPI version](https://badge.fury.io/py/instanovo.svg)](https://badge.fury.io/py/instanovo)
<a target="_blank" href="https://colab.research.google.com/github/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

The official code repository for InstaNovo. This repo contains the code for training and inference
of InstaNovo and InstaNovo+. InstaNovo is a transformer neural network with the ability to translate
fragment ion peaks into the sequence of amino acids that make up the studied peptide(s). InstaNovo+,
inspired by human intuition, is a multinomial diffusion model that further improves performance by
iterative refinement of predicted sequences.

![Graphical Abstract](https://raw.githubusercontent.com/instadeepai/InstaNovo/main/graphical_abstract.jpeg)

**Links:**

- bioRxiv: https://www.biorxiv.org/content/10.1101/2023.08.30.555055v3

**Developed by:**

- [InstaDeep](https://www.instadeep.com/)
- [The Department of Biotechnology and Biomedicine](https://orbit.dtu.dk/en/organisations/department-of-biotechnology-and-biomedicine) -
  [Technical University of Denmark](https://www.dtu.dk/)

## Usage

### Installation

To use InstaNovo, we need to install the module via `pip`:

```bash
pip install instanovo
```

It is recommended to install InstaNovo in a fresh environment, such as Conda or PyEnv. For example,
if you have
[conda](https://docs.conda.io/en/latest/)/[miniconda](https://docs.conda.io/projects/miniconda/en/latest/)
installed:

```bash
conda env create -f environment.yml
conda activate instanovo
```

Note: InstaNovo is built for Python >= 3.8

### Training

To train auto-regressive InstaNovo:

```bash
usage: python -m instanovo.transformer.train train_path valid_path [-h] [--config CONFIG] [--n_gpu N_GPU] [--n_workers N_WORKERS]

required arguments:
  train_path        Training data path
  valid_path        Validation data path

optional arguments:
  --config CONFIG   file in configs folder
  --n_workers N_WORKERS
```

Note: data is expected to be saved as Polars `.ipc` format. See section on data conversion.

To update the InstaNovo model config, modify the config file under
[configs/instanovo/base.yaml](https://github.com/instadeepai/InstaNovo/blob/main/configs/instanovo/base.yaml)

### Prediction

To evaluate InstaNovo:

```bash
usage: python -m instanovo.transformer.predict data_path model_path [-h] [--denovo] [--config CONFIG] [--subset SUBSET] [--knapsack_path KNAPSACK_PATH] [--n_workers N_WORKERS]

required arguments:
  data_path         Evaluation data path
  model_path        Model checkpoint path

optional arguments:
  --denovo          evaluate in de novo mode, will not try to compute metrics
  --output_path OUTPUT_PATH
                    Save predictions to a csv file (required in de novo mode)
  --subset SUBSET
                    portion of set to evaluate
  --knapsack_path KNAPSACK_PATH
                    path to pre-computed knapsack
  --n_workers N_WORKERS
```

### Using your own datasets

To use your own datasets, you simply need to tabulate your data in either
[Pandas](https://pandas.pydata.org/) or [Polars](https://www.pola.rs/) with the following schema:

The dataset is tabular, where each row corresponds to a labelled MS2 spectra.

- `sequence (string) [Optional]` \
   The target peptide sequence excluding post-translational modifications
- `modified_sequence (string)` \
  The target peptide sequence including post-translational modifications
- `precursor_mz (float64)` \
  The mass-to-charge of the precursor (from MS1)
- `charge (int64)` \
  The charge of the precursor (from MS1)
- `mz_array (list[float64])` \
  The mass-to-charge values of the MS2 spectrum
- `mz_array (list[float32])` \
  The intensity values of the MS2 spectrum

For example, the DataFrame for the
[nine species benchmark](https://huggingface.co/datasets/InstaDeepAI/ms_ninespecies_benchmark)
dataset (introduced in [Tran _et al._ 2017](https://www.pnas.org/doi/full/10.1073/pnas.1705691114))
looks as follows:

|     | sequence             | modified_sequence          | precursor_mz | precursor_charge | mz_array                             | intensity_array                     |
| --: | :------------------- | :------------------------- | -----------: | ---------------: | :----------------------------------- | :---------------------------------- |
|   0 | GRVEGMEAR            | GRVEGMEAR                  |      335.502 |                3 | [102.05527 104.052956 113.07079 ...] | [ 767.38837 2324.8787 598.8512 ...] |
|   1 | IGEYK                | IGEYK                      |      305.165 |                2 | [107.07023 110.071236 111.11693 ...] | [ 1055.4957 2251.3171 35508.96 ...] |
|   2 | GVSREEIQR            | GVSREEIQR                  |      358.528 |                3 | [103.039444 109.59844 112.08704 ...] | [801.19995 460.65268 808.3431 ...]  |
|   3 | SSYHADEQVNEASK       | SSYHADEQVNEASK             |      522.234 |                3 | [101.07095 102.0552 110.07163 ...]   | [ 989.45154 2332.653 1170.6191 ...] |
|   4 | DTFNTSSTSNSTSSSSSNSK | DTFNTSSTSN(+.98)STSSSSSNSK |      676.282 |                3 | [119.82458 120.08073 120.2038 ...]   | [ 487.86942 4806.1377 516.8846 ...] |

For _de novo_ prediction, the `modified_sequence` column is not required.

We also provide a conversion script for converting to Polars IPC binary (`.ipc`):

```bash
usage: python -m instanovo.utils.convert_to_ipc source target [-h] [--source_type {mgf,mzml,csv}] [--max_charge MAX_CHARGE] [--verbose]

positional arguments:
  source                source file or folder
  target                target ipc file to be saved

optional arguments:
  -h, --help            show this help message and exit
  --source_type {mgf,mzml,csv}
                        type of input data
  --max_charge MAX_CHARGE
                        maximum charge to filter out
```

_Note: we currently only support `mzml`, `mgf` and `csv` conversions._

If you want to use InstaNovo for evaluating metrics, you will need to manually set the
`modified_sequence` column after conversion.

## Roadmap

This code repo is currently under construction.

**ToDo:**

- Add data preprocessing pipeline
- Multi-GPU support

## License

Code is licensed under the Apache License, Version 2.0 (see [LICENSE](LICENSE.md))

The model checkpoints are licensed under Creative Commons Non-Commercial
([CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/))

## BibTeX entry and citation info

```bibtex
@article{eloff_kalogeropoulos_2023_instanovo,
	title = {De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments},
	author = {Kevin Eloff and Konstantinos Kalogeropoulos and Oliver Morell and Amandla Mabona and Jakob Berg Jespersen and Wesley Williams and Sam van Beljouw and Marcin Skwark and Andreas Hougaard Laustsen and Stan J. J. Brouns and Anne Ljungars and Erwin Marten Schoof and Jeroen Van Goey and Ulrich auf dem Keller and Karim Beguir and Nicolas Lopez Carranza and Timothy Patrick Jenkins},
	year = {2023},
	doi = {10.1101/2023.08.30.555055},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/10.1101/2023.08.30.555055v3},
	journal = {bioRxiv}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/instadeepai/InstaNovo",
    "name": "instanovo",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "InstaDeep",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/43/44/0ff9298f682a9b3f261aa8158fa59b77f7829a711fb8dd2e8139c7f2c371/instanovo-0.1.7.tar.gz",
    "platform": null,
    "description": "# _De novo_ peptide sequencing with InstaNovo\n\n[![PyPI version](https://badge.fury.io/py/instanovo.svg)](https://badge.fury.io/py/instanovo)\n<a target=\"_blank\" href=\"https://colab.research.google.com/github/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb\">\n<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/> </a>\n\nThe official code repository for InstaNovo. This repo contains the code for training and inference\nof InstaNovo and InstaNovo+. InstaNovo is a transformer neural network with the ability to translate\nfragment ion peaks into the sequence of amino acids that make up the studied peptide(s). InstaNovo+,\ninspired by human intuition, is a multinomial diffusion model that further improves performance by\niterative refinement of predicted sequences.\n\n![Graphical Abstract](https://raw.githubusercontent.com/instadeepai/InstaNovo/main/graphical_abstract.jpeg)\n\n**Links:**\n\n- bioRxiv: https://www.biorxiv.org/content/10.1101/2023.08.30.555055v3\n\n**Developed by:**\n\n- [InstaDeep](https://www.instadeep.com/)\n- [The Department of Biotechnology and Biomedicine](https://orbit.dtu.dk/en/organisations/department-of-biotechnology-and-biomedicine) -\n  [Technical University of Denmark](https://www.dtu.dk/)\n\n## Usage\n\n### Installation\n\nTo use InstaNovo, we need to install the module via `pip`:\n\n```bash\npip install instanovo\n```\n\nIt is recommended to install InstaNovo in a fresh environment, such as Conda or PyEnv. For example,\nif you have\n[conda](https://docs.conda.io/en/latest/)/[miniconda](https://docs.conda.io/projects/miniconda/en/latest/)\ninstalled:\n\n```bash\nconda env create -f environment.yml\nconda activate instanovo\n```\n\nNote: InstaNovo is built for Python >= 3.8\n\n### Training\n\nTo train auto-regressive InstaNovo:\n\n```bash\nusage: python -m instanovo.transformer.train train_path valid_path [-h] [--config CONFIG] [--n_gpu N_GPU] [--n_workers N_WORKERS]\n\nrequired arguments:\n  train_path        Training data path\n  valid_path        Validation data path\n\noptional arguments:\n  --config CONFIG   file in configs folder\n  --n_workers N_WORKERS\n```\n\nNote: data is expected to be saved as Polars `.ipc` format. See section on data conversion.\n\nTo update the InstaNovo model config, modify the config file under\n[configs/instanovo/base.yaml](https://github.com/instadeepai/InstaNovo/blob/main/configs/instanovo/base.yaml)\n\n### Prediction\n\nTo evaluate InstaNovo:\n\n```bash\nusage: python -m instanovo.transformer.predict data_path model_path [-h] [--denovo] [--config CONFIG] [--subset SUBSET] [--knapsack_path KNAPSACK_PATH] [--n_workers N_WORKERS]\n\nrequired arguments:\n  data_path         Evaluation data path\n  model_path        Model checkpoint path\n\noptional arguments:\n  --denovo          evaluate in de novo mode, will not try to compute metrics\n  --output_path OUTPUT_PATH\n                    Save predictions to a csv file (required in de novo mode)\n  --subset SUBSET\n                    portion of set to evaluate\n  --knapsack_path KNAPSACK_PATH\n                    path to pre-computed knapsack\n  --n_workers N_WORKERS\n```\n\n### Using your own datasets\n\nTo use your own datasets, you simply need to tabulate your data in either\n[Pandas](https://pandas.pydata.org/) or [Polars](https://www.pola.rs/) with the following schema:\n\nThe dataset is tabular, where each row corresponds to a labelled MS2 spectra.\n\n- `sequence (string) [Optional]` \\\n   The target peptide sequence excluding post-translational modifications\n- `modified_sequence (string)` \\\n  The target peptide sequence including post-translational modifications\n- `precursor_mz (float64)` \\\n  The mass-to-charge of the precursor (from MS1)\n- `charge (int64)` \\\n  The charge of the precursor (from MS1)\n- `mz_array (list[float64])` \\\n  The mass-to-charge values of the MS2 spectrum\n- `mz_array (list[float32])` \\\n  The intensity values of the MS2 spectrum\n\nFor example, the DataFrame for the\n[nine species benchmark](https://huggingface.co/datasets/InstaDeepAI/ms_ninespecies_benchmark)\ndataset (introduced in [Tran _et al._ 2017](https://www.pnas.org/doi/full/10.1073/pnas.1705691114))\nlooks as follows:\n\n|     | sequence             | modified_sequence          | precursor_mz | precursor_charge | mz_array                             | intensity_array                     |\n| --: | :------------------- | :------------------------- | -----------: | ---------------: | :----------------------------------- | :---------------------------------- |\n|   0 | GRVEGMEAR            | GRVEGMEAR                  |      335.502 |                3 | [102.05527 104.052956 113.07079 ...] | [ 767.38837 2324.8787 598.8512 ...] |\n|   1 | IGEYK                | IGEYK                      |      305.165 |                2 | [107.07023 110.071236 111.11693 ...] | [ 1055.4957 2251.3171 35508.96 ...] |\n|   2 | GVSREEIQR            | GVSREEIQR                  |      358.528 |                3 | [103.039444 109.59844 112.08704 ...] | [801.19995 460.65268 808.3431 ...]  |\n|   3 | SSYHADEQVNEASK       | SSYHADEQVNEASK             |      522.234 |                3 | [101.07095 102.0552 110.07163 ...]   | [ 989.45154 2332.653 1170.6191 ...] |\n|   4 | DTFNTSSTSNSTSSSSSNSK | DTFNTSSTSN(+.98)STSSSSSNSK |      676.282 |                3 | [119.82458 120.08073 120.2038 ...]   | [ 487.86942 4806.1377 516.8846 ...] |\n\nFor _de novo_ prediction, the `modified_sequence` column is not required.\n\nWe also provide a conversion script for converting to Polars IPC binary (`.ipc`):\n\n```bash\nusage: python -m instanovo.utils.convert_to_ipc source target [-h] [--source_type {mgf,mzml,csv}] [--max_charge MAX_CHARGE] [--verbose]\n\npositional arguments:\n  source                source file or folder\n  target                target ipc file to be saved\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --source_type {mgf,mzml,csv}\n                        type of input data\n  --max_charge MAX_CHARGE\n                        maximum charge to filter out\n```\n\n_Note: we currently only support `mzml`, `mgf` and `csv` conversions._\n\nIf you want to use InstaNovo for evaluating metrics, you will need to manually set the\n`modified_sequence` column after conversion.\n\n## Roadmap\n\nThis code repo is currently under construction.\n\n**ToDo:**\n\n- Add data preprocessing pipeline\n- Multi-GPU support\n\n## License\n\nCode is licensed under the Apache License, Version 2.0 (see [LICENSE](LICENSE.md))\n\nThe model checkpoints are licensed under Creative Commons Non-Commercial\n([CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/))\n\n## BibTeX entry and citation info\n\n```bibtex\n@article{eloff_kalogeropoulos_2023_instanovo,\n\ttitle = {De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments},\n\tauthor = {Kevin Eloff and Konstantinos Kalogeropoulos and Oliver Morell and Amandla Mabona and Jakob Berg Jespersen and Wesley Williams and Sam van Beljouw and Marcin Skwark and Andreas Hougaard Laustsen and Stan J. J. Brouns and Anne Ljungars and Erwin Marten Schoof and Jeroen Van Goey and Ulrich auf dem Keller and Karim Beguir and Nicolas Lopez Carranza and Timothy Patrick Jenkins},\n\tyear = {2023},\n\tdoi = {10.1101/2023.08.30.555055},\n\tpublisher = {Cold Spring Harbor Laboratory},\n\tURL = {https://www.biorxiv.org/content/10.1101/2023.08.30.555055v3},\n\tjournal = {bioRxiv}\n}\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "De novo sequencing with InstaNovo",
    "version": "0.1.7",
    "project_urls": {
        "Homepage": "https://github.com/instadeepai/InstaNovo"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee5b6342af473e15358a3bb450b4dbdfdea1b6dc97fd907bb392f7ec29c10316",
                "md5": "cae8c8c9433e8db6612c6062a36aad67",
                "sha256": "2355fbea63d93e05aaa86f9a1792c40fe1450b49ae82471ef0f8c8a46fadbebe"
            },
            "downloads": -1,
            "filename": "instanovo-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cae8c8c9433e8db6612c6062a36aad67",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 61170,
            "upload_time": "2024-03-05T22:52:40",
            "upload_time_iso_8601": "2024-03-05T22:52:40.845465Z",
            "url": "https://files.pythonhosted.org/packages/ee/5b/6342af473e15358a3bb450b4dbdfdea1b6dc97fd907bb392f7ec29c10316/instanovo-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "43440ff9298f682a9b3f261aa8158fa59b77f7829a711fb8dd2e8139c7f2c371",
                "md5": "a73226eb260b1a727de6617db55b44df",
                "sha256": "38525ea002d29c9e08133edb1bbd34f1814694067576a87092e625a7747fc66a"
            },
            "downloads": -1,
            "filename": "instanovo-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "a73226eb260b1a727de6617db55b44df",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 53513,
            "upload_time": "2024-03-05T22:52:42",
            "upload_time_iso_8601": "2024-03-05T22:52:42.044616Z",
            "url": "https://files.pythonhosted.org/packages/43/44/0ff9298f682a9b3f261aa8158fa59b77f7829a711fb8dd2e8139c7f2c371/instanovo-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-05 22:52:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "instadeepai",
    "github_project": "InstaNovo",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [
        {
            "name": "aiobotocore",
            "specs": [
                [
                    "==",
                    "2.4.1"
                ]
            ]
        },
        {
            "name": "boto3",
            "specs": [
                [
                    "==",
                    "1.24.59"
                ]
            ]
        },
        {
            "name": "botocore",
            "specs": [
                [
                    "==",
                    "1.27.59"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.1.7"
                ]
            ]
        },
        {
            "name": "cloudpathlib",
            "specs": [
                [
                    "==",
                    "0.10.0"
                ]
            ]
        },
        {
            "name": "datasets",
            "specs": [
                [
                    "==",
                    "2.14.5"
                ]
            ]
        },
        {
            "name": "depthcharge-ms",
            "specs": [
                [
                    "==",
                    "0.1.0"
                ]
            ]
        },
        {
            "name": "fastprogress",
            "specs": [
                [
                    "==",
                    "1.0.3"
                ]
            ]
        },
        {
            "name": "jiwer",
            "specs": [
                [
                    "==",
                    "3.0.3"
                ]
            ]
        },
        {
            "name": "matchms",
            "specs": [
                [
                    "==",
                    "0.22.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.7.2"
                ]
            ]
        },
        {
            "name": "numba",
            "specs": [
                [
                    "==",
                    "0.57.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.23.3"
                ]
            ]
        },
        {
            "name": "omegaconf",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.0.3"
                ]
            ]
        },
        {
            "name": "polars",
            "specs": [
                [
                    "==",
                    "0.19.7"
                ]
            ]
        },
        {
            "name": "protobuf",
            "specs": [
                [
                    "==",
                    "3.19.6"
                ]
            ]
        },
        {
            "name": "pyarrow",
            "specs": [
                [
                    "==",
                    "15.0.0"
                ]
            ]
        },
        {
            "name": "pyopenms",
            "specs": [
                [
                    "==",
                    "2.7.0"
                ]
            ]
        },
        {
            "name": "pyopenms",
            "specs": [
                [
                    "==",
                    "3.1.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "0.21.0"
                ]
            ]
        },
        {
            "name": "pytorch_lightning",
            "specs": [
                [
                    "==",
                    "1.8.6"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.1.2"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.4.1.post1"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    "==",
                    "0.12.0"
                ]
            ]
        },
        {
            "name": "spectrum_utils",
            "specs": [
                [
                    "==",
                    "0.4.1"
                ]
            ]
        },
        {
            "name": "tensorboard",
            "specs": [
                [
                    "==",
                    "2.10.1"
                ]
            ]
        },
        {
            "name": "tensorboardX",
            "specs": [
                [
                    "==",
                    "2.5.1"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.2.1"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.65.0"
                ]
            ]
        },
        {
            "name": "transfusion-asr",
            "specs": [
                [
                    "==",
                    "0.1.0"
                ]
            ]
        }
    ],
    "lcname": "instanovo"
}
        
Elapsed time: 0.23525s