flimsay


Nameflimsay JSON
Version 0.4.0 PyPI version JSON
download
home_page
SummaryA super simple fast IMS predictor
upload_time2023-07-25 21:18:28
maintainer
docs_urlNone
author
requires_python<3.12,>=3.9
licenseApache 2.0
keywords proteomics dia ion mobility
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Flimsay: Fun/Fast Simple IMS Anyone like You can use.
Sebastian Paez

version = 0.4.0

This repository implements a very simple LGBM model to predict ion
mobility from peptides.

## Usage

There are two main ways to interact with `flimsay`, one is using python
and the other one is using the python api directly.

### CLI

``` shell
$ pip install flimsay
```

``` shell
$ flimsay fill_blib mylibrary.blib # This will add ion mobility data to a .blib file.
```

``` python
! flimsay fill_blib --help
```

                                                                                    
     Usage: flimsay fill_blib [OPTIONS] BLIB OUT_BLIB                               
                                                                                    
     Add ion mobility prediction to a .blib file.                                   
                                                                                    
    ╭─ Options ────────────────────────────────────────────────────────────────────╮
    │ --overwrite      Whether to overwrite output file, if it exists              │
    │ --help           Show this message and exit.                                 │
    ╰──────────────────────────────────────────────────────────────────────────────╯

### Python

#### Single peptide

``` python
from flimsay.model import FlimsayModel

model_instance = FlimsayModel()
model_instance.predict_peptide("MYPEPTIDEK", charge=2)
```

    {'ccs': array([363.36245907]), 'one_over_k0': array([0.92423264])}

#### Many peptides at once

``` python
import pandas as pd
from flimsay.features import add_features, FEATURE_COLUMNS

df = pd.DataFrame({
    "Stripped_Seqs": ["LESLIEK", "LESLIE", "LESLKIE"]
})
df = add_features(
    df,
    stripped_sequence_name="Stripped_Seqs",
    calc_masses=True,
    default_charge=2,
)
df
```

    2023-07-25 21:14:45.792 | WARNING  | flimsay.features:add_features:163 - Charge not provided, using default charge of 2

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|     | Stripped_Seqs | StrippedPeptide | PepLength | NumBulky | NumTiny | NumProlines | NumGlycines | NumSerines | NumPos | PosIndexL | PosIndexR | NumNeg | NegIndexL | NegIndexR | Mass       | PrecursorCharge | PrecursorMz |
|-----|---------------|-----------------|-----------|----------|---------|-------------|-------------|------------|--------|-----------|-----------|--------|-----------|-----------|------------|-----------------|-------------|
| 0   | LESLIEK       | LESLIEK         | 7         | 3        | 1       | 0           | 0           | 1          | 1      | 0.857143  | 0.000000  | 2      | 0.142857  | 0.142857  | 830.474934 | 2               | 416.245292  |
| 1   | LESLIE        | LESLIE          | 6         | 3        | 1       | 0           | 0           | 1          | 0      | 1.000000  | 1.000000  | 2      | 0.166667  | 0.000000  | 702.379971 | 2               | 352.197811  |
| 2   | LESLKIE       | LESLKIE         | 7         | 3        | 1       | 0           | 0           | 1          | 1      | 0.571429  | 0.285714  | 2      | 0.142857  | 0.000000  | 830.474934 | 2               | 416.245292  |

</div>

``` python
model_instance.predict(df[FEATURE_COLUMNS])
```

    {'ccs': array([315.32424627, 306.70134752, 314.87268797]),
     'one_over_k0': array([0.78718781, 0.72658194, 0.78525451])}

## Performance

### Prediction Performance

![](https://github.com/TalusBio/flimsay/blob/main/train/plots/one_over_k0_model_ims_pred_vs_true.png)

![](https://github.com/TalusBio/flimsay/blob/main/train/plots/ccs_predicted_vs_real.png)

### Prediction Speed

#### Single peptide prediction

``` python
from flimsay.model import FlimsayModel

model_instance = FlimsayModel()

%timeit model_instance.predict_peptide("MYPEPTIDEK", charge=3)
```

    174 µs ± 942 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In my laptop that takes 133 microseconds per peptide, or roughly 7,500
peptides per second.

#### Batch Prediction

``` python
# Lets make a dataset of 1M peptides to test
import random
import pandas as pd
from flimsay.features import calc_mass, mass_to_mz, add_features

random.seed(42)
AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY")
charges = [2,3,4]

seqs = [random.sample(AMINO_ACIDS, 10) for _ in range(1_000_000)]
charges = [random.sample(charges, 1)[0] for _ in range(1_000_000)]
seqs = ["".join(seq) for seq in seqs]
masses = [calc_mass(x) for x in seqs]
mzs = [mass_to_mz(m, c) for m, c in zip(masses, charges)]

df = pd.DataFrame({
    "Stripped_Seqs": seqs,
    "PrecursorCharge": charges,
    "Mass": masses,
    "PrecursorMz": mzs})
df = add_features(df, stripped_sequence_name="Stripped_Seqs")


# Now we get to run the prediction!
%timeit model_instance.predict(df)
```

    20.6 s ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In my system every million peptides is predicted in 8.86 seconds, that is  
113,000 per second.

## Motivation

There is a fair amount of data on CCS and ion mobility of peptides but
only very few models actually use features that are directly
interpretable.

In addition, having a simpler model allows faster predictions in systems
that are not equiped with GPUs.

Therefore, this project aims to create a freely available, easy to use,
interpretable and fast model to predict ion mobility and collisional
cross-section for peptides.

## Features

The features used for prediction are meant to be simple and their
implementation can be found here:
[flimsy/features.py](flimsy/features.py)

``` python
from flimsay.features import FEATURE_COLUMN_DESCRIPTIONS
for k,v in FEATURE_COLUMN_DESCRIPTIONS.items():
    print(f">>> The Feature '{k}' is defined as: \n\t{v}")
```

    >>> The Feature 'PrecursorMz' is defined as: 
        Measured precursor m/z
    >>> The Feature 'Mass' is defined as: 
        Measured precursor mass (Da)
    >>> The Feature 'PrecursorCharge' is defined as: 
        Measured precursor charge, from the isotope envelope
    >>> The Feature 'PepLength' is defined as: 
        Length of the peptide sequence in amino acids
    >>> The Feature 'NumBulky' is defined as: 
        Number of bulky amino acids (LVIFWY)
    >>> The Feature 'NumTiny' is defined as: 
        Number of tiny amino acids (AS)
    >>> The Feature 'NumProlines' is defined as: 
        Number of proline residues
    >>> The Feature 'NumGlycines' is defined as: 
        Number of glycine residues
    >>> The Feature 'NumSerines' is defined as: 
        Number of serine residues
    >>> The Feature 'NumPos' is defined as: 
        Number of positive amino acids (KRH)
    >>> The Feature 'PosIndexL' is defined as: 
        Relative position of the first positive amino acid (KRH)
    >>> The Feature 'PosIndexR' is defined as: 
        Relative position of the last positive amino acid (KRH)
    >>> The Feature 'NumNeg' is defined as: 
        Number of negative amino acids (DE)
    >>> The Feature 'NegIndexL' is defined as: 
        Relative position of the first negative amino acid (DE)
    >>> The Feature 'NegIndexR' is defined as: 
        Relative position of the last negative amino acid (DE)

## Training

Currently the training logic is handled using DVC (https://dvc.org).

``` shell
git clone {this repo}
cd flimsay/train
dvc repro
```

Running this should automatically download the data, trian the models,
calculate and update the metrics.

The current version of this repo uses predominantly the data from: -
Meier, F., Köhler, N.D., Brunner, AD. et al. Deep learning the
collisional cross sections of the peptide universe from a million
experimental values. Nat Commun 12, 1185 (2021).
https://doi.org/10.1038/s41467-021-21352-8

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "flimsay",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "<3.12,>=3.9",
    "maintainer_email": "",
    "keywords": "proteomics,dia,ion mobility",
    "author": "",
    "author_email": "Sebastian Paez <spaez@talus.bio>",
    "download_url": "https://files.pythonhosted.org/packages/aa/83/ebabf023889f5f8c73546178947251ff60975bd87a9ae32fec2ea4a7f32a/flimsay-0.4.0.tar.gz",
    "platform": null,
    "description": "# Flimsay: Fun/Fast Simple IMS Anyone like You can use.\nSebastian Paez\n\nversion = 0.4.0\n\nThis repository implements a very simple LGBM model to predict ion\nmobility from peptides.\n\n## Usage\n\nThere are two main ways to interact with `flimsay`, one is using python\nand the other one is using the python api directly.\n\n### CLI\n\n``` shell\n$ pip install flimsay\n```\n\n``` shell\n$ flimsay fill_blib mylibrary.blib # This will add ion mobility data to a .blib file.\n```\n\n``` python\n! flimsay fill_blib --help\n```\n\n                                                                                    \n     Usage: flimsay fill_blib [OPTIONS] BLIB OUT_BLIB                               \n                                                                                    \n     Add ion mobility prediction to a .blib file.                                   \n                                                                                    \n    \u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n    \u2502 --overwrite      Whether to overwrite output file, if it exists              \u2502\n    \u2502 --help           Show this message and exit.                                 \u2502\n    \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\n### Python\n\n#### Single peptide\n\n``` python\nfrom flimsay.model import FlimsayModel\n\nmodel_instance = FlimsayModel()\nmodel_instance.predict_peptide(\"MYPEPTIDEK\", charge=2)\n```\n\n    {'ccs': array([363.36245907]), 'one_over_k0': array([0.92423264])}\n\n#### Many peptides at once\n\n``` python\nimport pandas as pd\nfrom flimsay.features import add_features, FEATURE_COLUMNS\n\ndf = pd.DataFrame({\n    \"Stripped_Seqs\": [\"LESLIEK\", \"LESLIE\", \"LESLKIE\"]\n})\ndf = add_features(\n    df,\n    stripped_sequence_name=\"Stripped_Seqs\",\n    calc_masses=True,\n    default_charge=2,\n)\ndf\n```\n\n    2023-07-25 21:14:45.792 | WARNING  | flimsay.features:add_features:163 - Charge not provided, using default charge of 2\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n&#10;    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n&#10;    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n\n|     | Stripped_Seqs | StrippedPeptide | PepLength | NumBulky | NumTiny | NumProlines | NumGlycines | NumSerines | NumPos | PosIndexL | PosIndexR | NumNeg | NegIndexL | NegIndexR | Mass       | PrecursorCharge | PrecursorMz |\n|-----|---------------|-----------------|-----------|----------|---------|-------------|-------------|------------|--------|-----------|-----------|--------|-----------|-----------|------------|-----------------|-------------|\n| 0   | LESLIEK       | LESLIEK         | 7         | 3        | 1       | 0           | 0           | 1          | 1      | 0.857143  | 0.000000  | 2      | 0.142857  | 0.142857  | 830.474934 | 2               | 416.245292  |\n| 1   | LESLIE        | LESLIE          | 6         | 3        | 1       | 0           | 0           | 1          | 0      | 1.000000  | 1.000000  | 2      | 0.166667  | 0.000000  | 702.379971 | 2               | 352.197811  |\n| 2   | LESLKIE       | LESLKIE         | 7         | 3        | 1       | 0           | 0           | 1          | 1      | 0.571429  | 0.285714  | 2      | 0.142857  | 0.000000  | 830.474934 | 2               | 416.245292  |\n\n</div>\n\n``` python\nmodel_instance.predict(df[FEATURE_COLUMNS])\n```\n\n    {'ccs': array([315.32424627, 306.70134752, 314.87268797]),\n     'one_over_k0': array([0.78718781, 0.72658194, 0.78525451])}\n\n## Performance\n\n### Prediction Performance\n\n![](https://github.com/TalusBio/flimsay/blob/main/train/plots/one_over_k0_model_ims_pred_vs_true.png)\n\n![](https://github.com/TalusBio/flimsay/blob/main/train/plots/ccs_predicted_vs_real.png)\n\n### Prediction Speed\n\n#### Single peptide prediction\n\n``` python\nfrom flimsay.model import FlimsayModel\n\nmodel_instance = FlimsayModel()\n\n%timeit model_instance.predict_peptide(\"MYPEPTIDEK\", charge=3)\n```\n\n    174 \u00b5s \u00b1 942 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\nIn my laptop that takes 133 microseconds per peptide, or roughly 7,500\npeptides per second.\n\n#### Batch Prediction\n\n``` python\n# Lets make a dataset of 1M peptides to test\nimport random\nimport pandas as pd\nfrom flimsay.features import calc_mass, mass_to_mz, add_features\n\nrandom.seed(42)\nAMINO_ACIDS = list(\"ACDEFGHIKLMNPQRSTVWY\")\ncharges = [2,3,4]\n\nseqs = [random.sample(AMINO_ACIDS, 10) for _ in range(1_000_000)]\ncharges = [random.sample(charges, 1)[0] for _ in range(1_000_000)]\nseqs = [\"\".join(seq) for seq in seqs]\nmasses = [calc_mass(x) for x in seqs]\nmzs = [mass_to_mz(m, c) for m, c in zip(masses, charges)]\n\ndf = pd.DataFrame({\n    \"Stripped_Seqs\": seqs,\n    \"PrecursorCharge\": charges,\n    \"Mass\": masses,\n    \"PrecursorMz\": mzs})\ndf = add_features(df, stripped_sequence_name=\"Stripped_Seqs\")\n\n\n# Now we get to run the prediction!\n%timeit model_instance.predict(df)\n```\n\n    20.6 s \u00b1 76.1 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n\nIn my system every million peptides is predicted in 8.86 seconds, that is  \n113,000 per second.\n\n## Motivation\n\nThere is a fair amount of data on CCS and ion mobility of peptides but\nonly very few models actually use features that are directly\ninterpretable.\n\nIn addition, having a simpler model allows faster predictions in systems\nthat are not equiped with GPUs.\n\nTherefore, this project aims to create a freely available, easy to use,\ninterpretable and fast model to predict ion mobility and collisional\ncross-section for peptides.\n\n## Features\n\nThe features used for prediction are meant to be simple and their\nimplementation can be found here:\n[flimsy/features.py](flimsy/features.py)\n\n``` python\nfrom flimsay.features import FEATURE_COLUMN_DESCRIPTIONS\nfor k,v in FEATURE_COLUMN_DESCRIPTIONS.items():\n    print(f\">>> The Feature '{k}' is defined as: \\n\\t{v}\")\n```\n\n    >>> The Feature 'PrecursorMz' is defined as: \n        Measured precursor m/z\n    >>> The Feature 'Mass' is defined as: \n        Measured precursor mass (Da)\n    >>> The Feature 'PrecursorCharge' is defined as: \n        Measured precursor charge, from the isotope envelope\n    >>> The Feature 'PepLength' is defined as: \n        Length of the peptide sequence in amino acids\n    >>> The Feature 'NumBulky' is defined as: \n        Number of bulky amino acids (LVIFWY)\n    >>> The Feature 'NumTiny' is defined as: \n        Number of tiny amino acids (AS)\n    >>> The Feature 'NumProlines' is defined as: \n        Number of proline residues\n    >>> The Feature 'NumGlycines' is defined as: \n        Number of glycine residues\n    >>> The Feature 'NumSerines' is defined as: \n        Number of serine residues\n    >>> The Feature 'NumPos' is defined as: \n        Number of positive amino acids (KRH)\n    >>> The Feature 'PosIndexL' is defined as: \n        Relative position of the first positive amino acid (KRH)\n    >>> The Feature 'PosIndexR' is defined as: \n        Relative position of the last positive amino acid (KRH)\n    >>> The Feature 'NumNeg' is defined as: \n        Number of negative amino acids (DE)\n    >>> The Feature 'NegIndexL' is defined as: \n        Relative position of the first negative amino acid (DE)\n    >>> The Feature 'NegIndexR' is defined as: \n        Relative position of the last negative amino acid (DE)\n\n## Training\n\nCurrently the training logic is handled using DVC (https://dvc.org).\n\n``` shell\ngit clone {this repo}\ncd flimsay/train\ndvc repro\n```\n\nRunning this should automatically download the data, trian the models,\ncalculate and update the metrics.\n\nThe current version of this repo uses predominantly the data from: -\nMeier, F., K\u00f6hler, N.D., Brunner, AD. et al.\u00a0Deep learning the\ncollisional cross sections of the peptide universe from a million\nexperimental values. Nat Commun 12, 1185 (2021).\nhttps://doi.org/10.1038/s41467-021-21352-8\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "A super simple fast IMS predictor",
    "version": "0.4.0",
    "project_urls": {
        "Documentation": "https://TalusBio.github.io/flimsay",
        "Homepage": "https://github.com/TalusBio/flimsay"
    },
    "split_keywords": [
        "proteomics",
        "dia",
        "ion mobility"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "be9de5045fdf3e323eba27f0c01a9daea230fb5bfe8c86bfde6046f7e19cf5e7",
                "md5": "7d7d3fe8192e74c07b00bde35db5444c",
                "sha256": "47a17d8aa91e0ae09bdcdd09599cef2dd90ddead3ed8e01bd3e8d22b326b737d"
            },
            "downloads": -1,
            "filename": "flimsay-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7d7d3fe8192e74c07b00bde35db5444c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.9",
            "size": 1488619,
            "upload_time": "2023-07-25T21:18:26",
            "upload_time_iso_8601": "2023-07-25T21:18:26.225647Z",
            "url": "https://files.pythonhosted.org/packages/be/9d/e5045fdf3e323eba27f0c01a9daea230fb5bfe8c86bfde6046f7e19cf5e7/flimsay-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aa83ebabf023889f5f8c73546178947251ff60975bd87a9ae32fec2ea4a7f32a",
                "md5": "70e516549ed80544c99bc1e57f6a6a00",
                "sha256": "dc3a7afdc8821b4aa15a86ef14d3be44892dbd6b22190eac4a74349132bf5bfc"
            },
            "downloads": -1,
            "filename": "flimsay-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "70e516549ed80544c99bc1e57f6a6a00",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.9",
            "size": 1481452,
            "upload_time": "2023-07-25T21:18:28",
            "upload_time_iso_8601": "2023-07-25T21:18:28.403406Z",
            "url": "https://files.pythonhosted.org/packages/aa/83/ebabf023889f5f8c73546178947251ff60975bd87a9ae32fec2ea4a7f32a/flimsay-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-25 21:18:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TalusBio",
    "github_project": "flimsay",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "flimsay"
}
        
Elapsed time: 1.04949s