Name | flimsay JSON |
Version |
0.4.0
JSON |
| download |
home_page | |
Summary | A super simple fast IMS predictor |
upload_time | 2023-07-25 21:18:28 |
maintainer | |
docs_url | None |
author | |
requires_python | <3.12,>=3.9 |
license | Apache 2.0 |
keywords |
proteomics
dia
ion mobility
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Flimsay: Fun/Fast Simple IMS Anyone like You can use.
Sebastian Paez
version = 0.4.0
This repository implements a very simple LGBM model to predict ion
mobility from peptides.
## Usage
There are two main ways to interact with `flimsay`, one is using python
and the other one is using the python api directly.
### CLI
``` shell
$ pip install flimsay
```
``` shell
$ flimsay fill_blib mylibrary.blib # This will add ion mobility data to a .blib file.
```
``` python
! flimsay fill_blib --help
```
Usage: flimsay fill_blib [OPTIONS] BLIB OUT_BLIB
Add ion mobility prediction to a .blib file.
╭─ Options ────────────────────────────────────────────────────────────────────╮
│ --overwrite Whether to overwrite output file, if it exists │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────╯
### Python
#### Single peptide
``` python
from flimsay.model import FlimsayModel
model_instance = FlimsayModel()
model_instance.predict_peptide("MYPEPTIDEK", charge=2)
```
{'ccs': array([363.36245907]), 'one_over_k0': array([0.92423264])}
#### Many peptides at once
``` python
import pandas as pd
from flimsay.features import add_features, FEATURE_COLUMNS
df = pd.DataFrame({
"Stripped_Seqs": ["LESLIEK", "LESLIE", "LESLKIE"]
})
df = add_features(
df,
stripped_sequence_name="Stripped_Seqs",
calc_masses=True,
default_charge=2,
)
df
```
2023-07-25 21:14:45.792 | WARNING | flimsay.features:add_features:163 - Charge not provided, using default charge of 2
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
| | Stripped_Seqs | StrippedPeptide | PepLength | NumBulky | NumTiny | NumProlines | NumGlycines | NumSerines | NumPos | PosIndexL | PosIndexR | NumNeg | NegIndexL | NegIndexR | Mass | PrecursorCharge | PrecursorMz |
|-----|---------------|-----------------|-----------|----------|---------|-------------|-------------|------------|--------|-----------|-----------|--------|-----------|-----------|------------|-----------------|-------------|
| 0 | LESLIEK | LESLIEK | 7 | 3 | 1 | 0 | 0 | 1 | 1 | 0.857143 | 0.000000 | 2 | 0.142857 | 0.142857 | 830.474934 | 2 | 416.245292 |
| 1 | LESLIE | LESLIE | 6 | 3 | 1 | 0 | 0 | 1 | 0 | 1.000000 | 1.000000 | 2 | 0.166667 | 0.000000 | 702.379971 | 2 | 352.197811 |
| 2 | LESLKIE | LESLKIE | 7 | 3 | 1 | 0 | 0 | 1 | 1 | 0.571429 | 0.285714 | 2 | 0.142857 | 0.000000 | 830.474934 | 2 | 416.245292 |
</div>
``` python
model_instance.predict(df[FEATURE_COLUMNS])
```
{'ccs': array([315.32424627, 306.70134752, 314.87268797]),
'one_over_k0': array([0.78718781, 0.72658194, 0.78525451])}
## Performance
### Prediction Performance
![](https://github.com/TalusBio/flimsay/blob/main/train/plots/one_over_k0_model_ims_pred_vs_true.png)
![](https://github.com/TalusBio/flimsay/blob/main/train/plots/ccs_predicted_vs_real.png)
### Prediction Speed
#### Single peptide prediction
``` python
from flimsay.model import FlimsayModel
model_instance = FlimsayModel()
%timeit model_instance.predict_peptide("MYPEPTIDEK", charge=3)
```
174 µs ± 942 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In my laptop that takes 133 microseconds per peptide, or roughly 7,500
peptides per second.
#### Batch Prediction
``` python
# Lets make a dataset of 1M peptides to test
import random
import pandas as pd
from flimsay.features import calc_mass, mass_to_mz, add_features
random.seed(42)
AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY")
charges = [2,3,4]
seqs = [random.sample(AMINO_ACIDS, 10) for _ in range(1_000_000)]
charges = [random.sample(charges, 1)[0] for _ in range(1_000_000)]
seqs = ["".join(seq) for seq in seqs]
masses = [calc_mass(x) for x in seqs]
mzs = [mass_to_mz(m, c) for m, c in zip(masses, charges)]
df = pd.DataFrame({
"Stripped_Seqs": seqs,
"PrecursorCharge": charges,
"Mass": masses,
"PrecursorMz": mzs})
df = add_features(df, stripped_sequence_name="Stripped_Seqs")
# Now we get to run the prediction!
%timeit model_instance.predict(df)
```
20.6 s ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In my system every million peptides is predicted in 8.86 seconds, that is
113,000 per second.
## Motivation
There is a fair amount of data on CCS and ion mobility of peptides but
only very few models actually use features that are directly
interpretable.
In addition, having a simpler model allows faster predictions in systems
that are not equiped with GPUs.
Therefore, this project aims to create a freely available, easy to use,
interpretable and fast model to predict ion mobility and collisional
cross-section for peptides.
## Features
The features used for prediction are meant to be simple and their
implementation can be found here:
[flimsy/features.py](flimsy/features.py)
``` python
from flimsay.features import FEATURE_COLUMN_DESCRIPTIONS
for k,v in FEATURE_COLUMN_DESCRIPTIONS.items():
print(f">>> The Feature '{k}' is defined as: \n\t{v}")
```
>>> The Feature 'PrecursorMz' is defined as:
Measured precursor m/z
>>> The Feature 'Mass' is defined as:
Measured precursor mass (Da)
>>> The Feature 'PrecursorCharge' is defined as:
Measured precursor charge, from the isotope envelope
>>> The Feature 'PepLength' is defined as:
Length of the peptide sequence in amino acids
>>> The Feature 'NumBulky' is defined as:
Number of bulky amino acids (LVIFWY)
>>> The Feature 'NumTiny' is defined as:
Number of tiny amino acids (AS)
>>> The Feature 'NumProlines' is defined as:
Number of proline residues
>>> The Feature 'NumGlycines' is defined as:
Number of glycine residues
>>> The Feature 'NumSerines' is defined as:
Number of serine residues
>>> The Feature 'NumPos' is defined as:
Number of positive amino acids (KRH)
>>> The Feature 'PosIndexL' is defined as:
Relative position of the first positive amino acid (KRH)
>>> The Feature 'PosIndexR' is defined as:
Relative position of the last positive amino acid (KRH)
>>> The Feature 'NumNeg' is defined as:
Number of negative amino acids (DE)
>>> The Feature 'NegIndexL' is defined as:
Relative position of the first negative amino acid (DE)
>>> The Feature 'NegIndexR' is defined as:
Relative position of the last negative amino acid (DE)
## Training
Currently the training logic is handled using DVC (https://dvc.org).
``` shell
git clone {this repo}
cd flimsay/train
dvc repro
```
Running this should automatically download the data, trian the models,
calculate and update the metrics.
The current version of this repo uses predominantly the data from: -
Meier, F., Köhler, N.D., Brunner, AD. et al. Deep learning the
collisional cross sections of the peptide universe from a million
experimental values. Nat Commun 12, 1185 (2021).
https://doi.org/10.1038/s41467-021-21352-8
Raw data
{
"_id": null,
"home_page": "",
"name": "flimsay",
"maintainer": "",
"docs_url": null,
"requires_python": "<3.12,>=3.9",
"maintainer_email": "",
"keywords": "proteomics,dia,ion mobility",
"author": "",
"author_email": "Sebastian Paez <spaez@talus.bio>",
"download_url": "https://files.pythonhosted.org/packages/aa/83/ebabf023889f5f8c73546178947251ff60975bd87a9ae32fec2ea4a7f32a/flimsay-0.4.0.tar.gz",
"platform": null,
"description": "# Flimsay: Fun/Fast Simple IMS Anyone like You can use.\nSebastian Paez\n\nversion = 0.4.0\n\nThis repository implements a very simple LGBM model to predict ion\nmobility from peptides.\n\n## Usage\n\nThere are two main ways to interact with `flimsay`, one is using python\nand the other one is using the python api directly.\n\n### CLI\n\n``` shell\n$ pip install flimsay\n```\n\n``` shell\n$ flimsay fill_blib mylibrary.blib # This will add ion mobility data to a .blib file.\n```\n\n``` python\n! flimsay fill_blib --help\n```\n\n \n Usage: flimsay fill_blib [OPTIONS] BLIB OUT_BLIB \n \n Add ion mobility prediction to a .blib file. \n \n \u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n \u2502 --overwrite Whether to overwrite output file, if it exists \u2502\n \u2502 --help Show this message and exit. \u2502\n \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\n### Python\n\n#### Single peptide\n\n``` python\nfrom flimsay.model import FlimsayModel\n\nmodel_instance = FlimsayModel()\nmodel_instance.predict_peptide(\"MYPEPTIDEK\", charge=2)\n```\n\n {'ccs': array([363.36245907]), 'one_over_k0': array([0.92423264])}\n\n#### Many peptides at once\n\n``` python\nimport pandas as pd\nfrom flimsay.features import add_features, FEATURE_COLUMNS\n\ndf = pd.DataFrame({\n \"Stripped_Seqs\": [\"LESLIEK\", \"LESLIE\", \"LESLKIE\"]\n})\ndf = add_features(\n df,\n stripped_sequence_name=\"Stripped_Seqs\",\n calc_masses=True,\n default_charge=2,\n)\ndf\n```\n\n 2023-07-25 21:14:45.792 | WARNING | flimsay.features:add_features:163 - Charge not provided, using default charge of 2\n\n<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n .dataframe tbody tr th {\n vertical-align: top;\n }\n .dataframe thead th {\n text-align: right;\n }\n</style>\n\n| | Stripped_Seqs | StrippedPeptide | PepLength | NumBulky | NumTiny | NumProlines | NumGlycines | NumSerines | NumPos | PosIndexL | PosIndexR | NumNeg | NegIndexL | NegIndexR | Mass | PrecursorCharge | PrecursorMz |\n|-----|---------------|-----------------|-----------|----------|---------|-------------|-------------|------------|--------|-----------|-----------|--------|-----------|-----------|------------|-----------------|-------------|\n| 0 | LESLIEK | LESLIEK | 7 | 3 | 1 | 0 | 0 | 1 | 1 | 0.857143 | 0.000000 | 2 | 0.142857 | 0.142857 | 830.474934 | 2 | 416.245292 |\n| 1 | LESLIE | LESLIE | 6 | 3 | 1 | 0 | 0 | 1 | 0 | 1.000000 | 1.000000 | 2 | 0.166667 | 0.000000 | 702.379971 | 2 | 352.197811 |\n| 2 | LESLKIE | LESLKIE | 7 | 3 | 1 | 0 | 0 | 1 | 1 | 0.571429 | 0.285714 | 2 | 0.142857 | 0.000000 | 830.474934 | 2 | 416.245292 |\n\n</div>\n\n``` python\nmodel_instance.predict(df[FEATURE_COLUMNS])\n```\n\n {'ccs': array([315.32424627, 306.70134752, 314.87268797]),\n 'one_over_k0': array([0.78718781, 0.72658194, 0.78525451])}\n\n## Performance\n\n### Prediction Performance\n\n![](https://github.com/TalusBio/flimsay/blob/main/train/plots/one_over_k0_model_ims_pred_vs_true.png)\n\n![](https://github.com/TalusBio/flimsay/blob/main/train/plots/ccs_predicted_vs_real.png)\n\n### Prediction Speed\n\n#### Single peptide prediction\n\n``` python\nfrom flimsay.model import FlimsayModel\n\nmodel_instance = FlimsayModel()\n\n%timeit model_instance.predict_peptide(\"MYPEPTIDEK\", charge=3)\n```\n\n 174 \u00b5s \u00b1 942 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\nIn my laptop that takes 133 microseconds per peptide, or roughly 7,500\npeptides per second.\n\n#### Batch Prediction\n\n``` python\n# Lets make a dataset of 1M peptides to test\nimport random\nimport pandas as pd\nfrom flimsay.features import calc_mass, mass_to_mz, add_features\n\nrandom.seed(42)\nAMINO_ACIDS = list(\"ACDEFGHIKLMNPQRSTVWY\")\ncharges = [2,3,4]\n\nseqs = [random.sample(AMINO_ACIDS, 10) for _ in range(1_000_000)]\ncharges = [random.sample(charges, 1)[0] for _ in range(1_000_000)]\nseqs = [\"\".join(seq) for seq in seqs]\nmasses = [calc_mass(x) for x in seqs]\nmzs = [mass_to_mz(m, c) for m, c in zip(masses, charges)]\n\ndf = pd.DataFrame({\n \"Stripped_Seqs\": seqs,\n \"PrecursorCharge\": charges,\n \"Mass\": masses,\n \"PrecursorMz\": mzs})\ndf = add_features(df, stripped_sequence_name=\"Stripped_Seqs\")\n\n\n# Now we get to run the prediction!\n%timeit model_instance.predict(df)\n```\n\n 20.6 s \u00b1 76.1 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n\nIn my system every million peptides is predicted in 8.86 seconds, that is \n113,000 per second.\n\n## Motivation\n\nThere is a fair amount of data on CCS and ion mobility of peptides but\nonly very few models actually use features that are directly\ninterpretable.\n\nIn addition, having a simpler model allows faster predictions in systems\nthat are not equiped with GPUs.\n\nTherefore, this project aims to create a freely available, easy to use,\ninterpretable and fast model to predict ion mobility and collisional\ncross-section for peptides.\n\n## Features\n\nThe features used for prediction are meant to be simple and their\nimplementation can be found here:\n[flimsy/features.py](flimsy/features.py)\n\n``` python\nfrom flimsay.features import FEATURE_COLUMN_DESCRIPTIONS\nfor k,v in FEATURE_COLUMN_DESCRIPTIONS.items():\n print(f\">>> The Feature '{k}' is defined as: \\n\\t{v}\")\n```\n\n >>> The Feature 'PrecursorMz' is defined as: \n Measured precursor m/z\n >>> The Feature 'Mass' is defined as: \n Measured precursor mass (Da)\n >>> The Feature 'PrecursorCharge' is defined as: \n Measured precursor charge, from the isotope envelope\n >>> The Feature 'PepLength' is defined as: \n Length of the peptide sequence in amino acids\n >>> The Feature 'NumBulky' is defined as: \n Number of bulky amino acids (LVIFWY)\n >>> The Feature 'NumTiny' is defined as: \n Number of tiny amino acids (AS)\n >>> The Feature 'NumProlines' is defined as: \n Number of proline residues\n >>> The Feature 'NumGlycines' is defined as: \n Number of glycine residues\n >>> The Feature 'NumSerines' is defined as: \n Number of serine residues\n >>> The Feature 'NumPos' is defined as: \n Number of positive amino acids (KRH)\n >>> The Feature 'PosIndexL' is defined as: \n Relative position of the first positive amino acid (KRH)\n >>> The Feature 'PosIndexR' is defined as: \n Relative position of the last positive amino acid (KRH)\n >>> The Feature 'NumNeg' is defined as: \n Number of negative amino acids (DE)\n >>> The Feature 'NegIndexL' is defined as: \n Relative position of the first negative amino acid (DE)\n >>> The Feature 'NegIndexR' is defined as: \n Relative position of the last negative amino acid (DE)\n\n## Training\n\nCurrently the training logic is handled using DVC (https://dvc.org).\n\n``` shell\ngit clone {this repo}\ncd flimsay/train\ndvc repro\n```\n\nRunning this should automatically download the data, trian the models,\ncalculate and update the metrics.\n\nThe current version of this repo uses predominantly the data from: -\nMeier, F., K\u00f6hler, N.D., Brunner, AD. et al.\u00a0Deep learning the\ncollisional cross sections of the peptide universe from a million\nexperimental values. Nat Commun 12, 1185 (2021).\nhttps://doi.org/10.1038/s41467-021-21352-8\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "A super simple fast IMS predictor",
"version": "0.4.0",
"project_urls": {
"Documentation": "https://TalusBio.github.io/flimsay",
"Homepage": "https://github.com/TalusBio/flimsay"
},
"split_keywords": [
"proteomics",
"dia",
"ion mobility"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "be9de5045fdf3e323eba27f0c01a9daea230fb5bfe8c86bfde6046f7e19cf5e7",
"md5": "7d7d3fe8192e74c07b00bde35db5444c",
"sha256": "47a17d8aa91e0ae09bdcdd09599cef2dd90ddead3ed8e01bd3e8d22b326b737d"
},
"downloads": -1,
"filename": "flimsay-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7d7d3fe8192e74c07b00bde35db5444c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.9",
"size": 1488619,
"upload_time": "2023-07-25T21:18:26",
"upload_time_iso_8601": "2023-07-25T21:18:26.225647Z",
"url": "https://files.pythonhosted.org/packages/be/9d/e5045fdf3e323eba27f0c01a9daea230fb5bfe8c86bfde6046f7e19cf5e7/flimsay-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "aa83ebabf023889f5f8c73546178947251ff60975bd87a9ae32fec2ea4a7f32a",
"md5": "70e516549ed80544c99bc1e57f6a6a00",
"sha256": "dc3a7afdc8821b4aa15a86ef14d3be44892dbd6b22190eac4a74349132bf5bfc"
},
"downloads": -1,
"filename": "flimsay-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "70e516549ed80544c99bc1e57f6a6a00",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.9",
"size": 1481452,
"upload_time": "2023-07-25T21:18:28",
"upload_time_iso_8601": "2023-07-25T21:18:28.403406Z",
"url": "https://files.pythonhosted.org/packages/aa/83/ebabf023889f5f8c73546178947251ff60975bd87a9ae32fec2ea4a7f32a/flimsay-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-25 21:18:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TalusBio",
"github_project": "flimsay",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "flimsay"
}