# ProDEC
A package to easily calculate descriptors of protein sequences and their common transforms.
## Installation
pip install prodec
## Getting started
ProDEC is organised in three classes:
1. **ProteinDescripors** - loads all available descriptors and allows you to instantiate them
2. **Descriptor** - instantiated from the latter, allows retrieval of raw descriptor values
3. **Transform** - to calculate domain averages, auto-cross covariances (ACC), physicochemical distance transformations (PDT) and fast Fourier transform (FFT)
4. **TransformType** - to identify the transform to be performed
Let us get the largest protein sequence from uniprot (*as of May 29th, 2020*).
import urllib.request
url = 'https://www.uniprot.org/uniprot/A0A5A9P0L4.fasta'
with urllib.request.urlopen(url) as data:
sequence = ''.join([line.decode('ascii').strip() for line in data][1:])
First load available descriptors:
from prodec import *
pdescs = ProteinDescriptors()
and print out their ID:
print(pdescs.available_descriptors)
Identify the descriptor ID corresponding to Zscales (Hellberg *et al.* 1987).
zscales = pdescs.get_descriptor('Zscale Hellberg')
Get information about the descriptor as defined in the original article
print(zscales.summary)
and values defined for each amino acid.
print(zscales.definition)
Now, obtain such descriptor values for the protein sequence.
raw_values = zscales.get(sequence)
To transform raw values, first identify available transforms (static method).
print(Transform.available_transforms())
Let us instantiate the desired transform (here domain averages)
avg_zscale = Transform(TransformType.AVG, zscales)
and obtain 50 domain averages (defaults to 2 if not specified).
avg_values = avg_zscale.get(sequence, domains=50)
One can get information about the transform.
print(avg_zscale.summary)
Similarly, ACC, PDT and FFT can be obtain with
acc_zscale = Transform(TransformType.ACC, zscales)
# or Transform('ACC', zscales)
acc_values = acc_zscale.get(sequence, lag=10) # default lag=1
pdt_zscale = Transform(TransformType.PDT, zscales)
# or Transform('PDT', zscales)
pdt_values = pdt_zscale.get(sequence, lag=100) # default lag=1
fft_zscale = Transform(TransformType.FFT, zscales)
# or Transform('FFT', zscales)
fft_values = pdt_zscale.get(sequence)
## Advanced usage
### Descriptors
- ***Flattening raw values***
In the case of multiple values being defined for one amino acid, the resulting sequence descriptors are flattened by default. This means that one gets a list in which values for each amino acid are contiguous.
This feature can be turned off, resulting in a list of lists, each dimension being separate from the other (e.g. for Zscales Hellberg, a list containing 3 sub-lists: the first sub-list with values of the first dimension for the whole sequence).
zscales.get(sequence, flatten=False)
- ***Dealing with gaps***
In the case of aligned sequences, one may want to omit gaps. By default, gaps are considered and given a value of 0.0 . Gaps can either be omitted like so:
zscales.get(sequence, gaps='omit')
or given any arbitrary value
zscales.get(sequence, gaps=-1)
- ***Non-standard amino acids***
If working with another dictionary than the 20 standard amino acids, one can provide the ones they are working with. This is only possible if the user defines their own descriptor supporting these aminoacids.
pdescs = prodec.ProteinDescriptors()
mydesc = pdescs.get('Descriptor supporting Selenocysteine and Pyrrolysine')
mydesc.get(sequence, dictionary=list('ACDEFGHIKLMNOPQRSTUVWY'))
- ***Raychaudhury's descriptor***
Rachaudhury *et al.*'s values can be weighted by different powers (*default: -4*).
pdescs = prodec.ProteinDescriptors()
raych = pdescs.get('Raychaudhury')
raych.get(sequence, power=-3)
Calculation of Raychaudhury's values is ***O(n²)*** . To speed this calculation, a sliding window optimization has been made, resulting in an ***O(n)*** algorithm. By default the window width is set to 120 giving accuracy to the third decimal place. One may change the width by specifying the precision (half of the window size).
raych.get(sequence, prec=80) # Window size = 160
To turn the optimization off and get full precision:
raych.get(sequence, prec=0)
### Transfoms
- ***Compatibility***
Some transforms cannot be calculated for binary descriptors. Some others can only be calculated with binary descriptors. One can check for compatibility between a transform and a descriptor.
psm = pdescs.get_descriptor('PSM')
prodec.Transform.is_compatible('AVG', 'PSM')
- ***Transforms and advanced descriptor arguments***
All arguments a *Descriptor* accepts can be supplied to a transform's *get* method.
pdt_zscale.get(sequence, lag=10, average=False, flatten=False)
raych = pdescs.get('Raychaudhury')
acc_raych = prodec.Transform('ACC', raych)
acc_raych.get(sequence, power=-3, gaps='omit', prec=100, flatten=False, lag=12)
### Adding new descriptors
Supplied descriptors are described in the file named *data.json* under the *src* folder.
The list of available descriptors is loaded from the *data.json* file when **ProteinDescriptors** is instantiated.
Add your favorite descriptor to the list, respecting the format of the file and giving it a unique ID, for it to be available.
### Checking descriptor for amino acids support
One can check the compatibility of their engineered descriptor with any sequence.
vstv= pdescs.get_descriptor('VSTV')
vstv.is_sequence_valid('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
Raw data
{
"_id": null,
"home_page": "https://github.com/OlivierBeq/ProDEC",
"name": "prodec",
"maintainer": "Olivier J. M. B\u00e9quignon",
"docs_url": null,
"requires_python": "",
"maintainer_email": "\"olivier.bequignon.maintainer@gmail.com\"",
"keywords": "protein,descriptors,proteochemometrics,PCM,QSAR",
"author": "Olivier J. M. B\u00e9quignon",
"author_email": "\"olivier.bequignon.maintainer@gmail.com\"",
"download_url": "https://files.pythonhosted.org/packages/7f/93/634637a27d8006a7c674b573e4226aed7a22be24f5e1b9b906837bf95f9f/prodec-1.0.2.post5.tar.gz",
"platform": null,
"description": "# ProDEC\r\n\r\nA package to easily calculate descriptors of protein sequences and their common transforms.\r\n\r\n## Installation\r\n\r\n pip install prodec\r\n\r\n\r\n## Getting started\r\n\r\nProDEC is organised in three classes:\r\n 1. **ProteinDescripors** - loads all available descriptors and allows you to instantiate them\r\n 2. **Descriptor** - instantiated from the latter, allows retrieval of raw descriptor values\r\n 3. **Transform** - to calculate domain averages, auto-cross covariances (ACC), physicochemical distance transformations (PDT) and fast Fourier transform (FFT)\r\n 4. **TransformType** - to identify the transform to be performed\r\n\r\nLet us get the largest protein sequence from uniprot (*as of May 29th, 2020*).\r\n\r\n import urllib.request\r\n \r\n url = 'https://www.uniprot.org/uniprot/A0A5A9P0L4.fasta'\r\n with urllib.request.urlopen(url) as data:\r\n sequence = ''.join([line.decode('ascii').strip() for line in data][1:])\r\nFirst load available descriptors:\r\n\r\n from prodec import *\r\n pdescs = ProteinDescriptors()\r\nand print out their ID:\r\n\r\n print(pdescs.available_descriptors)\r\nIdentify the descriptor ID corresponding to Zscales (Hellberg *et al.* 1987). \r\n\r\n zscales = pdescs.get_descriptor('Zscale Hellberg')\r\nGet information about the descriptor as defined in the original article\r\n\r\n print(zscales.summary)\r\nand values defined for each amino acid.\r\n\r\n print(zscales.definition)\r\n\r\nNow, obtain such descriptor values for the protein sequence.\r\n\r\n raw_values = zscales.get(sequence)\r\n\r\nTo transform raw values, first identify available transforms (static method).\r\n\r\n print(Transform.available_transforms())\r\nLet us instantiate the desired transform (here domain averages)\r\n\r\n avg_zscale = Transform(TransformType.AVG, zscales)\r\nand obtain 50 domain averages (defaults to 2 if not specified).\r\n\r\n avg_values = avg_zscale.get(sequence, domains=50)\r\n\r\nOne can get information about the transform.\r\n\r\n print(avg_zscale.summary)\r\n\r\nSimilarly, ACC, PDT and FFT can be obtain with\r\n\r\n acc_zscale = Transform(TransformType.ACC, zscales)\r\n # or Transform('ACC', zscales)\r\n acc_values = acc_zscale.get(sequence, lag=10) # default lag=1\r\n pdt_zscale = Transform(TransformType.PDT, zscales)\r\n # or Transform('PDT', zscales)\r\n pdt_values = pdt_zscale.get(sequence, lag=100) # default lag=1\r\n fft_zscale = Transform(TransformType.FFT, zscales)\r\n # or Transform('FFT', zscales)\r\n fft_values = pdt_zscale.get(sequence)\r\n\r\n## Advanced usage\r\n### Descriptors\r\n\r\n - ***Flattening raw values***\r\n\r\nIn the case of multiple values being defined for one amino acid, the resulting sequence descriptors are flattened by default. This means that one gets a list in which values for each amino acid are contiguous.\r\nThis feature can be turned off, resulting in a list of lists, each dimension being separate from the other (e.g. for Zscales Hellberg, a list containing 3 sub-lists: the first sub-list with values of the first dimension for the whole sequence). \r\n\r\n zscales.get(sequence, flatten=False)\r\n\r\n - ***Dealing with gaps***\r\n\r\nIn the case of aligned sequences, one may want to omit gaps. By default, gaps are considered and given a value of 0.0 . Gaps can either be omitted like so:\r\n\r\n zscales.get(sequence, gaps='omit')\r\nor given any arbitrary value\r\n\r\n zscales.get(sequence, gaps=-1)\r\n\r\n - ***Non-standard amino acids***\r\n\r\nIf working with another dictionary than the 20 standard amino acids, one can provide the ones they are working with. This is only possible if the user defines their own descriptor supporting these aminoacids.\r\n\r\n pdescs = prodec.ProteinDescriptors()\r\n mydesc = pdescs.get('Descriptor supporting Selenocysteine and Pyrrolysine')\r\n mydesc.get(sequence, dictionary=list('ACDEFGHIKLMNOPQRSTUVWY'))\r\n \r\n\r\n - ***Raychaudhury's descriptor***\r\n\r\nRachaudhury *et al.*'s values can be weighted by different powers (*default: -4*).\r\n\r\n pdescs = prodec.ProteinDescriptors()\r\n raych = pdescs.get('Raychaudhury')\r\n raych.get(sequence, power=-3)\r\n\r\nCalculation of Raychaudhury's values is ***O(n\u00b2)*** . To speed this calculation, a sliding window optimization has been made, resulting in an ***O(n)*** algorithm. By default the window width is set to 120 giving accuracy to the third decimal place. One may change the width by specifying the precision (half of the window size).\r\n\r\n raych.get(sequence, prec=80) # Window size = 160\r\nTo turn the optimization off and get full precision:\r\n\r\n raych.get(sequence, prec=0)\r\n\r\n### Transfoms\r\n\r\n - ***Compatibility***\r\n\r\nSome transforms cannot be calculated for binary descriptors. Some others can only be calculated with binary descriptors. One can check for compatibility between a transform and a descriptor.\r\n\r\n psm = pdescs.get_descriptor('PSM')\r\n prodec.Transform.is_compatible('AVG', 'PSM')\r\n\r\n - ***Transforms and advanced descriptor arguments***\r\n\r\nAll arguments a *Descriptor* accepts can be supplied to a transform's *get* method.\r\n\r\n pdt_zscale.get(sequence, lag=10, average=False, flatten=False)\r\n raych = pdescs.get('Raychaudhury')\r\n acc_raych = prodec.Transform('ACC', raych)\r\n acc_raych.get(sequence, power=-3, gaps='omit', prec=100, flatten=False, lag=12)\r\n\r\n### Adding new descriptors\r\nSupplied descriptors are described in the file named *data.json* under the *src* folder.\r\nThe list of available descriptors is loaded from the *data.json* file when **ProteinDescriptors** is instantiated.\r\nAdd your favorite descriptor to the list, respecting the format of the file and giving it a unique ID, for it to be available.\r\n\r\n### Checking descriptor for amino acids support\r\nOne can check the compatibility of their engineered descriptor with any sequence.\r\n\r\n vstv= pdescs.get_descriptor('VSTV')\r\n vstv.is_sequence_valid('ABCDEFGHIJKLMNOPQRSTUVWXYZ')\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package to calculate protein sequence descriptors",
"version": "1.0.2.post5",
"split_keywords": [
"protein",
"descriptors",
"proteochemometrics",
"pcm",
"qsar"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5e270838ad72a26a1f2e712914527d1592f20d90828d1d51ace6a406156d364e",
"md5": "a5dead9d4ac55088b99b2fb3d3003893",
"sha256": "5b42c97a8c70e20371bbb5eaa6e59ee723d4fe8362f76cd335238f9758778b06"
},
"downloads": -1,
"filename": "prodec-1.0.2.post5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a5dead9d4ac55088b99b2fb3d3003893",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 53179,
"upload_time": "2023-01-16T00:35:58",
"upload_time_iso_8601": "2023-01-16T00:35:58.686271Z",
"url": "https://files.pythonhosted.org/packages/5e/27/0838ad72a26a1f2e712914527d1592f20d90828d1d51ace6a406156d364e/prodec-1.0.2.post5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7f93634637a27d8006a7c674b573e4226aed7a22be24f5e1b9b906837bf95f9f",
"md5": "4f698d38c7043d8425d66a0cf7fa2eda",
"sha256": "7dd5d51c6ec3b9802f02334a51590730e4926552b22730bc209a1ff9758ec5e0"
},
"downloads": -1,
"filename": "prodec-1.0.2.post5.tar.gz",
"has_sig": false,
"md5_digest": "4f698d38c7043d8425d66a0cf7fa2eda",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 52850,
"upload_time": "2023-01-16T00:36:00",
"upload_time_iso_8601": "2023-01-16T00:36:00.181126Z",
"url": "https://files.pythonhosted.org/packages/7f/93/634637a27d8006a7c674b573e4226aed7a22be24f5e1b9b906837bf95f9f/prodec-1.0.2.post5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-16 00:36:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "OlivierBeq",
"github_project": "ProDEC",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "prodec"
}