# lXtractor
[![Coverage Status](https://coveralls.io/repos/github/edikedik/lXtractor/badge.svg?branch=master)](https://coveralls.io/github/edikedik/lXtractor?branch=master)
[![Documentation status](https://readthedocs.org/projects/lxtractor/badge/?version=latest)](https://lxtractor.readthedocs.io/en/latest/?badge=latest)
[![PyPi status](https://img.shields.io/pypi/v/lXtractor.svg)](https://pypi.org/project/lXtractor)
[![Python version](https://img.shields.io/pypi/pyversions/lXtractor.svg)](https://pypi.org/project/lXtractor)
[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch)
<img src="./fig/lXt_diagram.png" alt="lXt_diagram" width="300"/>
## Introduction
`lXtractor` is a toolbox devoted to feature extraction from macromolecular
sequences and structures.
It's tailored towards creating shareable local data collections anchored to
a reference sequence-based object: a single sequence, MSA, or an HMM model.
Currently, it doesn't define any unique algorithms, aiming at simplicity and
transparency.
It simply provides a (hopefully) convenient interface simplifying mundane tasks,
such as fetching the data, extracting domains, mapping sequences, and computing
sequential and structural variables.
Sequences and structures anchored to a single reference object have a benefit
of interpretability in downstream applications, such as fitting interpretable
ML models.
## Installation
`lXtractor` requires python>=3.10 installed on a Unix system and is
installable via pip
```bash
pip install lXtractor
```
We encourage users to first create a virtual environment via `conda` or `mamba`.
## Usage
`lXtractor` is designed to be flexible and its usage is defined by the initial
hypothesis or a reference object that one wants to extrapolate towards the
existing sequences or structures.
Below, we'll provide a very abstract description of what this package is
intended for.
In creating data collections, one could define the following steps::
1. Assemble the data.
2. Map reference object to assembled entries' sequences.
3. Filter hits.
4. Define and calculate variables -- sequence or structure descriptors.
5. Save the data for later usage or modifications.
`lXtractor` defines objects and routines helpful throughout this process.
Namely, `PDB`, `SIFTS`, `AlphaFold`, `fetch_uniprot()`
can aid in the first step.
Then, `Alignment` and `PyHMMer` can facilitate step 2.
At the end of the step 2 one will get a collection of `Chain*`-type objects.
If working with sequence-only collections, these are going to be
`ChainSequence` objects.
For structure-only data, these are going to be ``ChainStructure`` containers,
embedding `ChainSequence` and `GenericStructure` objects.
Finally, dealing with mappings between canonical sequence associated with
a group of structures will result in ``Chain`` objects.
`ChainList` wraps `Chain*`-type objects into a list-like collection with
useful operations allowing to quickly filter and bulk-modify `Chain*`-type
objects.
Thus, filtering typically comes down to using ``ChainList.filter()`` method that
accepts a `Callable[Chain*, bool]` and returns a filtered `ChainList`.
One can save/load the collected objects using `ChainIO` and proceed
with the feature extraction.
`lXtractor` defines various sequence and structure variables.
Variable-related operations are handled by `GenericCalculator` and
`Manager` classes. The former defines the calculation strategy and how
the calculations are parallelized, while the latter handles the calculations
and aggregates the results into a pandas `DataFrame`.
As a result, one is left with a collection of `Chain*`-type objects and a
table with calculated variables. In addition, one can store the calculated
variables within the objects themselves, although we currently do not encourage
this practice.
`lXtractor` is in the experimental stage and under active development.
Thus, objects' interfaces may change.
For the time being, one can check the examples of
1. [finding sequence determinants](https://eboruta.readthedocs.io/en/latest/notebooks/sequence_determinants_tutorial.html)
of tyrosine and serine-threonine kinases and
2. [a protocol](https://github.com/edikedik/kinactive/blob/abae9c8a1fca0754d02e3f117dee210b587e666b/kinactive/db.py#L142)
to build a complete structural collection of protein kinase domains.
More examples are to come in the future, so stay tuned. If you know a good example to apply `lXtractor`, feel free to raise an issue or reach out [ivan.reveguk@gmail.com](ivan.reveguk@gmail.com).
Raw data
{
"_id": null,
"home_page": null,
"name": "lXtractor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Ivan Reveguk <ivan.reveguk@gmail.com>",
"keywords": "bioinformatics, data_mining, feature_extracton, structural_biology",
"author": null,
"author_email": "Ivan Reveguk <ivan.reveguk@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/2c/19/3927e602bbd4c37222a62f93b287dc02ff36ad8336bba75f41ba387c91dd/lxtractor-0.1.6.tar.gz",
"platform": null,
"description": "# lXtractor\n\n[![Coverage Status](https://coveralls.io/repos/github/edikedik/lXtractor/badge.svg?branch=master)](https://coveralls.io/github/edikedik/lXtractor?branch=master)\n[![Documentation status](https://readthedocs.org/projects/lxtractor/badge/?version=latest)](https://lxtractor.readthedocs.io/en/latest/?badge=latest)\n[![PyPi status](https://img.shields.io/pypi/v/lXtractor.svg)](https://pypi.org/project/lXtractor)\n[![Python version](https://img.shields.io/pypi/pyversions/lXtractor.svg)](https://pypi.org/project/lXtractor)\n[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch)\n\n<img src=\"./fig/lXt_diagram.png\" alt=\"lXt_diagram\" width=\"300\"/>\n\n## Introduction\n\n`lXtractor` is a toolbox devoted to feature extraction from macromolecular\nsequences and structures.\nIt's tailored towards creating shareable local data collections anchored to\na reference sequence-based object: a single sequence, MSA, or an HMM model.\nCurrently, it doesn't define any unique algorithms, aiming at simplicity and\ntransparency.\nIt simply provides a (hopefully) convenient interface simplifying mundane tasks,\nsuch as fetching the data, extracting domains, mapping sequences, and computing\nsequential and structural variables.\nSequences and structures anchored to a single reference object have a benefit\nof interpretability in downstream applications, such as fitting interpretable\nML models.\n\n## Installation\n\n`lXtractor` requires python>=3.10 installed on a Unix system and is\ninstallable via pip\n\n```bash\npip install lXtractor\n```\n\nWe encourage users to first create a virtual environment via `conda` or `mamba`.\n\n## Usage\n\n`lXtractor` is designed to be flexible and its usage is defined by the initial\nhypothesis or a reference object that one wants to extrapolate towards the\nexisting sequences or structures.\nBelow, we'll provide a very abstract description of what this package is\nintended for.\n\nIn creating data collections, one could define the following steps::\n\n1. Assemble the data.\n2. Map reference object to assembled entries' sequences.\n3. Filter hits.\n4. Define and calculate variables -- sequence or structure descriptors.\n5. Save the data for later usage or modifications.\n\n`lXtractor` defines objects and routines helpful throughout this process.\nNamely, `PDB`, `SIFTS`, `AlphaFold`, `fetch_uniprot()`\ncan aid in the first step.\nThen, `Alignment` and `PyHMMer` can facilitate step 2.\nAt the end of the step 2 one will get a collection of `Chain*`-type objects.\nIf working with sequence-only collections, these are going to be\n`ChainSequence` objects.\nFor structure-only data, these are going to be ``ChainStructure`` containers,\nembedding `ChainSequence` and `GenericStructure` objects.\nFinally, dealing with mappings between canonical sequence associated with\na group of structures will result in ``Chain`` objects.\n\n`ChainList` wraps `Chain*`-type objects into a list-like collection with\nuseful operations allowing to quickly filter and bulk-modify `Chain*`-type\nobjects.\nThus, filtering typically comes down to using ``ChainList.filter()`` method that\naccepts a `Callable[Chain*, bool]` and returns a filtered `ChainList`.\nOne can save/load the collected objects using `ChainIO` and proceed\nwith the feature extraction.\n\n`lXtractor` defines various sequence and structure variables.\nVariable-related operations are handled by `GenericCalculator` and\n`Manager` classes. The former defines the calculation strategy and how\nthe calculations are parallelized, while the latter handles the calculations\nand aggregates the results into a pandas `DataFrame`.\n\nAs a result, one is left with a collection of `Chain*`-type objects and a\ntable with calculated variables. In addition, one can store the calculated\nvariables within the objects themselves, although we currently do not encourage\nthis practice.\n\n`lXtractor` is in the experimental stage and under active development.\nThus, objects' interfaces may change.\n\nFor the time being, one can check the examples of\n1. [finding sequence determinants](https://eboruta.readthedocs.io/en/latest/notebooks/sequence_determinants_tutorial.html)\nof tyrosine and serine-threonine kinases and\n2. [a protocol](https://github.com/edikedik/kinactive/blob/abae9c8a1fca0754d02e3f117dee210b587e666b/kinactive/db.py#L142)\nto build a complete structural collection of protein kinase domains.\n\nMore examples are to come in the future, so stay tuned. If you know a good example to apply `lXtractor`, feel free to raise an issue or reach out [ivan.reveguk@gmail.com](ivan.reveguk@gmail.com).",
"bugtrack_url": null,
"license": null,
"summary": "Feature extraction library for sequences and structures",
"version": "0.1.6",
"project_urls": {
"Bug Tracker": "https://github.com//edikedik/lXtractor/issues",
"Source code": "https://github.com/edikedik/lXtractor"
},
"split_keywords": [
"bioinformatics",
" data_mining",
" feature_extracton",
" structural_biology"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f84f8f3fc8115fb3bdf861e4c86011c38b67b77c485a19e1e7ba99fb3e2cfc82",
"md5": "835699ff432dc5c332bf9fa829c78b46",
"sha256": "c89cc44d54ba8fd86dd4cd4e0daec6b3d98cecf4ba393b0eda2a95c7cca1fa88"
},
"downloads": -1,
"filename": "lxtractor-0.1.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "835699ff432dc5c332bf9fa829c78b46",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 214931,
"upload_time": "2024-08-03T17:28:15",
"upload_time_iso_8601": "2024-08-03T17:28:15.487614Z",
"url": "https://files.pythonhosted.org/packages/f8/4f/8f3fc8115fb3bdf861e4c86011c38b67b77c485a19e1e7ba99fb3e2cfc82/lxtractor-0.1.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2c193927e602bbd4c37222a62f93b287dc02ff36ad8336bba75f41ba387c91dd",
"md5": "52a685faf5866f5947c408eb7f835df1",
"sha256": "397b162debc11930f123c9ee48eebed9d9a9de17bb592f66dcabb3912411395e"
},
"downloads": -1,
"filename": "lxtractor-0.1.6.tar.gz",
"has_sig": false,
"md5_digest": "52a685faf5866f5947c408eb7f835df1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 187108,
"upload_time": "2024-08-03T17:28:19",
"upload_time_iso_8601": "2024-08-03T17:28:19.471577Z",
"url": "https://files.pythonhosted.org/packages/2c/19/3927e602bbd4c37222a62f93b287dc02ff36ad8336bba75f41ba387c91dd/lxtractor-0.1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-03 17:28:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "edikedik",
"github_project": "lXtractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "lxtractor"
}