# evopython
`evopython` is an object-oriented Python package designed for genome-scale
feature resolution from whole-genome alignment data.
- [Installation](#installation)
- [Usage](#usage)
- [Documenation](#documentation)
- [Testing](#testing)
---
## Installation
`evoython` depends on just
[`biopython`](https://github.com/biopython/biopython) and can be installed with
```commandline
pip install evopython
```
## Usage
`evopython` parses genome annotation data and whole-genome alignment data,
providing an interface for accessing the former in the context of the latter.
*Ensembl* is a great resource for both; whole-genome alignment
[MAF](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) files and gene
annotation [GTF](https://genome.ucsc.edu/FAQ/FAQformat.html#format4) files can
be downloaded from their FTP site, indexed
[here](https://useast.ensembl.org/info/data/ftp/index.html).
`evopython` was designed to support a linear, progressive form of analysis,
where
1. `GTF` or `BED` instances are initialized with their respective files;
2. these dictionary-like data structures are queried for instances of the
`Feature` class; and
3. these `Feature` instances are resolved from a pairwise or multiple
whole-genome alignment, represented with the `MAF` class.
In general, we have analyses of the form:
```python
from evopython import GTF, MAF
genes = GTF("path/to/genes")
wga = MAF("path/to/wga", aligned_on="species_name")
for gene_name in genes:
feat = genes[gene_name]['feat']
alignments = wga.get(feat)
if len(alignments) == 1:
# The alignment is contiguous; do something.
pass
else:
# The alignment is discontiguous; do something else.
pass
```
We can parse any feature type in the GTF (see the `GTF` description
below) and further generate derived, secondary features with the `Feature`
class's pad method (see the `Feature` description below).
For specific usage examples, see the Jupyter notebooks in the **examples**
directory.
## Documentation
### class `evopython.GTF(gtf: str, types: tuple = tuple())`
> A nested `dict` mapping gene name to feature name to a list of `Feature`
> instances; each high-level gene `dict` has two additional keys, `attr` and
> `feat`, with the former mapping to a `dict` with annotated information such
> as "gene_biotype" indicating whether the gene is protein-coding and the
> latter mapping to the gene's `Feature` instance.
>
> *Arguments*:
> - `gtf`: The GTF file path.
> - `types`: The feature types to parse; gene features are parsed by default.
----
### class `evopython.BED(bed: str, on_name: bool = False)`
> A `dict` mapping locus `tuple` or name value to `Feature` instance; in the
> former case, loci have the form `(seqname, start, end, strand)`.
>
> *Arguments*:
> - `bed`: The BED file path.
> - `on_name`: A `bool` expressing whether name field values should be
used as keys to the features.
----
### class `evopython.Feature`
> A stranded, genomic feature.
>
> ### Attributes:
> - `chrom`: The chromosome name.
> - `start`: The forward-mapped, 0-based, inclusive starting coordinate.
> - `end`: The forward-mapped, 0-based, exclusive ending coordinate.
> - `strand`: The strand, plus or minus for forward or reverse.
> ----
> ### Instance properties:
> - `is_forward`: A `bool` expressing forward strand orientation.
> - `is_reverse`: A `bool` expressing reverse strand orientation.
> ----
> ### Methods:
>
> `locus(self, base: int = 0, strand: bool = False)`
>
> Returns the locus in a generic genome browser format.
>
> *Arguments*:
> - `base`: The coordinate system to use, 0 or 1, where the former is
> half-open on the end and the latter fully closed.
> - `strand`: A `bool` expressing whether to include the strand at the end
> of the locus.
>
> *Raises*:
> - `ValueError`: An invalid base was given.
> ----
> `pad(self, pad5: int, pad3: int, center: int = 0)`
>
> Pads the feature.
>
> Positive padding is tanatamount to feature extension and negative
> padding to feature shrinkage; with centering, both can be used to
> derive features that do not overlap the source feature.
>
> *Arguments*:
> - `pad5`: The number of bases to add to the 5'-end of the feature.
> - `pad3`: The number of bases to add to the 3'-end of the feature.
> - `center`: 5, 3, or 0, indicating how to, or to not, center the padding:
passing 5 prompts 5'-centering, such that padding is applied on the 5'
coordinate; 3 likewise prompts 3'-centering; and 0, the default, prompts no
centering, such that the whole feature is padded.
>
> *Returns:*
> - A new, padded `Feature` instance.
----
### class `evopython.MAF`
> A resolver for multiple alignment formatted whole-genome alignment data.
>
> *Arguments*:
> - `maf_dir`: The path to a directory of MAF files, each following the
naming scheme *chromosome_name.maf*.
> - `aligned_on`: The species that the chromosome names correspond to.
>
> ### Methods:
>
> `get(self, feat: Feature, match_strand: bool = True)`
>
> Finds an alignment for a feature.
>
> *Arguments:*
> - `feat`: The feature to get an alignment for.
> - `match_strand`: A bool expressing whether the alignment should match the
> feature's strand; if `False`, the alignment is mapped to the forward strand.
>
> *Returns:*
> - A `list` of `dicts` mapping species to `tuple`, where `tuple[0]` is a
> `Feature` instance describing the alignment's position and `tuple[1]` the
> aligned sequence.
## Testing
To test feature resolution,
1. clone the repository with
`git clone https://github.com/fiszbein-lab/evopython`,
2. download the MAF files into their respective directories using the
provided FTP links;
3. generate random test features using the command-line script,
`python features.py --maf path/to/maf --aligned-on species_name`, where
`aligned_on` is the name of the species the file is indexed on
(see `tests/data/meta_data.csv`); and
4. run the test from the command line with
`python -m unittest tests/test_resolution.py`.
Raw data
{
"_id": null,
"home_page": "https://github.com/fiszbein-lab/evopython",
"name": "evopython",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10.8",
"maintainer_email": "",
"keywords": "comparative genomics,evopython",
"author": "Steven Mick",
"author_email": "smick@bu.edu",
"download_url": "https://files.pythonhosted.org/packages/1f/16/8e73c773803c3b9c6e51d010bd1b178f88cd1d7923cc932c7739dd9b18c5/evopython-0.0.1.tar.gz",
"platform": null,
"description": "# evopython\n`evopython` is an object-oriented Python package designed for genome-scale\nfeature resolution from whole-genome alignment data.\n- [Installation](#installation)\n- [Usage](#usage)\n- [Documenation](#documentation)\n- [Testing](#testing)\n---\n\n## Installation\n`evoython` depends on just \n[`biopython`](https://github.com/biopython/biopython) and can be installed with\n```commandline\npip install evopython\n```\n\n## Usage\n`evopython` parses genome annotation data and whole-genome alignment data, \nproviding an interface for accessing the former in the context of the latter.\n*Ensembl* is a great resource for both; whole-genome alignment \n[MAF](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) files and gene\nannotation [GTF](https://genome.ucsc.edu/FAQ/FAQformat.html#format4) files can\nbe downloaded from their FTP site, indexed \n[here](https://useast.ensembl.org/info/data/ftp/index.html).\n\n`evopython` was designed to support a linear, progressive form of analysis, \nwhere\n1. `GTF` or `BED` instances are initialized with their respective files;\n2. these dictionary-like data structures are queried for instances of the\n`Feature` class; and\n3. these `Feature` instances are resolved from a pairwise or multiple \nwhole-genome alignment, represented with the `MAF` class.\n\nIn general, we have analyses of the form:\n```python\nfrom evopython import GTF, MAF\n\ngenes = GTF(\"path/to/genes\")\nwga = MAF(\"path/to/wga\", aligned_on=\"species_name\")\n\nfor gene_name in genes:\n feat = genes[gene_name]['feat']\n alignments = wga.get(feat)\n \n if len(alignments) == 1:\n # The alignment is contiguous; do something.\n pass\n else:\n # The alignment is discontiguous; do something else.\n pass\n```\nWe can parse any feature type in the GTF (see the `GTF` description \nbelow) and further generate derived, secondary features with the `Feature` \nclass's pad method (see the `Feature` description below).\n\nFor specific usage examples, see the Jupyter notebooks in the **examples** \ndirectory.\n\n## Documentation\n### class `evopython.GTF(gtf: str, types: tuple = tuple())`\n> A nested `dict` mapping gene name to feature name to a list of `Feature`\n> instances; each high-level gene `dict` has two additional keys, `attr` and \n> `feat`, with the former mapping to a `dict` with annotated information such \n> as \"gene_biotype\" indicating whether the gene is protein-coding and the \n> latter mapping to the gene's `Feature` instance.\n> \n> *Arguments*:\n> - `gtf`: The GTF file path.\n> - `types`: The feature types to parse; gene features are parsed by default.\n----\n### class `evopython.BED(bed: str, on_name: bool = False)`\n> A `dict` mapping locus `tuple` or name value to `Feature` instance; in the \n> former case, loci have the form `(seqname, start, end, strand)`.\n> \n> *Arguments*:\n> - `bed`: The BED file path.\n> - `on_name`: A `bool` expressing whether name field values should be \nused as keys to the features.\n----\n### class `evopython.Feature`\n> A stranded, genomic feature.\n>\n> ### Attributes:\n> - `chrom`: The chromosome name.\n> - `start`: The forward-mapped, 0-based, inclusive starting coordinate.\n> - `end`: The forward-mapped, 0-based, exclusive ending coordinate.\n> - `strand`: The strand, plus or minus for forward or reverse.\n> ----\n> ### Instance properties:\n> - `is_forward`: A `bool` expressing forward strand orientation.\n> - `is_reverse`: A `bool` expressing reverse strand orientation.\n> ----\n> ### Methods:\n> \n> `locus(self, base: int = 0, strand: bool = False)`\n> \n> Returns the locus in a generic genome browser format.\n> \n> *Arguments*:\n> - `base`: The coordinate system to use, 0 or 1, where the former is\n> half-open on the end and the latter fully closed.\n> - `strand`: A `bool` expressing whether to include the strand at the end\n> of the locus.\n>\n> *Raises*:\n> - `ValueError`: An invalid base was given.\n> ----\n> `pad(self, pad5: int, pad3: int, center: int = 0)`\n> \n> Pads the feature.\n> \n> Positive padding is tanatamount to feature extension and negative \n> padding to feature shrinkage; with centering, both can be used to \n> derive features that do not overlap the source feature.\n> \n> *Arguments*:\n> - `pad5`: The number of bases to add to the 5'-end of the feature.\n> - `pad3`: The number of bases to add to the 3'-end of the feature.\n> - `center`: 5, 3, or 0, indicating how to, or to not, center the padding: \npassing 5 prompts 5'-centering, such that padding is applied on the 5' \ncoordinate; 3 likewise prompts 3'-centering; and 0, the default, prompts no \ncentering, such that the whole feature is padded.\n> \n> *Returns:*\n> - A new, padded `Feature` instance.\n----\n### class `evopython.MAF`\n> A resolver for multiple alignment formatted whole-genome alignment data.\n>\n> *Arguments*:\n> - `maf_dir`: The path to a directory of MAF files, each following the \nnaming scheme *chromosome_name.maf*.\n> - `aligned_on`: The species that the chromosome names correspond to.\n>\n> ### Methods:\n> \n> `get(self, feat: Feature, match_strand: bool = True)`\n> \n> Finds an alignment for a feature.\n> \n> *Arguments:*\n> - `feat`: The feature to get an alignment for.\n> - `match_strand`: A bool expressing whether the alignment should match the \n> feature's strand; if `False`, the alignment is mapped to the forward strand.\n>\n> *Returns:*\n> - A `list` of `dicts` mapping species to `tuple`, where `tuple[0]` is a \n> `Feature` instance describing the alignment's position and `tuple[1]` the \n> aligned sequence.\n\n## Testing\n\nTo test feature resolution,\n1. clone the repository with\n`git clone https://github.com/fiszbein-lab/evopython`,\n2. download the MAF files into their respective directories using the \nprovided FTP links;\n3. generate random test features using the command-line script, \n`python features.py --maf path/to/maf --aligned-on species_name`, where \n`aligned_on` is the name of the species the file is indexed on \n(see `tests/data/meta_data.csv`); and\n4. run the test from the command line with \n`python -m unittest tests/test_resolution.py`.\n",
"bugtrack_url": null,
"license": "bsd-2-clause",
"summary": "Feature resolution from whole-genome alignment data.",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/fiszbein-lab/evopython"
},
"split_keywords": [
"comparative genomics",
"evopython"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1f168e73c773803c3b9c6e51d010bd1b178f88cd1d7923cc932c7739dd9b18c5",
"md5": "af035663ecc6cd96b668733bcd619213",
"sha256": "7018088669c82361e911dfea20634049d7a14e55aa6d880e0ecdb448381e375e"
},
"downloads": -1,
"filename": "evopython-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "af035663ecc6cd96b668733bcd619213",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10.8",
"size": 11992,
"upload_time": "2023-07-15T18:24:34",
"upload_time_iso_8601": "2023-07-15T18:24:34.639736Z",
"url": "https://files.pythonhosted.org/packages/1f/16/8e73c773803c3b9c6e51d010bd1b178f88cd1d7923cc932c7739dd9b18c5/evopython-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-15 18:24:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "fiszbein-lab",
"github_project": "evopython",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "evopython"
}