evopython


Nameevopython JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/fiszbein-lab/evopython
SummaryFeature resolution from whole-genome alignment data.
upload_time2023-07-15 18:24:34
maintainer
docs_urlNone
authorSteven Mick
requires_python>=3.10.8
licensebsd-2-clause
keywords comparative genomics evopython
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # evopython
`evopython` is an object-oriented Python package designed for genome-scale
feature resolution from whole-genome alignment data.
- [Installation](#installation)
- [Usage](#usage)
- [Documenation](#documentation)
- [Testing](#testing)
---

## Installation
`evoython` depends on just 
[`biopython`](https://github.com/biopython/biopython) and can be installed with
```commandline
pip install evopython
```

## Usage
`evopython` parses genome annotation data and whole-genome alignment data, 
providing an interface for accessing the former in the context of the latter.
*Ensembl* is a great resource for both; whole-genome alignment 
[MAF](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) files and gene
annotation [GTF](https://genome.ucsc.edu/FAQ/FAQformat.html#format4) files can
be downloaded from their FTP site, indexed 
[here](https://useast.ensembl.org/info/data/ftp/index.html).

`evopython` was designed to support a linear, progressive form of analysis, 
where
1. `GTF` or `BED` instances are initialized with their respective files;
2. these dictionary-like data structures are queried for instances of the
`Feature` class; and
3. these `Feature` instances are resolved from a pairwise or multiple 
whole-genome alignment, represented with the `MAF` class.

In general, we have analyses of the form:
```python
from evopython import GTF, MAF

genes = GTF("path/to/genes")
wga = MAF("path/to/wga", aligned_on="species_name")

for gene_name in genes:
    feat = genes[gene_name]['feat']
    alignments = wga.get(feat)
    
    if len(alignments) == 1:
        # The alignment is contiguous; do something.
        pass
    else:
        # The alignment is discontiguous; do something else.
        pass
```
We can parse any feature type in the GTF (see the `GTF` description 
below) and further generate derived, secondary features with the `Feature` 
class's pad method (see the `Feature` description below).

For specific usage examples, see the Jupyter notebooks in the **examples** 
directory.

## Documentation
### class `evopython.GTF(gtf: str, types: tuple = tuple())`
> A nested `dict` mapping gene name to feature name to a list of `Feature`
> instances; each high-level gene `dict` has two additional keys, `attr` and 
> `feat`, with the former mapping to a `dict` with annotated information such 
> as "gene_biotype" indicating whether the gene is protein-coding and the 
> latter mapping to the gene's `Feature` instance.
> 
> *Arguments*:
> - `gtf`: The GTF file path.
> - `types`: The feature types to parse; gene features are parsed by default.
----
### class `evopython.BED(bed: str, on_name: bool = False)`
> A `dict` mapping locus `tuple` or name value to `Feature` instance; in the 
> former case, loci have the form `(seqname, start, end, strand)`.
> 
> *Arguments*:
> - `bed`: The BED file path.
> - `on_name`: A `bool` expressing whether name field values should be 
used as keys to the features.
----
### class `evopython.Feature`
> A stranded, genomic feature.
>
> ### Attributes:
> - `chrom`: The chromosome name.
> - `start`: The forward-mapped, 0-based, inclusive starting coordinate.
> - `end`: The forward-mapped, 0-based, exclusive ending coordinate.
> - `strand`: The strand, plus or minus for forward or reverse.
> ----
> ### Instance properties:
> - `is_forward`: A `bool` expressing forward strand orientation.
> - `is_reverse`: A `bool` expressing reverse strand orientation.
> ----
> ### Methods:
> 
> `locus(self, base: int = 0, strand: bool = False)`
> 
> Returns the locus in a generic genome browser format.
> 
> *Arguments*:
> - `base`: The coordinate system to use, 0 or 1, where the former is
> half-open on the end and the latter fully closed.
> - `strand`: A `bool` expressing whether to include the strand at the end
> of the locus.
>
> *Raises*:
> - `ValueError`: An invalid base was given.
> ----
> `pad(self, pad5: int, pad3: int, center: int = 0)`
> 
> Pads the feature.
> 
> Positive padding is tanatamount to feature extension and negative 
> padding to feature shrinkage; with centering, both can be used to 
> derive features that do not overlap the source feature.
> 
> *Arguments*:
> - `pad5`: The number of bases to add to the 5'-end of the feature.
> - `pad3`: The number of bases to add to the 3'-end of the feature.
> - `center`: 5, 3, or 0, indicating how to, or to not, center the padding: 
passing 5 prompts 5'-centering, such that padding is applied on the 5' 
coordinate; 3 likewise prompts 3'-centering; and 0, the default, prompts no 
centering, such that the whole feature is padded.
> 
> *Returns:*
> - A new, padded `Feature` instance.
----
### class `evopython.MAF`
> A resolver for multiple alignment formatted whole-genome alignment data.
>
> *Arguments*:
> - `maf_dir`: The path to a directory of MAF files, each following the 
naming scheme *chromosome_name.maf*.
> - `aligned_on`: The species that the chromosome names correspond to.
>
> ### Methods:
> 
> `get(self, feat: Feature, match_strand: bool = True)`
> 
> Finds an alignment for a feature.
> 
> *Arguments:*
> - `feat`: The feature to get an alignment for.
> - `match_strand`: A bool expressing whether the alignment should match the 
> feature's strand; if `False`, the alignment is mapped to the forward strand.
>
> *Returns:*
> - A `list` of `dicts` mapping species to `tuple`, where `tuple[0]` is a 
> `Feature` instance describing the alignment's position and `tuple[1]` the 
> aligned sequence.

## Testing

To test feature resolution,
1. clone the repository with
`git clone https://github.com/fiszbein-lab/evopython`,
2. download the MAF files into their respective directories using the 
provided FTP links;
3. generate random test features using the command-line script, 
`python features.py --maf path/to/maf --aligned-on species_name`, where 
`aligned_on` is the name of the species the file is indexed on 
(see `tests/data/meta_data.csv`); and
4. run the test from the command line with 
`python -m unittest tests/test_resolution.py`.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fiszbein-lab/evopython",
    "name": "evopython",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10.8",
    "maintainer_email": "",
    "keywords": "comparative genomics,evopython",
    "author": "Steven Mick",
    "author_email": "smick@bu.edu",
    "download_url": "https://files.pythonhosted.org/packages/1f/16/8e73c773803c3b9c6e51d010bd1b178f88cd1d7923cc932c7739dd9b18c5/evopython-0.0.1.tar.gz",
    "platform": null,
    "description": "# evopython\n`evopython` is an object-oriented Python package designed for genome-scale\nfeature resolution from whole-genome alignment data.\n- [Installation](#installation)\n- [Usage](#usage)\n- [Documenation](#documentation)\n- [Testing](#testing)\n---\n\n## Installation\n`evoython` depends on just \n[`biopython`](https://github.com/biopython/biopython) and can be installed with\n```commandline\npip install evopython\n```\n\n## Usage\n`evopython` parses genome annotation data and whole-genome alignment data, \nproviding an interface for accessing the former in the context of the latter.\n*Ensembl* is a great resource for both; whole-genome alignment \n[MAF](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) files and gene\nannotation [GTF](https://genome.ucsc.edu/FAQ/FAQformat.html#format4) files can\nbe downloaded from their FTP site, indexed \n[here](https://useast.ensembl.org/info/data/ftp/index.html).\n\n`evopython` was designed to support a linear, progressive form of analysis, \nwhere\n1. `GTF` or `BED` instances are initialized with their respective files;\n2. these dictionary-like data structures are queried for instances of the\n`Feature` class; and\n3. these `Feature` instances are resolved from a pairwise or multiple \nwhole-genome alignment, represented with the `MAF` class.\n\nIn general, we have analyses of the form:\n```python\nfrom evopython import GTF, MAF\n\ngenes = GTF(\"path/to/genes\")\nwga = MAF(\"path/to/wga\", aligned_on=\"species_name\")\n\nfor gene_name in genes:\n    feat = genes[gene_name]['feat']\n    alignments = wga.get(feat)\n    \n    if len(alignments) == 1:\n        # The alignment is contiguous; do something.\n        pass\n    else:\n        # The alignment is discontiguous; do something else.\n        pass\n```\nWe can parse any feature type in the GTF (see the `GTF` description \nbelow) and further generate derived, secondary features with the `Feature` \nclass's pad method (see the `Feature` description below).\n\nFor specific usage examples, see the Jupyter notebooks in the **examples** \ndirectory.\n\n## Documentation\n### class `evopython.GTF(gtf: str, types: tuple = tuple())`\n> A nested `dict` mapping gene name to feature name to a list of `Feature`\n> instances; each high-level gene `dict` has two additional keys, `attr` and \n> `feat`, with the former mapping to a `dict` with annotated information such \n> as \"gene_biotype\" indicating whether the gene is protein-coding and the \n> latter mapping to the gene's `Feature` instance.\n> \n> *Arguments*:\n> - `gtf`: The GTF file path.\n> - `types`: The feature types to parse; gene features are parsed by default.\n----\n### class `evopython.BED(bed: str, on_name: bool = False)`\n> A `dict` mapping locus `tuple` or name value to `Feature` instance; in the \n> former case, loci have the form `(seqname, start, end, strand)`.\n> \n> *Arguments*:\n> - `bed`: The BED file path.\n> - `on_name`: A `bool` expressing whether name field values should be \nused as keys to the features.\n----\n### class `evopython.Feature`\n> A stranded, genomic feature.\n>\n> ### Attributes:\n> - `chrom`: The chromosome name.\n> - `start`: The forward-mapped, 0-based, inclusive starting coordinate.\n> - `end`: The forward-mapped, 0-based, exclusive ending coordinate.\n> - `strand`: The strand, plus or minus for forward or reverse.\n> ----\n> ### Instance properties:\n> - `is_forward`: A `bool` expressing forward strand orientation.\n> - `is_reverse`: A `bool` expressing reverse strand orientation.\n> ----\n> ### Methods:\n> \n> `locus(self, base: int = 0, strand: bool = False)`\n> \n> Returns the locus in a generic genome browser format.\n> \n> *Arguments*:\n> - `base`: The coordinate system to use, 0 or 1, where the former is\n> half-open on the end and the latter fully closed.\n> - `strand`: A `bool` expressing whether to include the strand at the end\n> of the locus.\n>\n> *Raises*:\n> - `ValueError`: An invalid base was given.\n> ----\n> `pad(self, pad5: int, pad3: int, center: int = 0)`\n> \n> Pads the feature.\n> \n> Positive padding is tanatamount to feature extension and negative \n> padding to feature shrinkage; with centering, both can be used to \n> derive features that do not overlap the source feature.\n> \n> *Arguments*:\n> - `pad5`: The number of bases to add to the 5'-end of the feature.\n> - `pad3`: The number of bases to add to the 3'-end of the feature.\n> - `center`: 5, 3, or 0, indicating how to, or to not, center the padding: \npassing 5 prompts 5'-centering, such that padding is applied on the 5' \ncoordinate; 3 likewise prompts 3'-centering; and 0, the default, prompts no \ncentering, such that the whole feature is padded.\n> \n> *Returns:*\n> - A new, padded `Feature` instance.\n----\n### class `evopython.MAF`\n> A resolver for multiple alignment formatted whole-genome alignment data.\n>\n> *Arguments*:\n> - `maf_dir`: The path to a directory of MAF files, each following the \nnaming scheme *chromosome_name.maf*.\n> - `aligned_on`: The species that the chromosome names correspond to.\n>\n> ### Methods:\n> \n> `get(self, feat: Feature, match_strand: bool = True)`\n> \n> Finds an alignment for a feature.\n> \n> *Arguments:*\n> - `feat`: The feature to get an alignment for.\n> - `match_strand`: A bool expressing whether the alignment should match the \n> feature's strand; if `False`, the alignment is mapped to the forward strand.\n>\n> *Returns:*\n> - A `list` of `dicts` mapping species to `tuple`, where `tuple[0]` is a \n> `Feature` instance describing the alignment's position and `tuple[1]` the \n> aligned sequence.\n\n## Testing\n\nTo test feature resolution,\n1. clone the repository with\n`git clone https://github.com/fiszbein-lab/evopython`,\n2. download the MAF files into their respective directories using the \nprovided FTP links;\n3. generate random test features using the command-line script, \n`python features.py --maf path/to/maf --aligned-on species_name`, where \n`aligned_on` is the name of the species the file is indexed on \n(see `tests/data/meta_data.csv`); and\n4. run the test from the command line with \n`python -m unittest tests/test_resolution.py`.\n",
    "bugtrack_url": null,
    "license": "bsd-2-clause",
    "summary": "Feature resolution from whole-genome alignment data.",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/fiszbein-lab/evopython"
    },
    "split_keywords": [
        "comparative genomics",
        "evopython"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1f168e73c773803c3b9c6e51d010bd1b178f88cd1d7923cc932c7739dd9b18c5",
                "md5": "af035663ecc6cd96b668733bcd619213",
                "sha256": "7018088669c82361e911dfea20634049d7a14e55aa6d880e0ecdb448381e375e"
            },
            "downloads": -1,
            "filename": "evopython-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "af035663ecc6cd96b668733bcd619213",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10.8",
            "size": 11992,
            "upload_time": "2023-07-15T18:24:34",
            "upload_time_iso_8601": "2023-07-15T18:24:34.639736Z",
            "url": "https://files.pythonhosted.org/packages/1f/16/8e73c773803c3b9c6e51d010bd1b178f88cd1d7923cc932c7739dd9b18c5/evopython-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-15 18:24:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fiszbein-lab",
    "github_project": "evopython",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "evopython"
}
        
Elapsed time: 0.86499s