[![Tests](https://github.com/openvax/pyensembl/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/pyensembl/actions/workflows/tests.yml)
[![Coverage Status](https://coveralls.io/repos/github/openvax/pyensembl/badge.svg?branch=main)](https://coveralls.io/github/openvax/pyensembl?branch=main)
<a href="https://pypi.python.org/pypi/pyensembl/">
<img src="https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000" alt="PyPI" />
</a>
# PyEnsembl
PyEnsembl is a Python interface to [Ensembl](http://www.ensembl.org) reference genome metadata such as exons and transcripts. PyEnsembl downloads [GTF](https://en.wikipedia.org/wiki/Gene_transfer_format) and [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from the [Ensembl FTP server](ftp://ftp.ensembl.org) and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.
# Example Usage
```python
from pyensembl import EnsemblRelease
# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)
# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
# get all exons associated with HLA-A
exon_ids = data.exon_ids_of_gene_name('HLA-A')
```
# Installation
You can install PyEnsembl using [pip](https://pip.pypa.io/en/latest/quickstart.html):
```sh
pip install pyensembl
```
This should also install any required packages such as [datacache](https://github.com/openvax/datacache).
Before using PyEnsembl, run the following command to download and install
Ensembl data:
```
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
```
For example, `pyensembl install --release 75 76 --species human` will download and install all
human reference data from Ensembl releases 75 and 76.
Alternatively, you can create the `EnsemblRelease` object from inside a Python
process and call `ensembl_object.download()` followed by `ensembl_object.index()`.
## Cache Location
By default, PyEnsembl uses the platform-specific `Cache` folder
and caches the files into the `pyensembl` sub-directory.
You can override this default by setting the environment key `PYENSEMBL_CACHE_DIR`
as your preferred location for caching:
```sh
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
```
or
```python
import os
os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
```
# Usage tips
## List installed genomes
To see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:
```sh
pyensembl list
```
Or equivalently do this in Python:
```python
from pyensembl.shell import collect_all_installed_ensembl_releases
collect_all_installed_ensembl_releases()
```
## Load genome in Python
Here's an example Python snippet that loads fly genome data from Ensembl release v100:
```python
from pyensembl import EnsemblRelease
data = EnsemblRelease(release=100, species='drosophila_melanogaster')
```
## Data structures
### Gene
```python
gene = genome.gene_by_id(gene_id='FBgn0011747')
```
### Transcript
```python
transcript = gene.transcripts[0]
```
### Protein information
```python
transcript.protein_id
transcript.protein_sequence
```
# Non-Ensembl Data
PyEnsembl also allows arbitrary genomes via the specification
of local file paths or remote URLs to both Ensembl and non-Ensembl GTF
and FASTA files. (Warning: GTF formats can vary, and handling of
non-Ensembl data is still very much in development.)
For example:
```python
from pyensembl import Genome
data = Genome(
reference_name='GRCh38',
annotation_name='my_genome_features',
# annotation_version=None,
gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file
# transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences
# protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences
# cache_directory_path=None, # Where to place downloaded and cached files for this genome
)
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
```
# API
The `EnsemblRelease` object has methods to let you access all possible
combinations of the annotation features _gene_name_, _gene_id_,
_transcript_name_, _transcript_id_, _exon_id_ as well as the location of
these genomic elements (contig, start position, end position, strand).
## Genes
<dl>
<dt>genes(contig=None, strand=None)</dt>
<dd>Returns a list of Gene objects, optionally restricted to a particular contig
or strand.</dd>
<dt>genes_at_locus(contig, position, end=None, strand=None)</dt>
<dd>Returns a list of Gene objects overlapping a particular position on a contig,
optionally extend into a range with the end parameter and restrict to
forward or backward strand by passing strand='+' or strand='-'.</dd>
<dt>gene_by_id(gene_id)</dt>
<dd>Return a Gene object for given Ensembl gene ID (e.g. "ENSG00000068793").</dd>
<dt>gene_names(contig=None, strand=None)</dt>
<dd>Returns all gene names in the annotation database, optionally restricted
to a particular contig or strand.</dd>
<dt>genes_by_name(gene_name)</dt>
<dd>Get all the unqiue genes with the given name (there might be multiple
due to copies in the genome), return a list containing a Gene object for each
distinct ID.</dd>
<dt>gene_by_protein_id(protein_id)</dt>
<dd>Find Gene associated with the given Ensembl protein ID (e.g. "ENSP00000350283")</dd>
<dt>gene_names_at_locus(contig, position, end=None, strand=None)
</dt>
<dd>Names of genes overlapping with the given locus, optionally restricted by strand.
(returns a list to account for overlapping genes)</dd>
<dt>gene_name_of_gene_id(gene_id)
</dt>
<dd>Returns name of gene with given genen ID.</dd>
<dt>gene_name_of_transcript_id(transcript_id)
</dt><dd>Returns name of gene associated with given transcript ID.</dd>
<dt>gene_name_of_transcript_name(transcript_name)
</dt>
<dd>Returns name of gene associated with given transcript name.</dd>
<dt>gene_name_of_exon_id(exon_id)
</dt><dd>Returns name of gene associated with given exon ID.</dd>
<dt>gene_ids(contig=None, strand=None)
</dt>
<dd>Return all gene IDs in the annotation database, optionally restricted by
chromosome name or strand.</dd>
<dt>gene_ids_of_gene_name(gene_name)
</dt>
<dd>Returns all Ensembl gene IDs with the given name.</dd>
</dl>
## Transcripts
<dl>
<dt>transcripts(contig=None, strand=None)</dt>
<dd>Returns a list of Transcript objects for all transcript entries in the
Ensembl database, optionally restricted to a particular contig or strand.</dd>
<dt>transcript_by_id(transcript_id)</dt>
<dd>Construct a Transcript object for given Ensembl transcript ID (e.g. "ENST00000369985")</dd>
<dt>transcripts_by_name(transcript_name)</dt>
<dd>Returns a list of Transcript objects for every transcript matching the given name.</dd>
<dt>transcript_names(contig=None, strand=None)</dt>
<dd>Returns all transcript names in the annotation database.</dd>
<dt>transcript_ids(contig=None, strand=None)</dt>
<dd>Returns all transcript IDs in the annotation database.</dd>
<dt>transcript_ids_of_gene_id(gene_id)</dt>
<dd>Return IDs of all transcripts associated with given gene ID.</dd>
<dt>transcript_ids_of_gene_name(gene_name)</dt>
<dd>Return IDs of all transcripts associated with given gene name.</dd>
<dt>transcript_ids_of_transcript_name(transcript_name)</dt>
<dd>Find all Ensembl transcript IDs with the given name.</dd>
<dt>transcript_ids_of_exon_id(exon_id)</dt>
<dd>Return IDs of all transcripts associatd with given exon ID.</dd>
</dl>
## Exons
<dl>
<dt>exon_ids(contig=None, strand=None)</dt>
<dd>Returns a list of exons IDs in the annotation database, optionally restricted
by the given chromosome and strand.</dd>
<dt>exon_by_id(exon_id)</dt>
<dd>Construct an Exon object for given Ensembl exon ID (e.g. "ENSE00001209410")</dd>
<dt>exon_ids_of_gene_id(gene_id)</dt>
<dd>Returns a list of exon IDs associated with a given gene ID.</dd>
<dt>exon_ids_of_gene_name(gene_name)</dt>
<dd>Returns a list of exon IDs associated with a given gene name.</dd>
<dt>exon_ids_of_transcript_id(transcript_id)</dt>
<dd>Returns a list of exon IDs associated with a given transcript ID.</dd>
<dt>exon_ids_of_transcript_name(transcript_name)</dt>
<dd>Returns a list of exon IDs associated with a given transcript name.</dd>
</dl>
Raw data
{
"_id": null,
"home_page": "https://github.com/openvax/pyensembl",
"name": "pyensembl",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Alex Rubinsteyn",
"author_email": "alex.rubinsteyn@unc.edu",
"download_url": "https://files.pythonhosted.org/packages/85/94/253fa59f1f9e7fe3b7d34818fd4fe08f857da241234bf18c6ee1b9496afe/pyensembl-2.3.13.tar.gz",
"platform": null,
"description": "[![Tests](https://github.com/openvax/pyensembl/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/pyensembl/actions/workflows/tests.yml)\n[![Coverage Status](https://coveralls.io/repos/github/openvax/pyensembl/badge.svg?branch=main)](https://coveralls.io/github/openvax/pyensembl?branch=main)\n<a href=\"https://pypi.python.org/pypi/pyensembl/\">\n<img src=\"https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000\" alt=\"PyPI\" />\n</a>\n\n# PyEnsembl\n\nPyEnsembl is a Python interface to [Ensembl](http://www.ensembl.org) reference genome metadata such as exons and transcripts. PyEnsembl downloads [GTF](https://en.wikipedia.org/wiki/Gene_transfer_format) and [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from the [Ensembl FTP server](ftp://ftp.ensembl.org) and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.\n\n# Example Usage\n\n```python\nfrom pyensembl import EnsemblRelease\n\n# release 77 uses human reference genome GRCh38\ndata = EnsemblRelease(77)\n\n# will return ['HLA-A']\ngene_names = data.gene_names_at_locus(contig=6, position=29945884)\n\n# get all exons associated with HLA-A\nexon_ids = data.exon_ids_of_gene_name('HLA-A')\n```\n\n# Installation\n\nYou can install PyEnsembl using [pip](https://pip.pypa.io/en/latest/quickstart.html):\n\n```sh\npip install pyensembl\n```\n\nThis should also install any required packages such as [datacache](https://github.com/openvax/datacache).\n\nBefore using PyEnsembl, run the following command to download and install\nEnsembl data:\n\n```\npyensembl install --release <list of Ensembl release numbers> --species <species-name>\n```\n\nFor example, `pyensembl install --release 75 76 --species human` will download and install all\nhuman reference data from Ensembl releases 75 and 76.\n\nAlternatively, you can create the `EnsemblRelease` object from inside a Python\nprocess and call `ensembl_object.download()` followed by `ensembl_object.index()`.\n\n## Cache Location\n\nBy default, PyEnsembl uses the platform-specific `Cache` folder\nand caches the files into the `pyensembl` sub-directory.\nYou can override this default by setting the environment key `PYENSEMBL_CACHE_DIR`\nas your preferred location for caching:\n\n```sh\nexport PYENSEMBL_CACHE_DIR=/custom/cache/dir\n```\n\nor\n\n```python\nimport os\n\nos.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'\n# ... PyEnsembl API usage\n```\n\n# Usage tips\n\n## List installed genomes\n\nTo see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:\n\n```sh\npyensembl list\n```\n\nOr equivalently do this in Python:\n\n```python\nfrom pyensembl.shell import collect_all_installed_ensembl_releases\ncollect_all_installed_ensembl_releases()\n```\n\n## Load genome in Python\n\nHere's an example Python snippet that loads fly genome data from Ensembl release v100:\n\n```python\nfrom pyensembl import EnsemblRelease\ndata = EnsemblRelease(release=100, species='drosophila_melanogaster')\n```\n\n## Data structures\n\n### Gene\n\n```python\ngene = genome.gene_by_id(gene_id='FBgn0011747')\n```\n\n### Transcript\n\n```python\ntranscript = gene.transcripts[0]\n```\n\n### Protein information\n\n```python\ntranscript.protein_id\ntranscript.protein_sequence\n```\n\n# Non-Ensembl Data\n\nPyEnsembl also allows arbitrary genomes via the specification\nof local file paths or remote URLs to both Ensembl and non-Ensembl GTF\nand FASTA files. (Warning: GTF formats can vary, and handling of\nnon-Ensembl data is still very much in development.)\n\nFor example:\n\n```python\nfrom pyensembl import Genome\ndata = Genome(\n reference_name='GRCh38',\n annotation_name='my_genome_features',\n # annotation_version=None,\n gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file\n # transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences\n # protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences\n # cache_directory_path=None, # Where to place downloaded and cached files for this genome\n)\n# parse GTF and construct database of genomic features\ndata.index()\ngene_names = data.gene_names_at_locus(contig=6, position=29945884)\n```\n\n# API\n\nThe `EnsemblRelease` object has methods to let you access all possible\ncombinations of the annotation features _gene_name_, _gene_id_,\n_transcript_name_, _transcript_id_, _exon_id_ as well as the location of\nthese genomic elements (contig, start position, end position, strand).\n\n## Genes\n\n<dl>\n<dt>genes(contig=None, strand=None)</dt>\n<dd>Returns a list of Gene objects, optionally restricted to a particular contig\nor strand.</dd>\n\n<dt>genes_at_locus(contig, position, end=None, strand=None)</dt>\n<dd>Returns a list of Gene objects overlapping a particular position on a contig,\noptionally extend into a range with the end parameter and restrict to\nforward or backward strand by passing strand='+' or strand='-'.</dd>\n\n<dt>gene_by_id(gene_id)</dt>\n<dd>Return a Gene object for given Ensembl gene ID (e.g. \"ENSG00000068793\").</dd>\n\n<dt>gene_names(contig=None, strand=None)</dt>\n<dd>Returns all gene names in the annotation database, optionally restricted\nto a particular contig or strand.</dd>\n\n<dt>genes_by_name(gene_name)</dt>\n<dd>Get all the unqiue genes with the given name (there might be multiple\ndue to copies in the genome), return a list containing a Gene object for each\ndistinct ID.</dd>\n\n<dt>gene_by_protein_id(protein_id)</dt>\n<dd>Find Gene associated with the given Ensembl protein ID (e.g. \"ENSP00000350283\")</dd>\n\n<dt>gene_names_at_locus(contig, position, end=None, strand=None)\n</dt>\n<dd>Names of genes overlapping with the given locus, optionally restricted by strand.\n(returns a list to account for overlapping genes)</dd>\n\n<dt>gene_name_of_gene_id(gene_id)\n</dt>\n<dd>Returns name of gene with given genen ID.</dd>\n\n<dt>gene_name_of_transcript_id(transcript_id)\n</dt><dd>Returns name of gene associated with given transcript ID.</dd>\n\n<dt>gene_name_of_transcript_name(transcript_name)\n</dt>\n<dd>Returns name of gene associated with given transcript name.</dd>\n\n<dt>gene_name_of_exon_id(exon_id)\n</dt><dd>Returns name of gene associated with given exon ID.</dd>\n\n<dt>gene_ids(contig=None, strand=None)\n</dt>\n<dd>Return all gene IDs in the annotation database, optionally restricted by\nchromosome name or strand.</dd>\n\n<dt>gene_ids_of_gene_name(gene_name)\n</dt>\n<dd>Returns all Ensembl gene IDs with the given name.</dd>\n\n</dl>\n\n## Transcripts\n\n<dl>\n<dt>transcripts(contig=None, strand=None)</dt>\n<dd>Returns a list of Transcript objects for all transcript entries in the\nEnsembl database, optionally restricted to a particular contig or strand.</dd>\n\n<dt>transcript_by_id(transcript_id)</dt>\n<dd>Construct a Transcript object for given Ensembl transcript ID (e.g. \"ENST00000369985\")</dd>\n\n<dt>transcripts_by_name(transcript_name)</dt>\n<dd>Returns a list of Transcript objects for every transcript matching the given name.</dd>\n\n<dt>transcript_names(contig=None, strand=None)</dt>\n<dd>Returns all transcript names in the annotation database.</dd>\n\n<dt>transcript_ids(contig=None, strand=None)</dt>\n<dd>Returns all transcript IDs in the annotation database.</dd>\n\n<dt>transcript_ids_of_gene_id(gene_id)</dt>\n<dd>Return IDs of all transcripts associated with given gene ID.</dd>\n\n<dt>transcript_ids_of_gene_name(gene_name)</dt>\n<dd>Return IDs of all transcripts associated with given gene name.</dd>\n\n<dt>transcript_ids_of_transcript_name(transcript_name)</dt>\n<dd>Find all Ensembl transcript IDs with the given name.</dd>\n\n<dt>transcript_ids_of_exon_id(exon_id)</dt>\n<dd>Return IDs of all transcripts associatd with given exon ID.</dd>\n</dl>\n\n## Exons\n\n<dl>\n<dt>exon_ids(contig=None, strand=None)</dt>\n<dd>Returns a list of exons IDs in the annotation database, optionally restricted\nby the given chromosome and strand.</dd>\n\n<dt>exon_by_id(exon_id)</dt>\n<dd>Construct an Exon object for given Ensembl exon ID (e.g. \"ENSE00001209410\")</dd>\n\n<dt>exon_ids_of_gene_id(gene_id)</dt>\n<dd>Returns a list of exon IDs associated with a given gene ID.</dd>\n\n<dt>exon_ids_of_gene_name(gene_name)</dt>\n<dd>Returns a list of exon IDs associated with a given gene name.</dd>\n\n<dt>exon_ids_of_transcript_id(transcript_id)</dt>\n<dd>Returns a list of exon IDs associated with a given transcript ID.</dd>\n\n<dt>exon_ids_of_transcript_name(transcript_name)</dt>\n<dd>Returns a list of exon IDs associated with a given transcript name.</dd>\n</dl>\n",
"bugtrack_url": null,
"license": "http://www.apache.org/licenses/LICENSE-2.0.html",
"summary": "Python interface to Ensembl reference genome metadata",
"version": "2.3.13",
"project_urls": {
"Homepage": "https://github.com/openvax/pyensembl"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "774fd8ea7378ffd03a7e71f6e750cb5fa7e8721a0e30f04af8ae60e704d59e96",
"md5": "de7c2aa968be5061ec60c0755ffadc60",
"sha256": "46989e4eb3c3436ac2a8b02f17e999439c04ca2db0926ce6d5535a9a0a00bfce"
},
"downloads": -1,
"filename": "pyensembl-2.3.13-py3-none-any.whl",
"has_sig": false,
"md5_digest": "de7c2aa968be5061ec60c0755ffadc60",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 55952,
"upload_time": "2024-04-25T13:45:24",
"upload_time_iso_8601": "2024-04-25T13:45:24.508350Z",
"url": "https://files.pythonhosted.org/packages/77/4f/d8ea7378ffd03a7e71f6e750cb5fa7e8721a0e30f04af8ae60e704d59e96/pyensembl-2.3.13-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8594253fa59f1f9e7fe3b7d34818fd4fe08f857da241234bf18c6ee1b9496afe",
"md5": "2aa1e29534735cabbb68ec0409e8684f",
"sha256": "c70ce760f68fe2a6be871db44e53ce1d4d1227f2ce0578c6b291d5a89f5d1832"
},
"downloads": -1,
"filename": "pyensembl-2.3.13.tar.gz",
"has_sig": false,
"md5_digest": "2aa1e29534735cabbb68ec0409e8684f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 60784,
"upload_time": "2024-04-25T13:45:31",
"upload_time_iso_8601": "2024-04-25T13:45:31.165099Z",
"url": "https://files.pythonhosted.org/packages/85/94/253fa59f1f9e7fe3b7d34818fd4fe08f857da241234bf18c6ee1b9496afe/pyensembl-2.3.13.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-25 13:45:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "openvax",
"github_project": "pyensembl",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "pyensembl"
}