pyensembl


Namepyensembl JSON
Version 2.3.13 PyPI version JSON
download
home_pagehttps://github.com/openvax/pyensembl
SummaryPython interface to Ensembl reference genome metadata
upload_time2024-04-25 13:45:31
maintainerNone
docs_urlNone
authorAlex Rubinsteyn
requires_pythonNone
licensehttp://www.apache.org/licenses/LICENSE-2.0.html
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Tests](https://github.com/openvax/pyensembl/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/pyensembl/actions/workflows/tests.yml)
[![Coverage Status](https://coveralls.io/repos/github/openvax/pyensembl/badge.svg?branch=main)](https://coveralls.io/github/openvax/pyensembl?branch=main)
<a href="https://pypi.python.org/pypi/pyensembl/">
<img src="https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000" alt="PyPI" />
</a>

# PyEnsembl

PyEnsembl is a Python interface to [Ensembl](http://www.ensembl.org) reference genome metadata such as exons and transcripts. PyEnsembl downloads [GTF](https://en.wikipedia.org/wiki/Gene_transfer_format) and [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from the [Ensembl FTP server](ftp://ftp.ensembl.org) and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.

# Example Usage

```python
from pyensembl import EnsemblRelease

# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)

# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

# get all exons associated with HLA-A
exon_ids  = data.exon_ids_of_gene_name('HLA-A')
```

# Installation

You can install PyEnsembl using [pip](https://pip.pypa.io/en/latest/quickstart.html):

```sh
pip install pyensembl
```

This should also install any required packages such as [datacache](https://github.com/openvax/datacache).

Before using PyEnsembl, run the following command to download and install
Ensembl data:

```
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
```

For example, `pyensembl install --release 75 76 --species human` will download and install all
human reference data from Ensembl releases 75 and 76.

Alternatively, you can create the `EnsemblRelease` object from inside a Python
process and call `ensembl_object.download()` followed by `ensembl_object.index()`.

## Cache Location

By default, PyEnsembl uses the platform-specific `Cache` folder
and caches the files into the `pyensembl` sub-directory.
You can override this default by setting the environment key `PYENSEMBL_CACHE_DIR`
as your preferred location for caching:

```sh
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
```

or

```python
import os

os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
```

# Usage tips

## List installed genomes

To see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:

```sh
pyensembl list
```

Or equivalently do this in Python:

```python
from pyensembl.shell import collect_all_installed_ensembl_releases
collect_all_installed_ensembl_releases()
```

## Load genome in Python

Here's an example Python snippet that loads fly genome data from Ensembl release v100:

```python
from pyensembl import EnsemblRelease
data = EnsemblRelease(release=100, species='drosophila_melanogaster')
```

## Data structures

### Gene

```python
gene = genome.gene_by_id(gene_id='FBgn0011747')
```

### Transcript

```python
transcript = gene.transcripts[0]
```

### Protein information

```python
transcript.protein_id
transcript.protein_sequence
```

# Non-Ensembl Data

PyEnsembl also allows arbitrary genomes via the specification
of local file paths or remote URLs to both Ensembl and non-Ensembl GTF
and FASTA files. (Warning: GTF formats can vary, and handling of
non-Ensembl data is still very much in development.)

For example:

```python
from pyensembl import Genome
data = Genome(
    reference_name='GRCh38',
    annotation_name='my_genome_features',
    # annotation_version=None,
    gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file
    # transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences
    # protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences
    # cache_directory_path=None, # Where to place downloaded and cached files for this genome
)
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
```

# API

The `EnsemblRelease` object has methods to let you access all possible
combinations of the annotation features _gene_name_, _gene_id_,
_transcript_name_, _transcript_id_, _exon_id_ as well as the location of
these genomic elements (contig, start position, end position, strand).

## Genes

<dl>
<dt>genes(contig=None, strand=None)</dt>
<dd>Returns a list of Gene objects, optionally restricted to a particular contig
or strand.</dd>

<dt>genes_at_locus(contig, position, end=None, strand=None)</dt>
<dd>Returns a list of Gene objects overlapping a particular position on a contig,
optionally extend into a range with the end parameter and restrict to
forward or backward strand by passing strand='+' or strand='-'.</dd>

<dt>gene_by_id(gene_id)</dt>
<dd>Return a Gene object for given Ensembl gene ID (e.g. "ENSG00000068793").</dd>

<dt>gene_names(contig=None, strand=None)</dt>
<dd>Returns all gene names in the annotation database, optionally restricted
to a particular contig or strand.</dd>

<dt>genes_by_name(gene_name)</dt>
<dd>Get all the unqiue genes with the given name (there might be multiple
due to copies in the genome), return a list containing a Gene object for each
distinct ID.</dd>

<dt>gene_by_protein_id(protein_id)</dt>
<dd>Find Gene associated with the given Ensembl protein ID (e.g. "ENSP00000350283")</dd>

<dt>gene_names_at_locus(contig, position, end=None, strand=None)
</dt>
<dd>Names of genes overlapping with the given locus, optionally restricted by strand.
(returns a list to account for overlapping genes)</dd>

<dt>gene_name_of_gene_id(gene_id)
</dt>
<dd>Returns name of gene with given genen ID.</dd>

<dt>gene_name_of_transcript_id(transcript_id)
</dt><dd>Returns name of gene associated with given transcript ID.</dd>

<dt>gene_name_of_transcript_name(transcript_name)
</dt>
<dd>Returns name of gene associated with given transcript name.</dd>

<dt>gene_name_of_exon_id(exon_id)
</dt><dd>Returns name of gene associated with given exon ID.</dd>

<dt>gene_ids(contig=None, strand=None)
</dt>
<dd>Return all gene IDs in the annotation database, optionally restricted by
chromosome name or strand.</dd>

<dt>gene_ids_of_gene_name(gene_name)
</dt>
<dd>Returns all Ensembl gene IDs with the given name.</dd>

</dl>

## Transcripts

<dl>
<dt>transcripts(contig=None, strand=None)</dt>
<dd>Returns a list of Transcript objects for all transcript entries in the
Ensembl database, optionally restricted to a particular contig or strand.</dd>

<dt>transcript_by_id(transcript_id)</dt>
<dd>Construct a Transcript object for given Ensembl transcript ID (e.g. "ENST00000369985")</dd>

<dt>transcripts_by_name(transcript_name)</dt>
<dd>Returns a list of Transcript objects for every transcript matching the given name.</dd>

<dt>transcript_names(contig=None, strand=None)</dt>
<dd>Returns all transcript names in the annotation database.</dd>

<dt>transcript_ids(contig=None, strand=None)</dt>
<dd>Returns all transcript IDs in the annotation database.</dd>

<dt>transcript_ids_of_gene_id(gene_id)</dt>
<dd>Return IDs of all transcripts associated with given gene ID.</dd>

<dt>transcript_ids_of_gene_name(gene_name)</dt>
<dd>Return IDs of all transcripts associated with given gene name.</dd>

<dt>transcript_ids_of_transcript_name(transcript_name)</dt>
<dd>Find all Ensembl transcript IDs with the given name.</dd>

<dt>transcript_ids_of_exon_id(exon_id)</dt>
<dd>Return IDs of all transcripts associatd with given exon ID.</dd>
</dl>

## Exons

<dl>
<dt>exon_ids(contig=None, strand=None)</dt>
<dd>Returns a list of exons IDs in the annotation database, optionally restricted
by the given chromosome and strand.</dd>

<dt>exon_by_id(exon_id)</dt>
<dd>Construct an Exon object for given Ensembl exon ID (e.g. "ENSE00001209410")</dd>

<dt>exon_ids_of_gene_id(gene_id)</dt>
<dd>Returns a list of exon IDs associated with a given gene ID.</dd>

<dt>exon_ids_of_gene_name(gene_name)</dt>
<dd>Returns a list of exon IDs associated with a given gene name.</dd>

<dt>exon_ids_of_transcript_id(transcript_id)</dt>
<dd>Returns a list of exon IDs associated with a given transcript ID.</dd>

<dt>exon_ids_of_transcript_name(transcript_name)</dt>
<dd>Returns a list of exon IDs associated with a given transcript name.</dd>
</dl>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/openvax/pyensembl",
    "name": "pyensembl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Alex Rubinsteyn",
    "author_email": "alex.rubinsteyn@unc.edu",
    "download_url": "https://files.pythonhosted.org/packages/85/94/253fa59f1f9e7fe3b7d34818fd4fe08f857da241234bf18c6ee1b9496afe/pyensembl-2.3.13.tar.gz",
    "platform": null,
    "description": "[![Tests](https://github.com/openvax/pyensembl/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/pyensembl/actions/workflows/tests.yml)\n[![Coverage Status](https://coveralls.io/repos/github/openvax/pyensembl/badge.svg?branch=main)](https://coveralls.io/github/openvax/pyensembl?branch=main)\n<a href=\"https://pypi.python.org/pypi/pyensembl/\">\n<img src=\"https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000\" alt=\"PyPI\" />\n</a>\n\n# PyEnsembl\n\nPyEnsembl is a Python interface to [Ensembl](http://www.ensembl.org) reference genome metadata such as exons and transcripts. PyEnsembl downloads [GTF](https://en.wikipedia.org/wiki/Gene_transfer_format) and [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from the [Ensembl FTP server](ftp://ftp.ensembl.org) and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.\n\n# Example Usage\n\n```python\nfrom pyensembl import EnsemblRelease\n\n# release 77 uses human reference genome GRCh38\ndata = EnsemblRelease(77)\n\n# will return ['HLA-A']\ngene_names = data.gene_names_at_locus(contig=6, position=29945884)\n\n# get all exons associated with HLA-A\nexon_ids  = data.exon_ids_of_gene_name('HLA-A')\n```\n\n# Installation\n\nYou can install PyEnsembl using [pip](https://pip.pypa.io/en/latest/quickstart.html):\n\n```sh\npip install pyensembl\n```\n\nThis should also install any required packages such as [datacache](https://github.com/openvax/datacache).\n\nBefore using PyEnsembl, run the following command to download and install\nEnsembl data:\n\n```\npyensembl install --release <list of Ensembl release numbers> --species <species-name>\n```\n\nFor example, `pyensembl install --release 75 76 --species human` will download and install all\nhuman reference data from Ensembl releases 75 and 76.\n\nAlternatively, you can create the `EnsemblRelease` object from inside a Python\nprocess and call `ensembl_object.download()` followed by `ensembl_object.index()`.\n\n## Cache Location\n\nBy default, PyEnsembl uses the platform-specific `Cache` folder\nand caches the files into the `pyensembl` sub-directory.\nYou can override this default by setting the environment key `PYENSEMBL_CACHE_DIR`\nas your preferred location for caching:\n\n```sh\nexport PYENSEMBL_CACHE_DIR=/custom/cache/dir\n```\n\nor\n\n```python\nimport os\n\nos.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'\n# ... PyEnsembl API usage\n```\n\n# Usage tips\n\n## List installed genomes\n\nTo see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:\n\n```sh\npyensembl list\n```\n\nOr equivalently do this in Python:\n\n```python\nfrom pyensembl.shell import collect_all_installed_ensembl_releases\ncollect_all_installed_ensembl_releases()\n```\n\n## Load genome in Python\n\nHere's an example Python snippet that loads fly genome data from Ensembl release v100:\n\n```python\nfrom pyensembl import EnsemblRelease\ndata = EnsemblRelease(release=100, species='drosophila_melanogaster')\n```\n\n## Data structures\n\n### Gene\n\n```python\ngene = genome.gene_by_id(gene_id='FBgn0011747')\n```\n\n### Transcript\n\n```python\ntranscript = gene.transcripts[0]\n```\n\n### Protein information\n\n```python\ntranscript.protein_id\ntranscript.protein_sequence\n```\n\n# Non-Ensembl Data\n\nPyEnsembl also allows arbitrary genomes via the specification\nof local file paths or remote URLs to both Ensembl and non-Ensembl GTF\nand FASTA files. (Warning: GTF formats can vary, and handling of\nnon-Ensembl data is still very much in development.)\n\nFor example:\n\n```python\nfrom pyensembl import Genome\ndata = Genome(\n    reference_name='GRCh38',\n    annotation_name='my_genome_features',\n    # annotation_version=None,\n    gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file\n    # transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences\n    # protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences\n    # cache_directory_path=None, # Where to place downloaded and cached files for this genome\n)\n# parse GTF and construct database of genomic features\ndata.index()\ngene_names = data.gene_names_at_locus(contig=6, position=29945884)\n```\n\n# API\n\nThe `EnsemblRelease` object has methods to let you access all possible\ncombinations of the annotation features _gene_name_, _gene_id_,\n_transcript_name_, _transcript_id_, _exon_id_ as well as the location of\nthese genomic elements (contig, start position, end position, strand).\n\n## Genes\n\n<dl>\n<dt>genes(contig=None, strand=None)</dt>\n<dd>Returns a list of Gene objects, optionally restricted to a particular contig\nor strand.</dd>\n\n<dt>genes_at_locus(contig, position, end=None, strand=None)</dt>\n<dd>Returns a list of Gene objects overlapping a particular position on a contig,\noptionally extend into a range with the end parameter and restrict to\nforward or backward strand by passing strand='+' or strand='-'.</dd>\n\n<dt>gene_by_id(gene_id)</dt>\n<dd>Return a Gene object for given Ensembl gene ID (e.g. \"ENSG00000068793\").</dd>\n\n<dt>gene_names(contig=None, strand=None)</dt>\n<dd>Returns all gene names in the annotation database, optionally restricted\nto a particular contig or strand.</dd>\n\n<dt>genes_by_name(gene_name)</dt>\n<dd>Get all the unqiue genes with the given name (there might be multiple\ndue to copies in the genome), return a list containing a Gene object for each\ndistinct ID.</dd>\n\n<dt>gene_by_protein_id(protein_id)</dt>\n<dd>Find Gene associated with the given Ensembl protein ID (e.g. \"ENSP00000350283\")</dd>\n\n<dt>gene_names_at_locus(contig, position, end=None, strand=None)\n</dt>\n<dd>Names of genes overlapping with the given locus, optionally restricted by strand.\n(returns a list to account for overlapping genes)</dd>\n\n<dt>gene_name_of_gene_id(gene_id)\n</dt>\n<dd>Returns name of gene with given genen ID.</dd>\n\n<dt>gene_name_of_transcript_id(transcript_id)\n</dt><dd>Returns name of gene associated with given transcript ID.</dd>\n\n<dt>gene_name_of_transcript_name(transcript_name)\n</dt>\n<dd>Returns name of gene associated with given transcript name.</dd>\n\n<dt>gene_name_of_exon_id(exon_id)\n</dt><dd>Returns name of gene associated with given exon ID.</dd>\n\n<dt>gene_ids(contig=None, strand=None)\n</dt>\n<dd>Return all gene IDs in the annotation database, optionally restricted by\nchromosome name or strand.</dd>\n\n<dt>gene_ids_of_gene_name(gene_name)\n</dt>\n<dd>Returns all Ensembl gene IDs with the given name.</dd>\n\n</dl>\n\n## Transcripts\n\n<dl>\n<dt>transcripts(contig=None, strand=None)</dt>\n<dd>Returns a list of Transcript objects for all transcript entries in the\nEnsembl database, optionally restricted to a particular contig or strand.</dd>\n\n<dt>transcript_by_id(transcript_id)</dt>\n<dd>Construct a Transcript object for given Ensembl transcript ID (e.g. \"ENST00000369985\")</dd>\n\n<dt>transcripts_by_name(transcript_name)</dt>\n<dd>Returns a list of Transcript objects for every transcript matching the given name.</dd>\n\n<dt>transcript_names(contig=None, strand=None)</dt>\n<dd>Returns all transcript names in the annotation database.</dd>\n\n<dt>transcript_ids(contig=None, strand=None)</dt>\n<dd>Returns all transcript IDs in the annotation database.</dd>\n\n<dt>transcript_ids_of_gene_id(gene_id)</dt>\n<dd>Return IDs of all transcripts associated with given gene ID.</dd>\n\n<dt>transcript_ids_of_gene_name(gene_name)</dt>\n<dd>Return IDs of all transcripts associated with given gene name.</dd>\n\n<dt>transcript_ids_of_transcript_name(transcript_name)</dt>\n<dd>Find all Ensembl transcript IDs with the given name.</dd>\n\n<dt>transcript_ids_of_exon_id(exon_id)</dt>\n<dd>Return IDs of all transcripts associatd with given exon ID.</dd>\n</dl>\n\n## Exons\n\n<dl>\n<dt>exon_ids(contig=None, strand=None)</dt>\n<dd>Returns a list of exons IDs in the annotation database, optionally restricted\nby the given chromosome and strand.</dd>\n\n<dt>exon_by_id(exon_id)</dt>\n<dd>Construct an Exon object for given Ensembl exon ID (e.g. \"ENSE00001209410\")</dd>\n\n<dt>exon_ids_of_gene_id(gene_id)</dt>\n<dd>Returns a list of exon IDs associated with a given gene ID.</dd>\n\n<dt>exon_ids_of_gene_name(gene_name)</dt>\n<dd>Returns a list of exon IDs associated with a given gene name.</dd>\n\n<dt>exon_ids_of_transcript_id(transcript_id)</dt>\n<dd>Returns a list of exon IDs associated with a given transcript ID.</dd>\n\n<dt>exon_ids_of_transcript_name(transcript_name)</dt>\n<dd>Returns a list of exon IDs associated with a given transcript name.</dd>\n</dl>\n",
    "bugtrack_url": null,
    "license": "http://www.apache.org/licenses/LICENSE-2.0.html",
    "summary": "Python interface to Ensembl reference genome metadata",
    "version": "2.3.13",
    "project_urls": {
        "Homepage": "https://github.com/openvax/pyensembl"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "774fd8ea7378ffd03a7e71f6e750cb5fa7e8721a0e30f04af8ae60e704d59e96",
                "md5": "de7c2aa968be5061ec60c0755ffadc60",
                "sha256": "46989e4eb3c3436ac2a8b02f17e999439c04ca2db0926ce6d5535a9a0a00bfce"
            },
            "downloads": -1,
            "filename": "pyensembl-2.3.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "de7c2aa968be5061ec60c0755ffadc60",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 55952,
            "upload_time": "2024-04-25T13:45:24",
            "upload_time_iso_8601": "2024-04-25T13:45:24.508350Z",
            "url": "https://files.pythonhosted.org/packages/77/4f/d8ea7378ffd03a7e71f6e750cb5fa7e8721a0e30f04af8ae60e704d59e96/pyensembl-2.3.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8594253fa59f1f9e7fe3b7d34818fd4fe08f857da241234bf18c6ee1b9496afe",
                "md5": "2aa1e29534735cabbb68ec0409e8684f",
                "sha256": "c70ce760f68fe2a6be871db44e53ce1d4d1227f2ce0578c6b291d5a89f5d1832"
            },
            "downloads": -1,
            "filename": "pyensembl-2.3.13.tar.gz",
            "has_sig": false,
            "md5_digest": "2aa1e29534735cabbb68ec0409e8684f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 60784,
            "upload_time": "2024-04-25T13:45:31",
            "upload_time_iso_8601": "2024-04-25T13:45:31.165099Z",
            "url": "https://files.pythonhosted.org/packages/85/94/253fa59f1f9e7fe3b7d34818fd4fe08f857da241234bf18c6ee1b9496afe/pyensembl-2.3.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-25 13:45:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "openvax",
    "github_project": "pyensembl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "pyensembl"
}
        
Elapsed time: 0.38442s