pyensembl


Namepyensembl JSON
Version 2.3.12 PyPI version JSON
download
home_pagehttps://github.com/openvax/pyensembl
SummaryPython interface to Ensembl reference genome metadata
upload_time2024-03-28 17:18:10
maintainerNone
docs_urlNone
authorAlex Rubinsteyn
requires_pythonNone
licensehttp://www.apache.org/licenses/LICENSE-2.0.html
keywords
VCS
bugtrack_url
requirements typechecks datacache memoized-property tinytimer gtfparse serializable pylint
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Tests](https://github.com/openvax/pyensembl/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/pyensembl/actions/workflows/tests.yml)
[![Coverage Status](https://coveralls.io/repos/github/openvax/pyensembl/badge.svg?branch=main)](https://coveralls.io/github/openvax/pyensembl?branch=main)
<a href="https://pypi.python.org/pypi/pyensembl/">
<img src="https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000" alt="PyPI" />
</a>

# PyEnsembl

PyEnsembl is a Python interface to [Ensembl](http://www.ensembl.org) reference genome metadata such as exons and transcripts. PyEnsembl downloads [GTF](https://en.wikipedia.org/wiki/Gene_transfer_format) and [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from the [Ensembl FTP server](ftp://ftp.ensembl.org) and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.

# Example Usage

```python
from pyensembl import EnsemblRelease

# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)

# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

# get all exons associated with HLA-A
exon_ids  = data.exon_ids_of_gene_name('HLA-A')
```

# Installation

You can install PyEnsembl using [pip](https://pip.pypa.io/en/latest/quickstart.html):

```sh
pip install pyensembl
```

This should also install any required packages such as [datacache](https://github.com/openvax/datacache).

Before using PyEnsembl, run the following command to download and install
Ensembl data:

```
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
```

For example, `pyensembl install --release 75 76 --species human` will download and install all
human reference data from Ensembl releases 75 and 76.

Alternatively, you can create the `EnsemblRelease` object from inside a Python
process and call `ensembl_object.download()` followed by `ensembl_object.index()`.

## Cache Location

By default, PyEnsembl uses the platform-specific `Cache` folder
and caches the files into the `pyensembl` sub-directory.
You can override this default by setting the environment key `PYENSEMBL_CACHE_DIR`
as your preferred location for caching:

```sh
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
```

or

```python
import os

os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
```

# Usage tips

## List installed genomes

To see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:

```sh
pyensembl list
```

Or equivalently do this in Python:

```python
from pyensembl.shell import collect_all_installed_ensembl_releases
collect_all_installed_ensembl_releases()
```

## Load genome in Python

Here's an example Python snippet that loads fly genome data from Ensembl release v100:

```python
from pyensembl import EnsemblRelease
data = EnsemblRelease(release=100, species='drosophila_melanogaster')
```

## Data structures

### Gene

```python
gene = genome.gene_by_id(gene_id='FBgn0011747')
```

### Transcript

```python
transcript = gene.transcripts[0]
```

### Protein information

```python
transcript.protein_id
transcript.protein_sequence
```

# Non-Ensembl Data

PyEnsembl also allows arbitrary genomes via the specification
of local file paths or remote URLs to both Ensembl and non-Ensembl GTF
and FASTA files. (Warning: GTF formats can vary, and handling of
non-Ensembl data is still very much in development.)

For example:

```python
from pyensembl import Genome
data = Genome(
    reference_name='GRCh38',
    annotation_name='my_genome_features',
    # annotation_version=None,
    gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file
    # transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences
    # protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences
    # cache_directory_path=None, # Where to place downloaded and cached files for this genome
)
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
```

# API

The `EnsemblRelease` object has methods to let you access all possible
combinations of the annotation features _gene_name_, _gene_id_,
_transcript_name_, _transcript_id_, _exon_id_ as well as the location of
these genomic elements (contig, start position, end position, strand).

## Genes

<dl>
<dt>genes(contig=None, strand=None)</dt>
<dd>Returns a list of Gene objects, optionally restricted to a particular contig
or strand.</dd>

<dt>genes_at_locus(contig, position, end=None, strand=None)</dt>
<dd>Returns a list of Gene objects overlapping a particular position on a contig,
optionally extend into a range with the end parameter and restrict to
forward or backward strand by passing strand='+' or strand='-'.</dd>

<dt>gene_by_id(gene_id)</dt>
<dd>Return a Gene object for given Ensembl gene ID (e.g. "ENSG00000068793").</dd>

<dt>gene_names(contig=None, strand=None)</dt>
<dd>Returns all gene names in the annotation database, optionally restricted
to a particular contig or strand.</dd>

<dt>genes_by_name(gene_name)</dt>
<dd>Get all the unqiue genes with the given name (there might be multiple
due to copies in the genome), return a list containing a Gene object for each
distinct ID.</dd>

<dt>gene_by_protein_id(protein_id)</dt>
<dd>Find Gene associated with the given Ensembl protein ID (e.g. "ENSP00000350283")</dd>

<dt>gene_names_at_locus(contig, position, end=None, strand=None)
</dt>
<dd>Names of genes overlapping with the given locus, optionally restricted by strand.
(returns a list to account for overlapping genes)</dd>

<dt>gene_name_of_gene_id(gene_id)
</dt>
<dd>Returns name of gene with given genen ID.</dd>

<dt>gene_name_of_transcript_id(transcript_id)
</dt><dd>Returns name of gene associated with given transcript ID.</dd>

<dt>gene_name_of_transcript_name(transcript_name)
</dt>
<dd>Returns name of gene associated with given transcript name.</dd>

<dt>gene_name_of_exon_id(exon_id)
</dt><dd>Returns name of gene associated with given exon ID.</dd>

<dt>gene_ids(contig=None, strand=None)
</dt>
<dd>Return all gene IDs in the annotation database, optionally restricted by
chromosome name or strand.</dd>

<dt>gene_ids_of_gene_name(gene_name)
</dt>
<dd>Returns all Ensembl gene IDs with the given name.</dd>

</dl>

## Transcripts

<dl>
<dt>transcripts(contig=None, strand=None)</dt>
<dd>Returns a list of Transcript objects for all transcript entries in the
Ensembl database, optionally restricted to a particular contig or strand.</dd>

<dt>transcript_by_id(transcript_id)</dt>
<dd>Construct a Transcript object for given Ensembl transcript ID (e.g. "ENST00000369985")</dd>

<dt>transcripts_by_name(transcript_name)</dt>
<dd>Returns a list of Transcript objects for every transcript matching the given name.</dd>

<dt>transcript_names(contig=None, strand=None)</dt>
<dd>Returns all transcript names in the annotation database.</dd>

<dt>transcript_ids(contig=None, strand=None)</dt>
<dd>Returns all transcript IDs in the annotation database.</dd>

<dt>transcript_ids_of_gene_id(gene_id)</dt>
<dd>Return IDs of all transcripts associated with given gene ID.</dd>

<dt>transcript_ids_of_gene_name(gene_name)</dt>
<dd>Return IDs of all transcripts associated with given gene name.</dd>

<dt>transcript_ids_of_transcript_name(transcript_name)</dt>
<dd>Find all Ensembl transcript IDs with the given name.</dd>

<dt>transcript_ids_of_exon_id(exon_id)</dt>
<dd>Return IDs of all transcripts associatd with given exon ID.</dd>
</dl>

## Exons

<dl>
<dt>exon_ids(contig=None, strand=None)</dt>
<dd>Returns a list of exons IDs in the annotation database, optionally restricted
by the given chromosome and strand.</dd>

<dt>exon_by_id(exon_id)</dt>
<dd>Construct an Exon object for given Ensembl exon ID (e.g. "ENSE00001209410")</dd>

<dt>exon_ids_of_gene_id(gene_id)</dt>
<dd>Returns a list of exon IDs associated with a given gene ID.</dd>

<dt>exon_ids_of_gene_name(gene_name)</dt>
<dd>Returns a list of exon IDs associated with a given gene name.</dd>

<dt>exon_ids_of_transcript_id(transcript_id)</dt>
<dd>Returns a list of exon IDs associated with a given transcript ID.</dd>

<dt>exon_ids_of_transcript_name(transcript_name)</dt>
<dd>Returns a list of exon IDs associated with a given transcript name.</dd>
</dl>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/openvax/pyensembl",
    "name": "pyensembl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Alex Rubinsteyn",
    "author_email": "alex.rubinsteyn@unc.edu",
    "download_url": "https://files.pythonhosted.org/packages/32/7d/c66e1fb9e48e8dce2b172baf5d51ecc32e5d3763c0ff351d36c311abea14/pyensembl-2.3.12.tar.gz",
    "platform": null,
    "description": "[![Tests](https://github.com/openvax/pyensembl/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/pyensembl/actions/workflows/tests.yml)\n[![Coverage Status](https://coveralls.io/repos/github/openvax/pyensembl/badge.svg?branch=main)](https://coveralls.io/github/openvax/pyensembl?branch=main)\n<a href=\"https://pypi.python.org/pypi/pyensembl/\">\n<img src=\"https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000\" alt=\"PyPI\" />\n</a>\n\n# PyEnsembl\n\nPyEnsembl is a Python interface to [Ensembl](http://www.ensembl.org) reference genome metadata such as exons and transcripts. PyEnsembl downloads [GTF](https://en.wikipedia.org/wiki/Gene_transfer_format) and [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from the [Ensembl FTP server](ftp://ftp.ensembl.org) and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.\n\n# Example Usage\n\n```python\nfrom pyensembl import EnsemblRelease\n\n# release 77 uses human reference genome GRCh38\ndata = EnsemblRelease(77)\n\n# will return ['HLA-A']\ngene_names = data.gene_names_at_locus(contig=6, position=29945884)\n\n# get all exons associated with HLA-A\nexon_ids  = data.exon_ids_of_gene_name('HLA-A')\n```\n\n# Installation\n\nYou can install PyEnsembl using [pip](https://pip.pypa.io/en/latest/quickstart.html):\n\n```sh\npip install pyensembl\n```\n\nThis should also install any required packages such as [datacache](https://github.com/openvax/datacache).\n\nBefore using PyEnsembl, run the following command to download and install\nEnsembl data:\n\n```\npyensembl install --release <list of Ensembl release numbers> --species <species-name>\n```\n\nFor example, `pyensembl install --release 75 76 --species human` will download and install all\nhuman reference data from Ensembl releases 75 and 76.\n\nAlternatively, you can create the `EnsemblRelease` object from inside a Python\nprocess and call `ensembl_object.download()` followed by `ensembl_object.index()`.\n\n## Cache Location\n\nBy default, PyEnsembl uses the platform-specific `Cache` folder\nand caches the files into the `pyensembl` sub-directory.\nYou can override this default by setting the environment key `PYENSEMBL_CACHE_DIR`\nas your preferred location for caching:\n\n```sh\nexport PYENSEMBL_CACHE_DIR=/custom/cache/dir\n```\n\nor\n\n```python\nimport os\n\nos.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'\n# ... PyEnsembl API usage\n```\n\n# Usage tips\n\n## List installed genomes\n\nTo see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:\n\n```sh\npyensembl list\n```\n\nOr equivalently do this in Python:\n\n```python\nfrom pyensembl.shell import collect_all_installed_ensembl_releases\ncollect_all_installed_ensembl_releases()\n```\n\n## Load genome in Python\n\nHere's an example Python snippet that loads fly genome data from Ensembl release v100:\n\n```python\nfrom pyensembl import EnsemblRelease\ndata = EnsemblRelease(release=100, species='drosophila_melanogaster')\n```\n\n## Data structures\n\n### Gene\n\n```python\ngene = genome.gene_by_id(gene_id='FBgn0011747')\n```\n\n### Transcript\n\n```python\ntranscript = gene.transcripts[0]\n```\n\n### Protein information\n\n```python\ntranscript.protein_id\ntranscript.protein_sequence\n```\n\n# Non-Ensembl Data\n\nPyEnsembl also allows arbitrary genomes via the specification\nof local file paths or remote URLs to both Ensembl and non-Ensembl GTF\nand FASTA files. (Warning: GTF formats can vary, and handling of\nnon-Ensembl data is still very much in development.)\n\nFor example:\n\n```python\nfrom pyensembl import Genome\ndata = Genome(\n    reference_name='GRCh38',\n    annotation_name='my_genome_features',\n    # annotation_version=None,\n    gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file\n    # transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences\n    # protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences\n    # cache_directory_path=None, # Where to place downloaded and cached files for this genome\n)\n# parse GTF and construct database of genomic features\ndata.index()\ngene_names = data.gene_names_at_locus(contig=6, position=29945884)\n```\n\n# API\n\nThe `EnsemblRelease` object has methods to let you access all possible\ncombinations of the annotation features _gene_name_, _gene_id_,\n_transcript_name_, _transcript_id_, _exon_id_ as well as the location of\nthese genomic elements (contig, start position, end position, strand).\n\n## Genes\n\n<dl>\n<dt>genes(contig=None, strand=None)</dt>\n<dd>Returns a list of Gene objects, optionally restricted to a particular contig\nor strand.</dd>\n\n<dt>genes_at_locus(contig, position, end=None, strand=None)</dt>\n<dd>Returns a list of Gene objects overlapping a particular position on a contig,\noptionally extend into a range with the end parameter and restrict to\nforward or backward strand by passing strand='+' or strand='-'.</dd>\n\n<dt>gene_by_id(gene_id)</dt>\n<dd>Return a Gene object for given Ensembl gene ID (e.g. \"ENSG00000068793\").</dd>\n\n<dt>gene_names(contig=None, strand=None)</dt>\n<dd>Returns all gene names in the annotation database, optionally restricted\nto a particular contig or strand.</dd>\n\n<dt>genes_by_name(gene_name)</dt>\n<dd>Get all the unqiue genes with the given name (there might be multiple\ndue to copies in the genome), return a list containing a Gene object for each\ndistinct ID.</dd>\n\n<dt>gene_by_protein_id(protein_id)</dt>\n<dd>Find Gene associated with the given Ensembl protein ID (e.g. \"ENSP00000350283\")</dd>\n\n<dt>gene_names_at_locus(contig, position, end=None, strand=None)\n</dt>\n<dd>Names of genes overlapping with the given locus, optionally restricted by strand.\n(returns a list to account for overlapping genes)</dd>\n\n<dt>gene_name_of_gene_id(gene_id)\n</dt>\n<dd>Returns name of gene with given genen ID.</dd>\n\n<dt>gene_name_of_transcript_id(transcript_id)\n</dt><dd>Returns name of gene associated with given transcript ID.</dd>\n\n<dt>gene_name_of_transcript_name(transcript_name)\n</dt>\n<dd>Returns name of gene associated with given transcript name.</dd>\n\n<dt>gene_name_of_exon_id(exon_id)\n</dt><dd>Returns name of gene associated with given exon ID.</dd>\n\n<dt>gene_ids(contig=None, strand=None)\n</dt>\n<dd>Return all gene IDs in the annotation database, optionally restricted by\nchromosome name or strand.</dd>\n\n<dt>gene_ids_of_gene_name(gene_name)\n</dt>\n<dd>Returns all Ensembl gene IDs with the given name.</dd>\n\n</dl>\n\n## Transcripts\n\n<dl>\n<dt>transcripts(contig=None, strand=None)</dt>\n<dd>Returns a list of Transcript objects for all transcript entries in the\nEnsembl database, optionally restricted to a particular contig or strand.</dd>\n\n<dt>transcript_by_id(transcript_id)</dt>\n<dd>Construct a Transcript object for given Ensembl transcript ID (e.g. \"ENST00000369985\")</dd>\n\n<dt>transcripts_by_name(transcript_name)</dt>\n<dd>Returns a list of Transcript objects for every transcript matching the given name.</dd>\n\n<dt>transcript_names(contig=None, strand=None)</dt>\n<dd>Returns all transcript names in the annotation database.</dd>\n\n<dt>transcript_ids(contig=None, strand=None)</dt>\n<dd>Returns all transcript IDs in the annotation database.</dd>\n\n<dt>transcript_ids_of_gene_id(gene_id)</dt>\n<dd>Return IDs of all transcripts associated with given gene ID.</dd>\n\n<dt>transcript_ids_of_gene_name(gene_name)</dt>\n<dd>Return IDs of all transcripts associated with given gene name.</dd>\n\n<dt>transcript_ids_of_transcript_name(transcript_name)</dt>\n<dd>Find all Ensembl transcript IDs with the given name.</dd>\n\n<dt>transcript_ids_of_exon_id(exon_id)</dt>\n<dd>Return IDs of all transcripts associatd with given exon ID.</dd>\n</dl>\n\n## Exons\n\n<dl>\n<dt>exon_ids(contig=None, strand=None)</dt>\n<dd>Returns a list of exons IDs in the annotation database, optionally restricted\nby the given chromosome and strand.</dd>\n\n<dt>exon_by_id(exon_id)</dt>\n<dd>Construct an Exon object for given Ensembl exon ID (e.g. \"ENSE00001209410\")</dd>\n\n<dt>exon_ids_of_gene_id(gene_id)</dt>\n<dd>Returns a list of exon IDs associated with a given gene ID.</dd>\n\n<dt>exon_ids_of_gene_name(gene_name)</dt>\n<dd>Returns a list of exon IDs associated with a given gene name.</dd>\n\n<dt>exon_ids_of_transcript_id(transcript_id)</dt>\n<dd>Returns a list of exon IDs associated with a given transcript ID.</dd>\n\n<dt>exon_ids_of_transcript_name(transcript_name)</dt>\n<dd>Returns a list of exon IDs associated with a given transcript name.</dd>\n</dl>\n",
    "bugtrack_url": null,
    "license": "http://www.apache.org/licenses/LICENSE-2.0.html",
    "summary": "Python interface to Ensembl reference genome metadata",
    "version": "2.3.12",
    "project_urls": {
        "Homepage": "https://github.com/openvax/pyensembl"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2558744e754b4949c29387db10fc372ea06b894a32daab031f3e478f2076a57c",
                "md5": "c9cff77d1caa2c3a6596d8358aee7038",
                "sha256": "86e9c8dd962a6dde36a8d8611ca9acb0a55574a75919c80dbe7daea265cbee2b"
            },
            "downloads": -1,
            "filename": "pyensembl-2.3.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c9cff77d1caa2c3a6596d8358aee7038",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 55953,
            "upload_time": "2024-03-28T17:18:06",
            "upload_time_iso_8601": "2024-03-28T17:18:06.929240Z",
            "url": "https://files.pythonhosted.org/packages/25/58/744e754b4949c29387db10fc372ea06b894a32daab031f3e478f2076a57c/pyensembl-2.3.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "327dc66e1fb9e48e8dce2b172baf5d51ecc32e5d3763c0ff351d36c311abea14",
                "md5": "8ded86d0cfca36cfc0255dd5d4302d05",
                "sha256": "bfa3de35a47fe85afa5f8ff39fddd86e2cca6c14c238029c1525b3d3aa8bb089"
            },
            "downloads": -1,
            "filename": "pyensembl-2.3.12.tar.gz",
            "has_sig": false,
            "md5_digest": "8ded86d0cfca36cfc0255dd5d4302d05",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 60753,
            "upload_time": "2024-03-28T17:18:10",
            "upload_time_iso_8601": "2024-03-28T17:18:10.844809Z",
            "url": "https://files.pythonhosted.org/packages/32/7d/c66e1fb9e48e8dce2b172baf5d51ecc32e5d3763c0ff351d36c311abea14/pyensembl-2.3.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-28 17:18:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "openvax",
    "github_project": "pyensembl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "typechecks",
            "specs": [
                [
                    ">=",
                    "0.0.2"
                ],
                [
                    "<",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "datacache",
            "specs": [
                [
                    ">=",
                    "1.4.0"
                ],
                [
                    "<",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "memoized-property",
            "specs": [
                [
                    ">=",
                    "1.0.2"
                ]
            ]
        },
        {
            "name": "tinytimer",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.0.0"
                ]
            ]
        },
        {
            "name": "gtfparse",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "serializable",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.2.1"
                ]
            ]
        },
        {
            "name": "pylint",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.17.2"
                ]
            ]
        }
    ],
    "lcname": "pyensembl"
}
        
Elapsed time: 0.24416s