varcode


Namevarcode JSON
Version 1.2.0 PyPI version JSON
download
home_pagehttps://github.com/openvax/varcode
SummaryVariant annotation in Python
upload_time2024-01-30 17:45:32
maintainer
docs_urlNone
authorAlex Rubinsteyn
requires_python
licensehttp://www.apache.org/licenses/LICENSE-2.0.html
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            [![Tests](https://github.com/openvax/varcode/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/varcode/actions/workflows/tests.yml)
<a href="https://coveralls.io/github/openvax/varcode?branch=master">
    <img src="https://coveralls.io/repos/openvax/varcode/badge.svg?branch=master&service=github" alt="Coverage Status" />
</a>
<a href="https://pypi.python.org/pypi/varcode/">
    <img src="https://img.shields.io/pypi/v/varcode.svg?maxAge=1000" alt="PyPI" />
</a>

Varcode
=======

Varcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.

Installation
------------

You can install varcode using [pip](https://pip.pypa.io/en/latest/quickstart.html):

```bash
pip install varcode
```

You can install required reference genome data through [PyEnsembl](https://github.com/openvax/pyensembl) as follows:

```bash
# Downloads and installs the Ensembl releases (75 and 76)
pyensembl install --release 75 76
```


Example
-------


```python
import varcode

# Load TCGA MAF containing variants from their
variants = varcode.load_maf("tcga-ovarian-cancer-variants.maf")

print(variants)
### <VariantCollection from 'tcga-ovarian-cancer-variants.maf' with 6428 elements>
###  -- Variant(contig=1, start=69538, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=881892, ref=T, alt=G, genome=GRCh37)
###  -- Variant(contig=1, start=3389714, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=3624325, ref=G, alt=T, genome=GRCh37)
###  ...

# you can index into a VariantCollection and get back a Variant object
variant = variants[0]

# groupby_gene_name returns a dictionary whose keys are gene names
# and whose values are themselves VariantCollections
gene_groups = variants.groupby_gene_name()

# get variants which affect the TP53 gene
TP53_variants = gene_groups["TP53"]

# predict protein coding effect of every TP53 variant on
# each transcript of the TP53 gene
TP53_effects = TP53_variants.effects()

print(TP53_effects)
### <EffectCollection with 789 elements>
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R342*)
### -- ThreePrimeUTR(variant=chr17 g.7574003G>A, transcript_name=TP53-005, transcript_id=ENST00000420246)
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.R342*)
### -- FrameShift(variant=chr17 g.7574030_7574030delG, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R333fs)
### ...

premature_stop_effect = TP53_effects[0]

print(str(premature_stop_effect.mutant_protein_sequence))
### 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMF'

print(premature_stop_effect.aa_mutation_start_offset)
### 341

print(premature_stop_effect.transcript)
### Transcript(id=ENST00000269305, name=TP53-001, gene_name=TP53, biotype=protein_coding, location=17:7571720-7590856)

print(premature_stop_effect.gene.name)
### 'TP53'
```

If you are looking for a quick start guide, you can check out [this iPython book](./examples/varcode-quick_start.ipynb) that demonstrates simple use cases of Varcode

Effect Types
------------

Effect type  | Description
-----------: | :-----------
*AlternateStartCodon* | Replace annotated start codon with alternative  start codon (*e.g.* "ATG>CAG").
*ComplexSubstitution* | Insertion and deletion of multiple amino acids.
*Deletion* | Coding mutation which causes deletion of amino acid(s).
*ExonLoss* | Deletion of entire exon, significantly disrupts protein.
*ExonicSpliceSite* | Mutation at the beginning or end of an exon, may affect splicing.
*FivePrimeUTR* | Variant affects 5' untranslated region before start codon.
*FrameShiftTruncation* | A frameshift which leads immediately to a stop codon (no novel amino acids created).
*FrameShift* | Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon.
*IncompleteTranscript* | Can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon).
*Insertion* | Coding mutation which causes insertion of amino acid(s).
*Intergenic* | Occurs outside of any annotated gene.
*Intragenic* |Within the annotated boundaries of a gene but not in a region that's transcribed into pre-mRNA.
*IntronicSpliceSite* | Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations.
*Intronic* | Variant occurs between exons and is unlikely to affect splicing.
*NoncodingTranscript* | Transcript doesn't code for a protein.
*PrematureStop* | Insertion of stop codon, truncates protein.
*Silent* | Mutation in coding sequence which does not change the amino acid sequence of the translated protein.
*SpliceAcceptor* | Mutation in the last two nucleotides of an intron, likely to affect splicing.
*SpliceDonor* | Mutation in the first two nucleotides of an intron, likely to affect splicing.
*StartLoss* | Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame).
*StopLoss* | Loss of stop codon, causes extension of protein by translation of nucleotides from 3' UTR.
*Substitution* | Coding mutation which causes simple substitution of one amino acid for another.
*ThreePrimeUTR* | Variant affects 3' untranslated region after stop codon of mRNA.


Coordinate System
-----------------
Varcode currently uses a "base counted, one start" genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to "space counted, zero start" (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this [blog post](http://alternateallele.blogspot.com/2012/03/genome-coordinate-conventions.html).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/openvax/varcode",
    "name": "varcode",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Alex Rubinsteyn",
    "author_email": "alex.rubinsteyn@unc.edu",
    "download_url": "https://files.pythonhosted.org/packages/4c/1b/43dc97dd86e7ebc8f2951c1c6a257ab0c61acb80dc2ed5898af78a6446bb/varcode-1.2.0.tar.gz",
    "platform": null,
    "description": "[![Tests](https://github.com/openvax/varcode/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/varcode/actions/workflows/tests.yml)\n<a href=\"https://coveralls.io/github/openvax/varcode?branch=master\">\n    <img src=\"https://coveralls.io/repos/openvax/varcode/badge.svg?branch=master&service=github\" alt=\"Coverage Status\" />\n</a>\n<a href=\"https://pypi.python.org/pypi/varcode/\">\n    <img src=\"https://img.shields.io/pypi/v/varcode.svg?maxAge=1000\" alt=\"PyPI\" />\n</a>\n\nVarcode\n=======\n\nVarcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.\n\nInstallation\n------------\n\nYou can install varcode using [pip](https://pip.pypa.io/en/latest/quickstart.html):\n\n```bash\npip install varcode\n```\n\nYou can install required reference genome data through [PyEnsembl](https://github.com/openvax/pyensembl) as follows:\n\n```bash\n# Downloads and installs the Ensembl releases (75 and 76)\npyensembl install --release 75 76\n```\n\n\nExample\n-------\n\n\n```python\nimport varcode\n\n# Load TCGA MAF containing variants from their\nvariants = varcode.load_maf(\"tcga-ovarian-cancer-variants.maf\")\n\nprint(variants)\n### <VariantCollection from 'tcga-ovarian-cancer-variants.maf' with 6428 elements>\n###  -- Variant(contig=1, start=69538, ref=G, alt=A, genome=GRCh37)\n###  -- Variant(contig=1, start=881892, ref=T, alt=G, genome=GRCh37)\n###  -- Variant(contig=1, start=3389714, ref=G, alt=A, genome=GRCh37)\n###  -- Variant(contig=1, start=3624325, ref=G, alt=T, genome=GRCh37)\n###  ...\n\n# you can index into a VariantCollection and get back a Variant object\nvariant = variants[0]\n\n# groupby_gene_name returns a dictionary whose keys are gene names\n# and whose values are themselves VariantCollections\ngene_groups = variants.groupby_gene_name()\n\n# get variants which affect the TP53 gene\nTP53_variants = gene_groups[\"TP53\"]\n\n# predict protein coding effect of every TP53 variant on\n# each transcript of the TP53 gene\nTP53_effects = TP53_variants.effects()\n\nprint(TP53_effects)\n### <EffectCollection with 789 elements>\n### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R342*)\n### -- ThreePrimeUTR(variant=chr17 g.7574003G>A, transcript_name=TP53-005, transcript_id=ENST00000420246)\n### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.R342*)\n### -- FrameShift(variant=chr17 g.7574030_7574030delG, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R333fs)\n### ...\n\npremature_stop_effect = TP53_effects[0]\n\nprint(str(premature_stop_effect.mutant_protein_sequence))\n### 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMF'\n\nprint(premature_stop_effect.aa_mutation_start_offset)\n### 341\n\nprint(premature_stop_effect.transcript)\n### Transcript(id=ENST00000269305, name=TP53-001, gene_name=TP53, biotype=protein_coding, location=17:7571720-7590856)\n\nprint(premature_stop_effect.gene.name)\n### 'TP53'\n```\n\nIf you are looking for a quick start guide, you can check out [this iPython book](./examples/varcode-quick_start.ipynb) that demonstrates simple use cases of Varcode\n\nEffect Types\n------------\n\nEffect type  | Description\n-----------: | :-----------\n*AlternateStartCodon* | Replace annotated start codon with alternative  start codon (*e.g.* \"ATG>CAG\").\n*ComplexSubstitution* | Insertion and deletion of multiple amino acids.\n*Deletion* | Coding mutation which causes deletion of amino acid(s).\n*ExonLoss* | Deletion of entire exon, significantly disrupts protein.\n*ExonicSpliceSite* | Mutation at the beginning or end of an exon, may affect splicing.\n*FivePrimeUTR* | Variant affects 5' untranslated region before start codon.\n*FrameShiftTruncation* | A frameshift which leads immediately to a stop codon (no novel amino acids created).\n*FrameShift* | Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon.\n*IncompleteTranscript* | Can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon).\n*Insertion* | Coding mutation which causes insertion of amino acid(s).\n*Intergenic* | Occurs outside of any annotated gene.\n*Intragenic* |Within the annotated boundaries of a gene but not in a region that's transcribed into pre-mRNA.\n*IntronicSpliceSite* | Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations.\n*Intronic* | Variant occurs between exons and is unlikely to affect splicing.\n*NoncodingTranscript* | Transcript doesn't code for a protein.\n*PrematureStop* | Insertion of stop codon, truncates protein.\n*Silent* | Mutation in coding sequence which does not change the amino acid sequence of the translated protein.\n*SpliceAcceptor* | Mutation in the last two nucleotides of an intron, likely to affect splicing.\n*SpliceDonor* | Mutation in the first two nucleotides of an intron, likely to affect splicing.\n*StartLoss* | Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame).\n*StopLoss* | Loss of stop codon, causes extension of protein by translation of nucleotides from 3' UTR.\n*Substitution* | Coding mutation which causes simple substitution of one amino acid for another.\n*ThreePrimeUTR* | Variant affects 3' untranslated region after stop codon of mRNA.\n\n\nCoordinate System\n-----------------\nVarcode currently uses a \"base counted, one start\" genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to \"space counted, zero start\" (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this [blog post](http://alternateallele.blogspot.com/2012/03/genome-coordinate-conventions.html).\n",
    "bugtrack_url": null,
    "license": "http://www.apache.org/licenses/LICENSE-2.0.html",
    "summary": "Variant annotation in Python",
    "version": "1.2.0",
    "project_urls": {
        "Homepage": "https://github.com/openvax/varcode"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "df551d593b2c30068e3807c99e8e4189e5f43447eec2706203da1e02be08caee",
                "md5": "96688d527c83d63833134c5393b73694",
                "sha256": "fd2bb84b39039cfc6adf13d834bec8294a7630b651c605665169c067a624f3cd"
            },
            "downloads": -1,
            "filename": "varcode-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "96688d527c83d63833134c5393b73694",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 120255,
            "upload_time": "2024-01-30T17:45:30",
            "upload_time_iso_8601": "2024-01-30T17:45:30.267027Z",
            "url": "https://files.pythonhosted.org/packages/df/55/1d593b2c30068e3807c99e8e4189e5f43447eec2706203da1e02be08caee/varcode-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4c1b43dc97dd86e7ebc8f2951c1c6a257ab0c61acb80dc2ed5898af78a6446bb",
                "md5": "595b1b1a26d93d6beaee6906e910a325",
                "sha256": "e3ae050bf772fca49417044f7dfc5f68c373b11ace7afea106d4f35b4721091f"
            },
            "downloads": -1,
            "filename": "varcode-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "595b1b1a26d93d6beaee6906e910a325",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 85290,
            "upload_time": "2024-01-30T17:45:32",
            "upload_time_iso_8601": "2024-01-30T17:45:32.140480Z",
            "url": "https://files.pythonhosted.org/packages/4c/1b/43dc97dd86e7ebc8f2951c1c6a257ab0c61acb80dc2ed5898af78a6446bb/varcode-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-30 17:45:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "openvax",
    "github_project": "varcode",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "lcname": "varcode"
}
        
Elapsed time: 0.17259s