# PyHGVS2
[PyHGVS](https://github.com/counsyl/hgvs) is good but abandoned;
[the last commit](https://github.com/counsyl/hgvs/commit/ab9b95f21466fbb3265b5bd818ccab0c926ca59f)
was over 5 years ago, and as the parent company counsyl
[was bought out](https://www.genomeweb.com/business-news/myriad-genetics-acquire-counsyl-375m)
at a similar time, it's unlikely to be updated ever again.
The original README is below. The last three headings were updated for accuracy.
# HGVS variant name parsing and generation
The Human Genome Variation Society (HGVS) promotes the discovery and
sharing of genetic variation in the human population. As part of facilitating
variant sharing, the society has produced a series of recommendations for
how to name and refer to variants within research publications and clinical
settings. A compilation of these recommendations is available on their
[website](http://www.hgvs.org/mutnomen/recs.html).
This library provides a simple Python API for parsing, formatting, and
normalizing HGVS names. Surprisingly, there are many non-trivial steps
necessary in handling HGVS names and therefore there is a need for well tested
libraries that encapsulate these steps.
## HGVS name example
In most next-generation sequencing applications, variants are first
discovered and described in terms of their genomic coordinates such as
chromosome 7, position 117,199,563 with reference allele `G` and
alternative allele `T`. According to the HGVS standard, we can
describe this variant as `NC_000007.13:g.117199563G>T`. The first
part of the name is a RefSeq ID `NC_000007.13` for chromosome 7
version 13. The `g.` denotes that this is a variant described in
genomic (i.e. chromosomal) coordinates. Lastly, the chromosomal position,
reference allele, and alternative allele are indicated. For simple
single nucleotide changes the `>` character is used.
More commonly, a variant will be described using a cDNA or protein
style HGVS name. In the example above, the variant in cDNA style is
named `NM_000492.3:c.1438G>T`. Here again, the first part of the name
refers to a RefSeq sequence, this time mRNA transcript `NM_000492`
version `3`. Optionally, the gene name can also be given as
`NM_000492.3(CFTR)`. The `c.` indicates that this is a cDNA name, and
the coordinate indicates that this mutation occurs at position 1438
along the coding portion of the spliced transcript (i.e. position 1 is
the first base of `ATG` translation start codon). Briefly, the
protein style of the variant name is `NP_000483.3:p.Gly480Cys` which
indicates the change in amino-acid coordinates (`480`) along an
amino-acid sequence (`NP_000483.3`) and gives the reference and
alternative amino-acid alleles (`Gly` and `Cys`, respectively).
The standard also specifies custom name formats for many mutation
categories such as insertions (`NM_000492.3:c.1438_1439insA`),
deletions (`NM_000492.3:c.1438_1440delGGT`),
duplications (`NM_000492.3:c.1438_1440dupGGT`), and several
other more complex genomic rearrangements.
While many of these names appear to be simple to parse or generate,
there are many corner cases, especially with cDNA HGVS names. For
example, variants before the start codon should have negative cDNA
coordinates (`NM_000492.3:c.-4G>C`), and variants after the stop codon
also have their own format (`NM_000492.3:c.*33C>T`). Variants within
introns are indicated by the closest exonic base with an additional
genomic offset such as `NM_000492.3:4243-20A>G` (the variant is 20
bases in the 5' direction of the cDNA coordinate 4243). Lastly, all
coordinates and alleles are specified on the strand of the
transcript. This library properly handles all logic necessary to
convert genomic coordinates to and from HGVS cDNA coordinates.
Another important consideration of any library that handles HGVS names
is variant normalization. The HGVS standard aims to provide "uniform
and unequivocal" description of variants. Namely, two people
discovering a variant should be able to arrive at the same name for
it. Such a property is very useful for checking whether a variant has
been seen before and connecting all known relevant information. For
SNPs, this property is fairly easy to achieve. However, for
insertions and deletions (indels) near repetitive regions, many indels
are equivalent (e.g. it doesn't matter which `AT` in a run of
`ATATATAT` was deleted). The VCF file format has chosen to uniquely
specify such indels by using the most left-aligned genomic coordinate.
Therefore, compliant variant callers that output VCF will have applied
this normalization. The HGVS standard also specifies a normalization
for such indels. However, it states that indels should use the most 3'
position in a transcript. For genes on the positive strand, this is
the opposite direction specified by VCF. This library properly
implements both kinds of variant normalization and allows easy
conversion between HGVS and VCF style variants. It also handles
many other cases of normalization (e.g. the HGVS standard recommends
indicating an insertion with the `dup` notation instead of `ins`
if it can be represented as a tandem duplication).
## Example usage
Below is a minimal example of parsing and formatting HGVS names. In
addition to the name itself, two other pieces of information are
needed: the genome sequence (needed for normalization), and the
transcript model or a callback for fetching the transcript model
(needed for transcript coordinate calculations). This library makes
as few assumptions as possible about how this external data is stored.
In this example, the genome sequence is read using the `pyfaidx` library
and transcripts are read from a RefSeqGenes flat-file using methods
provided by `hgvs`.
```python
import pyhgvs2 as hgvs
import hgvs.utils as hgvs_utils
from pyfaidx import Fasta
# Read genome sequence using pyfaidx.
genome = Fasta('hg19.fa')
# Read RefSeq transcripts into a python dict.
with open('hgvs/data/genes.refGene') as infile:
transcripts = hgvs_utils.read_transcripts(infile)
# Provide a callback for fetching a transcript by its name.
def get_transcript(name):
return transcripts.get(name)
# Parse the HGVS name into genomic coordinates and alleles.
chrom, offset, ref, alt = hgvs.parse_hgvs_name(
'NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
# Returns variant in VCF style: ('chr11', 17496508, 'T', 'C')
# Notice that since the transcript is on the negative strand, the alleles
# are reverse complemented during conversion.
# Format an HGVS name.
chrom, offset, ref, alt = ('chr11', 17496508, 'T', 'C')
transcript = get_transcript('NM_000352.3')
hgvs_name = hgvs.format_hgvs_name(
chrom, offset, ref, alt, genome, transcript)
# Returns 'NM_000352.3(ABCC8):c.215A>G'
```
The `hgvs` library can also perform just the parsing step and provide
a parse tree of the HGVS name.
```python
import pyhgvs2 as hgvs
hgvs_name = hgvs.HGVSName('NM_000352.3:c.215-10A>G')
# fields of the HGVS name are available as attributes:
#
# hgvs_name.transcript = 'NM_000352.3'
# hgvs_name.kind = 'c'
# hgvs_name.mutation_type = '>'
# hgvs_name.cdna_start = hgvs.CDNACoord(215, -10)
# hgvs_name.cdna_end = hgvs.CDNACoord(215, -10)
# hgvs_name.ref_allele = 'A'
# hgvs_name.alt_allele = 'G'
```
## Install
This library can be installed using the `uv` follows:
```sh
uv sync
```
## Tests
Test cases can be run by running:
```sh
pytest
```
## Requirements
This library requires at least Python 3.9, but otherwise has no
external dependencies.
The library does assume that genome sequence is available through a `pyfaidx`
compatible `Fasta` object. For an example of writing a wrapper for
a different genome sequence back-end, see
[hgvs2.tests.genome.MockGenome](https://github.com/g-b-f/pyhgvs2/blob/master/pyhgvs2/tests/genome.py)
Raw data
{
"_id": null,
"home_page": null,
"name": "pyhgvs2",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "hgvs, bioinformatics, genome, variant, DNA, RNA",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/8c/dd/c9f155901fbfae6ecf6fae7219859cadc53e5cc0555103f085f1fa074fc8/pyhgvs2-3.0.0.tar.gz",
"platform": null,
"description": "# PyHGVS2\n\n[PyHGVS](https://github.com/counsyl/hgvs) is good but abandoned;\n[the last commit](https://github.com/counsyl/hgvs/commit/ab9b95f21466fbb3265b5bd818ccab0c926ca59f)\nwas over 5 years ago, and as the parent company counsyl\n[was bought out](https://www.genomeweb.com/business-news/myriad-genetics-acquire-counsyl-375m)\nat a similar time, it's unlikely to be updated ever again.\n\nThe original README is below. The last three headings were updated for accuracy.\n\n# HGVS variant name parsing and generation\n\nThe Human Genome Variation Society (HGVS) promotes the discovery and\nsharing of genetic variation in the human population. As part of facilitating\nvariant sharing, the society has produced a series of recommendations for\nhow to name and refer to variants within research publications and clinical\nsettings. A compilation of these recommendations is available on their\n[website](http://www.hgvs.org/mutnomen/recs.html).\n\nThis library provides a simple Python API for parsing, formatting, and\nnormalizing HGVS names. Surprisingly, there are many non-trivial steps\nnecessary in handling HGVS names and therefore there is a need for well tested\nlibraries that encapsulate these steps.\n\n## HGVS name example\n\nIn most next-generation sequencing applications, variants are first\ndiscovered and described in terms of their genomic coordinates such as\nchromosome 7, position 117,199,563 with reference allele `G` and\nalternative allele `T`. According to the HGVS standard, we can\ndescribe this variant as `NC_000007.13:g.117199563G>T`. The first\npart of the name is a RefSeq ID `NC_000007.13` for chromosome 7\nversion 13. The `g.` denotes that this is a variant described in\ngenomic (i.e. chromosomal) coordinates. Lastly, the chromosomal position,\nreference allele, and alternative allele are indicated. For simple\nsingle nucleotide changes the `>` character is used.\n\nMore commonly, a variant will be described using a cDNA or protein\nstyle HGVS name. In the example above, the variant in cDNA style is\nnamed `NM_000492.3:c.1438G>T`. Here again, the first part of the name\nrefers to a RefSeq sequence, this time mRNA transcript `NM_000492`\nversion `3`. Optionally, the gene name can also be given as\n`NM_000492.3(CFTR)`. The `c.` indicates that this is a cDNA name, and\nthe coordinate indicates that this mutation occurs at position 1438\nalong the coding portion of the spliced transcript (i.e. position 1 is\nthe first base of `ATG` translation start codon). Briefly, the\nprotein style of the variant name is `NP_000483.3:p.Gly480Cys` which\nindicates the change in amino-acid coordinates (`480`) along an\namino-acid sequence (`NP_000483.3`) and gives the reference and\nalternative amino-acid alleles (`Gly` and `Cys`, respectively).\n\nThe standard also specifies custom name formats for many mutation\ncategories such as insertions (`NM_000492.3:c.1438_1439insA`),\ndeletions (`NM_000492.3:c.1438_1440delGGT`),\nduplications (`NM_000492.3:c.1438_1440dupGGT`), and several\nother more complex genomic rearrangements.\n\nWhile many of these names appear to be simple to parse or generate,\nthere are many corner cases, especially with cDNA HGVS names. For\nexample, variants before the start codon should have negative cDNA\ncoordinates (`NM_000492.3:c.-4G>C`), and variants after the stop codon\nalso have their own format (`NM_000492.3:c.*33C>T`). Variants within\nintrons are indicated by the closest exonic base with an additional\ngenomic offset such as `NM_000492.3:4243-20A>G` (the variant is 20\nbases in the 5' direction of the cDNA coordinate 4243). Lastly, all\ncoordinates and alleles are specified on the strand of the\ntranscript. This library properly handles all logic necessary to\nconvert genomic coordinates to and from HGVS cDNA coordinates.\n\nAnother important consideration of any library that handles HGVS names\nis variant normalization. The HGVS standard aims to provide \"uniform\nand unequivocal\" description of variants. Namely, two people\ndiscovering a variant should be able to arrive at the same name for\nit. Such a property is very useful for checking whether a variant has\nbeen seen before and connecting all known relevant information. For\nSNPs, this property is fairly easy to achieve. However, for\ninsertions and deletions (indels) near repetitive regions, many indels\nare equivalent (e.g. it doesn't matter which `AT` in a run of\n`ATATATAT` was deleted). The VCF file format has chosen to uniquely\nspecify such indels by using the most left-aligned genomic coordinate.\nTherefore, compliant variant callers that output VCF will have applied\nthis normalization. The HGVS standard also specifies a normalization\nfor such indels. However, it states that indels should use the most 3'\nposition in a transcript. For genes on the positive strand, this is\nthe opposite direction specified by VCF. This library properly\nimplements both kinds of variant normalization and allows easy\nconversion between HGVS and VCF style variants. It also handles\nmany other cases of normalization (e.g. the HGVS standard recommends\nindicating an insertion with the `dup` notation instead of `ins`\nif it can be represented as a tandem duplication).\n\n## Example usage\n\nBelow is a minimal example of parsing and formatting HGVS names. In\naddition to the name itself, two other pieces of information are\nneeded: the genome sequence (needed for normalization), and the\ntranscript model or a callback for fetching the transcript model\n(needed for transcript coordinate calculations). This library makes\nas few assumptions as possible about how this external data is stored.\nIn this example, the genome sequence is read using the `pyfaidx` library\nand transcripts are read from a RefSeqGenes flat-file using methods\nprovided by `hgvs`.\n\n```python\nimport pyhgvs2 as hgvs\nimport hgvs.utils as hgvs_utils\nfrom pyfaidx import Fasta\n\n# Read genome sequence using pyfaidx.\ngenome = Fasta('hg19.fa')\n\n# Read RefSeq transcripts into a python dict.\nwith open('hgvs/data/genes.refGene') as infile:\n transcripts = hgvs_utils.read_transcripts(infile)\n\n# Provide a callback for fetching a transcript by its name.\ndef get_transcript(name):\n return transcripts.get(name)\n\n# Parse the HGVS name into genomic coordinates and alleles.\nchrom, offset, ref, alt = hgvs.parse_hgvs_name(\n 'NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)\n# Returns variant in VCF style: ('chr11', 17496508, 'T', 'C')\n# Notice that since the transcript is on the negative strand, the alleles\n# are reverse complemented during conversion.\n\n# Format an HGVS name.\nchrom, offset, ref, alt = ('chr11', 17496508, 'T', 'C')\ntranscript = get_transcript('NM_000352.3')\nhgvs_name = hgvs.format_hgvs_name(\n chrom, offset, ref, alt, genome, transcript)\n# Returns 'NM_000352.3(ABCC8):c.215A>G'\n```\n\nThe `hgvs` library can also perform just the parsing step and provide\na parse tree of the HGVS name.\n\n```python\nimport pyhgvs2 as hgvs\n\nhgvs_name = hgvs.HGVSName('NM_000352.3:c.215-10A>G')\n\n# fields of the HGVS name are available as attributes:\n#\n# hgvs_name.transcript = 'NM_000352.3'\n# hgvs_name.kind = 'c'\n# hgvs_name.mutation_type = '>'\n# hgvs_name.cdna_start = hgvs.CDNACoord(215, -10)\n# hgvs_name.cdna_end = hgvs.CDNACoord(215, -10)\n# hgvs_name.ref_allele = 'A'\n# hgvs_name.alt_allele = 'G'\n```\n\n## Install\n\nThis library can be installed using the `uv` follows:\n\n```sh\nuv sync\n```\n\n## Tests\n\nTest cases can be run by running:\n\n```sh\npytest\n```\n\n## Requirements\n\nThis library requires at least Python 3.9, but otherwise has no\nexternal dependencies.\n\nThe library does assume that genome sequence is available through a `pyfaidx`\ncompatible `Fasta` object. For an example of writing a wrapper for\na different genome sequence back-end, see\n[hgvs2.tests.genome.MockGenome](https://github.com/g-b-f/pyhgvs2/blob/master/pyhgvs2/tests/genome.py)\n",
"bugtrack_url": null,
"license": null,
"summary": "Fork of pyhgvs",
"version": "3.0.0",
"project_urls": {
"Repository": "https://github.com/g-b-f/pyhgvs2"
},
"split_keywords": [
"hgvs",
" bioinformatics",
" genome",
" variant",
" dna",
" rna"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "66d3a42f5ff72a94d8292948eae7caad6f8e7756a2ac08ae049fe03922c9bf5b",
"md5": "84619809219b79877ec6b9455945d764",
"sha256": "ea20baab5afe07ad4d2db8ff7190adba7d567980067a26e9fdf79233c8cf502e"
},
"downloads": -1,
"filename": "pyhgvs2-3.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "84619809219b79877ec6b9455945d764",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 36430,
"upload_time": "2025-08-28T21:37:14",
"upload_time_iso_8601": "2025-08-28T21:37:14.291433Z",
"url": "https://files.pythonhosted.org/packages/66/d3/a42f5ff72a94d8292948eae7caad6f8e7756a2ac08ae049fe03922c9bf5b/pyhgvs2-3.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8cddc9f155901fbfae6ecf6fae7219859cadc53e5cc0555103f085f1fa074fc8",
"md5": "a9f78e9538780b28f61a70a2a46a7e4e",
"sha256": "0f971cf64699ddec7d66503bad917a15d7367217c29ccca088da64a4daa2ab33"
},
"downloads": -1,
"filename": "pyhgvs2-3.0.0.tar.gz",
"has_sig": false,
"md5_digest": "a9f78e9538780b28f61a70a2a46a7e4e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 35157,
"upload_time": "2025-08-28T21:37:15",
"upload_time_iso_8601": "2025-08-28T21:37:15.528432Z",
"url": "https://files.pythonhosted.org/packages/8c/dd/c9f155901fbfae6ecf6fae7219859cadc53e5cc0555103f085f1fa074fc8/pyhgvs2-3.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-28 21:37:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "g-b-f",
"github_project": "pyhgvs2",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pyhgvs2"
}