mhcgnomes


Namemhcgnomes JSON
Version 1.8.6 PyPI version JSON
download
home_pagehttps://github.com/til-unc/mhcgnomes
SummaryPython library for parsing MHC nomenclature in the wild
upload_time2023-06-01 20:17:26
maintainer
docs_urlNone
authorAlex Rubinsteyn
requires_python
licensehttp://www.apache.org/licenses/LICENSE-2.0.html
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            <a href="https://app.travis-ci.com/github/pirl-unc/mhcgnomes">
    <img src="https://travis-ci.com/pirl-unc/mhcgnomes.svg?branch=main" alt="Build Status" />
</a> 
<a href="https://coveralls.io/github/til-unc/mhcgnomes">
    <img src="https://coveralls.io/repos/github/til-unc/mhcgnomes/badge.svg?branch=main" alt="Coverage Status">
</a>
<a href="https://pypi.python.org/pypi/mhcgnomes/">
    <img src="https://img.shields.io/pypi/v/mhcgnomes.svg?maxAge=1000" alt="PyPI" />
</a>


![](https://raw.githubusercontent.com/til-unc/mhcgnomes/main/gnome-red-text.png) 

# mhcgnomes: Parsing MHC nomenclature in the wild

MHCgnomes is a parsing library for multi-species MHC nomenclature which
aims to correctly parse every name in [IEDB](http://www.iedb.org/), [IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/), [IPD/MHC](https://www.ebi.ac.uk/ipd/mhc/), and the allele lists for both [NetMHCpan](https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1) and [NetMHCIIpan](https://services.healthtech.dtu.dk/service.php?NetMHCIIpan-4.0) predictors. This allows for standardization between immune databases and tools, which often use different naming conventions.

## Usage example

```python

In [1]: mhcgnomes.parse("HLA-A0201")
Out[1]: Allele(
    gene=Gene(
        species=Species(name="Homo sapiens', prefix="HLA"), 
        name="A"), 
    allele_fields=("02", "01"), 
    annotations=(), 
    mutations=())

In [2]: mhcgnomes.parse("HLA-A0201").to_string()
Out[2]: 'HLA-A*02:01'

In [3]: mhcgnomes.parse("HLA-A0201").compact_string()
Out[3]: 'A0201'

```

## The problem: MHC nomenclature is nuts

Despite the valiant efforts of groups such as the [Comparative MHC Nomenclature Committee](https://www.ebi.ac.uk/ipd/mhc/committee/), the names of MHC alleles you might encounter in different datasets (or accepted by immunoinformatics tools) are frustratingly ill specified. It's not uncommon to see dozens of different forms for the same allele.

For example, these all refer to the same MHC protein sequence:

* "HLA-A\*02:01"
* "HLA-A02:01"
* "HLA-A:02:01"
* "HLA-A0201"


Additionally, for human alleles, the species prefix is often omitted:

* "A\*02:01"
* "A\*0201"
* "A02:01"
* "A:02:01"
* "A0201"


### Annotations

Sometimes, alleles are bundled with modifier suffixes which specify 
the functionality or abundance of the MHC. Here's an example with an allele
which is secreted instead of membrane-bound:

* "HLA-A\*02:01:01S"

These are collected in the `annotations` field of an 
[`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py)
result.

### Mutations

MHC proteins are sometimes described in terms of mutations to a known allele. 

* "HLA-B\*08:01 N80I mutant"

These mutations are collected in the `mutations` field of an 
[`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py) result.

### Beyond humans

To make things worse, several model organisms (like mice and rats) use archaic
naming systems, where there is no notion of allele groups or four/six/eight
digit alleles but every allele is simply given a name, such as:

* "H2-Kk"
* "RT1-9.5f"


In the above example "H2"/"RT1" correspond to species, "K"/"9.5" are
the gene names and "k"/"f" are the allele names.

To make these even worse, the name of a species is subject to variation (e.g. "H2" vs. "H-2") as well as drift over time (e.g. ChLA -> MhcPatr -> Patr).  

### Serotypes, haplotypes, and other named entitites

Besides alleles are also other named MHC related entities you'll encounter in immunological data. Closely related to alleles are serotypes, which effectively denote a grouping of alleles that are all recognized by the same antibody:

* "HLA-A2"
* "A2"

In many datasets the exact allele is not known but an experiment might note the genetic background of a model animal, resulting in loose haplotype restrictions such as: 

* "H2-k class I"

Yes, good luck disambiguating "H2-k" the haplotype from "H2-K" the gene, especially since capitalization is not stable enough to be relied on for parsing. 

In some cases immunological data comes only with a denoted species (e.g. "mouse"), a gene (e.g. "HLA-A"), or an MHC class ("human class I"). MHCgnomes has a structured representation for all of these cases and more. 

## Parsing strategy

It is a fool's errand to curate *all* possible MHC allele names since that list grows daily as the MHC loci of more people (and non-human animals) are sequenced. Instead, MHCgnomes contains an ontology of curated species and genes and then attempts to parse any given string into a multiple candidates of the following types:

* [`Species`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/species.py)
* [`Gene`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/gene.py)
* [`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py)
* [`AlleleWithoutGene`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele_without_gene.py)
* [`Class2Pair`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/class2_pair.py)
* [`Class2Locus`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/class2_locus.py)
* [`MhcClass`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/mhc_class.py)
* [`Haplotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/haplotype.py)
* [`Serotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/serotype.py)


The set of candidate interpretations for each string are then 
ranked according to heuristic rules. For example, a string will be 
preferentially interpreted as an [`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py) rather 
than a [`Serotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/serotype.py)
or [`Haplotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/haplotype.py). 


## How many digits per field?

Originally alleles for many genes were numbered with two digits:

* "HLA-MICB\*01"

But as the number of identified alleles increased, the number of
fields specifying a distinct protein increase to two. This became 
conventionally called a "four digit" format, since each field has two
digits. Yet, as the number of identified alleles continued to increase, then 
the number of digits per field has often increased from two to three: 

* "MICB\*002:01"
* "HLA-A00201"
* "A:002:01"
* "A\*00201"

These are not always currently treated as equivalent to allele strings with two digits in their first field, but that feature is in the works.

However, if databases such as [IPD-MHC](https://www.ebi.ac.uk/ipd/mhc/) or [IMGT-HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) recorded an older form of an allele, then MHCgnomes can optionally map it onto the modern version (including capturing differences in numbers of digits per field). 

## References

* [IPD-MHC: nomenclature requirements for the non-human major histocompatibility complex in the next-generation sequencing era](https://link.springer.com/article/10.1007%2Fs00251-018-1072-4)
* [Comparative MHC nomenclature: report from the ISAG/IUIS-VIC committee 2018]()
* [ISAG/IUIS-VIC Comparative MHC Nomenclature
Committee report, 2005](https://link.springer.com/content/pdf/10.1007%2Fs00251-005-0071-4.pdf)
* [Marsupial MHC Class II β Genes Are Not Orthologous to the Eutherian β Gene Families]()
* [Nomenclature for factors of the SLA system, update 2008](https://www.ncbi.nlm.nih.gov/pubmed/19317739)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/til-unc/mhcgnomes",
    "name": "mhcgnomes",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Alex Rubinsteyn",
    "author_email": "alex.rubinsteyn@unc.edu",
    "download_url": "https://files.pythonhosted.org/packages/a8/41/7b11a2fdee588025619868866ee9121235c5bb56bfddb4773d7c176bc4bb/mhcgnomes-1.8.6.tar.gz",
    "platform": null,
    "description": "<a href=\"https://app.travis-ci.com/github/pirl-unc/mhcgnomes\">\n    <img src=\"https://travis-ci.com/pirl-unc/mhcgnomes.svg?branch=main\" alt=\"Build Status\" />\n</a> \n<a href=\"https://coveralls.io/github/til-unc/mhcgnomes\">\n    <img src=\"https://coveralls.io/repos/github/til-unc/mhcgnomes/badge.svg?branch=main\" alt=\"Coverage Status\">\n</a>\n<a href=\"https://pypi.python.org/pypi/mhcgnomes/\">\n    <img src=\"https://img.shields.io/pypi/v/mhcgnomes.svg?maxAge=1000\" alt=\"PyPI\" />\n</a>\n\n\n![](https://raw.githubusercontent.com/til-unc/mhcgnomes/main/gnome-red-text.png) \n\n# mhcgnomes: Parsing MHC nomenclature in the wild\n\nMHCgnomes is a parsing library for multi-species MHC nomenclature which\naims to correctly parse every name in [IEDB](http://www.iedb.org/), [IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/), [IPD/MHC](https://www.ebi.ac.uk/ipd/mhc/), and the allele lists for both [NetMHCpan](https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1) and [NetMHCIIpan](https://services.healthtech.dtu.dk/service.php?NetMHCIIpan-4.0) predictors. This allows for standardization between immune databases and tools, which often use different naming conventions.\n\n## Usage example\n\n```python\n\nIn [1]: mhcgnomes.parse(\"HLA-A0201\")\nOut[1]: Allele(\n    gene=Gene(\n        species=Species(name=\"Homo sapiens', prefix=\"HLA\"), \n        name=\"A\"), \n    allele_fields=(\"02\", \"01\"), \n    annotations=(), \n    mutations=())\n\nIn [2]: mhcgnomes.parse(\"HLA-A0201\").to_string()\nOut[2]: 'HLA-A*02:01'\n\nIn [3]: mhcgnomes.parse(\"HLA-A0201\").compact_string()\nOut[3]: 'A0201'\n\n```\n\n## The problem: MHC nomenclature is nuts\n\nDespite the valiant efforts of groups such as the [Comparative MHC Nomenclature Committee](https://www.ebi.ac.uk/ipd/mhc/committee/), the names of MHC alleles you might encounter in different datasets (or accepted by immunoinformatics tools) are frustratingly ill specified. It's not uncommon to see dozens of different forms for the same allele.\n\nFor example, these all refer to the same MHC protein sequence:\n\n* \"HLA-A\\*02:01\"\n* \"HLA-A02:01\"\n* \"HLA-A:02:01\"\n* \"HLA-A0201\"\n\n\nAdditionally, for human alleles, the species prefix is often omitted:\n\n* \"A\\*02:01\"\n* \"A\\*0201\"\n* \"A02:01\"\n* \"A:02:01\"\n* \"A0201\"\n\n\n### Annotations\n\nSometimes, alleles are bundled with modifier suffixes which specify \nthe functionality or abundance of the MHC. Here's an example with an allele\nwhich is secreted instead of membrane-bound:\n\n* \"HLA-A\\*02:01:01S\"\n\nThese are collected in the `annotations` field of an \n[`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py)\nresult.\n\n### Mutations\n\nMHC proteins are sometimes described in terms of mutations to a known allele. \n\n* \"HLA-B\\*08:01 N80I mutant\"\n\nThese mutations are collected in the `mutations` field of an \n[`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py) result.\n\n### Beyond humans\n\nTo make things worse, several model organisms (like mice and rats) use archaic\nnaming systems, where there is no notion of allele groups or four/six/eight\ndigit alleles but every allele is simply given a name, such as:\n\n* \"H2-Kk\"\n* \"RT1-9.5f\"\n\n\nIn the above example \"H2\"/\"RT1\" correspond to species, \"K\"/\"9.5\" are\nthe gene names and \"k\"/\"f\" are the allele names.\n\nTo make these even worse, the name of a species is subject to variation (e.g. \"H2\" vs. \"H-2\") as well as drift over time (e.g. ChLA -> MhcPatr -> Patr).  \n\n### Serotypes, haplotypes, and other named entitites\n\nBesides alleles are also other named MHC related entities you'll encounter in immunological data. Closely related to alleles are serotypes, which effectively denote a grouping of alleles that are all recognized by the same antibody:\n\n* \"HLA-A2\"\n* \"A2\"\n\nIn many datasets the exact allele is not known but an experiment might note the genetic background of a model animal, resulting in loose haplotype restrictions such as: \n\n* \"H2-k class I\"\n\nYes, good luck disambiguating \"H2-k\" the haplotype from \"H2-K\" the gene, especially since capitalization is not stable enough to be relied on for parsing. \n\nIn some cases immunological data comes only with a denoted species (e.g. \"mouse\"), a gene (e.g. \"HLA-A\"), or an MHC class (\"human class I\"). MHCgnomes has a structured representation for all of these cases and more. \n\n## Parsing strategy\n\nIt is a fool's errand to curate *all* possible MHC allele names since that list grows daily as the MHC loci of more people (and non-human animals) are sequenced. Instead, MHCgnomes contains an ontology of curated species and genes and then attempts to parse any given string into a multiple candidates of the following types:\n\n* [`Species`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/species.py)\n* [`Gene`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/gene.py)\n* [`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py)\n* [`AlleleWithoutGene`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele_without_gene.py)\n* [`Class2Pair`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/class2_pair.py)\n* [`Class2Locus`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/class2_locus.py)\n* [`MhcClass`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/mhc_class.py)\n* [`Haplotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/haplotype.py)\n* [`Serotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/serotype.py)\n\n\nThe set of candidate interpretations for each string are then \nranked according to heuristic rules. For example, a string will be \npreferentially interpreted as an [`Allele`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/allele.py) rather \nthan a [`Serotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/serotype.py)\nor [`Haplotype`](https://github.com/til-unc/mhcgnomes/blob/main/mhcgnomes/haplotype.py). \n\n\n## How many digits per field?\n\nOriginally alleles for many genes were numbered with two digits:\n\n* \"HLA-MICB\\*01\"\n\nBut as the number of identified alleles increased, the number of\nfields specifying a distinct protein increase to two. This became \nconventionally called a \"four digit\" format, since each field has two\ndigits. Yet, as the number of identified alleles continued to increase, then \nthe number of digits per field has often increased from two to three: \n\n* \"MICB\\*002:01\"\n* \"HLA-A00201\"\n* \"A:002:01\"\n* \"A\\*00201\"\n\nThese are not always currently treated as equivalent to allele strings with two digits in their first field, but that feature is in the works.\n\nHowever, if databases such as [IPD-MHC](https://www.ebi.ac.uk/ipd/mhc/) or [IMGT-HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) recorded an older form of an allele, then MHCgnomes can optionally map it onto the modern version (including capturing differences in numbers of digits per field). \n\n## References\n\n* [IPD-MHC: nomenclature requirements for the non-human major histocompatibility complex in the next-generation sequencing era](https://link.springer.com/article/10.1007%2Fs00251-018-1072-4)\n* [Comparative MHC nomenclature: report from the ISAG/IUIS-VIC committee 2018]()\n* [ISAG/IUIS-VIC Comparative MHC Nomenclature\nCommittee report, 2005](https://link.springer.com/content/pdf/10.1007%2Fs00251-005-0071-4.pdf)\n* [Marsupial MHC Class II \u03b2 Genes Are Not Orthologous to the Eutherian \u03b2 Gene Families]()\n* [Nomenclature for factors of the SLA system, update 2008](https://www.ncbi.nlm.nih.gov/pubmed/19317739)\n",
    "bugtrack_url": null,
    "license": "http://www.apache.org/licenses/LICENSE-2.0.html",
    "summary": "Python library for parsing MHC nomenclature in the wild",
    "version": "1.8.6",
    "project_urls": {
        "Homepage": "https://github.com/til-unc/mhcgnomes"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b03e4fa67920f80300828bbdcc5fe97eeb33958d6825f60a9b4f57ef392e8bd4",
                "md5": "d48e39a062cb0f84230dea130c2f7006",
                "sha256": "f40cc7e0ba44dd8f1e733ba0525a8db62e016a0fbd1591a6fe2298ccee64dda0"
            },
            "downloads": -1,
            "filename": "mhcgnomes-1.8.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d48e39a062cb0f84230dea130c2f7006",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 103723,
            "upload_time": "2023-06-01T20:17:22",
            "upload_time_iso_8601": "2023-06-01T20:17:22.043296Z",
            "url": "https://files.pythonhosted.org/packages/b0/3e/4fa67920f80300828bbdcc5fe97eeb33958d6825f60a9b4f57ef392e8bd4/mhcgnomes-1.8.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a8417b11a2fdee588025619868866ee9121235c5bb56bfddb4773d7c176bc4bb",
                "md5": "4dbf8119b7f5af73f17a098494bc1e4d",
                "sha256": "d32b886d9cd58ed0e45d4cb3da83a439b1b68b59790ae04985711e489aa5e264"
            },
            "downloads": -1,
            "filename": "mhcgnomes-1.8.6.tar.gz",
            "has_sig": false,
            "md5_digest": "4dbf8119b7f5af73f17a098494bc1e4d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 708992,
            "upload_time": "2023-06-01T20:17:26",
            "upload_time_iso_8601": "2023-06-01T20:17:26.771410Z",
            "url": "https://files.pythonhosted.org/packages/a8/41/7b11a2fdee588025619868866ee9121235c5bb56bfddb4773d7c176bc4bb/mhcgnomes-1.8.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-01 20:17:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "til-unc",
    "github_project": "mhcgnomes",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "mhcgnomes"
}
        
Elapsed time: 0.15242s