| Name | taxopy JSON |
| Version |
0.13.0
JSON |
| download |
| home_page | None |
| Summary | A Python package for obtaining complete lineages and the lowest common ancestor (LCA) from a set of taxonomic identifiers. |
| upload_time | 2024-08-03 02:06:55 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.5 |
| license | None |
| keywords |
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# taxopy
[](https://doi.org/10.5281/zenodo.6993581)
A Python package for manipulating NCBI-formatted taxonomic databases. Allows you to obtain complete lineages, determine lowest common ancestors (LCAs), get taxa names from their taxids, and more!
## Installation
There are two ways to install taxopy:
- Using pip:
```
pip install taxopy
```
- Using conda:
```
conda install -c conda-forge -c bioconda taxopy
```
> [!NOTE]
> To enable fuzzy matching of taxonomic names, you need to install the `fuzzy-matching` extra. This can be done by adding `[fuzzy-matching]` to the installation command in pip:
> ```
> pip install taxopy[fuzzy-matching]
> ```
> Alternatively, you can install the `rapidfuzz` library:
> ```
> # Using pip
> pip install rapidfuzz
> # Using conda
> conda install -c conda-forge rapidfuzz
> ```
## Usage
```python
import taxopy
```
First you need to download taxonomic information from NCBI's servers and put this data into a `TaxDb` object:
```python
taxdb = taxopy.TaxDb()
# You can also use your own set of taxonomy files:
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp")
# If you want to support legacy taxonomic identifiers (that were merged to other identifier), you also need to provide a `merged.dmp` file. This is not necessary if the data is being downloaded from NCBI.
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp", merged_dmp="taxdb/merged.dmp")
```
The `TaxDb` object stores the name, rank and parent-child relationships of each taxonomic identifier:
```python
print(taxdb.taxid2name[2])
print(taxdb.taxid2parent[2])
print(taxdb.taxid2rank[2])
```
Bacteria
131567
superkingdom
If you want to retrieve the new taxonomic identifier of a legacy identifier you can use the `oldtaxid2newtaxid` attribute:
```python
print(taxdb.oldtaxid2newtaxid[260])
```
143224
To get information of a given taxon you can create a `Taxon` object using its taxonomic identifier:
```python
saccharomyces = taxopy.Taxon(4930, taxdb)
human = taxopy.Taxon(9606, taxdb)
gorilla = taxopy.Taxon(9593, taxdb)
lagomorpha = taxopy.Taxon(9975, taxdb)
```
Each `Taxon` object stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:
```python
print(lagomorpha.rank)
print(lagomorpha.name)
print(lagomorpha.name_lineage)
print(lagomorpha.ranked_name_lineage)
print(lagomorpha.rank_name_dictionary)
```
order
Lagomorpha
['Lagomorpha', 'Glires', 'Euarchontoglires', 'Boreoeutheria', 'Eutheria', 'Theria', 'Mammalia', 'Amniota', 'Tetrapoda', 'Dipnotetrapodomorpha', 'Sarcopterygii', 'Euteleostomi', 'Teleostomi', 'Gnathostomata', 'Vertebrata', 'Craniata', 'Chordata', 'Deuterostomia', 'Bilateria', 'Eumetazoa', 'Metazoa', 'Opisthokonta', 'Eukaryota', 'cellular organisms', 'root']
[('order', 'Lagomorpha'), ('clade', 'Glires'), ('superorder', 'Euarchontoglires'), ('clade', 'Boreoeutheria'), ('clade', 'Eutheria'), ('clade', 'Theria'), ('class', 'Mammalia'), ('clade', 'Amniota'), ('clade', 'Tetrapoda'), ('clade', 'Dipnotetrapodomorpha'), ('superclass', 'Sarcopterygii'), ('clade', 'Euteleostomi'), ('clade', 'Teleostomi'), ('clade', 'Gnathostomata'), ('clade', 'Vertebrata'), ('subphylum', 'Craniata'), ('phylum', 'Chordata'), ('clade', 'Deuterostomia'), ('clade', 'Bilateria'), ('clade', 'Eumetazoa'), ('kingdom', 'Metazoa'), ('clade', 'Opisthokonta'), ('superkingdom', 'Eukaryota'), ('no rank', 'cellular organisms'), ('no rank', 'root')]
{'order': 'Lagomorpha', 'clade': 'Opisthokonta', 'superorder': 'Euarchontoglires', 'class': 'Mammalia', 'superclass': 'Sarcopterygii', 'subphylum': 'Craniata', 'phylum': 'Chordata', 'kingdom': 'Metazoa', 'superkingdom': 'Eukaryota'}
You can use the `parent` method to get a `Taxon` object of the parent node of a given taxon:
```python
lagomorpha_parent = lagomorpha.parent(taxdb)
print(lagomorpha_parent.rank)
print(lagomorpha_parent.name)
```
clade
Glires
### LCA and majority vote
You can get the lowest common ancestor of a list of taxa using the `find_lca` function:
```python
human_lagomorpha_lca = taxopy.find_lca([human, lagomorpha], taxdb)
print(human_lagomorpha_lca.name)
```
Euarchontoglires
You may also use the `find_majority_vote` to discover the most specific taxon that is shared by more than half of the lineages of a list of taxa:
```python
majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb)
print(majority_vote.name)
```
Homininae
The `find_majority_vote` function allows you to control its stringency via the `fraction` parameter. For instance, if you would set `fraction` to 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default, `fraction` is 0.5.
```python
majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb, fraction=0.75)
print(majority_vote.name)
```
Euarchontoglires
You can also assign weights to each input lineage:
```python
majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb)
weighted_majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb, weights=[3, 1, 1, 1])
print(majority_vote.name)
print(weighted_majority_vote.name)
```
Euarchontoglires
Opisthokonta
To check the level of agreement between the taxa that were aggregated using `find_majority_vote` and the output taxon, you can check the `agreement` attribute.
```python
print(majority_vote.agreement)
print(weighted_majority_vote.agreement)
```
0.75
1.0
### Taxid from name
If you only have the name of a taxon, you can get its corresponding taxid using the `taxid_from_name` function:
```python
taxid = taxopy.taxid_from_name('Homininae', taxdb)
print(taxid)
```
[207598]
This function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:
```python
taxid = taxopy.taxid_from_name('Aotus', taxdb)
print(taxid)
```
[9504, 114498]
In case a list of names is provided as input, the function will return a list of lists.
```python
taxid = taxopy.taxid_from_name(['Homininae', 'Aotus'], taxdb)
print(taxid)
```
[[207598], [9504, 114498]]
In certain situations, it is useful to allow slight variations between the input name and the matches in the database. This can be accomplished by setting the `fuzzy` parameter of `taxid_from_name` to `True`. For instance, in GTDB, some taxa have suffixes. The `fuzzy` parameter enables you to locate the taxonomic identifiers of all taxa with a name similar to the input, so you don't need to know the exact name beforehand.
```python
# The `taxdump_url` parameter of the `TaxDb` class can be used retrieve a custom taxdump from a URL. In this case, we will use a GTDB taxdump provided by Wei Shen (https://github.com/shenwei356/gtdb-taxdump)
gtdb_taxdb = taxopy.TaxDb(taxdump_url="https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.5.0/gtdb-taxdump-R220.tar.gz")
for t in taxopy.taxid_from_name("Myxococcota", gtdb_taxdb, fuzzy=True):
print(taxopy.Taxon(t, gtdb_taxdb).name)
```
Myxococcota_A
Myxococcota
You can control the minimum similarity between the input name and the matches in the database by setting the `score_cutoff` parameter (default is 0.9).
```python
for t in taxopy.taxid_from_name(
"Myxococcota", gtdb_taxdb, fuzzy=True, score_cutoff=0.7
):
print(taxopy.Taxon(t, gtdb_taxdb).name)
```
Myxococcales
Myxococcota_A
Myxococcus
Myxococcia
Myxococcota
Myxococcaceae
## Acknowledgements
Some of the code used in taxopy was taken from the [CAT/BAT tool for taxonomic classification of contigs and metagenome-assembled genomes](https://github.com/dutilh/CAT).
Raw data
{
"_id": null,
"home_page": null,
"name": "taxopy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "Antonio Camargo <antoniop.camargo@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f9/12/01d5c7b29d2ba6fd2bcb6cfcb649cc677fd3e586cad3662f212f50a58910/taxopy-0.13.0.tar.gz",
"platform": null,
"description": "# taxopy\n\n[](https://doi.org/10.5281/zenodo.6993581)\n\nA Python package for manipulating NCBI-formatted taxonomic databases. Allows you to obtain complete lineages, determine lowest common ancestors (LCAs), get taxa names from their taxids, and more!\n\n## Installation\n\nThere are two ways to install taxopy:\n\n - Using pip:\n\n```\npip install taxopy\n```\n\n - Using conda:\n\n```\nconda install -c conda-forge -c bioconda taxopy\n```\n\n> [!NOTE]\n> To enable fuzzy matching of taxonomic names, you need to install the `fuzzy-matching` extra. This can be done by adding `[fuzzy-matching]` to the installation command in pip:\n> ```\n> pip install taxopy[fuzzy-matching]\n> ```\n> Alternatively, you can install the `rapidfuzz` library:\n> ```\n> # Using pip\n> pip install rapidfuzz\n> # Using conda\n> conda install -c conda-forge rapidfuzz\n> ```\n\n## Usage\n\n```python\nimport taxopy\n```\n\nFirst you need to download taxonomic information from NCBI's servers and put this data into a `TaxDb` object:\n\n```python\ntaxdb = taxopy.TaxDb()\n# You can also use your own set of taxonomy files:\ntaxdb = taxopy.TaxDb(nodes_dmp=\"taxdb/nodes.dmp\", names_dmp=\"taxdb/names.dmp\")\n# If you want to support legacy taxonomic identifiers (that were merged to other identifier), you also need to provide a `merged.dmp` file. This is not necessary if the data is being downloaded from NCBI.\ntaxdb = taxopy.TaxDb(nodes_dmp=\"taxdb/nodes.dmp\", names_dmp=\"taxdb/names.dmp\", merged_dmp=\"taxdb/merged.dmp\")\n```\n\nThe `TaxDb` object stores the name, rank and parent-child relationships of each taxonomic identifier:\n\n```python\nprint(taxdb.taxid2name[2])\nprint(taxdb.taxid2parent[2])\nprint(taxdb.taxid2rank[2])\n```\n\n Bacteria\n 131567\n superkingdom\n\nIf you want to retrieve the new taxonomic identifier of a legacy identifier you can use the `oldtaxid2newtaxid` attribute:\n\n```python\nprint(taxdb.oldtaxid2newtaxid[260])\n```\n\n 143224\n\nTo get information of a given taxon you can create a `Taxon` object using its taxonomic identifier:\n\n```python\nsaccharomyces = taxopy.Taxon(4930, taxdb)\nhuman = taxopy.Taxon(9606, taxdb)\ngorilla = taxopy.Taxon(9593, taxdb)\nlagomorpha = taxopy.Taxon(9975, taxdb)\n```\n\nEach `Taxon` object stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:\n\n```python\nprint(lagomorpha.rank)\nprint(lagomorpha.name)\nprint(lagomorpha.name_lineage)\nprint(lagomorpha.ranked_name_lineage)\nprint(lagomorpha.rank_name_dictionary)\n```\n\n order\n Lagomorpha\n ['Lagomorpha', 'Glires', 'Euarchontoglires', 'Boreoeutheria', 'Eutheria', 'Theria', 'Mammalia', 'Amniota', 'Tetrapoda', 'Dipnotetrapodomorpha', 'Sarcopterygii', 'Euteleostomi', 'Teleostomi', 'Gnathostomata', 'Vertebrata', 'Craniata', 'Chordata', 'Deuterostomia', 'Bilateria', 'Eumetazoa', 'Metazoa', 'Opisthokonta', 'Eukaryota', 'cellular organisms', 'root']\n [('order', 'Lagomorpha'), ('clade', 'Glires'), ('superorder', 'Euarchontoglires'), ('clade', 'Boreoeutheria'), ('clade', 'Eutheria'), ('clade', 'Theria'), ('class', 'Mammalia'), ('clade', 'Amniota'), ('clade', 'Tetrapoda'), ('clade', 'Dipnotetrapodomorpha'), ('superclass', 'Sarcopterygii'), ('clade', 'Euteleostomi'), ('clade', 'Teleostomi'), ('clade', 'Gnathostomata'), ('clade', 'Vertebrata'), ('subphylum', 'Craniata'), ('phylum', 'Chordata'), ('clade', 'Deuterostomia'), ('clade', 'Bilateria'), ('clade', 'Eumetazoa'), ('kingdom', 'Metazoa'), ('clade', 'Opisthokonta'), ('superkingdom', 'Eukaryota'), ('no rank', 'cellular organisms'), ('no rank', 'root')]\n {'order': 'Lagomorpha', 'clade': 'Opisthokonta', 'superorder': 'Euarchontoglires', 'class': 'Mammalia', 'superclass': 'Sarcopterygii', 'subphylum': 'Craniata', 'phylum': 'Chordata', 'kingdom': 'Metazoa', 'superkingdom': 'Eukaryota'}\n\nYou can use the `parent` method to get a `Taxon` object of the parent node of a given taxon:\n\n```python\nlagomorpha_parent = lagomorpha.parent(taxdb)\nprint(lagomorpha_parent.rank)\nprint(lagomorpha_parent.name)\n```\n\n clade\n Glires\n\n### LCA and majority vote\n\nYou can get the lowest common ancestor of a list of taxa using the `find_lca` function:\n\n```python\nhuman_lagomorpha_lca = taxopy.find_lca([human, lagomorpha], taxdb)\nprint(human_lagomorpha_lca.name)\n```\n\n Euarchontoglires\n\nYou may also use the `find_majority_vote` to discover the most specific taxon that is shared by more than half of the lineages of a list of taxa:\n\n```python\nmajority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb)\nprint(majority_vote.name)\n```\n\n Homininae\n\nThe `find_majority_vote` function allows you to control its stringency via the `fraction` parameter. For instance, if you would set `fraction` to 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default, `fraction` is 0.5.\n\n```python\nmajority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb, fraction=0.75)\nprint(majority_vote.name)\n```\n\n Euarchontoglires\n\nYou can also assign weights to each input lineage:\n\n```python\nmajority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb)\nweighted_majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb, weights=[3, 1, 1, 1])\nprint(majority_vote.name)\nprint(weighted_majority_vote.name)\n```\n\n Euarchontoglires\n Opisthokonta\n\nTo check the level of agreement between the taxa that were aggregated using `find_majority_vote` and the output taxon, you can check the `agreement` attribute.\n\n```python\nprint(majority_vote.agreement)\nprint(weighted_majority_vote.agreement)\n```\n\n 0.75\n 1.0\n\n### Taxid from name\n\nIf you only have the name of a taxon, you can get its corresponding taxid using the `taxid_from_name` function:\n\n```python\ntaxid = taxopy.taxid_from_name('Homininae', taxdb)\nprint(taxid)\n```\n\n [207598]\n\nThis function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:\n\n```python\ntaxid = taxopy.taxid_from_name('Aotus', taxdb)\nprint(taxid)\n```\n\n [9504, 114498]\n\nIn case a list of names is provided as input, the function will return a list of lists.\n\n```python\ntaxid = taxopy.taxid_from_name(['Homininae', 'Aotus'], taxdb)\nprint(taxid)\n```\n\n [[207598], [9504, 114498]]\n\nIn certain situations, it is useful to allow slight variations between the input name and the matches in the database. This can be accomplished by setting the `fuzzy` parameter of `taxid_from_name` to `True`. For instance, in GTDB, some taxa have suffixes. The `fuzzy` parameter enables you to locate the taxonomic identifiers of all taxa with a name similar to the input, so you don't need to know the exact name beforehand.\n\n```python\n# The `taxdump_url` parameter of the `TaxDb` class can be used retrieve a custom taxdump from a URL. In this case, we will use a GTDB taxdump provided by Wei Shen (https://github.com/shenwei356/gtdb-taxdump)\ngtdb_taxdb = taxopy.TaxDb(taxdump_url=\"https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.5.0/gtdb-taxdump-R220.tar.gz\")\nfor t in taxopy.taxid_from_name(\"Myxococcota\", gtdb_taxdb, fuzzy=True):\n print(taxopy.Taxon(t, gtdb_taxdb).name)\n```\n\n Myxococcota_A\n Myxococcota\n\nYou can control the minimum similarity between the input name and the matches in the database by setting the `score_cutoff` parameter (default is 0.9).\n\n```python\nfor t in taxopy.taxid_from_name(\n \"Myxococcota\", gtdb_taxdb, fuzzy=True, score_cutoff=0.7\n):\n print(taxopy.Taxon(t, gtdb_taxdb).name)\n```\n\n Myxococcales\n Myxococcota_A\n Myxococcus\n Myxococcia\n Myxococcota\n Myxococcaceae\n\n## Acknowledgements\n\nSome of the code used in taxopy was taken from the [CAT/BAT tool for taxonomic classification of contigs and metagenome-assembled genomes](https://github.com/dutilh/CAT).\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python package for obtaining complete lineages and the lowest common ancestor (LCA) from a set of taxonomic identifiers.",
"version": "0.13.0",
"project_urls": {
"Home": "https://github.com/apcamargo/taxopy"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d3f53618edd3eb2117c04f1865da12dca4d190436f42938779f4d79e362267c3",
"md5": "8df2cea8c21729a3ec645bafb64a874e",
"sha256": "296f2dda519b2c85cbbb893c2a9240e44d6c669f5e30c3f98a8690ac7e41db3a"
},
"downloads": -1,
"filename": "taxopy-0.13.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8df2cea8c21729a3ec645bafb64a874e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 24432,
"upload_time": "2024-08-03T02:06:53",
"upload_time_iso_8601": "2024-08-03T02:06:53.853840Z",
"url": "https://files.pythonhosted.org/packages/d3/f5/3618edd3eb2117c04f1865da12dca4d190436f42938779f4d79e362267c3/taxopy-0.13.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f91201d5c7b29d2ba6fd2bcb6cfcb649cc677fd3e586cad3662f212f50a58910",
"md5": "5be7787d661f1f74bf13ece7832d2151",
"sha256": "0f56b8470864bc8c44cda1edb00dd210a0105e5aec25dfb9f6bb725b0f753f5c"
},
"downloads": -1,
"filename": "taxopy-0.13.0.tar.gz",
"has_sig": false,
"md5_digest": "5be7787d661f1f74bf13ece7832d2151",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 25008,
"upload_time": "2024-08-03T02:06:55",
"upload_time_iso_8601": "2024-08-03T02:06:55.413118Z",
"url": "https://files.pythonhosted.org/packages/f9/12/01d5c7b29d2ba6fd2bcb6cfcb649cc677fd3e586cad3662f212f50a58910/taxopy-0.13.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-03 02:06:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "apcamargo",
"github_project": "taxopy",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "taxopy"
}