taxopy


Nametaxopy JSON
Version 0.13.0 PyPI version JSON
download
home_pageNone
SummaryA Python package for obtaining complete lineages and the lowest common ancestor (LCA) from a set of taxonomic identifiers.
upload_time2024-08-03 02:06:55
maintainerNone
docs_urlNone
authorNone
requires_python>=3.5
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # taxopy

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6993581.svg)](https://doi.org/10.5281/zenodo.6993581)

A Python package for manipulating NCBI-formatted taxonomic databases. Allows you to obtain complete lineages, determine lowest common ancestors (LCAs), get taxa names from their taxids, and more!

## Installation

There are two ways to install taxopy:

  - Using pip:

```
pip install taxopy
```

  - Using conda:

```
conda install -c conda-forge -c bioconda taxopy
```

> [!NOTE]
> To enable fuzzy matching of taxonomic names, you need to install the `fuzzy-matching` extra. This can be done by adding `[fuzzy-matching]` to the installation command in pip:
> ```
> pip install taxopy[fuzzy-matching]
> ```
> Alternatively, you can install the `rapidfuzz` library:
> ```
> # Using pip
> pip install rapidfuzz
> # Using conda
> conda install -c conda-forge rapidfuzz
> ```

## Usage

```python
import taxopy
```

First you need to download taxonomic information from NCBI's servers and put this data into a `TaxDb` object:

```python
taxdb = taxopy.TaxDb()
# You can also use your own set of taxonomy files:
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp")
# If you want to support legacy taxonomic identifiers (that were merged to other identifier), you also need to provide a `merged.dmp` file. This is not necessary if the data is being downloaded from NCBI.
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp", merged_dmp="taxdb/merged.dmp")
```

The `TaxDb` object stores the name, rank and parent-child relationships of each taxonomic identifier:

```python
print(taxdb.taxid2name[2])
print(taxdb.taxid2parent[2])
print(taxdb.taxid2rank[2])
```

    Bacteria
    131567
    superkingdom

If you want to retrieve the new taxonomic identifier of a legacy identifier you can use the `oldtaxid2newtaxid` attribute:

```python
print(taxdb.oldtaxid2newtaxid[260])
```

    143224

To get information of a given taxon you can create a `Taxon` object using its taxonomic identifier:

```python
saccharomyces = taxopy.Taxon(4930, taxdb)
human = taxopy.Taxon(9606, taxdb)
gorilla = taxopy.Taxon(9593, taxdb)
lagomorpha = taxopy.Taxon(9975, taxdb)
```

Each `Taxon` object stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:

```python
print(lagomorpha.rank)
print(lagomorpha.name)
print(lagomorpha.name_lineage)
print(lagomorpha.ranked_name_lineage)
print(lagomorpha.rank_name_dictionary)
```

    order
    Lagomorpha
    ['Lagomorpha', 'Glires', 'Euarchontoglires', 'Boreoeutheria', 'Eutheria', 'Theria', 'Mammalia', 'Amniota', 'Tetrapoda', 'Dipnotetrapodomorpha', 'Sarcopterygii', 'Euteleostomi', 'Teleostomi', 'Gnathostomata', 'Vertebrata', 'Craniata', 'Chordata', 'Deuterostomia', 'Bilateria', 'Eumetazoa', 'Metazoa', 'Opisthokonta', 'Eukaryota', 'cellular organisms', 'root']
    [('order', 'Lagomorpha'), ('clade', 'Glires'), ('superorder', 'Euarchontoglires'), ('clade', 'Boreoeutheria'), ('clade', 'Eutheria'), ('clade', 'Theria'), ('class', 'Mammalia'), ('clade', 'Amniota'), ('clade', 'Tetrapoda'), ('clade', 'Dipnotetrapodomorpha'), ('superclass', 'Sarcopterygii'), ('clade', 'Euteleostomi'), ('clade', 'Teleostomi'), ('clade', 'Gnathostomata'), ('clade', 'Vertebrata'), ('subphylum', 'Craniata'), ('phylum', 'Chordata'), ('clade', 'Deuterostomia'), ('clade', 'Bilateria'), ('clade', 'Eumetazoa'), ('kingdom', 'Metazoa'), ('clade', 'Opisthokonta'), ('superkingdom', 'Eukaryota'), ('no rank', 'cellular organisms'), ('no rank', 'root')]
    {'order': 'Lagomorpha', 'clade': 'Opisthokonta', 'superorder': 'Euarchontoglires', 'class': 'Mammalia', 'superclass': 'Sarcopterygii', 'subphylum': 'Craniata', 'phylum': 'Chordata', 'kingdom': 'Metazoa', 'superkingdom': 'Eukaryota'}

You can use the `parent` method to get a `Taxon` object of the parent node of a given taxon:

```python
lagomorpha_parent = lagomorpha.parent(taxdb)
print(lagomorpha_parent.rank)
print(lagomorpha_parent.name)
```

    clade
    Glires

### LCA and majority vote

You can get the lowest common ancestor of a list of taxa using the `find_lca` function:

```python
human_lagomorpha_lca = taxopy.find_lca([human, lagomorpha], taxdb)
print(human_lagomorpha_lca.name)
```

    Euarchontoglires

You may also use the `find_majority_vote` to discover the most specific taxon that is shared by more than half of the lineages of a list of taxa:

```python
majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb)
print(majority_vote.name)
```

    Homininae

The `find_majority_vote` function allows you to control its stringency via the `fraction` parameter. For instance, if you would set `fraction` to 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default, `fraction` is 0.5.

```python
majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb, fraction=0.75)
print(majority_vote.name)
```

    Euarchontoglires

You can also assign weights to each input lineage:

```python
majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb)
weighted_majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb, weights=[3, 1, 1, 1])
print(majority_vote.name)
print(weighted_majority_vote.name)
```

    Euarchontoglires
    Opisthokonta

To check the level of agreement between the taxa that were aggregated using `find_majority_vote` and the output taxon, you can check the `agreement` attribute.

```python
print(majority_vote.agreement)
print(weighted_majority_vote.agreement)
```

    0.75
    1.0

### Taxid from name

If you only have the name of a taxon, you can get its corresponding taxid using the `taxid_from_name` function:

```python
taxid = taxopy.taxid_from_name('Homininae', taxdb)
print(taxid)
```

    [207598]

This function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:

```python
taxid = taxopy.taxid_from_name('Aotus', taxdb)
print(taxid)
```

    [9504, 114498]

In case a list of names is provided as input, the function will return a list of lists.

```python
taxid = taxopy.taxid_from_name(['Homininae', 'Aotus'], taxdb)
print(taxid)
```

    [[207598], [9504, 114498]]

In certain situations, it is useful to allow slight variations between the input name and the matches in the database. This can be accomplished by setting the `fuzzy` parameter of `taxid_from_name` to `True`. For instance, in GTDB, some taxa have suffixes. The `fuzzy` parameter enables you to locate the taxonomic identifiers of all taxa with a name similar to the input, so you don't need to know the exact name beforehand.

```python
# The `taxdump_url` parameter of the `TaxDb` class can be used retrieve a custom taxdump from a URL. In this case, we will use a GTDB taxdump provided by Wei Shen (https://github.com/shenwei356/gtdb-taxdump)
gtdb_taxdb = taxopy.TaxDb(taxdump_url="https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.5.0/gtdb-taxdump-R220.tar.gz")
for t in taxopy.taxid_from_name("Myxococcota", gtdb_taxdb, fuzzy=True):
    print(taxopy.Taxon(t, gtdb_taxdb).name)
```

    Myxococcota_A
    Myxococcota

You can control the minimum similarity between the input name and the matches in the database by setting the `score_cutoff` parameter (default is 0.9).

```python
for t in taxopy.taxid_from_name(
    "Myxococcota", gtdb_taxdb, fuzzy=True, score_cutoff=0.7
):
    print(taxopy.Taxon(t, gtdb_taxdb).name)
```

    Myxococcales
    Myxococcota_A
    Myxococcus
    Myxococcia
    Myxococcota
    Myxococcaceae

## Acknowledgements

Some of the code used in taxopy was taken from the [CAT/BAT tool for taxonomic classification of contigs and metagenome-assembled genomes](https://github.com/dutilh/CAT).


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "taxopy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Antonio Camargo <antoniop.camargo@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f9/12/01d5c7b29d2ba6fd2bcb6cfcb649cc677fd3e586cad3662f212f50a58910/taxopy-0.13.0.tar.gz",
    "platform": null,
    "description": "# taxopy\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6993581.svg)](https://doi.org/10.5281/zenodo.6993581)\n\nA Python package for manipulating NCBI-formatted taxonomic databases. Allows you to obtain complete lineages, determine lowest common ancestors (LCAs), get taxa names from their taxids, and more!\n\n## Installation\n\nThere are two ways to install taxopy:\n\n  - Using pip:\n\n```\npip install taxopy\n```\n\n  - Using conda:\n\n```\nconda install -c conda-forge -c bioconda taxopy\n```\n\n> [!NOTE]\n> To enable fuzzy matching of taxonomic names, you need to install the `fuzzy-matching` extra. This can be done by adding `[fuzzy-matching]` to the installation command in pip:\n> ```\n> pip install taxopy[fuzzy-matching]\n> ```\n> Alternatively, you can install the `rapidfuzz` library:\n> ```\n> # Using pip\n> pip install rapidfuzz\n> # Using conda\n> conda install -c conda-forge rapidfuzz\n> ```\n\n## Usage\n\n```python\nimport taxopy\n```\n\nFirst you need to download taxonomic information from NCBI's servers and put this data into a `TaxDb` object:\n\n```python\ntaxdb = taxopy.TaxDb()\n# You can also use your own set of taxonomy files:\ntaxdb = taxopy.TaxDb(nodes_dmp=\"taxdb/nodes.dmp\", names_dmp=\"taxdb/names.dmp\")\n# If you want to support legacy taxonomic identifiers (that were merged to other identifier), you also need to provide a `merged.dmp` file. This is not necessary if the data is being downloaded from NCBI.\ntaxdb = taxopy.TaxDb(nodes_dmp=\"taxdb/nodes.dmp\", names_dmp=\"taxdb/names.dmp\", merged_dmp=\"taxdb/merged.dmp\")\n```\n\nThe `TaxDb` object stores the name, rank and parent-child relationships of each taxonomic identifier:\n\n```python\nprint(taxdb.taxid2name[2])\nprint(taxdb.taxid2parent[2])\nprint(taxdb.taxid2rank[2])\n```\n\n    Bacteria\n    131567\n    superkingdom\n\nIf you want to retrieve the new taxonomic identifier of a legacy identifier you can use the `oldtaxid2newtaxid` attribute:\n\n```python\nprint(taxdb.oldtaxid2newtaxid[260])\n```\n\n    143224\n\nTo get information of a given taxon you can create a `Taxon` object using its taxonomic identifier:\n\n```python\nsaccharomyces = taxopy.Taxon(4930, taxdb)\nhuman = taxopy.Taxon(9606, taxdb)\ngorilla = taxopy.Taxon(9593, taxdb)\nlagomorpha = taxopy.Taxon(9975, taxdb)\n```\n\nEach `Taxon` object stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:\n\n```python\nprint(lagomorpha.rank)\nprint(lagomorpha.name)\nprint(lagomorpha.name_lineage)\nprint(lagomorpha.ranked_name_lineage)\nprint(lagomorpha.rank_name_dictionary)\n```\n\n    order\n    Lagomorpha\n    ['Lagomorpha', 'Glires', 'Euarchontoglires', 'Boreoeutheria', 'Eutheria', 'Theria', 'Mammalia', 'Amniota', 'Tetrapoda', 'Dipnotetrapodomorpha', 'Sarcopterygii', 'Euteleostomi', 'Teleostomi', 'Gnathostomata', 'Vertebrata', 'Craniata', 'Chordata', 'Deuterostomia', 'Bilateria', 'Eumetazoa', 'Metazoa', 'Opisthokonta', 'Eukaryota', 'cellular organisms', 'root']\n    [('order', 'Lagomorpha'), ('clade', 'Glires'), ('superorder', 'Euarchontoglires'), ('clade', 'Boreoeutheria'), ('clade', 'Eutheria'), ('clade', 'Theria'), ('class', 'Mammalia'), ('clade', 'Amniota'), ('clade', 'Tetrapoda'), ('clade', 'Dipnotetrapodomorpha'), ('superclass', 'Sarcopterygii'), ('clade', 'Euteleostomi'), ('clade', 'Teleostomi'), ('clade', 'Gnathostomata'), ('clade', 'Vertebrata'), ('subphylum', 'Craniata'), ('phylum', 'Chordata'), ('clade', 'Deuterostomia'), ('clade', 'Bilateria'), ('clade', 'Eumetazoa'), ('kingdom', 'Metazoa'), ('clade', 'Opisthokonta'), ('superkingdom', 'Eukaryota'), ('no rank', 'cellular organisms'), ('no rank', 'root')]\n    {'order': 'Lagomorpha', 'clade': 'Opisthokonta', 'superorder': 'Euarchontoglires', 'class': 'Mammalia', 'superclass': 'Sarcopterygii', 'subphylum': 'Craniata', 'phylum': 'Chordata', 'kingdom': 'Metazoa', 'superkingdom': 'Eukaryota'}\n\nYou can use the `parent` method to get a `Taxon` object of the parent node of a given taxon:\n\n```python\nlagomorpha_parent = lagomorpha.parent(taxdb)\nprint(lagomorpha_parent.rank)\nprint(lagomorpha_parent.name)\n```\n\n    clade\n    Glires\n\n### LCA and majority vote\n\nYou can get the lowest common ancestor of a list of taxa using the `find_lca` function:\n\n```python\nhuman_lagomorpha_lca = taxopy.find_lca([human, lagomorpha], taxdb)\nprint(human_lagomorpha_lca.name)\n```\n\n    Euarchontoglires\n\nYou may also use the `find_majority_vote` to discover the most specific taxon that is shared by more than half of the lineages of a list of taxa:\n\n```python\nmajority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb)\nprint(majority_vote.name)\n```\n\n    Homininae\n\nThe `find_majority_vote` function allows you to control its stringency via the `fraction` parameter. For instance, if you would set `fraction` to 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default, `fraction` is 0.5.\n\n```python\nmajority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb, fraction=0.75)\nprint(majority_vote.name)\n```\n\n    Euarchontoglires\n\nYou can also assign weights to each input lineage:\n\n```python\nmajority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb)\nweighted_majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb, weights=[3, 1, 1, 1])\nprint(majority_vote.name)\nprint(weighted_majority_vote.name)\n```\n\n    Euarchontoglires\n    Opisthokonta\n\nTo check the level of agreement between the taxa that were aggregated using `find_majority_vote` and the output taxon, you can check the `agreement` attribute.\n\n```python\nprint(majority_vote.agreement)\nprint(weighted_majority_vote.agreement)\n```\n\n    0.75\n    1.0\n\n### Taxid from name\n\nIf you only have the name of a taxon, you can get its corresponding taxid using the `taxid_from_name` function:\n\n```python\ntaxid = taxopy.taxid_from_name('Homininae', taxdb)\nprint(taxid)\n```\n\n    [207598]\n\nThis function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:\n\n```python\ntaxid = taxopy.taxid_from_name('Aotus', taxdb)\nprint(taxid)\n```\n\n    [9504, 114498]\n\nIn case a list of names is provided as input, the function will return a list of lists.\n\n```python\ntaxid = taxopy.taxid_from_name(['Homininae', 'Aotus'], taxdb)\nprint(taxid)\n```\n\n    [[207598], [9504, 114498]]\n\nIn certain situations, it is useful to allow slight variations between the input name and the matches in the database. This can be accomplished by setting the `fuzzy` parameter of `taxid_from_name` to `True`. For instance, in GTDB, some taxa have suffixes. The `fuzzy` parameter enables you to locate the taxonomic identifiers of all taxa with a name similar to the input, so you don't need to know the exact name beforehand.\n\n```python\n# The `taxdump_url` parameter of the `TaxDb` class can be used retrieve a custom taxdump from a URL. In this case, we will use a GTDB taxdump provided by Wei Shen (https://github.com/shenwei356/gtdb-taxdump)\ngtdb_taxdb = taxopy.TaxDb(taxdump_url=\"https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.5.0/gtdb-taxdump-R220.tar.gz\")\nfor t in taxopy.taxid_from_name(\"Myxococcota\", gtdb_taxdb, fuzzy=True):\n    print(taxopy.Taxon(t, gtdb_taxdb).name)\n```\n\n    Myxococcota_A\n    Myxococcota\n\nYou can control the minimum similarity between the input name and the matches in the database by setting the `score_cutoff` parameter (default is 0.9).\n\n```python\nfor t in taxopy.taxid_from_name(\n    \"Myxococcota\", gtdb_taxdb, fuzzy=True, score_cutoff=0.7\n):\n    print(taxopy.Taxon(t, gtdb_taxdb).name)\n```\n\n    Myxococcales\n    Myxococcota_A\n    Myxococcus\n    Myxococcia\n    Myxococcota\n    Myxococcaceae\n\n## Acknowledgements\n\nSome of the code used in taxopy was taken from the [CAT/BAT tool for taxonomic classification of contigs and metagenome-assembled genomes](https://github.com/dutilh/CAT).\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python package for obtaining complete lineages and the lowest common ancestor (LCA) from a set of taxonomic identifiers.",
    "version": "0.13.0",
    "project_urls": {
        "Home": "https://github.com/apcamargo/taxopy"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d3f53618edd3eb2117c04f1865da12dca4d190436f42938779f4d79e362267c3",
                "md5": "8df2cea8c21729a3ec645bafb64a874e",
                "sha256": "296f2dda519b2c85cbbb893c2a9240e44d6c669f5e30c3f98a8690ac7e41db3a"
            },
            "downloads": -1,
            "filename": "taxopy-0.13.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8df2cea8c21729a3ec645bafb64a874e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 24432,
            "upload_time": "2024-08-03T02:06:53",
            "upload_time_iso_8601": "2024-08-03T02:06:53.853840Z",
            "url": "https://files.pythonhosted.org/packages/d3/f5/3618edd3eb2117c04f1865da12dca4d190436f42938779f4d79e362267c3/taxopy-0.13.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f91201d5c7b29d2ba6fd2bcb6cfcb649cc677fd3e586cad3662f212f50a58910",
                "md5": "5be7787d661f1f74bf13ece7832d2151",
                "sha256": "0f56b8470864bc8c44cda1edb00dd210a0105e5aec25dfb9f6bb725b0f753f5c"
            },
            "downloads": -1,
            "filename": "taxopy-0.13.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5be7787d661f1f74bf13ece7832d2151",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5",
            "size": 25008,
            "upload_time": "2024-08-03T02:06:55",
            "upload_time_iso_8601": "2024-08-03T02:06:55.413118Z",
            "url": "https://files.pythonhosted.org/packages/f9/12/01d5c7b29d2ba6fd2bcb6cfcb649cc677fd3e586cad3662f212f50a58910/taxopy-0.13.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-03 02:06:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "apcamargo",
    "github_project": "taxopy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "taxopy"
}
        
Elapsed time: 0.26809s