taxotagger


Nametaxotagger JSON
Version 0.0.1a6 PyPI version JSON
download
home_pageNone
SummaryFungi DNA barcoder based on semantic searching
upload_time2024-10-21 14:13:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseApache-2.0 license
keywords fungi taxonomy semantic search vector database machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TaxoTagger

 [![pypi badge](https://img.shields.io/pypi/v/taxotagger.svg?color=blue)](https://pypi.python.org/project/taxotagger/)
 [![Static Badge](https://img.shields.io/badge/🍄_Docs_🍄-826644)](https://mycoai.github.io/taxotagger)[](https://mycoai.github.io/taxotagger)

TaxoTagger is a Python library for DNA barcode identification, powered by semantic searching.

Features:
- 🚀 Effortlessly build vector databases from DNA sequences (FASTA files)
- ⚡  Achieve highly efficient and accurate semantic searching
- 🔥 Easily extend support for various embedding models


## Installation

TaxoTagger requires Python 3.10 or later.

```bash
# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10

# install the `taxotagger` package
pip install --pre taxotagger
```


## Usage

### Build a vector database from a FASTA file

```python
from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# creating the database will take ~30s
tt.create_db('data/database.fasta')
```

By default,  the `~/.cache/mycoai` folder is used to store the vector database and the embedding model. The [`MycoAI-CNN.pt`](https://zenodo.org/records/10904344) model is automatically downloaded to this folder if it is not there, and the vector database is created and named after the model.


### Conduct a semantic search with FASTA file
```python
from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# semantic search and return the top 1 result for each query sequence
res = tt.search('data/query.fasta', limit = 1)
```

The [`data/query.fasta` file](data/query.fasta) contains two query sequences: `KY106088` and `KY106087`. 

The search results `res` will be a dictionary with taxonomic level names as keys and matched results as values for each of the two query sequences. For example, `res['phylum']` will look like:

```python
[
    [{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}],
    [{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}]
]
```

The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.

The `id` field is the sequence ID of the matched sequence. The `distance` field is the cosine similarity between the query sequence and the matched sequence with a value between 0 and 1, the closer to 1, the more similar. The `entity` field is the taxonomic information of the matched sequence. 

We can see that the top 1 results for both query sequences are exactly themselves. This is because the query sequences are also in the database. You can try with different query sequences to see the search results.


## Docs
Please visit the [official documentation](https://mycoai.github.io/taxotagger) for more details.

## Question and feedback
Please submit [an issue](https://github.com/MycoAI/taxotagger/issues) if you have any question or feedback.

## Citation
If you use TaxoTagger in your work, please cite it by clicking the `Cite this repository` on right top of this page.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "taxotagger",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "Fungi, Taxonomy, Semantic search, Vector database, Machine learning",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/04/1c/a2eb2ebd783a2ceb944fefdf84b96838633675ae22fa4dd3a2156082552a/taxotagger-0.0.1a6.tar.gz",
    "platform": null,
    "description": "# TaxoTagger\n\n [![pypi badge](https://img.shields.io/pypi/v/taxotagger.svg?color=blue)](https://pypi.python.org/project/taxotagger/)\n [![Static Badge](https://img.shields.io/badge/\ud83c\udf44_Docs_\ud83c\udf44-826644)](https://mycoai.github.io/taxotagger)[](https://mycoai.github.io/taxotagger)\n\nTaxoTagger is a Python library for DNA barcode identification, powered by semantic searching.\n\nFeatures:\n- \ud83d\ude80 Effortlessly build vector databases from DNA sequences (FASTA files)\n- \u26a1  Achieve highly efficient and accurate semantic searching\n- \ud83d\udd25 Easily extend support for various embedding models\n\n\n## Installation\n\nTaxoTagger requires Python 3.10 or later.\n\n```bash\n# create an virtual environment\nconda create -n venv-3.10 python=3.10\nconda activate venv-3.10\n\n# install the `taxotagger` package\npip install --pre taxotagger\n```\n\n\n## Usage\n\n### Build a vector database from a FASTA file\n\n```python\nfrom taxotagger import ProjectConfig\nfrom taxotagger import TaxoTagger\n\nconfig = ProjectConfig()\ntt = TaxoTagger(config)\n\n# creating the database will take ~30s\ntt.create_db('data/database.fasta')\n```\n\nBy default,  the `~/.cache/mycoai` folder is used to store the vector database and the embedding model. The [`MycoAI-CNN.pt`](https://zenodo.org/records/10904344) model is automatically downloaded to this folder if it is not there, and the vector database is created and named after the model.\n\n\n### Conduct a semantic search with FASTA file\n```python\nfrom taxotagger import ProjectConfig\nfrom taxotagger import TaxoTagger\n\nconfig = ProjectConfig()\ntt = TaxoTagger(config)\n\n# semantic search and return the top 1 result for each query sequence\nres = tt.search('data/query.fasta', limit = 1)\n```\n\nThe [`data/query.fasta` file](data/query.fasta) contains two query sequences: `KY106088` and `KY106087`. \n\nThe search results `res` will be a dictionary with taxonomic level names as keys and matched results as values for each of the two query sequences. For example, `res['phylum']` will look like:\n\n```python\n[\n    [{\"id\": \"KY106088\", \"distance\": 1.0, \"entity\": {\"phylum\": \"Ascomycota\"}}],\n    [{\"id\": \"KY106087\", \"distance\": 0.9999998807907104, \"entity\": {\"phylum\": \"Ascomycota\"}}]\n]\n```\n\nThe first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.\n\nThe `id` field is the sequence ID of the matched sequence. The `distance` field is the cosine similarity between the query sequence and the matched sequence with a value between 0 and 1, the closer to 1, the more similar. The `entity` field is the taxonomic information of the matched sequence. \n\nWe can see that the top 1 results for both query sequences are exactly themselves. This is because the query sequences are also in the database. You can try with different query sequences to see the search results.\n\n\n## Docs\nPlease visit the [official documentation](https://mycoai.github.io/taxotagger) for more details.\n\n## Question and feedback\nPlease submit [an issue](https://github.com/MycoAI/taxotagger/issues) if you have any question or feedback.\n\n## Citation\nIf you use TaxoTagger in your work, please cite it by clicking the `Cite this repository` on right top of this page.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0 license",
    "summary": "Fungi DNA barcoder based on semantic searching",
    "version": "0.0.1a6",
    "project_urls": {
        "Repository": "https://github.com/MycoAI/taxotagger"
    },
    "split_keywords": [
        "fungi",
        " taxonomy",
        " semantic search",
        " vector database",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b6889c94b2807618d8e8d370830db9dc09d08cd80d5251399a4e7fba776c3f3",
                "md5": "01ebd229213b39e2378ecd2c69bf8533",
                "sha256": "87d4ce0b315c57a3f59ad1ef530665ca744315b40e1370b0ee7aeb28d90b043a"
            },
            "downloads": -1,
            "filename": "taxotagger-0.0.1a6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "01ebd229213b39e2378ecd2c69bf8533",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 19946,
            "upload_time": "2024-10-21T14:13:43",
            "upload_time_iso_8601": "2024-10-21T14:13:43.623444Z",
            "url": "https://files.pythonhosted.org/packages/8b/68/89c94b2807618d8e8d370830db9dc09d08cd80d5251399a4e7fba776c3f3/taxotagger-0.0.1a6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "041ca2eb2ebd783a2ceb944fefdf84b96838633675ae22fa4dd3a2156082552a",
                "md5": "012085af0ea999cb505606faa329c3b3",
                "sha256": "a6473304c774abe126b11630157b6d1f24295ad2ba5a6c786b6375186274cd5f"
            },
            "downloads": -1,
            "filename": "taxotagger-0.0.1a6.tar.gz",
            "has_sig": false,
            "md5_digest": "012085af0ea999cb505606faa329c3b3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 19348,
            "upload_time": "2024-10-21T14:13:44",
            "upload_time_iso_8601": "2024-10-21T14:13:44.905216Z",
            "url": "https://files.pythonhosted.org/packages/04/1c/a2eb2ebd783a2ceb944fefdf84b96838633675ae22fa4dd3a2156082552a/taxotagger-0.0.1a6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-21 14:13:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MycoAI",
    "github_project": "taxotagger",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "taxotagger"
}
        
Elapsed time: 0.37953s