ncbi-tree

Name	ncbi-tree JSON
Version	1.0.1 JSON
	download
home_page	https://github.com/phylobridge/ncbi-tree
Summary	ncbi-tree is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt)
upload_time	2025-10-21 21:51:16
maintainer	None
docs_url	None
author	NCBI-Tree Contributors
requires_python	>=3.8
license	CC BY-NC 4.0
keywords	ncbi taxonomy phylogeny bioinformatics newick tree phylogenetic ncbi-taxonomy taxdump taxonomic-tree species-tree tree-of-life phylogenetic-tree taxonomy-database taxonomy-parser taxonomy-analysis biology genomics evolution evolutionary-biology systematics species-classification organism-classification taxonomic-hierarchy computational-biology metagenomics biodiversity clade-analysis taxon-names taxonomy-ids merged-taxa rank-distribution phylogenetics phylogenomics comparative-genomics tree-builder ncbi-tools bio-tools data-pipeline text-tree tsv-mapping
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ncbi-tree

[![PyPI version](https://badge.fury.io/py/ncbi-tree.svg)](https://badge.fury.io/py/ncbi-tree)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)

**ncbi-tree** is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt).

## Quick Start

```bash
pip install ncbi-tree
ncbi-tree ./output
```

That's it! The tool will download the latest NCBI taxonomy, generate phylogenetic trees, and create detailed reports.

## Features

- [x] **Automatic Download**: Fetches the latest taxonomy data from NCBI FTP servers  
- [x] **Version Tracking**: Automatically detects and records the exact server version  
- [x] **Smart Caching**: Skips re-download and re-extraction when files already exist  
- [x] **Progress Bars**: Visual feedback for downloads and extraction using tqdm  
- [x] **Multiple Output Formats**: Newick with IDs only, Newick with names, text tree, TSV mapping  
- [x] **Comprehensive Reports**: Detailed taxonomy analysis with rank distribution and depth statistics  
- [x] **Name Sanitization**: By default inital letter is capitalized and space is replaced by `-`. Configurable name formatting with --no-sanitize option  
- [x] **Interactive Mode**: Optional files generated on demand without re-reading data  
- [x] **Merged Taxa Support**: Handles merged taxonomy IDs from merged.dmp  
- [x] **Cross-Platform**: Works on Linux, macOS, and Windows  
- [x] **Memory Efficient**: Reuses data in memory for optional file generation  
- [x] **Error Handling**: Comprehensive error catching with user-friendly messages  

## Installation

```bash
pip install ncbi-tree
```

## Usage

### Basic Usage

```bash
# Download and build taxonomy tree with default settings
ncbi-tree ./output

# Clean up intermediate files after processing
ncbi-tree ./output --no-cache

# Disable name sanitization (keep original spaces)
ncbi-tree ./output --no-sanitize

# Use custom download URL
ncbi-tree ./output --url https://custom-mirror.org/taxdump.tar.gz

# Combined options
ncbi-tree ./output --no-cache --no-sanitize
```

### Help

```bash
ncbi-tree --help
ncbi-tree --version
```

## Output Files

### Core Files (Generated Automatically)

1. **`output.NCBI.tree.tre`** - Newick tree with NCBI taxonomy IDs only
2. **`output.NCBI.report.txt`** - Exploratory taxonomy analysis and statistics
3. **`version.txt`** - Server timestamped version for downloaded taxdump.tar.gz

### Optional Files (User Prompted)

After core files are generated, you will be prompted:
```
Would you like to generate optional files (output.NCBI.tree.txt, output.NCBI.named.tree.tre, output.NCBI.ID.to.name.tsv)? [y/N]:
```

If you answer `y`, additional files will be generated **without re-reading data**:

4. **`output.NCBI.tree.txt`** - Plain-text tree with Unicode box-drawing
5. **`output.NCBI.named.tree.tre`** - Newick tree with rank:id:name labels
6. **`output.NCBI.ID.to.name.tsv`** - TSV mapping of IDs to names (TaxID, Name, Rank)

## Name Sanitization

By default, taxon names are sanitized for consistent display:
- Spaces replaced with `-`
- Existing `-` escaped as `<->`
- Title case applied
- Special characters removed

**Default (sanitized):**
```
"Human;Homo-Sapiens"
"Norway-Rat;Rattus-Norvegicus"
```

**With `--no-sanitize` flag:**
```
"human; Homo sapiens"
"Norway rat; Rattus norvegicus"
```

## Advanced Configuration

### Custom Name Display

To customize which name types are displayed, edit `NAME_PRIORITIES` in `ncbi_tree/core.py`:

```python
# Default: both common and scientific names
NAME_PRIORITIES = {"genbank common name": 0, "scientific name": 1}
# Result: "Human; Homo sapiens"

# Scientific name only (disable common name)
NAME_PRIORITIES = {"genbank common name": -1, "scientific name": 0}
# Result: "Homo sapiens"

# Common name only (disable scientific name)
NAME_PRIORITIES = {"genbank common name": 0, "scientific name": -1}
# Result: "Human"
```

**Note:** Priority value `-1` disables that name type, `>= 0` enables it (lower number = higher priority).

## Example

```bash
$ ncbi-tree ./ncbi_output

Output files:
  - ./ncbi_output/output.NCBI.tree.tre
  - ./ncbi_output/output.NCBI.tree.txt
  - ./ncbi_output/version.txt
```

## Requirements

- Python 3.8 or higher
- requests >= 2.25.0
- tqdm >= 4.50.0

## Technical Details

### Data Source
- **Primary**: NCBI Taxonomy Database (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)
- **Updates**: Automatic detection of latest version with timestamp tracking
- **Size**: ~70-100 MB compressed, ~2.7M+ taxonomy entries at the time of writing (October 2025)
- **Format**: NCBI taxdump format (nodes.dmp, names.dmp, merged.dmp)

### Output Formats
1. **Newick (`.tre`)**: Standard phylogenetic tree format compatible with all major tree viewers
2. **Text Tree (`.txt`)**: Unicode-based visualization for terminal/text viewing
3. **TSV Mapping (`.tsv`)**: Tabular format for database integration and lookups
4. **Report (`.txt`)**: Statistical analysis with rank distribution and depth metrics

## License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

## Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

## Acknowledgments

- NCBI for providing the taxonomy database

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/phylobridge/ncbi-tree",
    "name": "ncbi-tree",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "ncbi, taxonomy, phylogeny, bioinformatics, newick, tree, phylogenetic, ncbi-taxonomy, taxdump, taxonomic-tree, species-tree, tree-of-life, phylogenetic-tree, taxonomy-database, taxonomy-parser, taxonomy-analysis, biology, genomics, evolution, evolutionary-biology, systematics, species-classification, organism-classification, taxonomic-hierarchy, computational-biology, metagenomics, biodiversity, clade-analysis, taxon-names, taxonomy-ids, merged-taxa, rank-distribution, phylogenetics, phylogenomics, comparative-genomics, tree-builder, ncbi-tools, bio-tools, data-pipeline, text-tree, tsv-mapping",
    "author": "NCBI-Tree Contributors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/73/4a/cfd48691acddd7b69cabbdb84b678301b2029f2f0f915a537c44d6a0703a/ncbi_tree-1.0.1.tar.gz",
    "platform": null,
    "description": "# ncbi-tree\n\n[![PyPI version](https://badge.fury.io/py/ncbi-tree.svg)](https://badge.fury.io/py/ncbi-tree)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)\n\n**ncbi-tree** is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt).\n\n## Quick Start\n\n```bash\npip install ncbi-tree\nncbi-tree ./output\n```\n\nThat's it! The tool will download the latest NCBI taxonomy, generate phylogenetic trees, and create detailed reports.\n\n## Features\n\n- [x] **Automatic Download**: Fetches the latest taxonomy data from NCBI FTP servers  \n- [x] **Version Tracking**: Automatically detects and records the exact server version  \n- [x] **Smart Caching**: Skips re-download and re-extraction when files already exist  \n- [x] **Progress Bars**: Visual feedback for downloads and extraction using tqdm  \n- [x] **Multiple Output Formats**: Newick with IDs only, Newick with names, text tree, TSV mapping  \n- [x] **Comprehensive Reports**: Detailed taxonomy analysis with rank distribution and depth statistics  \n- [x] **Name Sanitization**: By default inital letter is capitalized and space is replaced by `-`. Configurable name formatting with --no-sanitize option  \n- [x] **Interactive Mode**: Optional files generated on demand without re-reading data  \n- [x] **Merged Taxa Support**: Handles merged taxonomy IDs from merged.dmp  \n- [x] **Cross-Platform**: Works on Linux, macOS, and Windows  \n- [x] **Memory Efficient**: Reuses data in memory for optional file generation  \n- [x] **Error Handling**: Comprehensive error catching with user-friendly messages  \n\n## Installation\n\n```bash\npip install ncbi-tree\n```\n\n## Usage\n\n### Basic Usage\n\n```bash\n# Download and build taxonomy tree with default settings\nncbi-tree ./output\n\n# Clean up intermediate files after processing\nncbi-tree ./output --no-cache\n\n# Disable name sanitization (keep original spaces)\nncbi-tree ./output --no-sanitize\n\n# Use custom download URL\nncbi-tree ./output --url https://custom-mirror.org/taxdump.tar.gz\n\n# Combined options\nncbi-tree ./output --no-cache --no-sanitize\n```\n\n### Help\n\n```bash\nncbi-tree --help\nncbi-tree --version\n```\n\n## Output Files\n\n### Core Files (Generated Automatically)\n\n1. **`output.NCBI.tree.tre`** - Newick tree with NCBI taxonomy IDs only\n2. **`output.NCBI.report.txt`** - Exploratory taxonomy analysis and statistics\n3. **`version.txt`** - Server timestamped version for downloaded taxdump.tar.gz\n\n### Optional Files (User Prompted)\n\nAfter core files are generated, you will be prompted:\n```\nWould you like to generate optional files (output.NCBI.tree.txt, output.NCBI.named.tree.tre, output.NCBI.ID.to.name.tsv)? [y/N]:\n```\n\nIf you answer `y`, additional files will be generated **without re-reading data**:\n\n4. **`output.NCBI.tree.txt`** - Plain-text tree with Unicode box-drawing\n5. **`output.NCBI.named.tree.tre`** - Newick tree with rank:id:name labels\n6. **`output.NCBI.ID.to.name.tsv`** - TSV mapping of IDs to names (TaxID, Name, Rank)\n\n## Name Sanitization\n\nBy default, taxon names are sanitized for consistent display:\n- Spaces replaced with `-`\n- Existing `-` escaped as `<->`\n- Title case applied\n- Special characters removed\n\n**Default (sanitized):**\n```\n\"Human;Homo-Sapiens\"\n\"Norway-Rat;Rattus-Norvegicus\"\n```\n\n**With `--no-sanitize` flag:**\n```\n\"human; Homo sapiens\"\n\"Norway rat; Rattus norvegicus\"\n```\n\n## Advanced Configuration\n\n### Custom Name Display\n\nTo customize which name types are displayed, edit `NAME_PRIORITIES` in `ncbi_tree/core.py`:\n\n```python\n# Default: both common and scientific names\nNAME_PRIORITIES = {\"genbank common name\": 0, \"scientific name\": 1}\n# Result: \"Human; Homo sapiens\"\n\n# Scientific name only (disable common name)\nNAME_PRIORITIES = {\"genbank common name\": -1, \"scientific name\": 0}\n# Result: \"Homo sapiens\"\n\n# Common name only (disable scientific name)\nNAME_PRIORITIES = {\"genbank common name\": 0, \"scientific name\": -1}\n# Result: \"Human\"\n```\n\n**Note:** Priority value `-1` disables that name type, `>= 0` enables it (lower number = higher priority).\n\n## Example\n\n```bash\n$ ncbi-tree ./ncbi_output\n\nOutput files:\n  - ./ncbi_output/output.NCBI.tree.tre\n  - ./ncbi_output/output.NCBI.tree.txt\n  - ./ncbi_output/version.txt\n```\n\n## Requirements\n\n- Python 3.8 or higher\n- requests >= 2.25.0\n- tqdm >= 4.50.0\n\n## Technical Details\n\n### Data Source\n- **Primary**: NCBI Taxonomy Database (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)\n- **Updates**: Automatic detection of latest version with timestamp tracking\n- **Size**: ~70-100 MB compressed, ~2.7M+ taxonomy entries at the time of writing (October 2025)\n- **Format**: NCBI taxdump format (nodes.dmp, names.dmp, merged.dmp)\n\n### Output Formats\n1. **Newick (`.tre`)**: Standard phylogenetic tree format compatible with all major tree viewers\n2. **Text Tree (`.txt`)**: Unicode-based visualization for terminal/text viewing\n3. **TSV Mapping (`.tsv`)**: Tabular format for database integration and lookups\n4. **Report (`.txt`)**: Statistical analysis with rank distribution and depth metrics\n\n## License\n\nThis project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).\n\n## Contributing\n\nContributions are welcome! Please feel free to submit issues or pull requests.\n\n## Acknowledgments\n\n- NCBI for providing the taxonomy database\n",
    "bugtrack_url": null,
    "license": "CC BY-NC 4.0",
    "summary": "ncbi-tree is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt)",
    "version": "1.0.1",
    "project_urls": {
        "Bug Reports": "https://github.com/phylobridge/ncbi-tree/issues",
        "Documentation": "https://github.com/phylobridge/ncbi-tree#readme",
        "Homepage": "https://github.com/phylobridge/ncbi-tree",
        "NCBI Taxonomy": "https://www.ncbi.nlm.nih.gov/taxonomy",
        "Source": "https://github.com/phylobridge/ncbi-tree"
    },
    "split_keywords": [
        "ncbi",
        " taxonomy",
        " phylogeny",
        " bioinformatics",
        " newick",
        " tree",
        " phylogenetic",
        " ncbi-taxonomy",
        " taxdump",
        " taxonomic-tree",
        " species-tree",
        " tree-of-life",
        " phylogenetic-tree",
        " taxonomy-database",
        " taxonomy-parser",
        " taxonomy-analysis",
        " biology",
        " genomics",
        " evolution",
        " evolutionary-biology",
        " systematics",
        " species-classification",
        " organism-classification",
        " taxonomic-hierarchy",
        " computational-biology",
        " metagenomics",
        " biodiversity",
        " clade-analysis",
        " taxon-names",
        " taxonomy-ids",
        " merged-taxa",
        " rank-distribution",
        " phylogenetics",
        " phylogenomics",
        " comparative-genomics",
        " tree-builder",
        " ncbi-tools",
        " bio-tools",
        " data-pipeline",
        " text-tree",
        " tsv-mapping"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "734acfd48691acddd7b69cabbdb84b678301b2029f2f0f915a537c44d6a0703a",
                "md5": "389a1087fdffe673c078dc6d2e471d43",
                "sha256": "23fa76552cc26d059c3335a1eaeb9045ccc1ba316a9c597f96eba0fa35a51ef2"
            },
            "downloads": -1,
            "filename": "ncbi_tree-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "389a1087fdffe673c078dc6d2e471d43",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 21456,
            "upload_time": "2025-10-21T21:51:16",
            "upload_time_iso_8601": "2025-10-21T21:51:16.212020Z",
            "url": "https://files.pythonhosted.org/packages/73/4a/cfd48691acddd7b69cabbdb84b678301b2029f2f0f915a537c44d6a0703a/ncbi_tree-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 21:51:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "phylobridge",
    "github_project": "ncbi-tree",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ncbi-tree"
}

NCBI-Tree Contributors