# ncbi-tree
[](https://badge.fury.io/py/ncbi-tree)
[](https://www.python.org/downloads/)
[](https://creativecommons.org/licenses/by-nc/4.0/)
**ncbi-tree** is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt).
## Quick Start
```bash
pip install ncbi-tree
ncbi-tree ./output
```
That's it! The tool will download the latest NCBI taxonomy, generate phylogenetic trees, and create detailed reports.
## Features
- [x] **Automatic Download**: Fetches the latest taxonomy data from NCBI FTP servers
- [x] **Version Tracking**: Automatically detects and records the exact server version
- [x] **Smart Caching**: Skips re-download and re-extraction when files already exist
- [x] **Progress Bars**: Visual feedback for downloads and extraction using tqdm
- [x] **Multiple Output Formats**: Newick with IDs only, Newick with names, text tree, TSV mapping
- [x] **Comprehensive Reports**: Detailed taxonomy analysis with rank distribution and depth statistics
- [x] **Name Sanitization**: By default inital letter is capitalized and space is replaced by `-`. Configurable name formatting with --no-sanitize option
- [x] **Interactive Mode**: Optional files generated on demand without re-reading data
- [x] **Merged Taxa Support**: Handles merged taxonomy IDs from merged.dmp
- [x] **Cross-Platform**: Works on Linux, macOS, and Windows
- [x] **Memory Efficient**: Reuses data in memory for optional file generation
- [x] **Error Handling**: Comprehensive error catching with user-friendly messages
## Installation
```bash
pip install ncbi-tree
```
## Usage
### Basic Usage
```bash
# Download and build taxonomy tree with default settings
ncbi-tree ./output
# Clean up intermediate files after processing
ncbi-tree ./output --no-cache
# Disable name sanitization (keep original spaces)
ncbi-tree ./output --no-sanitize
# Use custom download URL
ncbi-tree ./output --url https://custom-mirror.org/taxdump.tar.gz
# Combined options
ncbi-tree ./output --no-cache --no-sanitize
```
### Help
```bash
ncbi-tree --help
ncbi-tree --version
```
## Output Files
### Core Files (Generated Automatically)
1. **`output.NCBI.tree.tre`** - Newick tree with NCBI taxonomy IDs only
2. **`output.NCBI.report.txt`** - Exploratory taxonomy analysis and statistics
3. **`version.txt`** - Server timestamped version for downloaded taxdump.tar.gz
### Optional Files (User Prompted)
After core files are generated, you will be prompted:
```
Would you like to generate optional files (output.NCBI.tree.txt, output.NCBI.named.tree.tre, output.NCBI.ID.to.name.tsv)? [y/N]:
```
If you answer `y`, additional files will be generated **without re-reading data**:
4. **`output.NCBI.tree.txt`** - Plain-text tree with Unicode box-drawing
5. **`output.NCBI.named.tree.tre`** - Newick tree with rank:id:name labels
6. **`output.NCBI.ID.to.name.tsv`** - TSV mapping of IDs to names (TaxID, Name, Rank)
## Name Sanitization
By default, taxon names are sanitized for consistent display:
- Spaces replaced with `-`
- Existing `-` escaped as `<->`
- Title case applied
- Special characters removed
**Default (sanitized):**
```
"Human;Homo-Sapiens"
"Norway-Rat;Rattus-Norvegicus"
```
**With `--no-sanitize` flag:**
```
"human; Homo sapiens"
"Norway rat; Rattus norvegicus"
```
## Advanced Configuration
### Custom Name Display
To customize which name types are displayed, edit `NAME_PRIORITIES` in `ncbi_tree/core.py`:
```python
# Default: both common and scientific names
NAME_PRIORITIES = {"genbank common name": 0, "scientific name": 1}
# Result: "Human; Homo sapiens"
# Scientific name only (disable common name)
NAME_PRIORITIES = {"genbank common name": -1, "scientific name": 0}
# Result: "Homo sapiens"
# Common name only (disable scientific name)
NAME_PRIORITIES = {"genbank common name": 0, "scientific name": -1}
# Result: "Human"
```
**Note:** Priority value `-1` disables that name type, `>= 0` enables it (lower number = higher priority).
## Example
```bash
$ ncbi-tree ./ncbi_output
Output files:
- ./ncbi_output/output.NCBI.tree.tre
- ./ncbi_output/output.NCBI.tree.txt
- ./ncbi_output/version.txt
```
## Requirements
- Python 3.8 or higher
- requests >= 2.25.0
- tqdm >= 4.50.0
## Technical Details
### Data Source
- **Primary**: NCBI Taxonomy Database (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)
- **Updates**: Automatic detection of latest version with timestamp tracking
- **Size**: ~70-100 MB compressed, ~2.7M+ taxonomy entries at the time of writing (October 2025)
- **Format**: NCBI taxdump format (nodes.dmp, names.dmp, merged.dmp)
### Output Formats
1. **Newick (`.tre`)**: Standard phylogenetic tree format compatible with all major tree viewers
2. **Text Tree (`.txt`)**: Unicode-based visualization for terminal/text viewing
3. **TSV Mapping (`.tsv`)**: Tabular format for database integration and lookups
4. **Report (`.txt`)**: Statistical analysis with rank distribution and depth metrics
## License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
## Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
## Acknowledgments
- NCBI for providing the taxonomy database
Raw data
{
"_id": null,
"home_page": "https://github.com/phylobridge/ncbi-tree",
"name": "ncbi-tree",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "ncbi, taxonomy, phylogeny, bioinformatics, newick, tree, phylogenetic, ncbi-taxonomy, taxdump, taxonomic-tree, species-tree, tree-of-life, phylogenetic-tree, taxonomy-database, taxonomy-parser, taxonomy-analysis, biology, genomics, evolution, evolutionary-biology, systematics, species-classification, organism-classification, taxonomic-hierarchy, computational-biology, metagenomics, biodiversity, clade-analysis, taxon-names, taxonomy-ids, merged-taxa, rank-distribution, phylogenetics, phylogenomics, comparative-genomics, tree-builder, ncbi-tools, bio-tools, data-pipeline, text-tree, tsv-mapping",
"author": "NCBI-Tree Contributors",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/73/4a/cfd48691acddd7b69cabbdb84b678301b2029f2f0f915a537c44d6a0703a/ncbi_tree-1.0.1.tar.gz",
"platform": null,
"description": "# ncbi-tree\n\n[](https://badge.fury.io/py/ncbi-tree)\n[](https://www.python.org/downloads/)\n[](https://creativecommons.org/licenses/by-nc/4.0/)\n\n**ncbi-tree** is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt).\n\n## Quick Start\n\n```bash\npip install ncbi-tree\nncbi-tree ./output\n```\n\nThat's it! The tool will download the latest NCBI taxonomy, generate phylogenetic trees, and create detailed reports.\n\n## Features\n\n- [x] **Automatic Download**: Fetches the latest taxonomy data from NCBI FTP servers \n- [x] **Version Tracking**: Automatically detects and records the exact server version \n- [x] **Smart Caching**: Skips re-download and re-extraction when files already exist \n- [x] **Progress Bars**: Visual feedback for downloads and extraction using tqdm \n- [x] **Multiple Output Formats**: Newick with IDs only, Newick with names, text tree, TSV mapping \n- [x] **Comprehensive Reports**: Detailed taxonomy analysis with rank distribution and depth statistics \n- [x] **Name Sanitization**: By default inital letter is capitalized and space is replaced by `-`. Configurable name formatting with --no-sanitize option \n- [x] **Interactive Mode**: Optional files generated on demand without re-reading data \n- [x] **Merged Taxa Support**: Handles merged taxonomy IDs from merged.dmp \n- [x] **Cross-Platform**: Works on Linux, macOS, and Windows \n- [x] **Memory Efficient**: Reuses data in memory for optional file generation \n- [x] **Error Handling**: Comprehensive error catching with user-friendly messages \n\n## Installation\n\n```bash\npip install ncbi-tree\n```\n\n## Usage\n\n### Basic Usage\n\n```bash\n# Download and build taxonomy tree with default settings\nncbi-tree ./output\n\n# Clean up intermediate files after processing\nncbi-tree ./output --no-cache\n\n# Disable name sanitization (keep original spaces)\nncbi-tree ./output --no-sanitize\n\n# Use custom download URL\nncbi-tree ./output --url https://custom-mirror.org/taxdump.tar.gz\n\n# Combined options\nncbi-tree ./output --no-cache --no-sanitize\n```\n\n### Help\n\n```bash\nncbi-tree --help\nncbi-tree --version\n```\n\n## Output Files\n\n### Core Files (Generated Automatically)\n\n1. **`output.NCBI.tree.tre`** - Newick tree with NCBI taxonomy IDs only\n2. **`output.NCBI.report.txt`** - Exploratory taxonomy analysis and statistics\n3. **`version.txt`** - Server timestamped version for downloaded taxdump.tar.gz\n\n### Optional Files (User Prompted)\n\nAfter core files are generated, you will be prompted:\n```\nWould you like to generate optional files (output.NCBI.tree.txt, output.NCBI.named.tree.tre, output.NCBI.ID.to.name.tsv)? [y/N]:\n```\n\nIf you answer `y`, additional files will be generated **without re-reading data**:\n\n4. **`output.NCBI.tree.txt`** - Plain-text tree with Unicode box-drawing\n5. **`output.NCBI.named.tree.tre`** - Newick tree with rank:id:name labels\n6. **`output.NCBI.ID.to.name.tsv`** - TSV mapping of IDs to names (TaxID, Name, Rank)\n\n## Name Sanitization\n\nBy default, taxon names are sanitized for consistent display:\n- Spaces replaced with `-`\n- Existing `-` escaped as `<->`\n- Title case applied\n- Special characters removed\n\n**Default (sanitized):**\n```\n\"Human;Homo-Sapiens\"\n\"Norway-Rat;Rattus-Norvegicus\"\n```\n\n**With `--no-sanitize` flag:**\n```\n\"human; Homo sapiens\"\n\"Norway rat; Rattus norvegicus\"\n```\n\n## Advanced Configuration\n\n### Custom Name Display\n\nTo customize which name types are displayed, edit `NAME_PRIORITIES` in `ncbi_tree/core.py`:\n\n```python\n# Default: both common and scientific names\nNAME_PRIORITIES = {\"genbank common name\": 0, \"scientific name\": 1}\n# Result: \"Human; Homo sapiens\"\n\n# Scientific name only (disable common name)\nNAME_PRIORITIES = {\"genbank common name\": -1, \"scientific name\": 0}\n# Result: \"Homo sapiens\"\n\n# Common name only (disable scientific name)\nNAME_PRIORITIES = {\"genbank common name\": 0, \"scientific name\": -1}\n# Result: \"Human\"\n```\n\n**Note:** Priority value `-1` disables that name type, `>= 0` enables it (lower number = higher priority).\n\n## Example\n\n```bash\n$ ncbi-tree ./ncbi_output\n\nOutput files:\n - ./ncbi_output/output.NCBI.tree.tre\n - ./ncbi_output/output.NCBI.tree.txt\n - ./ncbi_output/version.txt\n```\n\n## Requirements\n\n- Python 3.8 or higher\n- requests >= 2.25.0\n- tqdm >= 4.50.0\n\n## Technical Details\n\n### Data Source\n- **Primary**: NCBI Taxonomy Database (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)\n- **Updates**: Automatic detection of latest version with timestamp tracking\n- **Size**: ~70-100 MB compressed, ~2.7M+ taxonomy entries at the time of writing (October 2025)\n- **Format**: NCBI taxdump format (nodes.dmp, names.dmp, merged.dmp)\n\n### Output Formats\n1. **Newick (`.tre`)**: Standard phylogenetic tree format compatible with all major tree viewers\n2. **Text Tree (`.txt`)**: Unicode-based visualization for terminal/text viewing\n3. **TSV Mapping (`.tsv`)**: Tabular format for database integration and lookups\n4. **Report (`.txt`)**: Statistical analysis with rank distribution and depth metrics\n\n## License\n\nThis project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).\n\n## Contributing\n\nContributions are welcome! Please feel free to submit issues or pull requests.\n\n## Acknowledgments\n\n- NCBI for providing the taxonomy database\n",
"bugtrack_url": null,
"license": "CC BY-NC 4.0",
"summary": "ncbi-tree is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt)",
"version": "1.0.1",
"project_urls": {
"Bug Reports": "https://github.com/phylobridge/ncbi-tree/issues",
"Documentation": "https://github.com/phylobridge/ncbi-tree#readme",
"Homepage": "https://github.com/phylobridge/ncbi-tree",
"NCBI Taxonomy": "https://www.ncbi.nlm.nih.gov/taxonomy",
"Source": "https://github.com/phylobridge/ncbi-tree"
},
"split_keywords": [
"ncbi",
" taxonomy",
" phylogeny",
" bioinformatics",
" newick",
" tree",
" phylogenetic",
" ncbi-taxonomy",
" taxdump",
" taxonomic-tree",
" species-tree",
" tree-of-life",
" phylogenetic-tree",
" taxonomy-database",
" taxonomy-parser",
" taxonomy-analysis",
" biology",
" genomics",
" evolution",
" evolutionary-biology",
" systematics",
" species-classification",
" organism-classification",
" taxonomic-hierarchy",
" computational-biology",
" metagenomics",
" biodiversity",
" clade-analysis",
" taxon-names",
" taxonomy-ids",
" merged-taxa",
" rank-distribution",
" phylogenetics",
" phylogenomics",
" comparative-genomics",
" tree-builder",
" ncbi-tools",
" bio-tools",
" data-pipeline",
" text-tree",
" tsv-mapping"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "734acfd48691acddd7b69cabbdb84b678301b2029f2f0f915a537c44d6a0703a",
"md5": "389a1087fdffe673c078dc6d2e471d43",
"sha256": "23fa76552cc26d059c3335a1eaeb9045ccc1ba316a9c597f96eba0fa35a51ef2"
},
"downloads": -1,
"filename": "ncbi_tree-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "389a1087fdffe673c078dc6d2e471d43",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 21456,
"upload_time": "2025-10-21T21:51:16",
"upload_time_iso_8601": "2025-10-21T21:51:16.212020Z",
"url": "https://files.pythonhosted.org/packages/73/4a/cfd48691acddd7b69cabbdb84b678301b2029f2f0f915a537c44d6a0703a/ncbi_tree-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-21 21:51:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "phylobridge",
"github_project": "ncbi-tree",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ncbi-tree"
}