# votuderep
[](https://github.com/quadram-institute-bioscience/votuderep/actions/workflows/test.yml)

A Python CLI tool for dereplicating and filtering viral contigs (vOTUs - viral Operational Taxonomic Units)
using the CheckV method.
## Features
- **Dereplicate vOTUs**: Remove redundant viral sequences using BLAST-based ANI clustering
- **Filter by CheckV metrics**: Filter viral contigs based on quality, completeness, and other metrics
- **Tabulate reads**: Generate CSV tables from paired-end sequencing read directories
- **Download training data**: Fetch viral assembly datasets for training purposes
## Requirements
- Python >= 3.10
- BLAST+ toolkit (specifically `blastn` and `makeblastdb`)
## Installation
### From source
```bash
# Clone the repository
git clone https://github.com/yourusername/votuderep.git
cd votuderep
# Install in development mode
pip install -e .
# Or install normally
pip install .
```
### Installing BLAST+
votuderep requires BLAST+ to be installed and available in your PATH:
```bash
# Using conda (recommended)
conda install -c bioconda blast
# On Ubuntu/Debian
sudo apt-get install ncbi-blast+
# On macOS
brew install blast
```
## Usage
votuderep provides four main commands: `derep`, `filter`, `tabulate`, and `trainingdata`.
### Dereplicate vOTUs
Remove redundant sequences using BLAST and ANI clustering:
```bash
votuderep derep -i input.fasta -o dereplicated.fasta
```
**Options:**
- `-i, --input`: Input FASTA file [required]
- `-o, --output`: Output FASTA file [default: dereplicated_vOTUs.fasta]
- `-t, --threads`: Number of threads for BLAST [default: 2]
- `--tmp`: Temporary directory [default: $TEMP or /tmp]
- `--min-ani`: Minimum ANI threshold (0-100) [default: 95]
- `--min-tcov`: Minimum target coverage (0-100) [default: 85]
- `--keep`: Keep temporary directory with intermediate files
**Example:**
```bash
# Basic dereplication
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta
# With custom parameters
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
--min-ani 97 --min-tcov 90 -t 8
# Keep intermediate files for inspection
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
--keep --tmp ./temp_dir
```
**How it works:**
1. Creates a BLAST database from input sequences
2. Performs all-vs-all BLASTN comparison
3. Calculates ANI (Average Nucleotide Identity) and coverage
4. Clusters sequences using greedy centroid-based algorithm
5. Outputs the longest sequence from each cluster (representative)
### Filter by CheckV
Filter viral contigs based on CheckV quality metrics:
```bash
votuderep filter input.fasta checkv_output.tsv -o filtered.fasta
```
**Required Arguments:**
- `FASTA`: Input FASTA file with viral contigs
- `CHECKV_OUT`: TSV output file from CheckV
**Options:**
**Length filters:**
- `-m, --min-len`: Minimum contig length [default: 0]
- `--max-len`: Maximum contig length, 0 = unlimited [default: 0]
**Quality filters:**
- `--min-quality`: Minimum quality level: low, medium, or high [default: low]
- `--complete`: Only keep complete genomes
- `--exclude-undetermined`: Exclude contigs where quality is "Not-determined"
**Metrics filters:**
- `-c, --min-completeness`: Minimum completeness percentage (0-100)
- `--max-contam`: Maximum contamination percentage (0-100)
- `--no-warnings`: Only keep contigs with no warnings
**Other filters:**
- `--provirus`: Only select proviruses (provirus == "Yes")
- `-o, --output`: Output FASTA file [default: STDOUT]
**Examples:**
```bash
# Basic filtering - minimum quality
votuderep filter viral_contigs.fasta checkv_output.tsv -o filtered.fasta
# High-quality sequences only
votuderep filter viral_contigs.fasta checkv_output.tsv \
--min-quality high -o high_quality.fasta
# Complete genomes with minimum length
votuderep filter viral_contigs.fasta checkv_output.tsv \
--complete --min-len 5000 -o complete_genomes.fasta
# Complex filtering
votuderep filter viral_contigs.fasta checkv_output.tsv \
--min-quality medium \
--min-completeness 80 \
--max-contam 5 \
--no-warnings \
--min-len 3000 \
-o high_confidence.fasta
# Output to stdout (for piping)
votuderep filter viral_contigs.fasta checkv_output.tsv > filtered.fasta
```
**Quality Levels:**
CheckV assigns quality levels to viral contigs:
- **Complete**: Complete genomes (highest quality)
- **High-quality**: High confidence viral sequences
- **Medium-quality**: Moderate confidence sequences
- **Low-quality**: Lower confidence but valid sequences
- **Not-determined**: Quality could not be determined
The `--min-quality` option filters inclusively:
- `low`: Includes Low, Medium, High, and Complete (default)
- `medium`: Includes Medium, High, and Complete
- `high`: Includes High and Complete only
Note: "Not-determined" sequences are included by default unless `--exclude-undetermined` is used.
### Tabulate Reads
Generate a CSV table from a directory containing paired-end sequencing reads:
```bash
votuderep tabulate reads/ -o samples.csv
```
**Required Arguments:**
- `INPUT_DIR`: Directory containing sequencing read files
**Options:**
- `-o, --output`: Output CSV file [default: STDOUT]
- `-d, --delimiter`: Field separator [default: ,]
- `-1, --for-tag`: Forward read identifier [default: _R1]
- `-2, --rev-tag`: Reverse read identifier [default: _R2]
- `-s, --strip`: Remove string from sample names (can be used multiple times)
- `-e, --extension`: Only process files with this extension
- `-a, --absolute`: Use absolute paths in output
**Examples:**
```bash
# Basic usage - generate CSV table
votuderep tabulate reads/ -o samples.csv
# Custom read tags and extension
votuderep tabulate reads/ --for-tag _1 --rev-tag _2 --extension .fq.gz
# Strip patterns from sample names and use absolute paths
votuderep tabulate reads/ --strip "Sample_" --strip ".filtered" -a
```
### Download Training Data
Download viral assembly and sequencing reads for training purposes:
```bash
votuderep trainingdata -o ./ebame-virome/
```
**Options:**
- `-o, --outdir`: Output directory [default: ./ebame-virome/]
**Example:**
```bash
# Download to default directory
votuderep trainingdata
# Download to custom directory
votuderep trainingdata -o ./training_data/
```
## Global Options
- `-v, --verbose`: Enable verbose logging
- `--version`: Show version and exit
- `--help`: Show help message
## License
MIT License - See LICENSE file for details
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Authors
Andrea Telatin & QIB Core Bioinformatics
©️ Quadram Institute Bioscience 2025
Raw data
{
"_id": null,
"home_page": null,
"name": "votuderep",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "bioinformatics, viral, contigs, dereplication, blast",
"author": null,
"author_email": "Andrea Telatin <andrea.telatin@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/40/9f/a264f33f352e9621c2176eca769787747ca916c67390273460aee8ef8f30/votuderep-0.4.0.tar.gz",
"platform": null,
"description": "# votuderep\n\n[](https://github.com/quadram-institute-bioscience/votuderep/actions/workflows/test.yml)\n\n\n\n\nA Python CLI tool for dereplicating and filtering viral contigs (vOTUs - viral Operational Taxonomic Units)\nusing the CheckV method.\n\n## Features\n\n- **Dereplicate vOTUs**: Remove redundant viral sequences using BLAST-based ANI clustering\n- **Filter by CheckV metrics**: Filter viral contigs based on quality, completeness, and other metrics\n- **Tabulate reads**: Generate CSV tables from paired-end sequencing read directories\n- **Download training data**: Fetch viral assembly datasets for training purposes\n\n## Requirements\n\n- Python >= 3.10\n- BLAST+ toolkit (specifically `blastn` and `makeblastdb`)\n\n## Installation\n\n### From source\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/votuderep.git\ncd votuderep\n\n# Install in development mode\npip install -e .\n\n# Or install normally\npip install .\n```\n\n### Installing BLAST+\n\nvotuderep requires BLAST+ to be installed and available in your PATH:\n\n```bash\n# Using conda (recommended)\nconda install -c bioconda blast\n\n# On Ubuntu/Debian\nsudo apt-get install ncbi-blast+\n\n# On macOS\nbrew install blast\n```\n\n\n## Usage\n\nvotuderep provides four main commands: `derep`, `filter`, `tabulate`, and `trainingdata`.\n\n### Dereplicate vOTUs\n\nRemove redundant sequences using BLAST and ANI clustering:\n\n```bash\nvotuderep derep -i input.fasta -o dereplicated.fasta\n```\n\n**Options:**\n\n- `-i, --input`: Input FASTA file [required]\n- `-o, --output`: Output FASTA file [default: dereplicated_vOTUs.fasta]\n- `-t, --threads`: Number of threads for BLAST [default: 2]\n- `--tmp`: Temporary directory [default: $TEMP or /tmp]\n- `--min-ani`: Minimum ANI threshold (0-100) [default: 95]\n- `--min-tcov`: Minimum target coverage (0-100) [default: 85]\n- `--keep`: Keep temporary directory with intermediate files\n\n**Example:**\n\n```bash\n# Basic dereplication\nvotuderep derep -i viral_contigs.fasta -o dereplicated.fasta\n\n# With custom parameters\nvotuderep derep -i viral_contigs.fasta -o dereplicated.fasta \\\n --min-ani 97 --min-tcov 90 -t 8\n\n# Keep intermediate files for inspection\nvotuderep derep -i viral_contigs.fasta -o dereplicated.fasta \\\n --keep --tmp ./temp_dir\n```\n\n**How it works:**\n\n1. Creates a BLAST database from input sequences\n2. Performs all-vs-all BLASTN comparison\n3. Calculates ANI (Average Nucleotide Identity) and coverage\n4. Clusters sequences using greedy centroid-based algorithm\n5. Outputs the longest sequence from each cluster (representative)\n\n### Filter by CheckV\n\nFilter viral contigs based on CheckV quality metrics:\n\n```bash\nvotuderep filter input.fasta checkv_output.tsv -o filtered.fasta\n```\n\n**Required Arguments:**\n\n- `FASTA`: Input FASTA file with viral contigs\n- `CHECKV_OUT`: TSV output file from CheckV\n\n**Options:**\n\n**Length filters:**\n- `-m, --min-len`: Minimum contig length [default: 0]\n- `--max-len`: Maximum contig length, 0 = unlimited [default: 0]\n\n**Quality filters:**\n- `--min-quality`: Minimum quality level: low, medium, or high [default: low]\n- `--complete`: Only keep complete genomes\n- `--exclude-undetermined`: Exclude contigs where quality is \"Not-determined\"\n\n**Metrics filters:**\n- `-c, --min-completeness`: Minimum completeness percentage (0-100)\n- `--max-contam`: Maximum contamination percentage (0-100)\n- `--no-warnings`: Only keep contigs with no warnings\n\n**Other filters:**\n- `--provirus`: Only select proviruses (provirus == \"Yes\")\n- `-o, --output`: Output FASTA file [default: STDOUT]\n\n**Examples:**\n\n```bash\n# Basic filtering - minimum quality\nvotuderep filter viral_contigs.fasta checkv_output.tsv -o filtered.fasta\n\n# High-quality sequences only\nvotuderep filter viral_contigs.fasta checkv_output.tsv \\\n --min-quality high -o high_quality.fasta\n\n# Complete genomes with minimum length\nvotuderep filter viral_contigs.fasta checkv_output.tsv \\\n --complete --min-len 5000 -o complete_genomes.fasta\n\n# Complex filtering\nvotuderep filter viral_contigs.fasta checkv_output.tsv \\\n --min-quality medium \\\n --min-completeness 80 \\\n --max-contam 5 \\\n --no-warnings \\\n --min-len 3000 \\\n -o high_confidence.fasta\n\n# Output to stdout (for piping)\nvotuderep filter viral_contigs.fasta checkv_output.tsv > filtered.fasta\n```\n\n**Quality Levels:**\n\nCheckV assigns quality levels to viral contigs:\n\n- **Complete**: Complete genomes (highest quality)\n- **High-quality**: High confidence viral sequences\n- **Medium-quality**: Moderate confidence sequences\n- **Low-quality**: Lower confidence but valid sequences\n- **Not-determined**: Quality could not be determined\n\nThe `--min-quality` option filters inclusively:\n- `low`: Includes Low, Medium, High, and Complete (default)\n- `medium`: Includes Medium, High, and Complete\n- `high`: Includes High and Complete only\n\nNote: \"Not-determined\" sequences are included by default unless `--exclude-undetermined` is used.\n\n### Tabulate Reads\n\nGenerate a CSV table from a directory containing paired-end sequencing reads:\n\n```bash\nvotuderep tabulate reads/ -o samples.csv\n```\n\n**Required Arguments:**\n\n- `INPUT_DIR`: Directory containing sequencing read files\n\n**Options:**\n\n- `-o, --output`: Output CSV file [default: STDOUT]\n- `-d, --delimiter`: Field separator [default: ,]\n- `-1, --for-tag`: Forward read identifier [default: _R1]\n- `-2, --rev-tag`: Reverse read identifier [default: _R2]\n- `-s, --strip`: Remove string from sample names (can be used multiple times)\n- `-e, --extension`: Only process files with this extension\n- `-a, --absolute`: Use absolute paths in output\n\n**Examples:**\n\n```bash\n# Basic usage - generate CSV table\nvotuderep tabulate reads/ -o samples.csv\n\n# Custom read tags and extension\nvotuderep tabulate reads/ --for-tag _1 --rev-tag _2 --extension .fq.gz\n\n# Strip patterns from sample names and use absolute paths\nvotuderep tabulate reads/ --strip \"Sample_\" --strip \".filtered\" -a\n```\n\n### Download Training Data\n\nDownload viral assembly and sequencing reads for training purposes:\n\n```bash\nvotuderep trainingdata -o ./ebame-virome/\n```\n\n**Options:**\n\n- `-o, --outdir`: Output directory [default: ./ebame-virome/]\n\n**Example:**\n\n```bash\n# Download to default directory\nvotuderep trainingdata\n\n# Download to custom directory\nvotuderep trainingdata -o ./training_data/\n```\n\n## Global Options\n\n- `-v, --verbose`: Enable verbose logging\n- `--version`: Show version and exit\n- `--help`: Show help message\n\n## License\n\nMIT License - See LICENSE file for details\n\n \n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Authors\n\nAndrea Telatin & QIB Core Bioinformatics\n\n\u00a9\ufe0f Quadram Institute Bioscience 2025\n \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A CLI tool for dereplicating and filtering viral contigs",
"version": "0.4.0",
"project_urls": {
"Documentation": "https://github.com/quadram-institute-bioscience/votuderep#readme",
"Homepage": "https://github.com/quadram-institute-bioscience/votuderep",
"Issues": "https://github.com/quadram-institute-bioscience/votuderep/issues",
"Repository": "https://github.com/quadram-institute-bioscience/votuderep"
},
"split_keywords": [
"bioinformatics",
" viral",
" contigs",
" dereplication",
" blast"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0387e9b64c33120bbfe197d4cec25e1a8356bf6f462ba3ab9dee57aec086a3e5",
"md5": "4c103de3847851fe78d9518c00b7931d",
"sha256": "0f5b27a8cdb8b379e24b2927451441768aced3fe5d512d929af6e616f9ceed44"
},
"downloads": -1,
"filename": "votuderep-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4c103de3847851fe78d9518c00b7931d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 29135,
"upload_time": "2025-10-17T14:07:35",
"upload_time_iso_8601": "2025-10-17T14:07:35.700312Z",
"url": "https://files.pythonhosted.org/packages/03/87/e9b64c33120bbfe197d4cec25e1a8356bf6f462ba3ab9dee57aec086a3e5/votuderep-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "409fa264f33f352e9621c2176eca769787747ca916c67390273460aee8ef8f30",
"md5": "1b717eaef832e25882a1b3d1adc96650",
"sha256": "261d70439c6f1eb1e03595649a62db17e3b9f69a50007ef5ac06181c468a30d6"
},
"downloads": -1,
"filename": "votuderep-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "1b717eaef832e25882a1b3d1adc96650",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 27081,
"upload_time": "2025-10-17T14:07:37",
"upload_time_iso_8601": "2025-10-17T14:07:37.223531Z",
"url": "https://files.pythonhosted.org/packages/40/9f/a264f33f352e9621c2176eca769787747ca916c67390273460aee8ef8f30/votuderep-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-17 14:07:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "quadram-institute-bioscience",
"github_project": "votuderep#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "click",
"specs": [
[
"==",
"8.3.0"
]
]
},
{
"name": "markdown-it-py",
"specs": [
[
"==",
"4.0.0"
]
]
},
{
"name": "mdurl",
"specs": [
[
"==",
"0.1.2"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.2.6"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.3.3"
]
]
},
{
"name": "pyfastx",
"specs": [
[
"==",
"2.2.0"
]
]
},
{
"name": "pygments",
"specs": [
[
"==",
"2.19.2"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.9.0.post0"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2025.2"
]
]
},
{
"name": "rich",
"specs": [
[
"==",
"14.2.0"
]
]
},
{
"name": "rich-click",
"specs": [
[
"==",
"1.9.3"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.17.0"
]
]
},
{
"name": "typing-extensions",
"specs": [
[
"==",
"4.15.0"
]
]
},
{
"name": "tzdata",
"specs": [
[
"==",
"2025.2"
]
]
}
],
"lcname": "votuderep"
}