votuderep

Name	votuderep JSON
Version	0.4.0 JSON
	download
home_page	None
Summary	A CLI tool for dereplicating and filtering viral contigs
upload_time	2025-10-17 14:07:37
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	bioinformatics viral contigs dereplication blast
VCS
bugtrack_url
requirements	click markdown-it-py mdurl numpy pandas pyfastx pygments python-dateutil pytz rich rich-click six typing-extensions tzdata
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # votuderep

[![Test](https://github.com/quadram-institute-bioscience/votuderep/actions/workflows/test.yml/badge.svg)](https://github.com/quadram-institute-bioscience/votuderep/actions/workflows/test.yml)


![Logo](https://github.com/quadram-institute-bioscience/votuderep/raw/main/votuderep.png)

A Python CLI tool for dereplicating and filtering viral contigs (vOTUs - viral Operational Taxonomic Units)
using the CheckV method.

## Features

- **Dereplicate vOTUs**: Remove redundant viral sequences using BLAST-based ANI clustering
- **Filter by CheckV metrics**: Filter viral contigs based on quality, completeness, and other metrics
- **Tabulate reads**: Generate CSV tables from paired-end sequencing read directories
- **Download training data**: Fetch viral assembly datasets for training purposes

## Requirements

- Python >= 3.10
- BLAST+ toolkit (specifically `blastn` and `makeblastdb`)

## Installation

### From source

```bash
# Clone the repository
git clone https://github.com/yourusername/votuderep.git
cd votuderep

# Install in development mode
pip install -e .

# Or install normally
pip install .
```

### Installing BLAST+

votuderep requires BLAST+ to be installed and available in your PATH:

```bash
# Using conda (recommended)
conda install -c bioconda blast

# On Ubuntu/Debian
sudo apt-get install ncbi-blast+

# On macOS
brew install blast
```


## Usage

votuderep provides four main commands: `derep`, `filter`, `tabulate`, and `trainingdata`.

### Dereplicate vOTUs

Remove redundant sequences using BLAST and ANI clustering:

```bash
votuderep derep -i input.fasta -o dereplicated.fasta
```

**Options:**

- `-i, --input`: Input FASTA file [required]
- `-o, --output`: Output FASTA file [default: dereplicated_vOTUs.fasta]
- `-t, --threads`: Number of threads for BLAST [default: 2]
- `--tmp`: Temporary directory [default: $TEMP or /tmp]
- `--min-ani`: Minimum ANI threshold (0-100) [default: 95]
- `--min-tcov`: Minimum target coverage (0-100) [default: 85]
- `--keep`: Keep temporary directory with intermediate files

**Example:**

```bash
# Basic dereplication
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta

# With custom parameters
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
  --min-ani 97 --min-tcov 90 -t 8

# Keep intermediate files for inspection
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
  --keep --tmp ./temp_dir
```

**How it works:**

1. Creates a BLAST database from input sequences
2. Performs all-vs-all BLASTN comparison
3. Calculates ANI (Average Nucleotide Identity) and coverage
4. Clusters sequences using greedy centroid-based algorithm
5. Outputs the longest sequence from each cluster (representative)

### Filter by CheckV

Filter viral contigs based on CheckV quality metrics:

```bash
votuderep filter input.fasta checkv_output.tsv -o filtered.fasta
```

**Required Arguments:**

- `FASTA`: Input FASTA file with viral contigs
- `CHECKV_OUT`: TSV output file from CheckV

**Options:**

**Length filters:**
- `-m, --min-len`: Minimum contig length [default: 0]
- `--max-len`: Maximum contig length, 0 = unlimited [default: 0]

**Quality filters:**
- `--min-quality`: Minimum quality level: low, medium, or high [default: low]
- `--complete`: Only keep complete genomes
- `--exclude-undetermined`: Exclude contigs where quality is "Not-determined"

**Metrics filters:**
- `-c, --min-completeness`: Minimum completeness percentage (0-100)
- `--max-contam`: Maximum contamination percentage (0-100)
- `--no-warnings`: Only keep contigs with no warnings

**Other filters:**
- `--provirus`: Only select proviruses (provirus == "Yes")
- `-o, --output`: Output FASTA file [default: STDOUT]

**Examples:**

```bash
# Basic filtering - minimum quality
votuderep filter viral_contigs.fasta checkv_output.tsv -o filtered.fasta

# High-quality sequences only
votuderep filter viral_contigs.fasta checkv_output.tsv \
  --min-quality high -o high_quality.fasta

# Complete genomes with minimum length
votuderep filter viral_contigs.fasta checkv_output.tsv \
  --complete --min-len 5000 -o complete_genomes.fasta

# Complex filtering
votuderep filter viral_contigs.fasta checkv_output.tsv \
  --min-quality medium \
  --min-completeness 80 \
  --max-contam 5 \
  --no-warnings \
  --min-len 3000 \
  -o high_confidence.fasta

# Output to stdout (for piping)
votuderep filter viral_contigs.fasta checkv_output.tsv > filtered.fasta
```

**Quality Levels:**

CheckV assigns quality levels to viral contigs:

- **Complete**: Complete genomes (highest quality)
- **High-quality**: High confidence viral sequences
- **Medium-quality**: Moderate confidence sequences
- **Low-quality**: Lower confidence but valid sequences
- **Not-determined**: Quality could not be determined

The `--min-quality` option filters inclusively:
- `low`: Includes Low, Medium, High, and Complete (default)
- `medium`: Includes Medium, High, and Complete
- `high`: Includes High and Complete only

Note: "Not-determined" sequences are included by default unless `--exclude-undetermined` is used.

### Tabulate Reads

Generate a CSV table from a directory containing paired-end sequencing reads:

```bash
votuderep tabulate reads/ -o samples.csv
```

**Required Arguments:**

- `INPUT_DIR`: Directory containing sequencing read files

**Options:**

- `-o, --output`: Output CSV file [default: STDOUT]
- `-d, --delimiter`: Field separator [default: ,]
- `-1, --for-tag`: Forward read identifier [default: _R1]
- `-2, --rev-tag`: Reverse read identifier [default: _R2]
- `-s, --strip`: Remove string from sample names (can be used multiple times)
- `-e, --extension`: Only process files with this extension
- `-a, --absolute`: Use absolute paths in output

**Examples:**

```bash
# Basic usage - generate CSV table
votuderep tabulate reads/ -o samples.csv

# Custom read tags and extension
votuderep tabulate reads/ --for-tag _1 --rev-tag _2 --extension .fq.gz

# Strip patterns from sample names and use absolute paths
votuderep tabulate reads/ --strip "Sample_" --strip ".filtered" -a
```

### Download Training Data

Download viral assembly and sequencing reads for training purposes:

```bash
votuderep trainingdata -o ./ebame-virome/
```

**Options:**

- `-o, --outdir`: Output directory [default: ./ebame-virome/]

**Example:**

```bash
# Download to default directory
votuderep trainingdata

# Download to custom directory
votuderep trainingdata -o ./training_data/
```

## Global Options

- `-v, --verbose`: Enable verbose logging
- `--version`: Show version and exit
- `--help`: Show help message

## License

MIT License - See LICENSE file for details

 
## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Authors

Andrea Telatin & QIB Core Bioinformatics

©️ Quadram Institute Bioscience 2025

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "votuderep",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "bioinformatics, viral, contigs, dereplication, blast",
    "author": null,
    "author_email": "Andrea Telatin <andrea.telatin@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/40/9f/a264f33f352e9621c2176eca769787747ca916c67390273460aee8ef8f30/votuderep-0.4.0.tar.gz",
    "platform": null,
    "description": "# votuderep\n\n[![Test](https://github.com/quadram-institute-bioscience/votuderep/actions/workflows/test.yml/badge.svg)](https://github.com/quadram-institute-bioscience/votuderep/actions/workflows/test.yml)\n\n\n![Logo](https://github.com/quadram-institute-bioscience/votuderep/raw/main/votuderep.png)\n\nA Python CLI tool for dereplicating and filtering viral contigs (vOTUs - viral Operational Taxonomic Units)\nusing the CheckV method.\n\n## Features\n\n- **Dereplicate vOTUs**: Remove redundant viral sequences using BLAST-based ANI clustering\n- **Filter by CheckV metrics**: Filter viral contigs based on quality, completeness, and other metrics\n- **Tabulate reads**: Generate CSV tables from paired-end sequencing read directories\n- **Download training data**: Fetch viral assembly datasets for training purposes\n\n## Requirements\n\n- Python >= 3.10\n- BLAST+ toolkit (specifically `blastn` and `makeblastdb`)\n\n## Installation\n\n### From source\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/votuderep.git\ncd votuderep\n\n# Install in development mode\npip install -e .\n\n# Or install normally\npip install .\n```\n\n### Installing BLAST+\n\nvotuderep requires BLAST+ to be installed and available in your PATH:\n\n```bash\n# Using conda (recommended)\nconda install -c bioconda blast\n\n# On Ubuntu/Debian\nsudo apt-get install ncbi-blast+\n\n# On macOS\nbrew install blast\n```\n\n\n## Usage\n\nvotuderep provides four main commands: `derep`, `filter`, `tabulate`, and `trainingdata`.\n\n### Dereplicate vOTUs\n\nRemove redundant sequences using BLAST and ANI clustering:\n\n```bash\nvotuderep derep -i input.fasta -o dereplicated.fasta\n```\n\n**Options:**\n\n- `-i, --input`: Input FASTA file [required]\n- `-o, --output`: Output FASTA file [default: dereplicated_vOTUs.fasta]\n- `-t, --threads`: Number of threads for BLAST [default: 2]\n- `--tmp`: Temporary directory [default: $TEMP or /tmp]\n- `--min-ani`: Minimum ANI threshold (0-100) [default: 95]\n- `--min-tcov`: Minimum target coverage (0-100) [default: 85]\n- `--keep`: Keep temporary directory with intermediate files\n\n**Example:**\n\n```bash\n# Basic dereplication\nvotuderep derep -i viral_contigs.fasta -o dereplicated.fasta\n\n# With custom parameters\nvotuderep derep -i viral_contigs.fasta -o dereplicated.fasta \\\n  --min-ani 97 --min-tcov 90 -t 8\n\n# Keep intermediate files for inspection\nvotuderep derep -i viral_contigs.fasta -o dereplicated.fasta \\\n  --keep --tmp ./temp_dir\n```\n\n**How it works:**\n\n1. Creates a BLAST database from input sequences\n2. Performs all-vs-all BLASTN comparison\n3. Calculates ANI (Average Nucleotide Identity) and coverage\n4. Clusters sequences using greedy centroid-based algorithm\n5. Outputs the longest sequence from each cluster (representative)\n\n### Filter by CheckV\n\nFilter viral contigs based on CheckV quality metrics:\n\n```bash\nvotuderep filter input.fasta checkv_output.tsv -o filtered.fasta\n```\n\n**Required Arguments:**\n\n- `FASTA`: Input FASTA file with viral contigs\n- `CHECKV_OUT`: TSV output file from CheckV\n\n**Options:**\n\n**Length filters:**\n- `-m, --min-len`: Minimum contig length [default: 0]\n- `--max-len`: Maximum contig length, 0 = unlimited [default: 0]\n\n**Quality filters:**\n- `--min-quality`: Minimum quality level: low, medium, or high [default: low]\n- `--complete`: Only keep complete genomes\n- `--exclude-undetermined`: Exclude contigs where quality is \"Not-determined\"\n\n**Metrics filters:**\n- `-c, --min-completeness`: Minimum completeness percentage (0-100)\n- `--max-contam`: Maximum contamination percentage (0-100)\n- `--no-warnings`: Only keep contigs with no warnings\n\n**Other filters:**\n- `--provirus`: Only select proviruses (provirus == \"Yes\")\n- `-o, --output`: Output FASTA file [default: STDOUT]\n\n**Examples:**\n\n```bash\n# Basic filtering - minimum quality\nvotuderep filter viral_contigs.fasta checkv_output.tsv -o filtered.fasta\n\n# High-quality sequences only\nvotuderep filter viral_contigs.fasta checkv_output.tsv \\\n  --min-quality high -o high_quality.fasta\n\n# Complete genomes with minimum length\nvotuderep filter viral_contigs.fasta checkv_output.tsv \\\n  --complete --min-len 5000 -o complete_genomes.fasta\n\n# Complex filtering\nvotuderep filter viral_contigs.fasta checkv_output.tsv \\\n  --min-quality medium \\\n  --min-completeness 80 \\\n  --max-contam 5 \\\n  --no-warnings \\\n  --min-len 3000 \\\n  -o high_confidence.fasta\n\n# Output to stdout (for piping)\nvotuderep filter viral_contigs.fasta checkv_output.tsv > filtered.fasta\n```\n\n**Quality Levels:**\n\nCheckV assigns quality levels to viral contigs:\n\n- **Complete**: Complete genomes (highest quality)\n- **High-quality**: High confidence viral sequences\n- **Medium-quality**: Moderate confidence sequences\n- **Low-quality**: Lower confidence but valid sequences\n- **Not-determined**: Quality could not be determined\n\nThe `--min-quality` option filters inclusively:\n- `low`: Includes Low, Medium, High, and Complete (default)\n- `medium`: Includes Medium, High, and Complete\n- `high`: Includes High and Complete only\n\nNote: \"Not-determined\" sequences are included by default unless `--exclude-undetermined` is used.\n\n### Tabulate Reads\n\nGenerate a CSV table from a directory containing paired-end sequencing reads:\n\n```bash\nvotuderep tabulate reads/ -o samples.csv\n```\n\n**Required Arguments:**\n\n- `INPUT_DIR`: Directory containing sequencing read files\n\n**Options:**\n\n- `-o, --output`: Output CSV file [default: STDOUT]\n- `-d, --delimiter`: Field separator [default: ,]\n- `-1, --for-tag`: Forward read identifier [default: _R1]\n- `-2, --rev-tag`: Reverse read identifier [default: _R2]\n- `-s, --strip`: Remove string from sample names (can be used multiple times)\n- `-e, --extension`: Only process files with this extension\n- `-a, --absolute`: Use absolute paths in output\n\n**Examples:**\n\n```bash\n# Basic usage - generate CSV table\nvotuderep tabulate reads/ -o samples.csv\n\n# Custom read tags and extension\nvotuderep tabulate reads/ --for-tag _1 --rev-tag _2 --extension .fq.gz\n\n# Strip patterns from sample names and use absolute paths\nvotuderep tabulate reads/ --strip \"Sample_\" --strip \".filtered\" -a\n```\n\n### Download Training Data\n\nDownload viral assembly and sequencing reads for training purposes:\n\n```bash\nvotuderep trainingdata -o ./ebame-virome/\n```\n\n**Options:**\n\n- `-o, --outdir`: Output directory [default: ./ebame-virome/]\n\n**Example:**\n\n```bash\n# Download to default directory\nvotuderep trainingdata\n\n# Download to custom directory\nvotuderep trainingdata -o ./training_data/\n```\n\n## Global Options\n\n- `-v, --verbose`: Enable verbose logging\n- `--version`: Show version and exit\n- `--help`: Show help message\n\n## License\n\nMIT License - See LICENSE file for details\n\n \n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Authors\n\nAndrea Telatin & QIB Core Bioinformatics\n\n\u00a9\ufe0f Quadram Institute Bioscience 2025\n \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A CLI tool for dereplicating and filtering viral contigs",
    "version": "0.4.0",
    "project_urls": {
        "Documentation": "https://github.com/quadram-institute-bioscience/votuderep#readme",
        "Homepage": "https://github.com/quadram-institute-bioscience/votuderep",
        "Issues": "https://github.com/quadram-institute-bioscience/votuderep/issues",
        "Repository": "https://github.com/quadram-institute-bioscience/votuderep"
    },
    "split_keywords": [
        "bioinformatics",
        " viral",
        " contigs",
        " dereplication",
        " blast"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0387e9b64c33120bbfe197d4cec25e1a8356bf6f462ba3ab9dee57aec086a3e5",
                "md5": "4c103de3847851fe78d9518c00b7931d",
                "sha256": "0f5b27a8cdb8b379e24b2927451441768aced3fe5d512d929af6e616f9ceed44"
            },
            "downloads": -1,
            "filename": "votuderep-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4c103de3847851fe78d9518c00b7931d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 29135,
            "upload_time": "2025-10-17T14:07:35",
            "upload_time_iso_8601": "2025-10-17T14:07:35.700312Z",
            "url": "https://files.pythonhosted.org/packages/03/87/e9b64c33120bbfe197d4cec25e1a8356bf6f462ba3ab9dee57aec086a3e5/votuderep-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "409fa264f33f352e9621c2176eca769787747ca916c67390273460aee8ef8f30",
                "md5": "1b717eaef832e25882a1b3d1adc96650",
                "sha256": "261d70439c6f1eb1e03595649a62db17e3b9f69a50007ef5ac06181c468a30d6"
            },
            "downloads": -1,
            "filename": "votuderep-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1b717eaef832e25882a1b3d1adc96650",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 27081,
            "upload_time": "2025-10-17T14:07:37",
            "upload_time_iso_8601": "2025-10-17T14:07:37.223531Z",
            "url": "https://files.pythonhosted.org/packages/40/9f/a264f33f352e9621c2176eca769787747ca916c67390273460aee8ef8f30/votuderep-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-17 14:07:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "quadram-institute-bioscience",
    "github_project": "votuderep#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.3.0"
                ]
            ]
        },
        {
            "name": "markdown-it-py",
            "specs": [
                [
                    "==",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "mdurl",
            "specs": [
                [
                    "==",
                    "0.1.2"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.2.6"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.3.3"
                ]
            ]
        },
        {
            "name": "pyfastx",
            "specs": [
                [
                    "==",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "pygments",
            "specs": [
                [
                    "==",
                    "2.19.2"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    "==",
                    "2.9.0.post0"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    "==",
                    "14.2.0"
                ]
            ]
        },
        {
            "name": "rich-click",
            "specs": [
                [
                    "==",
                    "1.9.3"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    "==",
                    "1.17.0"
                ]
            ]
        },
        {
            "name": "typing-extensions",
            "specs": [
                [
                    "==",
                    "4.15.0"
                ]
            ]
        },
        {
            "name": "tzdata",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        }
    ],
    "lcname": "votuderep"
}

None