# ssiamb — Ambiguous Sites Counter
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
**Author:** Povilas Matusevicius <pmat@ssi.dk>
**Repository:** [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb)
**License:** MIT
**Minimum Python:** 3.12+
## Overview
`ssiamb` computes an "ambiguous sites" metric for bacterial whole genome sequencing (WGS) as a measure of within-sample heterogeneity. This tool modernizes and standardizes the lab's prior definition while providing robust packaging, CLI interface, and Galaxy integration capabilities.
### What are "Ambiguous Sites"?
An ambiguous site is a genomic position with:
- **Sufficient coverage**: Depth ≥ `dp_min` (default: 10)
- **Minor-allele signal**: Minor-allele fraction (MAF) ≥ `maf_min` (default: 0.10)
These metrics are determined from variant calls after normalization and atomization, counting **once per locus** (multi-allelic sites count once if any ALT passes the thresholds).
### Variant Calling and Filtering Strategy
`ssiamb` uses a two-stage approach to ensure comprehensive and consistent variant detection:
#### 1. Variant Calling (Capture All Variants)
- **BBTools**: Configured with `minallelefraction=0.0` to capture all variant calls regardless of frequency
- **BCFtools**: Configured without MAF filtering during the calling stage to capture all variants
- **Rationale**: Ensures no potentially relevant variants are lost during the calling process
#### 2. Analysis-Time Filtering (Apply Thresholds)
- **MAF Threshold**: The `--maf-min` parameter (default: 0.10) is applied during analysis, not during variant calling
- **Post-calling Filter**: All filtering for ambiguous site detection happens after variant calls are made
- **Consistency**: Both callers use the same filtering approach, ensuring comparable results
- **Flexibility**: Allows reanalysis of the same variant calls with different thresholds using `ssiamb summarize`
This approach maximizes sensitivity during variant detection while maintaining analytical flexibility and ensuring reproducible results across different caller technologies.
### Supported Modes
#### Self-mapping Mode (`ssiamb self`)
- **Input**: Reads → Sample's own assembly
- **Use case**: Analyze heterogeneity against the sample's assembled genome
- **Mapping space**: Uses the assembly as reference
#### Reference-mapped Mode (`ssiamb ref`)
- **Input**: Reads → Species canonical reference
- **Use case**: Compare against standardized reference genomes
- **Reference selection**: Via admin directory, user override, or Bracken classification
#### Summarize Mode (`ssiamb summarize`)
- **Input**: Pre-computed VCF + BAM files
- **Use case**: Reanalyze existing variant calls with different thresholds
- **Speed**: Fast analysis without remapping or variant calling
### Key Features
- **🧬 Automatic Reference Management**: Download and index references from NCBI RefSeq
- **🔧 Flexible Mapping**: Support for minimap2 (default) and bwa-mem2
- **📊 Multiple Variant Callers**: BBTools (default) and bcftools
- **📋 Comprehensive Outputs**: Summary TSV (always) + optional VCF, BED, matrices, per-contig analysis
- **📏 Depth Analysis**: Using mosdepth (default) or samtools
- **♻️ Reusable Workflows**: Accept pre-computed BAM/VCF files
- **🧪 Galaxy Integration**: Designed for workflow environments
- **✅ Quality Control**: Configurable thresholds with sensible defaults
- **🧪 Robust Testing**: Comprehensive test suite with smoke testing
## Installation
### Development Installation (Recommended)
For development or to get the latest features:
```bash
# Clone the repository
git clone https://github.com/ssi-dk/ssiamb.git
cd ssiamb
# Create conda environment with dependencies
conda env create -f environment.yml
conda activate ssiamb
# Install in editable mode
pip install -e .
# Verify installation
ssiamb --help
```
### Conda Environment Setup
The project includes an `environment.yml` file with all required dependencies:
```bash
# Create environment
conda env create -f environment.yml
# Activate environment
conda activate ssiamb
# Install ssiamb
pip install -e .
```
### Future Distribution Methods
When stable releases are published, the package will also be available via:
```bash
# Future installation via pip (PyPI)
pip install ssiamb
# Future installation via conda (Bioconda)
conda install -c bioconda ssiamb
```
### External Tool Dependencies
`ssiamb` requires several external bioinformatics tools. These are included in the conda environment:
- **Mapping**: `minimap2` and/or `bwa-mem2`
- **Variant calling**: BBTools and `bcftools`
- **Depth analysis**: `mosdepth` and `samtools`
- **VCF processing**: `bcftools` (for normalization)
## Admin Reference Directory Setup
For reference-mapped mode, you need an admin reference directory containing indexed reference genomes. `ssiamb` includes a built-in downloader to automatically fetch and index references from NCBI RefSeq.
### Setting Up References
1. **Set environment variable** (recommended):
```bash
export SSIAMB_REF_DIR=/path/to/references
```
2. **Download common bacterial references**:
```bash
# Download single species
python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR
# Download multiple species
python -m ssiamb.refseq download --species "Salmonella enterica" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Staphylococcus aureus" --output-dir $SSIAMB_REF_DIR
```
3. **Verify setup**:
```bash
ls $SSIAMB_REF_DIR
# Should show: Escherichia_coli.fna, Escherichia_coli.fna.mmi, Escherichia_coli.fna.bwa.*, etc.
```
### Reference Downloader Features
- **Automatic Selection**: Chooses best RefSeq reference genome (complete > chromosome > scaffold)
- **Index Generation**: Creates both minimap2 (`.mmi`) and bwa-mem2 (`.bwa.*`) indexes
- **Species Normalization**: Handles common name variations and aliases
- **Progress Reporting**: Shows download progress with rich progress bars
- **Fallback Logic**: Tries multiple genomes if primary choice fails
### Common Species Examples
```bash
# Popular bacterial pathogens
python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Salmonella enterica" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Staphylococcus aureus" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Streptococcus pneumoniae" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Klebsiella pneumoniae" --output-dir $SSIAMB_REF_DIR
```
```bash
# Install in editable/development mode (recommended for contributors)
pip install -e .
# After editable install you can run the CLI via the console script or module:
ssiamb --help
# or
python -m ssiamb --help
```
When a stable release is published the package will also be available via PyPI and Bioconda (example future commands):
```bash
# Future installation via pip (PyPI)
pip install ssiamb
# Future installation via conda (Bioconda)
conda install -c bioconda ssiamb
```
## Quick Start
### Basic Usage
```bash
# Check what would be done (dry run)
ssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna --dry-run
# Self-mapping mode: analyze reads against sample's own assembly
ssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna
# Reference-mapped mode: analyze against species reference
ssiamb ref --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --species "Escherichia coli"
# Summarize existing VCF and BAM files
ssiamb summarize --vcf sample.vcf.gz --bam sample.bam
```
### Comprehensive Examples
#### Self-Mapping Mode
```bash
# Basic self-mapping
ssiamb self \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--assembly assembly.fna \
--sample MySample \
--outdir results/
# With custom thresholds and optional outputs
ssiamb self \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--assembly assembly.fna \
--dp-min 15 \
--maf-min 0.05 \
--emit-vcf \
--emit-bed \
--emit-matrix \
--threads 8
# Using bwa-mem2 instead of minimap2
ssiamb self \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--assembly assembly.fna \
--mapper bwa-mem2 \
--caller bcftools
```
#### Reference-Mapping Mode
```bash
# Using species name (requires admin reference directory)
export SSIAMB_REF_DIR=/path/to/references
ssiamb ref \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--species "Escherichia coli" \
--sample MySample
# Using direct reference file
ssiamb ref \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--reference reference_genome.fna \
--sample MySample
# Using Bracken classification results
ssiamb ref \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--bracken sample.bracken \
--ref-dir /path/to/references \
--min-bracken-frac 0.8
```
#### Advanced Usage
```bash
# Output to stdout (no files written)
ssiamb self \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--assembly assembly.fna \
--stdout
# Reuse existing BAM file
ssiamb self \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--assembly assembly.fna \
--bam existing_alignment.bam
# Append to existing TSV instead of overwriting
ssiamb self \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--assembly assembly.fna \
--tsv-mode append
# Comprehensive output with provenance
ssiamb self \
--r1 reads_R1.fastq.gz \
--r2 reads_R2.fastq.gz \
--assembly assembly.fna \
--emit-vcf \
--emit-bed \
--emit-matrix \
--emit-per-contig \
--emit-provenance \
--emit-multiqc
```
### Testing Your Installation
Run the built-in smoke test to verify everything works:
```bash
# Quick test (skips downloads)
python smoke_test.py --skip-downloads
# Full test including reference downloads (slower)
python smoke_test.py
```
## Output Files and Formats
### Primary Output
#### `ambiguous_summary.tsv`
Always generated. Single-row summary with comprehensive metrics:
| Column | Description | Example |
|--------|-------------|---------|
| `sample` | Sample identifier | `MySample` |
| `mode` | Analysis mode | `self`, `ref` |
| `mapper` | Alignment tool used | `minimap2`, `bwa-mem2` |
| `caller` | Variant caller used | `bbtools`, `bcftools` |
| `dp_min` | Minimum depth threshold | `10` |
| `maf_min` | Minimum MAF threshold | `0.1` |
| `ambiguous_snv_count` | Number of ambiguous SNVs | `42` |
| `ambiguous_snv_per_mb` | SNVs per megabase | `15.23` |
| `callable_bases` | Bases with adequate coverage | `4651234` |
| `genome_length` | Total genome length | `4800000` |
| `breadth_10x` | Fraction covered at 10x+ | `0.9691` |
| `ref_label` | Reference identifier | `Escherichia_coli|GCF_000005825.2` |
| `runtime_sec` | Analysis runtime | `245.67` |
### Optional Outputs
Enable with command-line flags:
#### `--emit-vcf`: Variant Call Format
- **File**: `{SAMPLE}.ambiguous_sites.vcf.gz` + `.tbi` index
- **Content**: Normalized, atomized variants passing thresholds
- **Annotations**: Custom INFO fields (MAF, AMBIG flag, etc.)
- **Format**: BGzip compressed, tabix indexed
#### `--emit-bed`: Browser Extensible Data
- **File**: `{SAMPLE}.ambiguous_sites.bed.gz` + `.tbi` index
- **Content**: Genomic intervals of ambiguous sites
- **Columns**: `chrom`, `start`, `end`, `name`, `score`, `strand`, `sample`, `variant_class`, `ref`, `alt`, `maf`, `dp`
- **Format**: BGzip compressed, tabix indexed
#### `--emit-matrix`: Depth×MAF Matrix
- **File**: `{SAMPLE}.ambiguity_matrix.tsv.gz`
- **Content**: 100×51 cumulative count matrix
- **Rows**: Depth thresholds (1-100)
- **Columns**: MAF bins (0-50, representing 0.00-0.50)
- **Values**: Cumulative variant counts
#### `--emit-per-contig`: Per-Contig Summary
- **File**: `{SAMPLE}.per_contig_summary.tsv`
- **Content**: Breakdown by chromosome/contig
- **Columns**: `sample`, `contig`, `length`, `callable_bases_10x`, `breadth_10x`, `ambiguous_snv_count`, etc.
#### `--emit-provenance`: Analysis Provenance
- **File**: `run_provenance.json`
- **Content**: Complete analysis parameters, tool versions, runtime info
- **Format**: JSON array (one entry per sample)
#### `--emit-multiqc`: MultiQC Integration
- **File**: `{SAMPLE}.multiqc.tsv`
- **Content**: Curated metrics for MultiQC reporting
- **Use case**: Integration with MultiQC pipelines
### Output Directory Structure
```
results/
├── ambiguous_summary.tsv # Always generated
├── MySample.ambiguous_sites.vcf.gz # --emit-vcf
├── MySample.ambiguous_sites.vcf.gz.tbi # VCF index
├── MySample.ambiguous_sites.bed.gz # --emit-bed
├── MySample.ambiguous_sites.bed.gz.tbi # BED index
├── MySample.ambiguity_matrix.tsv.gz # --emit-matrix
├── MySample.per_contig_summary.tsv # --emit-per-contig
├── MySample.multiqc.tsv # --emit-multiqc
└── run_provenance.json # --emit-provenance
```
## CLI Reference
### Global Options
| Option | Default | Description |
|--------|---------|-------------|
| `--threads` | `4` | Number of CPU threads |
| `--outdir` | `.` | Output directory |
| `--sample` | *inferred* | Sample name (required if auto-inference fails) |
| `--dp-min` | `10` | Minimum depth for ambiguous sites |
| `--maf-min` | `0.1` | Minimum minor allele frequency (post-calling filter) |
| `--dp-cap` | `100` | Maximum depth cap (clipped to 100) |
| `--mapper` | `minimap2` | Alignment tool (`minimap2`, `bwa-mem2`) |
| `--caller` | `bbtools` | Variant caller (`bbtools`, `bcftools`) |
| `--depth-tool` | `mosdepth` | Depth analysis tool (`mosdepth`, `samtools`) |
| `--require-pass` | `False` | Only use PASS variants |
| `--tsv-mode` | `overwrite` | TSV handling (`overwrite`, `append`, `fail`) |
| `--stdout` | `False` | Write summary to stdout only |
### Command-Specific Options
#### `ssiamb self`
| Option | Required | Description |
|--------|----------|-------------|
| `--r1` | ✅ | Forward reads (FASTQ, gzipped OK) |
| `--r2` | ✅ | Reverse reads (FASTQ, gzipped OK) |
| `--assembly` | ✅ | Assembly FASTA file |
| `--vcf` | ❌ | Reuse existing VCF file |
| `--bam` | ❌ | Reuse existing BAM file |
#### `ssiamb ref`
| Option | Required | Description |
|--------|----------|-------------|
| `--r1` | ✅ | Forward reads (FASTQ, gzipped OK) |
| `--r2` | ✅ | Reverse reads (FASTQ, gzipped OK) |
| `--reference` | ❌* | Direct reference FASTA |
| `--species` | ❌* | Species name for lookup |
| `--bracken` | ❌* | Bracken classification file |
| `--ref-dir` | ❌ | Admin reference directory |
| `--min-bracken-frac` | `0.70` | Minimum Bracken fraction |
| `--min-bracken-reads` | `100000` | Minimum Bracken reads |
| `--on-fail` | `error` | Bracken failure action (`error`, `self`) |
*One of `--reference`, `--species`, or `--bracken` is required.
#### `ssiamb summarize`
| Option | Required | Description |
|--------|----------|-------------|
| `--vcf` | ✅ | VCF file to summarize |
| `--bam` | ✅ | BAM file for denominator |
| `--output` | ❌ | Output file path |
## Error Codes and Troubleshooting
### Exit Codes
`ssiamb` follows a structured exit code system for programmatic handling:
- **0**: Success
- **1**: CLI/input errors (missing files, invalid sample names, bad arguments)
- **2**: Reference mode selection errors (species not found, Bracken failures)
- **3**: Reuse compatibility errors (VCF/BAM mismatch with reference)
- **4**: External tool failures (missing tools, tool execution errors)
- **5**: QC failures (only when `--qc-action fail` is enabled)
### Common Issues and Solutions
#### Missing External Tools
**Error**: `Command not found: minimap2`
**Solution**: Install dependencies via conda:
```bash
conda install -c bioconda minimap2 bwa-mem2 bcftools samtools mosdepth bbmap
```
#### Reference Directory Issues
**Error**: `Species 'Escherichia coli' not found in admin directory`
**Solutions**:
```bash
# Option 1: Download the species reference
python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR
# Option 2: Use direct reference file
ssiamb ref --reference /path/to/ecoli.fna --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz
# Option 3: Set environment variable
export SSIAMB_REF_DIR=/path/to/your/references
```
#### Index File Issues
**Error**: `Reference found but indexes missing for 'Escherichia_coli': minimap2 index`
**Solution**: Regenerate indexes:
```bash
cd $SSIAMB_REF_DIR
minimap2 -d Escherichia_coli.fna.mmi Escherichia_coli.fna
bwa-mem2 index Escherichia_coli.fna
```
#### Sample Name Issues
**Error**: `Sample name could not be safely inferred`
**Solution**: Explicitly provide sample name:
```bash
ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --sample MySample
```
#### Memory Issues
**Error**: `BBTools out of memory`
**Solutions**:
```bash
# Reduce threads
ssiamb self --threads 2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
# Set BBTools memory limit
ssiamb self --bbtools-mem 4g --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
```
#### Permission Issues
**Error**: `Cannot write to output directory`
**Solutions**:
```bash
# Create directory with proper permissions
mkdir -p /path/to/output
chmod 755 /path/to/output
# Or use different output directory
ssiamb self --outdir ~/results --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
```
### Debugging Tips
1. **Use dry-run mode** to validate inputs:
```bash
ssiamb self --dry-run --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
```
2. **Check tool versions**:
```bash
minimap2 --version
bcftools --version
mosdepth --version
```
3. **Validate input files**:
```bash
# Check FASTQ files
zcat reads_R1.fastq.gz | head -4
# Check FASTA files
head assembly.fna
```
4. **Run smoke test**:
```bash
python smoke_test.py --skip-downloads
```
### Getting Help
- **CLI help**: `ssiamb --help` or `ssiamb COMMAND --help`
- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)
- **Contact**: Povilas Matusevicius <pmat@ssi.dk>
## Output
### Primary Output
- **`ambiguous_summary.tsv`**: Single-row summary with ambiguous site counts and quality metrics
### Optional Outputs (via flags)
- **`--emit-vcf`**: Variant calls with ambiguity annotations
- **`--emit-bed`**: BED file of ambiguous sites
- **`--emit-matrix`**: Depth×MAF cumulative count matrix
- **`--emit-per-contig`**: Per-contig breakdown
- **`--emit-provenance`**: Analysis provenance and parameters
- **`--emit-multiqc`**: MultiQC-compatible reports
## Testing
The project includes comprehensive testing:
### Running Tests
```bash
# Run all unit tests
python -m pytest tests/ -v
# Run specific test modules
python -m pytest tests/test_refseq.py -v
python -m pytest tests/test_refdir.py -v
# Run smoke test (integration test)
python smoke_test.py --skip-downloads # Fast version
python smoke_test.py # Full version with downloads
```
### Test Coverage
The test suite covers:
- **Unit tests**: Core algorithms, species normalization, reference downloading
- **Integration tests**: Full pipeline validation via smoke test
- **Edge cases**: Error handling, malformed inputs, missing dependencies
- **Mock testing**: External API calls and tool dependencies
### Test Dependencies
Install test dependencies:
```bash
# Via conda environment (recommended)
conda env create -f environment.yml
# Or manually
pip install pytest numpy pysam biopython requests
```
## Development Status
This project has completed its major development milestones:
- ✅ **Planning & Specification** - Comprehensive requirements defined
- ✅ **Repository Bootstrap** - Package structure, CI/CD, documentation
- ✅ **Core Implementation** - CLI, models, and processing pipelines
- ✅ **External Tool Integration** - Mapping and variant calling workflows
- ✅ **Reference Management** - Automatic RefSeq download and indexing
- ✅ **Testing & Validation** - Unit tests, integration testing, smoke tests
- 🚧 **Packaging & Distribution** - Bioconda, containers, Galaxy tools
### Recent Features
- **RefSeq Integration**: Automatic reference genome downloading from NCBI
- **Robust Indexing**: Automatic minimap2 and bwa-mem2 index generation
- **Enhanced Testing**: Comprehensive unit tests and smoke testing
- **Improved CLI**: Better help text, error messages, and validation
- **Output Flexibility**: Centralized directory handling with fallbacks
## Multi-Sample Processing
`ssiamb` supports batch processing via manifest files for analyzing multiple samples efficiently.
### Manifest Files
Create TSV files listing samples and their inputs:
#### Self-Mode Manifest
```tsv
sample r1 r2 assembly bam vcf
Sample1 data/Sample1_R1.fastq.gz data/Sample1_R2.fastq.gz assemblies/Sample1.fna
Sample2 data/Sample2_R1.fastq.gz data/Sample2_R2.fastq.gz assemblies/Sample2.fna existing/Sample2.bam
Sample3 data/Sample3_R1.fastq.gz data/Sample3_R2.fastq.gz assemblies/Sample3.fna existing/Sample3.vcf.gz
```
#### Reference-Mode Manifest
```tsv
sample r1 r2 bracken reference species bam vcf
Sample1 data/Sample1_R1.fastq.gz data/Sample1_R2.fastq.gz Sample1.bracken
Sample2 data/Sample2_R1.fastq.gz data/Sample2_R2.fastq.gz ref/custom.fna
Sample3 data/Sample3_R1.fastq.gz data/Sample3_R2.fastq.gz Escherichia coli
```
### Running with Manifests
```bash
# Process self-mode manifest
ssiamb self --manifest samples.tsv --outdir results/
# Process reference-mode manifest
ssiamb ref --manifest samples.tsv --ref-dir $SSIAMB_REF_DIR --outdir results/
# With custom settings
ssiamb self --manifest samples.tsv --dp-min 15 --emit-vcf --threads 8
```
### Manifest Features
- **Relative paths**: Resolved relative to manifest file location
- **Optional columns**: Empty cells skip optional inputs (bam, vcf)
- **Comments**: Lines starting with `#` are ignored
- **Sequential processing**: Samples processed one at a time
- **Consolidated output**: Single `ambiguous_summary.tsv` with all samples
## Performance Considerations
### Resource Usage
- **CPU**: Scales with `--threads` parameter (default: 4)
- **Memory**:
- BBTools: 4-8GB (adjustable with `--bbtools-mem`)
- bwa-mem2: 2-4GB
- minimap2: 1-2GB
- **Disk**: ~2-5x input size for intermediate files
- **Network**: Required for RefSeq downloads only
### Optimization Tips
1. **Use appropriate thread count**:
```bash
# For high-memory systems
ssiamb self --threads 16 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
```
2. **Reuse intermediate files**:
```bash
# First run - saves BAM
ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
# Subsequent runs - reuses BAM
ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --bam sample.sorted.bam --dp-min 15
```
3. **Pre-download references**:
```bash
# Download all needed references first
python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Salmonella enterica" --output-dir $SSIAMB_REF_DIR
```
4. **Use faster mappers for large datasets**:
```bash
# minimap2 is generally faster
ssiamb self --mapper minimap2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
```
### Typical Runtimes
| Dataset Size | Mode | Mapper | Runtime (4 cores) |
|-------------|------|--------|--------------------|
| 1M PE reads | self | minimap2 | 2-5 minutes |
| 1M PE reads | ref | minimap2 | 3-7 minutes |
| 5M PE reads | self | minimap2 | 8-15 minutes |
| 5M PE reads | ref | bwa-mem2 | 15-25 minutes |
*Times vary based on genome size, coverage, and hardware*
## Contributing
This project is developed by the SSI team. The codebase is now feature-complete with comprehensive testing.
### Development Setup
```bash
# Clone repository
git clone https://github.com/ssi-dk/ssiamb.git
cd ssiamb
# Create development environment
conda env create -f environment.yml
conda activate ssiamb
# Install in editable mode
pip install -e .
# Run tests
python -m pytest tests/ -v
python smoke_test.py --skip-downloads
```
### Code Quality
Before submitting changes:
1. **Run the test suite**:
```bash
python -m pytest tests/ -v
```
2. **Run smoke tests**:
```bash
python smoke_test.py
```
3. **Check code style** (if configured):
```bash
black src/ tests/
isort src/ tests/
```
### Contact
For questions, contributions, or issues:
- **Primary Contact**: Povilas Matusevicius <pmat@ssi.dk>
- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)
- **Repository**: [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb)
## Release Process
This project uses automated publishing to PyPI, Bioconda, and Galaxy ToolShed. The release process is as follows:
### 1. Version Update
1. Update version in `pyproject.toml`:
```toml
[project]
version = "1.0.0" # Update this
```
2. Update version in `recipes/ssiamb/meta.yaml`:
```yaml
{% set version = "1.0.0" %} # Update this
```
3. Update version in `galaxy/ssiamb.xml`:
```xml
<tool id="ssiamb" name="Ambiguous Sites Counter" version="1.0.0+galaxy0">
```
### 2. Create Release
1. Commit version changes:
```bash
git add pyproject.toml recipes/ssiamb/meta.yaml galaxy/ssiamb.xml
git commit -m "Bump version to v1.0.0"
git push origin main
```
2. Create and push tag:
```bash
git tag v1.0.0
git push origin v1.0.0
```
### 3. Automated Publishing
#### PyPI Publishing (Automatic)
- GitHub Actions automatically publishes to PyPI on tag push
- Uses PyPI Trusted Publishing (OIDC) - no tokens needed
- Creates signed GitHub release with artifacts
#### Bioconda Publishing (Manual)
1. Wait for PyPI release to complete
2. Update `recipes/ssiamb/meta.yaml` with correct SHA256:
```bash
# Get SHA256 from PyPI release
pip download ssiamb==1.0.0 --no-deps
shasum -a 256 ssiamb-1.0.0.tar.gz
```
3. Fork [bioconda/bioconda-recipes](https://github.com/bioconda/bioconda-recipes)
4. Copy `recipes/ssiamb/` to `recipes/ssiamb/` in the fork
5. Create pull request to bioconda-recipes
6. Address review feedback and wait for merge
#### Galaxy ToolShed Publishing (Manual)
1. Install planemo: `pip install planemo`
2. Test wrapper: `planemo test galaxy/ssiamb.xml` (may fail until bioconda is available)
3. Create account on [Galaxy ToolShed](https://toolshed.g2.bx.psu.edu/)
4. Upload wrapper:
```bash
cd galaxy/
planemo shed_upload --shed_target toolshed
```
### 4. Post-Release
1. Verify all distributions:
- PyPI: <https://pypi.org/project/ssiamb/>
- Bioconda: <https://anaconda.org/bioconda/ssiamb>
- Galaxy ToolShed: <https://toolshed.g2.bx.psu.edu/>
- BioContainers: <https://quay.io/repository/biocontainers/ssiamb>
2. Update documentation if needed
3. Announce release
### Version Numbering
- Use semantic versioning: `MAJOR.MINOR.PATCH`
- Galaxy wrapper versions: `SOFTWARE_VERSION+galaxy0` (increment galaxy# for wrapper-only changes)
- Pre-releases: `1.0.0rc1`, `1.0.0a1`, etc.
### Troubleshooting
See `PYPI_SETUP.md` for PyPI Trusted Publishing configuration details.
## Citation
> **Note**: Citation information will be provided upon publication.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "ssiamb",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "bacterial-genomics, bioinformatics, genomics, mapping, variant-calling",
"author": null,
"author_email": "SSI Denmark <support@ssi.dk>",
"download_url": "https://files.pythonhosted.org/packages/5a/4b/db01f3c0622a7c3f96dfe1ed77cd33eeeb825665c27087b01843eddc26e4/ssiamb-0.9.0.tar.gz",
"platform": null,
"description": "# ssiamb \u2014 Ambiguous Sites Counter\n\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/downloads/)\n\n**Author:** Povilas Matusevicius <pmat@ssi.dk> \n**Repository:** [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb) \n**License:** MIT \n**Minimum Python:** 3.12+\n\n## Overview\n\n`ssiamb` computes an \"ambiguous sites\" metric for bacterial whole genome sequencing (WGS) as a measure of within-sample heterogeneity. This tool modernizes and standardizes the lab's prior definition while providing robust packaging, CLI interface, and Galaxy integration capabilities.\n\n### What are \"Ambiguous Sites\"?\n\nAn ambiguous site is a genomic position with:\n \n- **Sufficient coverage**: Depth \u2265 `dp_min` (default: 10)\n- **Minor-allele signal**: Minor-allele fraction (MAF) \u2265 `maf_min` (default: 0.10)\n\nThese metrics are determined from variant calls after normalization and atomization, counting **once per locus** (multi-allelic sites count once if any ALT passes the thresholds).\n\n### Variant Calling and Filtering Strategy\n\n`ssiamb` uses a two-stage approach to ensure comprehensive and consistent variant detection:\n\n#### 1. Variant Calling (Capture All Variants)\n\n- **BBTools**: Configured with `minallelefraction=0.0` to capture all variant calls regardless of frequency\n- **BCFtools**: Configured without MAF filtering during the calling stage to capture all variants\n- **Rationale**: Ensures no potentially relevant variants are lost during the calling process\n\n#### 2. Analysis-Time Filtering (Apply Thresholds)\n\n- **MAF Threshold**: The `--maf-min` parameter (default: 0.10) is applied during analysis, not during variant calling\n- **Post-calling Filter**: All filtering for ambiguous site detection happens after variant calls are made\n- **Consistency**: Both callers use the same filtering approach, ensuring comparable results\n- **Flexibility**: Allows reanalysis of the same variant calls with different thresholds using `ssiamb summarize`\n\nThis approach maximizes sensitivity during variant detection while maintaining analytical flexibility and ensuring reproducible results across different caller technologies.\n\n### Supported Modes\n\n#### Self-mapping Mode (`ssiamb self`)\n- **Input**: Reads \u2192 Sample's own assembly\n- **Use case**: Analyze heterogeneity against the sample's assembled genome\n- **Mapping space**: Uses the assembly as reference\n\n#### Reference-mapped Mode (`ssiamb ref`)\n- **Input**: Reads \u2192 Species canonical reference\n- **Use case**: Compare against standardized reference genomes \n- **Reference selection**: Via admin directory, user override, or Bracken classification\n\n#### Summarize Mode (`ssiamb summarize`)\n- **Input**: Pre-computed VCF + BAM files\n- **Use case**: Reanalyze existing variant calls with different thresholds\n- **Speed**: Fast analysis without remapping or variant calling\n\n### Key Features\n\n- **\ud83e\uddec Automatic Reference Management**: Download and index references from NCBI RefSeq\n- **\ud83d\udd27 Flexible Mapping**: Support for minimap2 (default) and bwa-mem2\n- **\ud83d\udcca Multiple Variant Callers**: BBTools (default) and bcftools\n- **\ud83d\udccb Comprehensive Outputs**: Summary TSV (always) + optional VCF, BED, matrices, per-contig analysis\n- **\ud83d\udccf Depth Analysis**: Using mosdepth (default) or samtools\n- **\u267b\ufe0f Reusable Workflows**: Accept pre-computed BAM/VCF files\n- **\ud83e\uddea Galaxy Integration**: Designed for workflow environments\n- **\u2705 Quality Control**: Configurable thresholds with sensible defaults\n- **\ud83e\uddea Robust Testing**: Comprehensive test suite with smoke testing\n\n## Installation\n\n### Development Installation (Recommended)\n\nFor development or to get the latest features:\n\n```bash\n# Clone the repository\ngit clone https://github.com/ssi-dk/ssiamb.git\ncd ssiamb\n\n# Create conda environment with dependencies\nconda env create -f environment.yml\nconda activate ssiamb\n\n# Install in editable mode\npip install -e .\n\n# Verify installation\nssiamb --help\n```\n\n### Conda Environment Setup\n\nThe project includes an `environment.yml` file with all required dependencies:\n\n```bash\n# Create environment\nconda env create -f environment.yml\n\n# Activate environment \nconda activate ssiamb\n\n# Install ssiamb\npip install -e .\n```\n\n### Future Distribution Methods\n\nWhen stable releases are published, the package will also be available via:\n\n```bash\n# Future installation via pip (PyPI)\npip install ssiamb\n\n# Future installation via conda (Bioconda)\nconda install -c bioconda ssiamb\n```\n\n### External Tool Dependencies\n\n`ssiamb` requires several external bioinformatics tools. These are included in the conda environment:\n\n- **Mapping**: `minimap2` and/or `bwa-mem2`\n- **Variant calling**: BBTools and `bcftools`\n- **Depth analysis**: `mosdepth` and `samtools`\n- **VCF processing**: `bcftools` (for normalization)\n\n## Admin Reference Directory Setup\n\nFor reference-mapped mode, you need an admin reference directory containing indexed reference genomes. `ssiamb` includes a built-in downloader to automatically fetch and index references from NCBI RefSeq.\n\n### Setting Up References\n\n1. **Set environment variable** (recommended):\n ```bash\n export SSIAMB_REF_DIR=/path/to/references\n ```\n\n2. **Download common bacterial references**:\n ```bash\n # Download single species\n python -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\n \n # Download multiple species \n python -m ssiamb.refseq download --species \"Salmonella enterica\" --output-dir $SSIAMB_REF_DIR\n python -m ssiamb.refseq download --species \"Staphylococcus aureus\" --output-dir $SSIAMB_REF_DIR\n ```\n\n3. **Verify setup**:\n ```bash\n ls $SSIAMB_REF_DIR\n # Should show: Escherichia_coli.fna, Escherichia_coli.fna.mmi, Escherichia_coli.fna.bwa.*, etc.\n ```\n\n### Reference Downloader Features\n\n- **Automatic Selection**: Chooses best RefSeq reference genome (complete > chromosome > scaffold)\n- **Index Generation**: Creates both minimap2 (`.mmi`) and bwa-mem2 (`.bwa.*`) indexes\n- **Species Normalization**: Handles common name variations and aliases\n- **Progress Reporting**: Shows download progress with rich progress bars\n- **Fallback Logic**: Tries multiple genomes if primary choice fails\n\n### Common Species Examples\n\n```bash\n# Popular bacterial pathogens\npython -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\npython -m ssiamb.refseq download --species \"Salmonella enterica\" --output-dir $SSIAMB_REF_DIR \npython -m ssiamb.refseq download --species \"Staphylococcus aureus\" --output-dir $SSIAMB_REF_DIR\npython -m ssiamb.refseq download --species \"Streptococcus pneumoniae\" --output-dir $SSIAMB_REF_DIR\npython -m ssiamb.refseq download --species \"Klebsiella pneumoniae\" --output-dir $SSIAMB_REF_DIR\n```\n\n```bash\n# Install in editable/development mode (recommended for contributors)\npip install -e .\n\n# After editable install you can run the CLI via the console script or module:\nssiamb --help\n# or\npython -m ssiamb --help\n```\n\n \nWhen a stable release is published the package will also be available via PyPI and Bioconda (example future commands):\n\n```bash\n# Future installation via pip (PyPI)\npip install ssiamb\n\n# Future installation via conda (Bioconda)\nconda install -c bioconda ssiamb\n```\n\n## Quick Start\n\n### Basic Usage\n\n```bash\n# Check what would be done (dry run)\nssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna --dry-run\n\n# Self-mapping mode: analyze reads against sample's own assembly\nssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna\n\n# Reference-mapped mode: analyze against species reference\nssiamb ref --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --species \"Escherichia coli\"\n\n# Summarize existing VCF and BAM files\nssiamb summarize --vcf sample.vcf.gz --bam sample.bam\n```\n\n### Comprehensive Examples\n\n#### Self-Mapping Mode\n\n```bash\n# Basic self-mapping\nssiamb self \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --assembly assembly.fna \\\n --sample MySample \\\n --outdir results/\n\n# With custom thresholds and optional outputs\nssiamb self \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --assembly assembly.fna \\\n --dp-min 15 \\\n --maf-min 0.05 \\\n --emit-vcf \\\n --emit-bed \\\n --emit-matrix \\\n --threads 8\n\n# Using bwa-mem2 instead of minimap2\nssiamb self \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --assembly assembly.fna \\\n --mapper bwa-mem2 \\\n --caller bcftools\n```\n\n#### Reference-Mapping Mode\n\n```bash\n# Using species name (requires admin reference directory)\nexport SSIAMB_REF_DIR=/path/to/references\nssiamb ref \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --species \"Escherichia coli\" \\\n --sample MySample\n\n# Using direct reference file\nssiamb ref \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --reference reference_genome.fna \\\n --sample MySample\n\n# Using Bracken classification results\nssiamb ref \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --bracken sample.bracken \\\n --ref-dir /path/to/references \\\n --min-bracken-frac 0.8\n```\n\n#### Advanced Usage\n\n```bash\n# Output to stdout (no files written)\nssiamb self \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --assembly assembly.fna \\\n --stdout\n\n# Reuse existing BAM file\nssiamb self \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --assembly assembly.fna \\\n --bam existing_alignment.bam\n\n# Append to existing TSV instead of overwriting\nssiamb self \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --assembly assembly.fna \\\n --tsv-mode append\n\n# Comprehensive output with provenance\nssiamb self \\\n --r1 reads_R1.fastq.gz \\\n --r2 reads_R2.fastq.gz \\\n --assembly assembly.fna \\\n --emit-vcf \\\n --emit-bed \\\n --emit-matrix \\\n --emit-per-contig \\\n --emit-provenance \\\n --emit-multiqc\n```\n\n### Testing Your Installation\n\nRun the built-in smoke test to verify everything works:\n\n```bash\n# Quick test (skips downloads)\npython smoke_test.py --skip-downloads\n\n# Full test including reference downloads (slower)\npython smoke_test.py\n```\n\n## Output Files and Formats\n\n### Primary Output\n\n#### `ambiguous_summary.tsv`\nAlways generated. Single-row summary with comprehensive metrics:\n\n| Column | Description | Example |\n|--------|-------------|---------|\n| `sample` | Sample identifier | `MySample` |\n| `mode` | Analysis mode | `self`, `ref` |\n| `mapper` | Alignment tool used | `minimap2`, `bwa-mem2` |\n| `caller` | Variant caller used | `bbtools`, `bcftools` |\n| `dp_min` | Minimum depth threshold | `10` |\n| `maf_min` | Minimum MAF threshold | `0.1` |\n| `ambiguous_snv_count` | Number of ambiguous SNVs | `42` |\n| `ambiguous_snv_per_mb` | SNVs per megabase | `15.23` |\n| `callable_bases` | Bases with adequate coverage | `4651234` |\n| `genome_length` | Total genome length | `4800000` |\n| `breadth_10x` | Fraction covered at 10x+ | `0.9691` |\n| `ref_label` | Reference identifier | `Escherichia_coli|GCF_000005825.2` |\n| `runtime_sec` | Analysis runtime | `245.67` |\n\n### Optional Outputs\n\nEnable with command-line flags:\n\n#### `--emit-vcf`: Variant Call Format\n- **File**: `{SAMPLE}.ambiguous_sites.vcf.gz` + `.tbi` index\n- **Content**: Normalized, atomized variants passing thresholds\n- **Annotations**: Custom INFO fields (MAF, AMBIG flag, etc.)\n- **Format**: BGzip compressed, tabix indexed\n\n#### `--emit-bed`: Browser Extensible Data\n- **File**: `{SAMPLE}.ambiguous_sites.bed.gz` + `.tbi` index \n- **Content**: Genomic intervals of ambiguous sites\n- **Columns**: `chrom`, `start`, `end`, `name`, `score`, `strand`, `sample`, `variant_class`, `ref`, `alt`, `maf`, `dp`\n- **Format**: BGzip compressed, tabix indexed\n\n#### `--emit-matrix`: Depth\u00d7MAF Matrix\n- **File**: `{SAMPLE}.ambiguity_matrix.tsv.gz`\n- **Content**: 100\u00d751 cumulative count matrix\n- **Rows**: Depth thresholds (1-100)\n- **Columns**: MAF bins (0-50, representing 0.00-0.50)\n- **Values**: Cumulative variant counts\n\n#### `--emit-per-contig`: Per-Contig Summary\n- **File**: `{SAMPLE}.per_contig_summary.tsv`\n- **Content**: Breakdown by chromosome/contig\n- **Columns**: `sample`, `contig`, `length`, `callable_bases_10x`, `breadth_10x`, `ambiguous_snv_count`, etc.\n\n#### `--emit-provenance`: Analysis Provenance\n- **File**: `run_provenance.json`\n- **Content**: Complete analysis parameters, tool versions, runtime info\n- **Format**: JSON array (one entry per sample)\n\n#### `--emit-multiqc`: MultiQC Integration\n- **File**: `{SAMPLE}.multiqc.tsv`\n- **Content**: Curated metrics for MultiQC reporting\n- **Use case**: Integration with MultiQC pipelines\n\n### Output Directory Structure\n\n```\nresults/\n\u251c\u2500\u2500 ambiguous_summary.tsv # Always generated\n\u251c\u2500\u2500 MySample.ambiguous_sites.vcf.gz # --emit-vcf\n\u251c\u2500\u2500 MySample.ambiguous_sites.vcf.gz.tbi # VCF index\n\u251c\u2500\u2500 MySample.ambiguous_sites.bed.gz # --emit-bed \n\u251c\u2500\u2500 MySample.ambiguous_sites.bed.gz.tbi # BED index\n\u251c\u2500\u2500 MySample.ambiguity_matrix.tsv.gz # --emit-matrix\n\u251c\u2500\u2500 MySample.per_contig_summary.tsv # --emit-per-contig\n\u251c\u2500\u2500 MySample.multiqc.tsv # --emit-multiqc\n\u2514\u2500\u2500 run_provenance.json # --emit-provenance\n```\n\n## CLI Reference\n\n### Global Options\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--threads` | `4` | Number of CPU threads |\n| `--outdir` | `.` | Output directory |\n| `--sample` | *inferred* | Sample name (required if auto-inference fails) |\n| `--dp-min` | `10` | Minimum depth for ambiguous sites |\n| `--maf-min` | `0.1` | Minimum minor allele frequency (post-calling filter) |\n| `--dp-cap` | `100` | Maximum depth cap (clipped to 100) |\n| `--mapper` | `minimap2` | Alignment tool (`minimap2`, `bwa-mem2`) |\n| `--caller` | `bbtools` | Variant caller (`bbtools`, `bcftools`) |\n| `--depth-tool` | `mosdepth` | Depth analysis tool (`mosdepth`, `samtools`) |\n| `--require-pass` | `False` | Only use PASS variants |\n| `--tsv-mode` | `overwrite` | TSV handling (`overwrite`, `append`, `fail`) |\n| `--stdout` | `False` | Write summary to stdout only |\n\n### Command-Specific Options\n\n#### `ssiamb self`\n| Option | Required | Description |\n|--------|----------|-------------|\n| `--r1` | \u2705 | Forward reads (FASTQ, gzipped OK) |\n| `--r2` | \u2705 | Reverse reads (FASTQ, gzipped OK) |\n| `--assembly` | \u2705 | Assembly FASTA file |\n| `--vcf` | \u274c | Reuse existing VCF file |\n| `--bam` | \u274c | Reuse existing BAM file |\n\n#### `ssiamb ref`\n| Option | Required | Description |\n|--------|----------|-------------|\n| `--r1` | \u2705 | Forward reads (FASTQ, gzipped OK) |\n| `--r2` | \u2705 | Reverse reads (FASTQ, gzipped OK) |\n| `--reference` | \u274c* | Direct reference FASTA |\n| `--species` | \u274c* | Species name for lookup |\n| `--bracken` | \u274c* | Bracken classification file |\n| `--ref-dir` | \u274c | Admin reference directory |\n| `--min-bracken-frac` | `0.70` | Minimum Bracken fraction |\n| `--min-bracken-reads` | `100000` | Minimum Bracken reads |\n| `--on-fail` | `error` | Bracken failure action (`error`, `self`) |\n\n*One of `--reference`, `--species`, or `--bracken` is required.\n\n#### `ssiamb summarize`\n| Option | Required | Description |\n|--------|----------|-------------|\n| `--vcf` | \u2705 | VCF file to summarize |\n| `--bam` | \u2705 | BAM file for denominator |\n| `--output` | \u274c | Output file path |\n\n## Error Codes and Troubleshooting\n\n### Exit Codes\n\n`ssiamb` follows a structured exit code system for programmatic handling:\n\n- **0**: Success\n- **1**: CLI/input errors (missing files, invalid sample names, bad arguments)\n- **2**: Reference mode selection errors (species not found, Bracken failures)\n- **3**: Reuse compatibility errors (VCF/BAM mismatch with reference)\n- **4**: External tool failures (missing tools, tool execution errors)\n- **5**: QC failures (only when `--qc-action fail` is enabled)\n\n### Common Issues and Solutions\n\n#### Missing External Tools\n\n**Error**: `Command not found: minimap2`\n\n**Solution**: Install dependencies via conda:\n```bash\nconda install -c bioconda minimap2 bwa-mem2 bcftools samtools mosdepth bbmap\n```\n\n#### Reference Directory Issues\n\n**Error**: `Species 'Escherichia coli' not found in admin directory`\n\n**Solutions**:\n```bash\n# Option 1: Download the species reference\npython -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\n\n# Option 2: Use direct reference file\nssiamb ref --reference /path/to/ecoli.fna --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz\n\n# Option 3: Set environment variable\nexport SSIAMB_REF_DIR=/path/to/your/references\n```\n\n#### Index File Issues\n\n**Error**: `Reference found but indexes missing for 'Escherichia_coli': minimap2 index`\n\n**Solution**: Regenerate indexes:\n```bash\ncd $SSIAMB_REF_DIR\nminimap2 -d Escherichia_coli.fna.mmi Escherichia_coli.fna\nbwa-mem2 index Escherichia_coli.fna\n```\n\n#### Sample Name Issues\n\n**Error**: `Sample name could not be safely inferred`\n\n**Solution**: Explicitly provide sample name:\n```bash\nssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --sample MySample\n```\n\n#### Memory Issues\n\n**Error**: `BBTools out of memory`\n\n**Solutions**:\n```bash\n# Reduce threads\nssiamb self --threads 2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n\n# Set BBTools memory limit \nssiamb self --bbtools-mem 4g --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n```\n\n#### Permission Issues\n\n**Error**: `Cannot write to output directory`\n\n**Solutions**:\n```bash\n# Create directory with proper permissions\nmkdir -p /path/to/output\nchmod 755 /path/to/output\n\n# Or use different output directory\nssiamb self --outdir ~/results --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n```\n\n### Debugging Tips\n\n1. **Use dry-run mode** to validate inputs:\n ```bash\n ssiamb self --dry-run --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n ```\n\n2. **Check tool versions**:\n ```bash\n minimap2 --version\n bcftools --version\n mosdepth --version\n ```\n\n3. **Validate input files**:\n ```bash\n # Check FASTQ files\n zcat reads_R1.fastq.gz | head -4\n \n # Check FASTA files\n head assembly.fna\n ```\n\n4. **Run smoke test**:\n ```bash\n python smoke_test.py --skip-downloads\n ```\n\n### Getting Help\n\n- **CLI help**: `ssiamb --help` or `ssiamb COMMAND --help`\n- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)\n- **Contact**: Povilas Matusevicius <pmat@ssi.dk>\n\n## Output\n\n### Primary Output\n\n- **`ambiguous_summary.tsv`**: Single-row summary with ambiguous site counts and quality metrics\n\n### Optional Outputs (via flags)\n\n- **`--emit-vcf`**: Variant calls with ambiguity annotations\n- **`--emit-bed`**: BED file of ambiguous sites\n- **`--emit-matrix`**: Depth\u00d7MAF cumulative count matrix\n- **`--emit-per-contig`**: Per-contig breakdown\n- **`--emit-provenance`**: Analysis provenance and parameters\n- **`--emit-multiqc`**: MultiQC-compatible reports\n\n## Testing\n\nThe project includes comprehensive testing:\n\n### Running Tests\n\n```bash\n# Run all unit tests\npython -m pytest tests/ -v\n\n# Run specific test modules\npython -m pytest tests/test_refseq.py -v\npython -m pytest tests/test_refdir.py -v\n\n# Run smoke test (integration test)\npython smoke_test.py --skip-downloads # Fast version\npython smoke_test.py # Full version with downloads\n```\n\n### Test Coverage\n\nThe test suite covers:\n\n- **Unit tests**: Core algorithms, species normalization, reference downloading\n- **Integration tests**: Full pipeline validation via smoke test\n- **Edge cases**: Error handling, malformed inputs, missing dependencies\n- **Mock testing**: External API calls and tool dependencies\n\n### Test Dependencies\n\nInstall test dependencies:\n\n```bash\n# Via conda environment (recommended)\nconda env create -f environment.yml\n\n# Or manually\npip install pytest numpy pysam biopython requests\n```\n\n## Development Status\n\nThis project has completed its major development milestones:\n\n- \u2705 **Planning & Specification** - Comprehensive requirements defined\n- \u2705 **Repository Bootstrap** - Package structure, CI/CD, documentation\n- \u2705 **Core Implementation** - CLI, models, and processing pipelines\n- \u2705 **External Tool Integration** - Mapping and variant calling workflows\n- \u2705 **Reference Management** - Automatic RefSeq download and indexing\n- \u2705 **Testing & Validation** - Unit tests, integration testing, smoke tests\n- \ud83d\udea7 **Packaging & Distribution** - Bioconda, containers, Galaxy tools\n\n### Recent Features\n\n- **RefSeq Integration**: Automatic reference genome downloading from NCBI\n- **Robust Indexing**: Automatic minimap2 and bwa-mem2 index generation\n- **Enhanced Testing**: Comprehensive unit tests and smoke testing\n- **Improved CLI**: Better help text, error messages, and validation\n- **Output Flexibility**: Centralized directory handling with fallbacks\n\n \n## Multi-Sample Processing\n\n`ssiamb` supports batch processing via manifest files for analyzing multiple samples efficiently.\n\n### Manifest Files\n\nCreate TSV files listing samples and their inputs:\n\n#### Self-Mode Manifest\n\n```tsv\nsample\tr1\tr2\tassembly\tbam\tvcf\nSample1\tdata/Sample1_R1.fastq.gz\tdata/Sample1_R2.fastq.gz\tassemblies/Sample1.fna\t\t\nSample2\tdata/Sample2_R1.fastq.gz\tdata/Sample2_R2.fastq.gz\tassemblies/Sample2.fna\texisting/Sample2.bam\t\nSample3\tdata/Sample3_R1.fastq.gz\tdata/Sample3_R2.fastq.gz\tassemblies/Sample3.fna\t\texisting/Sample3.vcf.gz\n```\n\n#### Reference-Mode Manifest\n\n```tsv\nsample\tr1\tr2\tbracken\treference\tspecies\tbam\tvcf\nSample1\tdata/Sample1_R1.fastq.gz\tdata/Sample1_R2.fastq.gz\tSample1.bracken\t\t\t\nSample2\tdata/Sample2_R1.fastq.gz\tdata/Sample2_R2.fastq.gz\t\tref/custom.fna\t\t\nSample3\tdata/Sample3_R1.fastq.gz\tdata/Sample3_R2.fastq.gz\t\t\tEscherichia coli\t\n```\n\n### Running with Manifests\n\n```bash\n# Process self-mode manifest\nssiamb self --manifest samples.tsv --outdir results/\n\n# Process reference-mode manifest \nssiamb ref --manifest samples.tsv --ref-dir $SSIAMB_REF_DIR --outdir results/\n\n# With custom settings\nssiamb self --manifest samples.tsv --dp-min 15 --emit-vcf --threads 8\n```\n\n### Manifest Features\n\n- **Relative paths**: Resolved relative to manifest file location\n- **Optional columns**: Empty cells skip optional inputs (bam, vcf)\n- **Comments**: Lines starting with `#` are ignored\n- **Sequential processing**: Samples processed one at a time\n- **Consolidated output**: Single `ambiguous_summary.tsv` with all samples\n\n## Performance Considerations\n\n### Resource Usage\n\n- **CPU**: Scales with `--threads` parameter (default: 4)\n- **Memory**: \n - BBTools: 4-8GB (adjustable with `--bbtools-mem`)\n - bwa-mem2: 2-4GB \n - minimap2: 1-2GB\n- **Disk**: ~2-5x input size for intermediate files\n- **Network**: Required for RefSeq downloads only\n\n### Optimization Tips\n\n1. **Use appropriate thread count**:\n ```bash\n # For high-memory systems\n ssiamb self --threads 16 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n ```\n\n2. **Reuse intermediate files**:\n ```bash\n # First run - saves BAM\n ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n \n # Subsequent runs - reuses BAM\n ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --bam sample.sorted.bam --dp-min 15\n ```\n\n3. **Pre-download references**:\n ```bash\n # Download all needed references first\n python -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\n python -m ssiamb.refseq download --species \"Salmonella enterica\" --output-dir $SSIAMB_REF_DIR\n ```\n\n4. **Use faster mappers for large datasets**:\n ```bash\n # minimap2 is generally faster\n ssiamb self --mapper minimap2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n ```\n\n### Typical Runtimes\n\n| Dataset Size | Mode | Mapper | Runtime (4 cores) |\n|-------------|------|--------|--------------------|\n| 1M PE reads | self | minimap2 | 2-5 minutes |\n| 1M PE reads | ref | minimap2 | 3-7 minutes |\n| 5M PE reads | self | minimap2 | 8-15 minutes |\n| 5M PE reads | ref | bwa-mem2 | 15-25 minutes |\n\n*Times vary based on genome size, coverage, and hardware*\n\n## Contributing\n\nThis project is developed by the SSI team. The codebase is now feature-complete with comprehensive testing.\n\n### Development Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/ssi-dk/ssiamb.git\ncd ssiamb\n\n# Create development environment\nconda env create -f environment.yml\nconda activate ssiamb\n\n# Install in editable mode\npip install -e .\n\n# Run tests\npython -m pytest tests/ -v\npython smoke_test.py --skip-downloads\n```\n\n### Code Quality\n\nBefore submitting changes:\n\n1. **Run the test suite**:\n ```bash\n python -m pytest tests/ -v\n ```\n\n2. **Run smoke tests**:\n ```bash\n python smoke_test.py\n ```\n\n3. **Check code style** (if configured):\n ```bash\n black src/ tests/\n isort src/ tests/\n ```\n\n### Contact\n\nFor questions, contributions, or issues:\n\n- **Primary Contact**: Povilas Matusevicius <pmat@ssi.dk>\n- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)\n- **Repository**: [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb)\n\n## Release Process\n\nThis project uses automated publishing to PyPI, Bioconda, and Galaxy ToolShed. The release process is as follows:\n\n### 1. Version Update\n\n1. Update version in `pyproject.toml`:\n\n ```toml\n [project]\n version = \"1.0.0\" # Update this\n ```\n\n2. Update version in `recipes/ssiamb/meta.yaml`:\n\n ```yaml\n {% set version = \"1.0.0\" %} # Update this\n ```\n\n3. Update version in `galaxy/ssiamb.xml`:\n\n ```xml\n <tool id=\"ssiamb\" name=\"Ambiguous Sites Counter\" version=\"1.0.0+galaxy0\">\n ```\n\n### 2. Create Release\n\n1. Commit version changes:\n\n ```bash\n git add pyproject.toml recipes/ssiamb/meta.yaml galaxy/ssiamb.xml\n git commit -m \"Bump version to v1.0.0\"\n git push origin main\n ```\n\n2. Create and push tag:\n\n ```bash\n git tag v1.0.0\n git push origin v1.0.0\n ```\n\n### 3. Automated Publishing\n\n#### PyPI Publishing (Automatic)\n\n- GitHub Actions automatically publishes to PyPI on tag push\n- Uses PyPI Trusted Publishing (OIDC) - no tokens needed\n- Creates signed GitHub release with artifacts\n\n#### Bioconda Publishing (Manual)\n\n1. Wait for PyPI release to complete\n2. Update `recipes/ssiamb/meta.yaml` with correct SHA256:\n\n ```bash\n # Get SHA256 from PyPI release\n pip download ssiamb==1.0.0 --no-deps\n shasum -a 256 ssiamb-1.0.0.tar.gz\n ```\n\n3. Fork [bioconda/bioconda-recipes](https://github.com/bioconda/bioconda-recipes)\n4. Copy `recipes/ssiamb/` to `recipes/ssiamb/` in the fork\n5. Create pull request to bioconda-recipes\n6. Address review feedback and wait for merge\n\n#### Galaxy ToolShed Publishing (Manual)\n\n1. Install planemo: `pip install planemo`\n2. Test wrapper: `planemo test galaxy/ssiamb.xml` (may fail until bioconda is available)\n3. Create account on [Galaxy ToolShed](https://toolshed.g2.bx.psu.edu/)\n4. Upload wrapper:\n\n ```bash\n cd galaxy/\n planemo shed_upload --shed_target toolshed\n ```\n\n### 4. Post-Release\n\n1. Verify all distributions:\n - PyPI: <https://pypi.org/project/ssiamb/>\n - Bioconda: <https://anaconda.org/bioconda/ssiamb>\n - Galaxy ToolShed: <https://toolshed.g2.bx.psu.edu/>\n - BioContainers: <https://quay.io/repository/biocontainers/ssiamb>\n\n2. Update documentation if needed\n3. Announce release\n\n### Version Numbering\n\n- Use semantic versioning: `MAJOR.MINOR.PATCH`\n- Galaxy wrapper versions: `SOFTWARE_VERSION+galaxy0` (increment galaxy# for wrapper-only changes)\n- Pre-releases: `1.0.0rc1`, `1.0.0a1`, etc.\n\n### Troubleshooting\n\nSee `PYPI_SETUP.md` for PyPI Trusted Publishing configuration details.\n\n## Citation\n\n> **Note**: Citation information will be provided upon publication.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "SSI Ambiguous Site Detection Tool",
"version": "0.9.0",
"project_urls": {
"Homepage": "https://github.com/ssi-dk/ssiamb",
"Issues": "https://github.com/ssi-dk/ssiamb/issues",
"Repository": "https://github.com/ssi-dk/ssiamb"
},
"split_keywords": [
"bacterial-genomics",
" bioinformatics",
" genomics",
" mapping",
" variant-calling"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1428f80a36689976d96b7844b9ef8c21106213b7e32377017d72bfd1f503c459",
"md5": "7dd184864a17dbde74f76bdef1b4eb97",
"sha256": "f8ac3cbf81a8ff9d4ec481804168d9badec7b035bd1493401e290f6ae9b9d9fc"
},
"downloads": -1,
"filename": "ssiamb-0.9.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7dd184864a17dbde74f76bdef1b4eb97",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 92908,
"upload_time": "2025-10-09T12:18:31",
"upload_time_iso_8601": "2025-10-09T12:18:31.792367Z",
"url": "https://files.pythonhosted.org/packages/14/28/f80a36689976d96b7844b9ef8c21106213b7e32377017d72bfd1f503c459/ssiamb-0.9.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5a4bdb01f3c0622a7c3f96dfe1ed77cd33eeeb825665c27087b01843eddc26e4",
"md5": "2edc48ea9e2384deb829da19eaaf68cc",
"sha256": "5339fb118c0f3d4bdec0b54d700fc7d31f433c7d01f68dd29827b919d5a6b396"
},
"downloads": -1,
"filename": "ssiamb-0.9.0.tar.gz",
"has_sig": false,
"md5_digest": "2edc48ea9e2384deb829da19eaaf68cc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 79757430,
"upload_time": "2025-10-09T12:18:34",
"upload_time_iso_8601": "2025-10-09T12:18:34.337000Z",
"url": "https://files.pythonhosted.org/packages/5a/4b/db01f3c0622a7c3f96dfe1ed77cd33eeeb825665c27087b01843eddc26e4/ssiamb-0.9.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-09 12:18:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ssi-dk",
"github_project": "ssiamb",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "biopython",
"specs": [
[
"==",
"1.85"
]
]
},
{
"name": "black",
"specs": [
[
"==",
"25.1.0"
]
]
},
{
"name": "build",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.2.1"
]
]
},
{
"name": "coverage",
"specs": [
[
"==",
"7.10.6"
]
]
},
{
"name": "flake8",
"specs": [
[
"==",
"7.3.0"
]
]
},
{
"name": "hatchling",
"specs": [
[
"==",
"1.27.0"
]
]
},
{
"name": "iniconfig",
"specs": [
[
"==",
"2.1.0"
]
]
},
{
"name": "markdown-it-py",
"specs": [
[
"==",
"4.0.0"
]
]
},
{
"name": "mccabe",
"specs": [
[
"==",
"0.7.0"
]
]
},
{
"name": "mdurl",
"specs": [
[
"==",
"0.1.2"
]
]
},
{
"name": "mypy",
"specs": [
[
"==",
"1.18.1"
]
]
},
{
"name": "mypy_extensions",
"specs": [
[
"==",
"1.1.0"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.3.3"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"25.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.3.2"
]
]
},
{
"name": "pathspec",
"specs": [
[
"==",
"0.12.1"
]
]
},
{
"name": "platformdirs",
"specs": [
[
"==",
"4.4.0"
]
]
},
{
"name": "pluggy",
"specs": [
[
"==",
"1.6.0"
]
]
},
{
"name": "pycodestyle",
"specs": [
[
"==",
"2.14.0"
]
]
},
{
"name": "pyflakes",
"specs": [
[
"==",
"3.4.0"
]
]
},
{
"name": "Pygments",
"specs": [
[
"==",
"2.19.2"
]
]
},
{
"name": "pyproject_hooks",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "pysam",
"specs": [
[
"==",
"0.23.3"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"8.4.2"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
"==",
"7.0.0"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.9.0.post0"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2025.2"
]
]
},
{
"name": "PyYAML",
"specs": [
[
"==",
"6.0.2"
]
]
},
{
"name": "rich",
"specs": [
[
"==",
"14.1.0"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"80.9.0"
]
]
},
{
"name": "shellingham",
"specs": [
[
"==",
"1.5.4"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.17.0"
]
]
},
{
"name": "trove-classifiers",
"specs": [
[
"==",
"2025.9.11.17"
]
]
},
{
"name": "typer",
"specs": [
[
"==",
"0.17.4"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
"==",
"4.15.0"
]
]
},
{
"name": "tzdata",
"specs": [
[
"==",
"2025.2"
]
]
},
{
"name": "wheel",
"specs": [
[
"==",
"0.45.1"
]
]
}
],
"lcname": "ssiamb"
}