ssiamb


Namessiamb JSON
Version 0.9.0 PyPI version JSON
download
home_pageNone
SummarySSI Ambiguous Site Detection Tool
upload_time2025-10-09 12:18:34
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseNone
keywords bacterial-genomics bioinformatics genomics mapping variant-calling
VCS
bugtrack_url
requirements biopython black build click coverage flake8 hatchling iniconfig markdown-it-py mccabe mdurl mypy mypy_extensions numpy packaging pandas pathspec platformdirs pluggy pycodestyle pyflakes Pygments pyproject_hooks pysam pytest pytest-cov python-dateutil pytz PyYAML rich setuptools shellingham six trove-classifiers typer typing_extensions tzdata wheel
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ssiamb — Ambiguous Sites Counter

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)

**Author:** Povilas Matusevicius <pmat@ssi.dk>  
**Repository:** [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb)  
**License:** MIT  
**Minimum Python:** 3.12+

## Overview

`ssiamb` computes an "ambiguous sites" metric for bacterial whole genome sequencing (WGS) as a measure of within-sample heterogeneity. This tool modernizes and standardizes the lab's prior definition while providing robust packaging, CLI interface, and Galaxy integration capabilities.

### What are "Ambiguous Sites"?

An ambiguous site is a genomic position with:
 
- **Sufficient coverage**: Depth ≥ `dp_min` (default: 10)
- **Minor-allele signal**: Minor-allele fraction (MAF) ≥ `maf_min` (default: 0.10)

These metrics are determined from variant calls after normalization and atomization, counting **once per locus** (multi-allelic sites count once if any ALT passes the thresholds).

### Variant Calling and Filtering Strategy

`ssiamb` uses a two-stage approach to ensure comprehensive and consistent variant detection:

#### 1. Variant Calling (Capture All Variants)

- **BBTools**: Configured with `minallelefraction=0.0` to capture all variant calls regardless of frequency
- **BCFtools**: Configured without MAF filtering during the calling stage to capture all variants
- **Rationale**: Ensures no potentially relevant variants are lost during the calling process

#### 2. Analysis-Time Filtering (Apply Thresholds)

- **MAF Threshold**: The `--maf-min` parameter (default: 0.10) is applied during analysis, not during variant calling
- **Post-calling Filter**: All filtering for ambiguous site detection happens after variant calls are made
- **Consistency**: Both callers use the same filtering approach, ensuring comparable results
- **Flexibility**: Allows reanalysis of the same variant calls with different thresholds using `ssiamb summarize`

This approach maximizes sensitivity during variant detection while maintaining analytical flexibility and ensuring reproducible results across different caller technologies.

### Supported Modes

#### Self-mapping Mode (`ssiamb self`)
- **Input**: Reads → Sample's own assembly
- **Use case**: Analyze heterogeneity against the sample's assembled genome
- **Mapping space**: Uses the assembly as reference

#### Reference-mapped Mode (`ssiamb ref`)
- **Input**: Reads → Species canonical reference
- **Use case**: Compare against standardized reference genomes  
- **Reference selection**: Via admin directory, user override, or Bracken classification

#### Summarize Mode (`ssiamb summarize`)
- **Input**: Pre-computed VCF + BAM files
- **Use case**: Reanalyze existing variant calls with different thresholds
- **Speed**: Fast analysis without remapping or variant calling

### Key Features

- **🧬 Automatic Reference Management**: Download and index references from NCBI RefSeq
- **🔧 Flexible Mapping**: Support for minimap2 (default) and bwa-mem2
- **📊 Multiple Variant Callers**: BBTools (default) and bcftools
- **📋 Comprehensive Outputs**: Summary TSV (always) + optional VCF, BED, matrices, per-contig analysis
- **📏 Depth Analysis**: Using mosdepth (default) or samtools
- **♻️ Reusable Workflows**: Accept pre-computed BAM/VCF files
- **🧪 Galaxy Integration**: Designed for workflow environments
- **✅ Quality Control**: Configurable thresholds with sensible defaults
- **🧪 Robust Testing**: Comprehensive test suite with smoke testing

## Installation

### Development Installation (Recommended)

For development or to get the latest features:

```bash
# Clone the repository
git clone https://github.com/ssi-dk/ssiamb.git
cd ssiamb

# Create conda environment with dependencies
conda env create -f environment.yml
conda activate ssiamb

# Install in editable mode
pip install -e .

# Verify installation
ssiamb --help
```

### Conda Environment Setup

The project includes an `environment.yml` file with all required dependencies:

```bash
# Create environment
conda env create -f environment.yml

# Activate environment  
conda activate ssiamb

# Install ssiamb
pip install -e .
```

### Future Distribution Methods

When stable releases are published, the package will also be available via:

```bash
# Future installation via pip (PyPI)
pip install ssiamb

# Future installation via conda (Bioconda)
conda install -c bioconda ssiamb
```

### External Tool Dependencies

`ssiamb` requires several external bioinformatics tools. These are included in the conda environment:

- **Mapping**: `minimap2` and/or `bwa-mem2`
- **Variant calling**: BBTools and `bcftools`
- **Depth analysis**: `mosdepth` and `samtools`
- **VCF processing**: `bcftools` (for normalization)

## Admin Reference Directory Setup

For reference-mapped mode, you need an admin reference directory containing indexed reference genomes. `ssiamb` includes a built-in downloader to automatically fetch and index references from NCBI RefSeq.

### Setting Up References

1. **Set environment variable** (recommended):
   ```bash
   export SSIAMB_REF_DIR=/path/to/references
   ```

2. **Download common bacterial references**:
   ```bash
   # Download single species
   python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR
   
   # Download multiple species  
   python -m ssiamb.refseq download --species "Salmonella enterica" --output-dir $SSIAMB_REF_DIR
   python -m ssiamb.refseq download --species "Staphylococcus aureus" --output-dir $SSIAMB_REF_DIR
   ```

3. **Verify setup**:
   ```bash
   ls $SSIAMB_REF_DIR
   # Should show: Escherichia_coli.fna, Escherichia_coli.fna.mmi, Escherichia_coli.fna.bwa.*, etc.
   ```

### Reference Downloader Features

- **Automatic Selection**: Chooses best RefSeq reference genome (complete > chromosome > scaffold)
- **Index Generation**: Creates both minimap2 (`.mmi`) and bwa-mem2 (`.bwa.*`) indexes
- **Species Normalization**: Handles common name variations and aliases
- **Progress Reporting**: Shows download progress with rich progress bars
- **Fallback Logic**: Tries multiple genomes if primary choice fails

### Common Species Examples

```bash
# Popular bacterial pathogens
python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Salmonella enterica" --output-dir $SSIAMB_REF_DIR  
python -m ssiamb.refseq download --species "Staphylococcus aureus" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Streptococcus pneumoniae" --output-dir $SSIAMB_REF_DIR
python -m ssiamb.refseq download --species "Klebsiella pneumoniae" --output-dir $SSIAMB_REF_DIR
```

```bash
# Install in editable/development mode (recommended for contributors)
pip install -e .

# After editable install you can run the CLI via the console script or module:
ssiamb --help
# or
python -m ssiamb --help
```

 
When a stable release is published the package will also be available via PyPI and Bioconda (example future commands):

```bash
# Future installation via pip (PyPI)
pip install ssiamb

# Future installation via conda (Bioconda)
conda install -c bioconda ssiamb
```

## Quick Start

### Basic Usage

```bash
# Check what would be done (dry run)
ssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna --dry-run

# Self-mapping mode: analyze reads against sample's own assembly
ssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna

# Reference-mapped mode: analyze against species reference
ssiamb ref --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --species "Escherichia coli"

# Summarize existing VCF and BAM files
ssiamb summarize --vcf sample.vcf.gz --bam sample.bam
```

### Comprehensive Examples

#### Self-Mapping Mode

```bash
# Basic self-mapping
ssiamb self \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --assembly assembly.fna \
  --sample MySample \
  --outdir results/

# With custom thresholds and optional outputs
ssiamb self \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --assembly assembly.fna \
  --dp-min 15 \
  --maf-min 0.05 \
  --emit-vcf \
  --emit-bed \
  --emit-matrix \
  --threads 8

# Using bwa-mem2 instead of minimap2
ssiamb self \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --assembly assembly.fna \
  --mapper bwa-mem2 \
  --caller bcftools
```

#### Reference-Mapping Mode

```bash
# Using species name (requires admin reference directory)
export SSIAMB_REF_DIR=/path/to/references
ssiamb ref \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --species "Escherichia coli" \
  --sample MySample

# Using direct reference file
ssiamb ref \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --reference reference_genome.fna \
  --sample MySample

# Using Bracken classification results
ssiamb ref \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --bracken sample.bracken \
  --ref-dir /path/to/references \
  --min-bracken-frac 0.8
```

#### Advanced Usage

```bash
# Output to stdout (no files written)
ssiamb self \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --assembly assembly.fna \
  --stdout

# Reuse existing BAM file
ssiamb self \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --assembly assembly.fna \
  --bam existing_alignment.bam

# Append to existing TSV instead of overwriting
ssiamb self \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --assembly assembly.fna \
  --tsv-mode append

# Comprehensive output with provenance
ssiamb self \
  --r1 reads_R1.fastq.gz \
  --r2 reads_R2.fastq.gz \
  --assembly assembly.fna \
  --emit-vcf \
  --emit-bed \
  --emit-matrix \
  --emit-per-contig \
  --emit-provenance \
  --emit-multiqc
```

### Testing Your Installation

Run the built-in smoke test to verify everything works:

```bash
# Quick test (skips downloads)
python smoke_test.py --skip-downloads

# Full test including reference downloads (slower)
python smoke_test.py
```

## Output Files and Formats

### Primary Output

#### `ambiguous_summary.tsv`
Always generated. Single-row summary with comprehensive metrics:

| Column | Description | Example |
|--------|-------------|---------|
| `sample` | Sample identifier | `MySample` |
| `mode` | Analysis mode | `self`, `ref` |
| `mapper` | Alignment tool used | `minimap2`, `bwa-mem2` |
| `caller` | Variant caller used | `bbtools`, `bcftools` |
| `dp_min` | Minimum depth threshold | `10` |
| `maf_min` | Minimum MAF threshold | `0.1` |
| `ambiguous_snv_count` | Number of ambiguous SNVs | `42` |
| `ambiguous_snv_per_mb` | SNVs per megabase | `15.23` |
| `callable_bases` | Bases with adequate coverage | `4651234` |
| `genome_length` | Total genome length | `4800000` |
| `breadth_10x` | Fraction covered at 10x+ | `0.9691` |
| `ref_label` | Reference identifier | `Escherichia_coli|GCF_000005825.2` |
| `runtime_sec` | Analysis runtime | `245.67` |

### Optional Outputs

Enable with command-line flags:

#### `--emit-vcf`: Variant Call Format
- **File**: `{SAMPLE}.ambiguous_sites.vcf.gz` + `.tbi` index
- **Content**: Normalized, atomized variants passing thresholds
- **Annotations**: Custom INFO fields (MAF, AMBIG flag, etc.)
- **Format**: BGzip compressed, tabix indexed

#### `--emit-bed`: Browser Extensible Data
- **File**: `{SAMPLE}.ambiguous_sites.bed.gz` + `.tbi` index  
- **Content**: Genomic intervals of ambiguous sites
- **Columns**: `chrom`, `start`, `end`, `name`, `score`, `strand`, `sample`, `variant_class`, `ref`, `alt`, `maf`, `dp`
- **Format**: BGzip compressed, tabix indexed

#### `--emit-matrix`: Depth×MAF Matrix
- **File**: `{SAMPLE}.ambiguity_matrix.tsv.gz`
- **Content**: 100×51 cumulative count matrix
- **Rows**: Depth thresholds (1-100)
- **Columns**: MAF bins (0-50, representing 0.00-0.50)
- **Values**: Cumulative variant counts

#### `--emit-per-contig`: Per-Contig Summary
- **File**: `{SAMPLE}.per_contig_summary.tsv`
- **Content**: Breakdown by chromosome/contig
- **Columns**: `sample`, `contig`, `length`, `callable_bases_10x`, `breadth_10x`, `ambiguous_snv_count`, etc.

#### `--emit-provenance`: Analysis Provenance
- **File**: `run_provenance.json`
- **Content**: Complete analysis parameters, tool versions, runtime info
- **Format**: JSON array (one entry per sample)

#### `--emit-multiqc`: MultiQC Integration
- **File**: `{SAMPLE}.multiqc.tsv`
- **Content**: Curated metrics for MultiQC reporting
- **Use case**: Integration with MultiQC pipelines

### Output Directory Structure

```
results/
├── ambiguous_summary.tsv                    # Always generated
├── MySample.ambiguous_sites.vcf.gz          # --emit-vcf
├── MySample.ambiguous_sites.vcf.gz.tbi      # VCF index
├── MySample.ambiguous_sites.bed.gz          # --emit-bed  
├── MySample.ambiguous_sites.bed.gz.tbi      # BED index
├── MySample.ambiguity_matrix.tsv.gz         # --emit-matrix
├── MySample.per_contig_summary.tsv          # --emit-per-contig
├── MySample.multiqc.tsv                     # --emit-multiqc
└── run_provenance.json                      # --emit-provenance
```

## CLI Reference

### Global Options

| Option | Default | Description |
|--------|---------|-------------|
| `--threads` | `4` | Number of CPU threads |
| `--outdir` | `.` | Output directory |
| `--sample` | *inferred* | Sample name (required if auto-inference fails) |
| `--dp-min` | `10` | Minimum depth for ambiguous sites |
| `--maf-min` | `0.1` | Minimum minor allele frequency (post-calling filter) |
| `--dp-cap` | `100` | Maximum depth cap (clipped to 100) |
| `--mapper` | `minimap2` | Alignment tool (`minimap2`, `bwa-mem2`) |
| `--caller` | `bbtools` | Variant caller (`bbtools`, `bcftools`) |
| `--depth-tool` | `mosdepth` | Depth analysis tool (`mosdepth`, `samtools`) |
| `--require-pass` | `False` | Only use PASS variants |
| `--tsv-mode` | `overwrite` | TSV handling (`overwrite`, `append`, `fail`) |
| `--stdout` | `False` | Write summary to stdout only |

### Command-Specific Options

#### `ssiamb self`
| Option | Required | Description |
|--------|----------|-------------|
| `--r1` | ✅ | Forward reads (FASTQ, gzipped OK) |
| `--r2` | ✅ | Reverse reads (FASTQ, gzipped OK) |
| `--assembly` | ✅ | Assembly FASTA file |
| `--vcf` | ❌ | Reuse existing VCF file |
| `--bam` | ❌ | Reuse existing BAM file |

#### `ssiamb ref`
| Option | Required | Description |
|--------|----------|-------------|
| `--r1` | ✅ | Forward reads (FASTQ, gzipped OK) |
| `--r2` | ✅ | Reverse reads (FASTQ, gzipped OK) |
| `--reference` | ❌* | Direct reference FASTA |
| `--species` | ❌* | Species name for lookup |
| `--bracken` | ❌* | Bracken classification file |
| `--ref-dir` | ❌ | Admin reference directory |
| `--min-bracken-frac` | `0.70` | Minimum Bracken fraction |
| `--min-bracken-reads` | `100000` | Minimum Bracken reads |
| `--on-fail` | `error` | Bracken failure action (`error`, `self`) |

*One of `--reference`, `--species`, or `--bracken` is required.

#### `ssiamb summarize`
| Option | Required | Description |
|--------|----------|-------------|
| `--vcf` | ✅ | VCF file to summarize |
| `--bam` | ✅ | BAM file for denominator |
| `--output` | ❌ | Output file path |

## Error Codes and Troubleshooting

### Exit Codes

`ssiamb` follows a structured exit code system for programmatic handling:

- **0**: Success
- **1**: CLI/input errors (missing files, invalid sample names, bad arguments)
- **2**: Reference mode selection errors (species not found, Bracken failures)
- **3**: Reuse compatibility errors (VCF/BAM mismatch with reference)
- **4**: External tool failures (missing tools, tool execution errors)
- **5**: QC failures (only when `--qc-action fail` is enabled)

### Common Issues and Solutions

#### Missing External Tools

**Error**: `Command not found: minimap2`

**Solution**: Install dependencies via conda:
```bash
conda install -c bioconda minimap2 bwa-mem2 bcftools samtools mosdepth bbmap
```

#### Reference Directory Issues

**Error**: `Species 'Escherichia coli' not found in admin directory`

**Solutions**:
```bash
# Option 1: Download the species reference
python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR

# Option 2: Use direct reference file
ssiamb ref --reference /path/to/ecoli.fna --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz

# Option 3: Set environment variable
export SSIAMB_REF_DIR=/path/to/your/references
```

#### Index File Issues

**Error**: `Reference found but indexes missing for 'Escherichia_coli': minimap2 index`

**Solution**: Regenerate indexes:
```bash
cd $SSIAMB_REF_DIR
minimap2 -d Escherichia_coli.fna.mmi Escherichia_coli.fna
bwa-mem2 index Escherichia_coli.fna
```

#### Sample Name Issues

**Error**: `Sample name could not be safely inferred`

**Solution**: Explicitly provide sample name:
```bash
ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --sample MySample
```

#### Memory Issues

**Error**: `BBTools out of memory`

**Solutions**:
```bash
# Reduce threads
ssiamb self --threads 2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna

# Set BBTools memory limit  
ssiamb self --bbtools-mem 4g --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
```

#### Permission Issues

**Error**: `Cannot write to output directory`

**Solutions**:
```bash
# Create directory with proper permissions
mkdir -p /path/to/output
chmod 755 /path/to/output

# Or use different output directory
ssiamb self --outdir ~/results --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
```

### Debugging Tips

1. **Use dry-run mode** to validate inputs:
   ```bash
   ssiamb self --dry-run --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
   ```

2. **Check tool versions**:
   ```bash
   minimap2 --version
   bcftools --version
   mosdepth --version
   ```

3. **Validate input files**:
   ```bash
   # Check FASTQ files
   zcat reads_R1.fastq.gz | head -4
   
   # Check FASTA files
   head assembly.fna
   ```

4. **Run smoke test**:
   ```bash
   python smoke_test.py --skip-downloads
   ```

### Getting Help

- **CLI help**: `ssiamb --help` or `ssiamb COMMAND --help`
- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)
- **Contact**: Povilas Matusevicius <pmat@ssi.dk>

## Output

### Primary Output

- **`ambiguous_summary.tsv`**: Single-row summary with ambiguous site counts and quality metrics

### Optional Outputs (via flags)

- **`--emit-vcf`**: Variant calls with ambiguity annotations
- **`--emit-bed`**: BED file of ambiguous sites
- **`--emit-matrix`**: Depth×MAF cumulative count matrix
- **`--emit-per-contig`**: Per-contig breakdown
- **`--emit-provenance`**: Analysis provenance and parameters
- **`--emit-multiqc`**: MultiQC-compatible reports

## Testing

The project includes comprehensive testing:

### Running Tests

```bash
# Run all unit tests
python -m pytest tests/ -v

# Run specific test modules
python -m pytest tests/test_refseq.py -v
python -m pytest tests/test_refdir.py -v

# Run smoke test (integration test)
python smoke_test.py --skip-downloads  # Fast version
python smoke_test.py                   # Full version with downloads
```

### Test Coverage

The test suite covers:

- **Unit tests**: Core algorithms, species normalization, reference downloading
- **Integration tests**: Full pipeline validation via smoke test
- **Edge cases**: Error handling, malformed inputs, missing dependencies
- **Mock testing**: External API calls and tool dependencies

### Test Dependencies

Install test dependencies:

```bash
# Via conda environment (recommended)
conda env create -f environment.yml

# Or manually
pip install pytest numpy pysam biopython requests
```

## Development Status

This project has completed its major development milestones:

- ✅ **Planning & Specification** - Comprehensive requirements defined
- ✅ **Repository Bootstrap** - Package structure, CI/CD, documentation
- ✅ **Core Implementation** - CLI, models, and processing pipelines
- ✅ **External Tool Integration** - Mapping and variant calling workflows
- ✅ **Reference Management** - Automatic RefSeq download and indexing
- ✅ **Testing & Validation** - Unit tests, integration testing, smoke tests
- 🚧 **Packaging & Distribution** - Bioconda, containers, Galaxy tools

### Recent Features

- **RefSeq Integration**: Automatic reference genome downloading from NCBI
- **Robust Indexing**: Automatic minimap2 and bwa-mem2 index generation
- **Enhanced Testing**: Comprehensive unit tests and smoke testing
- **Improved CLI**: Better help text, error messages, and validation
- **Output Flexibility**: Centralized directory handling with fallbacks

 
## Multi-Sample Processing

`ssiamb` supports batch processing via manifest files for analyzing multiple samples efficiently.

### Manifest Files

Create TSV files listing samples and their inputs:

#### Self-Mode Manifest

```tsv
sample	r1	r2	assembly	bam	vcf
Sample1	data/Sample1_R1.fastq.gz	data/Sample1_R2.fastq.gz	assemblies/Sample1.fna		
Sample2	data/Sample2_R1.fastq.gz	data/Sample2_R2.fastq.gz	assemblies/Sample2.fna	existing/Sample2.bam	
Sample3	data/Sample3_R1.fastq.gz	data/Sample3_R2.fastq.gz	assemblies/Sample3.fna		existing/Sample3.vcf.gz
```

#### Reference-Mode Manifest

```tsv
sample	r1	r2	bracken	reference	species	bam	vcf
Sample1	data/Sample1_R1.fastq.gz	data/Sample1_R2.fastq.gz	Sample1.bracken			
Sample2	data/Sample2_R1.fastq.gz	data/Sample2_R2.fastq.gz		ref/custom.fna		
Sample3	data/Sample3_R1.fastq.gz	data/Sample3_R2.fastq.gz			Escherichia coli	
```

### Running with Manifests

```bash
# Process self-mode manifest
ssiamb self --manifest samples.tsv --outdir results/

# Process reference-mode manifest  
ssiamb ref --manifest samples.tsv --ref-dir $SSIAMB_REF_DIR --outdir results/

# With custom settings
ssiamb self --manifest samples.tsv --dp-min 15 --emit-vcf --threads 8
```

### Manifest Features

- **Relative paths**: Resolved relative to manifest file location
- **Optional columns**: Empty cells skip optional inputs (bam, vcf)
- **Comments**: Lines starting with `#` are ignored
- **Sequential processing**: Samples processed one at a time
- **Consolidated output**: Single `ambiguous_summary.tsv` with all samples

## Performance Considerations

### Resource Usage

- **CPU**: Scales with `--threads` parameter (default: 4)
- **Memory**: 
  - BBTools: 4-8GB (adjustable with `--bbtools-mem`)
  - bwa-mem2: 2-4GB 
  - minimap2: 1-2GB
- **Disk**: ~2-5x input size for intermediate files
- **Network**: Required for RefSeq downloads only

### Optimization Tips

1. **Use appropriate thread count**:
   ```bash
   # For high-memory systems
   ssiamb self --threads 16 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
   ```

2. **Reuse intermediate files**:
   ```bash
   # First run - saves BAM
   ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
   
   # Subsequent runs - reuses BAM
   ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --bam sample.sorted.bam --dp-min 15
   ```

3. **Pre-download references**:
   ```bash
   # Download all needed references first
   python -m ssiamb.refseq download --species "Escherichia coli" --output-dir $SSIAMB_REF_DIR
   python -m ssiamb.refseq download --species "Salmonella enterica" --output-dir $SSIAMB_REF_DIR
   ```

4. **Use faster mappers for large datasets**:
   ```bash
   # minimap2 is generally faster
   ssiamb self --mapper minimap2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna
   ```

### Typical Runtimes

| Dataset Size | Mode | Mapper | Runtime (4 cores) |
|-------------|------|--------|--------------------|
| 1M PE reads | self | minimap2 | 2-5 minutes |
| 1M PE reads | ref | minimap2 | 3-7 minutes |
| 5M PE reads | self | minimap2 | 8-15 minutes |
| 5M PE reads | ref | bwa-mem2 | 15-25 minutes |

*Times vary based on genome size, coverage, and hardware*

## Contributing

This project is developed by the SSI team. The codebase is now feature-complete with comprehensive testing.

### Development Setup

```bash
# Clone repository
git clone https://github.com/ssi-dk/ssiamb.git
cd ssiamb

# Create development environment
conda env create -f environment.yml
conda activate ssiamb

# Install in editable mode
pip install -e .

# Run tests
python -m pytest tests/ -v
python smoke_test.py --skip-downloads
```

### Code Quality

Before submitting changes:

1. **Run the test suite**:
   ```bash
   python -m pytest tests/ -v
   ```

2. **Run smoke tests**:
   ```bash
   python smoke_test.py
   ```

3. **Check code style** (if configured):
   ```bash
   black src/ tests/
   isort src/ tests/
   ```

### Contact

For questions, contributions, or issues:

- **Primary Contact**: Povilas Matusevicius <pmat@ssi.dk>
- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)
- **Repository**: [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb)

## Release Process

This project uses automated publishing to PyPI, Bioconda, and Galaxy ToolShed. The release process is as follows:

### 1. Version Update

1. Update version in `pyproject.toml`:

   ```toml
   [project]
   version = "1.0.0"  # Update this
   ```

2. Update version in `recipes/ssiamb/meta.yaml`:

   ```yaml
   {% set version = "1.0.0" %}  # Update this
   ```

3. Update version in `galaxy/ssiamb.xml`:

   ```xml
   <tool id="ssiamb" name="Ambiguous Sites Counter" version="1.0.0+galaxy0">
   ```

### 2. Create Release

1. Commit version changes:

   ```bash
   git add pyproject.toml recipes/ssiamb/meta.yaml galaxy/ssiamb.xml
   git commit -m "Bump version to v1.0.0"
   git push origin main
   ```

2. Create and push tag:

   ```bash
   git tag v1.0.0
   git push origin v1.0.0
   ```

### 3. Automated Publishing

#### PyPI Publishing (Automatic)

- GitHub Actions automatically publishes to PyPI on tag push
- Uses PyPI Trusted Publishing (OIDC) - no tokens needed
- Creates signed GitHub release with artifacts

#### Bioconda Publishing (Manual)

1. Wait for PyPI release to complete
2. Update `recipes/ssiamb/meta.yaml` with correct SHA256:

   ```bash
   # Get SHA256 from PyPI release
   pip download ssiamb==1.0.0 --no-deps
   shasum -a 256 ssiamb-1.0.0.tar.gz
   ```

3. Fork [bioconda/bioconda-recipes](https://github.com/bioconda/bioconda-recipes)
4. Copy `recipes/ssiamb/` to `recipes/ssiamb/` in the fork
5. Create pull request to bioconda-recipes
6. Address review feedback and wait for merge

#### Galaxy ToolShed Publishing (Manual)

1. Install planemo: `pip install planemo`
2. Test wrapper: `planemo test galaxy/ssiamb.xml` (may fail until bioconda is available)
3. Create account on [Galaxy ToolShed](https://toolshed.g2.bx.psu.edu/)
4. Upload wrapper:

   ```bash
   cd galaxy/
   planemo shed_upload --shed_target toolshed
   ```

### 4. Post-Release

1. Verify all distributions:
   - PyPI: <https://pypi.org/project/ssiamb/>
   - Bioconda: <https://anaconda.org/bioconda/ssiamb>
   - Galaxy ToolShed: <https://toolshed.g2.bx.psu.edu/>
   - BioContainers: <https://quay.io/repository/biocontainers/ssiamb>

2. Update documentation if needed
3. Announce release

### Version Numbering

- Use semantic versioning: `MAJOR.MINOR.PATCH`
- Galaxy wrapper versions: `SOFTWARE_VERSION+galaxy0` (increment galaxy# for wrapper-only changes)
- Pre-releases: `1.0.0rc1`, `1.0.0a1`, etc.

### Troubleshooting

See `PYPI_SETUP.md` for PyPI Trusted Publishing configuration details.

## Citation

> **Note**: Citation information will be provided upon publication.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ssiamb",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "bacterial-genomics, bioinformatics, genomics, mapping, variant-calling",
    "author": null,
    "author_email": "SSI Denmark <support@ssi.dk>",
    "download_url": "https://files.pythonhosted.org/packages/5a/4b/db01f3c0622a7c3f96dfe1ed77cd33eeeb825665c27087b01843eddc26e4/ssiamb-0.9.0.tar.gz",
    "platform": null,
    "description": "# ssiamb \u2014 Ambiguous Sites Counter\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n\n**Author:** Povilas Matusevicius <pmat@ssi.dk>  \n**Repository:** [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb)  \n**License:** MIT  \n**Minimum Python:** 3.12+\n\n## Overview\n\n`ssiamb` computes an \"ambiguous sites\" metric for bacterial whole genome sequencing (WGS) as a measure of within-sample heterogeneity. This tool modernizes and standardizes the lab's prior definition while providing robust packaging, CLI interface, and Galaxy integration capabilities.\n\n### What are \"Ambiguous Sites\"?\n\nAn ambiguous site is a genomic position with:\n \n- **Sufficient coverage**: Depth \u2265 `dp_min` (default: 10)\n- **Minor-allele signal**: Minor-allele fraction (MAF) \u2265 `maf_min` (default: 0.10)\n\nThese metrics are determined from variant calls after normalization and atomization, counting **once per locus** (multi-allelic sites count once if any ALT passes the thresholds).\n\n### Variant Calling and Filtering Strategy\n\n`ssiamb` uses a two-stage approach to ensure comprehensive and consistent variant detection:\n\n#### 1. Variant Calling (Capture All Variants)\n\n- **BBTools**: Configured with `minallelefraction=0.0` to capture all variant calls regardless of frequency\n- **BCFtools**: Configured without MAF filtering during the calling stage to capture all variants\n- **Rationale**: Ensures no potentially relevant variants are lost during the calling process\n\n#### 2. Analysis-Time Filtering (Apply Thresholds)\n\n- **MAF Threshold**: The `--maf-min` parameter (default: 0.10) is applied during analysis, not during variant calling\n- **Post-calling Filter**: All filtering for ambiguous site detection happens after variant calls are made\n- **Consistency**: Both callers use the same filtering approach, ensuring comparable results\n- **Flexibility**: Allows reanalysis of the same variant calls with different thresholds using `ssiamb summarize`\n\nThis approach maximizes sensitivity during variant detection while maintaining analytical flexibility and ensuring reproducible results across different caller technologies.\n\n### Supported Modes\n\n#### Self-mapping Mode (`ssiamb self`)\n- **Input**: Reads \u2192 Sample's own assembly\n- **Use case**: Analyze heterogeneity against the sample's assembled genome\n- **Mapping space**: Uses the assembly as reference\n\n#### Reference-mapped Mode (`ssiamb ref`)\n- **Input**: Reads \u2192 Species canonical reference\n- **Use case**: Compare against standardized reference genomes  \n- **Reference selection**: Via admin directory, user override, or Bracken classification\n\n#### Summarize Mode (`ssiamb summarize`)\n- **Input**: Pre-computed VCF + BAM files\n- **Use case**: Reanalyze existing variant calls with different thresholds\n- **Speed**: Fast analysis without remapping or variant calling\n\n### Key Features\n\n- **\ud83e\uddec Automatic Reference Management**: Download and index references from NCBI RefSeq\n- **\ud83d\udd27 Flexible Mapping**: Support for minimap2 (default) and bwa-mem2\n- **\ud83d\udcca Multiple Variant Callers**: BBTools (default) and bcftools\n- **\ud83d\udccb Comprehensive Outputs**: Summary TSV (always) + optional VCF, BED, matrices, per-contig analysis\n- **\ud83d\udccf Depth Analysis**: Using mosdepth (default) or samtools\n- **\u267b\ufe0f Reusable Workflows**: Accept pre-computed BAM/VCF files\n- **\ud83e\uddea Galaxy Integration**: Designed for workflow environments\n- **\u2705 Quality Control**: Configurable thresholds with sensible defaults\n- **\ud83e\uddea Robust Testing**: Comprehensive test suite with smoke testing\n\n## Installation\n\n### Development Installation (Recommended)\n\nFor development or to get the latest features:\n\n```bash\n# Clone the repository\ngit clone https://github.com/ssi-dk/ssiamb.git\ncd ssiamb\n\n# Create conda environment with dependencies\nconda env create -f environment.yml\nconda activate ssiamb\n\n# Install in editable mode\npip install -e .\n\n# Verify installation\nssiamb --help\n```\n\n### Conda Environment Setup\n\nThe project includes an `environment.yml` file with all required dependencies:\n\n```bash\n# Create environment\nconda env create -f environment.yml\n\n# Activate environment  \nconda activate ssiamb\n\n# Install ssiamb\npip install -e .\n```\n\n### Future Distribution Methods\n\nWhen stable releases are published, the package will also be available via:\n\n```bash\n# Future installation via pip (PyPI)\npip install ssiamb\n\n# Future installation via conda (Bioconda)\nconda install -c bioconda ssiamb\n```\n\n### External Tool Dependencies\n\n`ssiamb` requires several external bioinformatics tools. These are included in the conda environment:\n\n- **Mapping**: `minimap2` and/or `bwa-mem2`\n- **Variant calling**: BBTools and `bcftools`\n- **Depth analysis**: `mosdepth` and `samtools`\n- **VCF processing**: `bcftools` (for normalization)\n\n## Admin Reference Directory Setup\n\nFor reference-mapped mode, you need an admin reference directory containing indexed reference genomes. `ssiamb` includes a built-in downloader to automatically fetch and index references from NCBI RefSeq.\n\n### Setting Up References\n\n1. **Set environment variable** (recommended):\n   ```bash\n   export SSIAMB_REF_DIR=/path/to/references\n   ```\n\n2. **Download common bacterial references**:\n   ```bash\n   # Download single species\n   python -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\n   \n   # Download multiple species  \n   python -m ssiamb.refseq download --species \"Salmonella enterica\" --output-dir $SSIAMB_REF_DIR\n   python -m ssiamb.refseq download --species \"Staphylococcus aureus\" --output-dir $SSIAMB_REF_DIR\n   ```\n\n3. **Verify setup**:\n   ```bash\n   ls $SSIAMB_REF_DIR\n   # Should show: Escherichia_coli.fna, Escherichia_coli.fna.mmi, Escherichia_coli.fna.bwa.*, etc.\n   ```\n\n### Reference Downloader Features\n\n- **Automatic Selection**: Chooses best RefSeq reference genome (complete > chromosome > scaffold)\n- **Index Generation**: Creates both minimap2 (`.mmi`) and bwa-mem2 (`.bwa.*`) indexes\n- **Species Normalization**: Handles common name variations and aliases\n- **Progress Reporting**: Shows download progress with rich progress bars\n- **Fallback Logic**: Tries multiple genomes if primary choice fails\n\n### Common Species Examples\n\n```bash\n# Popular bacterial pathogens\npython -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\npython -m ssiamb.refseq download --species \"Salmonella enterica\" --output-dir $SSIAMB_REF_DIR  \npython -m ssiamb.refseq download --species \"Staphylococcus aureus\" --output-dir $SSIAMB_REF_DIR\npython -m ssiamb.refseq download --species \"Streptococcus pneumoniae\" --output-dir $SSIAMB_REF_DIR\npython -m ssiamb.refseq download --species \"Klebsiella pneumoniae\" --output-dir $SSIAMB_REF_DIR\n```\n\n```bash\n# Install in editable/development mode (recommended for contributors)\npip install -e .\n\n# After editable install you can run the CLI via the console script or module:\nssiamb --help\n# or\npython -m ssiamb --help\n```\n\n \nWhen a stable release is published the package will also be available via PyPI and Bioconda (example future commands):\n\n```bash\n# Future installation via pip (PyPI)\npip install ssiamb\n\n# Future installation via conda (Bioconda)\nconda install -c bioconda ssiamb\n```\n\n## Quick Start\n\n### Basic Usage\n\n```bash\n# Check what would be done (dry run)\nssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna --dry-run\n\n# Self-mapping mode: analyze reads against sample's own assembly\nssiamb self --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --assembly sample.fna\n\n# Reference-mapped mode: analyze against species reference\nssiamb ref --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz --species \"Escherichia coli\"\n\n# Summarize existing VCF and BAM files\nssiamb summarize --vcf sample.vcf.gz --bam sample.bam\n```\n\n### Comprehensive Examples\n\n#### Self-Mapping Mode\n\n```bash\n# Basic self-mapping\nssiamb self \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --assembly assembly.fna \\\n  --sample MySample \\\n  --outdir results/\n\n# With custom thresholds and optional outputs\nssiamb self \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --assembly assembly.fna \\\n  --dp-min 15 \\\n  --maf-min 0.05 \\\n  --emit-vcf \\\n  --emit-bed \\\n  --emit-matrix \\\n  --threads 8\n\n# Using bwa-mem2 instead of minimap2\nssiamb self \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --assembly assembly.fna \\\n  --mapper bwa-mem2 \\\n  --caller bcftools\n```\n\n#### Reference-Mapping Mode\n\n```bash\n# Using species name (requires admin reference directory)\nexport SSIAMB_REF_DIR=/path/to/references\nssiamb ref \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --species \"Escherichia coli\" \\\n  --sample MySample\n\n# Using direct reference file\nssiamb ref \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --reference reference_genome.fna \\\n  --sample MySample\n\n# Using Bracken classification results\nssiamb ref \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --bracken sample.bracken \\\n  --ref-dir /path/to/references \\\n  --min-bracken-frac 0.8\n```\n\n#### Advanced Usage\n\n```bash\n# Output to stdout (no files written)\nssiamb self \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --assembly assembly.fna \\\n  --stdout\n\n# Reuse existing BAM file\nssiamb self \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --assembly assembly.fna \\\n  --bam existing_alignment.bam\n\n# Append to existing TSV instead of overwriting\nssiamb self \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --assembly assembly.fna \\\n  --tsv-mode append\n\n# Comprehensive output with provenance\nssiamb self \\\n  --r1 reads_R1.fastq.gz \\\n  --r2 reads_R2.fastq.gz \\\n  --assembly assembly.fna \\\n  --emit-vcf \\\n  --emit-bed \\\n  --emit-matrix \\\n  --emit-per-contig \\\n  --emit-provenance \\\n  --emit-multiqc\n```\n\n### Testing Your Installation\n\nRun the built-in smoke test to verify everything works:\n\n```bash\n# Quick test (skips downloads)\npython smoke_test.py --skip-downloads\n\n# Full test including reference downloads (slower)\npython smoke_test.py\n```\n\n## Output Files and Formats\n\n### Primary Output\n\n#### `ambiguous_summary.tsv`\nAlways generated. Single-row summary with comprehensive metrics:\n\n| Column | Description | Example |\n|--------|-------------|---------|\n| `sample` | Sample identifier | `MySample` |\n| `mode` | Analysis mode | `self`, `ref` |\n| `mapper` | Alignment tool used | `minimap2`, `bwa-mem2` |\n| `caller` | Variant caller used | `bbtools`, `bcftools` |\n| `dp_min` | Minimum depth threshold | `10` |\n| `maf_min` | Minimum MAF threshold | `0.1` |\n| `ambiguous_snv_count` | Number of ambiguous SNVs | `42` |\n| `ambiguous_snv_per_mb` | SNVs per megabase | `15.23` |\n| `callable_bases` | Bases with adequate coverage | `4651234` |\n| `genome_length` | Total genome length | `4800000` |\n| `breadth_10x` | Fraction covered at 10x+ | `0.9691` |\n| `ref_label` | Reference identifier | `Escherichia_coli|GCF_000005825.2` |\n| `runtime_sec` | Analysis runtime | `245.67` |\n\n### Optional Outputs\n\nEnable with command-line flags:\n\n#### `--emit-vcf`: Variant Call Format\n- **File**: `{SAMPLE}.ambiguous_sites.vcf.gz` + `.tbi` index\n- **Content**: Normalized, atomized variants passing thresholds\n- **Annotations**: Custom INFO fields (MAF, AMBIG flag, etc.)\n- **Format**: BGzip compressed, tabix indexed\n\n#### `--emit-bed`: Browser Extensible Data\n- **File**: `{SAMPLE}.ambiguous_sites.bed.gz` + `.tbi` index  \n- **Content**: Genomic intervals of ambiguous sites\n- **Columns**: `chrom`, `start`, `end`, `name`, `score`, `strand`, `sample`, `variant_class`, `ref`, `alt`, `maf`, `dp`\n- **Format**: BGzip compressed, tabix indexed\n\n#### `--emit-matrix`: Depth\u00d7MAF Matrix\n- **File**: `{SAMPLE}.ambiguity_matrix.tsv.gz`\n- **Content**: 100\u00d751 cumulative count matrix\n- **Rows**: Depth thresholds (1-100)\n- **Columns**: MAF bins (0-50, representing 0.00-0.50)\n- **Values**: Cumulative variant counts\n\n#### `--emit-per-contig`: Per-Contig Summary\n- **File**: `{SAMPLE}.per_contig_summary.tsv`\n- **Content**: Breakdown by chromosome/contig\n- **Columns**: `sample`, `contig`, `length`, `callable_bases_10x`, `breadth_10x`, `ambiguous_snv_count`, etc.\n\n#### `--emit-provenance`: Analysis Provenance\n- **File**: `run_provenance.json`\n- **Content**: Complete analysis parameters, tool versions, runtime info\n- **Format**: JSON array (one entry per sample)\n\n#### `--emit-multiqc`: MultiQC Integration\n- **File**: `{SAMPLE}.multiqc.tsv`\n- **Content**: Curated metrics for MultiQC reporting\n- **Use case**: Integration with MultiQC pipelines\n\n### Output Directory Structure\n\n```\nresults/\n\u251c\u2500\u2500 ambiguous_summary.tsv                    # Always generated\n\u251c\u2500\u2500 MySample.ambiguous_sites.vcf.gz          # --emit-vcf\n\u251c\u2500\u2500 MySample.ambiguous_sites.vcf.gz.tbi      # VCF index\n\u251c\u2500\u2500 MySample.ambiguous_sites.bed.gz          # --emit-bed  \n\u251c\u2500\u2500 MySample.ambiguous_sites.bed.gz.tbi      # BED index\n\u251c\u2500\u2500 MySample.ambiguity_matrix.tsv.gz         # --emit-matrix\n\u251c\u2500\u2500 MySample.per_contig_summary.tsv          # --emit-per-contig\n\u251c\u2500\u2500 MySample.multiqc.tsv                     # --emit-multiqc\n\u2514\u2500\u2500 run_provenance.json                      # --emit-provenance\n```\n\n## CLI Reference\n\n### Global Options\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--threads` | `4` | Number of CPU threads |\n| `--outdir` | `.` | Output directory |\n| `--sample` | *inferred* | Sample name (required if auto-inference fails) |\n| `--dp-min` | `10` | Minimum depth for ambiguous sites |\n| `--maf-min` | `0.1` | Minimum minor allele frequency (post-calling filter) |\n| `--dp-cap` | `100` | Maximum depth cap (clipped to 100) |\n| `--mapper` | `minimap2` | Alignment tool (`minimap2`, `bwa-mem2`) |\n| `--caller` | `bbtools` | Variant caller (`bbtools`, `bcftools`) |\n| `--depth-tool` | `mosdepth` | Depth analysis tool (`mosdepth`, `samtools`) |\n| `--require-pass` | `False` | Only use PASS variants |\n| `--tsv-mode` | `overwrite` | TSV handling (`overwrite`, `append`, `fail`) |\n| `--stdout` | `False` | Write summary to stdout only |\n\n### Command-Specific Options\n\n#### `ssiamb self`\n| Option | Required | Description |\n|--------|----------|-------------|\n| `--r1` | \u2705 | Forward reads (FASTQ, gzipped OK) |\n| `--r2` | \u2705 | Reverse reads (FASTQ, gzipped OK) |\n| `--assembly` | \u2705 | Assembly FASTA file |\n| `--vcf` | \u274c | Reuse existing VCF file |\n| `--bam` | \u274c | Reuse existing BAM file |\n\n#### `ssiamb ref`\n| Option | Required | Description |\n|--------|----------|-------------|\n| `--r1` | \u2705 | Forward reads (FASTQ, gzipped OK) |\n| `--r2` | \u2705 | Reverse reads (FASTQ, gzipped OK) |\n| `--reference` | \u274c* | Direct reference FASTA |\n| `--species` | \u274c* | Species name for lookup |\n| `--bracken` | \u274c* | Bracken classification file |\n| `--ref-dir` | \u274c | Admin reference directory |\n| `--min-bracken-frac` | `0.70` | Minimum Bracken fraction |\n| `--min-bracken-reads` | `100000` | Minimum Bracken reads |\n| `--on-fail` | `error` | Bracken failure action (`error`, `self`) |\n\n*One of `--reference`, `--species`, or `--bracken` is required.\n\n#### `ssiamb summarize`\n| Option | Required | Description |\n|--------|----------|-------------|\n| `--vcf` | \u2705 | VCF file to summarize |\n| `--bam` | \u2705 | BAM file for denominator |\n| `--output` | \u274c | Output file path |\n\n## Error Codes and Troubleshooting\n\n### Exit Codes\n\n`ssiamb` follows a structured exit code system for programmatic handling:\n\n- **0**: Success\n- **1**: CLI/input errors (missing files, invalid sample names, bad arguments)\n- **2**: Reference mode selection errors (species not found, Bracken failures)\n- **3**: Reuse compatibility errors (VCF/BAM mismatch with reference)\n- **4**: External tool failures (missing tools, tool execution errors)\n- **5**: QC failures (only when `--qc-action fail` is enabled)\n\n### Common Issues and Solutions\n\n#### Missing External Tools\n\n**Error**: `Command not found: minimap2`\n\n**Solution**: Install dependencies via conda:\n```bash\nconda install -c bioconda minimap2 bwa-mem2 bcftools samtools mosdepth bbmap\n```\n\n#### Reference Directory Issues\n\n**Error**: `Species 'Escherichia coli' not found in admin directory`\n\n**Solutions**:\n```bash\n# Option 1: Download the species reference\npython -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\n\n# Option 2: Use direct reference file\nssiamb ref --reference /path/to/ecoli.fna --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz\n\n# Option 3: Set environment variable\nexport SSIAMB_REF_DIR=/path/to/your/references\n```\n\n#### Index File Issues\n\n**Error**: `Reference found but indexes missing for 'Escherichia_coli': minimap2 index`\n\n**Solution**: Regenerate indexes:\n```bash\ncd $SSIAMB_REF_DIR\nminimap2 -d Escherichia_coli.fna.mmi Escherichia_coli.fna\nbwa-mem2 index Escherichia_coli.fna\n```\n\n#### Sample Name Issues\n\n**Error**: `Sample name could not be safely inferred`\n\n**Solution**: Explicitly provide sample name:\n```bash\nssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --sample MySample\n```\n\n#### Memory Issues\n\n**Error**: `BBTools out of memory`\n\n**Solutions**:\n```bash\n# Reduce threads\nssiamb self --threads 2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n\n# Set BBTools memory limit  \nssiamb self --bbtools-mem 4g --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n```\n\n#### Permission Issues\n\n**Error**: `Cannot write to output directory`\n\n**Solutions**:\n```bash\n# Create directory with proper permissions\nmkdir -p /path/to/output\nchmod 755 /path/to/output\n\n# Or use different output directory\nssiamb self --outdir ~/results --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n```\n\n### Debugging Tips\n\n1. **Use dry-run mode** to validate inputs:\n   ```bash\n   ssiamb self --dry-run --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n   ```\n\n2. **Check tool versions**:\n   ```bash\n   minimap2 --version\n   bcftools --version\n   mosdepth --version\n   ```\n\n3. **Validate input files**:\n   ```bash\n   # Check FASTQ files\n   zcat reads_R1.fastq.gz | head -4\n   \n   # Check FASTA files\n   head assembly.fna\n   ```\n\n4. **Run smoke test**:\n   ```bash\n   python smoke_test.py --skip-downloads\n   ```\n\n### Getting Help\n\n- **CLI help**: `ssiamb --help` or `ssiamb COMMAND --help`\n- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)\n- **Contact**: Povilas Matusevicius <pmat@ssi.dk>\n\n## Output\n\n### Primary Output\n\n- **`ambiguous_summary.tsv`**: Single-row summary with ambiguous site counts and quality metrics\n\n### Optional Outputs (via flags)\n\n- **`--emit-vcf`**: Variant calls with ambiguity annotations\n- **`--emit-bed`**: BED file of ambiguous sites\n- **`--emit-matrix`**: Depth\u00d7MAF cumulative count matrix\n- **`--emit-per-contig`**: Per-contig breakdown\n- **`--emit-provenance`**: Analysis provenance and parameters\n- **`--emit-multiqc`**: MultiQC-compatible reports\n\n## Testing\n\nThe project includes comprehensive testing:\n\n### Running Tests\n\n```bash\n# Run all unit tests\npython -m pytest tests/ -v\n\n# Run specific test modules\npython -m pytest tests/test_refseq.py -v\npython -m pytest tests/test_refdir.py -v\n\n# Run smoke test (integration test)\npython smoke_test.py --skip-downloads  # Fast version\npython smoke_test.py                   # Full version with downloads\n```\n\n### Test Coverage\n\nThe test suite covers:\n\n- **Unit tests**: Core algorithms, species normalization, reference downloading\n- **Integration tests**: Full pipeline validation via smoke test\n- **Edge cases**: Error handling, malformed inputs, missing dependencies\n- **Mock testing**: External API calls and tool dependencies\n\n### Test Dependencies\n\nInstall test dependencies:\n\n```bash\n# Via conda environment (recommended)\nconda env create -f environment.yml\n\n# Or manually\npip install pytest numpy pysam biopython requests\n```\n\n## Development Status\n\nThis project has completed its major development milestones:\n\n- \u2705 **Planning & Specification** - Comprehensive requirements defined\n- \u2705 **Repository Bootstrap** - Package structure, CI/CD, documentation\n- \u2705 **Core Implementation** - CLI, models, and processing pipelines\n- \u2705 **External Tool Integration** - Mapping and variant calling workflows\n- \u2705 **Reference Management** - Automatic RefSeq download and indexing\n- \u2705 **Testing & Validation** - Unit tests, integration testing, smoke tests\n- \ud83d\udea7 **Packaging & Distribution** - Bioconda, containers, Galaxy tools\n\n### Recent Features\n\n- **RefSeq Integration**: Automatic reference genome downloading from NCBI\n- **Robust Indexing**: Automatic minimap2 and bwa-mem2 index generation\n- **Enhanced Testing**: Comprehensive unit tests and smoke testing\n- **Improved CLI**: Better help text, error messages, and validation\n- **Output Flexibility**: Centralized directory handling with fallbacks\n\n \n## Multi-Sample Processing\n\n`ssiamb` supports batch processing via manifest files for analyzing multiple samples efficiently.\n\n### Manifest Files\n\nCreate TSV files listing samples and their inputs:\n\n#### Self-Mode Manifest\n\n```tsv\nsample\tr1\tr2\tassembly\tbam\tvcf\nSample1\tdata/Sample1_R1.fastq.gz\tdata/Sample1_R2.fastq.gz\tassemblies/Sample1.fna\t\t\nSample2\tdata/Sample2_R1.fastq.gz\tdata/Sample2_R2.fastq.gz\tassemblies/Sample2.fna\texisting/Sample2.bam\t\nSample3\tdata/Sample3_R1.fastq.gz\tdata/Sample3_R2.fastq.gz\tassemblies/Sample3.fna\t\texisting/Sample3.vcf.gz\n```\n\n#### Reference-Mode Manifest\n\n```tsv\nsample\tr1\tr2\tbracken\treference\tspecies\tbam\tvcf\nSample1\tdata/Sample1_R1.fastq.gz\tdata/Sample1_R2.fastq.gz\tSample1.bracken\t\t\t\nSample2\tdata/Sample2_R1.fastq.gz\tdata/Sample2_R2.fastq.gz\t\tref/custom.fna\t\t\nSample3\tdata/Sample3_R1.fastq.gz\tdata/Sample3_R2.fastq.gz\t\t\tEscherichia coli\t\n```\n\n### Running with Manifests\n\n```bash\n# Process self-mode manifest\nssiamb self --manifest samples.tsv --outdir results/\n\n# Process reference-mode manifest  \nssiamb ref --manifest samples.tsv --ref-dir $SSIAMB_REF_DIR --outdir results/\n\n# With custom settings\nssiamb self --manifest samples.tsv --dp-min 15 --emit-vcf --threads 8\n```\n\n### Manifest Features\n\n- **Relative paths**: Resolved relative to manifest file location\n- **Optional columns**: Empty cells skip optional inputs (bam, vcf)\n- **Comments**: Lines starting with `#` are ignored\n- **Sequential processing**: Samples processed one at a time\n- **Consolidated output**: Single `ambiguous_summary.tsv` with all samples\n\n## Performance Considerations\n\n### Resource Usage\n\n- **CPU**: Scales with `--threads` parameter (default: 4)\n- **Memory**: \n  - BBTools: 4-8GB (adjustable with `--bbtools-mem`)\n  - bwa-mem2: 2-4GB \n  - minimap2: 1-2GB\n- **Disk**: ~2-5x input size for intermediate files\n- **Network**: Required for RefSeq downloads only\n\n### Optimization Tips\n\n1. **Use appropriate thread count**:\n   ```bash\n   # For high-memory systems\n   ssiamb self --threads 16 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n   ```\n\n2. **Reuse intermediate files**:\n   ```bash\n   # First run - saves BAM\n   ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n   \n   # Subsequent runs - reuses BAM\n   ssiamb self --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna --bam sample.sorted.bam --dp-min 15\n   ```\n\n3. **Pre-download references**:\n   ```bash\n   # Download all needed references first\n   python -m ssiamb.refseq download --species \"Escherichia coli\" --output-dir $SSIAMB_REF_DIR\n   python -m ssiamb.refseq download --species \"Salmonella enterica\" --output-dir $SSIAMB_REF_DIR\n   ```\n\n4. **Use faster mappers for large datasets**:\n   ```bash\n   # minimap2 is generally faster\n   ssiamb self --mapper minimap2 --r1 reads_R1.fastq.gz --r2 reads_R2.fastq.gz --assembly assembly.fna\n   ```\n\n### Typical Runtimes\n\n| Dataset Size | Mode | Mapper | Runtime (4 cores) |\n|-------------|------|--------|--------------------|\n| 1M PE reads | self | minimap2 | 2-5 minutes |\n| 1M PE reads | ref | minimap2 | 3-7 minutes |\n| 5M PE reads | self | minimap2 | 8-15 minutes |\n| 5M PE reads | ref | bwa-mem2 | 15-25 minutes |\n\n*Times vary based on genome size, coverage, and hardware*\n\n## Contributing\n\nThis project is developed by the SSI team. The codebase is now feature-complete with comprehensive testing.\n\n### Development Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/ssi-dk/ssiamb.git\ncd ssiamb\n\n# Create development environment\nconda env create -f environment.yml\nconda activate ssiamb\n\n# Install in editable mode\npip install -e .\n\n# Run tests\npython -m pytest tests/ -v\npython smoke_test.py --skip-downloads\n```\n\n### Code Quality\n\nBefore submitting changes:\n\n1. **Run the test suite**:\n   ```bash\n   python -m pytest tests/ -v\n   ```\n\n2. **Run smoke tests**:\n   ```bash\n   python smoke_test.py\n   ```\n\n3. **Check code style** (if configured):\n   ```bash\n   black src/ tests/\n   isort src/ tests/\n   ```\n\n### Contact\n\nFor questions, contributions, or issues:\n\n- **Primary Contact**: Povilas Matusevicius <pmat@ssi.dk>\n- **GitHub Issues**: [Report bugs or request features](https://github.com/ssi-dk/ssiamb/issues)\n- **Repository**: [https://github.com/ssi-dk/ssiamb](https://github.com/ssi-dk/ssiamb)\n\n## Release Process\n\nThis project uses automated publishing to PyPI, Bioconda, and Galaxy ToolShed. The release process is as follows:\n\n### 1. Version Update\n\n1. Update version in `pyproject.toml`:\n\n   ```toml\n   [project]\n   version = \"1.0.0\"  # Update this\n   ```\n\n2. Update version in `recipes/ssiamb/meta.yaml`:\n\n   ```yaml\n   {% set version = \"1.0.0\" %}  # Update this\n   ```\n\n3. Update version in `galaxy/ssiamb.xml`:\n\n   ```xml\n   <tool id=\"ssiamb\" name=\"Ambiguous Sites Counter\" version=\"1.0.0+galaxy0\">\n   ```\n\n### 2. Create Release\n\n1. Commit version changes:\n\n   ```bash\n   git add pyproject.toml recipes/ssiamb/meta.yaml galaxy/ssiamb.xml\n   git commit -m \"Bump version to v1.0.0\"\n   git push origin main\n   ```\n\n2. Create and push tag:\n\n   ```bash\n   git tag v1.0.0\n   git push origin v1.0.0\n   ```\n\n### 3. Automated Publishing\n\n#### PyPI Publishing (Automatic)\n\n- GitHub Actions automatically publishes to PyPI on tag push\n- Uses PyPI Trusted Publishing (OIDC) - no tokens needed\n- Creates signed GitHub release with artifacts\n\n#### Bioconda Publishing (Manual)\n\n1. Wait for PyPI release to complete\n2. Update `recipes/ssiamb/meta.yaml` with correct SHA256:\n\n   ```bash\n   # Get SHA256 from PyPI release\n   pip download ssiamb==1.0.0 --no-deps\n   shasum -a 256 ssiamb-1.0.0.tar.gz\n   ```\n\n3. Fork [bioconda/bioconda-recipes](https://github.com/bioconda/bioconda-recipes)\n4. Copy `recipes/ssiamb/` to `recipes/ssiamb/` in the fork\n5. Create pull request to bioconda-recipes\n6. Address review feedback and wait for merge\n\n#### Galaxy ToolShed Publishing (Manual)\n\n1. Install planemo: `pip install planemo`\n2. Test wrapper: `planemo test galaxy/ssiamb.xml` (may fail until bioconda is available)\n3. Create account on [Galaxy ToolShed](https://toolshed.g2.bx.psu.edu/)\n4. Upload wrapper:\n\n   ```bash\n   cd galaxy/\n   planemo shed_upload --shed_target toolshed\n   ```\n\n### 4. Post-Release\n\n1. Verify all distributions:\n   - PyPI: <https://pypi.org/project/ssiamb/>\n   - Bioconda: <https://anaconda.org/bioconda/ssiamb>\n   - Galaxy ToolShed: <https://toolshed.g2.bx.psu.edu/>\n   - BioContainers: <https://quay.io/repository/biocontainers/ssiamb>\n\n2. Update documentation if needed\n3. Announce release\n\n### Version Numbering\n\n- Use semantic versioning: `MAJOR.MINOR.PATCH`\n- Galaxy wrapper versions: `SOFTWARE_VERSION+galaxy0` (increment galaxy# for wrapper-only changes)\n- Pre-releases: `1.0.0rc1`, `1.0.0a1`, etc.\n\n### Troubleshooting\n\nSee `PYPI_SETUP.md` for PyPI Trusted Publishing configuration details.\n\n## Citation\n\n> **Note**: Citation information will be provided upon publication.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "SSI Ambiguous Site Detection Tool",
    "version": "0.9.0",
    "project_urls": {
        "Homepage": "https://github.com/ssi-dk/ssiamb",
        "Issues": "https://github.com/ssi-dk/ssiamb/issues",
        "Repository": "https://github.com/ssi-dk/ssiamb"
    },
    "split_keywords": [
        "bacterial-genomics",
        " bioinformatics",
        " genomics",
        " mapping",
        " variant-calling"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1428f80a36689976d96b7844b9ef8c21106213b7e32377017d72bfd1f503c459",
                "md5": "7dd184864a17dbde74f76bdef1b4eb97",
                "sha256": "f8ac3cbf81a8ff9d4ec481804168d9badec7b035bd1493401e290f6ae9b9d9fc"
            },
            "downloads": -1,
            "filename": "ssiamb-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7dd184864a17dbde74f76bdef1b4eb97",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 92908,
            "upload_time": "2025-10-09T12:18:31",
            "upload_time_iso_8601": "2025-10-09T12:18:31.792367Z",
            "url": "https://files.pythonhosted.org/packages/14/28/f80a36689976d96b7844b9ef8c21106213b7e32377017d72bfd1f503c459/ssiamb-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5a4bdb01f3c0622a7c3f96dfe1ed77cd33eeeb825665c27087b01843eddc26e4",
                "md5": "2edc48ea9e2384deb829da19eaaf68cc",
                "sha256": "5339fb118c0f3d4bdec0b54d700fc7d31f433c7d01f68dd29827b919d5a6b396"
            },
            "downloads": -1,
            "filename": "ssiamb-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "2edc48ea9e2384deb829da19eaaf68cc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 79757430,
            "upload_time": "2025-10-09T12:18:34",
            "upload_time_iso_8601": "2025-10-09T12:18:34.337000Z",
            "url": "https://files.pythonhosted.org/packages/5a/4b/db01f3c0622a7c3f96dfe1ed77cd33eeeb825665c27087b01843eddc26e4/ssiamb-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-09 12:18:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ssi-dk",
    "github_project": "ssiamb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "biopython",
            "specs": [
                [
                    "==",
                    "1.85"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    "==",
                    "25.1.0"
                ]
            ]
        },
        {
            "name": "build",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.2.1"
                ]
            ]
        },
        {
            "name": "coverage",
            "specs": [
                [
                    "==",
                    "7.10.6"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    "==",
                    "7.3.0"
                ]
            ]
        },
        {
            "name": "hatchling",
            "specs": [
                [
                    "==",
                    "1.27.0"
                ]
            ]
        },
        {
            "name": "iniconfig",
            "specs": [
                [
                    "==",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "markdown-it-py",
            "specs": [
                [
                    "==",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "mccabe",
            "specs": [
                [
                    "==",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "mdurl",
            "specs": [
                [
                    "==",
                    "0.1.2"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    "==",
                    "1.18.1"
                ]
            ]
        },
        {
            "name": "mypy_extensions",
            "specs": [
                [
                    "==",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.3.3"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    "==",
                    "25.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.3.2"
                ]
            ]
        },
        {
            "name": "pathspec",
            "specs": [
                [
                    "==",
                    "0.12.1"
                ]
            ]
        },
        {
            "name": "platformdirs",
            "specs": [
                [
                    "==",
                    "4.4.0"
                ]
            ]
        },
        {
            "name": "pluggy",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "pycodestyle",
            "specs": [
                [
                    "==",
                    "2.14.0"
                ]
            ]
        },
        {
            "name": "pyflakes",
            "specs": [
                [
                    "==",
                    "3.4.0"
                ]
            ]
        },
        {
            "name": "Pygments",
            "specs": [
                [
                    "==",
                    "2.19.2"
                ]
            ]
        },
        {
            "name": "pyproject_hooks",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "pysam",
            "specs": [
                [
                    "==",
                    "0.23.3"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "8.4.2"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    "==",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    "==",
                    "2.9.0.post0"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "PyYAML",
            "specs": [
                [
                    "==",
                    "6.0.2"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    "==",
                    "14.1.0"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "80.9.0"
                ]
            ]
        },
        {
            "name": "shellingham",
            "specs": [
                [
                    "==",
                    "1.5.4"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    "==",
                    "1.17.0"
                ]
            ]
        },
        {
            "name": "trove-classifiers",
            "specs": [
                [
                    "==",
                    "2025.9.11.17"
                ]
            ]
        },
        {
            "name": "typer",
            "specs": [
                [
                    "==",
                    "0.17.4"
                ]
            ]
        },
        {
            "name": "typing_extensions",
            "specs": [
                [
                    "==",
                    "4.15.0"
                ]
            ]
        },
        {
            "name": "tzdata",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "wheel",
            "specs": [
                [
                    "==",
                    "0.45.1"
                ]
            ]
        }
    ],
    "lcname": "ssiamb"
}
        
Elapsed time: 2.20353s