py-gbcms


Namepy-gbcms JSON
Version 2.0.0 PyPI version JSON
download
home_pageNone
SummaryPython implementation of GetBaseCountsMultiSample (gbcms) for calculating base counts in BAM files
upload_time2025-10-08 14:28:52
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseAGPL-3.0
keywords bam base-counts bioinformatics gbcms genomics maf vcf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # py-gbcms - Python Implementation of gbcms

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-AGPL%203.0-blue.svg)](https://opensource.org/licenses/AGPL-3.0)

A high-performance Python reimplementation of [GetBaseCountsMultiSample](https://github.com/msk-access/GetBaseCountsMultiSample) for calculating base counts in multiple BAM files at variant positions specified in VCF or MAF files.

**Package**: `py-gbcms` | **CLI**: `gbcms`

## โœจ Features

- ๐Ÿš€ **High Performance**: Multi-threaded processing with efficient BAM file handling
  - **Smart hybrid counting** strategy: numba for simple SNPs (50-100x faster), counter.py for complex variants
  - **Numba JIT compilation** for optimized counting operations
  - **joblib** for efficient local parallelization
  - **Ray** support for distributed computing across clusters
- ๐Ÿ”ฌ **Strand Bias Analysis**: Statistical strand bias detection using Fisher's exact test
- ๐ŸŽจ **Beautiful CLI**: Rich terminal output with progress bars and colored logging
- ๐Ÿ”’ **Type Safety**: Pydantic models for runtime validation and type checking
- ๐Ÿณ **Docker Support**: Containerized deployment for reproducibility
- ๐Ÿ“Š **Multiple Formats**: Support for both VCF and MAF input/output formats with strand bias
- ๐Ÿงช **Well Tested**: Comprehensive unit tests with high coverage
- ๐Ÿ”ง **Modern Python**: Built with type hints, Pydantic models, and modern Python practices

## ๐Ÿš€ Quick Start

### Installation

# Install with all features
uv pip install "py-gbcms[all]"

# Or with pip
pip install "py-gbcms[all]"

# For development (includes scipy for strand bias and type checking)
uv pip install -e ".[dev]"

**Core Dependencies:**
- `pysam>=0.22.0` - BAM file processing
- `numpy>=1.24.0` - Numerical computations
- `scipy>=1.11.0` - Statistical analysis (Fisher's exact test for strand bias)
- `pandas>=2.0.0` - Data manipulation
- `pydantic>=2.0.0` - Runtime validation
- `numba>=0.58.0` - JIT compilation for performance
- `joblib>=1.3.0` - Parallel processing

**Optional Dependencies:**
- `cyvcf2>=0.30.0` - Fast VCF parsing (`py-gbcms[fast]`)
- `ray>=2.7.0` - Distributed computing (`py-gbcms[ray]`)

**Development Dependencies:**
- `scipy-stubs>=1.11.0` - Type stubs for scipy (for mypy type checking)

**Requirements:** Python 3.11 or later

### Basic Usage

```bash
# Run
gbcms count run \
    --fasta reference.fa \
    --bam sample1:sample1.bam \
    --vcf variants.vcf \
    --output counts.txt \
    --thread 8

# Run with MAF file
gbcms count run \
    --fasta reference.fa \
    --bam-fof bam_files.txt \
    --maf variants.maf \
    --output counts.txt

### Docker Usage

```bash
docker pull ghcr.io/msk-access/getbasecounts:latest

# Run the container
docker run --rm \
    -v $(pwd)/data:/data \
    ghcr.io/msk-access/getbasecounts:latest \
    gbcms count run \
    --omaf \
    --fasta /data/reference.fa \
    --bam sample1:/data/sample1.bam \
    --vcf /data/variants.vcf \
    --output /data/counts.maf
```

### BAM File of Files Format

Create a tab-separated file (`bam_files.txt`):
```
sample1	/path/to/sample1.bam
sample2	/path/to/sample2.bam
sample3	/path/to/sample3.bam
```

Then use:

```bash
gbcms count run \
    --fasta reference.fa \
    --bam-fof bam_files.txt \
    --vcf variants.vcf \

## Command Line Options

### Commands

gbcms uses subcommands for different operations:

- `gbcms count run`: Run base counting on variants (main command)
- `gbcms validate files`: Validate input files before processing
- `gbcms version`: Show version information
- `gbcms info`: Show tool capabilities and information

### Count Run Options

#### Required Arguments

- `--fasta`, `-f`: Reference genome FASTA file (must be indexed with .fai)
- `--output`, `-o`: Output file path

#### BAM Input (at least one required)

- `--bam`, `-b`: BAM file in format `SAMPLE_NAME:BAM_FILE` (can be specified multiple times)
- `--bam-fof`: File containing sample names and BAM paths (tab-separated)

#### Variant Input (one required)

- `--maf`: Input variant file in MAF format (can be specified multiple times)
- `--vcf`: Input variant file in VCF format (can be specified multiple times)

#### Output Options

- `--omaf`: Output in MAF format (only with MAF input)
- `--positive-count/--no-positive-count`: Output positive strand counts (default: True)
- `--negative-count/--no-negative-count`: Output negative strand counts (default: False)
- `--fragment-count/--no-fragment-count`: Output fragment counts (default: False)
- `--fragment-fractional-weight`: Use fractional weights for fragments (default: False)

#### Quality Filters
- `--maq`: Mapping quality threshold (default: 20)
- `--baq`: Base quality threshold (default: 0)
- `--filter-duplicate`: Filter duplicate reads (default: True)
- `--filter-improper-pair`: Filter improperly paired reads (default: False)
- `--filter-qc-failed`: Filter QC failed reads (default: False)
- `--filter-indel/--no-filter-indel`: Filter reads with indels (default: False)
- `--filter-non-primary/--no-filter-non-primary`: Filter non-primary alignments (default: False)

#### Performance Options
- `--thread`, `-t`: Number of threads (default: 1)
- `--max-block-size`: Maximum variants per block (default: 10000)
- `--max-block-dist`: Maximum block distance in bp (default: 100000)

#### Advanced Options
- `--generic-counting`: Use generic counting algorithm for complex variants
- `--suppress-warning`: Maximum warnings per type (default: 3)
- `--verbose`, `-v`: Enable verbose logging

### Validate Files Options

- `--fasta`, `-f`: Reference FASTA file to validate
- `--bam`, `-b`: BAM files to validate (SAMPLE:PATH format, multiple allowed)
- `--vcf`: VCF files to validate (multiple allowed)
- `--maf`: MAF files to validate (multiple allowed)

## Output Format

### VCF Format (Proper VCF with INFO fields)

**Extension**: `.vcf`

**Structure**: Standard VCF format with count and strand bias information in FORMAT and INFO fields

**Example**:
```vcf
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth across all samples">
##INFO=<ID=SB,Number=3,Type=Float,Description="Strand bias p-value, odds ratio, direction">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total depth for this sample">
##FORMAT=<ID=SB,Number=3,Type=Float,Description="Strand bias for this sample">
#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO                    FORMAT      SAMPLE1             SAMPLE2
chr1    100  .   A    T    .     .       DP=95;SB=0.001234,2.5,reverse  DP:RD:AD:SB  50:30:20:0.001234:2.5:reverse  45:25:20:0.95:1.2:none
```

**INFO Fields**:
- `DP`: Total depth across all samples
- `SB`: Strand bias (p-value,odds_ratio,direction) - most significant across samples
- `FSB`: Fragment strand bias (when fragment counting enabled)

**FORMAT Fields**:
- `DP`: Total depth for this sample
- `RD`: Reference allele depth for this sample
- `AD`: Alternate allele depth for this sample
- `DPP`: Positive strand depth (if enabled)
- `RDP`: Positive strand reference depth (if enabled)
- `ADP`: Positive strand alternate depth (if enabled)
- `DPF`: Fragment depth (if enabled)
- `RDF`: Fragment reference depth (if enabled)
- `ADF`: Fragment alternate depth (if enabled)
- `SB`: Strand bias (p-value,odds_ratio,direction) for this sample
- `FSB`: Fragment strand bias for this sample (if enabled)

### MAF Format

When using `--omaf`, the output maintains the MAF format with updated count columns for tumor and normal samples, including strand bias information.

## Development

### Setup Development Environment

```bash
# Clone the repository
git clone https://github.com/msk-access/py-gbcms.git
cd py-gbcms

# Install with development dependencies (includes scipy-stubs for type checking)
uv pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install
```

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=gbcms --cov-report=html

# Run specific test file
pytest tests/test_counter.py -v
```

### Code Quality

```bash
# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking (requires scipy-stubs)
mypy src/
```

### Building Docker Image

```bash
# Build the image
docker build -t gbcms:latest .

# Run tests in container
docker run --rm gbcms:latest pytest
```

## Performance Comparison

Compared to the original C++ implementation:

| Feature | C++ Version | Python (Smart Hybrid) | Python (Pure counter.py) |
|---------|-------------|----------------------|-------------------------|
| **Simple SNPs** | ~1x | **~50-100x faster** | ~0.8-1.2x |
| **Complex Variants** | ~1x | **~1x** | ~0.8-1.2x |
| **Overall** | Baseline | **~10-50x faster** | ~0.8-1.2x |
| Memory | Baseline | ~1.2x | ~1.2x |
| Multi-threading | OpenMP | joblib/Ray | joblib/Ray |
| Dependencies | bamtools, zlib | pysam, numpy, numba, joblib, ray |
| Scalability | Single machine | Single machine | Multi-node clusters |

**Smart Hybrid Strategy:**
- **Simple SNPs**: Uses `numba_counter` (50-100x faster than C++)
- **Complex variants**: Uses `counter.py` (C++ equivalent accuracy)
- **Automatic selection**: Optimal algorithm chosen per variant type
*Performance varies based on workload and Python version. Python 3.11+ shows significant improvements.

**With Numba JIT compilation and smart algorithm selection. See [Fast VCF Parsing](docs/CYVCF2_SUPPORT.md) for benchmarks.

## Architecture

The package is organized into distinct modules with clear responsibilities:

```
py-gbcms/
โ”œโ”€โ”€ src/gbcms/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ cli.py              # ๐ŸŽจ Typer CLI interface with Rich
โ”‚   โ”œโ”€โ”€ config.py           # โš™๏ธ  Configuration dataclasses (legacy)
โ”‚   โ”œโ”€โ”€ models.py           # ๐Ÿ”’ Pydantic models for type safety โญ
โ”‚   โ”œโ”€โ”€ variant.py          # ๐Ÿ“„ Variant loading (VCF/MAF)
โ”‚   โ”œโ”€โ”€ counter.py          # ๐Ÿข Pure Python counting (baseline)
โ”‚   โ”œโ”€โ”€ numba_counter.py    # โšก Numba-optimized counting (50-100x faster) โญ
โ”‚   โ”œโ”€โ”€ parallel.py         # ๐Ÿ”„ joblib/Ray parallelization โญ
โ”‚   โ”œโ”€โ”€ reference.py        # ๐Ÿงฌ Reference sequence handling
โ”‚   โ”œโ”€โ”€ output.py           # ๐Ÿ“ค Output formatting with strand bias โญ
โ”‚   โ””โ”€โ”€ processor.py        # ๐ŸŽฏ Main processing pipeline
โ”œโ”€โ”€ tests/                  # ๐Ÿงช Comprehensive test suite
โ”œโ”€โ”€ scripts/                # ๐Ÿ› ๏ธ  Setup and verification scripts
โ”œโ”€โ”€ Dockerfile              # ๐Ÿณ Production container
โ”œโ”€โ”€ pyproject.toml          # ๐Ÿ“ฆ Package configuration
โ””โ”€โ”€ docs/
    โ”œโ”€โ”€ README.md           # Main documentation
    โ”œโ”€โ”€ ARCHITECTURE.md     # Module relationships โญ
    โ”œโ”€โ”€ INSTALLATION.md     # Setup guide
    โ”œโ”€โ”€ QUICKSTART.md       # 5-minute start
    โ”œโ”€โ”€ ADVANCED_FEATURES.md # Pydantic, Numba, Ray โญ
    โ””โ”€โ”€ CLI_FEATURES.md     # CLI documentation
```

### Module Relationships

```
CLI (cli.py)
    โ†“
Configuration (models.py)
    โ†“
Processor (processor.py)
    โ”œโ”€โ†’ Variant Loader (variant.py)
    โ”œโ”€โ†’ Reference (reference.py)
    โ”œโ”€โ†’ Smart Counting Engine โญ
    โ”‚   โ”œโ”€โ†’ counter.py (Pure Python - accurate, slower)
    โ”‚   โ””โ”€โ†’ numba_counter.py (JIT compiled - 50-100x faster for SNPs) โญ
    โ”œโ”€โ†’ Strand Bias Analysis (counter.py + output.py) โญ
    โ”œโ”€โ†’ Parallelization (parallel.py)
    โ””โ”€โ†’ Output (output.py)
```

**Key Algorithm Selection:**
- **`Smart Hybrid Strategy`**: Automatically chooses optimal algorithm per variant
- **`counter.py`**: Pure Python, easy to debug, baseline performance
- **`numba_counter.py`**: JIT-compiled, 50-100x faster for simple SNPs, for production

**New Features:**
- **Strand Bias Analysis**: Statistical detection using Fisher's exact test โญ
- **Enhanced Output**: VCF/MAF formats with strand bias columns โญ

See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed module relationships.

## Documentation

๐Ÿ“š **[Complete Documentation](docs/README.md)** | [Quick Start](docs/QUICKSTART.md) | [Contributing Guide](CONTRIBUTING.md) | [Package Structure](docs/PACKAGE_STRUCTURE.md) | [Testing Guide](docs/TESTING_GUIDE.md)

- **User Guide**
  - [Input & Output Formats](docs/INPUT_OUTPUT.md)

- **Advanced**
  - [Advanced Features](docs/ADVANCED_FEATURES.md)
  - [Parallelization Guide](docs/PARALLELIZATION_GUIDE.md)
  - [Fast VCF Parsing (cyvcf2)](docs/CYVCF2_SUPPORT.md)

- **Reference**
  - [Architecture](docs/ARCHITECTURE.md)
  - [C++ Comparison](docs/CPP_FEATURE_COMPARISON.md)
  - [FAQ](docs/FAQ.md)

- **Docker & Deployment**
  - [Docker Guide](docs/DOCKER_GUIDE.md)

## Advanced Features

GetBaseCounts includes several advanced features for performance and scalability:

### ๐Ÿ”’ Type Safety with Pydantic

```python
from gbcms.models import GetBaseCountsConfig

# Runtime validation of all inputs
config = GetBaseCountsConfig(
    fasta_file=Path("reference.fa"),
    bam_files=[...],
    variant_files=[...],
)
```

### ๐Ÿ”ฌ Strand Bias Analysis

```python
# Strand bias is automatically calculated for all variants
# Uses Fisher's exact test for statistical rigor
from gbcms.counter import BaseCounter

counter = BaseCounter(config)
# Strand bias calculated during counting and included in output
```

**Features:**
- **Fisher's exact test** for statistically sound strand bias detection
- **Automatic direction detection** (forward, reverse, or none)
- **Minimum depth filtering** (10 reads) for reliable calculations
- **VCF and MAF output** with strand bias columns

### โšก Performance with Smart Hybrid Strategy

```python
# Automatic algorithm selection based on variant complexity
from gbcms.counter import BaseCounter

counter = BaseCounter(config)
# Automatically uses:
# - numba_counter for simple SNPs (50-100x faster)
# - counter.py for complex variants (maximum accuracy)
```

### ๐Ÿ”„ Parallelization with joblib

```bash
# Use joblib for efficient local parallelization
gbcms count run --thread 16 --backend joblib ...
```

### ๐ŸŒ Distributed Computing with Ray

```bash
# Install with Ray support
uv pip install "gbcms[ray]"

# Use Ray for distributed processing
gbcms count run --thread 32 --backend ray --use-ray ...
```

See [ADVANCED_FEATURES.md](ADVANCED_FEATURES.md) for detailed documentation and benchmarks.

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch
3. Make your changes with tests
4. Run the test suite and linters
5. Submit a pull request

## Citation

If you use this tool in your research, please cite:

```
GetBaseCountsMultiSample: A tool for calculating base counts in multiple BAM files
MSK-ACCESS Team
https://github.com/msk-access/GetBaseCountsMultiSample
```

## License

AGPL-3.0 License - See [LICENSE](LICENSE) for details.

## Support

- ๐Ÿ› Report bugs: [GitHub Issues](https://github.com/msk-access/py-gbcms/issues)
- ๐Ÿ’ฌ Ask questions: [GitHub Discussions](https://github.com/msk-access/py-gbcms/discussions)
- ๐Ÿ“ง Email: shahr2@mskcc.org

## Acknowledgments

This is a Python reimplementation of the original C++ tool developed by the MSK-ACCESS team. Special thanks to the original authors and contributors.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "py-gbcms",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "bam, base-counts, bioinformatics, gbcms, genomics, maf, vcf",
    "author": null,
    "author_email": "MSK-ACCESS <shahr2@mskcc.org>",
    "download_url": "https://files.pythonhosted.org/packages/2e/5a/9b7f7b48c9aa124f7304020664295dda49a8bc8f9882c68886d1e71dea5d/py_gbcms-2.0.0.tar.gz",
    "platform": null,
    "description": "# py-gbcms - Python Implementation of gbcms\n\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/License-AGPL%203.0-blue.svg)](https://opensource.org/licenses/AGPL-3.0)\n\nA high-performance Python reimplementation of [GetBaseCountsMultiSample](https://github.com/msk-access/GetBaseCountsMultiSample) for calculating base counts in multiple BAM files at variant positions specified in VCF or MAF files.\n\n**Package**: `py-gbcms` | **CLI**: `gbcms`\n\n## \u2728 Features\n\n- \ud83d\ude80 **High Performance**: Multi-threaded processing with efficient BAM file handling\n  - **Smart hybrid counting** strategy: numba for simple SNPs (50-100x faster), counter.py for complex variants\n  - **Numba JIT compilation** for optimized counting operations\n  - **joblib** for efficient local parallelization\n  - **Ray** support for distributed computing across clusters\n- \ud83d\udd2c **Strand Bias Analysis**: Statistical strand bias detection using Fisher's exact test\n- \ud83c\udfa8 **Beautiful CLI**: Rich terminal output with progress bars and colored logging\n- \ud83d\udd12 **Type Safety**: Pydantic models for runtime validation and type checking\n- \ud83d\udc33 **Docker Support**: Containerized deployment for reproducibility\n- \ud83d\udcca **Multiple Formats**: Support for both VCF and MAF input/output formats with strand bias\n- \ud83e\uddea **Well Tested**: Comprehensive unit tests with high coverage\n- \ud83d\udd27 **Modern Python**: Built with type hints, Pydantic models, and modern Python practices\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n# Install with all features\nuv pip install \"py-gbcms[all]\"\n\n# Or with pip\npip install \"py-gbcms[all]\"\n\n# For development (includes scipy for strand bias and type checking)\nuv pip install -e \".[dev]\"\n\n**Core Dependencies:**\n- `pysam>=0.22.0` - BAM file processing\n- `numpy>=1.24.0` - Numerical computations\n- `scipy>=1.11.0` - Statistical analysis (Fisher's exact test for strand bias)\n- `pandas>=2.0.0` - Data manipulation\n- `pydantic>=2.0.0` - Runtime validation\n- `numba>=0.58.0` - JIT compilation for performance\n- `joblib>=1.3.0` - Parallel processing\n\n**Optional Dependencies:**\n- `cyvcf2>=0.30.0` - Fast VCF parsing (`py-gbcms[fast]`)\n- `ray>=2.7.0` - Distributed computing (`py-gbcms[ray]`)\n\n**Development Dependencies:**\n- `scipy-stubs>=1.11.0` - Type stubs for scipy (for mypy type checking)\n\n**Requirements:** Python 3.11 or later\n\n### Basic Usage\n\n```bash\n# Run\ngbcms count run \\\n    --fasta reference.fa \\\n    --bam sample1:sample1.bam \\\n    --vcf variants.vcf \\\n    --output counts.txt \\\n    --thread 8\n\n# Run with MAF file\ngbcms count run \\\n    --fasta reference.fa \\\n    --bam-fof bam_files.txt \\\n    --maf variants.maf \\\n    --output counts.txt\n\n### Docker Usage\n\n```bash\ndocker pull ghcr.io/msk-access/getbasecounts:latest\n\n# Run the container\ndocker run --rm \\\n    -v $(pwd)/data:/data \\\n    ghcr.io/msk-access/getbasecounts:latest \\\n    gbcms count run \\\n    --omaf \\\n    --fasta /data/reference.fa \\\n    --bam sample1:/data/sample1.bam \\\n    --vcf /data/variants.vcf \\\n    --output /data/counts.maf\n```\n\n### BAM File of Files Format\n\nCreate a tab-separated file (`bam_files.txt`):\n```\nsample1\t/path/to/sample1.bam\nsample2\t/path/to/sample2.bam\nsample3\t/path/to/sample3.bam\n```\n\nThen use:\n\n```bash\ngbcms count run \\\n    --fasta reference.fa \\\n    --bam-fof bam_files.txt \\\n    --vcf variants.vcf \\\n\n## Command Line Options\n\n### Commands\n\ngbcms uses subcommands for different operations:\n\n- `gbcms count run`: Run base counting on variants (main command)\n- `gbcms validate files`: Validate input files before processing\n- `gbcms version`: Show version information\n- `gbcms info`: Show tool capabilities and information\n\n### Count Run Options\n\n#### Required Arguments\n\n- `--fasta`, `-f`: Reference genome FASTA file (must be indexed with .fai)\n- `--output`, `-o`: Output file path\n\n#### BAM Input (at least one required)\n\n- `--bam`, `-b`: BAM file in format `SAMPLE_NAME:BAM_FILE` (can be specified multiple times)\n- `--bam-fof`: File containing sample names and BAM paths (tab-separated)\n\n#### Variant Input (one required)\n\n- `--maf`: Input variant file in MAF format (can be specified multiple times)\n- `--vcf`: Input variant file in VCF format (can be specified multiple times)\n\n#### Output Options\n\n- `--omaf`: Output in MAF format (only with MAF input)\n- `--positive-count/--no-positive-count`: Output positive strand counts (default: True)\n- `--negative-count/--no-negative-count`: Output negative strand counts (default: False)\n- `--fragment-count/--no-fragment-count`: Output fragment counts (default: False)\n- `--fragment-fractional-weight`: Use fractional weights for fragments (default: False)\n\n#### Quality Filters\n- `--maq`: Mapping quality threshold (default: 20)\n- `--baq`: Base quality threshold (default: 0)\n- `--filter-duplicate`: Filter duplicate reads (default: True)\n- `--filter-improper-pair`: Filter improperly paired reads (default: False)\n- `--filter-qc-failed`: Filter QC failed reads (default: False)\n- `--filter-indel/--no-filter-indel`: Filter reads with indels (default: False)\n- `--filter-non-primary/--no-filter-non-primary`: Filter non-primary alignments (default: False)\n\n#### Performance Options\n- `--thread`, `-t`: Number of threads (default: 1)\n- `--max-block-size`: Maximum variants per block (default: 10000)\n- `--max-block-dist`: Maximum block distance in bp (default: 100000)\n\n#### Advanced Options\n- `--generic-counting`: Use generic counting algorithm for complex variants\n- `--suppress-warning`: Maximum warnings per type (default: 3)\n- `--verbose`, `-v`: Enable verbose logging\n\n### Validate Files Options\n\n- `--fasta`, `-f`: Reference FASTA file to validate\n- `--bam`, `-b`: BAM files to validate (SAMPLE:PATH format, multiple allowed)\n- `--vcf`: VCF files to validate (multiple allowed)\n- `--maf`: MAF files to validate (multiple allowed)\n\n## Output Format\n\n### VCF Format (Proper VCF with INFO fields)\n\n**Extension**: `.vcf`\n\n**Structure**: Standard VCF format with count and strand bias information in FORMAT and INFO fields\n\n**Example**:\n```vcf\n##fileformat=VCFv4.2\n##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total depth across all samples\">\n##INFO=<ID=SB,Number=3,Type=Float,Description=\"Strand bias p-value, odds ratio, direction\">\n##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Total depth for this sample\">\n##FORMAT=<ID=SB,Number=3,Type=Float,Description=\"Strand bias for this sample\">\n#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO                    FORMAT      SAMPLE1             SAMPLE2\nchr1    100  .   A    T    .     .       DP=95;SB=0.001234,2.5,reverse  DP:RD:AD:SB  50:30:20:0.001234:2.5:reverse  45:25:20:0.95:1.2:none\n```\n\n**INFO Fields**:\n- `DP`: Total depth across all samples\n- `SB`: Strand bias (p-value,odds_ratio,direction) - most significant across samples\n- `FSB`: Fragment strand bias (when fragment counting enabled)\n\n**FORMAT Fields**:\n- `DP`: Total depth for this sample\n- `RD`: Reference allele depth for this sample\n- `AD`: Alternate allele depth for this sample\n- `DPP`: Positive strand depth (if enabled)\n- `RDP`: Positive strand reference depth (if enabled)\n- `ADP`: Positive strand alternate depth (if enabled)\n- `DPF`: Fragment depth (if enabled)\n- `RDF`: Fragment reference depth (if enabled)\n- `ADF`: Fragment alternate depth (if enabled)\n- `SB`: Strand bias (p-value,odds_ratio,direction) for this sample\n- `FSB`: Fragment strand bias for this sample (if enabled)\n\n### MAF Format\n\nWhen using `--omaf`, the output maintains the MAF format with updated count columns for tumor and normal samples, including strand bias information.\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/msk-access/py-gbcms.git\ncd py-gbcms\n\n# Install with development dependencies (includes scipy-stubs for type checking)\nuv pip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=gbcms --cov-report=html\n\n# Run specific test file\npytest tests/test_counter.py -v\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src/ tests/\n\n# Lint code\nruff check src/ tests/\n\n# Type checking (requires scipy-stubs)\nmypy src/\n```\n\n### Building Docker Image\n\n```bash\n# Build the image\ndocker build -t gbcms:latest .\n\n# Run tests in container\ndocker run --rm gbcms:latest pytest\n```\n\n## Performance Comparison\n\nCompared to the original C++ implementation:\n\n| Feature | C++ Version | Python (Smart Hybrid) | Python (Pure counter.py) |\n|---------|-------------|----------------------|-------------------------|\n| **Simple SNPs** | ~1x | **~50-100x faster** | ~0.8-1.2x |\n| **Complex Variants** | ~1x | **~1x** | ~0.8-1.2x |\n| **Overall** | Baseline | **~10-50x faster** | ~0.8-1.2x |\n| Memory | Baseline | ~1.2x | ~1.2x |\n| Multi-threading | OpenMP | joblib/Ray | joblib/Ray |\n| Dependencies | bamtools, zlib | pysam, numpy, numba, joblib, ray |\n| Scalability | Single machine | Single machine | Multi-node clusters |\n\n**Smart Hybrid Strategy:**\n- **Simple SNPs**: Uses `numba_counter` (50-100x faster than C++)\n- **Complex variants**: Uses `counter.py` (C++ equivalent accuracy)\n- **Automatic selection**: Optimal algorithm chosen per variant type\n*Performance varies based on workload and Python version. Python 3.11+ shows significant improvements.\n\n**With Numba JIT compilation and smart algorithm selection. See [Fast VCF Parsing](docs/CYVCF2_SUPPORT.md) for benchmarks.\n\n## Architecture\n\nThe package is organized into distinct modules with clear responsibilities:\n\n```\npy-gbcms/\n\u251c\u2500\u2500 src/gbcms/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 cli.py              # \ud83c\udfa8 Typer CLI interface with Rich\n\u2502   \u251c\u2500\u2500 config.py           # \u2699\ufe0f  Configuration dataclasses (legacy)\n\u2502   \u251c\u2500\u2500 models.py           # \ud83d\udd12 Pydantic models for type safety \u2b50\n\u2502   \u251c\u2500\u2500 variant.py          # \ud83d\udcc4 Variant loading (VCF/MAF)\n\u2502   \u251c\u2500\u2500 counter.py          # \ud83d\udc22 Pure Python counting (baseline)\n\u2502   \u251c\u2500\u2500 numba_counter.py    # \u26a1 Numba-optimized counting (50-100x faster) \u2b50\n\u2502   \u251c\u2500\u2500 parallel.py         # \ud83d\udd04 joblib/Ray parallelization \u2b50\n\u2502   \u251c\u2500\u2500 reference.py        # \ud83e\uddec Reference sequence handling\n\u2502   \u251c\u2500\u2500 output.py           # \ud83d\udce4 Output formatting with strand bias \u2b50\n\u2502   \u2514\u2500\u2500 processor.py        # \ud83c\udfaf Main processing pipeline\n\u251c\u2500\u2500 tests/                  # \ud83e\uddea Comprehensive test suite\n\u251c\u2500\u2500 scripts/                # \ud83d\udee0\ufe0f  Setup and verification scripts\n\u251c\u2500\u2500 Dockerfile              # \ud83d\udc33 Production container\n\u251c\u2500\u2500 pyproject.toml          # \ud83d\udce6 Package configuration\n\u2514\u2500\u2500 docs/\n    \u251c\u2500\u2500 README.md           # Main documentation\n    \u251c\u2500\u2500 ARCHITECTURE.md     # Module relationships \u2b50\n    \u251c\u2500\u2500 INSTALLATION.md     # Setup guide\n    \u251c\u2500\u2500 QUICKSTART.md       # 5-minute start\n    \u251c\u2500\u2500 ADVANCED_FEATURES.md # Pydantic, Numba, Ray \u2b50\n    \u2514\u2500\u2500 CLI_FEATURES.md     # CLI documentation\n```\n\n### Module Relationships\n\n```\nCLI (cli.py)\n    \u2193\nConfiguration (models.py)\n    \u2193\nProcessor (processor.py)\n    \u251c\u2500\u2192 Variant Loader (variant.py)\n    \u251c\u2500\u2192 Reference (reference.py)\n    \u251c\u2500\u2192 Smart Counting Engine \u2b50\n    \u2502   \u251c\u2500\u2192 counter.py (Pure Python - accurate, slower)\n    \u2502   \u2514\u2500\u2192 numba_counter.py (JIT compiled - 50-100x faster for SNPs) \u2b50\n    \u251c\u2500\u2192 Strand Bias Analysis (counter.py + output.py) \u2b50\n    \u251c\u2500\u2192 Parallelization (parallel.py)\n    \u2514\u2500\u2192 Output (output.py)\n```\n\n**Key Algorithm Selection:**\n- **`Smart Hybrid Strategy`**: Automatically chooses optimal algorithm per variant\n- **`counter.py`**: Pure Python, easy to debug, baseline performance\n- **`numba_counter.py`**: JIT-compiled, 50-100x faster for simple SNPs, for production\n\n**New Features:**\n- **Strand Bias Analysis**: Statistical detection using Fisher's exact test \u2b50\n- **Enhanced Output**: VCF/MAF formats with strand bias columns \u2b50\n\nSee [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed module relationships.\n\n## Documentation\n\n\ud83d\udcda **[Complete Documentation](docs/README.md)** | [Quick Start](docs/QUICKSTART.md) | [Contributing Guide](CONTRIBUTING.md) | [Package Structure](docs/PACKAGE_STRUCTURE.md) | [Testing Guide](docs/TESTING_GUIDE.md)\n\n- **User Guide**\n  - [Input & Output Formats](docs/INPUT_OUTPUT.md)\n\n- **Advanced**\n  - [Advanced Features](docs/ADVANCED_FEATURES.md)\n  - [Parallelization Guide](docs/PARALLELIZATION_GUIDE.md)\n  - [Fast VCF Parsing (cyvcf2)](docs/CYVCF2_SUPPORT.md)\n\n- **Reference**\n  - [Architecture](docs/ARCHITECTURE.md)\n  - [C++ Comparison](docs/CPP_FEATURE_COMPARISON.md)\n  - [FAQ](docs/FAQ.md)\n\n- **Docker & Deployment**\n  - [Docker Guide](docs/DOCKER_GUIDE.md)\n\n## Advanced Features\n\nGetBaseCounts includes several advanced features for performance and scalability:\n\n### \ud83d\udd12 Type Safety with Pydantic\n\n```python\nfrom gbcms.models import GetBaseCountsConfig\n\n# Runtime validation of all inputs\nconfig = GetBaseCountsConfig(\n    fasta_file=Path(\"reference.fa\"),\n    bam_files=[...],\n    variant_files=[...],\n)\n```\n\n### \ud83d\udd2c Strand Bias Analysis\n\n```python\n# Strand bias is automatically calculated for all variants\n# Uses Fisher's exact test for statistical rigor\nfrom gbcms.counter import BaseCounter\n\ncounter = BaseCounter(config)\n# Strand bias calculated during counting and included in output\n```\n\n**Features:**\n- **Fisher's exact test** for statistically sound strand bias detection\n- **Automatic direction detection** (forward, reverse, or none)\n- **Minimum depth filtering** (10 reads) for reliable calculations\n- **VCF and MAF output** with strand bias columns\n\n### \u26a1 Performance with Smart Hybrid Strategy\n\n```python\n# Automatic algorithm selection based on variant complexity\nfrom gbcms.counter import BaseCounter\n\ncounter = BaseCounter(config)\n# Automatically uses:\n# - numba_counter for simple SNPs (50-100x faster)\n# - counter.py for complex variants (maximum accuracy)\n```\n\n### \ud83d\udd04 Parallelization with joblib\n\n```bash\n# Use joblib for efficient local parallelization\ngbcms count run --thread 16 --backend joblib ...\n```\n\n### \ud83c\udf10 Distributed Computing with Ray\n\n```bash\n# Install with Ray support\nuv pip install \"gbcms[ray]\"\n\n# Use Ray for distributed processing\ngbcms count run --thread 32 --backend ray --use-ray ...\n```\n\nSee [ADVANCED_FEATURES.md](ADVANCED_FEATURES.md) for detailed documentation and benchmarks.\n\n## Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes with tests\n4. Run the test suite and linters\n5. Submit a pull request\n\n## Citation\n\nIf you use this tool in your research, please cite:\n\n```\nGetBaseCountsMultiSample: A tool for calculating base counts in multiple BAM files\nMSK-ACCESS Team\nhttps://github.com/msk-access/GetBaseCountsMultiSample\n```\n\n## License\n\nAGPL-3.0 License - See [LICENSE](LICENSE) for details.\n\n## Support\n\n- \ud83d\udc1b Report bugs: [GitHub Issues](https://github.com/msk-access/py-gbcms/issues)\n- \ud83d\udcac Ask questions: [GitHub Discussions](https://github.com/msk-access/py-gbcms/discussions)\n- \ud83d\udce7 Email: shahr2@mskcc.org\n\n## Acknowledgments\n\nThis is a Python reimplementation of the original C++ tool developed by the MSK-ACCESS team. Special thanks to the original authors and contributors.\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0",
    "summary": "Python implementation of GetBaseCountsMultiSample (gbcms) for calculating base counts in BAM files",
    "version": "2.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/msk-access/getbasecounts/issues",
        "Documentation": "https://github.com/msk-access/getbasecounts#readme",
        "Homepage": "https://github.com/msk-access/getbasecounts",
        "Repository": "https://github.com/msk-access/getbasecounts"
    },
    "split_keywords": [
        "bam",
        " base-counts",
        " bioinformatics",
        " gbcms",
        " genomics",
        " maf",
        " vcf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "580cbb873fd5c9986ddb556a195927c427d16002f27734142b9635ed9a531c26",
                "md5": "6f10b27e949ff6f6819b7bf8d22dd248",
                "sha256": "1e8ec1baf82f643413ea411e0b8f84fdf4b70d08055cb81e3b619b22b97b15a2"
            },
            "downloads": -1,
            "filename": "py_gbcms-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6f10b27e949ff6f6819b7bf8d22dd248",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 51298,
            "upload_time": "2025-10-08T14:28:50",
            "upload_time_iso_8601": "2025-10-08T14:28:50.007911Z",
            "url": "https://files.pythonhosted.org/packages/58/0c/bb873fd5c9986ddb556a195927c427d16002f27734142b9635ed9a531c26/py_gbcms-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2e5a9b7f7b48c9aa124f7304020664295dda49a8bc8f9882c68886d1e71dea5d",
                "md5": "3b45423fab8ca16451ce3cdd27adb1f7",
                "sha256": "39811e3a84c6a15cf1af6aaeb93cb272c6d2ff4a1bd7fd537183d3e21a2d549f"
            },
            "downloads": -1,
            "filename": "py_gbcms-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3b45423fab8ca16451ce3cdd27adb1f7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 173147,
            "upload_time": "2025-10-08T14:28:52",
            "upload_time_iso_8601": "2025-10-08T14:28:52.373670Z",
            "url": "https://files.pythonhosted.org/packages/2e/5a/9b7f7b48c9aa124f7304020664295dda49a8bc8f9882c68886d1e71dea5d/py_gbcms-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-08 14:28:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "msk-access",
    "github_project": "getbasecounts",
    "github_not_found": true,
    "lcname": "py-gbcms"
}
        
Elapsed time: 0.84418s