# py-gbcms - Python Implementation of gbcms
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/AGPL-3.0)
A high-performance Python reimplementation of [GetBaseCountsMultiSample](https://github.com/msk-access/GetBaseCountsMultiSample) for calculating base counts in multiple BAM files at variant positions specified in VCF or MAF files.
**Package**: `py-gbcms` | **CLI**: `gbcms`
## โจ Features
- ๐ **High Performance**: Multi-threaded processing with efficient BAM file handling
- **Smart hybrid counting** strategy: numba for simple SNPs (50-100x faster), counter.py for complex variants
- **Numba JIT compilation** for optimized counting operations
- **joblib** for efficient local parallelization
- **Ray** support for distributed computing across clusters
- ๐ฌ **Strand Bias Analysis**: Statistical strand bias detection using Fisher's exact test
- ๐จ **Beautiful CLI**: Rich terminal output with progress bars and colored logging
- ๐ **Type Safety**: Pydantic models for runtime validation and type checking
- ๐ณ **Docker Support**: Containerized deployment for reproducibility
- ๐ **Multiple Formats**: Support for both VCF and MAF input/output formats with strand bias
- ๐งช **Well Tested**: Comprehensive unit tests with high coverage
- ๐ง **Modern Python**: Built with type hints, Pydantic models, and modern Python practices
## ๐ Quick Start
### Installation
# Install with all features
uv pip install "py-gbcms[all]"
# Or with pip
pip install "py-gbcms[all]"
# For development (includes scipy for strand bias and type checking)
uv pip install -e ".[dev]"
**Core Dependencies:**
- `pysam>=0.22.0` - BAM file processing
- `numpy>=1.24.0` - Numerical computations
- `scipy>=1.11.0` - Statistical analysis (Fisher's exact test for strand bias)
- `pandas>=2.0.0` - Data manipulation
- `pydantic>=2.0.0` - Runtime validation
- `numba>=0.58.0` - JIT compilation for performance
- `joblib>=1.3.0` - Parallel processing
**Optional Dependencies:**
- `cyvcf2>=0.30.0` - Fast VCF parsing (`py-gbcms[fast]`)
- `ray>=2.7.0` - Distributed computing (`py-gbcms[ray]`)
**Development Dependencies:**
- `scipy-stubs>=1.11.0` - Type stubs for scipy (for mypy type checking)
**Requirements:** Python 3.11 or later
### Basic Usage
```bash
# Run
gbcms count run \
--fasta reference.fa \
--bam sample1:sample1.bam \
--vcf variants.vcf \
--output counts.txt \
--thread 8
# Run with MAF file
gbcms count run \
--fasta reference.fa \
--bam-fof bam_files.txt \
--maf variants.maf \
--output counts.txt
### Docker Usage
```bash
docker pull ghcr.io/msk-access/getbasecounts:latest
# Run the container
docker run --rm \
-v $(pwd)/data:/data \
ghcr.io/msk-access/getbasecounts:latest \
gbcms count run \
--omaf \
--fasta /data/reference.fa \
--bam sample1:/data/sample1.bam \
--vcf /data/variants.vcf \
--output /data/counts.maf
```
### BAM File of Files Format
Create a tab-separated file (`bam_files.txt`):
```
sample1 /path/to/sample1.bam
sample2 /path/to/sample2.bam
sample3 /path/to/sample3.bam
```
Then use:
```bash
gbcms count run \
--fasta reference.fa \
--bam-fof bam_files.txt \
--vcf variants.vcf \
## Command Line Options
### Commands
gbcms uses subcommands for different operations:
- `gbcms count run`: Run base counting on variants (main command)
- `gbcms validate files`: Validate input files before processing
- `gbcms version`: Show version information
- `gbcms info`: Show tool capabilities and information
### Count Run Options
#### Required Arguments
- `--fasta`, `-f`: Reference genome FASTA file (must be indexed with .fai)
- `--output`, `-o`: Output file path
#### BAM Input (at least one required)
- `--bam`, `-b`: BAM file in format `SAMPLE_NAME:BAM_FILE` (can be specified multiple times)
- `--bam-fof`: File containing sample names and BAM paths (tab-separated)
#### Variant Input (one required)
- `--maf`: Input variant file in MAF format (can be specified multiple times)
- `--vcf`: Input variant file in VCF format (can be specified multiple times)
#### Output Options
- `--omaf`: Output in MAF format (only with MAF input)
- `--positive-count/--no-positive-count`: Output positive strand counts (default: True)
- `--negative-count/--no-negative-count`: Output negative strand counts (default: False)
- `--fragment-count/--no-fragment-count`: Output fragment counts (default: False)
- `--fragment-fractional-weight`: Use fractional weights for fragments (default: False)
#### Quality Filters
- `--maq`: Mapping quality threshold (default: 20)
- `--baq`: Base quality threshold (default: 0)
- `--filter-duplicate`: Filter duplicate reads (default: True)
- `--filter-improper-pair`: Filter improperly paired reads (default: False)
- `--filter-qc-failed`: Filter QC failed reads (default: False)
- `--filter-indel/--no-filter-indel`: Filter reads with indels (default: False)
- `--filter-non-primary/--no-filter-non-primary`: Filter non-primary alignments (default: False)
#### Performance Options
- `--thread`, `-t`: Number of threads (default: 1)
- `--max-block-size`: Maximum variants per block (default: 10000)
- `--max-block-dist`: Maximum block distance in bp (default: 100000)
#### Advanced Options
- `--generic-counting`: Use generic counting algorithm for complex variants
- `--suppress-warning`: Maximum warnings per type (default: 3)
- `--verbose`, `-v`: Enable verbose logging
### Validate Files Options
- `--fasta`, `-f`: Reference FASTA file to validate
- `--bam`, `-b`: BAM files to validate (SAMPLE:PATH format, multiple allowed)
- `--vcf`: VCF files to validate (multiple allowed)
- `--maf`: MAF files to validate (multiple allowed)
## Output Format
### VCF Format (Proper VCF with INFO fields)
**Extension**: `.vcf`
**Structure**: Standard VCF format with count and strand bias information in FORMAT and INFO fields
**Example**:
```vcf
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth across all samples">
##INFO=<ID=SB,Number=3,Type=Float,Description="Strand bias p-value, odds ratio, direction">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total depth for this sample">
##FORMAT=<ID=SB,Number=3,Type=Float,Description="Strand bias for this sample">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
chr1 100 . A T . . DP=95;SB=0.001234,2.5,reverse DP:RD:AD:SB 50:30:20:0.001234:2.5:reverse 45:25:20:0.95:1.2:none
```
**INFO Fields**:
- `DP`: Total depth across all samples
- `SB`: Strand bias (p-value,odds_ratio,direction) - most significant across samples
- `FSB`: Fragment strand bias (when fragment counting enabled)
**FORMAT Fields**:
- `DP`: Total depth for this sample
- `RD`: Reference allele depth for this sample
- `AD`: Alternate allele depth for this sample
- `DPP`: Positive strand depth (if enabled)
- `RDP`: Positive strand reference depth (if enabled)
- `ADP`: Positive strand alternate depth (if enabled)
- `DPF`: Fragment depth (if enabled)
- `RDF`: Fragment reference depth (if enabled)
- `ADF`: Fragment alternate depth (if enabled)
- `SB`: Strand bias (p-value,odds_ratio,direction) for this sample
- `FSB`: Fragment strand bias for this sample (if enabled)
### MAF Format
When using `--omaf`, the output maintains the MAF format with updated count columns for tumor and normal samples, including strand bias information.
## Development
### Setup Development Environment
```bash
# Clone the repository
git clone https://github.com/msk-access/py-gbcms.git
cd py-gbcms
# Install with development dependencies (includes scipy-stubs for type checking)
uv pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
```
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=gbcms --cov-report=html
# Run specific test file
pytest tests/test_counter.py -v
```
### Code Quality
```bash
# Format code
black src/ tests/
# Lint code
ruff check src/ tests/
# Type checking (requires scipy-stubs)
mypy src/
```
### Building Docker Image
```bash
# Build the image
docker build -t gbcms:latest .
# Run tests in container
docker run --rm gbcms:latest pytest
```
## Performance Comparison
Compared to the original C++ implementation:
| Feature | C++ Version | Python (Smart Hybrid) | Python (Pure counter.py) |
|---------|-------------|----------------------|-------------------------|
| **Simple SNPs** | ~1x | **~50-100x faster** | ~0.8-1.2x |
| **Complex Variants** | ~1x | **~1x** | ~0.8-1.2x |
| **Overall** | Baseline | **~10-50x faster** | ~0.8-1.2x |
| Memory | Baseline | ~1.2x | ~1.2x |
| Multi-threading | OpenMP | joblib/Ray | joblib/Ray |
| Dependencies | bamtools, zlib | pysam, numpy, numba, joblib, ray |
| Scalability | Single machine | Single machine | Multi-node clusters |
**Smart Hybrid Strategy:**
- **Simple SNPs**: Uses `numba_counter` (50-100x faster than C++)
- **Complex variants**: Uses `counter.py` (C++ equivalent accuracy)
- **Automatic selection**: Optimal algorithm chosen per variant type
*Performance varies based on workload and Python version. Python 3.11+ shows significant improvements.
**With Numba JIT compilation and smart algorithm selection. See [Fast VCF Parsing](docs/CYVCF2_SUPPORT.md) for benchmarks.
## Architecture
The package is organized into distinct modules with clear responsibilities:
```
py-gbcms/
โโโ src/gbcms/
โ โโโ __init__.py
โ โโโ cli.py # ๐จ Typer CLI interface with Rich
โ โโโ config.py # โ๏ธ Configuration dataclasses (legacy)
โ โโโ models.py # ๐ Pydantic models for type safety โญ
โ โโโ variant.py # ๐ Variant loading (VCF/MAF)
โ โโโ counter.py # ๐ข Pure Python counting (baseline)
โ โโโ numba_counter.py # โก Numba-optimized counting (50-100x faster) โญ
โ โโโ parallel.py # ๐ joblib/Ray parallelization โญ
โ โโโ reference.py # ๐งฌ Reference sequence handling
โ โโโ output.py # ๐ค Output formatting with strand bias โญ
โ โโโ processor.py # ๐ฏ Main processing pipeline
โโโ tests/ # ๐งช Comprehensive test suite
โโโ scripts/ # ๐ ๏ธ Setup and verification scripts
โโโ Dockerfile # ๐ณ Production container
โโโ pyproject.toml # ๐ฆ Package configuration
โโโ docs/
โโโ README.md # Main documentation
โโโ ARCHITECTURE.md # Module relationships โญ
โโโ INSTALLATION.md # Setup guide
โโโ QUICKSTART.md # 5-minute start
โโโ ADVANCED_FEATURES.md # Pydantic, Numba, Ray โญ
โโโ CLI_FEATURES.md # CLI documentation
```
### Module Relationships
```
CLI (cli.py)
โ
Configuration (models.py)
โ
Processor (processor.py)
โโโ Variant Loader (variant.py)
โโโ Reference (reference.py)
โโโ Smart Counting Engine โญ
โ โโโ counter.py (Pure Python - accurate, slower)
โ โโโ numba_counter.py (JIT compiled - 50-100x faster for SNPs) โญ
โโโ Strand Bias Analysis (counter.py + output.py) โญ
โโโ Parallelization (parallel.py)
โโโ Output (output.py)
```
**Key Algorithm Selection:**
- **`Smart Hybrid Strategy`**: Automatically chooses optimal algorithm per variant
- **`counter.py`**: Pure Python, easy to debug, baseline performance
- **`numba_counter.py`**: JIT-compiled, 50-100x faster for simple SNPs, for production
**New Features:**
- **Strand Bias Analysis**: Statistical detection using Fisher's exact test โญ
- **Enhanced Output**: VCF/MAF formats with strand bias columns โญ
See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed module relationships.
## Documentation
๐ **[Complete Documentation](docs/README.md)** | [Quick Start](docs/QUICKSTART.md) | [Contributing Guide](CONTRIBUTING.md) | [Package Structure](docs/PACKAGE_STRUCTURE.md) | [Testing Guide](docs/TESTING_GUIDE.md)
- **User Guide**
- [Input & Output Formats](docs/INPUT_OUTPUT.md)
- **Advanced**
- [Advanced Features](docs/ADVANCED_FEATURES.md)
- [Parallelization Guide](docs/PARALLELIZATION_GUIDE.md)
- [Fast VCF Parsing (cyvcf2)](docs/CYVCF2_SUPPORT.md)
- **Reference**
- [Architecture](docs/ARCHITECTURE.md)
- [C++ Comparison](docs/CPP_FEATURE_COMPARISON.md)
- [FAQ](docs/FAQ.md)
- **Docker & Deployment**
- [Docker Guide](docs/DOCKER_GUIDE.md)
## Advanced Features
GetBaseCounts includes several advanced features for performance and scalability:
### ๐ Type Safety with Pydantic
```python
from gbcms.models import GetBaseCountsConfig
# Runtime validation of all inputs
config = GetBaseCountsConfig(
fasta_file=Path("reference.fa"),
bam_files=[...],
variant_files=[...],
)
```
### ๐ฌ Strand Bias Analysis
```python
# Strand bias is automatically calculated for all variants
# Uses Fisher's exact test for statistical rigor
from gbcms.counter import BaseCounter
counter = BaseCounter(config)
# Strand bias calculated during counting and included in output
```
**Features:**
- **Fisher's exact test** for statistically sound strand bias detection
- **Automatic direction detection** (forward, reverse, or none)
- **Minimum depth filtering** (10 reads) for reliable calculations
- **VCF and MAF output** with strand bias columns
### โก Performance with Smart Hybrid Strategy
```python
# Automatic algorithm selection based on variant complexity
from gbcms.counter import BaseCounter
counter = BaseCounter(config)
# Automatically uses:
# - numba_counter for simple SNPs (50-100x faster)
# - counter.py for complex variants (maximum accuracy)
```
### ๐ Parallelization with joblib
```bash
# Use joblib for efficient local parallelization
gbcms count run --thread 16 --backend joblib ...
```
### ๐ Distributed Computing with Ray
```bash
# Install with Ray support
uv pip install "gbcms[ray]"
# Use Ray for distributed processing
gbcms count run --thread 32 --backend ray --use-ray ...
```
See [ADVANCED_FEATURES.md](ADVANCED_FEATURES.md) for detailed documentation and benchmarks.
## Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes with tests
4. Run the test suite and linters
5. Submit a pull request
## Citation
If you use this tool in your research, please cite:
```
GetBaseCountsMultiSample: A tool for calculating base counts in multiple BAM files
MSK-ACCESS Team
https://github.com/msk-access/GetBaseCountsMultiSample
```
## License
AGPL-3.0 License - See [LICENSE](LICENSE) for details.
## Support
- ๐ Report bugs: [GitHub Issues](https://github.com/msk-access/py-gbcms/issues)
- ๐ฌ Ask questions: [GitHub Discussions](https://github.com/msk-access/py-gbcms/discussions)
- ๐ง Email: shahr2@mskcc.org
## Acknowledgments
This is a Python reimplementation of the original C++ tool developed by the MSK-ACCESS team. Special thanks to the original authors and contributors.
Raw data
{
"_id": null,
"home_page": null,
"name": "py-gbcms",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "bam, base-counts, bioinformatics, gbcms, genomics, maf, vcf",
"author": null,
"author_email": "MSK-ACCESS <shahr2@mskcc.org>",
"download_url": "https://files.pythonhosted.org/packages/2e/5a/9b7f7b48c9aa124f7304020664295dda49a8bc8f9882c68886d1e71dea5d/py_gbcms-2.0.0.tar.gz",
"platform": null,
"description": "# py-gbcms - Python Implementation of gbcms\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/AGPL-3.0)\n\nA high-performance Python reimplementation of [GetBaseCountsMultiSample](https://github.com/msk-access/GetBaseCountsMultiSample) for calculating base counts in multiple BAM files at variant positions specified in VCF or MAF files.\n\n**Package**: `py-gbcms` | **CLI**: `gbcms`\n\n## \u2728 Features\n\n- \ud83d\ude80 **High Performance**: Multi-threaded processing with efficient BAM file handling\n - **Smart hybrid counting** strategy: numba for simple SNPs (50-100x faster), counter.py for complex variants\n - **Numba JIT compilation** for optimized counting operations\n - **joblib** for efficient local parallelization\n - **Ray** support for distributed computing across clusters\n- \ud83d\udd2c **Strand Bias Analysis**: Statistical strand bias detection using Fisher's exact test\n- \ud83c\udfa8 **Beautiful CLI**: Rich terminal output with progress bars and colored logging\n- \ud83d\udd12 **Type Safety**: Pydantic models for runtime validation and type checking\n- \ud83d\udc33 **Docker Support**: Containerized deployment for reproducibility\n- \ud83d\udcca **Multiple Formats**: Support for both VCF and MAF input/output formats with strand bias\n- \ud83e\uddea **Well Tested**: Comprehensive unit tests with high coverage\n- \ud83d\udd27 **Modern Python**: Built with type hints, Pydantic models, and modern Python practices\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n# Install with all features\nuv pip install \"py-gbcms[all]\"\n\n# Or with pip\npip install \"py-gbcms[all]\"\n\n# For development (includes scipy for strand bias and type checking)\nuv pip install -e \".[dev]\"\n\n**Core Dependencies:**\n- `pysam>=0.22.0` - BAM file processing\n- `numpy>=1.24.0` - Numerical computations\n- `scipy>=1.11.0` - Statistical analysis (Fisher's exact test for strand bias)\n- `pandas>=2.0.0` - Data manipulation\n- `pydantic>=2.0.0` - Runtime validation\n- `numba>=0.58.0` - JIT compilation for performance\n- `joblib>=1.3.0` - Parallel processing\n\n**Optional Dependencies:**\n- `cyvcf2>=0.30.0` - Fast VCF parsing (`py-gbcms[fast]`)\n- `ray>=2.7.0` - Distributed computing (`py-gbcms[ray]`)\n\n**Development Dependencies:**\n- `scipy-stubs>=1.11.0` - Type stubs for scipy (for mypy type checking)\n\n**Requirements:** Python 3.11 or later\n\n### Basic Usage\n\n```bash\n# Run\ngbcms count run \\\n --fasta reference.fa \\\n --bam sample1:sample1.bam \\\n --vcf variants.vcf \\\n --output counts.txt \\\n --thread 8\n\n# Run with MAF file\ngbcms count run \\\n --fasta reference.fa \\\n --bam-fof bam_files.txt \\\n --maf variants.maf \\\n --output counts.txt\n\n### Docker Usage\n\n```bash\ndocker pull ghcr.io/msk-access/getbasecounts:latest\n\n# Run the container\ndocker run --rm \\\n -v $(pwd)/data:/data \\\n ghcr.io/msk-access/getbasecounts:latest \\\n gbcms count run \\\n --omaf \\\n --fasta /data/reference.fa \\\n --bam sample1:/data/sample1.bam \\\n --vcf /data/variants.vcf \\\n --output /data/counts.maf\n```\n\n### BAM File of Files Format\n\nCreate a tab-separated file (`bam_files.txt`):\n```\nsample1\t/path/to/sample1.bam\nsample2\t/path/to/sample2.bam\nsample3\t/path/to/sample3.bam\n```\n\nThen use:\n\n```bash\ngbcms count run \\\n --fasta reference.fa \\\n --bam-fof bam_files.txt \\\n --vcf variants.vcf \\\n\n## Command Line Options\n\n### Commands\n\ngbcms uses subcommands for different operations:\n\n- `gbcms count run`: Run base counting on variants (main command)\n- `gbcms validate files`: Validate input files before processing\n- `gbcms version`: Show version information\n- `gbcms info`: Show tool capabilities and information\n\n### Count Run Options\n\n#### Required Arguments\n\n- `--fasta`, `-f`: Reference genome FASTA file (must be indexed with .fai)\n- `--output`, `-o`: Output file path\n\n#### BAM Input (at least one required)\n\n- `--bam`, `-b`: BAM file in format `SAMPLE_NAME:BAM_FILE` (can be specified multiple times)\n- `--bam-fof`: File containing sample names and BAM paths (tab-separated)\n\n#### Variant Input (one required)\n\n- `--maf`: Input variant file in MAF format (can be specified multiple times)\n- `--vcf`: Input variant file in VCF format (can be specified multiple times)\n\n#### Output Options\n\n- `--omaf`: Output in MAF format (only with MAF input)\n- `--positive-count/--no-positive-count`: Output positive strand counts (default: True)\n- `--negative-count/--no-negative-count`: Output negative strand counts (default: False)\n- `--fragment-count/--no-fragment-count`: Output fragment counts (default: False)\n- `--fragment-fractional-weight`: Use fractional weights for fragments (default: False)\n\n#### Quality Filters\n- `--maq`: Mapping quality threshold (default: 20)\n- `--baq`: Base quality threshold (default: 0)\n- `--filter-duplicate`: Filter duplicate reads (default: True)\n- `--filter-improper-pair`: Filter improperly paired reads (default: False)\n- `--filter-qc-failed`: Filter QC failed reads (default: False)\n- `--filter-indel/--no-filter-indel`: Filter reads with indels (default: False)\n- `--filter-non-primary/--no-filter-non-primary`: Filter non-primary alignments (default: False)\n\n#### Performance Options\n- `--thread`, `-t`: Number of threads (default: 1)\n- `--max-block-size`: Maximum variants per block (default: 10000)\n- `--max-block-dist`: Maximum block distance in bp (default: 100000)\n\n#### Advanced Options\n- `--generic-counting`: Use generic counting algorithm for complex variants\n- `--suppress-warning`: Maximum warnings per type (default: 3)\n- `--verbose`, `-v`: Enable verbose logging\n\n### Validate Files Options\n\n- `--fasta`, `-f`: Reference FASTA file to validate\n- `--bam`, `-b`: BAM files to validate (SAMPLE:PATH format, multiple allowed)\n- `--vcf`: VCF files to validate (multiple allowed)\n- `--maf`: MAF files to validate (multiple allowed)\n\n## Output Format\n\n### VCF Format (Proper VCF with INFO fields)\n\n**Extension**: `.vcf`\n\n**Structure**: Standard VCF format with count and strand bias information in FORMAT and INFO fields\n\n**Example**:\n```vcf\n##fileformat=VCFv4.2\n##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total depth across all samples\">\n##INFO=<ID=SB,Number=3,Type=Float,Description=\"Strand bias p-value, odds ratio, direction\">\n##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Total depth for this sample\">\n##FORMAT=<ID=SB,Number=3,Type=Float,Description=\"Strand bias for this sample\">\n#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2\nchr1 100 . A T . . DP=95;SB=0.001234,2.5,reverse DP:RD:AD:SB 50:30:20:0.001234:2.5:reverse 45:25:20:0.95:1.2:none\n```\n\n**INFO Fields**:\n- `DP`: Total depth across all samples\n- `SB`: Strand bias (p-value,odds_ratio,direction) - most significant across samples\n- `FSB`: Fragment strand bias (when fragment counting enabled)\n\n**FORMAT Fields**:\n- `DP`: Total depth for this sample\n- `RD`: Reference allele depth for this sample\n- `AD`: Alternate allele depth for this sample\n- `DPP`: Positive strand depth (if enabled)\n- `RDP`: Positive strand reference depth (if enabled)\n- `ADP`: Positive strand alternate depth (if enabled)\n- `DPF`: Fragment depth (if enabled)\n- `RDF`: Fragment reference depth (if enabled)\n- `ADF`: Fragment alternate depth (if enabled)\n- `SB`: Strand bias (p-value,odds_ratio,direction) for this sample\n- `FSB`: Fragment strand bias for this sample (if enabled)\n\n### MAF Format\n\nWhen using `--omaf`, the output maintains the MAF format with updated count columns for tumor and normal samples, including strand bias information.\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/msk-access/py-gbcms.git\ncd py-gbcms\n\n# Install with development dependencies (includes scipy-stubs for type checking)\nuv pip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=gbcms --cov-report=html\n\n# Run specific test file\npytest tests/test_counter.py -v\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src/ tests/\n\n# Lint code\nruff check src/ tests/\n\n# Type checking (requires scipy-stubs)\nmypy src/\n```\n\n### Building Docker Image\n\n```bash\n# Build the image\ndocker build -t gbcms:latest .\n\n# Run tests in container\ndocker run --rm gbcms:latest pytest\n```\n\n## Performance Comparison\n\nCompared to the original C++ implementation:\n\n| Feature | C++ Version | Python (Smart Hybrid) | Python (Pure counter.py) |\n|---------|-------------|----------------------|-------------------------|\n| **Simple SNPs** | ~1x | **~50-100x faster** | ~0.8-1.2x |\n| **Complex Variants** | ~1x | **~1x** | ~0.8-1.2x |\n| **Overall** | Baseline | **~10-50x faster** | ~0.8-1.2x |\n| Memory | Baseline | ~1.2x | ~1.2x |\n| Multi-threading | OpenMP | joblib/Ray | joblib/Ray |\n| Dependencies | bamtools, zlib | pysam, numpy, numba, joblib, ray |\n| Scalability | Single machine | Single machine | Multi-node clusters |\n\n**Smart Hybrid Strategy:**\n- **Simple SNPs**: Uses `numba_counter` (50-100x faster than C++)\n- **Complex variants**: Uses `counter.py` (C++ equivalent accuracy)\n- **Automatic selection**: Optimal algorithm chosen per variant type\n*Performance varies based on workload and Python version. Python 3.11+ shows significant improvements.\n\n**With Numba JIT compilation and smart algorithm selection. See [Fast VCF Parsing](docs/CYVCF2_SUPPORT.md) for benchmarks.\n\n## Architecture\n\nThe package is organized into distinct modules with clear responsibilities:\n\n```\npy-gbcms/\n\u251c\u2500\u2500 src/gbcms/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 cli.py # \ud83c\udfa8 Typer CLI interface with Rich\n\u2502 \u251c\u2500\u2500 config.py # \u2699\ufe0f Configuration dataclasses (legacy)\n\u2502 \u251c\u2500\u2500 models.py # \ud83d\udd12 Pydantic models for type safety \u2b50\n\u2502 \u251c\u2500\u2500 variant.py # \ud83d\udcc4 Variant loading (VCF/MAF)\n\u2502 \u251c\u2500\u2500 counter.py # \ud83d\udc22 Pure Python counting (baseline)\n\u2502 \u251c\u2500\u2500 numba_counter.py # \u26a1 Numba-optimized counting (50-100x faster) \u2b50\n\u2502 \u251c\u2500\u2500 parallel.py # \ud83d\udd04 joblib/Ray parallelization \u2b50\n\u2502 \u251c\u2500\u2500 reference.py # \ud83e\uddec Reference sequence handling\n\u2502 \u251c\u2500\u2500 output.py # \ud83d\udce4 Output formatting with strand bias \u2b50\n\u2502 \u2514\u2500\u2500 processor.py # \ud83c\udfaf Main processing pipeline\n\u251c\u2500\u2500 tests/ # \ud83e\uddea Comprehensive test suite\n\u251c\u2500\u2500 scripts/ # \ud83d\udee0\ufe0f Setup and verification scripts\n\u251c\u2500\u2500 Dockerfile # \ud83d\udc33 Production container\n\u251c\u2500\u2500 pyproject.toml # \ud83d\udce6 Package configuration\n\u2514\u2500\u2500 docs/\n \u251c\u2500\u2500 README.md # Main documentation\n \u251c\u2500\u2500 ARCHITECTURE.md # Module relationships \u2b50\n \u251c\u2500\u2500 INSTALLATION.md # Setup guide\n \u251c\u2500\u2500 QUICKSTART.md # 5-minute start\n \u251c\u2500\u2500 ADVANCED_FEATURES.md # Pydantic, Numba, Ray \u2b50\n \u2514\u2500\u2500 CLI_FEATURES.md # CLI documentation\n```\n\n### Module Relationships\n\n```\nCLI (cli.py)\n \u2193\nConfiguration (models.py)\n \u2193\nProcessor (processor.py)\n \u251c\u2500\u2192 Variant Loader (variant.py)\n \u251c\u2500\u2192 Reference (reference.py)\n \u251c\u2500\u2192 Smart Counting Engine \u2b50\n \u2502 \u251c\u2500\u2192 counter.py (Pure Python - accurate, slower)\n \u2502 \u2514\u2500\u2192 numba_counter.py (JIT compiled - 50-100x faster for SNPs) \u2b50\n \u251c\u2500\u2192 Strand Bias Analysis (counter.py + output.py) \u2b50\n \u251c\u2500\u2192 Parallelization (parallel.py)\n \u2514\u2500\u2192 Output (output.py)\n```\n\n**Key Algorithm Selection:**\n- **`Smart Hybrid Strategy`**: Automatically chooses optimal algorithm per variant\n- **`counter.py`**: Pure Python, easy to debug, baseline performance\n- **`numba_counter.py`**: JIT-compiled, 50-100x faster for simple SNPs, for production\n\n**New Features:**\n- **Strand Bias Analysis**: Statistical detection using Fisher's exact test \u2b50\n- **Enhanced Output**: VCF/MAF formats with strand bias columns \u2b50\n\nSee [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed module relationships.\n\n## Documentation\n\n\ud83d\udcda **[Complete Documentation](docs/README.md)** | [Quick Start](docs/QUICKSTART.md) | [Contributing Guide](CONTRIBUTING.md) | [Package Structure](docs/PACKAGE_STRUCTURE.md) | [Testing Guide](docs/TESTING_GUIDE.md)\n\n- **User Guide**\n - [Input & Output Formats](docs/INPUT_OUTPUT.md)\n\n- **Advanced**\n - [Advanced Features](docs/ADVANCED_FEATURES.md)\n - [Parallelization Guide](docs/PARALLELIZATION_GUIDE.md)\n - [Fast VCF Parsing (cyvcf2)](docs/CYVCF2_SUPPORT.md)\n\n- **Reference**\n - [Architecture](docs/ARCHITECTURE.md)\n - [C++ Comparison](docs/CPP_FEATURE_COMPARISON.md)\n - [FAQ](docs/FAQ.md)\n\n- **Docker & Deployment**\n - [Docker Guide](docs/DOCKER_GUIDE.md)\n\n## Advanced Features\n\nGetBaseCounts includes several advanced features for performance and scalability:\n\n### \ud83d\udd12 Type Safety with Pydantic\n\n```python\nfrom gbcms.models import GetBaseCountsConfig\n\n# Runtime validation of all inputs\nconfig = GetBaseCountsConfig(\n fasta_file=Path(\"reference.fa\"),\n bam_files=[...],\n variant_files=[...],\n)\n```\n\n### \ud83d\udd2c Strand Bias Analysis\n\n```python\n# Strand bias is automatically calculated for all variants\n# Uses Fisher's exact test for statistical rigor\nfrom gbcms.counter import BaseCounter\n\ncounter = BaseCounter(config)\n# Strand bias calculated during counting and included in output\n```\n\n**Features:**\n- **Fisher's exact test** for statistically sound strand bias detection\n- **Automatic direction detection** (forward, reverse, or none)\n- **Minimum depth filtering** (10 reads) for reliable calculations\n- **VCF and MAF output** with strand bias columns\n\n### \u26a1 Performance with Smart Hybrid Strategy\n\n```python\n# Automatic algorithm selection based on variant complexity\nfrom gbcms.counter import BaseCounter\n\ncounter = BaseCounter(config)\n# Automatically uses:\n# - numba_counter for simple SNPs (50-100x faster)\n# - counter.py for complex variants (maximum accuracy)\n```\n\n### \ud83d\udd04 Parallelization with joblib\n\n```bash\n# Use joblib for efficient local parallelization\ngbcms count run --thread 16 --backend joblib ...\n```\n\n### \ud83c\udf10 Distributed Computing with Ray\n\n```bash\n# Install with Ray support\nuv pip install \"gbcms[ray]\"\n\n# Use Ray for distributed processing\ngbcms count run --thread 32 --backend ray --use-ray ...\n```\n\nSee [ADVANCED_FEATURES.md](ADVANCED_FEATURES.md) for detailed documentation and benchmarks.\n\n## Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes with tests\n4. Run the test suite and linters\n5. Submit a pull request\n\n## Citation\n\nIf you use this tool in your research, please cite:\n\n```\nGetBaseCountsMultiSample: A tool for calculating base counts in multiple BAM files\nMSK-ACCESS Team\nhttps://github.com/msk-access/GetBaseCountsMultiSample\n```\n\n## License\n\nAGPL-3.0 License - See [LICENSE](LICENSE) for details.\n\n## Support\n\n- \ud83d\udc1b Report bugs: [GitHub Issues](https://github.com/msk-access/py-gbcms/issues)\n- \ud83d\udcac Ask questions: [GitHub Discussions](https://github.com/msk-access/py-gbcms/discussions)\n- \ud83d\udce7 Email: shahr2@mskcc.org\n\n## Acknowledgments\n\nThis is a Python reimplementation of the original C++ tool developed by the MSK-ACCESS team. Special thanks to the original authors and contributors.\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "Python implementation of GetBaseCountsMultiSample (gbcms) for calculating base counts in BAM files",
"version": "2.0.0",
"project_urls": {
"Bug Tracker": "https://github.com/msk-access/getbasecounts/issues",
"Documentation": "https://github.com/msk-access/getbasecounts#readme",
"Homepage": "https://github.com/msk-access/getbasecounts",
"Repository": "https://github.com/msk-access/getbasecounts"
},
"split_keywords": [
"bam",
" base-counts",
" bioinformatics",
" gbcms",
" genomics",
" maf",
" vcf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "580cbb873fd5c9986ddb556a195927c427d16002f27734142b9635ed9a531c26",
"md5": "6f10b27e949ff6f6819b7bf8d22dd248",
"sha256": "1e8ec1baf82f643413ea411e0b8f84fdf4b70d08055cb81e3b619b22b97b15a2"
},
"downloads": -1,
"filename": "py_gbcms-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6f10b27e949ff6f6819b7bf8d22dd248",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 51298,
"upload_time": "2025-10-08T14:28:50",
"upload_time_iso_8601": "2025-10-08T14:28:50.007911Z",
"url": "https://files.pythonhosted.org/packages/58/0c/bb873fd5c9986ddb556a195927c427d16002f27734142b9635ed9a531c26/py_gbcms-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2e5a9b7f7b48c9aa124f7304020664295dda49a8bc8f9882c68886d1e71dea5d",
"md5": "3b45423fab8ca16451ce3cdd27adb1f7",
"sha256": "39811e3a84c6a15cf1af6aaeb93cb272c6d2ff4a1bd7fd537183d3e21a2d549f"
},
"downloads": -1,
"filename": "py_gbcms-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "3b45423fab8ca16451ce3cdd27adb1f7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 173147,
"upload_time": "2025-10-08T14:28:52",
"upload_time_iso_8601": "2025-10-08T14:28:52.373670Z",
"url": "https://files.pythonhosted.org/packages/2e/5a/9b7f7b48c9aa124f7304020664295dda49a8bc8f9882c68886d1e71dea5d/py_gbcms-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-08 14:28:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "msk-access",
"github_project": "getbasecounts",
"github_not_found": true,
"lcname": "py-gbcms"
}