<div align="right">
<img src="biometal_logo.png" alt="biometal logo" width="150"/>
</div>
# biometal
**ARM-native bioinformatics library with streaming architecture and evidence-based optimization**
[](https://crates.io/crates/biometal)
[](https://docs.rs/biometal)
[](https://github.com/shandley/biometal#license)
---
## What Makes biometal Different?
Most bioinformatics tools require you to download entire datasets before analysis. **biometal** streams data directly from the network, enabling analysis of terabyte-scale datasets on consumer hardware without downloading.
### Key Features
1. **Streaming Architecture** (Rule 5)
- Constant ~5 MB memory footprint regardless of dataset size
- Analyze 5TB datasets on laptops without downloading
- 99.5% memory reduction compared to batch processing
2. **ARM-Native Performance** (Rule 1)
- 16-25× speedup using ARM NEON SIMD
- Works across Mac (Apple Silicon), AWS Graviton, Ampere, Raspberry Pi
- Automatic fallback to scalar code on x86_64
3. **Network Streaming** (Rule 6)
- Stream directly from HTTP/HTTPS sources
- SRA toolkit integration (no local copy needed)
- Smart LRU caching minimizes network requests
- Background prefetching hides latency
4. **Intelligent I/O** (Rules 3-4)
- 6.5× speedup from parallel bgzip decompression
- Additional 2.5× from memory-mapped I/O (large files on macOS)
- Combined 16.3× I/O speedup
5. **Evidence-Based Design**
- Every optimization validated with statistical rigor (N=30, 95% CI)
- 1,357 experiments, 40,710 measurements
- Full methodology: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)
---
## Quick Start
### Rust Installation
```toml
[dependencies]
biometal = "1.0"
```
### Python Installation
```bash
# Option 1: Install from PyPI (when published)
pip install biometal-rs
# Option 2: Build from source (requires Rust toolchain)
pip install maturin
git clone https://github.com/shandley/biometal
cd biometal
maturin develop --release --features python
```
**Requirements**:
- Python 3.9+ (tested on 3.14)
- Rust toolchain (for building from source)
---
## Usage
### Rust: Basic Usage
```rust
use biometal::FastqStream;
// Stream FASTQ from local file (constant memory)
let stream = FastqStream::from_path("large_dataset.fq.gz")?;
for record in stream {
let record = record?;
// Process one record at a time
// Memory stays constant at ~5 MB
}
```
### Network Streaming
```rust
use biometal::io::DataSource;
use biometal::FastqStream;
// Stream directly from URL (no download!)
let source = DataSource::Http("https://example.com/huge_dataset.fq.gz".to_string());
let stream = FastqStream::new(source)?;
// Analyze 5TB dataset without downloading
for record in stream {
// Smart caching + prefetching in background
}
```
### SRA Streaming (No Download!)
```rust
use biometal::io::DataSource;
use biometal::FastqStream;
// Stream directly from NCBI SRA (no local download!)
let source = DataSource::Sra("SRR390728".to_string()); // E. coli dataset
let stream = FastqStream::new(source)?;
for record in stream {
let record = record?;
// Process 40 MB dataset with only ~5 MB memory
// Background prefetching hides network latency
}
```
### Operations with Auto-Optimization
```rust
use biometal::operations;
// ARM NEON automatically enabled on ARM platforms
let counts = operations::base_counting(&sequence)?;
let gc = operations::gc_content(&sequence)?;
// 16-25× faster on ARM, automatic scalar fallback on x86_64
```
### Python: Basic Usage
```python
import biometal
# Stream FASTQ from local file (constant memory)
stream = biometal.FastqStream.from_path("large_dataset.fq.gz")
for record in stream:
# Process one record at a time
# Memory stays constant at ~5 MB
gc = biometal.gc_content(record.sequence)
print(f"{record.id}: GC={gc:.2%}")
```
### Python: ARM NEON Operations
```python
import biometal
# ARM NEON automatically enabled on ARM platforms
# 16-25× faster on Mac ARM, automatic scalar fallback on x86_64
# GC content calculation
sequence = b"ATGCATGC"
gc = biometal.gc_content(sequence) # 20.3× speedup on ARM
# Base counting
counts = biometal.count_bases(sequence) # 16.7× speedup on ARM
print(f"A:{counts['A']}, C:{counts['C']}, G:{counts['G']}, T:{counts['T']}")
# Quality scoring
quality = record.quality
mean_q = biometal.mean_quality(quality) # 25.1× speedup on ARM
# K-mer extraction (for ML preprocessing)
kmers = biometal.extract_kmers(sequence, k=6)
print(f"6-mers: {kmers}")
```
### Python: Example Workflow
```python
import biometal
# Analyze FASTQ file with streaming (constant memory)
stream = biometal.FastqStream.from_path("data.fq.gz")
total_bases = 0
total_gc = 0.0
high_quality = 0
for record in stream:
# Count bases (ARM NEON accelerated)
counts = biometal.count_bases(record.sequence)
total_bases += sum(counts.values())
# Calculate GC content (ARM NEON accelerated)
gc = biometal.gc_content(record.sequence)
total_gc += gc
# Check quality (ARM NEON accelerated)
if biometal.mean_quality(record.quality) > 30.0:
high_quality += 1
print(f"Total bases: {total_bases}")
print(f"Average GC: {total_gc/len(stream):.2%}")
print(f"High quality reads: {high_quality}")
```
---
## Performance
### Memory Efficiency
| Dataset Size | Traditional | biometal | Reduction |
|--------------|-------------|----------|-----------|
| 100K sequences | 134 MB | 5 MB | 96.3% |
| 1M sequences | 1,344 MB | 5 MB | 99.5% |
| **5TB dataset** | **5,000 GB** | **5 MB** | **99.9999%** |
### ARM NEON Speedup (Mac Apple Silicon)
**Optimized for Apple Silicon** - All optimizations validated on Mac M3 Max (1,357 experiments, N=30):
| Operation | Scalar | NEON | Speedup |
|-----------|--------|------|---------|
| Base counting | 315 Kseq/s | 5,254 Kseq/s | **16.7×** |
| GC content | 294 Kseq/s | 5,954 Kseq/s | **20.3×** |
| Quality filter | 245 Kseq/s | 6,143 Kseq/s | **25.1×** |
### Cross-Platform Performance (Validated Nov 2025)
| Platform | Base Counting | GC Content | Quality | Status |
|----------|---------------|------------|---------|--------|
| **Mac M3** (target) | 16.7× | 20.3× | 25.1× | ✅ Optimized |
| **AWS Graviton** | 10.7× | 6.9× | 1.9× | ✅ Works (portable) |
| **x86_64 Intel** | 1.0× | 1.0× | 1.0× | ✅ Works (portable) |
**Note**: biometal is optimized for Mac ARM (consumer hardware democratization). Other platforms are supported with correct, production-ready code but not specifically optimized. See [Cross-Platform Testing Results](results/cross_platform/FINDINGS.md) for details.
### I/O Optimization
| File Size | Standard | Optimized | Speedup |
|-----------|----------|-----------|---------|
| Small (<50 MB) | 12.3s | 1.9s | 6.5× |
| Large (≥50 MB) | 12.3s | 0.75s | **16.3×** |
---
## Democratizing Bioinformatics
biometal addresses four barriers that lock researchers out of genomics:
### 1. Economic Barrier
- **Problem**: Most tools require $50K+ servers
- **Solution**: Consumer ARM laptops ($1,400) deliver production performance
- **Impact**: Small labs and LMIC researchers can compete
### 2. Environmental Barrier
- **Problem**: HPC clusters consume massive energy (300× excess for many workloads)
- **Solution**: ARM efficiency inherent in architecture
- **Impact**: Reduced carbon footprint for genomics research
### 3. Portability Barrier
- **Problem**: Vendor lock-in (x86-only, cloud-only tools)
- **Solution**: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)
- **Impact**: No platform dependencies, true portability
### 4. Data Access Barrier ⭐
- **Problem**: 5TB datasets require 5TB storage + days to download
- **Solution**: Network streaming with smart caching
- **Impact**: Analyze 5TB datasets on 24GB laptops without downloading
---
## Evidence Base
biometal's design is grounded in comprehensive experimental validation:
- **Experiments**: 1,357 total (40,710 measurements with N=30)
- **Statistical rigor**: 95% confidence intervals, Cohen's d effect sizes
- **Cross-platform**: Mac M4 Max, AWS Graviton 3
- **Lab notebook**: 33 entries documenting full experimental log
See [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md) for detailed evidence links.
**Full methodology**: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)
**Publications** (in preparation):
1. DAG Framework: BMC Bioinformatics
2. biometal Library: Bioinformatics (Application Note) or JOSS
3. Four-Pillar Democratization: GigaScience
---
## Platform Support
### Optimization Strategy
biometal is **optimized for Mac ARM** (M1/M2/M3/M4) based on 1,357 experiments on Mac M3 Max. This aligns with our democratization mission: enable world-class bioinformatics on **affordable consumer hardware** ($1,000-2,000 MacBooks, not $50,000 servers).
Other platforms are **supported with portable, correct code** but not specifically optimized:
| Platform | Performance | Test Status | Strategy |
|----------|-------------|-------------|----------|
| **Mac ARM** (M1/M2/M3/M4) | **16-25× speedup** | ✅ 121/121 tests pass | **Optimized** (target platform) |
| **AWS Graviton** | 6-10× speedup | ✅ 121/121 tests pass | Portable (works well) |
| **Linux x86_64** | 1× (scalar) | ✅ 118/118 tests pass | Portable (fallback) |
### Feature Support Matrix
| Feature | macOS ARM | Linux ARM | Linux x86_64 |
|---------|-----------|-----------|--------------|
| ARM NEON SIMD | ✅ | ✅ | ❌ (scalar fallback) |
| Parallel Bgzip | ✅ | ✅ | ✅ |
| Smart mmap | ✅ | ⏳ | ❌ |
| Network Streaming | ✅ | ✅ | ✅ |
| Python Bindings | ✅ | ✅ | ✅ |
**Validation**: Cross-platform testing completed Nov 2025 on AWS Graviton 3 and x86_64. All tests pass. See [results/cross_platform/FINDINGS.md](results/cross_platform/FINDINGS.md) for full details.
---
## Roadmap
**v1.0.0** (Released November 5, 2025) ✅
- Streaming FASTQ/FASTA parsers (constant memory)
- ARM NEON operations (16-25× speedup)
- Network streaming (HTTP/HTTPS, SRA)
- Python bindings (PyO3 0.27, Python 3.9-3.14)
- Cross-platform validation (Mac ARM, Graviton, x86_64)
- Production-grade quality (121 tests, Grade A+)
**Future Considerations** (Community Driven)
- Extended operation coverage (alignment, assembly)
- Additional format support (BAM/SAM, VCF)
- Publish to crates.io and PyPI
- Metal GPU acceleration (Mac-specific)
---
## SRA Streaming: Analysis Without Downloads
One of biometal's most powerful features is direct streaming from NCBI's Sequence Read Archive (SRA) without local downloads.
### Why This Matters
**Traditional workflow:**
1. Download 5 GB SRA dataset → 30 minutes + 5 GB disk space
2. Decompress → 15 GB disk space
3. Process → Additional memory
4. **Total:** 45 minutes + 20 GB resources before analysis even starts
**biometal workflow:**
1. Start analysis immediately → 0 wait time, ~5 MB memory
2. Stream directly from NCBI S3 → No disk space needed
3. Background prefetching hides latency → Near-local performance
### Supported Accessions
- **SRR** (Run): Most common, represents a sequencing run
- **SRX** (Experiment): Collection of runs
- **SRS** (Sample): Biological sample
- **SRP** (Study): Collection of experiments
### Basic SRA Usage
```rust
use biometal::io::DataSource;
use biometal::operations::{count_bases, gc_content};
use biometal::FastqStream;
// Stream from SRA accession
let source = DataSource::Sra("SRR390728".to_string());
let stream = FastqStream::new(source)?;
for record in stream {
let record = record?;
// ARM NEON-optimized operations (16-25× speedup)
let bases = count_bases(&record.sequence);
let gc = gc_content(&record.sequence);
// Memory: Constant ~5 MB
}
```
### Real-World Example: E. coli Analysis
```bash
# Run the E. coli streaming example
cargo run --example sra_ecoli --features network
# Process ~250,000 reads with only ~5 MB memory
# No download required!
```
See [examples/sra_ecoli.rs](examples/sra_ecoli.rs) for complete example.
### Performance Tuning
biometal automatically configures optimal settings for most use cases. For custom tuning:
```rust
use biometal::io::{HttpReader, sra_to_url};
let url = sra_to_url("SRR390728")?;
let reader = HttpReader::new(&url)?
.with_prefetch_count(8) // Prefetch 8 blocks ahead
.with_chunk_size(128 * 1024); // 128 KB chunks
// See docs/PERFORMANCE_TUNING.md for detailed guide
```
### SRA URL Conversion
```rust
use biometal::io::{is_sra_accession, sra_to_url};
// Check if string is SRA accession
if is_sra_accession("SRR390728") {
// Convert to direct NCBI S3 URL
let url = sra_to_url("SRR390728")?;
// → https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR390728/SRR390728
}
```
### Memory Guarantees
- **Streaming buffer:** ~5 MB (constant)
- **LRU cache:** 50 MB (byte-bounded, automatic eviction)
- **Prefetch:** ~256 KB (4 blocks × 64 KB)
- **Total:** ~55 MB regardless of SRA file size
Compare to downloading a 5 GB SRA file → **99%+ memory savings**
### Examples
| Example | Dataset | Size | Demo |
|---------|---------|------|------|
| [sra_streaming.rs](examples/sra_streaming.rs) | Demo mode | N/A | Capabilities overview |
| [sra_ecoli.rs](examples/sra_ecoli.rs) | E. coli K-12 | ~40 MB | Real SRA streaming |
| [prefetch_tuning.rs](examples/prefetch_tuning.rs) | E. coli K-12 | ~40 MB | Performance tuning |
---
## Example Use Cases
### 1. Large-Scale Quality Control
```rust
use biometal::{FastqStream, operations};
// Stream 5TB dataset without downloading
let stream = FastqStream::from_url("https://sra.example.com/huge.fq.gz")?;
let mut total = 0;
let mut high_quality = 0;
for record in stream {
let record = record?;
total += 1;
// ARM NEON accelerated (16-25×)
if operations::mean_quality(&record.quality) > 30.0 {
high_quality += 1;
}
}
println!("High quality: {}/{} ({:.1}%)",
high_quality, total, 100.0 * high_quality as f64 / total as f64);
```
### 2. BERT Preprocessing Pipeline
```rust
use biometal::{FastqStream, kmer};
// Stream from SRA (no local copy!)
let stream = FastqStream::from_sra("SRR12345678")?;
// Extract k-mers for DNABert training
for record in stream {
let record = record?;
let kmers = kmer::extract_overlapping(&record.sequence, 6)?;
// Feed to BERT training pipeline
// Constant memory even for TB-scale datasets
}
```
### 3. Metagenomics Filtering
```rust
use biometal::{FastqStream, operations};
let input = FastqStream::from_path("metagen.fq.gz")?;
let mut output = FastqWriter::create("filtered.fq.gz")?;
for record in input {
let record = record?;
// Filter low-complexity sequences (ARM NEON accelerated)
if operations::complexity_score(&record.sequence) > 0.5 {
output.write(&record)?;
}
}
// Memory: constant ~5 MB
// Speed: 16-25× faster on ARM
```
---
## Contributing
We welcome contributions! biometal is built on evidence-based optimization, so new features should:
1. Have clear use cases
2. Be validated experimentally (when adding optimizations)
3. Maintain platform portability
4. Follow the optimization rules in [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md)
See [CLAUDE.md](CLAUDE.md) for development guidelines.
---
## License
Licensed under either of:
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
---
## Citation
If you use biometal in your research, please cite:
```bibtex
@software{biometal2025,
author = {Handley, Scott},
title = {biometal: ARM-native bioinformatics with streaming architecture},
year = {2025},
url = {https://github.com/shandley/biometal}
}
```
For the experimental methodology, see:
```bibtex
@misc{asbb2025,
author = {Handley, Scott},
title = {Apple Silicon Bio Bench: Systematic Hardware Characterization for Bioinformatics},
year = {2025},
url = {https://github.com/shandley/apple-silicon-bio-bench}
}
```
---
**Status**: v1.0.0 - Production Release 🎉
**Released**: November 5, 2025
**Grade**: A+ (rust-code-quality-reviewer)
**Tests**: 121 passing (87 unit + 7 integration + 27 doc)
**Evidence Base**: 1,357 experiments, 40,710 measurements
**Mission**: Democratizing bioinformatics compute
Raw data
{
"_id": null,
"home_page": "https://github.com/shandley/biometal",
"name": "biometal-rs",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "bioinformatics, genomics, arm, neon, streaming",
"author": null,
"author_email": "Scott Handley <scott.handley@example.com>",
"download_url": "https://files.pythonhosted.org/packages/30/38/1dbab783fc9e8f1be081f7ed5c09cdf8ca5fd385af0f7ca5bd4f4f6df2a2/biometal_rs-1.0.0.tar.gz",
"platform": null,
"description": "<div align=\"right\">\n <img src=\"biometal_logo.png\" alt=\"biometal logo\" width=\"150\"/>\n</div>\n\n# biometal\n\n**ARM-native bioinformatics library with streaming architecture and evidence-based optimization**\n\n[](https://crates.io/crates/biometal)\n[](https://docs.rs/biometal)\n[](https://github.com/shandley/biometal#license)\n\n---\n\n## What Makes biometal Different?\n\nMost bioinformatics tools require you to download entire datasets before analysis. **biometal** streams data directly from the network, enabling analysis of terabyte-scale datasets on consumer hardware without downloading.\n\n### Key Features\n\n1. **Streaming Architecture** (Rule 5)\n - Constant ~5 MB memory footprint regardless of dataset size\n - Analyze 5TB datasets on laptops without downloading\n - 99.5% memory reduction compared to batch processing\n\n2. **ARM-Native Performance** (Rule 1)\n - 16-25\u00d7 speedup using ARM NEON SIMD\n - Works across Mac (Apple Silicon), AWS Graviton, Ampere, Raspberry Pi\n - Automatic fallback to scalar code on x86_64\n\n3. **Network Streaming** (Rule 6)\n - Stream directly from HTTP/HTTPS sources\n - SRA toolkit integration (no local copy needed)\n - Smart LRU caching minimizes network requests\n - Background prefetching hides latency\n\n4. **Intelligent I/O** (Rules 3-4)\n - 6.5\u00d7 speedup from parallel bgzip decompression\n - Additional 2.5\u00d7 from memory-mapped I/O (large files on macOS)\n - Combined 16.3\u00d7 I/O speedup\n\n5. **Evidence-Based Design**\n - Every optimization validated with statistical rigor (N=30, 95% CI)\n - 1,357 experiments, 40,710 measurements\n - Full methodology: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)\n\n---\n\n## Quick Start\n\n### Rust Installation\n\n```toml\n[dependencies]\nbiometal = \"1.0\"\n```\n\n### Python Installation\n\n```bash\n# Option 1: Install from PyPI (when published)\npip install biometal-rs\n\n# Option 2: Build from source (requires Rust toolchain)\npip install maturin\ngit clone https://github.com/shandley/biometal\ncd biometal\nmaturin develop --release --features python\n```\n\n**Requirements**:\n- Python 3.9+ (tested on 3.14)\n- Rust toolchain (for building from source)\n\n---\n\n## Usage\n\n### Rust: Basic Usage\n\n```rust\nuse biometal::FastqStream;\n\n// Stream FASTQ from local file (constant memory)\nlet stream = FastqStream::from_path(\"large_dataset.fq.gz\")?;\n\nfor record in stream {\n let record = record?;\n // Process one record at a time\n // Memory stays constant at ~5 MB\n}\n```\n\n### Network Streaming\n\n```rust\nuse biometal::io::DataSource;\nuse biometal::FastqStream;\n\n// Stream directly from URL (no download!)\nlet source = DataSource::Http(\"https://example.com/huge_dataset.fq.gz\".to_string());\nlet stream = FastqStream::new(source)?;\n\n// Analyze 5TB dataset without downloading\nfor record in stream {\n // Smart caching + prefetching in background\n}\n```\n\n### SRA Streaming (No Download!)\n\n```rust\nuse biometal::io::DataSource;\nuse biometal::FastqStream;\n\n// Stream directly from NCBI SRA (no local download!)\nlet source = DataSource::Sra(\"SRR390728\".to_string()); // E. coli dataset\nlet stream = FastqStream::new(source)?;\n\nfor record in stream {\n let record = record?;\n // Process 40 MB dataset with only ~5 MB memory\n // Background prefetching hides network latency\n}\n```\n\n### Operations with Auto-Optimization\n\n```rust\nuse biometal::operations;\n\n// ARM NEON automatically enabled on ARM platforms\nlet counts = operations::base_counting(&sequence)?;\nlet gc = operations::gc_content(&sequence)?;\n\n// 16-25\u00d7 faster on ARM, automatic scalar fallback on x86_64\n```\n\n### Python: Basic Usage\n\n```python\nimport biometal\n\n# Stream FASTQ from local file (constant memory)\nstream = biometal.FastqStream.from_path(\"large_dataset.fq.gz\")\n\nfor record in stream:\n # Process one record at a time\n # Memory stays constant at ~5 MB\n gc = biometal.gc_content(record.sequence)\n print(f\"{record.id}: GC={gc:.2%}\")\n```\n\n### Python: ARM NEON Operations\n\n```python\nimport biometal\n\n# ARM NEON automatically enabled on ARM platforms\n# 16-25\u00d7 faster on Mac ARM, automatic scalar fallback on x86_64\n\n# GC content calculation\nsequence = b\"ATGCATGC\"\ngc = biometal.gc_content(sequence) # 20.3\u00d7 speedup on ARM\n\n# Base counting\ncounts = biometal.count_bases(sequence) # 16.7\u00d7 speedup on ARM\nprint(f\"A:{counts['A']}, C:{counts['C']}, G:{counts['G']}, T:{counts['T']}\")\n\n# Quality scoring\nquality = record.quality\nmean_q = biometal.mean_quality(quality) # 25.1\u00d7 speedup on ARM\n\n# K-mer extraction (for ML preprocessing)\nkmers = biometal.extract_kmers(sequence, k=6)\nprint(f\"6-mers: {kmers}\")\n```\n\n### Python: Example Workflow\n\n```python\nimport biometal\n\n# Analyze FASTQ file with streaming (constant memory)\nstream = biometal.FastqStream.from_path(\"data.fq.gz\")\n\ntotal_bases = 0\ntotal_gc = 0.0\nhigh_quality = 0\n\nfor record in stream:\n # Count bases (ARM NEON accelerated)\n counts = biometal.count_bases(record.sequence)\n total_bases += sum(counts.values())\n\n # Calculate GC content (ARM NEON accelerated)\n gc = biometal.gc_content(record.sequence)\n total_gc += gc\n\n # Check quality (ARM NEON accelerated)\n if biometal.mean_quality(record.quality) > 30.0:\n high_quality += 1\n\nprint(f\"Total bases: {total_bases}\")\nprint(f\"Average GC: {total_gc/len(stream):.2%}\")\nprint(f\"High quality reads: {high_quality}\")\n```\n\n---\n\n## Performance\n\n### Memory Efficiency\n\n| Dataset Size | Traditional | biometal | Reduction |\n|--------------|-------------|----------|-----------|\n| 100K sequences | 134 MB | 5 MB | 96.3% |\n| 1M sequences | 1,344 MB | 5 MB | 99.5% |\n| **5TB dataset** | **5,000 GB** | **5 MB** | **99.9999%** |\n\n### ARM NEON Speedup (Mac Apple Silicon)\n\n**Optimized for Apple Silicon** - All optimizations validated on Mac M3 Max (1,357 experiments, N=30):\n\n| Operation | Scalar | NEON | Speedup |\n|-----------|--------|------|---------|\n| Base counting | 315 Kseq/s | 5,254 Kseq/s | **16.7\u00d7** |\n| GC content | 294 Kseq/s | 5,954 Kseq/s | **20.3\u00d7** |\n| Quality filter | 245 Kseq/s | 6,143 Kseq/s | **25.1\u00d7** |\n\n### Cross-Platform Performance (Validated Nov 2025)\n\n| Platform | Base Counting | GC Content | Quality | Status |\n|----------|---------------|------------|---------|--------|\n| **Mac M3** (target) | 16.7\u00d7 | 20.3\u00d7 | 25.1\u00d7 | \u2705 Optimized |\n| **AWS Graviton** | 10.7\u00d7 | 6.9\u00d7 | 1.9\u00d7 | \u2705 Works (portable) |\n| **x86_64 Intel** | 1.0\u00d7 | 1.0\u00d7 | 1.0\u00d7 | \u2705 Works (portable) |\n\n**Note**: biometal is optimized for Mac ARM (consumer hardware democratization). Other platforms are supported with correct, production-ready code but not specifically optimized. See [Cross-Platform Testing Results](results/cross_platform/FINDINGS.md) for details.\n\n### I/O Optimization\n\n| File Size | Standard | Optimized | Speedup |\n|-----------|----------|-----------|---------|\n| Small (<50 MB) | 12.3s | 1.9s | 6.5\u00d7 |\n| Large (\u226550 MB) | 12.3s | 0.75s | **16.3\u00d7** |\n\n---\n\n## Democratizing Bioinformatics\n\nbiometal addresses four barriers that lock researchers out of genomics:\n\n### 1. Economic Barrier\n- **Problem**: Most tools require $50K+ servers\n- **Solution**: Consumer ARM laptops ($1,400) deliver production performance\n- **Impact**: Small labs and LMIC researchers can compete\n\n### 2. Environmental Barrier\n- **Problem**: HPC clusters consume massive energy (300\u00d7 excess for many workloads)\n- **Solution**: ARM efficiency inherent in architecture\n- **Impact**: Reduced carbon footprint for genomics research\n\n### 3. Portability Barrier\n- **Problem**: Vendor lock-in (x86-only, cloud-only tools)\n- **Solution**: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)\n- **Impact**: No platform dependencies, true portability\n\n### 4. Data Access Barrier \u2b50\n- **Problem**: 5TB datasets require 5TB storage + days to download\n- **Solution**: Network streaming with smart caching\n- **Impact**: Analyze 5TB datasets on 24GB laptops without downloading\n\n---\n\n## Evidence Base\n\nbiometal's design is grounded in comprehensive experimental validation:\n\n- **Experiments**: 1,357 total (40,710 measurements with N=30)\n- **Statistical rigor**: 95% confidence intervals, Cohen's d effect sizes\n- **Cross-platform**: Mac M4 Max, AWS Graviton 3\n- **Lab notebook**: 33 entries documenting full experimental log\n\nSee [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md) for detailed evidence links.\n\n**Full methodology**: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)\n\n**Publications** (in preparation):\n1. DAG Framework: BMC Bioinformatics\n2. biometal Library: Bioinformatics (Application Note) or JOSS\n3. Four-Pillar Democratization: GigaScience\n\n---\n\n## Platform Support\n\n### Optimization Strategy\n\nbiometal is **optimized for Mac ARM** (M1/M2/M3/M4) based on 1,357 experiments on Mac M3 Max. This aligns with our democratization mission: enable world-class bioinformatics on **affordable consumer hardware** ($1,000-2,000 MacBooks, not $50,000 servers).\n\nOther platforms are **supported with portable, correct code** but not specifically optimized:\n\n| Platform | Performance | Test Status | Strategy |\n|----------|-------------|-------------|----------|\n| **Mac ARM** (M1/M2/M3/M4) | **16-25\u00d7 speedup** | \u2705 121/121 tests pass | **Optimized** (target platform) |\n| **AWS Graviton** | 6-10\u00d7 speedup | \u2705 121/121 tests pass | Portable (works well) |\n| **Linux x86_64** | 1\u00d7 (scalar) | \u2705 118/118 tests pass | Portable (fallback) |\n\n### Feature Support Matrix\n\n| Feature | macOS ARM | Linux ARM | Linux x86_64 |\n|---------|-----------|-----------|--------------|\n| ARM NEON SIMD | \u2705 | \u2705 | \u274c (scalar fallback) |\n| Parallel Bgzip | \u2705 | \u2705 | \u2705 |\n| Smart mmap | \u2705 | \u23f3 | \u274c |\n| Network Streaming | \u2705 | \u2705 | \u2705 |\n| Python Bindings | \u2705 | \u2705 | \u2705 |\n\n**Validation**: Cross-platform testing completed Nov 2025 on AWS Graviton 3 and x86_64. All tests pass. See [results/cross_platform/FINDINGS.md](results/cross_platform/FINDINGS.md) for full details.\n\n---\n\n## Roadmap\n\n**v1.0.0** (Released November 5, 2025) \u2705\n- Streaming FASTQ/FASTA parsers (constant memory)\n- ARM NEON operations (16-25\u00d7 speedup)\n- Network streaming (HTTP/HTTPS, SRA)\n- Python bindings (PyO3 0.27, Python 3.9-3.14)\n- Cross-platform validation (Mac ARM, Graviton, x86_64)\n- Production-grade quality (121 tests, Grade A+)\n\n**Future Considerations** (Community Driven)\n- Extended operation coverage (alignment, assembly)\n- Additional format support (BAM/SAM, VCF)\n- Publish to crates.io and PyPI\n- Metal GPU acceleration (Mac-specific)\n\n---\n\n## SRA Streaming: Analysis Without Downloads\n\nOne of biometal's most powerful features is direct streaming from NCBI's Sequence Read Archive (SRA) without local downloads.\n\n### Why This Matters\n\n**Traditional workflow:**\n1. Download 5 GB SRA dataset \u2192 30 minutes + 5 GB disk space\n2. Decompress \u2192 15 GB disk space\n3. Process \u2192 Additional memory\n4. **Total:** 45 minutes + 20 GB resources before analysis even starts\n\n**biometal workflow:**\n1. Start analysis immediately \u2192 0 wait time, ~5 MB memory\n2. Stream directly from NCBI S3 \u2192 No disk space needed\n3. Background prefetching hides latency \u2192 Near-local performance\n\n### Supported Accessions\n\n- **SRR** (Run): Most common, represents a sequencing run\n- **SRX** (Experiment): Collection of runs\n- **SRS** (Sample): Biological sample\n- **SRP** (Study): Collection of experiments\n\n### Basic SRA Usage\n\n```rust\nuse biometal::io::DataSource;\nuse biometal::operations::{count_bases, gc_content};\nuse biometal::FastqStream;\n\n// Stream from SRA accession\nlet source = DataSource::Sra(\"SRR390728\".to_string());\nlet stream = FastqStream::new(source)?;\n\nfor record in stream {\n let record = record?;\n\n // ARM NEON-optimized operations (16-25\u00d7 speedup)\n let bases = count_bases(&record.sequence);\n let gc = gc_content(&record.sequence);\n\n // Memory: Constant ~5 MB\n}\n```\n\n### Real-World Example: E. coli Analysis\n\n```bash\n# Run the E. coli streaming example\ncargo run --example sra_ecoli --features network\n\n# Process ~250,000 reads with only ~5 MB memory\n# No download required!\n```\n\nSee [examples/sra_ecoli.rs](examples/sra_ecoli.rs) for complete example.\n\n### Performance Tuning\n\nbiometal automatically configures optimal settings for most use cases. For custom tuning:\n\n```rust\nuse biometal::io::{HttpReader, sra_to_url};\n\nlet url = sra_to_url(\"SRR390728\")?;\nlet reader = HttpReader::new(&url)?\n .with_prefetch_count(8) // Prefetch 8 blocks ahead\n .with_chunk_size(128 * 1024); // 128 KB chunks\n\n// See docs/PERFORMANCE_TUNING.md for detailed guide\n```\n\n### SRA URL Conversion\n\n```rust\nuse biometal::io::{is_sra_accession, sra_to_url};\n\n// Check if string is SRA accession\nif is_sra_accession(\"SRR390728\") {\n // Convert to direct NCBI S3 URL\n let url = sra_to_url(\"SRR390728\")?;\n // \u2192 https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR390728/SRR390728\n}\n```\n\n### Memory Guarantees\n\n- **Streaming buffer:** ~5 MB (constant)\n- **LRU cache:** 50 MB (byte-bounded, automatic eviction)\n- **Prefetch:** ~256 KB (4 blocks \u00d7 64 KB)\n- **Total:** ~55 MB regardless of SRA file size\n\nCompare to downloading a 5 GB SRA file \u2192 **99%+ memory savings**\n\n### Examples\n\n| Example | Dataset | Size | Demo |\n|---------|---------|------|------|\n| [sra_streaming.rs](examples/sra_streaming.rs) | Demo mode | N/A | Capabilities overview |\n| [sra_ecoli.rs](examples/sra_ecoli.rs) | E. coli K-12 | ~40 MB | Real SRA streaming |\n| [prefetch_tuning.rs](examples/prefetch_tuning.rs) | E. coli K-12 | ~40 MB | Performance tuning |\n\n---\n\n## Example Use Cases\n\n### 1. Large-Scale Quality Control\n\n```rust\nuse biometal::{FastqStream, operations};\n\n// Stream 5TB dataset without downloading\nlet stream = FastqStream::from_url(\"https://sra.example.com/huge.fq.gz\")?;\n\nlet mut total = 0;\nlet mut high_quality = 0;\n\nfor record in stream {\n let record = record?;\n total += 1;\n \n // ARM NEON accelerated (16-25\u00d7)\n if operations::mean_quality(&record.quality) > 30.0 {\n high_quality += 1;\n }\n}\n\nprintln!(\"High quality: {}/{} ({:.1}%)\", \n high_quality, total, 100.0 * high_quality as f64 / total as f64);\n```\n\n### 2. BERT Preprocessing Pipeline\n\n```rust\nuse biometal::{FastqStream, kmer};\n\n// Stream from SRA (no local copy!)\nlet stream = FastqStream::from_sra(\"SRR12345678\")?;\n\n// Extract k-mers for DNABert training\nfor record in stream {\n let record = record?;\n let kmers = kmer::extract_overlapping(&record.sequence, 6)?;\n \n // Feed to BERT training pipeline\n // Constant memory even for TB-scale datasets\n}\n```\n\n### 3. Metagenomics Filtering\n\n```rust\nuse biometal::{FastqStream, operations};\n\nlet input = FastqStream::from_path(\"metagen.fq.gz\")?;\nlet mut output = FastqWriter::create(\"filtered.fq.gz\")?;\n\nfor record in input {\n let record = record?;\n \n // Filter low-complexity sequences (ARM NEON accelerated)\n if operations::complexity_score(&record.sequence) > 0.5 {\n output.write(&record)?;\n }\n}\n// Memory: constant ~5 MB\n// Speed: 16-25\u00d7 faster on ARM\n```\n\n---\n\n## Contributing\n\nWe welcome contributions! biometal is built on evidence-based optimization, so new features should:\n1. Have clear use cases\n2. Be validated experimentally (when adding optimizations)\n3. Maintain platform portability\n4. Follow the optimization rules in [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md)\n\nSee [CLAUDE.md](CLAUDE.md) for development guidelines.\n\n---\n\n## License\n\nLicensed under either of:\n\n- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)\n- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)\n\nat your option.\n\n---\n\n## Citation\n\nIf you use biometal in your research, please cite:\n\n```bibtex\n@software{biometal2025,\n author = {Handley, Scott},\n title = {biometal: ARM-native bioinformatics with streaming architecture},\n year = {2025},\n url = {https://github.com/shandley/biometal}\n}\n```\n\nFor the experimental methodology, see:\n```bibtex\n@misc{asbb2025,\n author = {Handley, Scott},\n title = {Apple Silicon Bio Bench: Systematic Hardware Characterization for Bioinformatics},\n year = {2025},\n url = {https://github.com/shandley/apple-silicon-bio-bench}\n}\n```\n\n---\n\n**Status**: v1.0.0 - Production Release \ud83c\udf89\n**Released**: November 5, 2025\n**Grade**: A+ (rust-code-quality-reviewer)\n**Tests**: 121 passing (87 unit + 7 integration + 27 doc)\n**Evidence Base**: 1,357 experiments, 40,710 measurements\n**Mission**: Democratizing bioinformatics compute\n\n",
"bugtrack_url": null,
"license": "MIT OR Apache-2.0",
"summary": "ARM-native bioinformatics library with streaming architecture and evidence-based optimization",
"version": "1.0.0",
"project_urls": {
"Documentation": "https://docs.rs/biometal",
"Homepage": "https://github.com/shandley/biometal",
"Repository": "https://github.com/shandley/biometal"
},
"split_keywords": [
"bioinformatics",
" genomics",
" arm",
" neon",
" streaming"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "88a3728ca697a620c5b1f6b4a1ca7cffcac8166cddc255757f15525f3e6d5bab",
"md5": "4d9d38085628a5d5537404e0674517e7",
"sha256": "ec1aa1ae6622f2321e8e3a9ba6abf3b899b9dc151579b05fb877d9764feb7e48"
},
"downloads": -1,
"filename": "biometal_rs-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl",
"has_sig": false,
"md5_digest": "4d9d38085628a5d5537404e0674517e7",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.9",
"size": 1100484,
"upload_time": "2025-11-05T19:27:01",
"upload_time_iso_8601": "2025-11-05T19:27:01.286308Z",
"url": "https://files.pythonhosted.org/packages/88/a3/728ca697a620c5b1f6b4a1ca7cffcac8166cddc255757f15525f3e6d5bab/biometal_rs-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a5d39e424efe1d3ff4adda96c366991ae06976e4c02140b25e85462492789f89",
"md5": "fc63d2afb0b622ac76adf717ceb9ebc2",
"sha256": "91d10deb2d0c39f90dc2ed901368bad0bda347ed93b0b3182949ad4fc85ba981"
},
"downloads": -1,
"filename": "biometal_rs-1.0.0-cp311-cp311-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "fc63d2afb0b622ac76adf717ceb9ebc2",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.9",
"size": 1054313,
"upload_time": "2025-11-05T19:27:03",
"upload_time_iso_8601": "2025-11-05T19:27:03.129975Z",
"url": "https://files.pythonhosted.org/packages/a5/d3/9e424efe1d3ff4adda96c366991ae06976e4c02140b25e85462492789f89/biometal_rs-1.0.0-cp311-cp311-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "fd7c827fe7d966f6d5ac84992978ea6e5ead04716f2bc19dedf1bfc8975a85a3",
"md5": "11cb460500718543bfd0de7b54b69b45",
"sha256": "78869bb0cb75378d529fcaf6746a6f9411257da7a529ca210d1cb0ef0e174a5b"
},
"downloads": -1,
"filename": "biometal_rs-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "11cb460500718543bfd0de7b54b69b45",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.9",
"size": 3360298,
"upload_time": "2025-11-05T19:27:04",
"upload_time_iso_8601": "2025-11-05T19:27:04.586972Z",
"url": "https://files.pythonhosted.org/packages/fd/7c/827fe7d966f6d5ac84992978ea6e5ead04716f2bc19dedf1bfc8975a85a3/biometal_rs-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "30381dbab783fc9e8f1be081f7ed5c09cdf8ca5fd385af0f7ca5bd4f4f6df2a2",
"md5": "4b4759efc09a5ea93753eb7b4c86d249",
"sha256": "fc583d234725dfbc7b1467985f0b2a9d6dc552237704025024ee23d12b87a611"
},
"downloads": -1,
"filename": "biometal_rs-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "4b4759efc09a5ea93753eb7b4c86d249",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 352337,
"upload_time": "2025-11-05T19:27:06",
"upload_time_iso_8601": "2025-11-05T19:27:06.234000Z",
"url": "https://files.pythonhosted.org/packages/30/38/1dbab783fc9e8f1be081f7ed5c09cdf8ca5fd385af0f7ca5bd4f4f6df2a2/biometal_rs-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-05 19:27:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "shandley",
"github_project": "biometal",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "biometal-rs"
}