biometal-rs


Namebiometal-rs JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/shandley/biometal
SummaryARM-native bioinformatics library with streaming architecture and evidence-based optimization
upload_time2025-11-05 19:27:06
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT OR Apache-2.0
keywords bioinformatics genomics arm neon streaming
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="right">
  <img src="biometal_logo.png" alt="biometal logo" width="150"/>
</div>

# biometal

**ARM-native bioinformatics library with streaming architecture and evidence-based optimization**

[![Crates.io](https://img.shields.io/crates/v/biometal.svg)](https://crates.io/crates/biometal)
[![Documentation](https://docs.rs/biometal/badge.svg)](https://docs.rs/biometal)
[![License](https://img.shields.io/crates/l/biometal.svg)](https://github.com/shandley/biometal#license)

---

## What Makes biometal Different?

Most bioinformatics tools require you to download entire datasets before analysis. **biometal** streams data directly from the network, enabling analysis of terabyte-scale datasets on consumer hardware without downloading.

### Key Features

1. **Streaming Architecture** (Rule 5)
   - Constant ~5 MB memory footprint regardless of dataset size
   - Analyze 5TB datasets on laptops without downloading
   - 99.5% memory reduction compared to batch processing

2. **ARM-Native Performance** (Rule 1)
   - 16-25× speedup using ARM NEON SIMD
   - Works across Mac (Apple Silicon), AWS Graviton, Ampere, Raspberry Pi
   - Automatic fallback to scalar code on x86_64

3. **Network Streaming** (Rule 6)
   - Stream directly from HTTP/HTTPS sources
   - SRA toolkit integration (no local copy needed)
   - Smart LRU caching minimizes network requests
   - Background prefetching hides latency

4. **Intelligent I/O** (Rules 3-4)
   - 6.5× speedup from parallel bgzip decompression
   - Additional 2.5× from memory-mapped I/O (large files on macOS)
   - Combined 16.3× I/O speedup

5. **Evidence-Based Design**
   - Every optimization validated with statistical rigor (N=30, 95% CI)
   - 1,357 experiments, 40,710 measurements
   - Full methodology: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)

---

## Quick Start

### Rust Installation

```toml
[dependencies]
biometal = "1.0"
```

### Python Installation

```bash
# Option 1: Install from PyPI (when published)
pip install biometal-rs

# Option 2: Build from source (requires Rust toolchain)
pip install maturin
git clone https://github.com/shandley/biometal
cd biometal
maturin develop --release --features python
```

**Requirements**:
- Python 3.9+ (tested on 3.14)
- Rust toolchain (for building from source)

---

## Usage

### Rust: Basic Usage

```rust
use biometal::FastqStream;

// Stream FASTQ from local file (constant memory)
let stream = FastqStream::from_path("large_dataset.fq.gz")?;

for record in stream {
    let record = record?;
    // Process one record at a time
    // Memory stays constant at ~5 MB
}
```

### Network Streaming

```rust
use biometal::io::DataSource;
use biometal::FastqStream;

// Stream directly from URL (no download!)
let source = DataSource::Http("https://example.com/huge_dataset.fq.gz".to_string());
let stream = FastqStream::new(source)?;

// Analyze 5TB dataset without downloading
for record in stream {
    // Smart caching + prefetching in background
}
```

### SRA Streaming (No Download!)

```rust
use biometal::io::DataSource;
use biometal::FastqStream;

// Stream directly from NCBI SRA (no local download!)
let source = DataSource::Sra("SRR390728".to_string());  // E. coli dataset
let stream = FastqStream::new(source)?;

for record in stream {
    let record = record?;
    // Process 40 MB dataset with only ~5 MB memory
    // Background prefetching hides network latency
}
```

### Operations with Auto-Optimization

```rust
use biometal::operations;

// ARM NEON automatically enabled on ARM platforms
let counts = operations::base_counting(&sequence)?;
let gc = operations::gc_content(&sequence)?;

// 16-25× faster on ARM, automatic scalar fallback on x86_64
```

### Python: Basic Usage

```python
import biometal

# Stream FASTQ from local file (constant memory)
stream = biometal.FastqStream.from_path("large_dataset.fq.gz")

for record in stream:
    # Process one record at a time
    # Memory stays constant at ~5 MB
    gc = biometal.gc_content(record.sequence)
    print(f"{record.id}: GC={gc:.2%}")
```

### Python: ARM NEON Operations

```python
import biometal

# ARM NEON automatically enabled on ARM platforms
# 16-25× faster on Mac ARM, automatic scalar fallback on x86_64

# GC content calculation
sequence = b"ATGCATGC"
gc = biometal.gc_content(sequence)  # 20.3× speedup on ARM

# Base counting
counts = biometal.count_bases(sequence)  # 16.7× speedup on ARM
print(f"A:{counts['A']}, C:{counts['C']}, G:{counts['G']}, T:{counts['T']}")

# Quality scoring
quality = record.quality
mean_q = biometal.mean_quality(quality)  # 25.1× speedup on ARM

# K-mer extraction (for ML preprocessing)
kmers = biometal.extract_kmers(sequence, k=6)
print(f"6-mers: {kmers}")
```

### Python: Example Workflow

```python
import biometal

# Analyze FASTQ file with streaming (constant memory)
stream = biometal.FastqStream.from_path("data.fq.gz")

total_bases = 0
total_gc = 0.0
high_quality = 0

for record in stream:
    # Count bases (ARM NEON accelerated)
    counts = biometal.count_bases(record.sequence)
    total_bases += sum(counts.values())

    # Calculate GC content (ARM NEON accelerated)
    gc = biometal.gc_content(record.sequence)
    total_gc += gc

    # Check quality (ARM NEON accelerated)
    if biometal.mean_quality(record.quality) > 30.0:
        high_quality += 1

print(f"Total bases: {total_bases}")
print(f"Average GC: {total_gc/len(stream):.2%}")
print(f"High quality reads: {high_quality}")
```

---

## Performance

### Memory Efficiency

| Dataset Size | Traditional | biometal | Reduction |
|--------------|-------------|----------|-----------|
| 100K sequences | 134 MB | 5 MB | 96.3% |
| 1M sequences | 1,344 MB | 5 MB | 99.5% |
| **5TB dataset** | **5,000 GB** | **5 MB** | **99.9999%** |

### ARM NEON Speedup (Mac Apple Silicon)

**Optimized for Apple Silicon** - All optimizations validated on Mac M3 Max (1,357 experiments, N=30):

| Operation | Scalar | NEON | Speedup |
|-----------|--------|------|---------|
| Base counting | 315 Kseq/s | 5,254 Kseq/s | **16.7×** |
| GC content | 294 Kseq/s | 5,954 Kseq/s | **20.3×** |
| Quality filter | 245 Kseq/s | 6,143 Kseq/s | **25.1×** |

### Cross-Platform Performance (Validated Nov 2025)

| Platform | Base Counting | GC Content | Quality | Status |
|----------|---------------|------------|---------|--------|
| **Mac M3** (target) | 16.7× | 20.3× | 25.1× | ✅ Optimized |
| **AWS Graviton** | 10.7× | 6.9× | 1.9× | ✅ Works (portable) |
| **x86_64 Intel** | 1.0× | 1.0× | 1.0× | ✅ Works (portable) |

**Note**: biometal is optimized for Mac ARM (consumer hardware democratization). Other platforms are supported with correct, production-ready code but not specifically optimized. See [Cross-Platform Testing Results](results/cross_platform/FINDINGS.md) for details.

### I/O Optimization

| File Size | Standard | Optimized | Speedup |
|-----------|----------|-----------|---------|
| Small (<50 MB) | 12.3s | 1.9s | 6.5× |
| Large (≥50 MB) | 12.3s | 0.75s | **16.3×** |

---

## Democratizing Bioinformatics

biometal addresses four barriers that lock researchers out of genomics:

### 1. Economic Barrier
- **Problem**: Most tools require $50K+ servers
- **Solution**: Consumer ARM laptops ($1,400) deliver production performance
- **Impact**: Small labs and LMIC researchers can compete

### 2. Environmental Barrier
- **Problem**: HPC clusters consume massive energy (300× excess for many workloads)
- **Solution**: ARM efficiency inherent in architecture
- **Impact**: Reduced carbon footprint for genomics research

### 3. Portability Barrier
- **Problem**: Vendor lock-in (x86-only, cloud-only tools)
- **Solution**: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)
- **Impact**: No platform dependencies, true portability

### 4. Data Access Barrier ⭐
- **Problem**: 5TB datasets require 5TB storage + days to download
- **Solution**: Network streaming with smart caching
- **Impact**: Analyze 5TB datasets on 24GB laptops without downloading

---

## Evidence Base

biometal's design is grounded in comprehensive experimental validation:

- **Experiments**: 1,357 total (40,710 measurements with N=30)
- **Statistical rigor**: 95% confidence intervals, Cohen's d effect sizes
- **Cross-platform**: Mac M4 Max, AWS Graviton 3
- **Lab notebook**: 33 entries documenting full experimental log

See [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md) for detailed evidence links.

**Full methodology**: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)

**Publications** (in preparation):
1. DAG Framework: BMC Bioinformatics
2. biometal Library: Bioinformatics (Application Note) or JOSS
3. Four-Pillar Democratization: GigaScience

---

## Platform Support

### Optimization Strategy

biometal is **optimized for Mac ARM** (M1/M2/M3/M4) based on 1,357 experiments on Mac M3 Max. This aligns with our democratization mission: enable world-class bioinformatics on **affordable consumer hardware** ($1,000-2,000 MacBooks, not $50,000 servers).

Other platforms are **supported with portable, correct code** but not specifically optimized:

| Platform | Performance | Test Status | Strategy |
|----------|-------------|-------------|----------|
| **Mac ARM** (M1/M2/M3/M4) | **16-25× speedup** | ✅ 121/121 tests pass | **Optimized** (target platform) |
| **AWS Graviton** | 6-10× speedup | ✅ 121/121 tests pass | Portable (works well) |
| **Linux x86_64** | 1× (scalar) | ✅ 118/118 tests pass | Portable (fallback) |

### Feature Support Matrix

| Feature | macOS ARM | Linux ARM | Linux x86_64 |
|---------|-----------|-----------|--------------|
| ARM NEON SIMD | ✅ | ✅ | ❌ (scalar fallback) |
| Parallel Bgzip | ✅ | ✅ | ✅ |
| Smart mmap | ✅ | ⏳ | ❌ |
| Network Streaming | ✅ | ✅ | ✅ |
| Python Bindings | ✅ | ✅ | ✅ |

**Validation**: Cross-platform testing completed Nov 2025 on AWS Graviton 3 and x86_64. All tests pass. See [results/cross_platform/FINDINGS.md](results/cross_platform/FINDINGS.md) for full details.

---

## Roadmap

**v1.0.0** (Released November 5, 2025) ✅
- Streaming FASTQ/FASTA parsers (constant memory)
- ARM NEON operations (16-25× speedup)
- Network streaming (HTTP/HTTPS, SRA)
- Python bindings (PyO3 0.27, Python 3.9-3.14)
- Cross-platform validation (Mac ARM, Graviton, x86_64)
- Production-grade quality (121 tests, Grade A+)

**Future Considerations** (Community Driven)
- Extended operation coverage (alignment, assembly)
- Additional format support (BAM/SAM, VCF)
- Publish to crates.io and PyPI
- Metal GPU acceleration (Mac-specific)

---

## SRA Streaming: Analysis Without Downloads

One of biometal's most powerful features is direct streaming from NCBI's Sequence Read Archive (SRA) without local downloads.

### Why This Matters

**Traditional workflow:**
1. Download 5 GB SRA dataset → 30 minutes + 5 GB disk space
2. Decompress → 15 GB disk space
3. Process → Additional memory
4. **Total:** 45 minutes + 20 GB resources before analysis even starts

**biometal workflow:**
1. Start analysis immediately → 0 wait time, ~5 MB memory
2. Stream directly from NCBI S3 → No disk space needed
3. Background prefetching hides latency → Near-local performance

### Supported Accessions

- **SRR** (Run): Most common, represents a sequencing run
- **SRX** (Experiment): Collection of runs
- **SRS** (Sample): Biological sample
- **SRP** (Study): Collection of experiments

### Basic SRA Usage

```rust
use biometal::io::DataSource;
use biometal::operations::{count_bases, gc_content};
use biometal::FastqStream;

// Stream from SRA accession
let source = DataSource::Sra("SRR390728".to_string());
let stream = FastqStream::new(source)?;

for record in stream {
    let record = record?;

    // ARM NEON-optimized operations (16-25× speedup)
    let bases = count_bases(&record.sequence);
    let gc = gc_content(&record.sequence);

    // Memory: Constant ~5 MB
}
```

### Real-World Example: E. coli Analysis

```bash
# Run the E. coli streaming example
cargo run --example sra_ecoli --features network

# Process ~250,000 reads with only ~5 MB memory
# No download required!
```

See [examples/sra_ecoli.rs](examples/sra_ecoli.rs) for complete example.

### Performance Tuning

biometal automatically configures optimal settings for most use cases. For custom tuning:

```rust
use biometal::io::{HttpReader, sra_to_url};

let url = sra_to_url("SRR390728")?;
let reader = HttpReader::new(&url)?
    .with_prefetch_count(8)      // Prefetch 8 blocks ahead
    .with_chunk_size(128 * 1024); // 128 KB chunks

// See docs/PERFORMANCE_TUNING.md for detailed guide
```

### SRA URL Conversion

```rust
use biometal::io::{is_sra_accession, sra_to_url};

// Check if string is SRA accession
if is_sra_accession("SRR390728") {
    // Convert to direct NCBI S3 URL
    let url = sra_to_url("SRR390728")?;
    // → https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR390728/SRR390728
}
```

### Memory Guarantees

- **Streaming buffer:** ~5 MB (constant)
- **LRU cache:** 50 MB (byte-bounded, automatic eviction)
- **Prefetch:** ~256 KB (4 blocks × 64 KB)
- **Total:** ~55 MB regardless of SRA file size

Compare to downloading a 5 GB SRA file → **99%+ memory savings**

### Examples

| Example | Dataset | Size | Demo |
|---------|---------|------|------|
| [sra_streaming.rs](examples/sra_streaming.rs) | Demo mode | N/A | Capabilities overview |
| [sra_ecoli.rs](examples/sra_ecoli.rs) | E. coli K-12 | ~40 MB | Real SRA streaming |
| [prefetch_tuning.rs](examples/prefetch_tuning.rs) | E. coli K-12 | ~40 MB | Performance tuning |

---

## Example Use Cases

### 1. Large-Scale Quality Control

```rust
use biometal::{FastqStream, operations};

// Stream 5TB dataset without downloading
let stream = FastqStream::from_url("https://sra.example.com/huge.fq.gz")?;

let mut total = 0;
let mut high_quality = 0;

for record in stream {
    let record = record?;
    total += 1;
    
    // ARM NEON accelerated (16-25×)
    if operations::mean_quality(&record.quality) > 30.0 {
        high_quality += 1;
    }
}

println!("High quality: {}/{} ({:.1}%)", 
    high_quality, total, 100.0 * high_quality as f64 / total as f64);
```

### 2. BERT Preprocessing Pipeline

```rust
use biometal::{FastqStream, kmer};

// Stream from SRA (no local copy!)
let stream = FastqStream::from_sra("SRR12345678")?;

// Extract k-mers for DNABert training
for record in stream {
    let record = record?;
    let kmers = kmer::extract_overlapping(&record.sequence, 6)?;
    
    // Feed to BERT training pipeline
    // Constant memory even for TB-scale datasets
}
```

### 3. Metagenomics Filtering

```rust
use biometal::{FastqStream, operations};

let input = FastqStream::from_path("metagen.fq.gz")?;
let mut output = FastqWriter::create("filtered.fq.gz")?;

for record in input {
    let record = record?;
    
    // Filter low-complexity sequences (ARM NEON accelerated)
    if operations::complexity_score(&record.sequence) > 0.5 {
        output.write(&record)?;
    }
}
// Memory: constant ~5 MB
// Speed: 16-25× faster on ARM
```

---

## Contributing

We welcome contributions! biometal is built on evidence-based optimization, so new features should:
1. Have clear use cases
2. Be validated experimentally (when adding optimizations)
3. Maintain platform portability
4. Follow the optimization rules in [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md)

See [CLAUDE.md](CLAUDE.md) for development guidelines.

---

## License

Licensed under either of:

- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

---

## Citation

If you use biometal in your research, please cite:

```bibtex
@software{biometal2025,
  author = {Handley, Scott},
  title = {biometal: ARM-native bioinformatics with streaming architecture},
  year = {2025},
  url = {https://github.com/shandley/biometal}
}
```

For the experimental methodology, see:
```bibtex
@misc{asbb2025,
  author = {Handley, Scott},
  title = {Apple Silicon Bio Bench: Systematic Hardware Characterization for Bioinformatics},
  year = {2025},
  url = {https://github.com/shandley/apple-silicon-bio-bench}
}
```

---

**Status**: v1.0.0 - Production Release 🎉
**Released**: November 5, 2025
**Grade**: A+ (rust-code-quality-reviewer)
**Tests**: 121 passing (87 unit + 7 integration + 27 doc)
**Evidence Base**: 1,357 experiments, 40,710 measurements
**Mission**: Democratizing bioinformatics compute


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shandley/biometal",
    "name": "biometal-rs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "bioinformatics, genomics, arm, neon, streaming",
    "author": null,
    "author_email": "Scott Handley <scott.handley@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/30/38/1dbab783fc9e8f1be081f7ed5c09cdf8ca5fd385af0f7ca5bd4f4f6df2a2/biometal_rs-1.0.0.tar.gz",
    "platform": null,
    "description": "<div align=\"right\">\n  <img src=\"biometal_logo.png\" alt=\"biometal logo\" width=\"150\"/>\n</div>\n\n# biometal\n\n**ARM-native bioinformatics library with streaming architecture and evidence-based optimization**\n\n[![Crates.io](https://img.shields.io/crates/v/biometal.svg)](https://crates.io/crates/biometal)\n[![Documentation](https://docs.rs/biometal/badge.svg)](https://docs.rs/biometal)\n[![License](https://img.shields.io/crates/l/biometal.svg)](https://github.com/shandley/biometal#license)\n\n---\n\n## What Makes biometal Different?\n\nMost bioinformatics tools require you to download entire datasets before analysis. **biometal** streams data directly from the network, enabling analysis of terabyte-scale datasets on consumer hardware without downloading.\n\n### Key Features\n\n1. **Streaming Architecture** (Rule 5)\n   - Constant ~5 MB memory footprint regardless of dataset size\n   - Analyze 5TB datasets on laptops without downloading\n   - 99.5% memory reduction compared to batch processing\n\n2. **ARM-Native Performance** (Rule 1)\n   - 16-25\u00d7 speedup using ARM NEON SIMD\n   - Works across Mac (Apple Silicon), AWS Graviton, Ampere, Raspberry Pi\n   - Automatic fallback to scalar code on x86_64\n\n3. **Network Streaming** (Rule 6)\n   - Stream directly from HTTP/HTTPS sources\n   - SRA toolkit integration (no local copy needed)\n   - Smart LRU caching minimizes network requests\n   - Background prefetching hides latency\n\n4. **Intelligent I/O** (Rules 3-4)\n   - 6.5\u00d7 speedup from parallel bgzip decompression\n   - Additional 2.5\u00d7 from memory-mapped I/O (large files on macOS)\n   - Combined 16.3\u00d7 I/O speedup\n\n5. **Evidence-Based Design**\n   - Every optimization validated with statistical rigor (N=30, 95% CI)\n   - 1,357 experiments, 40,710 measurements\n   - Full methodology: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)\n\n---\n\n## Quick Start\n\n### Rust Installation\n\n```toml\n[dependencies]\nbiometal = \"1.0\"\n```\n\n### Python Installation\n\n```bash\n# Option 1: Install from PyPI (when published)\npip install biometal-rs\n\n# Option 2: Build from source (requires Rust toolchain)\npip install maturin\ngit clone https://github.com/shandley/biometal\ncd biometal\nmaturin develop --release --features python\n```\n\n**Requirements**:\n- Python 3.9+ (tested on 3.14)\n- Rust toolchain (for building from source)\n\n---\n\n## Usage\n\n### Rust: Basic Usage\n\n```rust\nuse biometal::FastqStream;\n\n// Stream FASTQ from local file (constant memory)\nlet stream = FastqStream::from_path(\"large_dataset.fq.gz\")?;\n\nfor record in stream {\n    let record = record?;\n    // Process one record at a time\n    // Memory stays constant at ~5 MB\n}\n```\n\n### Network Streaming\n\n```rust\nuse biometal::io::DataSource;\nuse biometal::FastqStream;\n\n// Stream directly from URL (no download!)\nlet source = DataSource::Http(\"https://example.com/huge_dataset.fq.gz\".to_string());\nlet stream = FastqStream::new(source)?;\n\n// Analyze 5TB dataset without downloading\nfor record in stream {\n    // Smart caching + prefetching in background\n}\n```\n\n### SRA Streaming (No Download!)\n\n```rust\nuse biometal::io::DataSource;\nuse biometal::FastqStream;\n\n// Stream directly from NCBI SRA (no local download!)\nlet source = DataSource::Sra(\"SRR390728\".to_string());  // E. coli dataset\nlet stream = FastqStream::new(source)?;\n\nfor record in stream {\n    let record = record?;\n    // Process 40 MB dataset with only ~5 MB memory\n    // Background prefetching hides network latency\n}\n```\n\n### Operations with Auto-Optimization\n\n```rust\nuse biometal::operations;\n\n// ARM NEON automatically enabled on ARM platforms\nlet counts = operations::base_counting(&sequence)?;\nlet gc = operations::gc_content(&sequence)?;\n\n// 16-25\u00d7 faster on ARM, automatic scalar fallback on x86_64\n```\n\n### Python: Basic Usage\n\n```python\nimport biometal\n\n# Stream FASTQ from local file (constant memory)\nstream = biometal.FastqStream.from_path(\"large_dataset.fq.gz\")\n\nfor record in stream:\n    # Process one record at a time\n    # Memory stays constant at ~5 MB\n    gc = biometal.gc_content(record.sequence)\n    print(f\"{record.id}: GC={gc:.2%}\")\n```\n\n### Python: ARM NEON Operations\n\n```python\nimport biometal\n\n# ARM NEON automatically enabled on ARM platforms\n# 16-25\u00d7 faster on Mac ARM, automatic scalar fallback on x86_64\n\n# GC content calculation\nsequence = b\"ATGCATGC\"\ngc = biometal.gc_content(sequence)  # 20.3\u00d7 speedup on ARM\n\n# Base counting\ncounts = biometal.count_bases(sequence)  # 16.7\u00d7 speedup on ARM\nprint(f\"A:{counts['A']}, C:{counts['C']}, G:{counts['G']}, T:{counts['T']}\")\n\n# Quality scoring\nquality = record.quality\nmean_q = biometal.mean_quality(quality)  # 25.1\u00d7 speedup on ARM\n\n# K-mer extraction (for ML preprocessing)\nkmers = biometal.extract_kmers(sequence, k=6)\nprint(f\"6-mers: {kmers}\")\n```\n\n### Python: Example Workflow\n\n```python\nimport biometal\n\n# Analyze FASTQ file with streaming (constant memory)\nstream = biometal.FastqStream.from_path(\"data.fq.gz\")\n\ntotal_bases = 0\ntotal_gc = 0.0\nhigh_quality = 0\n\nfor record in stream:\n    # Count bases (ARM NEON accelerated)\n    counts = biometal.count_bases(record.sequence)\n    total_bases += sum(counts.values())\n\n    # Calculate GC content (ARM NEON accelerated)\n    gc = biometal.gc_content(record.sequence)\n    total_gc += gc\n\n    # Check quality (ARM NEON accelerated)\n    if biometal.mean_quality(record.quality) > 30.0:\n        high_quality += 1\n\nprint(f\"Total bases: {total_bases}\")\nprint(f\"Average GC: {total_gc/len(stream):.2%}\")\nprint(f\"High quality reads: {high_quality}\")\n```\n\n---\n\n## Performance\n\n### Memory Efficiency\n\n| Dataset Size | Traditional | biometal | Reduction |\n|--------------|-------------|----------|-----------|\n| 100K sequences | 134 MB | 5 MB | 96.3% |\n| 1M sequences | 1,344 MB | 5 MB | 99.5% |\n| **5TB dataset** | **5,000 GB** | **5 MB** | **99.9999%** |\n\n### ARM NEON Speedup (Mac Apple Silicon)\n\n**Optimized for Apple Silicon** - All optimizations validated on Mac M3 Max (1,357 experiments, N=30):\n\n| Operation | Scalar | NEON | Speedup |\n|-----------|--------|------|---------|\n| Base counting | 315 Kseq/s | 5,254 Kseq/s | **16.7\u00d7** |\n| GC content | 294 Kseq/s | 5,954 Kseq/s | **20.3\u00d7** |\n| Quality filter | 245 Kseq/s | 6,143 Kseq/s | **25.1\u00d7** |\n\n### Cross-Platform Performance (Validated Nov 2025)\n\n| Platform | Base Counting | GC Content | Quality | Status |\n|----------|---------------|------------|---------|--------|\n| **Mac M3** (target) | 16.7\u00d7 | 20.3\u00d7 | 25.1\u00d7 | \u2705 Optimized |\n| **AWS Graviton** | 10.7\u00d7 | 6.9\u00d7 | 1.9\u00d7 | \u2705 Works (portable) |\n| **x86_64 Intel** | 1.0\u00d7 | 1.0\u00d7 | 1.0\u00d7 | \u2705 Works (portable) |\n\n**Note**: biometal is optimized for Mac ARM (consumer hardware democratization). Other platforms are supported with correct, production-ready code but not specifically optimized. See [Cross-Platform Testing Results](results/cross_platform/FINDINGS.md) for details.\n\n### I/O Optimization\n\n| File Size | Standard | Optimized | Speedup |\n|-----------|----------|-----------|---------|\n| Small (<50 MB) | 12.3s | 1.9s | 6.5\u00d7 |\n| Large (\u226550 MB) | 12.3s | 0.75s | **16.3\u00d7** |\n\n---\n\n## Democratizing Bioinformatics\n\nbiometal addresses four barriers that lock researchers out of genomics:\n\n### 1. Economic Barrier\n- **Problem**: Most tools require $50K+ servers\n- **Solution**: Consumer ARM laptops ($1,400) deliver production performance\n- **Impact**: Small labs and LMIC researchers can compete\n\n### 2. Environmental Barrier\n- **Problem**: HPC clusters consume massive energy (300\u00d7 excess for many workloads)\n- **Solution**: ARM efficiency inherent in architecture\n- **Impact**: Reduced carbon footprint for genomics research\n\n### 3. Portability Barrier\n- **Problem**: Vendor lock-in (x86-only, cloud-only tools)\n- **Solution**: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)\n- **Impact**: No platform dependencies, true portability\n\n### 4. Data Access Barrier \u2b50\n- **Problem**: 5TB datasets require 5TB storage + days to download\n- **Solution**: Network streaming with smart caching\n- **Impact**: Analyze 5TB datasets on 24GB laptops without downloading\n\n---\n\n## Evidence Base\n\nbiometal's design is grounded in comprehensive experimental validation:\n\n- **Experiments**: 1,357 total (40,710 measurements with N=30)\n- **Statistical rigor**: 95% confidence intervals, Cohen's d effect sizes\n- **Cross-platform**: Mac M4 Max, AWS Graviton 3\n- **Lab notebook**: 33 entries documenting full experimental log\n\nSee [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md) for detailed evidence links.\n\n**Full methodology**: [apple-silicon-bio-bench](https://github.com/shandley/apple-silicon-bio-bench)\n\n**Publications** (in preparation):\n1. DAG Framework: BMC Bioinformatics\n2. biometal Library: Bioinformatics (Application Note) or JOSS\n3. Four-Pillar Democratization: GigaScience\n\n---\n\n## Platform Support\n\n### Optimization Strategy\n\nbiometal is **optimized for Mac ARM** (M1/M2/M3/M4) based on 1,357 experiments on Mac M3 Max. This aligns with our democratization mission: enable world-class bioinformatics on **affordable consumer hardware** ($1,000-2,000 MacBooks, not $50,000 servers).\n\nOther platforms are **supported with portable, correct code** but not specifically optimized:\n\n| Platform | Performance | Test Status | Strategy |\n|----------|-------------|-------------|----------|\n| **Mac ARM** (M1/M2/M3/M4) | **16-25\u00d7 speedup** | \u2705 121/121 tests pass | **Optimized** (target platform) |\n| **AWS Graviton** | 6-10\u00d7 speedup | \u2705 121/121 tests pass | Portable (works well) |\n| **Linux x86_64** | 1\u00d7 (scalar) | \u2705 118/118 tests pass | Portable (fallback) |\n\n### Feature Support Matrix\n\n| Feature | macOS ARM | Linux ARM | Linux x86_64 |\n|---------|-----------|-----------|--------------|\n| ARM NEON SIMD | \u2705 | \u2705 | \u274c (scalar fallback) |\n| Parallel Bgzip | \u2705 | \u2705 | \u2705 |\n| Smart mmap | \u2705 | \u23f3 | \u274c |\n| Network Streaming | \u2705 | \u2705 | \u2705 |\n| Python Bindings | \u2705 | \u2705 | \u2705 |\n\n**Validation**: Cross-platform testing completed Nov 2025 on AWS Graviton 3 and x86_64. All tests pass. See [results/cross_platform/FINDINGS.md](results/cross_platform/FINDINGS.md) for full details.\n\n---\n\n## Roadmap\n\n**v1.0.0** (Released November 5, 2025) \u2705\n- Streaming FASTQ/FASTA parsers (constant memory)\n- ARM NEON operations (16-25\u00d7 speedup)\n- Network streaming (HTTP/HTTPS, SRA)\n- Python bindings (PyO3 0.27, Python 3.9-3.14)\n- Cross-platform validation (Mac ARM, Graviton, x86_64)\n- Production-grade quality (121 tests, Grade A+)\n\n**Future Considerations** (Community Driven)\n- Extended operation coverage (alignment, assembly)\n- Additional format support (BAM/SAM, VCF)\n- Publish to crates.io and PyPI\n- Metal GPU acceleration (Mac-specific)\n\n---\n\n## SRA Streaming: Analysis Without Downloads\n\nOne of biometal's most powerful features is direct streaming from NCBI's Sequence Read Archive (SRA) without local downloads.\n\n### Why This Matters\n\n**Traditional workflow:**\n1. Download 5 GB SRA dataset \u2192 30 minutes + 5 GB disk space\n2. Decompress \u2192 15 GB disk space\n3. Process \u2192 Additional memory\n4. **Total:** 45 minutes + 20 GB resources before analysis even starts\n\n**biometal workflow:**\n1. Start analysis immediately \u2192 0 wait time, ~5 MB memory\n2. Stream directly from NCBI S3 \u2192 No disk space needed\n3. Background prefetching hides latency \u2192 Near-local performance\n\n### Supported Accessions\n\n- **SRR** (Run): Most common, represents a sequencing run\n- **SRX** (Experiment): Collection of runs\n- **SRS** (Sample): Biological sample\n- **SRP** (Study): Collection of experiments\n\n### Basic SRA Usage\n\n```rust\nuse biometal::io::DataSource;\nuse biometal::operations::{count_bases, gc_content};\nuse biometal::FastqStream;\n\n// Stream from SRA accession\nlet source = DataSource::Sra(\"SRR390728\".to_string());\nlet stream = FastqStream::new(source)?;\n\nfor record in stream {\n    let record = record?;\n\n    // ARM NEON-optimized operations (16-25\u00d7 speedup)\n    let bases = count_bases(&record.sequence);\n    let gc = gc_content(&record.sequence);\n\n    // Memory: Constant ~5 MB\n}\n```\n\n### Real-World Example: E. coli Analysis\n\n```bash\n# Run the E. coli streaming example\ncargo run --example sra_ecoli --features network\n\n# Process ~250,000 reads with only ~5 MB memory\n# No download required!\n```\n\nSee [examples/sra_ecoli.rs](examples/sra_ecoli.rs) for complete example.\n\n### Performance Tuning\n\nbiometal automatically configures optimal settings for most use cases. For custom tuning:\n\n```rust\nuse biometal::io::{HttpReader, sra_to_url};\n\nlet url = sra_to_url(\"SRR390728\")?;\nlet reader = HttpReader::new(&url)?\n    .with_prefetch_count(8)      // Prefetch 8 blocks ahead\n    .with_chunk_size(128 * 1024); // 128 KB chunks\n\n// See docs/PERFORMANCE_TUNING.md for detailed guide\n```\n\n### SRA URL Conversion\n\n```rust\nuse biometal::io::{is_sra_accession, sra_to_url};\n\n// Check if string is SRA accession\nif is_sra_accession(\"SRR390728\") {\n    // Convert to direct NCBI S3 URL\n    let url = sra_to_url(\"SRR390728\")?;\n    // \u2192 https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR390728/SRR390728\n}\n```\n\n### Memory Guarantees\n\n- **Streaming buffer:** ~5 MB (constant)\n- **LRU cache:** 50 MB (byte-bounded, automatic eviction)\n- **Prefetch:** ~256 KB (4 blocks \u00d7 64 KB)\n- **Total:** ~55 MB regardless of SRA file size\n\nCompare to downloading a 5 GB SRA file \u2192 **99%+ memory savings**\n\n### Examples\n\n| Example | Dataset | Size | Demo |\n|---------|---------|------|------|\n| [sra_streaming.rs](examples/sra_streaming.rs) | Demo mode | N/A | Capabilities overview |\n| [sra_ecoli.rs](examples/sra_ecoli.rs) | E. coli K-12 | ~40 MB | Real SRA streaming |\n| [prefetch_tuning.rs](examples/prefetch_tuning.rs) | E. coli K-12 | ~40 MB | Performance tuning |\n\n---\n\n## Example Use Cases\n\n### 1. Large-Scale Quality Control\n\n```rust\nuse biometal::{FastqStream, operations};\n\n// Stream 5TB dataset without downloading\nlet stream = FastqStream::from_url(\"https://sra.example.com/huge.fq.gz\")?;\n\nlet mut total = 0;\nlet mut high_quality = 0;\n\nfor record in stream {\n    let record = record?;\n    total += 1;\n    \n    // ARM NEON accelerated (16-25\u00d7)\n    if operations::mean_quality(&record.quality) > 30.0 {\n        high_quality += 1;\n    }\n}\n\nprintln!(\"High quality: {}/{} ({:.1}%)\", \n    high_quality, total, 100.0 * high_quality as f64 / total as f64);\n```\n\n### 2. BERT Preprocessing Pipeline\n\n```rust\nuse biometal::{FastqStream, kmer};\n\n// Stream from SRA (no local copy!)\nlet stream = FastqStream::from_sra(\"SRR12345678\")?;\n\n// Extract k-mers for DNABert training\nfor record in stream {\n    let record = record?;\n    let kmers = kmer::extract_overlapping(&record.sequence, 6)?;\n    \n    // Feed to BERT training pipeline\n    // Constant memory even for TB-scale datasets\n}\n```\n\n### 3. Metagenomics Filtering\n\n```rust\nuse biometal::{FastqStream, operations};\n\nlet input = FastqStream::from_path(\"metagen.fq.gz\")?;\nlet mut output = FastqWriter::create(\"filtered.fq.gz\")?;\n\nfor record in input {\n    let record = record?;\n    \n    // Filter low-complexity sequences (ARM NEON accelerated)\n    if operations::complexity_score(&record.sequence) > 0.5 {\n        output.write(&record)?;\n    }\n}\n// Memory: constant ~5 MB\n// Speed: 16-25\u00d7 faster on ARM\n```\n\n---\n\n## Contributing\n\nWe welcome contributions! biometal is built on evidence-based optimization, so new features should:\n1. Have clear use cases\n2. Be validated experimentally (when adding optimizations)\n3. Maintain platform portability\n4. Follow the optimization rules in [OPTIMIZATION_RULES.md](OPTIMIZATION_RULES.md)\n\nSee [CLAUDE.md](CLAUDE.md) for development guidelines.\n\n---\n\n## License\n\nLicensed under either of:\n\n- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)\n- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)\n\nat your option.\n\n---\n\n## Citation\n\nIf you use biometal in your research, please cite:\n\n```bibtex\n@software{biometal2025,\n  author = {Handley, Scott},\n  title = {biometal: ARM-native bioinformatics with streaming architecture},\n  year = {2025},\n  url = {https://github.com/shandley/biometal}\n}\n```\n\nFor the experimental methodology, see:\n```bibtex\n@misc{asbb2025,\n  author = {Handley, Scott},\n  title = {Apple Silicon Bio Bench: Systematic Hardware Characterization for Bioinformatics},\n  year = {2025},\n  url = {https://github.com/shandley/apple-silicon-bio-bench}\n}\n```\n\n---\n\n**Status**: v1.0.0 - Production Release \ud83c\udf89\n**Released**: November 5, 2025\n**Grade**: A+ (rust-code-quality-reviewer)\n**Tests**: 121 passing (87 unit + 7 integration + 27 doc)\n**Evidence Base**: 1,357 experiments, 40,710 measurements\n**Mission**: Democratizing bioinformatics compute\n\n",
    "bugtrack_url": null,
    "license": "MIT OR Apache-2.0",
    "summary": "ARM-native bioinformatics library with streaming architecture and evidence-based optimization",
    "version": "1.0.0",
    "project_urls": {
        "Documentation": "https://docs.rs/biometal",
        "Homepage": "https://github.com/shandley/biometal",
        "Repository": "https://github.com/shandley/biometal"
    },
    "split_keywords": [
        "bioinformatics",
        " genomics",
        " arm",
        " neon",
        " streaming"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "88a3728ca697a620c5b1f6b4a1ca7cffcac8166cddc255757f15525f3e6d5bab",
                "md5": "4d9d38085628a5d5537404e0674517e7",
                "sha256": "ec1aa1ae6622f2321e8e3a9ba6abf3b899b9dc151579b05fb877d9764feb7e48"
            },
            "downloads": -1,
            "filename": "biometal_rs-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "4d9d38085628a5d5537404e0674517e7",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.9",
            "size": 1100484,
            "upload_time": "2025-11-05T19:27:01",
            "upload_time_iso_8601": "2025-11-05T19:27:01.286308Z",
            "url": "https://files.pythonhosted.org/packages/88/a3/728ca697a620c5b1f6b4a1ca7cffcac8166cddc255757f15525f3e6d5bab/biometal_rs-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a5d39e424efe1d3ff4adda96c366991ae06976e4c02140b25e85462492789f89",
                "md5": "fc63d2afb0b622ac76adf717ceb9ebc2",
                "sha256": "91d10deb2d0c39f90dc2ed901368bad0bda347ed93b0b3182949ad4fc85ba981"
            },
            "downloads": -1,
            "filename": "biometal_rs-1.0.0-cp311-cp311-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "fc63d2afb0b622ac76adf717ceb9ebc2",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.9",
            "size": 1054313,
            "upload_time": "2025-11-05T19:27:03",
            "upload_time_iso_8601": "2025-11-05T19:27:03.129975Z",
            "url": "https://files.pythonhosted.org/packages/a5/d3/9e424efe1d3ff4adda96c366991ae06976e4c02140b25e85462492789f89/biometal_rs-1.0.0-cp311-cp311-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fd7c827fe7d966f6d5ac84992978ea6e5ead04716f2bc19dedf1bfc8975a85a3",
                "md5": "11cb460500718543bfd0de7b54b69b45",
                "sha256": "78869bb0cb75378d529fcaf6746a6f9411257da7a529ca210d1cb0ef0e174a5b"
            },
            "downloads": -1,
            "filename": "biometal_rs-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "11cb460500718543bfd0de7b54b69b45",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.9",
            "size": 3360298,
            "upload_time": "2025-11-05T19:27:04",
            "upload_time_iso_8601": "2025-11-05T19:27:04.586972Z",
            "url": "https://files.pythonhosted.org/packages/fd/7c/827fe7d966f6d5ac84992978ea6e5ead04716f2bc19dedf1bfc8975a85a3/biometal_rs-1.0.0-cp311-cp311-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "30381dbab783fc9e8f1be081f7ed5c09cdf8ca5fd385af0f7ca5bd4f4f6df2a2",
                "md5": "4b4759efc09a5ea93753eb7b4c86d249",
                "sha256": "fc583d234725dfbc7b1467985f0b2a9d6dc552237704025024ee23d12b87a611"
            },
            "downloads": -1,
            "filename": "biometal_rs-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4b4759efc09a5ea93753eb7b4c86d249",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 352337,
            "upload_time": "2025-11-05T19:27:06",
            "upload_time_iso_8601": "2025-11-05T19:27:06.234000Z",
            "url": "https://files.pythonhosted.org/packages/30/38/1dbab783fc9e8f1be081f7ed5c09cdf8ca5fd385af0f7ca5bd4f4f6df2a2/biometal_rs-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-05 19:27:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shandley",
    "github_project": "biometal",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "biometal-rs"
}
        
Elapsed time: 3.22221s