# fastaccess
Efficient random access to subsequences in large FASTA files using byte-level seeking.
## Installation
```bash
# Copy the fastaccess directory to your project
cp -r fastaccess /your/project/
# Or install as package
pip install -e .
```
## Quick Start
```python
from fastaccess import FastaStore
# Load FASTA (builds index, caches for next time)
fa = FastaStore("genome.fa")
# Fetch subsequence (1-based inclusive coordinates)
seq = fa.fetch("chr1", 1000, 2000)
# Get sequence info
info = fa.get_info("chr1")
print(info['description']) # Full FASTA header
# Batch fetch
sequences = fa.fetch_many([
("chr1", 1, 100),
("chr2", 500, 600)
])
```
## API Reference
### `FastaStore(path, use_cache=True)`
Initialize and build index.
**Parameters:**
- `path` (str): Path to FASTA file
- `use_cache` (bool): Save/load index from `.fidx` cache file (default: True)
**Example:**
```python
fa = FastaStore("genome.fa") # Uses cache (45x faster reload)
fa = FastaStore("genome.fa", False) # No cache
```
---
### `fetch(name, start, stop)` → str
Fetch a subsequence using 1-based inclusive coordinates.
**Parameters:**
- `name` (str): Sequence name
- `start` (int): Start position (≥1)
- `stop` (int): Stop position (≥start, ≤length)
**Returns:** Uppercase sequence string
**Example:**
```python
seq = fa.fetch("chr1", 1000, 2000) # Returns 1001 bases
```
---
### `fetch_many(queries)` → List[str]
Batch fetch multiple subsequences.
**Parameters:**
- `queries` (List[Tuple[str, int, int]]): List of (name, start, stop)
**Returns:** List of uppercase sequences
**Example:**
```python
seqs = fa.fetch_many([
("chr1", 1, 100),
("chr2", 500, 600)
])
```
---
### `list_sequences()` → List[str]
Get all sequence names.
**Example:**
```python
names = fa.list_sequences() # ["chr1", "chr2", "chrM"]
```
---
### `get_length(name)` → int
Get sequence length in bases.
**Example:**
```python
length = fa.get_length("chr1") # 248956422
```
---
### `get_description(name)` → str
Get full FASTA header description.
**Example:**
```python
desc = fa.get_description("U00096.3")
# "Escherichia coli str. K-12 substr. MG1655, complete genome"
```
---
### `get_info(name)` → dict
Get all metadata at once.
**Returns:** Dictionary with keys: `name`, `description`, `length`
**Example:**
```python
info = fa.get_info("chr1")
# {
# 'name': 'chr1',
# 'description': 'Homo sapiens chromosome 1...',
# 'length': 248956422
# }
```
---
### `rebuild_index()`
Force rebuild index and update cache.
**Example:**
```python
fa.rebuild_index() # If FASTA file was modified
```
## Features
### Random Access
Uses `seek()` and `read()` to fetch only required bytes. No need to load entire file.
### Index Caching
Automatically saves index to `.fidx` file for 45-4300x faster reloading:
```python
fa = FastaStore("genome.fa") # First load: builds index (30 ms)
fa = FastaStore("genome.fa") # Second load: uses cache (0.7 ms)
```
Cache is automatically invalidated when FASTA file changes.
### Coordinate System
**1-based inclusive** (standard in bioinformatics):
```python
# Sequence: ACGTACGT...
fa.fetch("seq", 1, 4) # Returns "ACGT" (positions 1,2,3,4)
fa.fetch("seq", 5, 8) # Returns "ACGT" (positions 5,6,7,8)
```
### Format Support
- **Wrapped sequences** (fixed-width lines, e.g., 60 bp/line)
- **Unwrapped sequences** (single-line)
- **Unix line endings** (`\n`)
- **Windows line endings** (`\r\n`)
### Output
All sequences returned as **uppercase** strings.
## Examples
### E. coli Genome
```python
from fastaccess import FastaStore
# Load E. coli genome
fa = FastaStore("ecoli.fasta")
# Get sequence ID
seq_id = fa.list_sequences()[0] # "U00096.3"
# Get metadata
info = fa.get_info(seq_id)
print(f"Organism: {info['description']}")
print(f"Size: {info['length']:,} bp")
# Fetch origin of replication region
oriC = fa.fetch(seq_id, 3923000, 3923500)
print(f"oriC: {oriC[:50]}...")
```
### Extract Gene Sequences
```python
# Gene coordinates
genes = [
("TP53", "chr17", 7668402, 7687550),
("BRCA1", "chr17", 43044295, 43170245),
]
fa = FastaStore("hg38.fa")
for gene_name, seq_id, start, end in genes:
sequence = fa.fetch(seq_id, start, end)
print(f"{gene_name}: {len(sequence):,} bp")
# Save to file, analyze, etc.
```
### Process in Chunks
```python
fa = FastaStore("genome.fa")
seq_id = "chr1"
length = fa.get_length(seq_id)
chunk_size = 10000
for start in range(1, length, chunk_size):
stop = min(start + chunk_size - 1, length)
chunk = fa.fetch(seq_id, start, stop)
# Process chunk...
```
### Extract BED Regions
```python
# BED file regions
regions = [
("chr1", 1000, 2000),
("chr1", 5000, 6000),
("chr2", 10000, 11000),
]
fa = FastaStore("genome.fa")
sequences = fa.fetch_many(regions)
# Write to new FASTA
with open("features.fa", "w") as out:
for i, (seq_id, start, stop) in enumerate(regions):
out.write(f">feature_{i+1} {seq_id}:{start}-{stop}\n")
out.write(f"{sequences[i]}\n")
```
## Performance
**Index Building (with cache):**
```
First load: 30 ms (parses FASTA, saves cache)
Second load: 0.7 ms (loads cache - 45x faster!)
```
**Subsequence Fetching:**
```
Small (100 bp): 0.031 ms
Medium (10 KB): 0.133 ms
Large (100 KB): 1.072 ms
```
**Large Genomes:**
```
Human genome (3 GB):
Without cache: 30 seconds
With cache: 0.05 seconds (600x faster!)
```
## How It Works
### Index Structure
```python
@dataclass
class Entry:
name: str # "chr1"
description: str # "Homo sapiens chromosome 1..."
length: int # 248956422
line_blen: int # 60 (bases per line, 0 if unwrapped)
line_len: int # 61 (bytes per line with \n)
offset: int # Byte offset to sequence data
```
### Random Access Math
**Wrapped sequences (60 bp/line):**
```python
# For position 10001:
zero_based = 10000
line_number = 10000 // 60 = 166
position_in_line = 10000 % 60 = 40
byte_offset = offset + (166 × 61) + 40
# Seek to byte_offset, read across lines, skip newlines
```
**Unwrapped sequences:**
```python
byte_offset = offset + (start - 1)
# Simple seek and read
```
## File Structure
```
fastaccess/
├── __init__.py # Package exports
├── index.py # Entry dataclass, build_index()
├── store.py # fetch_subseq() with random access
└── api.py # FastaStore class with caching
```
**Total:** ~500 lines, zero dependencies (stdlib only)
## Error Handling
```python
try:
seq = fa.fetch("chr1", 1, 1000)
except KeyError:
print("Sequence not found")
except ValueError as e:
print(f"Invalid coordinates: {e}")
```
**Validation:**
- `start >= 1`
- `stop >= start`
- `stop <= sequence_length`
- Sequence name exists
## Testing
```bash
pytest fastaccess/tests/ # 19 tests, all passing
```
**Test coverage:**
- Wrapped/unwrapped sequences
- Unix/Windows line endings
- Cross-line fetching
- Edge cases (first/last bases)
- Input validation
- Batch operations
- Cache functionality
## Requirements
- Python 3.7+
- No external dependencies
## Cache Files
Index is cached to `{fasta_file}.fidx` as JSON:
```json
{
"fasta_mtime": 1760684220.34,
"sequences": {
"chr1": {
"name": "chr1",
"description": "Homo sapiens chromosome 1...",
"length": 248956422,
"line_blen": 60,
"line_len": 61,
"offset": 123456
}
}
}
```
**Cache invalidation:** Automatic when FASTA file is modified
**Disable caching:**
```python
fa = FastaStore("genome.fa", use_cache=False)
```
## Limitations
- Sequences must be ASCII (DNA/RNA)
- Index rebuilt on each init (unless cached)
- No support for compressed FASTA (.gz)
## License
MIT License
Raw data
{
"_id": null,
"home_page": "https://github.com/yourusername/fastaccess",
"name": "fastaccess",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "bioinformatics fasta genomics sequence-analysis random-access",
"author": "Bioinformatics Developer",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/57/7c/146674ed88a3c0a5087d38a43cfec3899e79f3e2de45f169e3c4126aec5f/fastaccess-0.1.0.tar.gz",
"platform": null,
"description": "# fastaccess\n\nEfficient random access to subsequences in large FASTA files using byte-level seeking.\n\n## Installation\n\n```bash\n# Copy the fastaccess directory to your project\ncp -r fastaccess /your/project/\n\n# Or install as package\npip install -e .\n```\n\n## Quick Start\n\n```python\nfrom fastaccess import FastaStore\n\n# Load FASTA (builds index, caches for next time)\nfa = FastaStore(\"genome.fa\")\n\n# Fetch subsequence (1-based inclusive coordinates)\nseq = fa.fetch(\"chr1\", 1000, 2000)\n\n# Get sequence info\ninfo = fa.get_info(\"chr1\")\nprint(info['description']) # Full FASTA header\n\n# Batch fetch\nsequences = fa.fetch_many([\n (\"chr1\", 1, 100),\n (\"chr2\", 500, 600)\n])\n```\n\n## API Reference\n\n### `FastaStore(path, use_cache=True)`\n\nInitialize and build index.\n\n**Parameters:**\n- `path` (str): Path to FASTA file\n- `use_cache` (bool): Save/load index from `.fidx` cache file (default: True)\n\n**Example:**\n```python\nfa = FastaStore(\"genome.fa\") # Uses cache (45x faster reload)\nfa = FastaStore(\"genome.fa\", False) # No cache\n```\n\n---\n\n### `fetch(name, start, stop)` \u2192 str\n\nFetch a subsequence using 1-based inclusive coordinates.\n\n**Parameters:**\n- `name` (str): Sequence name\n- `start` (int): Start position (\u22651)\n- `stop` (int): Stop position (\u2265start, \u2264length)\n\n**Returns:** Uppercase sequence string\n\n**Example:**\n```python\nseq = fa.fetch(\"chr1\", 1000, 2000) # Returns 1001 bases\n```\n\n---\n\n### `fetch_many(queries)` \u2192 List[str]\n\nBatch fetch multiple subsequences.\n\n**Parameters:**\n- `queries` (List[Tuple[str, int, int]]): List of (name, start, stop)\n\n**Returns:** List of uppercase sequences\n\n**Example:**\n```python\nseqs = fa.fetch_many([\n (\"chr1\", 1, 100),\n (\"chr2\", 500, 600)\n])\n```\n\n---\n\n### `list_sequences()` \u2192 List[str]\n\nGet all sequence names.\n\n**Example:**\n```python\nnames = fa.list_sequences() # [\"chr1\", \"chr2\", \"chrM\"]\n```\n\n---\n\n### `get_length(name)` \u2192 int\n\nGet sequence length in bases.\n\n**Example:**\n```python\nlength = fa.get_length(\"chr1\") # 248956422\n```\n\n---\n\n### `get_description(name)` \u2192 str\n\nGet full FASTA header description.\n\n**Example:**\n```python\ndesc = fa.get_description(\"U00096.3\")\n# \"Escherichia coli str. K-12 substr. MG1655, complete genome\"\n```\n\n---\n\n### `get_info(name)` \u2192 dict\n\nGet all metadata at once.\n\n**Returns:** Dictionary with keys: `name`, `description`, `length`\n\n**Example:**\n```python\ninfo = fa.get_info(\"chr1\")\n# {\n# 'name': 'chr1',\n# 'description': 'Homo sapiens chromosome 1...',\n# 'length': 248956422\n# }\n```\n\n---\n\n### `rebuild_index()`\n\nForce rebuild index and update cache.\n\n**Example:**\n```python\nfa.rebuild_index() # If FASTA file was modified\n```\n\n## Features\n\n### Random Access\nUses `seek()` and `read()` to fetch only required bytes. No need to load entire file.\n\n### Index Caching\nAutomatically saves index to `.fidx` file for 45-4300x faster reloading:\n```python\nfa = FastaStore(\"genome.fa\") # First load: builds index (30 ms)\nfa = FastaStore(\"genome.fa\") # Second load: uses cache (0.7 ms)\n```\n\nCache is automatically invalidated when FASTA file changes.\n\n### Coordinate System\n**1-based inclusive** (standard in bioinformatics):\n```python\n# Sequence: ACGTACGT...\nfa.fetch(\"seq\", 1, 4) # Returns \"ACGT\" (positions 1,2,3,4)\nfa.fetch(\"seq\", 5, 8) # Returns \"ACGT\" (positions 5,6,7,8)\n```\n\n### Format Support\n- **Wrapped sequences** (fixed-width lines, e.g., 60 bp/line)\n- **Unwrapped sequences** (single-line)\n- **Unix line endings** (`\\n`)\n- **Windows line endings** (`\\r\\n`)\n\n### Output\nAll sequences returned as **uppercase** strings.\n\n## Examples\n\n### E. coli Genome\n\n```python\nfrom fastaccess import FastaStore\n\n# Load E. coli genome\nfa = FastaStore(\"ecoli.fasta\")\n\n# Get sequence ID\nseq_id = fa.list_sequences()[0] # \"U00096.3\"\n\n# Get metadata\ninfo = fa.get_info(seq_id)\nprint(f\"Organism: {info['description']}\")\nprint(f\"Size: {info['length']:,} bp\")\n\n# Fetch origin of replication region\noriC = fa.fetch(seq_id, 3923000, 3923500)\nprint(f\"oriC: {oriC[:50]}...\")\n```\n\n### Extract Gene Sequences\n\n```python\n# Gene coordinates\ngenes = [\n (\"TP53\", \"chr17\", 7668402, 7687550),\n (\"BRCA1\", \"chr17\", 43044295, 43170245),\n]\n\nfa = FastaStore(\"hg38.fa\")\n\nfor gene_name, seq_id, start, end in genes:\n sequence = fa.fetch(seq_id, start, end)\n print(f\"{gene_name}: {len(sequence):,} bp\")\n # Save to file, analyze, etc.\n```\n\n### Process in Chunks\n\n```python\nfa = FastaStore(\"genome.fa\")\n\nseq_id = \"chr1\"\nlength = fa.get_length(seq_id)\nchunk_size = 10000\n\nfor start in range(1, length, chunk_size):\n stop = min(start + chunk_size - 1, length)\n chunk = fa.fetch(seq_id, start, stop)\n # Process chunk...\n```\n\n### Extract BED Regions\n\n```python\n# BED file regions\nregions = [\n (\"chr1\", 1000, 2000),\n (\"chr1\", 5000, 6000),\n (\"chr2\", 10000, 11000),\n]\n\nfa = FastaStore(\"genome.fa\")\nsequences = fa.fetch_many(regions)\n\n# Write to new FASTA\nwith open(\"features.fa\", \"w\") as out:\n for i, (seq_id, start, stop) in enumerate(regions):\n out.write(f\">feature_{i+1} {seq_id}:{start}-{stop}\\n\")\n out.write(f\"{sequences[i]}\\n\")\n```\n\n## Performance\n\n**Index Building (with cache):**\n```\nFirst load: 30 ms (parses FASTA, saves cache)\nSecond load: 0.7 ms (loads cache - 45x faster!)\n```\n\n**Subsequence Fetching:**\n```\nSmall (100 bp): 0.031 ms\nMedium (10 KB): 0.133 ms\nLarge (100 KB): 1.072 ms\n```\n\n**Large Genomes:**\n```\nHuman genome (3 GB):\n Without cache: 30 seconds\n With cache: 0.05 seconds (600x faster!)\n```\n\n## How It Works\n\n### Index Structure\n```python\n@dataclass\nclass Entry:\n name: str # \"chr1\"\n description: str # \"Homo sapiens chromosome 1...\"\n length: int # 248956422\n line_blen: int # 60 (bases per line, 0 if unwrapped)\n line_len: int # 61 (bytes per line with \\n)\n offset: int # Byte offset to sequence data\n```\n\n### Random Access Math\n\n**Wrapped sequences (60 bp/line):**\n```python\n# For position 10001:\nzero_based = 10000\nline_number = 10000 // 60 = 166\nposition_in_line = 10000 % 60 = 40\nbyte_offset = offset + (166 \u00d7 61) + 40\n\n# Seek to byte_offset, read across lines, skip newlines\n```\n\n**Unwrapped sequences:**\n```python\nbyte_offset = offset + (start - 1)\n# Simple seek and read\n```\n\n## File Structure\n\n```\nfastaccess/\n\u251c\u2500\u2500 __init__.py # Package exports\n\u251c\u2500\u2500 index.py # Entry dataclass, build_index()\n\u251c\u2500\u2500 store.py # fetch_subseq() with random access\n\u2514\u2500\u2500 api.py # FastaStore class with caching\n```\n\n**Total:** ~500 lines, zero dependencies (stdlib only)\n\n## Error Handling\n\n```python\ntry:\n seq = fa.fetch(\"chr1\", 1, 1000)\nexcept KeyError:\n print(\"Sequence not found\")\nexcept ValueError as e:\n print(f\"Invalid coordinates: {e}\")\n```\n\n**Validation:**\n- `start >= 1`\n- `stop >= start`\n- `stop <= sequence_length`\n- Sequence name exists\n\n## Testing\n\n```bash\npytest fastaccess/tests/ # 19 tests, all passing\n```\n\n**Test coverage:**\n- Wrapped/unwrapped sequences\n- Unix/Windows line endings\n- Cross-line fetching\n- Edge cases (first/last bases)\n- Input validation\n- Batch operations\n- Cache functionality\n\n## Requirements\n\n- Python 3.7+\n- No external dependencies\n\n## Cache Files\n\nIndex is cached to `{fasta_file}.fidx` as JSON:\n```json\n{\n \"fasta_mtime\": 1760684220.34,\n \"sequences\": {\n \"chr1\": {\n \"name\": \"chr1\",\n \"description\": \"Homo sapiens chromosome 1...\",\n \"length\": 248956422,\n \"line_blen\": 60,\n \"line_len\": 61,\n \"offset\": 123456\n }\n }\n}\n```\n\n**Cache invalidation:** Automatic when FASTA file is modified\n\n**Disable caching:**\n```python\nfa = FastaStore(\"genome.fa\", use_cache=False)\n```\n\n## Limitations\n\n- Sequences must be ASCII (DNA/RNA)\n- Index rebuilt on each init (unless cached)\n- No support for compressed FASTA (.gz)\n\n## License\n\nMIT License\n",
"bugtrack_url": null,
"license": null,
"summary": "Efficient random access to subsequences in large FASTA files",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/yourusername/fastaccess"
},
"split_keywords": [
"bioinformatics",
"fasta",
"genomics",
"sequence-analysis",
"random-access"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dc4442b9a57c94c849fa4426c2412a0ff2616beeb616165dd89d94ce653d5a07",
"md5": "902512a4b0cd09f9f113638513290c73",
"sha256": "50789f214b7f2f740f3d0f291f2c3d93b302f66a845065eb081e65aa35166b3b"
},
"downloads": -1,
"filename": "fastaccess-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "902512a4b0cd09f9f113638513290c73",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 10269,
"upload_time": "2025-10-29T20:23:36",
"upload_time_iso_8601": "2025-10-29T20:23:36.987907Z",
"url": "https://files.pythonhosted.org/packages/dc/44/42b9a57c94c849fa4426c2412a0ff2616beeb616165dd89d94ce653d5a07/fastaccess-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "577c146674ed88a3c0a5087d38a43cfec3899e79f3e2de45f169e3c4126aec5f",
"md5": "0faab62c771786181bf900934481e87c",
"sha256": "f94478a050bdadfdbfc3103c7ae4601786586bbc5bb980f3403c62336059cd00"
},
"downloads": -1,
"filename": "fastaccess-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "0faab62c771786181bf900934481e87c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 12211,
"upload_time": "2025-10-29T20:23:38",
"upload_time_iso_8601": "2025-10-29T20:23:38.197135Z",
"url": "https://files.pythonhosted.org/packages/57/7c/146674ed88a3c0a5087d38a43cfec3899e79f3e2de45f169e3c4126aec5f/fastaccess-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-29 20:23:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "fastaccess",
"github_not_found": true,
"lcname": "fastaccess"
}