fastaccess


Namefastaccess JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/yourusername/fastaccess
SummaryEfficient random access to subsequences in large FASTA files
upload_time2025-10-29 20:23:38
maintainerNone
docs_urlNone
authorBioinformatics Developer
requires_python>=3.7
licenseNone
keywords bioinformatics fasta genomics sequence-analysis random-access
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # fastaccess

Efficient random access to subsequences in large FASTA files using byte-level seeking.

## Installation

```bash
# Copy the fastaccess directory to your project
cp -r fastaccess /your/project/

# Or install as package
pip install -e .
```

## Quick Start

```python
from fastaccess import FastaStore

# Load FASTA (builds index, caches for next time)
fa = FastaStore("genome.fa")

# Fetch subsequence (1-based inclusive coordinates)
seq = fa.fetch("chr1", 1000, 2000)

# Get sequence info
info = fa.get_info("chr1")
print(info['description'])  # Full FASTA header

# Batch fetch
sequences = fa.fetch_many([
    ("chr1", 1, 100),
    ("chr2", 500, 600)
])
```

## API Reference

### `FastaStore(path, use_cache=True)`

Initialize and build index.

**Parameters:**
- `path` (str): Path to FASTA file
- `use_cache` (bool): Save/load index from `.fidx` cache file (default: True)

**Example:**
```python
fa = FastaStore("genome.fa")          # Uses cache (45x faster reload)
fa = FastaStore("genome.fa", False)   # No cache
```

---

### `fetch(name, start, stop)` → str

Fetch a subsequence using 1-based inclusive coordinates.

**Parameters:**
- `name` (str): Sequence name
- `start` (int): Start position (≥1)
- `stop` (int): Stop position (≥start, ≤length)

**Returns:** Uppercase sequence string

**Example:**
```python
seq = fa.fetch("chr1", 1000, 2000)  # Returns 1001 bases
```

---

### `fetch_many(queries)` → List[str]

Batch fetch multiple subsequences.

**Parameters:**
- `queries` (List[Tuple[str, int, int]]): List of (name, start, stop)

**Returns:** List of uppercase sequences

**Example:**
```python
seqs = fa.fetch_many([
    ("chr1", 1, 100),
    ("chr2", 500, 600)
])
```

---

### `list_sequences()` → List[str]

Get all sequence names.

**Example:**
```python
names = fa.list_sequences()  # ["chr1", "chr2", "chrM"]
```

---

### `get_length(name)` → int

Get sequence length in bases.

**Example:**
```python
length = fa.get_length("chr1")  # 248956422
```

---

### `get_description(name)` → str

Get full FASTA header description.

**Example:**
```python
desc = fa.get_description("U00096.3")
# "Escherichia coli str. K-12 substr. MG1655, complete genome"
```

---

### `get_info(name)` → dict

Get all metadata at once.

**Returns:** Dictionary with keys: `name`, `description`, `length`

**Example:**
```python
info = fa.get_info("chr1")
# {
#   'name': 'chr1',
#   'description': 'Homo sapiens chromosome 1...',
#   'length': 248956422
# }
```

---

### `rebuild_index()`

Force rebuild index and update cache.

**Example:**
```python
fa.rebuild_index()  # If FASTA file was modified
```

## Features

### Random Access
Uses `seek()` and `read()` to fetch only required bytes. No need to load entire file.

### Index Caching
Automatically saves index to `.fidx` file for 45-4300x faster reloading:
```python
fa = FastaStore("genome.fa")  # First load: builds index (30 ms)
fa = FastaStore("genome.fa")  # Second load: uses cache (0.7 ms)
```

Cache is automatically invalidated when FASTA file changes.

### Coordinate System
**1-based inclusive** (standard in bioinformatics):
```python
# Sequence: ACGTACGT...
fa.fetch("seq", 1, 4)  # Returns "ACGT" (positions 1,2,3,4)
fa.fetch("seq", 5, 8)  # Returns "ACGT" (positions 5,6,7,8)
```

### Format Support
- **Wrapped sequences** (fixed-width lines, e.g., 60 bp/line)
- **Unwrapped sequences** (single-line)
- **Unix line endings** (`\n`)
- **Windows line endings** (`\r\n`)

### Output
All sequences returned as **uppercase** strings.

## Examples

### E. coli Genome

```python
from fastaccess import FastaStore

# Load E. coli genome
fa = FastaStore("ecoli.fasta")

# Get sequence ID
seq_id = fa.list_sequences()[0]  # "U00096.3"

# Get metadata
info = fa.get_info(seq_id)
print(f"Organism: {info['description']}")
print(f"Size: {info['length']:,} bp")

# Fetch origin of replication region
oriC = fa.fetch(seq_id, 3923000, 3923500)
print(f"oriC: {oriC[:50]}...")
```

### Extract Gene Sequences

```python
# Gene coordinates
genes = [
    ("TP53", "chr17", 7668402, 7687550),
    ("BRCA1", "chr17", 43044295, 43170245),
]

fa = FastaStore("hg38.fa")

for gene_name, seq_id, start, end in genes:
    sequence = fa.fetch(seq_id, start, end)
    print(f"{gene_name}: {len(sequence):,} bp")
    # Save to file, analyze, etc.
```

### Process in Chunks

```python
fa = FastaStore("genome.fa")

seq_id = "chr1"
length = fa.get_length(seq_id)
chunk_size = 10000

for start in range(1, length, chunk_size):
    stop = min(start + chunk_size - 1, length)
    chunk = fa.fetch(seq_id, start, stop)
    # Process chunk...
```

### Extract BED Regions

```python
# BED file regions
regions = [
    ("chr1", 1000, 2000),
    ("chr1", 5000, 6000),
    ("chr2", 10000, 11000),
]

fa = FastaStore("genome.fa")
sequences = fa.fetch_many(regions)

# Write to new FASTA
with open("features.fa", "w") as out:
    for i, (seq_id, start, stop) in enumerate(regions):
        out.write(f">feature_{i+1} {seq_id}:{start}-{stop}\n")
        out.write(f"{sequences[i]}\n")
```

## Performance

**Index Building (with cache):**
```
First load:   30 ms    (parses FASTA, saves cache)
Second load:  0.7 ms   (loads cache - 45x faster!)
```

**Subsequence Fetching:**
```
Small (100 bp):    0.031 ms
Medium (10 KB):    0.133 ms
Large (100 KB):    1.072 ms
```

**Large Genomes:**
```
Human genome (3 GB):
  Without cache: 30 seconds
  With cache:    0.05 seconds  (600x faster!)
```

## How It Works

### Index Structure
```python
@dataclass
class Entry:
    name: str          # "chr1"
    description: str   # "Homo sapiens chromosome 1..."
    length: int        # 248956422
    line_blen: int     # 60 (bases per line, 0 if unwrapped)
    line_len: int      # 61 (bytes per line with \n)
    offset: int        # Byte offset to sequence data
```

### Random Access Math

**Wrapped sequences (60 bp/line):**
```python
# For position 10001:
zero_based = 10000
line_number = 10000 // 60 = 166
position_in_line = 10000 % 60 = 40
byte_offset = offset + (166 × 61) + 40

# Seek to byte_offset, read across lines, skip newlines
```

**Unwrapped sequences:**
```python
byte_offset = offset + (start - 1)
# Simple seek and read
```

## File Structure

```
fastaccess/
├── __init__.py       # Package exports
├── index.py          # Entry dataclass, build_index()
├── store.py          # fetch_subseq() with random access
└── api.py            # FastaStore class with caching
```

**Total:** ~500 lines, zero dependencies (stdlib only)

## Error Handling

```python
try:
    seq = fa.fetch("chr1", 1, 1000)
except KeyError:
    print("Sequence not found")
except ValueError as e:
    print(f"Invalid coordinates: {e}")
```

**Validation:**
- `start >= 1`
- `stop >= start`
- `stop <= sequence_length`
- Sequence name exists

## Testing

```bash
pytest fastaccess/tests/  # 19 tests, all passing
```

**Test coverage:**
- Wrapped/unwrapped sequences
- Unix/Windows line endings
- Cross-line fetching
- Edge cases (first/last bases)
- Input validation
- Batch operations
- Cache functionality

## Requirements

- Python 3.7+
- No external dependencies

## Cache Files

Index is cached to `{fasta_file}.fidx` as JSON:
```json
{
  "fasta_mtime": 1760684220.34,
  "sequences": {
    "chr1": {
      "name": "chr1",
      "description": "Homo sapiens chromosome 1...",
      "length": 248956422,
      "line_blen": 60,
      "line_len": 61,
      "offset": 123456
    }
  }
}
```

**Cache invalidation:** Automatic when FASTA file is modified

**Disable caching:**
```python
fa = FastaStore("genome.fa", use_cache=False)
```

## Limitations

- Sequences must be ASCII (DNA/RNA)
- Index rebuilt on each init (unless cached)
- No support for compressed FASTA (.gz)

## License

MIT License

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yourusername/fastaccess",
    "name": "fastaccess",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "bioinformatics fasta genomics sequence-analysis random-access",
    "author": "Bioinformatics Developer",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/57/7c/146674ed88a3c0a5087d38a43cfec3899e79f3e2de45f169e3c4126aec5f/fastaccess-0.1.0.tar.gz",
    "platform": null,
    "description": "# fastaccess\n\nEfficient random access to subsequences in large FASTA files using byte-level seeking.\n\n## Installation\n\n```bash\n# Copy the fastaccess directory to your project\ncp -r fastaccess /your/project/\n\n# Or install as package\npip install -e .\n```\n\n## Quick Start\n\n```python\nfrom fastaccess import FastaStore\n\n# Load FASTA (builds index, caches for next time)\nfa = FastaStore(\"genome.fa\")\n\n# Fetch subsequence (1-based inclusive coordinates)\nseq = fa.fetch(\"chr1\", 1000, 2000)\n\n# Get sequence info\ninfo = fa.get_info(\"chr1\")\nprint(info['description'])  # Full FASTA header\n\n# Batch fetch\nsequences = fa.fetch_many([\n    (\"chr1\", 1, 100),\n    (\"chr2\", 500, 600)\n])\n```\n\n## API Reference\n\n### `FastaStore(path, use_cache=True)`\n\nInitialize and build index.\n\n**Parameters:**\n- `path` (str): Path to FASTA file\n- `use_cache` (bool): Save/load index from `.fidx` cache file (default: True)\n\n**Example:**\n```python\nfa = FastaStore(\"genome.fa\")          # Uses cache (45x faster reload)\nfa = FastaStore(\"genome.fa\", False)   # No cache\n```\n\n---\n\n### `fetch(name, start, stop)` \u2192 str\n\nFetch a subsequence using 1-based inclusive coordinates.\n\n**Parameters:**\n- `name` (str): Sequence name\n- `start` (int): Start position (\u22651)\n- `stop` (int): Stop position (\u2265start, \u2264length)\n\n**Returns:** Uppercase sequence string\n\n**Example:**\n```python\nseq = fa.fetch(\"chr1\", 1000, 2000)  # Returns 1001 bases\n```\n\n---\n\n### `fetch_many(queries)` \u2192 List[str]\n\nBatch fetch multiple subsequences.\n\n**Parameters:**\n- `queries` (List[Tuple[str, int, int]]): List of (name, start, stop)\n\n**Returns:** List of uppercase sequences\n\n**Example:**\n```python\nseqs = fa.fetch_many([\n    (\"chr1\", 1, 100),\n    (\"chr2\", 500, 600)\n])\n```\n\n---\n\n### `list_sequences()` \u2192 List[str]\n\nGet all sequence names.\n\n**Example:**\n```python\nnames = fa.list_sequences()  # [\"chr1\", \"chr2\", \"chrM\"]\n```\n\n---\n\n### `get_length(name)` \u2192 int\n\nGet sequence length in bases.\n\n**Example:**\n```python\nlength = fa.get_length(\"chr1\")  # 248956422\n```\n\n---\n\n### `get_description(name)` \u2192 str\n\nGet full FASTA header description.\n\n**Example:**\n```python\ndesc = fa.get_description(\"U00096.3\")\n# \"Escherichia coli str. K-12 substr. MG1655, complete genome\"\n```\n\n---\n\n### `get_info(name)` \u2192 dict\n\nGet all metadata at once.\n\n**Returns:** Dictionary with keys: `name`, `description`, `length`\n\n**Example:**\n```python\ninfo = fa.get_info(\"chr1\")\n# {\n#   'name': 'chr1',\n#   'description': 'Homo sapiens chromosome 1...',\n#   'length': 248956422\n# }\n```\n\n---\n\n### `rebuild_index()`\n\nForce rebuild index and update cache.\n\n**Example:**\n```python\nfa.rebuild_index()  # If FASTA file was modified\n```\n\n## Features\n\n### Random Access\nUses `seek()` and `read()` to fetch only required bytes. No need to load entire file.\n\n### Index Caching\nAutomatically saves index to `.fidx` file for 45-4300x faster reloading:\n```python\nfa = FastaStore(\"genome.fa\")  # First load: builds index (30 ms)\nfa = FastaStore(\"genome.fa\")  # Second load: uses cache (0.7 ms)\n```\n\nCache is automatically invalidated when FASTA file changes.\n\n### Coordinate System\n**1-based inclusive** (standard in bioinformatics):\n```python\n# Sequence: ACGTACGT...\nfa.fetch(\"seq\", 1, 4)  # Returns \"ACGT\" (positions 1,2,3,4)\nfa.fetch(\"seq\", 5, 8)  # Returns \"ACGT\" (positions 5,6,7,8)\n```\n\n### Format Support\n- **Wrapped sequences** (fixed-width lines, e.g., 60 bp/line)\n- **Unwrapped sequences** (single-line)\n- **Unix line endings** (`\\n`)\n- **Windows line endings** (`\\r\\n`)\n\n### Output\nAll sequences returned as **uppercase** strings.\n\n## Examples\n\n### E. coli Genome\n\n```python\nfrom fastaccess import FastaStore\n\n# Load E. coli genome\nfa = FastaStore(\"ecoli.fasta\")\n\n# Get sequence ID\nseq_id = fa.list_sequences()[0]  # \"U00096.3\"\n\n# Get metadata\ninfo = fa.get_info(seq_id)\nprint(f\"Organism: {info['description']}\")\nprint(f\"Size: {info['length']:,} bp\")\n\n# Fetch origin of replication region\noriC = fa.fetch(seq_id, 3923000, 3923500)\nprint(f\"oriC: {oriC[:50]}...\")\n```\n\n### Extract Gene Sequences\n\n```python\n# Gene coordinates\ngenes = [\n    (\"TP53\", \"chr17\", 7668402, 7687550),\n    (\"BRCA1\", \"chr17\", 43044295, 43170245),\n]\n\nfa = FastaStore(\"hg38.fa\")\n\nfor gene_name, seq_id, start, end in genes:\n    sequence = fa.fetch(seq_id, start, end)\n    print(f\"{gene_name}: {len(sequence):,} bp\")\n    # Save to file, analyze, etc.\n```\n\n### Process in Chunks\n\n```python\nfa = FastaStore(\"genome.fa\")\n\nseq_id = \"chr1\"\nlength = fa.get_length(seq_id)\nchunk_size = 10000\n\nfor start in range(1, length, chunk_size):\n    stop = min(start + chunk_size - 1, length)\n    chunk = fa.fetch(seq_id, start, stop)\n    # Process chunk...\n```\n\n### Extract BED Regions\n\n```python\n# BED file regions\nregions = [\n    (\"chr1\", 1000, 2000),\n    (\"chr1\", 5000, 6000),\n    (\"chr2\", 10000, 11000),\n]\n\nfa = FastaStore(\"genome.fa\")\nsequences = fa.fetch_many(regions)\n\n# Write to new FASTA\nwith open(\"features.fa\", \"w\") as out:\n    for i, (seq_id, start, stop) in enumerate(regions):\n        out.write(f\">feature_{i+1} {seq_id}:{start}-{stop}\\n\")\n        out.write(f\"{sequences[i]}\\n\")\n```\n\n## Performance\n\n**Index Building (with cache):**\n```\nFirst load:   30 ms    (parses FASTA, saves cache)\nSecond load:  0.7 ms   (loads cache - 45x faster!)\n```\n\n**Subsequence Fetching:**\n```\nSmall (100 bp):    0.031 ms\nMedium (10 KB):    0.133 ms\nLarge (100 KB):    1.072 ms\n```\n\n**Large Genomes:**\n```\nHuman genome (3 GB):\n  Without cache: 30 seconds\n  With cache:    0.05 seconds  (600x faster!)\n```\n\n## How It Works\n\n### Index Structure\n```python\n@dataclass\nclass Entry:\n    name: str          # \"chr1\"\n    description: str   # \"Homo sapiens chromosome 1...\"\n    length: int        # 248956422\n    line_blen: int     # 60 (bases per line, 0 if unwrapped)\n    line_len: int      # 61 (bytes per line with \\n)\n    offset: int        # Byte offset to sequence data\n```\n\n### Random Access Math\n\n**Wrapped sequences (60 bp/line):**\n```python\n# For position 10001:\nzero_based = 10000\nline_number = 10000 // 60 = 166\nposition_in_line = 10000 % 60 = 40\nbyte_offset = offset + (166 \u00d7 61) + 40\n\n# Seek to byte_offset, read across lines, skip newlines\n```\n\n**Unwrapped sequences:**\n```python\nbyte_offset = offset + (start - 1)\n# Simple seek and read\n```\n\n## File Structure\n\n```\nfastaccess/\n\u251c\u2500\u2500 __init__.py       # Package exports\n\u251c\u2500\u2500 index.py          # Entry dataclass, build_index()\n\u251c\u2500\u2500 store.py          # fetch_subseq() with random access\n\u2514\u2500\u2500 api.py            # FastaStore class with caching\n```\n\n**Total:** ~500 lines, zero dependencies (stdlib only)\n\n## Error Handling\n\n```python\ntry:\n    seq = fa.fetch(\"chr1\", 1, 1000)\nexcept KeyError:\n    print(\"Sequence not found\")\nexcept ValueError as e:\n    print(f\"Invalid coordinates: {e}\")\n```\n\n**Validation:**\n- `start >= 1`\n- `stop >= start`\n- `stop <= sequence_length`\n- Sequence name exists\n\n## Testing\n\n```bash\npytest fastaccess/tests/  # 19 tests, all passing\n```\n\n**Test coverage:**\n- Wrapped/unwrapped sequences\n- Unix/Windows line endings\n- Cross-line fetching\n- Edge cases (first/last bases)\n- Input validation\n- Batch operations\n- Cache functionality\n\n## Requirements\n\n- Python 3.7+\n- No external dependencies\n\n## Cache Files\n\nIndex is cached to `{fasta_file}.fidx` as JSON:\n```json\n{\n  \"fasta_mtime\": 1760684220.34,\n  \"sequences\": {\n    \"chr1\": {\n      \"name\": \"chr1\",\n      \"description\": \"Homo sapiens chromosome 1...\",\n      \"length\": 248956422,\n      \"line_blen\": 60,\n      \"line_len\": 61,\n      \"offset\": 123456\n    }\n  }\n}\n```\n\n**Cache invalidation:** Automatic when FASTA file is modified\n\n**Disable caching:**\n```python\nfa = FastaStore(\"genome.fa\", use_cache=False)\n```\n\n## Limitations\n\n- Sequences must be ASCII (DNA/RNA)\n- Index rebuilt on each init (unless cached)\n- No support for compressed FASTA (.gz)\n\n## License\n\nMIT License\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Efficient random access to subsequences in large FASTA files",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/yourusername/fastaccess"
    },
    "split_keywords": [
        "bioinformatics",
        "fasta",
        "genomics",
        "sequence-analysis",
        "random-access"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dc4442b9a57c94c849fa4426c2412a0ff2616beeb616165dd89d94ce653d5a07",
                "md5": "902512a4b0cd09f9f113638513290c73",
                "sha256": "50789f214b7f2f740f3d0f291f2c3d93b302f66a845065eb081e65aa35166b3b"
            },
            "downloads": -1,
            "filename": "fastaccess-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "902512a4b0cd09f9f113638513290c73",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 10269,
            "upload_time": "2025-10-29T20:23:36",
            "upload_time_iso_8601": "2025-10-29T20:23:36.987907Z",
            "url": "https://files.pythonhosted.org/packages/dc/44/42b9a57c94c849fa4426c2412a0ff2616beeb616165dd89d94ce653d5a07/fastaccess-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "577c146674ed88a3c0a5087d38a43cfec3899e79f3e2de45f169e3c4126aec5f",
                "md5": "0faab62c771786181bf900934481e87c",
                "sha256": "f94478a050bdadfdbfc3103c7ae4601786586bbc5bb980f3403c62336059cd00"
            },
            "downloads": -1,
            "filename": "fastaccess-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0faab62c771786181bf900934481e87c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 12211,
            "upload_time": "2025-10-29T20:23:38",
            "upload_time_iso_8601": "2025-10-29T20:23:38.197135Z",
            "url": "https://files.pythonhosted.org/packages/57/7c/146674ed88a3c0a5087d38a43cfec3899e79f3e2de45f169e3c4126aec5f/fastaccess-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 20:23:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "fastaccess",
    "github_not_found": true,
    "lcname": "fastaccess"
}
        
Elapsed time: 3.40625s