easyfasta


Nameeasyfasta JSON
Version 1.0.13 PyPI version JSON
download
home_pageNone
SummaryA lightweight Python library for efficient FASTA file parsing and DNA sequence manipulation.
upload_time2025-08-08 01:00:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords bioinformatics fasta
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Easy Fasta

A lightweight functional Python library for efficient FASTA file parsing and DNA sequence manipulation. No OOP bloat, only data.

## Features

- **Memory-efficient parsing**: Stream through large FASTA files without loading everything into memory
- **Random access**: Jump directly to specific sequences with position tracking
- **Sequence extraction**: Filter sequences by identifiers
- **DNA manipulation**: Complete IUPAC-compliant complement and reverse complement operations
- **Formatting**: Convert sequences to multi-line FASTA format
- **Does not validate input**: user are responsible to provide correctly formatted file.
## Installation
python 3.8+
```
> pip install easyfasta
```
or simply copy the module to your project

## Quick Start

```python
from easyfasta import *
# Parse FASTA file sequence by sequence (memory efficient)
with open('sequences.fasta') as f:
    for header, sequence in fasta_iter(f):
        print(f">{header}")
        print(sequence[:50])  # First 50 bases

# Load entire FASTA into dictionary
sequences = load_fasta('sequences.fasta')
print(sequences['sequence_id'])

# Extract specific sequences
target_ids = ['seq1', 'seq2', 'seq3']
found = get_sequence_id('sequences.fasta', target_ids)
for header, seq in found:
    print(f"Found: {header}")

# Extract specific sequences using indexes
index = build_index('sequences.fasta')
# using pickle you can save and load the index
#import pickle
#pickle.dump(index, "save_index_file.pkl")
#index = pickle.load("save_index_file.pkl")
target_ids = ['seq1', 'seq2', 'seq3']
found = get_sequence_index('sequences.fasta', target_ids, index, ignore_unfound=True)
for header, seq in found:
    print(f"Found: {header}")

# DNA manipulation
dna = "ATCGGTAA"
print(complement(dna))           # TAGCCATT
print(reverse_complement(dna))   # TTACCGAT
```

## API Reference

### Parsing Functions

#### `fasta_iter(open_file: TextIO) -> Generator[tuple[str, str], None, None]`

Memory-efficient iterator over FASTA sequences.

```python
with open('large_file.fasta') as f:
    for header, sequence in fasta_iter(f):
        # Process one sequence at a time
        process_sequence(header, sequence)
```

#### `load_fasta(fasta_path: str|Path) -> dict[str, str]`

Load entire FASTA file into a dictionary mapping sequence IDs to sequences.

```python
sequences = load_fasta('sequences.fasta')
my_sequence = sequences['sequence_id']
```

#### `get_sequence_id(fasta_file: str|Path, identifiers: Iterable[str], identifier_only: bool = True) -> list[tuple[str, str]]`

Extract sequences matching specific identifiers.

- `identifier_only`: If True, match only the first part of headers (before whitespace)

```python
wanted = ['seq1', 'seq2']
results = get_sequence_id('sequences.fasta', wanted)
```

#### `build_index(fasta_file: str|Path) -> dict[str, int]`

Build a fasta index as a dictionary


```python
index = build_index(fasta_file)
```

#### `get_sequence_index(fasta_file: str|Path, identifiers:Iterable[str], index_dict:dict[str, int], ignore_unfound: bool = True) -> list[tuple[str, str]]`

use index to retrieve sequence (faster)


```python
index = build_index(fasta_file)
wanted = ['seq1', 'seq2']
get_sequence_index(fasta_file, wanted, index)
```

### Sequence Manipulation

#### `complement(seq: str) -> str`
Return the complement of a DNA sequence (A↔T, C↔G, supports all IUPAC codes).

#### `reverse(seq: str) -> str`
Return the reverse of a sequence.

#### `reverse_complement(seq: str) -> str`
Return the reverse complement of a DNA sequence.

#### `wrap_sequence(sequence: str, chunk_size: int = 80) -> str`
Format sequence with line breaks every `chunk_size` characters (standard multiline FASTA format).

```python
formatted = wrap_sequence("ATCGATCGATCG" * 10, 60)
print(formatted)  # 60 characters per line
# write to a file
with open(out_file, 'w') as fo:
   fo.write(">{}\n{}\n".format('seq_id',  wrap_sequence("ATCGATCGATCG" * 10, 80)))
```

## Design Philosophy

This library prioritizes:

- **Memory efficiency**: Built for large genomic files that don't fit in RAM
- **Simplicity**: Clean, predictable API with minimal dependencies. Not OOP bloat, only data.
- **Performance**: Stream-based processing with O(1) memory usage for parsing
- **Standards compliance**: Full IUPAC nucleotide code support

## Use Cases

- Processing large fasta file (metagenome)
- Common DNA sequence manipulation
- Common operations on fasta including parsing, indexing and sequence retrieval.
- Bioinformatics workflows requiring memory efficiency

## Requirements

- Python 3.8+
- No external dependencies

## License
MIT

## Contributing
Feel free to ask for new features. I published it as lightweight because those are the feature I use the most and wanted to start with a solid fondation.

I used this library for years, and it has been extensively tested. As such I will only adress issue that come with a minimal reproducible problem.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "easyfasta",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Romain JSB Lannes <romain.lannes@protonmail.com>",
    "keywords": "bioinformatics, fasta",
    "author": null,
    "author_email": "Romain JSB Lannes <romain.lannes@protonmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/95/8e/ba12e7c6712a5345a985095f7cbbcb2a36ac37eed6876958ab75e576f060/easyfasta-1.0.13.tar.gz",
    "platform": null,
    "description": "# Easy Fasta\n\nA lightweight functional Python library for efficient FASTA file parsing and DNA sequence manipulation. No OOP bloat, only data.\n\n## Features\n\n- **Memory-efficient parsing**: Stream through large FASTA files without loading everything into memory\n- **Random access**: Jump directly to specific sequences with position tracking\n- **Sequence extraction**: Filter sequences by identifiers\n- **DNA manipulation**: Complete IUPAC-compliant complement and reverse complement operations\n- **Formatting**: Convert sequences to multi-line FASTA format\n- **Does not validate input**: user are responsible to provide correctly formatted file.\n## Installation\npython 3.8+\n```\n> pip install easyfasta\n```\nor simply copy the module to your project\n\n## Quick Start\n\n```python\nfrom easyfasta import *\n# Parse FASTA file sequence by sequence (memory efficient)\nwith open('sequences.fasta') as f:\n    for header, sequence in fasta_iter(f):\n        print(f\">{header}\")\n        print(sequence[:50])  # First 50 bases\n\n# Load entire FASTA into dictionary\nsequences = load_fasta('sequences.fasta')\nprint(sequences['sequence_id'])\n\n# Extract specific sequences\ntarget_ids = ['seq1', 'seq2', 'seq3']\nfound = get_sequence_id('sequences.fasta', target_ids)\nfor header, seq in found:\n    print(f\"Found: {header}\")\n\n# Extract specific sequences using indexes\nindex = build_index('sequences.fasta')\n# using pickle you can save and load the index\n#import pickle\n#pickle.dump(index, \"save_index_file.pkl\")\n#index = pickle.load(\"save_index_file.pkl\")\ntarget_ids = ['seq1', 'seq2', 'seq3']\nfound = get_sequence_index('sequences.fasta', target_ids, index, ignore_unfound=True)\nfor header, seq in found:\n    print(f\"Found: {header}\")\n\n# DNA manipulation\ndna = \"ATCGGTAA\"\nprint(complement(dna))           # TAGCCATT\nprint(reverse_complement(dna))   # TTACCGAT\n```\n\n## API Reference\n\n### Parsing Functions\n\n#### `fasta_iter(open_file: TextIO) -> Generator[tuple[str, str], None, None]`\n\nMemory-efficient iterator over FASTA sequences.\n\n```python\nwith open('large_file.fasta') as f:\n    for header, sequence in fasta_iter(f):\n        # Process one sequence at a time\n        process_sequence(header, sequence)\n```\n\n#### `load_fasta(fasta_path: str|Path) -> dict[str, str]`\n\nLoad entire FASTA file into a dictionary mapping sequence IDs to sequences.\n\n```python\nsequences = load_fasta('sequences.fasta')\nmy_sequence = sequences['sequence_id']\n```\n\n#### `get_sequence_id(fasta_file: str|Path, identifiers: Iterable[str], identifier_only: bool = True) -> list[tuple[str, str]]`\n\nExtract sequences matching specific identifiers.\n\n- `identifier_only`: If True, match only the first part of headers (before whitespace)\n\n```python\nwanted = ['seq1', 'seq2']\nresults = get_sequence_id('sequences.fasta', wanted)\n```\n\n#### `build_index(fasta_file: str|Path) -> dict[str, int]`\n\nBuild a fasta index as a dictionary\n\n\n```python\nindex = build_index(fasta_file)\n```\n\n#### `get_sequence_index(fasta_file: str|Path, identifiers:Iterable[str], index_dict:dict[str, int], ignore_unfound: bool = True) -> list[tuple[str, str]]`\n\nuse index to retrieve sequence (faster)\n\n\n```python\nindex = build_index(fasta_file)\nwanted = ['seq1', 'seq2']\nget_sequence_index(fasta_file, wanted, index)\n```\n\n### Sequence Manipulation\n\n#### `complement(seq: str) -> str`\nReturn the complement of a DNA sequence (A\u2194T, C\u2194G, supports all IUPAC codes).\n\n#### `reverse(seq: str) -> str`\nReturn the reverse of a sequence.\n\n#### `reverse_complement(seq: str) -> str`\nReturn the reverse complement of a DNA sequence.\n\n#### `wrap_sequence(sequence: str, chunk_size: int = 80) -> str`\nFormat sequence with line breaks every `chunk_size` characters (standard multiline FASTA format).\n\n```python\nformatted = wrap_sequence(\"ATCGATCGATCG\" * 10, 60)\nprint(formatted)  # 60 characters per line\n# write to a file\nwith open(out_file, 'w') as fo:\n   fo.write(\">{}\\n{}\\n\".format('seq_id',  wrap_sequence(\"ATCGATCGATCG\" * 10, 80)))\n```\n\n## Design Philosophy\n\nThis library prioritizes:\n\n- **Memory efficiency**: Built for large genomic files that don't fit in RAM\n- **Simplicity**: Clean, predictable API with minimal dependencies. Not OOP bloat, only data.\n- **Performance**: Stream-based processing with O(1) memory usage for parsing\n- **Standards compliance**: Full IUPAC nucleotide code support\n\n## Use Cases\n\n- Processing large fasta file (metagenome)\n- Common DNA sequence manipulation\n- Common operations on fasta including parsing, indexing and sequence retrieval.\n- Bioinformatics workflows requiring memory efficiency\n\n## Requirements\n\n- Python 3.8+\n- No external dependencies\n\n## License\nMIT\n\n## Contributing\nFeel free to ask for new features. I published it as lightweight because those are the feature I use the most and wanted to start with a solid fondation.\n\nI used this library for years, and it has been extensively tested. As such I will only adress issue that come with a minimal reproducible problem.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A lightweight Python library for efficient FASTA file parsing and DNA sequence manipulation.",
    "version": "1.0.13",
    "project_urls": {
        "Documentation": "https://github.com/rLannes/easyfasta",
        "Homepage": "https://github.com/rLannes/easyfasta",
        "Repository": "https://github.com/rLannes/easyfasta"
    },
    "split_keywords": [
        "bioinformatics",
        " fasta"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3995f473773ce01d49f6f34f285ebf9a9d4f59b6dc805c000c639ebbafa61ba2",
                "md5": "316a99a06a0a2544397fab1f05d24b20",
                "sha256": "e92306d7a83a5d71d936338538f18e207dec9d02bbe01762e25a523a42765428"
            },
            "downloads": -1,
            "filename": "easyfasta-1.0.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "316a99a06a0a2544397fab1f05d24b20",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6419,
            "upload_time": "2025-08-08T01:00:57",
            "upload_time_iso_8601": "2025-08-08T01:00:57.637835Z",
            "url": "https://files.pythonhosted.org/packages/39/95/f473773ce01d49f6f34f285ebf9a9d4f59b6dc805c000c639ebbafa61ba2/easyfasta-1.0.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "958eba12e7c6712a5345a985095f7cbbcb2a36ac37eed6876958ab75e576f060",
                "md5": "10f292d79237b4e75b3809d85556615d",
                "sha256": "9b13be1a8e5216e4e244e821b4855399b137eb2dd913e4d234faf1c335cc66cd"
            },
            "downloads": -1,
            "filename": "easyfasta-1.0.13.tar.gz",
            "has_sig": false,
            "md5_digest": "10f292d79237b4e75b3809d85556615d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5453,
            "upload_time": "2025-08-08T01:00:58",
            "upload_time_iso_8601": "2025-08-08T01:00:58.861029Z",
            "url": "https://files.pythonhosted.org/packages/95/8e/ba12e7c6712a5345a985095f7cbbcb2a36ac37eed6876958ab75e576f060/easyfasta-1.0.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 01:00:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rLannes",
    "github_project": "easyfasta",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "easyfasta"
}
        
Elapsed time: 1.23612s