needletail


Nameneedletail JSON
Version 0.6.0 PyPI version JSON
download
home_pageNone
SummaryFASTX parsing and k-mer methods
upload_time2024-11-01 21:35:12
maintainerNone
docs_urlNone
authorRoderick Bovee <rbovee@gmail.com>, Vincent Prouillet <vincent@onecodex.com>
requires_pythonNone
licenseMIT
keywords fasta fastq kmer bioinformatics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![CI](https://github.com/onecodex/needletail/workflows/CI/badge.svg)
[![crates.io](https://img.shields.io/crates/v/needletail.svg)](https://crates.io/crates/needletail)

# Needletail

Needletail is a MIT-licensed, minimal-copying FASTA/FASTQ parser and _k_-mer processing library for Rust.

The goal is to write a fast *and* well-tested set of functions that more specialized bioinformatics programs can use.
Needletail's goal is to be as fast as the [readfq](https://github.com/lh3/readfq) C library at parsing FASTX files and much (i.e. 25 times) faster than equivalent Python implementations at _k_-mer counting.

## Example

```rust
extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};

fn main() {
    let filename = "tests/data/28S.fasta";

    let mut n_bases = 0;
    let mut n_valid_kmers = 0;
    let mut reader = parse_fastx_file(&filename).expect("valid path/file");
    while let Some(record) = reader.next() {
        let seqrec = record.expect("invalid record");
        // keep track of the total number of bases
        n_bases += seqrec.num_bases();
        // normalize to make sure all the bases are consistently capitalized and
        // that we remove the newlines since this is FASTA
        let norm_seq = seqrec.normalize(false);
        // we make a reverse complemented copy of the sequence first for
        // `canonical_kmers` to draw the complemented sequences from.
        let rc = norm_seq.reverse_complement();
        // now we keep track of the number of AAAAs (or TTTTs via
        // canonicalization) in the file; note we also get the position (i.0;
        // in the event there were `N`-containing kmers that were skipped)
        // and whether the sequence was complemented (i.2) in addition to
        // the canonical kmer (i.1)
        for (_, kmer, _) in norm_seq.canonical_kmers(4, &rc) {
            if kmer == b"AAAA" {
                n_valid_kmers += 1;
            }
        }
    }
    println!("There are {} bases in your file.", n_bases);
    println!("There are {} AAAAs in your file.", n_valid_kmers);
}
```

## Installation

Needletail requires `rust` and `cargo` to be installed.
Please use either your local package manager (`homebrew`, `apt-get`, `pacman`, etc) or install these via [rustup](https://www.rustup.rs/).

Once you have Rust set up, you can include needletail in your `Cargo.toml` file like:
```shell
[dependencies]
needletail = "0.6.0"
```

To install needletail itself for development:
```shell
git clone https://github.com/onecodex/needletail
cargo test  # to run tests
```

### Python

#### Documentation

For a real example, you can refer to `test_python.py`.

The python library only raise one type of exception: `NeedletailError`.

There are 2 ways to parse a FASTA/FASTQ: one if you have a string (`parse_fastx_string(content: str)`) or a path to a file
(`parse_fastx_file(path: str)`). Those functions will raise if the file is not found or if the content is invalid and will return
an iterator.


```python
from needletail import parse_fastx_file, NeedletailError, reverse_complement, normalize_seq

try:
    for record in parse_fastx_file("myfile.fastq"):
        print(record.id)
        print(record.seq)
        print(record.qual)
except NeedletailError:
    print("Invalid Fastq file")
```

A record has the following shape:

```python
class Record:
    id: str
    seq: str
    qual: Optional[str]

    def is_fasta(self) -> bool
    def is_fastq(self) -> bool
    def normalize(self, iupac: bool)
```

Note that `normalize` (see <https://docs.rs/needletail/0.4.1/needletail/sequence/fn.normalize.html> for what it does) will mutate `self.seq`.
It is also available as the `normalize_seq(seq: str, iupac: bool)` function which will return the normalized sequence in this case.

Lastly, there is also a `reverse_complement(seq: str)` that will do exactly what it says. This will not raise an error if you pass some invalid
characters.

#### Building

To work on the Python library on a Mac OS X/Unix system (requires Python 3):
```bash
pip install maturin

# finally, install the library in the local virtualenv
maturin develop --cargo-extra-args="--features=python"
```

To build the binary wheels and push to PyPI

```
# The Mac build requires switching through a few different python versions
maturin build --features python --release --strip

# The linux build is automated through cross-compiling in a docker image
docker run --rm -v $(pwd):/io ghcr.io/pyo3/maturin:main build --features=python --release --strip -f
twine upload target/wheels/*
```

## Releasing A New Version

There is a Github Workflow that will build Python wheels for macOS (x86 and
ARM) and Ubuntu (x86). To run, create a new release.

## Getting Help

Questions are best directed as GitHub issues. We plan to add more documentation soon, but in the meantime "doc" comments are included in the source.

## Contributing

Please do! We're happy to discuss possible additions and/or accept pull requests.

## Acknowledgements
Starting from 0.4, the parsers algorithms is taken from [seq_io](https://github.com/markschl/seq_io). While it has been slightly modified, it is mainly
coming from that library. Links to the original files are available in `src/parser/fast{a,q}.rs`.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "needletail",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "FASTA, FASTQ, kmer, bioinformatics",
    "author": "Roderick Bovee <rbovee@gmail.com>, Vincent Prouillet <vincent@onecodex.com>",
    "author_email": "Roderick Bovee <rbovee@gmail.com>, Vincent Prouillet <vincent@onecodex.com>",
    "download_url": null,
    "platform": null,
    "description": "![CI](https://github.com/onecodex/needletail/workflows/CI/badge.svg)\n[![crates.io](https://img.shields.io/crates/v/needletail.svg)](https://crates.io/crates/needletail)\n\n# Needletail\n\nNeedletail is a MIT-licensed, minimal-copying FASTA/FASTQ parser and _k_-mer processing library for Rust.\n\nThe goal is to write a fast *and* well-tested set of functions that more specialized bioinformatics programs can use.\nNeedletail's goal is to be as fast as the [readfq](https://github.com/lh3/readfq) C library at parsing FASTX files and much (i.e. 25 times) faster than equivalent Python implementations at _k_-mer counting.\n\n## Example\n\n```rust\nextern crate needletail;\nuse needletail::{parse_fastx_file, Sequence, FastxReader};\n\nfn main() {\n    let filename = \"tests/data/28S.fasta\";\n\n    let mut n_bases = 0;\n    let mut n_valid_kmers = 0;\n    let mut reader = parse_fastx_file(&filename).expect(\"valid path/file\");\n    while let Some(record) = reader.next() {\n        let seqrec = record.expect(\"invalid record\");\n        // keep track of the total number of bases\n        n_bases += seqrec.num_bases();\n        // normalize to make sure all the bases are consistently capitalized and\n        // that we remove the newlines since this is FASTA\n        let norm_seq = seqrec.normalize(false);\n        // we make a reverse complemented copy of the sequence first for\n        // `canonical_kmers` to draw the complemented sequences from.\n        let rc = norm_seq.reverse_complement();\n        // now we keep track of the number of AAAAs (or TTTTs via\n        // canonicalization) in the file; note we also get the position (i.0;\n        // in the event there were `N`-containing kmers that were skipped)\n        // and whether the sequence was complemented (i.2) in addition to\n        // the canonical kmer (i.1)\n        for (_, kmer, _) in norm_seq.canonical_kmers(4, &rc) {\n            if kmer == b\"AAAA\" {\n                n_valid_kmers += 1;\n            }\n        }\n    }\n    println!(\"There are {} bases in your file.\", n_bases);\n    println!(\"There are {} AAAAs in your file.\", n_valid_kmers);\n}\n```\n\n## Installation\n\nNeedletail requires `rust` and `cargo` to be installed.\nPlease use either your local package manager (`homebrew`, `apt-get`, `pacman`, etc) or install these via [rustup](https://www.rustup.rs/).\n\nOnce you have Rust set up, you can include needletail in your `Cargo.toml` file like:\n```shell\n[dependencies]\nneedletail = \"0.6.0\"\n```\n\nTo install needletail itself for development:\n```shell\ngit clone https://github.com/onecodex/needletail\ncargo test  # to run tests\n```\n\n### Python\n\n#### Documentation\n\nFor a real example, you can refer to `test_python.py`.\n\nThe python library only raise one type of exception: `NeedletailError`.\n\nThere are 2 ways to parse a FASTA/FASTQ: one if you have a string (`parse_fastx_string(content: str)`) or a path to a file\n(`parse_fastx_file(path: str)`). Those functions will raise if the file is not found or if the content is invalid and will return\nan iterator.\n\n\n```python\nfrom needletail import parse_fastx_file, NeedletailError, reverse_complement, normalize_seq\n\ntry:\n    for record in parse_fastx_file(\"myfile.fastq\"):\n        print(record.id)\n        print(record.seq)\n        print(record.qual)\nexcept NeedletailError:\n    print(\"Invalid Fastq file\")\n```\n\nA record has the following shape:\n\n```python\nclass Record:\n    id: str\n    seq: str\n    qual: Optional[str]\n\n    def is_fasta(self) -> bool\n    def is_fastq(self) -> bool\n    def normalize(self, iupac: bool)\n```\n\nNote that `normalize` (see <https://docs.rs/needletail/0.4.1/needletail/sequence/fn.normalize.html> for what it does) will mutate `self.seq`.\nIt is also available as the `normalize_seq(seq: str, iupac: bool)` function which will return the normalized sequence in this case.\n\nLastly, there is also a `reverse_complement(seq: str)` that will do exactly what it says. This will not raise an error if you pass some invalid\ncharacters.\n\n#### Building\n\nTo work on the Python library on a Mac OS X/Unix system (requires Python 3):\n```bash\npip install maturin\n\n# finally, install the library in the local virtualenv\nmaturin develop --cargo-extra-args=\"--features=python\"\n```\n\nTo build the binary wheels and push to PyPI\n\n```\n# The Mac build requires switching through a few different python versions\nmaturin build --features python --release --strip\n\n# The linux build is automated through cross-compiling in a docker image\ndocker run --rm -v $(pwd):/io ghcr.io/pyo3/maturin:main build --features=python --release --strip -f\ntwine upload target/wheels/*\n```\n\n## Releasing A New Version\n\nThere is a Github Workflow that will build Python wheels for macOS (x86 and\nARM) and Ubuntu (x86). To run, create a new release.\n\n## Getting Help\n\nQuestions are best directed as GitHub issues. We plan to add more documentation soon, but in the meantime \"doc\" comments are included in the source.\n\n## Contributing\n\nPlease do! We're happy to discuss possible additions and/or accept pull requests.\n\n## Acknowledgements\nStarting from 0.4, the parsers algorithms is taken from [seq_io](https://github.com/markschl/seq_io). While it has been slightly modified, it is mainly\ncoming from that library. Links to the original files are available in `src/parser/fast{a,q}.rs`.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "FASTX parsing and k-mer methods",
    "version": "0.6.0",
    "project_urls": {
        "Source Code": "https://github.com/onecodex/needletail"
    },
    "split_keywords": [
        "fasta",
        " fastq",
        " kmer",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "288cfebe405f81289c0944a9dffb8e397e9d7a2f07f78fc20062c4147d021ba4",
                "md5": "d98b047538dc56fe512c7fb6544fbc47",
                "sha256": "e7257f39e1599eba394c33d71e94504fc1efb759be8903be8a9ded37efcc7d32"
            },
            "downloads": -1,
            "filename": "needletail-0.6.0-cp311-cp311-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d98b047538dc56fe512c7fb6544fbc47",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": null,
            "size": 306269,
            "upload_time": "2024-11-01T21:35:12",
            "upload_time_iso_8601": "2024-11-01T21:35:12.178482Z",
            "url": "https://files.pythonhosted.org/packages/28/8c/febe405f81289c0944a9dffb8e397e9d7a2f07f78fc20062c4147d021ba4/needletail-0.6.0-cp311-cp311-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0e85808a1fd10b2739051828f574d12b4d0e00f7b82754c26c766659061cf91a",
                "md5": "c7c7a3acf573d5ac97251885aed82274",
                "sha256": "587885e4317bb981822919ab50039ebcb8a3731d7887faa4abfed86096bffa9f"
            },
            "downloads": -1,
            "filename": "needletail-0.6.0-cp311-cp311-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "c7c7a3acf573d5ac97251885aed82274",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": null,
            "size": 293194,
            "upload_time": "2024-11-01T21:35:13",
            "upload_time_iso_8601": "2024-11-01T21:35:13.759010Z",
            "url": "https://files.pythonhosted.org/packages/0e/85/808a1fd10b2739051828f574d12b4d0e00f7b82754c26c766659061cf91a/needletail-0.6.0-cp311-cp311-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "785d5bf496e492b34db78de761038de2f5927f44eaca5a5c55864efdc99e0ce0",
                "md5": "79ddadd750d531eeae32b44f4b2828be",
                "sha256": "d1b97cd1de077ba858796261a9aca9d971ce4583ade79ad350de7780a3f186ad"
            },
            "downloads": -1,
            "filename": "needletail-0.6.0-cp311-cp311-manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "79ddadd750d531eeae32b44f4b2828be",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": null,
            "size": 375038,
            "upload_time": "2024-11-01T21:35:31",
            "upload_time_iso_8601": "2024-11-01T21:35:31.970491Z",
            "url": "https://files.pythonhosted.org/packages/78/5d/5bf496e492b34db78de761038de2f5927f44eaca5a5c55864efdc99e0ce0/needletail-0.6.0-cp311-cp311-manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f18639dfeaf34b5bbf0210580416ea0dd212e09a7c65e529dbda8d4cb8da9fab",
                "md5": "84c4e7b885ddf00ee56c990a64119571",
                "sha256": "b3f0ac812cded0d868ccc46746ecdbd07f3f9c18950b929cfd04c14365bbef17"
            },
            "downloads": -1,
            "filename": "needletail-0.6.0-cp312-cp312-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "84c4e7b885ddf00ee56c990a64119571",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": null,
            "size": 305437,
            "upload_time": "2024-11-01T21:35:33",
            "upload_time_iso_8601": "2024-11-01T21:35:33.195201Z",
            "url": "https://files.pythonhosted.org/packages/f1/86/39dfeaf34b5bbf0210580416ea0dd212e09a7c65e529dbda8d4cb8da9fab/needletail-0.6.0-cp312-cp312-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "84000209b9f6be0c0274e3be4e75404d18759625e4b6963f40eda6cd8fbac58d",
                "md5": "a244bd1fc46b7ab2f530b49f829ee70e",
                "sha256": "ea8a5656839c82a8169297a23088c70758d36b3414a5d07cccc4e12f5ab74763"
            },
            "downloads": -1,
            "filename": "needletail-0.6.0-cp312-cp312-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "a244bd1fc46b7ab2f530b49f829ee70e",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": null,
            "size": 292561,
            "upload_time": "2024-11-01T21:35:34",
            "upload_time_iso_8601": "2024-11-01T21:35:34.947142Z",
            "url": "https://files.pythonhosted.org/packages/84/00/0209b9f6be0c0274e3be4e75404d18759625e4b6963f40eda6cd8fbac58d/needletail-0.6.0-cp312-cp312-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a65d41e0c8d66bceb14f08c27d01ede2bb9d5bf556f5f7da61cdbd93721f30bf",
                "md5": "1c02ed3f619934ba7b1393b800721169",
                "sha256": "c06e7c57eebb0168e67881f457f72133b8080ac5fe7e6aeeca8003b9612ee5c1"
            },
            "downloads": -1,
            "filename": "needletail-0.6.0-cp312-cp312-manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "1c02ed3f619934ba7b1393b800721169",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": null,
            "size": 374204,
            "upload_time": "2024-11-01T21:35:36",
            "upload_time_iso_8601": "2024-11-01T21:35:36.493450Z",
            "url": "https://files.pythonhosted.org/packages/a6/5d/41e0c8d66bceb14f08c27d01ede2bb9d5bf556f5f7da61cdbd93721f30bf/needletail-0.6.0-cp312-cp312-manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-01 21:35:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "onecodex",
    "github_project": "needletail",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "needletail"
}
        
Elapsed time: 0.34029s