nnfasta


Namennfasta JSON
Version 0.1.36 PyPI version JSON
download
home_pagehttps://github.com/arabidopsis/nnfasta.git
SummaryLightweight Neural Net efficient FASTA
upload_time2024-05-12 14:14:56
maintainerNone
docs_urlNone
authorIan Castleden
requires_python>=3.9
licenseMIT
keywords fasta torch tensorflow neural nets training
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # nnfasta

Lightweight Neural Net efficient FASTA dataset suitable for training.

Should be memory efficient across process boundaries.
So useful as input to torch/tensorflow dataloaders with multiple workers etc.
(see [this issue](https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662))

Presents a list of fasta files as a simple `abc.Sequence`
so you can inquire about `len(dataset)` and retrieve
`Record`s randomly with `dataset[i]`

Uses Python's `mmap.mmap` under the hood.

The underlying FASTA's should be "well formed" since there is
minimal sanity checking done.

## Install

Install with:

```bash
pip install nnfasta
```

**There are no dependencies**, you just need a modern (>= 3.9) python (< 12K of code).

## Usage

```python
from nnfasta import nnfastas

dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])

# display the number of sequences
print(len(dataset))

# get a particular record
rec = dataset[20]
print('sequence', rec.id, rec.description, rec.seq)
```

**Warning**: No checks are made for the existence of
the fasta files. Also files of zero length will be rejected
by `mmap`.

A `Record` mimics biopython's [`SeqRecord`](https://biopython.org/wiki/SeqRecord) and is simply:

```python
@dataclass
class Record:
    id: str
    """Sequence ID"""
    description: str
    """Line prefixed by '>'"""
    seq: str
    """Sequence stripped of whitespace and uppercased"""

    @property
    def name(self) -> str:
        return self.id
```

The major difference is that `seq` is just a simple `str` not a biopython `Seq` object
(We just don't want the `Bio` dependency -- `nnfasta` has _no_ dependencies).

## Arguments

You can give `nnfastas` either a filename, a `Path`, the actual
bytes in the file or an open file pointer (opened with `mode="rb"`)
_OR_ a list of these things. e.g:

```python

from nnfasta import nnfastas
my = "my.fasta"
fa = nnfastas([my, open(my mode="rb"),
            open(my, mode="rb").read()])
```

## Encoding

The files are assumed to be encoded as "`ASCII`". If this is not the
case then `nnfastas` accepts an `encoding` argument. All the files
presented to `nnfastas` are assumed to be similarly encoded. You can
alter the decoding with the `errors` keyword (default=`strict`).

## Test and Train Split best practice

Use `SubsetFasta`

```python
from nnfasta import nnfastas, SubsetFasta
from sklearn.model_selection import train_test_split

dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])
train_idx, test_idx = train_test_split(range(len(dataset)),test_size=.1,shuffle=True)

# these are still Sequence[Record] objects.

train_data = SubsetFasta(dataset, train_idx)
test_data = SubsetFasta(dataset, test_idx)

# *OR* ... this is basically the same
import torch
train_data, test_data = torch.utils.data.random_split(dataset, [.9, .1])

```

See the pytorch `Subset` logic [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)


## How it works

We memory map the input files and use python's `re` package to scan the files
for `b"\r>|\n>|^>"`  bytes from which we compute a start, end for each
record and create an `array.array` (in memory).

The operating system will ensure that similar mmapped pages in different process
are shared.

Enjoy peps!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/arabidopsis/nnfasta.git",
    "name": "nnfasta",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "fasta, torch, tensorflow, neural nets, training",
    "author": "Ian Castleden",
    "author_email": "ian.castleden@uwa.edu.au",
    "download_url": "https://files.pythonhosted.org/packages/7c/d8/da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b/nnfasta-0.1.36.tar.gz",
    "platform": null,
    "description": "# nnfasta\n\nLightweight Neural Net efficient FASTA dataset suitable for training.\n\nShould be memory efficient across process boundaries.\nSo useful as input to torch/tensorflow dataloaders with multiple workers etc.\n(see [this issue](https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662))\n\nPresents a list of fasta files as a simple `abc.Sequence`\nso you can inquire about `len(dataset)` and retrieve\n`Record`s randomly with `dataset[i]`\n\nUses Python's `mmap.mmap` under the hood.\n\nThe underlying FASTA's should be \"well formed\" since there is\nminimal sanity checking done.\n\n## Install\n\nInstall with:\n\n```bash\npip install nnfasta\n```\n\n**There are no dependencies**, you just need a modern (>= 3.9) python (< 12K of code).\n\n## Usage\n\n```python\nfrom nnfasta import nnfastas\n\ndataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])\n\n# display the number of sequences\nprint(len(dataset))\n\n# get a particular record\nrec = dataset[20]\nprint('sequence', rec.id, rec.description, rec.seq)\n```\n\n**Warning**: No checks are made for the existence of\nthe fasta files. Also files of zero length will be rejected\nby `mmap`.\n\nA `Record` mimics biopython's [`SeqRecord`](https://biopython.org/wiki/SeqRecord) and is simply:\n\n```python\n@dataclass\nclass Record:\n    id: str\n    \"\"\"Sequence ID\"\"\"\n    description: str\n    \"\"\"Line prefixed by '>'\"\"\"\n    seq: str\n    \"\"\"Sequence stripped of whitespace and uppercased\"\"\"\n\n    @property\n    def name(self) -> str:\n        return self.id\n```\n\nThe major difference is that `seq` is just a simple `str` not a biopython `Seq` object\n(We just don't want the `Bio` dependency -- `nnfasta` has _no_ dependencies).\n\n## Arguments\n\nYou can give `nnfastas` either a filename, a `Path`, the actual\nbytes in the file or an open file pointer (opened with `mode=\"rb\"`)\n_OR_ a list of these things. e.g:\n\n```python\n\nfrom nnfasta import nnfastas\nmy = \"my.fasta\"\nfa = nnfastas([my, open(my mode=\"rb\"),\n            open(my, mode=\"rb\").read()])\n```\n\n## Encoding\n\nThe files are assumed to be encoded as \"`ASCII`\". If this is not the\ncase then `nnfastas` accepts an `encoding` argument. All the files\npresented to `nnfastas` are assumed to be similarly encoded. You can\nalter the decoding with the `errors` keyword (default=`strict`).\n\n## Test and Train Split best practice\n\nUse `SubsetFasta`\n\n```python\nfrom nnfasta import nnfastas, SubsetFasta\nfrom sklearn.model_selection import train_test_split\n\ndataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])\ntrain_idx, test_idx = train_test_split(range(len(dataset)),test_size=.1,shuffle=True)\n\n# these are still Sequence[Record] objects.\n\ntrain_data = SubsetFasta(dataset, train_idx)\ntest_data = SubsetFasta(dataset, test_idx)\n\n# *OR* ... this is basically the same\nimport torch\ntrain_data, test_data = torch.utils.data.random_split(dataset, [.9, .1])\n\n```\n\nSee the pytorch `Subset` logic [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)\n\n\n## How it works\n\nWe memory map the input files and use python's `re` package to scan the files\nfor `b\"\\r>|\\n>|^>\"`  bytes from which we compute a start, end for each\nrecord and create an `array.array` (in memory).\n\nThe operating system will ensure that similar mmapped pages in different process\nare shared.\n\nEnjoy peps!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Lightweight Neural Net efficient FASTA",
    "version": "0.1.36",
    "project_urls": {
        "Homepage": "https://github.com/arabidopsis/nnfasta.git",
        "Repository": "https://github.com/arabidopsis/nnfasta.git"
    },
    "split_keywords": [
        "fasta",
        " torch",
        " tensorflow",
        " neural nets",
        " training"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9c5a7445ac8eb605e23d32f986870c160487f8255bcd1e2a0e99954ad467f090",
                "md5": "1e87d7e38ade91fbf56761a8dc09f3e7",
                "sha256": "efd2c24ce89afc24862338d606ba035267b72d55c1f2208725eb8569f74011ee"
            },
            "downloads": -1,
            "filename": "nnfasta-0.1.36-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1e87d7e38ade91fbf56761a8dc09f3e7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 7728,
            "upload_time": "2024-05-12T14:14:53",
            "upload_time_iso_8601": "2024-05-12T14:14:53.723273Z",
            "url": "https://files.pythonhosted.org/packages/9c/5a/7445ac8eb605e23d32f986870c160487f8255bcd1e2a0e99954ad467f090/nnfasta-0.1.36-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7cd8da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b",
                "md5": "63c3b80ac244f57393eca08c92bb31f9",
                "sha256": "33de0e5be3fdaf85f4f73a0f1b6b94354ae6d746cd443ed0cb3c949a88385ca2"
            },
            "downloads": -1,
            "filename": "nnfasta-0.1.36.tar.gz",
            "has_sig": false,
            "md5_digest": "63c3b80ac244f57393eca08c92bb31f9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 6961,
            "upload_time": "2024-05-12T14:14:56",
            "upload_time_iso_8601": "2024-05-12T14:14:56.788229Z",
            "url": "https://files.pythonhosted.org/packages/7c/d8/da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b/nnfasta-0.1.36.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-12 14:14:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "arabidopsis",
    "github_project": "nnfasta",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "nnfasta"
}
        
Elapsed time: 0.23345s