# nnfasta
Lightweight Neural Net efficient FASTA dataset suitable for training.
Should be memory efficient across process boundaries.
So useful as input to torch/tensorflow dataloaders with multiple workers etc.
(see [this issue](https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662))
Presents a list of fasta files as a simple `abc.Sequence`
so you can inquire about `len(dataset)` and retrieve
`Record`s randomly with `dataset[i]`
Uses Python's `mmap.mmap` under the hood.
The underlying FASTA's should be "well formed" since there is
minimal sanity checking done.
## Install
Install with:
```bash
pip install nnfasta
```
**There are no dependencies**, you just need a modern (>= 3.9) python (< 12K of code).
## Usage
```python
from nnfasta import nnfastas
dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])
# display the number of sequences
print(len(dataset))
# get a particular record
rec = dataset[20]
print('sequence', rec.id, rec.description, rec.seq)
```
**Warning**: No checks are made for the existence of
the fasta files. Also files of zero length will be rejected
by `mmap`.
A `Record` mimics biopython's [`SeqRecord`](https://biopython.org/wiki/SeqRecord) and is simply:
```python
@dataclass
class Record:
id: str
"""Sequence ID"""
description: str
"""Line prefixed by '>'"""
seq: str
"""Sequence stripped of whitespace and uppercased"""
@property
def name(self) -> str:
return self.id
```
The major difference is that `seq` is just a simple `str` not a biopython `Seq` object
(We just don't want the `Bio` dependency -- `nnfasta` has _no_ dependencies).
## Arguments
You can give `nnfastas` either a filename, a `Path`, the actual
bytes in the file or an open file pointer (opened with `mode="rb"`)
_OR_ a list of these things. e.g:
```python
from nnfasta import nnfastas
my = "my.fasta"
fa = nnfastas([my, open(my mode="rb"),
open(my, mode="rb").read()])
```
## Encoding
The files are assumed to be encoded as "`ASCII`". If this is not the
case then `nnfastas` accepts an `encoding` argument. All the files
presented to `nnfastas` are assumed to be similarly encoded. You can
alter the decoding with the `errors` keyword (default=`strict`).
## Test and Train Split best practice
Use `SubsetFasta`
```python
from nnfasta import nnfastas, SubsetFasta
from sklearn.model_selection import train_test_split
dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])
train_idx, test_idx = train_test_split(range(len(dataset)),test_size=.1,shuffle=True)
# these are still Sequence[Record] objects.
train_data = SubsetFasta(dataset, train_idx)
test_data = SubsetFasta(dataset, test_idx)
# *OR* ... this is basically the same
import torch
train_data, test_data = torch.utils.data.random_split(dataset, [.9, .1])
```
See the pytorch `Subset` logic [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)
## How it works
We memory map the input files and use python's `re` package to scan the files
for `b"\r>|\n>|^>"` bytes from which we compute a start, end for each
record and create an `array.array` (in memory).
The operating system will ensure that similar mmapped pages in different process
are shared.
Enjoy peps!
Raw data
{
"_id": null,
"home_page": "https://github.com/arabidopsis/nnfasta.git",
"name": "nnfasta",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "fasta, torch, tensorflow, neural nets, training",
"author": "Ian Castleden",
"author_email": "ian.castleden@uwa.edu.au",
"download_url": "https://files.pythonhosted.org/packages/7c/d8/da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b/nnfasta-0.1.36.tar.gz",
"platform": null,
"description": "# nnfasta\n\nLightweight Neural Net efficient FASTA dataset suitable for training.\n\nShould be memory efficient across process boundaries.\nSo useful as input to torch/tensorflow dataloaders with multiple workers etc.\n(see [this issue](https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662))\n\nPresents a list of fasta files as a simple `abc.Sequence`\nso you can inquire about `len(dataset)` and retrieve\n`Record`s randomly with `dataset[i]`\n\nUses Python's `mmap.mmap` under the hood.\n\nThe underlying FASTA's should be \"well formed\" since there is\nminimal sanity checking done.\n\n## Install\n\nInstall with:\n\n```bash\npip install nnfasta\n```\n\n**There are no dependencies**, you just need a modern (>= 3.9) python (< 12K of code).\n\n## Usage\n\n```python\nfrom nnfasta import nnfastas\n\ndataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])\n\n# display the number of sequences\nprint(len(dataset))\n\n# get a particular record\nrec = dataset[20]\nprint('sequence', rec.id, rec.description, rec.seq)\n```\n\n**Warning**: No checks are made for the existence of\nthe fasta files. Also files of zero length will be rejected\nby `mmap`.\n\nA `Record` mimics biopython's [`SeqRecord`](https://biopython.org/wiki/SeqRecord) and is simply:\n\n```python\n@dataclass\nclass Record:\n id: str\n \"\"\"Sequence ID\"\"\"\n description: str\n \"\"\"Line prefixed by '>'\"\"\"\n seq: str\n \"\"\"Sequence stripped of whitespace and uppercased\"\"\"\n\n @property\n def name(self) -> str:\n return self.id\n```\n\nThe major difference is that `seq` is just a simple `str` not a biopython `Seq` object\n(We just don't want the `Bio` dependency -- `nnfasta` has _no_ dependencies).\n\n## Arguments\n\nYou can give `nnfastas` either a filename, a `Path`, the actual\nbytes in the file or an open file pointer (opened with `mode=\"rb\"`)\n_OR_ a list of these things. e.g:\n\n```python\n\nfrom nnfasta import nnfastas\nmy = \"my.fasta\"\nfa = nnfastas([my, open(my mode=\"rb\"),\n open(my, mode=\"rb\").read()])\n```\n\n## Encoding\n\nThe files are assumed to be encoded as \"`ASCII`\". If this is not the\ncase then `nnfastas` accepts an `encoding` argument. All the files\npresented to `nnfastas` are assumed to be similarly encoded. You can\nalter the decoding with the `errors` keyword (default=`strict`).\n\n## Test and Train Split best practice\n\nUse `SubsetFasta`\n\n```python\nfrom nnfasta import nnfastas, SubsetFasta\nfrom sklearn.model_selection import train_test_split\n\ndataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])\ntrain_idx, test_idx = train_test_split(range(len(dataset)),test_size=.1,shuffle=True)\n\n# these are still Sequence[Record] objects.\n\ntrain_data = SubsetFasta(dataset, train_idx)\ntest_data = SubsetFasta(dataset, test_idx)\n\n# *OR* ... this is basically the same\nimport torch\ntrain_data, test_data = torch.utils.data.random_split(dataset, [.9, .1])\n\n```\n\nSee the pytorch `Subset` logic [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)\n\n\n## How it works\n\nWe memory map the input files and use python's `re` package to scan the files\nfor `b\"\\r>|\\n>|^>\"` bytes from which we compute a start, end for each\nrecord and create an `array.array` (in memory).\n\nThe operating system will ensure that similar mmapped pages in different process\nare shared.\n\nEnjoy peps!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Lightweight Neural Net efficient FASTA",
"version": "0.1.36",
"project_urls": {
"Homepage": "https://github.com/arabidopsis/nnfasta.git",
"Repository": "https://github.com/arabidopsis/nnfasta.git"
},
"split_keywords": [
"fasta",
" torch",
" tensorflow",
" neural nets",
" training"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9c5a7445ac8eb605e23d32f986870c160487f8255bcd1e2a0e99954ad467f090",
"md5": "1e87d7e38ade91fbf56761a8dc09f3e7",
"sha256": "efd2c24ce89afc24862338d606ba035267b72d55c1f2208725eb8569f74011ee"
},
"downloads": -1,
"filename": "nnfasta-0.1.36-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1e87d7e38ade91fbf56761a8dc09f3e7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 7728,
"upload_time": "2024-05-12T14:14:53",
"upload_time_iso_8601": "2024-05-12T14:14:53.723273Z",
"url": "https://files.pythonhosted.org/packages/9c/5a/7445ac8eb605e23d32f986870c160487f8255bcd1e2a0e99954ad467f090/nnfasta-0.1.36-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7cd8da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b",
"md5": "63c3b80ac244f57393eca08c92bb31f9",
"sha256": "33de0e5be3fdaf85f4f73a0f1b6b94354ae6d746cd443ed0cb3c949a88385ca2"
},
"downloads": -1,
"filename": "nnfasta-0.1.36.tar.gz",
"has_sig": false,
"md5_digest": "63c3b80ac244f57393eca08c92bb31f9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 6961,
"upload_time": "2024-05-12T14:14:56",
"upload_time_iso_8601": "2024-05-12T14:14:56.788229Z",
"url": "https://files.pythonhosted.org/packages/7c/d8/da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b/nnfasta-0.1.36.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-12 14:14:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "arabidopsis",
"github_project": "nnfasta",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "nnfasta"
}