# lazy_dataset
![Run python tests](https://github.com/fgnt/lazy_dataset/workflows/Run%20python%20tests/badge.svg?branch=master)
[![codecov.io](https://codecov.io/github/fgnt/lazy_dataset/coverage.svg?branch=master)](https://codecov.io/github/fgnt/lazy_dataset?branch=master)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgnt/lazy_dataset/blob/master/LICENSE)
Lazy_dataset is a helper to deal with large datasets that do not fit into memory.
It allows to define transformations that are applied lazily,
(e.g. a mapping function to read data from HDD). When someone iterates over the dataset all
transformations are applied.
Supported transformations:
- `dataset.map(map_fn)`: Apply the function `map_fn` to each example ([builtins.map](https://docs.python.org/3/library/functions.html#map))
- `dataset[2]`: Get example at index `2`.
- `dataset['example_id']` Get that example that has the example id `'example_id'`.
- `dataset[10:20]`: Get a sub dataset that contains only the examples in the slice 10 to 20.
- `dataset.filter(filter_fn, lazy=True)` Drops examples where `filter_fn(example)` is false ([builtins.filter](https://docs.python.org/3/library/functions.html#filter)).
- `dataset.concatenate(*others)`: Concatenates two or more datasets ([numpy.concatenate](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html))
- `dataset.intersperse(*others)`: Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).
- `dataset.zip(*others)`: Zip two or more datasets
- `dataset.shuffle(reshuffle=False)`: Shuffles the dataset. When `reshuffle` is `True` it shuffles each time when you iterate over the data.
- `dataset.tile(reps, shuffle=False)`: Repeats the dataset `reps` times and concatenates it ([numpy.tile](https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html))
- `dataset.cycle()`: Repeats the dataset endlessly ([itertools.cycle](https://docs.python.org/3/library/itertools.html#itertools.cycle) but without caching)
- `dataset.groupby(group_fn)`: Groups examples together. In contrast to `itertools.groupby` a sort is not nessesary, like in pandas ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby), [pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))
- `dataset.sort(key_fn, sort_fn=sorted)`: Sorts the examples depending on the values `key_fn(example)` ([list.sort](https://docs.python.org/3/library/stdtypes.html#list.sort))
- `dataset.batch(batch_size, drop_last=False)`: Batches `batch_size` examples together as a list. Usually followed by a map ([tensorflow.data.Dataset.batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch))
- `dataset.random_choice()`: Get a random example ([numpy.random.choice](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html))
- `dataset.cache()`: Cache in RAM (similar to ESPnet's `keep_all_data_on_mem`)
- `dataset.diskcache()`: Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)
- ...
```python
>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
... 'example_id_1': {
... 'observation': [1, 2, 3],
... 'label': 1,
... },
... 'example_id_2': {
... 'observation': [4, 5, 6],
... 'label': 2,
... },
... 'example_id_3': {
... 'observation': [7, 8, 9],
... 'label': 3,
... },
... }
>>> for example_id, example in examples.items():
... example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
... print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
... example['label'] *= 10
... return example
>>> ds = ds.map(transform)
>>> for example in ds:
... print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
... print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
DictDataset(len=3)
MapDataset(_pickle.loads)
MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)
```
## Comparison with PyTorch's DataLoader
See [here](comparison/comparison.md) for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.
## Installation
Install it directly with Pip, if you just want to use it:
```bash
pip install lazy_dataset
```
If you want to make changes or want the most recent version: Clone the repository and install it as follows:
```bash
git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .
```
Raw data
{
"_id": null,
"home_page": "https://github.com/fgnt/lazy_dataset",
"name": "lazy-dataset",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6.0",
"maintainer_email": null,
"keywords": null,
"author": "Christoph Boeddeker",
"author_email": "boeddeker@nt.upb.de",
"download_url": "https://files.pythonhosted.org/packages/22/fc/9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9/lazy_dataset-0.0.15.tar.gz",
"platform": null,
"description": "\n# lazy_dataset\n\n![Run python tests](https://github.com/fgnt/lazy_dataset/workflows/Run%20python%20tests/badge.svg?branch=master)\n[![codecov.io](https://codecov.io/github/fgnt/lazy_dataset/coverage.svg?branch=master)](https://codecov.io/github/fgnt/lazy_dataset?branch=master)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgnt/lazy_dataset/blob/master/LICENSE)\n\nLazy_dataset is a helper to deal with large datasets that do not fit into memory.\nIt allows to define transformations that are applied lazily,\n(e.g. a mapping function to read data from HDD). When someone iterates over the dataset all\ntransformations are applied.\n\nSupported transformations:\n - `dataset.map(map_fn)`: Apply the function `map_fn` to each example ([builtins.map](https://docs.python.org/3/library/functions.html#map))\n - `dataset[2]`: Get example at index `2`.\n - `dataset['example_id']` Get that example that has the example id `'example_id'`.\n - `dataset[10:20]`: Get a sub dataset that contains only the examples in the slice 10 to 20.\n - `dataset.filter(filter_fn, lazy=True)` Drops examples where `filter_fn(example)` is false ([builtins.filter](https://docs.python.org/3/library/functions.html#filter)).\n - `dataset.concatenate(*others)`: Concatenates two or more datasets ([numpy.concatenate](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html))\n - `dataset.intersperse(*others)`: Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).\n - `dataset.zip(*others)`: Zip two or more datasets\n - `dataset.shuffle(reshuffle=False)`: Shuffles the dataset. When `reshuffle` is `True` it shuffles each time when you iterate over the data.\n - `dataset.tile(reps, shuffle=False)`: Repeats the dataset `reps` times and concatenates it ([numpy.tile](https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html))\n - `dataset.cycle()`: Repeats the dataset endlessly ([itertools.cycle](https://docs.python.org/3/library/itertools.html#itertools.cycle) but without caching)\n - `dataset.groupby(group_fn)`: Groups examples together. In contrast to `itertools.groupby` a sort is not nessesary, like in pandas ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby), [pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))\n - `dataset.sort(key_fn, sort_fn=sorted)`: Sorts the examples depending on the values `key_fn(example)` ([list.sort](https://docs.python.org/3/library/stdtypes.html#list.sort))\n - `dataset.batch(batch_size, drop_last=False)`: Batches `batch_size` examples together as a list. Usually followed by a map ([tensorflow.data.Dataset.batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch))\n - `dataset.random_choice()`: Get a random example ([numpy.random.choice](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html))\n - `dataset.cache()`: Cache in RAM (similar to ESPnet's `keep_all_data_on_mem`)\n - `dataset.diskcache()`: Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)\n - ...\n\n\n```python\n>>> from IPython.lib.pretty import pprint\n>>> import lazy_dataset\n>>> examples = {\n... 'example_id_1': {\n... 'observation': [1, 2, 3],\n... 'label': 1,\n... },\n... 'example_id_2': {\n... 'observation': [4, 5, 6],\n... 'label': 2,\n... },\n... 'example_id_3': {\n... 'observation': [7, 8, 9],\n... 'label': 3,\n... },\n... }\n>>> for example_id, example in examples.items():\n... example['example_id'] = example_id\n>>> ds = lazy_dataset.new(examples)\n>>> ds\n DictDataset(len=3)\nMapDataset(_pickle.loads)\n>>> ds.keys()\n('example_id_1', 'example_id_2', 'example_id_3')\n>>> for example in ds:\n... print(example)\n{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}\n{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}\n{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}\n>>> def transform(example):\n... example['label'] *= 10\n... return example\n>>> ds = ds.map(transform)\n>>> for example in ds:\n... print(example)\n{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}\n{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}\n{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}\n>>> ds = ds.filter(lambda example: example['label'] > 15)\n>>> for example in ds:\n... print(example)\n{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}\n{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}\n>>> ds['example_id_2']\n{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}\n>>> ds\n DictDataset(len=3)\n MapDataset(_pickle.loads)\n MapDataset(<function transform at 0x7ff74efb6620>)\nFilterDataset(<function <lambda> at 0x7ff74efb67b8>)\n```\n\n## Comparison with PyTorch's DataLoader\n\nSee [here](comparison/comparison.md) for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.\n\n## Installation\n\nInstall it directly with Pip, if you just want to use it:\n\n```bash\npip install lazy_dataset\n```\n\nIf you want to make changes or want the most recent version: Clone the repository and install it as follows:\n\n```bash\ngit clone https://github.com/fgnt/lazy_dataset.git\ncd lazy_dataset\npip install --editable .\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Process large datasets as if it was an iterable.",
"version": "0.0.15",
"project_urls": {
"Homepage": "https://github.com/fgnt/lazy_dataset"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "14afa0f0bfe49afc8c92edbcc8656f6ae9309fab6ce29564ae9f9cc36908d0b0",
"md5": "4473a662b907898ecf283e47669485a4",
"sha256": "2998bb526880c6a1ecbf960e356410a358ddf289e636d48825112f7649a5f1f2"
},
"downloads": -1,
"filename": "lazy_dataset-0.0.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4473a662b907898ecf283e47669485a4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6.0",
"size": 44033,
"upload_time": "2024-06-18T14:26:31",
"upload_time_iso_8601": "2024-06-18T14:26:31.883651Z",
"url": "https://files.pythonhosted.org/packages/14/af/a0f0bfe49afc8c92edbcc8656f6ae9309fab6ce29564ae9f9cc36908d0b0/lazy_dataset-0.0.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "22fc9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9",
"md5": "e8ed61cc03e8e736bb6d568325ae46a2",
"sha256": "6171608976c98429ea2fa87b496d5b387bef26ef205e53a9db611b7adee81a10"
},
"downloads": -1,
"filename": "lazy_dataset-0.0.15.tar.gz",
"has_sig": false,
"md5_digest": "e8ed61cc03e8e736bb6d568325ae46a2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6.0",
"size": 53187,
"upload_time": "2024-06-18T14:26:58",
"upload_time_iso_8601": "2024-06-18T14:26:58.509508Z",
"url": "https://files.pythonhosted.org/packages/22/fc/9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9/lazy_dataset-0.0.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-18 14:26:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "fgnt",
"github_project": "lazy_dataset",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"lcname": "lazy-dataset"
}