lazy-dataset

Name	lazy-dataset JSON
Version	0.0.15 JSON
	download
home_page	https://github.com/fgnt/lazy_dataset
Summary	Process large datasets as if it was an iterable.
upload_time	2024-06-18 14:26:58
maintainer	None
docs_url	None
author	Christoph Boeddeker
requires_python	>=3.6.0
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI
coveralls test coverage	No coveralls.

            
# lazy_dataset

![Run python tests](https://github.com/fgnt/lazy_dataset/workflows/Run%20python%20tests/badge.svg?branch=master)
[![codecov.io](https://codecov.io/github/fgnt/lazy_dataset/coverage.svg?branch=master)](https://codecov.io/github/fgnt/lazy_dataset?branch=master)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgnt/lazy_dataset/blob/master/LICENSE)

Lazy_dataset is a helper to deal with large datasets that do not fit into memory.
It allows to define transformations that are applied lazily,
(e.g. a mapping function to read data from HDD). When someone iterates over the dataset all
transformations are applied.

Supported transformations:
 - `dataset.map(map_fn)`: Apply the function `map_fn` to each example ([builtins.map](https://docs.python.org/3/library/functions.html#map))
 - `dataset[2]`: Get example at index `2`.
 - `dataset['example_id']` Get that example that has the example id `'example_id'`.
 - `dataset[10:20]`: Get a sub dataset that contains only the examples in the slice 10 to 20.
 - `dataset.filter(filter_fn, lazy=True)` Drops examples where `filter_fn(example)` is false ([builtins.filter](https://docs.python.org/3/library/functions.html#filter)).
 - `dataset.concatenate(*others)`: Concatenates two or more datasets ([numpy.concatenate](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html))
 - `dataset.intersperse(*others)`: Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).
 - `dataset.zip(*others)`: Zip two or more datasets
 - `dataset.shuffle(reshuffle=False)`: Shuffles the dataset. When `reshuffle` is `True` it shuffles each time when you iterate over the data.
 - `dataset.tile(reps, shuffle=False)`: Repeats the dataset `reps` times and concatenates it ([numpy.tile](https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html))
 - `dataset.cycle()`: Repeats the dataset endlessly ([itertools.cycle](https://docs.python.org/3/library/itertools.html#itertools.cycle) but without caching)
 - `dataset.groupby(group_fn)`: Groups examples together. In contrast to `itertools.groupby` a sort is not nessesary, like in pandas ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby), [pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))
 - `dataset.sort(key_fn, sort_fn=sorted)`: Sorts the examples depending on the values `key_fn(example)` ([list.sort](https://docs.python.org/3/library/stdtypes.html#list.sort))
 - `dataset.batch(batch_size, drop_last=False)`: Batches `batch_size` examples together as a list. Usually followed by a map ([tensorflow.data.Dataset.batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch))
 - `dataset.random_choice()`: Get a random example ([numpy.random.choice](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html))
 - `dataset.cache()`: Cache in RAM (similar to ESPnet's `keep_all_data_on_mem`)
 - `dataset.diskcache()`: Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)
 - ...


```python
>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
...     'example_id_1': {
...         'observation': [1, 2, 3],
...         'label': 1,
...     },
...     'example_id_2': {
...         'observation': [4, 5, 6],
...         'label': 2,
...     },
...     'example_id_3': {
...         'observation': [7, 8, 9],
...         'label': 3,
...     },
... }
>>> for example_id, example in examples.items():
...     example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
  DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
...     example['label'] *= 10
...     return example
>>> ds = ds.map(transform)
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
...     print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
      DictDataset(len=3)
    MapDataset(_pickle.loads)
  MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)
```

## Comparison with PyTorch's DataLoader

See [here](comparison/comparison.md) for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.

## Installation

Install it directly with Pip, if you just want to use it:

```bash
pip install lazy_dataset
```

If you want to make changes or want the most recent version: Clone the repository and install it as follows:

```bash
git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fgnt/lazy_dataset",
    "name": "lazy-dataset",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": null,
    "keywords": null,
    "author": "Christoph Boeddeker",
    "author_email": "boeddeker@nt.upb.de",
    "download_url": "https://files.pythonhosted.org/packages/22/fc/9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9/lazy_dataset-0.0.15.tar.gz",
    "platform": null,
    "description": "\n# lazy_dataset\n\n![Run python tests](https://github.com/fgnt/lazy_dataset/workflows/Run%20python%20tests/badge.svg?branch=master)\n[![codecov.io](https://codecov.io/github/fgnt/lazy_dataset/coverage.svg?branch=master)](https://codecov.io/github/fgnt/lazy_dataset?branch=master)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgnt/lazy_dataset/blob/master/LICENSE)\n\nLazy_dataset is a helper to deal with large datasets that do not fit into memory.\nIt allows to define transformations that are applied lazily,\n(e.g. a mapping function to read data from HDD). When someone iterates over the dataset all\ntransformations are applied.\n\nSupported transformations:\n - `dataset.map(map_fn)`: Apply the function `map_fn` to each example ([builtins.map](https://docs.python.org/3/library/functions.html#map))\n - `dataset[2]`: Get example at index `2`.\n - `dataset['example_id']` Get that example that has the example id `'example_id'`.\n - `dataset[10:20]`: Get a sub dataset that contains only the examples in the slice 10 to 20.\n - `dataset.filter(filter_fn, lazy=True)` Drops examples where `filter_fn(example)` is false ([builtins.filter](https://docs.python.org/3/library/functions.html#filter)).\n - `dataset.concatenate(*others)`: Concatenates two or more datasets ([numpy.concatenate](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html))\n - `dataset.intersperse(*others)`: Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).\n - `dataset.zip(*others)`: Zip two or more datasets\n - `dataset.shuffle(reshuffle=False)`: Shuffles the dataset. When `reshuffle` is `True` it shuffles each time when you iterate over the data.\n - `dataset.tile(reps, shuffle=False)`: Repeats the dataset `reps` times and concatenates it ([numpy.tile](https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html))\n - `dataset.cycle()`: Repeats the dataset endlessly ([itertools.cycle](https://docs.python.org/3/library/itertools.html#itertools.cycle) but without caching)\n - `dataset.groupby(group_fn)`: Groups examples together. In contrast to `itertools.groupby` a sort is not nessesary, like in pandas ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby), [pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))\n - `dataset.sort(key_fn, sort_fn=sorted)`: Sorts the examples depending on the values `key_fn(example)` ([list.sort](https://docs.python.org/3/library/stdtypes.html#list.sort))\n - `dataset.batch(batch_size, drop_last=False)`: Batches `batch_size` examples together as a list. Usually followed by a map ([tensorflow.data.Dataset.batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch))\n - `dataset.random_choice()`: Get a random example ([numpy.random.choice](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html))\n - `dataset.cache()`: Cache in RAM (similar to ESPnet's `keep_all_data_on_mem`)\n - `dataset.diskcache()`: Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)\n - ...\n\n\n```python\n>>> from IPython.lib.pretty import pprint\n>>> import lazy_dataset\n>>> examples = {\n...     'example_id_1': {\n...         'observation': [1, 2, 3],\n...         'label': 1,\n...     },\n...     'example_id_2': {\n...         'observation': [4, 5, 6],\n...         'label': 2,\n...     },\n...     'example_id_3': {\n...         'observation': [7, 8, 9],\n...         'label': 3,\n...     },\n... }\n>>> for example_id, example in examples.items():\n...     example['example_id'] = example_id\n>>> ds = lazy_dataset.new(examples)\n>>> ds\n  DictDataset(len=3)\nMapDataset(_pickle.loads)\n>>> ds.keys()\n('example_id_1', 'example_id_2', 'example_id_3')\n>>> for example in ds:\n...     print(example)\n{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}\n{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}\n{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}\n>>> def transform(example):\n...     example['label'] *= 10\n...     return example\n>>> ds = ds.map(transform)\n>>> for example in ds:\n...     print(example)\n{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}\n{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}\n{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}\n>>> ds = ds.filter(lambda example: example['label'] > 15)\n>>> for example in ds:\n...     print(example)\n{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}\n{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}\n>>> ds['example_id_2']\n{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}\n>>> ds\n      DictDataset(len=3)\n    MapDataset(_pickle.loads)\n  MapDataset(<function transform at 0x7ff74efb6620>)\nFilterDataset(<function <lambda> at 0x7ff74efb67b8>)\n```\n\n## Comparison with PyTorch's DataLoader\n\nSee [here](comparison/comparison.md) for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.\n\n## Installation\n\nInstall it directly with Pip, if you just want to use it:\n\n```bash\npip install lazy_dataset\n```\n\nIf you want to make changes or want the most recent version: Clone the repository and install it as follows:\n\n```bash\ngit clone https://github.com/fgnt/lazy_dataset.git\ncd lazy_dataset\npip install --editable .\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Process large datasets as if it was an iterable.",
    "version": "0.0.15",
    "project_urls": {
        "Homepage": "https://github.com/fgnt/lazy_dataset"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "14afa0f0bfe49afc8c92edbcc8656f6ae9309fab6ce29564ae9f9cc36908d0b0",
                "md5": "4473a662b907898ecf283e47669485a4",
                "sha256": "2998bb526880c6a1ecbf960e356410a358ddf289e636d48825112f7649a5f1f2"
            },
            "downloads": -1,
            "filename": "lazy_dataset-0.0.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4473a662b907898ecf283e47669485a4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6.0",
            "size": 44033,
            "upload_time": "2024-06-18T14:26:31",
            "upload_time_iso_8601": "2024-06-18T14:26:31.883651Z",
            "url": "https://files.pythonhosted.org/packages/14/af/a0f0bfe49afc8c92edbcc8656f6ae9309fab6ce29564ae9f9cc36908d0b0/lazy_dataset-0.0.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "22fc9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9",
                "md5": "e8ed61cc03e8e736bb6d568325ae46a2",
                "sha256": "6171608976c98429ea2fa87b496d5b387bef26ef205e53a9db611b7adee81a10"
            },
            "downloads": -1,
            "filename": "lazy_dataset-0.0.15.tar.gz",
            "has_sig": false,
            "md5_digest": "e8ed61cc03e8e736bb6d568325ae46a2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 53187,
            "upload_time": "2024-06-18T14:26:58",
            "upload_time_iso_8601": "2024-06-18T14:26:58.509508Z",
            "url": "https://files.pythonhosted.org/packages/22/fc/9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9/lazy_dataset-0.0.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-18 14:26:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fgnt",
    "github_project": "lazy_dataset",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "lcname": "lazy-dataset"
}

Christoph Boeddeker