thunderpack


Namethunderpack JSON
Version 0.0.2 PyPI version JSON
download
home_page
SummaryDataset library for blazingly fast data loading and decoding
upload_time2024-02-07 18:39:48
maintainer
docs_urlNone
author
requires_python>=3.9
licenseMIT License Copyright (c) 2022 Jose Javier Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords thunderpack formats machine learning encoding
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ⚡ ThunderPack

_Blazingly fast multi-modal data format for training deep neural networks_

## ❓ TL;DR

Most deep learning comprise of media (image, video, sound) are distributed as plain individual files, but this can incur in a lot of inefficiencies due to the filesystem behaving poorly when many files are involved. ThunderPack solves the issue by using **LMDB** an lightweight database, that supports blazingly fast reads.

![](https://github.com/JJGO/thunderpack/blob/assets/read-time.png)
_Benchmark of random read access on various data tensor storage solutions (lower read time is better). ThunderPack scales very well to large amount of datapoints, whereas other solutions present significant slowdowns even when using local SSD hardware._

## 🌟 Features

- **Optimized read support** - ThunderPack offers specialized read-only APIs that are orders of magnitude faster than common generic interfaces that also support writing.
- **Concurrency Support** - ThunderPack is thread safe and can be used with `torch.data.util.DataLoader` without issue.
- **Memory mapping** - Thanks to the use of LMDB, data is memory-mapped by default, drastically reducing the read times for entries already present in the filesystem cache.
- **Data Locality** - Data and labels are stored together, reducing read times.
- **Immense Flexibility** - Unlike other dataloader formats that trade off speed with usability, ThunderPack keeps a human-friendly dictionary-like interface that supports arbitrary sampling and that can be edited after creation.
- **Improved Tensor I/O** - Faster primitives for (de)serializing Numpy `ndarray`s, Torch `Tensor`s and `Jax` `ndarray`s.
- **Extensive extension support** - ThunderPack supports a wide variety of data formats out of the box.
- **Customizability** - ThunderPack can easily be extended to support other  (also feel free to submit a PR with your extension of choice).
<!-- - **Cloud Native** - Compatible with streaming data schemes, and with built-in sharding support. -->

<!-- ## 🚀 Quickstart -->

## 💾 Installation

<!--
ThunderPack can be installed via `pip`. For the stable version:

```shell
pip install thunderpack
```
-->
Or for the latest version:

```shell
pip install git+https://github.com/JJGO/thunderpack.git
```

You can also **manually** install it by cloning it, installing dependencies, and adding it to your `PYTHONPATH`


```shell
git clone https://github.com/JJGO/thunderpack
python -m pip install -r ./thunderpack/requirements.txt

export PYTHONPATH="$PYTHONPATH:$(realpath ./thunderpack)"
```

## Quickstart

Thunderpack has asymetric APIs for writing and reading data to maximize read throughput and speed.

First, we create a dataset using the `ThunderDB` object which behaves like a dictionary and it will automatically and 
transparently encode data when assigning values. Keys are completely arbitrary and schema is left to the user. 
In this case we store the `metadata`, `samples` and all the keys corresponding to datapoints 

```python
from thunderpack import ThunderDB

with ThunderDB.open('/tmp/thunderpack_test', 'c') as db:
    db['metadata'] = {'version': '0.1', 'n_samples': 100}
    
    keys = []
    for i in range(100):
        key = f'sample{i:02d}'
        x = np.random.normal(size=(128,128))
        y = np.random.normal(size=(128,128))
        # Thunderpack will serialize the tuple and numpy arrays automatically
        db[key] = (x, y) 
        keys.append(key)
    db['samples'] = keys
```

Once created, we can read the data using `ThunderReader`, which as a dict-like API

```python
from thunderpack import ThunderReader

reader = ThunderReader('/tmp/thunderpack_test')
print(reader['metadata'])
# {'version': '0.1', 'n_samples': 100}
print(reader['samples'][:5])
# ['sample00', 'sample01', 'sample02', 'sample03', 'sample04']
print(reader['sample00'][0].shape)
# (128, 128)
```

Thunderpack provides a PyTorch compatible Dataset object via `ThunderDataset`, which 
 assigns a `._db` attribute with the `ThunderReader` object 

```python
from thunderpack.torch import ThunderDataset
class MyDataset(ThunderDataset):
    
    def __init__(self, file):
        super().__init__(file)
        # Access through self._db attribute
        self.samples = self._db['samples']
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        return self._db[self.samples[idx]]

d = MyDataset('/tmp/thunderpack_test')
print(len(d))
# 100
print(d[0][0].shape)
# (128, 128)
```


## 📁 Supported Formats

ThunderPack supports a wide range of data formats out of the box

|  | Modality | Supported Formats |
| :-: | :-- | :-- |
| 🧮 | Tensor | npy, npz, pt, safetensors |
| 📷 | Image | jpeg, png, bmp, webp |
| 🎧 | Audio | wav, flac, ogg, mp3 |
| 🗂️ | Tabular | csv, parquet, feather, jsonl |
| 📄 | Documents | json, yaml, msgpack, txt |
| 🗜️ | Compression | lz4, zstd, gzip, bz2, snappy, brotli |
| 🧸 | Object | pickle |


## ↔ Type-Format mappings

ThunderPack automatically maps common Python data types to efficient data formats

| Type | Format |
|:-- | :--: |
| `PIL.Image` | PNG or JPEG |
| `pandas.DataFrame` | Parquet |
| `np.ndarray`, `torch.Tensor` | NumpyPack (LZ4) |
| `bool`, `int`, `float`, `complex`, `str` | MessagePack (LZ4) |
| `list`, `dict`, `tuple` | ThunderPack (LZ4) |


<!-- ## Performance Benchmarks

>>> Compare loading times of Miniplaces, OxfordFlowers, ImageNet, OASIS3d

## Tutorial

#### 1. Writing a dataset

#### 2. Reading a dataset

#### 3. Creating a PyTorch wrapper

#### 4. Defining a custom format  -->

## ✍️ Citation

```
@misc{ortiz2023thunderpack,
    author = {Jose Javier Gonzalez Ortiz},
    title = {The ThunderPack Data Format},
    year = {2023},
    howpublished = {\\url{<https://github.com/JJGO/thunderpack/>}},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "thunderpack",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "thunderpack,formats,machine learning,encoding",
    "author": "",
    "author_email": "Jose Javier Gonzalez Ortiz <josejg@mit.edu>",
    "download_url": "https://files.pythonhosted.org/packages/41/af/0015a85986270773448412539ab1dff9c151ec5ee91bfb669cfabad2d645/thunderpack-0.0.2.tar.gz",
    "platform": null,
    "description": "# \u26a1 ThunderPack\n\n_Blazingly fast multi-modal data format for training deep neural networks_\n\n## \u2753 TL;DR\n\nMost deep learning comprise of media (image, video, sound) are distributed as plain individual files, but this can incur in a lot of inefficiencies due to the filesystem behaving poorly when many files are involved. ThunderPack solves the issue by using **LMDB** an lightweight database, that supports blazingly fast reads.\n\n![](https://github.com/JJGO/thunderpack/blob/assets/read-time.png)\n_Benchmark of random read access on various data tensor storage solutions (lower read time is better). ThunderPack scales very well to large amount of datapoints, whereas other solutions present significant slowdowns even when using local SSD hardware._\n\n## \ud83c\udf1f Features\n\n- **Optimized read support** - ThunderPack offers specialized read-only APIs that are orders of magnitude faster than common generic interfaces that also support writing.\n- **Concurrency Support** - ThunderPack is thread safe and can be used with `torch.data.util.DataLoader` without issue.\n- **Memory mapping** - Thanks to the use of LMDB, data is memory-mapped by default, drastically reducing the read times for entries already present in the filesystem cache.\n- **Data Locality** - Data and labels are stored together, reducing read times.\n- **Immense Flexibility** - Unlike other dataloader formats that trade off speed with usability, ThunderPack keeps a human-friendly dictionary-like interface that supports arbitrary sampling and that can be edited after creation.\n- **Improved Tensor I/O** - Faster primitives for (de)serializing Numpy `ndarray`s, Torch `Tensor`s and `Jax` `ndarray`s.\n- **Extensive extension support** - ThunderPack supports a wide variety of data formats out of the box.\n- **Customizability** - ThunderPack can easily be extended to support other  (also feel free to submit a PR with your extension of choice).\n<!-- - **Cloud Native** - Compatible with streaming data schemes, and with built-in sharding support. -->\n\n<!-- ## \ud83d\ude80 Quickstart -->\n\n## \ud83d\udcbe Installation\n\n<!--\nThunderPack can be installed via `pip`. For the stable version:\n\n```shell\npip install thunderpack\n```\n-->\nOr for the latest version:\n\n```shell\npip install git+https://github.com/JJGO/thunderpack.git\n```\n\nYou can also **manually** install it by cloning it, installing dependencies, and adding it to your `PYTHONPATH`\n\n\n```shell\ngit clone https://github.com/JJGO/thunderpack\npython -m pip install -r ./thunderpack/requirements.txt\n\nexport PYTHONPATH=\"$PYTHONPATH:$(realpath ./thunderpack)\"\n```\n\n## Quickstart\n\nThunderpack has asymetric APIs for writing and reading data to maximize read throughput and speed.\n\nFirst, we create a dataset using the `ThunderDB` object which behaves like a dictionary and it will automatically and \ntransparently encode data when assigning values. Keys are completely arbitrary and schema is left to the user. \nIn this case we store the `metadata`, `samples` and all the keys corresponding to datapoints \n\n```python\nfrom thunderpack import ThunderDB\n\nwith ThunderDB.open('/tmp/thunderpack_test', 'c') as db:\n    db['metadata'] = {'version': '0.1', 'n_samples': 100}\n    \n    keys = []\n    for i in range(100):\n        key = f'sample{i:02d}'\n        x = np.random.normal(size=(128,128))\n        y = np.random.normal(size=(128,128))\n        # Thunderpack will serialize the tuple and numpy arrays automatically\n        db[key] = (x, y) \n        keys.append(key)\n    db['samples'] = keys\n```\n\nOnce created, we can read the data using `ThunderReader`, which as a dict-like API\n\n```python\nfrom thunderpack import ThunderReader\n\nreader = ThunderReader('/tmp/thunderpack_test')\nprint(reader['metadata'])\n# {'version': '0.1', 'n_samples': 100}\nprint(reader['samples'][:5])\n# ['sample00', 'sample01', 'sample02', 'sample03', 'sample04']\nprint(reader['sample00'][0].shape)\n# (128, 128)\n```\n\nThunderpack provides a PyTorch compatible Dataset object via `ThunderDataset`, which \n assigns a `._db` attribute with the `ThunderReader` object \n\n```python\nfrom thunderpack.torch import ThunderDataset\nclass MyDataset(ThunderDataset):\n    \n    def __init__(self, file):\n        super().__init__(file)\n        # Access through self._db attribute\n        self.samples = self._db['samples']\n    \n    def __len__(self):\n        return len(self.samples)\n    \n    def __getitem__(self, idx):\n        return self._db[self.samples[idx]]\n\nd = MyDataset('/tmp/thunderpack_test')\nprint(len(d))\n# 100\nprint(d[0][0].shape)\n# (128, 128)\n```\n\n\n## \ud83d\udcc1 Supported Formats\n\nThunderPack supports a wide range of data formats out of the box\n\n|  | Modality | Supported Formats |\n| :-: | :-- | :-- |\n| \ud83e\uddee | Tensor | npy, npz, pt, safetensors |\n| \ud83d\udcf7 | Image | jpeg, png, bmp, webp |\n| \ud83c\udfa7 | Audio | wav, flac, ogg, mp3 |\n| \ud83d\uddc2\ufe0f | Tabular | csv, parquet, feather, jsonl |\n| \ud83d\udcc4 | Documents | json, yaml, msgpack, txt |\n| \ud83d\udddc\ufe0f | Compression | lz4, zstd, gzip, bz2, snappy, brotli |\n| \ud83e\uddf8 | Object | pickle |\n\n\n## \u2194 Type-Format mappings\n\nThunderPack automatically maps common Python data types to efficient data formats\n\n| Type | Format |\n|:-- | :--: |\n| `PIL.Image` | PNG or JPEG |\n| `pandas.DataFrame` | Parquet |\n| `np.ndarray`, `torch.Tensor` | NumpyPack (LZ4) |\n| `bool`, `int`, `float`, `complex`, `str` | MessagePack (LZ4) |\n| `list`, `dict`, `tuple` | ThunderPack (LZ4) |\n\n\n<!-- ## Performance Benchmarks\n\n>>> Compare loading times of Miniplaces, OxfordFlowers, ImageNet, OASIS3d\n\n## Tutorial\n\n#### 1. Writing a dataset\n\n#### 2. Reading a dataset\n\n#### 3. Creating a PyTorch wrapper\n\n#### 4. Defining a custom format  -->\n\n## \u270d\ufe0f Citation\n\n```\n@misc{ortiz2023thunderpack,\n    author = {Jose Javier Gonzalez Ortiz},\n    title = {The ThunderPack Data Format},\n    year = {2023},\n    howpublished = {\\\\url{<https://github.com/JJGO/thunderpack/>}},\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2022 Jose Javier  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Dataset library for blazingly fast data loading and decoding",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/jjgo/thunderpack",
        "Repository": "https://github.com/jjgo/thunderpack"
    },
    "split_keywords": [
        "thunderpack",
        "formats",
        "machine learning",
        "encoding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5a7d445bc06e98a6f332f864c41ac0440004c7f4c2609b9a114b2a401f81a98a",
                "md5": "b26070d50bdb4b8f284efb14df4359ac",
                "sha256": "02b30ce18d25b9e7cd066d819d26d28b5eedbb6fcbc4476cfcb89bc664fb1248"
            },
            "downloads": -1,
            "filename": "thunderpack-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b26070d50bdb4b8f284efb14df4359ac",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 13425,
            "upload_time": "2024-02-07T18:39:47",
            "upload_time_iso_8601": "2024-02-07T18:39:47.072827Z",
            "url": "https://files.pythonhosted.org/packages/5a/7d/445bc06e98a6f332f864c41ac0440004c7f4c2609b9a114b2a401f81a98a/thunderpack-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "41af0015a85986270773448412539ab1dff9c151ec5ee91bfb669cfabad2d645",
                "md5": "3f7bce041386423c494fb053ac9da42a",
                "sha256": "b2087205289e216c1873c487d03a55a3fb214c3c7836bc83bac8bba0ab0a6405"
            },
            "downloads": -1,
            "filename": "thunderpack-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "3f7bce041386423c494fb053ac9da42a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 14833,
            "upload_time": "2024-02-07T18:39:48",
            "upload_time_iso_8601": "2024-02-07T18:39:48.829589Z",
            "url": "https://files.pythonhosted.org/packages/41/af/0015a85986270773448412539ab1dff9c151ec5ee91bfb669cfabad2d645/thunderpack-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-07 18:39:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jjgo",
    "github_project": "thunderpack",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "thunderpack"
}
        
Elapsed time: 0.18591s