Name | dpshdl JSON |
Version |
0.0.21
JSON |
| download |
home_page | https://github.com/dpshai/dpshdl |
Summary | Framework-agnostic library for loading data |
upload_time | 2024-03-15 02:56:38 |
maintainer | |
docs_url | None |
author | Benjamin Bolte |
requires_python | >=3.11 |
license | |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# dpshdl
A framework-agnostic library for loading data.
## Installation
Install the package using:
```bash
pip install dpshdl
```
Or, to install the latest branch:
```bash
pip install 'dpshdl @ git+https://github.com/kscalelabs/dpshdl.git@master'
```
## Usage
Datasets should override a single method, `next`, which returns a single sample.
```python
from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
import numpy as np
class MyDataset(Dataset[int, np.ndarray]):
def next(self) -> int:
return 1
# Loops forever.
with Dataloader(MyDataset(), batch_size=2) as loader:
for sample in loader:
assert sample.shape == (2,)
```
### Error Handling
You can wrap any dataset in an `ErrorHandlingDataset` to catch and log errors:
```python
from dpshdl.dataset import ErrorHandlingDataset
with Dataloader(ErrorHandlingDataset(MyDataset()), batch_size=2) as loader:
...
```
This wrapper will detect errors in the `next` function and log error summaries, to avoid crashing the entire program.
### Ad-hoc Testing
While developing datasets, you usually want to loop through a few samples to make sure everything is working. You can do this easily as follows:
```python
MyDataset().test(
max_samples=100,
handle_errors=True, # To automatically wrap the dataset in an ErrorHandlingDataset.
print_fn=lambda i, sample: print(f"Sample {i}: {sample}")
)
```
### Collating
This package provides a default implementation of dataset collating, which can be used as follows:
```python
from dpshdl.collate import collate
class MyDataset(Dataset[int, np.ndarray]):
def collate(self, items: list[int]) -> np.ndarray:
return collate(items)
```
Alternatively, you can implement your own custom collating strategy:
```python
from dpshdl.collate import collate
class MyDataset(Dataset[int, list[int]]):
def collate(self, items: list[int]) -> list[int]:
return items
```
There are additional arguments that can be passed to the `collate` function to automatically handle padding and batching:
```python
from dpshdl.collate import pad_all, pad_sequence
import functools
import random
import numpy as np
items = [np.random.random(random.randint(5, 10)) for _ in range(5)] # Randomly sized arrays.
collate(items) # Will fail because the arrays are of different sizes.
collate(items, pad=True) # Use the default padding strategy.
collate(items, pad=functools.partial(pad_all, left_pad=True)) # Left-padding.
collate(items, pad=functools.partial(pad_sequence, dim=0, left_pad=True)) # Pads a specific dimension.
```
### Prefetching
Sometimes it is a good idea to trigger a host-to-device transfer before a batch of samples is needed, so that it can take place asynchronously while other computation is happening. This is called prefetching. This package provides a simple utility class to do this:
```python
from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
from dpshdl.prefetcher import Prefetcher
import numpy as np
import torch
from torch import Tensor
class MyDataset(Dataset[int, np.ndarray]):
def next(self) -> int:
return 1
def to_device_func(sample: np.ndarray) -> Tensor:
# Because this is non-blocking, the H2D transfer can take place in the
# background while other computation is happening.
return torch.from_numpy(sample).to("cuda", non_blocking=True)
with Prefetcher(to_device_func, Dataloader(MyDataset(), batch_size=2)) as loader:
for sample in loader:
assert sample.device.type == "cuda"
```
Raw data
{
"_id": null,
"home_page": "https://github.com/dpshai/dpshdl",
"name": "dpshdl",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": "",
"keywords": "",
"author": "Benjamin Bolte",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/7a/91/a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94/dpshdl-0.0.21.tar.gz",
"platform": null,
"description": "# dpshdl\n\nA framework-agnostic library for loading data.\n\n## Installation\n\nInstall the package using:\n\n```bash\npip install dpshdl\n```\n\nOr, to install the latest branch:\n\n```bash\npip install 'dpshdl @ git+https://github.com/kscalelabs/dpshdl.git@master'\n```\n\n## Usage\n\nDatasets should override a single method, `next`, which returns a single sample.\n\n```python\nfrom dpshdl.dataset import Dataset\nfrom dpshdl.dataloader import Dataloader\nimport numpy as np\n\nclass MyDataset(Dataset[int, np.ndarray]):\n def next(self) -> int:\n return 1\n\n# Loops forever.\nwith Dataloader(MyDataset(), batch_size=2) as loader:\n for sample in loader:\n assert sample.shape == (2,)\n```\n\n### Error Handling\n\nYou can wrap any dataset in an `ErrorHandlingDataset` to catch and log errors:\n\n```python\nfrom dpshdl.dataset import ErrorHandlingDataset\n\nwith Dataloader(ErrorHandlingDataset(MyDataset()), batch_size=2) as loader:\n ...\n```\n\nThis wrapper will detect errors in the `next` function and log error summaries, to avoid crashing the entire program.\n\n### Ad-hoc Testing\n\nWhile developing datasets, you usually want to loop through a few samples to make sure everything is working. You can do this easily as follows:\n\n```python\nMyDataset().test(\n max_samples=100,\n handle_errors=True, # To automatically wrap the dataset in an ErrorHandlingDataset.\n print_fn=lambda i, sample: print(f\"Sample {i}: {sample}\")\n)\n```\n\n### Collating\n\nThis package provides a default implementation of dataset collating, which can be used as follows:\n\n```python\nfrom dpshdl.collate import collate\n\nclass MyDataset(Dataset[int, np.ndarray]):\n def collate(self, items: list[int]) -> np.ndarray:\n return collate(items)\n```\n\nAlternatively, you can implement your own custom collating strategy:\n\n```python\nfrom dpshdl.collate import collate\n\nclass MyDataset(Dataset[int, list[int]]):\n def collate(self, items: list[int]) -> list[int]:\n return items\n```\n\nThere are additional arguments that can be passed to the `collate` function to automatically handle padding and batching:\n\n```python\nfrom dpshdl.collate import pad_all, pad_sequence\nimport functools\nimport random\nimport numpy as np\n\nitems = [np.random.random(random.randint(5, 10)) for _ in range(5)] # Randomly sized arrays.\ncollate(items) # Will fail because the arrays are of different sizes.\ncollate(items, pad=True) # Use the default padding strategy.\ncollate(items, pad=functools.partial(pad_all, left_pad=True)) # Left-padding.\ncollate(items, pad=functools.partial(pad_sequence, dim=0, left_pad=True)) # Pads a specific dimension.\n```\n\n### Prefetching\n\nSometimes it is a good idea to trigger a host-to-device transfer before a batch of samples is needed, so that it can take place asynchronously while other computation is happening. This is called prefetching. This package provides a simple utility class to do this:\n\n```python\nfrom dpshdl.dataset import Dataset\nfrom dpshdl.dataloader import Dataloader\nfrom dpshdl.prefetcher import Prefetcher\nimport numpy as np\nimport torch\nfrom torch import Tensor\n\n\nclass MyDataset(Dataset[int, np.ndarray]):\n def next(self) -> int:\n return 1\n\n\ndef to_device_func(sample: np.ndarray) -> Tensor:\n # Because this is non-blocking, the H2D transfer can take place in the\n # background while other computation is happening.\n return torch.from_numpy(sample).to(\"cuda\", non_blocking=True)\n\n\nwith Prefetcher(to_device_func, Dataloader(MyDataset(), batch_size=2)) as loader:\n for sample in loader:\n assert sample.device.type == \"cuda\"\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "Framework-agnostic library for loading data",
"version": "0.0.21",
"project_urls": {
"Homepage": "https://github.com/dpshai/dpshdl"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1e0892e2593e0a2fbb66c8231b63d3fa66d2f1e8744458e3e7daa1bfd991d10a",
"md5": "61d283922ef79d3b3a5903906c67ce0e",
"sha256": "15d2fc76b782193d01e8c7f8d67f10e882015e7dd35e7805f46d148fe1f8c79c"
},
"downloads": -1,
"filename": "dpshdl-0.0.21-py3-none-any.whl",
"has_sig": false,
"md5_digest": "61d283922ef79d3b3a5903906c67ce0e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 27203,
"upload_time": "2024-03-15T02:56:37",
"upload_time_iso_8601": "2024-03-15T02:56:37.763512Z",
"url": "https://files.pythonhosted.org/packages/1e/08/92e2593e0a2fbb66c8231b63d3fa66d2f1e8744458e3e7daa1bfd991d10a/dpshdl-0.0.21-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7a91a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94",
"md5": "6027373d315ff686634b762068c814ca",
"sha256": "72e4da98750f10ff53437226c9aca2347c8054ba032563b2b361ae7a8c60910e"
},
"downloads": -1,
"filename": "dpshdl-0.0.21.tar.gz",
"has_sig": false,
"md5_digest": "6027373d315ff686634b762068c814ca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 26646,
"upload_time": "2024-03-15T02:56:38",
"upload_time_iso_8601": "2024-03-15T02:56:38.943174Z",
"url": "https://files.pythonhosted.org/packages/7a/91/a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94/dpshdl-0.0.21.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-15 02:56:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dpshai",
"github_project": "dpshdl",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dpshdl"
}