dpshdl

Name	dpshdl JSON
Version	0.0.21 JSON
	download
home_page	https://github.com/dpshai/dpshdl
Summary	Framework-agnostic library for loading data
upload_time	2024-03-15 02:56:38
maintainer
docs_url	None
author	Benjamin Bolte
requires_python	>=3.11
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # dpshdl

A framework-agnostic library for loading data.

## Installation

Install the package using:

```bash
pip install dpshdl
```

Or, to install the latest branch:

```bash
pip install 'dpshdl @ git+https://github.com/kscalelabs/dpshdl.git@master'
```

## Usage

Datasets should override a single method, `next`, which returns a single sample.

```python
from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
import numpy as np

class MyDataset(Dataset[int, np.ndarray]):
    def next(self) -> int:
        return 1

# Loops forever.
with Dataloader(MyDataset(), batch_size=2) as loader:
    for sample in loader:
        assert sample.shape == (2,)
```

### Error Handling

You can wrap any dataset in an `ErrorHandlingDataset` to catch and log errors:

```python
from dpshdl.dataset import ErrorHandlingDataset

with Dataloader(ErrorHandlingDataset(MyDataset()), batch_size=2) as loader:
    ...
```

This wrapper will detect errors in the `next` function and log error summaries, to avoid crashing the entire program.

### Ad-hoc Testing

While developing datasets, you usually want to loop through a few samples to make sure everything is working. You can do this easily as follows:

```python
MyDataset().test(
    max_samples=100,
    handle_errors=True,  # To automatically wrap the dataset in an ErrorHandlingDataset.
    print_fn=lambda i, sample: print(f"Sample {i}: {sample}")
)
```

### Collating

This package provides a default implementation of dataset collating, which can be used as follows:

```python
from dpshdl.collate import collate

class MyDataset(Dataset[int, np.ndarray]):
    def collate(self, items: list[int]) -> np.ndarray:
        return collate(items)
```

Alternatively, you can implement your own custom collating strategy:

```python
from dpshdl.collate import collate

class MyDataset(Dataset[int, list[int]]):
    def collate(self, items: list[int]) -> list[int]:
        return items
```

There are additional arguments that can be passed to the `collate` function to automatically handle padding and batching:

```python
from dpshdl.collate import pad_all, pad_sequence
import functools
import random
import numpy as np

items = [np.random.random(random.randint(5, 10)) for _ in range(5)]  # Randomly sized arrays.
collate(items)  # Will fail because the arrays are of different sizes.
collate(items, pad=True)  # Use the default padding strategy.
collate(items, pad=functools.partial(pad_all, left_pad=True))  # Left-padding.
collate(items, pad=functools.partial(pad_sequence, dim=0, left_pad=True))  # Pads a specific dimension.
```

### Prefetching

Sometimes it is a good idea to trigger a host-to-device transfer before a batch of samples is needed, so that it can take place asynchronously while other computation is happening. This is called prefetching. This package provides a simple utility class to do this:

```python
from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
from dpshdl.prefetcher import Prefetcher
import numpy as np
import torch
from torch import Tensor


class MyDataset(Dataset[int, np.ndarray]):
    def next(self) -> int:
        return 1


def to_device_func(sample: np.ndarray) -> Tensor:
    # Because this is non-blocking, the H2D transfer can take place in the
    # background while other computation is happening.
    return torch.from_numpy(sample).to("cuda", non_blocking=True)


with Prefetcher(to_device_func, Dataloader(MyDataset(), batch_size=2)) as loader:
    for sample in loader:
        assert sample.device.type == "cuda"
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dpshai/dpshdl",
    "name": "dpshdl",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "",
    "keywords": "",
    "author": "Benjamin Bolte",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/7a/91/a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94/dpshdl-0.0.21.tar.gz",
    "platform": null,
    "description": "# dpshdl\n\nA framework-agnostic library for loading data.\n\n## Installation\n\nInstall the package using:\n\n```bash\npip install dpshdl\n```\n\nOr, to install the latest branch:\n\n```bash\npip install 'dpshdl @ git+https://github.com/kscalelabs/dpshdl.git@master'\n```\n\n## Usage\n\nDatasets should override a single method, `next`, which returns a single sample.\n\n```python\nfrom dpshdl.dataset import Dataset\nfrom dpshdl.dataloader import Dataloader\nimport numpy as np\n\nclass MyDataset(Dataset[int, np.ndarray]):\n    def next(self) -> int:\n        return 1\n\n# Loops forever.\nwith Dataloader(MyDataset(), batch_size=2) as loader:\n    for sample in loader:\n        assert sample.shape == (2,)\n```\n\n### Error Handling\n\nYou can wrap any dataset in an `ErrorHandlingDataset` to catch and log errors:\n\n```python\nfrom dpshdl.dataset import ErrorHandlingDataset\n\nwith Dataloader(ErrorHandlingDataset(MyDataset()), batch_size=2) as loader:\n    ...\n```\n\nThis wrapper will detect errors in the `next` function and log error summaries, to avoid crashing the entire program.\n\n### Ad-hoc Testing\n\nWhile developing datasets, you usually want to loop through a few samples to make sure everything is working. You can do this easily as follows:\n\n```python\nMyDataset().test(\n    max_samples=100,\n    handle_errors=True,  # To automatically wrap the dataset in an ErrorHandlingDataset.\n    print_fn=lambda i, sample: print(f\"Sample {i}: {sample}\")\n)\n```\n\n### Collating\n\nThis package provides a default implementation of dataset collating, which can be used as follows:\n\n```python\nfrom dpshdl.collate import collate\n\nclass MyDataset(Dataset[int, np.ndarray]):\n    def collate(self, items: list[int]) -> np.ndarray:\n        return collate(items)\n```\n\nAlternatively, you can implement your own custom collating strategy:\n\n```python\nfrom dpshdl.collate import collate\n\nclass MyDataset(Dataset[int, list[int]]):\n    def collate(self, items: list[int]) -> list[int]:\n        return items\n```\n\nThere are additional arguments that can be passed to the `collate` function to automatically handle padding and batching:\n\n```python\nfrom dpshdl.collate import pad_all, pad_sequence\nimport functools\nimport random\nimport numpy as np\n\nitems = [np.random.random(random.randint(5, 10)) for _ in range(5)]  # Randomly sized arrays.\ncollate(items)  # Will fail because the arrays are of different sizes.\ncollate(items, pad=True)  # Use the default padding strategy.\ncollate(items, pad=functools.partial(pad_all, left_pad=True))  # Left-padding.\ncollate(items, pad=functools.partial(pad_sequence, dim=0, left_pad=True))  # Pads a specific dimension.\n```\n\n### Prefetching\n\nSometimes it is a good idea to trigger a host-to-device transfer before a batch of samples is needed, so that it can take place asynchronously while other computation is happening. This is called prefetching. This package provides a simple utility class to do this:\n\n```python\nfrom dpshdl.dataset import Dataset\nfrom dpshdl.dataloader import Dataloader\nfrom dpshdl.prefetcher import Prefetcher\nimport numpy as np\nimport torch\nfrom torch import Tensor\n\n\nclass MyDataset(Dataset[int, np.ndarray]):\n    def next(self) -> int:\n        return 1\n\n\ndef to_device_func(sample: np.ndarray) -> Tensor:\n    # Because this is non-blocking, the H2D transfer can take place in the\n    # background while other computation is happening.\n    return torch.from_numpy(sample).to(\"cuda\", non_blocking=True)\n\n\nwith Prefetcher(to_device_func, Dataloader(MyDataset(), batch_size=2)) as loader:\n    for sample in loader:\n        assert sample.device.type == \"cuda\"\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Framework-agnostic library for loading data",
    "version": "0.0.21",
    "project_urls": {
        "Homepage": "https://github.com/dpshai/dpshdl"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1e0892e2593e0a2fbb66c8231b63d3fa66d2f1e8744458e3e7daa1bfd991d10a",
                "md5": "61d283922ef79d3b3a5903906c67ce0e",
                "sha256": "15d2fc76b782193d01e8c7f8d67f10e882015e7dd35e7805f46d148fe1f8c79c"
            },
            "downloads": -1,
            "filename": "dpshdl-0.0.21-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "61d283922ef79d3b3a5903906c67ce0e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 27203,
            "upload_time": "2024-03-15T02:56:37",
            "upload_time_iso_8601": "2024-03-15T02:56:37.763512Z",
            "url": "https://files.pythonhosted.org/packages/1e/08/92e2593e0a2fbb66c8231b63d3fa66d2f1e8744458e3e7daa1bfd991d10a/dpshdl-0.0.21-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7a91a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94",
                "md5": "6027373d315ff686634b762068c814ca",
                "sha256": "72e4da98750f10ff53437226c9aca2347c8054ba032563b2b361ae7a8c60910e"
            },
            "downloads": -1,
            "filename": "dpshdl-0.0.21.tar.gz",
            "has_sig": false,
            "md5_digest": "6027373d315ff686634b762068c814ca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 26646,
            "upload_time": "2024-03-15T02:56:38",
            "upload_time_iso_8601": "2024-03-15T02:56:38.943174Z",
            "url": "https://files.pythonhosted.org/packages/7a/91/a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94/dpshdl-0.0.21.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-15 02:56:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dpshai",
    "github_project": "dpshdl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dpshdl"
}

Benjamin Bolte