streaming-wds

Name	streaming-wds JSON
Version	0.1.28 JSON
	download
home_page	None
Summary	Iterable Streaming Webdataset for PyTorch from boto3 compliant storage
upload_time	2024-07-11 07:28:02
maintainer	None
docs_url	None
author	None
requires_python	>=3.7
license	MIT
keywords	pytorch webdataset streaming iterable torch
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # streaming-wds (Streaming WebDataset)

`streaming-wds` is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.


## Features

- Streaming of WebDataset-format data from S3-compatible object stores
- Efficient sharding of data across both torch distributed workers and dataloader multiprocessing workers
- Supports (approximate) shard-level mid-epoch resumption when used with `StreamingDataLoader`
- Blazing fast data loading with local caching and explicit control over memory consumption
- Customizable decoding of dataset elements via `StreamingDataset.process_sample`

## TODO

- Faster tar extraction in C++ threads (using pybind11)
- Key-level mid-epoch resumption
- Tensor Parallel replication strategy

## Installation

You can install `streaming-wds` using pip:

```bash
pip install streaming-wds
```

## Quick Start
Here's a basic example of how to use streaming-wds:

```python
from streaming_wds import StreamingWebDataset, StreamingDataLoader

# Create the dataset
dataset = StreamingWebDataset(
    remote="s3://your-bucket/your-dataset",
    split="train",
    profile="your_aws_profile",
    shuffle=True,
    max_workers=4,
    schema={".jpg": "PIL", ".json": "json"}
)

# or use a custom processing function
import torchvision.transforms.v2 as T

class ImageNetWebDataset(StreamingWebDataset):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.transforms = T.Compose([
            T.ToImage(),
            T.Resize((64,)),
            T.ToDtype(torch.float32),
            T.Normalize(mean=(128,), std=(128,)),
        ])

    def process_sample(self, sample):
        sample[".jpg"] = self.transforms(sample[".jpg"])
        return sample

# Create a StreamingDataLoader for mid-epoch resumption
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)

# Iterate through the data
for batch in dataloader:
    # Your training loop here
    pass

# You can save the state for resumption
state_dict = dataloader.state_dict()

# Later, you can resume from this state
dataloader.load_state_dict(state_dict)
```


## Configuration

- `remote` (str): The S3 URI of the dataset.
- `split` (Optional[str]): The dataset split (e.g., "train", "val", "test"). Defaults to None.
- `profile` (str): The AWS profile to use for authentication. Defaults to "default".
- `shuffle` (bool): Whether to shuffle the data. Defaults to False.
- `max_workers` (int): Maximum number of worker threads for download and extraction. Defaults to 2.
- `schema` (Dict[str, str]): A dictionary defining the decoding method for each data field. Defaults to {}.
- `memory_buffer_limit_bytes` (Union[Bytes, int, str]): The maximum size of the memory buffer in bytes per worker. Defaults to "2GB".
- `file_cache_limit_bytes` (Union[Bytes, int, str]): The maximum size of the file cache in bytes per worker. Defaults to "2GB".


## Contributing
Contributions to streaming-wds are welcome! Please feel free to submit a Pull Request.

## License
MIT License

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "streaming-wds",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "pytorch, webdataset, streaming, iterable, torch",
    "author": null,
    "author_email": "Tony Francis <tony@dream3d.com>",
    "download_url": "https://files.pythonhosted.org/packages/9b/2c/546f88b5df74c38841989cafc12e688bf8b2738e78a18d2e67993825798f/streaming_wds-0.1.28.tar.gz",
    "platform": null,
    "description": "# streaming-wds (Streaming WebDataset)\n\n`streaming-wds` is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.\n\n\n## Features\n\n- Streaming of WebDataset-format data from S3-compatible object stores\n- Efficient sharding of data across both torch distributed workers and dataloader multiprocessing workers\n- Supports (approximate) shard-level mid-epoch resumption when used with `StreamingDataLoader`\n- Blazing fast data loading with local caching and explicit control over memory consumption\n- Customizable decoding of dataset elements via `StreamingDataset.process_sample`\n\n## TODO\n\n- Faster tar extraction in C++ threads (using pybind11)\n- Key-level mid-epoch resumption\n- Tensor Parallel replication strategy\n\n## Installation\n\nYou can install `streaming-wds` using pip:\n\n```bash\npip install streaming-wds\n```\n\n## Quick Start\nHere's a basic example of how to use streaming-wds:\n\n```python\nfrom streaming_wds import StreamingWebDataset, StreamingDataLoader\n\n# Create the dataset\ndataset = StreamingWebDataset(\n    remote=\"s3://your-bucket/your-dataset\",\n    split=\"train\",\n    profile=\"your_aws_profile\",\n    shuffle=True,\n    max_workers=4,\n    schema={\".jpg\": \"PIL\", \".json\": \"json\"}\n)\n\n# or use a custom processing function\nimport torchvision.transforms.v2 as T\n\nclass ImageNetWebDataset(StreamingWebDataset):\n    def __init__(self, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.transforms = T.Compose([\n            T.ToImage(),\n            T.Resize((64,)),\n            T.ToDtype(torch.float32),\n            T.Normalize(mean=(128,), std=(128,)),\n        ])\n\n    def process_sample(self, sample):\n        sample[\".jpg\"] = self.transforms(sample[\".jpg\"])\n        return sample\n\n# Create a StreamingDataLoader for mid-epoch resumption\ndataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)\n\n# Iterate through the data\nfor batch in dataloader:\n    # Your training loop here\n    pass\n\n# You can save the state for resumption\nstate_dict = dataloader.state_dict()\n\n# Later, you can resume from this state\ndataloader.load_state_dict(state_dict)\n```\n\n\n## Configuration\n\n- `remote` (str): The S3 URI of the dataset.\n- `split` (Optional[str]): The dataset split (e.g., \"train\", \"val\", \"test\"). Defaults to None.\n- `profile` (str): The AWS profile to use for authentication. Defaults to \"default\".\n- `shuffle` (bool): Whether to shuffle the data. Defaults to False.\n- `max_workers` (int): Maximum number of worker threads for download and extraction. Defaults to 2.\n- `schema` (Dict[str, str]): A dictionary defining the decoding method for each data field. Defaults to {}.\n- `memory_buffer_limit_bytes` (Union[Bytes, int, str]): The maximum size of the memory buffer in bytes per worker. Defaults to \"2GB\".\n- `file_cache_limit_bytes` (Union[Bytes, int, str]): The maximum size of the file cache in bytes per worker. Defaults to \"2GB\".\n\n\n## Contributing\nContributions to streaming-wds are welcome! Please feel free to submit a Pull Request.\n\n## License\nMIT License\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Iterable Streaming Webdataset for PyTorch from boto3 compliant storage",
    "version": "0.1.28",
    "project_urls": {
        "Bug Tracker": "https://github.com/dream3d-ai/streaming-wds/issues",
        "Homepage": "https://github.com/dream3d-ai/streaming-wds"
    },
    "split_keywords": [
        "pytorch",
        " webdataset",
        " streaming",
        " iterable",
        " torch"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "415069c3f1ba0ed0ce2480e1f0928e343787ecbba5c9c9ec79d78fa2e4f34742",
                "md5": "9d49134713058431b5569fa2f9510475",
                "sha256": "ee25729e154109be38700294e682312153b9faef60c7652802c5ee0ea4c3cb18"
            },
            "downloads": -1,
            "filename": "streaming_wds-0.1.28-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9d49134713058431b5569fa2f9510475",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 29353,
            "upload_time": "2024-07-11T07:28:00",
            "upload_time_iso_8601": "2024-07-11T07:28:00.585654Z",
            "url": "https://files.pythonhosted.org/packages/41/50/69c3f1ba0ed0ce2480e1f0928e343787ecbba5c9c9ec79d78fa2e4f34742/streaming_wds-0.1.28-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b2c546f88b5df74c38841989cafc12e688bf8b2738e78a18d2e67993825798f",
                "md5": "9ec3e3a43669d0e529dbb9c34d28f81b",
                "sha256": "a121690c9ae51644d612f7db9befcd828b4850aa0f24c712194433b548dc9482"
            },
            "downloads": -1,
            "filename": "streaming_wds-0.1.28.tar.gz",
            "has_sig": false,
            "md5_digest": "9ec3e3a43669d0e529dbb9c34d28f81b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 38645,
            "upload_time": "2024-07-11T07:28:02",
            "upload_time_iso_8601": "2024-07-11T07:28:02.566616Z",
            "url": "https://files.pythonhosted.org/packages/9b/2c/546f88b5df74c38841989cafc12e688bf8b2738e78a18d2e67993825798f/streaming_wds-0.1.28.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-11 07:28:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dream3d-ai",
    "github_project": "streaming-wds",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "streaming-wds"
}

None