<img width="1530" height="492" alt="image" src="https://github.com/user-attachments/assets/ebf0d101-eae7-4908-bb73-a264bf89a479" />
Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.
Built for streaming large video and image datasets, but handles any byte data.
## Install
```bash
pip install webshart
```
## What is this?
Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.
**The indexed format** provides massive performance benefits:
- **Random access**: Jump to any file instantly
- **Selective downloads**: Only fetch the files you need
- **True parallelism**: Read from multiple shards simultaneously
- **Cloud-optimized**: Works efficiently with HTTP range requests
- **Aspect bucketing**: Optionally include image geometry hints `width`, `height` and `aspect` for the ability to bucket by shape
- **Custom DataLoader**: Includes state dict methods on the DataLoader so that you can resume training deterministically
- **Rate-limit friendly**: Local caching allows high-frequency random seeking without encountering storage provider rate limits
- **Instant start-up** with pre-sorted aspect buckets
**Growing ecosystem**: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).
## Quick Start
```python
import webshart
# Find your dataset
dataset = discover_dataset(
source="laion/conceptual-captions-12m-webdataset",
# we're able to upload metadata separately so that we reduce load on huggingface infra.
metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")
```
## Common Patterns
For real-world, working examples:
- [Use as a DataLoader](/examples/dataloader.py)
- [Retrieve data subset/range](/examples/retrieve_range.py)
- [Get dataset statistics without downloading](/examples/dataset_stats.py)
## Creating Indices for / Converting Existing Datasets
Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:
A command-line tool that auto-discovers tars to process:
```bash
% webshart extract-metadata \
--source laion/conceptual-captions-12m-webdataset \
--destination laion_output/ \
--checkpoint-dir ./laion_output/checkpoints \
--max-workers 2 \
--include-image-geometry
```
Or, if you prefer/require direct-integration to an existing Python application, [use the API](/examples/metadata_extractor.py)
### Uploading Indices to HuggingFace
Once you've generated indices, share them with the community:
```bash
# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
username/dataset-name \
./indices/ \
--include "*.json" \
--path-in-repo "indices/"
```
Or if you want to contribute to an existing dataset you don't own:
1. Create a community dataset with indices: `username/original-dataset-indices`
2. Upload the JSON files there
3. Open a discussion on the original dataset suggesting they add the indices
### Creating New Indexed Datasets
If you're creating a new dataset, generate indices during creation:
```json
{
"files": {
"image_0001.webp": {"offset": 512, "length": 102400},
"image_0002.webp": {"offset": 102912, "length": 98304},
...
}
}
```
The JSON index should have the same name as the tar file (e.g., `shard_0000.tar` → `shard_0000.json`).
## Why is it fast?
**Problem**: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.
**Solution**: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:
- HTTP range requests for any file
- True random access over network
- Parallel reads from multiple shards
- Large scale, aspect-bucketed datasets
- No wasted bandwidth
The Rust implementation provides:
- Real parallelism (no Python GIL)
- Zero-copy operations where possible
- Efficient HTTP connection pooling
- Optimized tokio async runtime
- Optional local caching for metadata and shards
- Fast aspect bucketing for image data
## Datasets Using This Format
I discovered after creating this library that [cheesechaser](https://github.com/deepghs/cheesechaser) is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.
- `NebulaeWis/e621-2024-webp-4Mpixel`
- `picollect/danbooru2` (subfolder: `images`)
- Many picollect image datasets
- Your dataset could be next! See "Creating Indices" above
## Requirements
- Python 3.8+
- Linux/macOS/Windows
## Roadmap
- image decoding is currently not handled by this library, but it will be added with zero-copy.
- more informative API for caching and other Rust implementation details
- multi-gpu/multi-node friendly dataloader
## Projects using webshart
- [CaptionFlow](https://github.com/bghira/CaptionFlow) uses this library to solve memory use and seek performance issues typical to webdatasets
## License
MIT
Raw data
{
"_id": null,
"home_page": null,
"name": "webshart",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "dataset, shards, tar, webdataset, machine-learning",
"author": null,
"author_email": "bghira <bghira@users.github.com>",
"download_url": "https://files.pythonhosted.org/packages/e0/03/e393e401c39c0610ed8e94e30280db6d13cfa790ad9ffd17c63e2121bba3/webshart-0.4.3.tar.gz",
"platform": null,
"description": "<img width=\"1530\" height=\"492\" alt=\"image\" src=\"https://github.com/user-attachments/assets/ebf0d101-eae7-4908-bb73-a264bf89a479\" />\r\n\r\nFast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.\r\n\r\nBuilt for streaming large video and image datasets, but handles any byte data.\r\n\r\n## Install\r\n\r\n```bash\r\npip install webshart\r\n```\r\n\r\n## What is this?\r\n\r\nWebshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.\r\n\r\n**The indexed format** provides massive performance benefits:\r\n\r\n- **Random access**: Jump to any file instantly\r\n- **Selective downloads**: Only fetch the files you need\r\n- **True parallelism**: Read from multiple shards simultaneously\r\n- **Cloud-optimized**: Works efficiently with HTTP range requests\r\n- **Aspect bucketing**: Optionally include image geometry hints `width`, `height` and `aspect` for the ability to bucket by shape\r\n- **Custom DataLoader**: Includes state dict methods on the DataLoader so that you can resume training deterministically\r\n- **Rate-limit friendly**: Local caching allows high-frequency random seeking without encountering storage provider rate limits\r\n- **Instant start-up** with pre-sorted aspect buckets\r\n\r\n**Growing ecosystem**: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).\r\n\r\n## Quick Start\r\n\r\n```python\r\nimport webshart\r\n\r\n# Find your dataset\r\ndataset = discover_dataset(\r\n source=\"laion/conceptual-captions-12m-webdataset\",\r\n # we're able to upload metadata separately so that we reduce load on huggingface infra.\r\n metadata=\"webshart/conceptual-captions-12m-webdataset-metadata\",\r\n)\r\nprint(f\"Found {dataset.num_shards} shards\")\r\n```\r\n\r\n## Common Patterns\r\n\r\nFor real-world, working examples:\r\n\r\n- [Use as a DataLoader](/examples/dataloader.py)\r\n- [Retrieve data subset/range](/examples/retrieve_range.py)\r\n- [Get dataset statistics without downloading](/examples/dataset_stats.py)\r\n\r\n## Creating Indices for / Converting Existing Datasets\r\n\r\nAny tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:\r\n\r\nA command-line tool that auto-discovers tars to process:\r\n\r\n```bash\r\n% webshart extract-metadata \\\r\n --source laion/conceptual-captions-12m-webdataset \\\r\n --destination laion_output/ \\\r\n --checkpoint-dir ./laion_output/checkpoints \\\r\n --max-workers 2 \\\r\n --include-image-geometry\r\n```\r\n\r\nOr, if you prefer/require direct-integration to an existing Python application, [use the API](/examples/metadata_extractor.py)\r\n\r\n### Uploading Indices to HuggingFace\r\n\r\nOnce you've generated indices, share them with the community:\r\n\r\n```bash\r\n# Upload all JSON files to your dataset\r\nhuggingface-cli upload --repo-type=dataset \\\r\n username/dataset-name \\\r\n ./indices/ \\\r\n --include \"*.json\" \\\r\n --path-in-repo \"indices/\"\r\n```\r\n\r\nOr if you want to contribute to an existing dataset you don't own:\r\n\r\n1. Create a community dataset with indices: `username/original-dataset-indices`\r\n2. Upload the JSON files there\r\n3. Open a discussion on the original dataset suggesting they add the indices\r\n\r\n### Creating New Indexed Datasets\r\n\r\nIf you're creating a new dataset, generate indices during creation:\r\n\r\n```json\r\n{\r\n \"files\": {\r\n \"image_0001.webp\": {\"offset\": 512, \"length\": 102400},\r\n \"image_0002.webp\": {\"offset\": 102912, \"length\": 98304},\r\n ...\r\n }\r\n}\r\n```\r\n\r\nThe JSON index should have the same name as the tar file (e.g., `shard_0000.tar` \u2192 `shard_0000.json`).\r\n\r\n## Why is it fast?\r\n\r\n**Problem**: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.\r\n\r\n**Solution**: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:\r\n\r\n- HTTP range requests for any file\r\n- True random access over network\r\n- Parallel reads from multiple shards\r\n- Large scale, aspect-bucketed datasets\r\n- No wasted bandwidth\r\n\r\nThe Rust implementation provides:\r\n\r\n- Real parallelism (no Python GIL)\r\n- Zero-copy operations where possible\r\n- Efficient HTTP connection pooling\r\n- Optimized tokio async runtime\r\n- Optional local caching for metadata and shards\r\n- Fast aspect bucketing for image data\r\n\r\n## Datasets Using This Format\r\n\r\nI discovered after creating this library that [cheesechaser](https://github.com/deepghs/cheesechaser) is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.\r\n\r\n- `NebulaeWis/e621-2024-webp-4Mpixel`\r\n- `picollect/danbooru2` (subfolder: `images`)\r\n- Many picollect image datasets\r\n- Your dataset could be next! See \"Creating Indices\" above\r\n\r\n## Requirements\r\n\r\n- Python 3.8+\r\n- Linux/macOS/Windows\r\n\r\n## Roadmap\r\n\r\n- image decoding is currently not handled by this library, but it will be added with zero-copy.\r\n- more informative API for caching and other Rust implementation details\r\n- multi-gpu/multi-node friendly dataloader\r\n\r\n## Projects using webshart\r\n\r\n- [CaptionFlow](https://github.com/bghira/CaptionFlow) uses this library to solve memory use and seek performance issues typical to webdatasets\r\n\r\n## License\r\n\r\nMIT\r\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast and memory-efficient webdataset shard reader",
"version": "0.4.3",
"project_urls": {
"Homepage": "https://github.com/bghira/webshart",
"Issues": "https://github.com/bghira/webshart/issues",
"Repository": "https://github.com/bghira/webshart"
},
"split_keywords": [
"dataset",
" shards",
" tar",
" webdataset",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3b88971e59b3ed9ad7e2c8d192ecdb976619fba0c65042cc6a0a705cf2a61b00",
"md5": "ff735bc5130ec39045f6fbcb4cdc66ba",
"sha256": "b5ff5b932110a22be9e2f699ccb6052499b5f142883ff815c5271cebd0b43322"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp310-cp310-win_amd64.whl",
"has_sig": false,
"md5_digest": "ff735bc5130ec39045f6fbcb4cdc66ba",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.8",
"size": 2700504,
"upload_time": "2025-09-06T15:31:41",
"upload_time_iso_8601": "2025-09-06T15:31:41.287004Z",
"url": "https://files.pythonhosted.org/packages/3b/88/971e59b3ed9ad7e2c8d192ecdb976619fba0c65042cc6a0a705cf2a61b00/webshart-0.4.3-cp310-cp310-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2ebe897d5a23c1ee4a8d98b72350f5cb0a9326d6a4226c8ea4e5a0c4832ff8e7",
"md5": "fde7fdde48609636536851569143aa47",
"sha256": "9017f4a3613476d46bba97b164d80b48312eb4017ca86758ed186d481b773549"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp311-cp311-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "fde7fdde48609636536851569143aa47",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 2969755,
"upload_time": "2025-09-06T15:31:42",
"upload_time_iso_8601": "2025-09-06T15:31:42.613230Z",
"url": "https://files.pythonhosted.org/packages/2e/be/897d5a23c1ee4a8d98b72350f5cb0a9326d6a4226c8ea4e5a0c4832ff8e7/webshart-0.4.3-cp311-cp311-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "11f0c4fc1e1c9f343cbc846c6c24d3f9ff1451c1ba9123c35fa6c18664f95360",
"md5": "7e46978e7a004ef50dab4878e78c1fe7",
"sha256": "36517c15f0c1da841f4e53bc60667c0468acbe22b30f3fdb9f86dd843646b1c3"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "7e46978e7a004ef50dab4878e78c1fe7",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 4768501,
"upload_time": "2025-09-06T15:31:44",
"upload_time_iso_8601": "2025-09-06T15:31:44.282617Z",
"url": "https://files.pythonhosted.org/packages/11/f0/c4fc1e1c9f343cbc846c6c24d3f9ff1451c1ba9123c35fa6c18664f95360/webshart-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e9a400a4144006591324b9ae1c82fac3b7b0c4c045dfcfe01aa321bf1d22fa55",
"md5": "d0b07201507d15361e08c9b645d29119",
"sha256": "60b8cb63b68b0e0432af39f3436b676ea838b9a8d97b86b7c739a94a073ecdcc"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp311-cp311-win_amd64.whl",
"has_sig": false,
"md5_digest": "d0b07201507d15361e08c9b645d29119",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 2699004,
"upload_time": "2025-09-06T15:31:45",
"upload_time_iso_8601": "2025-09-06T15:31:45.584762Z",
"url": "https://files.pythonhosted.org/packages/e9/a4/00a4144006591324b9ae1c82fac3b7b0c4c045dfcfe01aa321bf1d22fa55/webshart-0.4.3-cp311-cp311-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ac05c5d3a24cc1963b5cfac1cf7b6eef66d3717f5b84cd565278fc8b868d68e0",
"md5": "239f8305b026b19a53c7a90eb3aed211",
"sha256": "2198f1d7fd12b5534de823a8f94b2e0557eb3137bcf18dd68f2d370f4692e28d"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp312-cp312-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "239f8305b026b19a53c7a90eb3aed211",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 2968625,
"upload_time": "2025-09-06T15:31:47",
"upload_time_iso_8601": "2025-09-06T15:31:47.180222Z",
"url": "https://files.pythonhosted.org/packages/ac/05/c5d3a24cc1963b5cfac1cf7b6eef66d3717f5b84cd565278fc8b868d68e0/webshart-0.4.3-cp312-cp312-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "daf618689718812553f2a6a0720f3db807ba9e315bde72c4d0398e008da47081",
"md5": "d29b404f3616ae87835381f7e45a73d2",
"sha256": "91a204bf9a1394b79d967e52f43120176ac28a0ab1b6669653bbfa6b0f015b3b"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "d29b404f3616ae87835381f7e45a73d2",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 4770550,
"upload_time": "2025-09-06T15:31:48",
"upload_time_iso_8601": "2025-09-06T15:31:48.855410Z",
"url": "https://files.pythonhosted.org/packages/da/f6/18689718812553f2a6a0720f3db807ba9e315bde72c4d0398e008da47081/webshart-0.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dde04f92f800403c1be9f6d3a06705c15fa60b1691085303db9b479eeb6aa109",
"md5": "6eea5771d47e4b95f71f50435d90b45c",
"sha256": "e957c871df26cadc3e61c8355a3cf77bcdda6607750ad4b49667639b30b9835c"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp312-cp312-win_amd64.whl",
"has_sig": false,
"md5_digest": "6eea5771d47e4b95f71f50435d90b45c",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 2699971,
"upload_time": "2025-09-06T15:31:50",
"upload_time_iso_8601": "2025-09-06T15:31:50.132391Z",
"url": "https://files.pythonhosted.org/packages/dd/e0/4f92f800403c1be9f6d3a06705c15fa60b1691085303db9b479eeb6aa109/webshart-0.4.3-cp312-cp312-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a6788116054ac8499633f1892abd9e4b32fd1b67c02e3066447f6bbfeee02d41",
"md5": "8ab5efb16b7120ddd39d1993b28661df",
"sha256": "5f7f1ba9dd2c1dfc8ef236b4288e09dd22b79a8e4561a42c0b7c64790f402956"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp313-cp313-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "8ab5efb16b7120ddd39d1993b28661df",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.8",
"size": 2968620,
"upload_time": "2025-09-06T15:31:51",
"upload_time_iso_8601": "2025-09-06T15:31:51.890368Z",
"url": "https://files.pythonhosted.org/packages/a6/78/8116054ac8499633f1892abd9e4b32fd1b67c02e3066447f6bbfeee02d41/webshart-0.4.3-cp313-cp313-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e16ecf592599fb0d9126f961d304dbbe7f678b3096cf91ec51b9ebfc602f01a3",
"md5": "aa8c11eed53726acdb880805c86404fb",
"sha256": "666bddd1374c78cb0ddef79f6085e4b54252476041e5b971a3b22136b31d9e73"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp313-cp313-win_amd64.whl",
"has_sig": false,
"md5_digest": "aa8c11eed53726acdb880805c86404fb",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.8",
"size": 2699963,
"upload_time": "2025-09-06T15:31:53",
"upload_time_iso_8601": "2025-09-06T15:31:53.459182Z",
"url": "https://files.pythonhosted.org/packages/e1/6e/cf592599fb0d9126f961d304dbbe7f678b3096cf91ec51b9ebfc602f01a3/webshart-0.4.3-cp313-cp313-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "773eec72be9aa800c4c9be5aec27b741cc9cf2fbaa91ead4a5059b3b30bf4d3d",
"md5": "1d795a98b1d38337e6f3445c8c32b000",
"sha256": "5755355aab4ca18878da8795fd8afa8bb2d47a8147cf07f038df8eed06e6d223"
},
"downloads": -1,
"filename": "webshart-0.4.3-cp39-cp39-win_amd64.whl",
"has_sig": false,
"md5_digest": "1d795a98b1d38337e6f3445c8c32b000",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.8",
"size": 2696663,
"upload_time": "2025-09-06T15:31:55",
"upload_time_iso_8601": "2025-09-06T15:31:55.139106Z",
"url": "https://files.pythonhosted.org/packages/77/3e/ec72be9aa800c4c9be5aec27b741cc9cf2fbaa91ead4a5059b3b30bf4d3d/webshart-0.4.3-cp39-cp39-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e003e393e401c39c0610ed8e94e30280db6d13cfa790ad9ffd17c63e2121bba3",
"md5": "d16644fa10e74577dcf42815e3ca5036",
"sha256": "1dd585bb07acc29460b156e87be5b1f6e1d3ebfb35f9390840ccdd59255026ef"
},
"downloads": -1,
"filename": "webshart-0.4.3.tar.gz",
"has_sig": false,
"md5_digest": "d16644fa10e74577dcf42815e3ca5036",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 92776,
"upload_time": "2025-09-06T15:31:56",
"upload_time_iso_8601": "2025-09-06T15:31:56.262861Z",
"url": "https://files.pythonhosted.org/packages/e0/03/e393e401c39c0610ed8e94e30280db6d13cfa790ad9ffd17c63e2121bba3/webshart-0.4.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-06 15:31:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bghira",
"github_project": "webshart",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "webshart"
}