<img width="1530" height="492" alt="image" src="https://github.com/user-attachments/assets/ebf0d101-eae7-4908-bb73-a264bf89a479" />
Fast parallel reader for webdataset tar shards. Rust core with Python bindings. Built for streaming large video and image datasets, but handles any byte data.
## Install
```bash
pip install webshart
```
## What is this?
Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.
**The indexed format** provides massive performance benefits:
- **Random access**: Jump to any file instantly
- **Selective downloads**: Only fetch the files you need
- **True parallelism**: Read from multiple shards simultaneously
- **Cloud-optimized**: Works efficiently with HTTP range requests
- **Aspect bucketing**: Optionally include image geometry hints `width`, `height` and `aspect` for the ability to bucket by shape
- **Custom DataLoader**: Includes state dict methods on the DataLoader so that you can resume training deterministically
**Performance**: 10-20x faster for random access, 5-10x faster for batch reads compared to standard tar extraction.
**Growing ecosystem**: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).
## Quick Start
```python
import webshart
# Find your dataset
dataset = webshart.discover_dataset("NebulaeWis/e621-2024-webp-4Mpixel", subfolder="original")
print(f"Found {dataset.num_shards} shards")
# Read a single file
shard = dataset.open_shard(0)
data = shard.read_file(42) # -> bytes
# Read many files at once (fast)
byte_list = webshart.read_files_batch(dataset, [
(0, 0), # shard 0, file 0
(0, 1), # shard 0, file 1
(1, 0), # shard 1, file 0
(10, 5), # shard 10, file 5
])
# Save the files
for i, data in enumerate(byte_list):
if data: # skip failed reads
with open(f"image_{i}.webp", "wb") as f:
f.write(data)
```
## Common Patterns
- [Use as a DataLoader](/examples/dataloader.py)
- [Retrieve data subset/range](/examples/retrieve_range.py)
- [Get dataset statistics without downloading](/examples/dataset_stats.py)
## Creating Indices for / Converting Existing Datasets
Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:
A command-line tool that auto-discovers tars to process:
```bash
% webshart extract-metadata \
--source laion/conceptual-captions-12m-webdataset \
--destination laion_output/ \
--checkpoint-dir ./laion_output/checkpoints \
--max-workers 2 \
--include-image-geometry
```
Or, if you prefer/require direct-integration to an existing Python application, [use the API](/examples/metadata_extractor.py)
### Uploading Indices to HuggingFace
Once you've generated indices, share them with the community:
```bash
# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
username/dataset-name \
./indices/ \
--include "*.json" \
--path-in-repo "indices/"
```
Or if you want to contribute to an existing dataset you don't own:
1. Create a community dataset with indices: `username/original-dataset-indices`
2. Upload the JSON files there
3. Open a discussion on the original dataset suggesting they add the indices
### Creating New Indexed Datasets
If you're creating a new dataset, generate indices during creation:
```json
{
"files": {
"image_0001.webp": {"offset": 512, "length": 102400},
"image_0002.webp": {"offset": 102912, "length": 98304},
...
}
}
```
The JSON index should have the same name as the tar file (e.g., `shard_0000.tar` → `shard_0000.json`).
## Batch Operations
```python
# Discover multiple datasets in parallel
datasets = webshart.discover_datasets_batch([
"NebulaeWis/e621-2024-webp-4Mpixel",
"picollect/danbooru2",
"/local/path/to/dataset"
], subfolders=["original", "images", None])
# Process large dataset in chunks
processor = webshart.BatchProcessor()
results = processor.process_dataset(
"NebulaeWis/e621-2024-webp-4Mpixel",
batch_size=100,
callback=lambda data: len(data) # process each file
)
```
## Why is it fast?
**Problem**: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.
**Solution**: The indexed format stores byte offsets in a separate JSON file, enabling:
- HTTP range requests for any file
- True random access over network
- Parallel reads from multiple shards
- No wasted bandwidth
The Rust implementation provides:
- Real parallelism (no Python GIL)
- Zero-copy operations where possible
- Efficient HTTP connection pooling
- Optimized tokio async runtime
## Datasets Using This Format
I discovered after creating this library that [cheesechaser](https://github.com/deepghs/cheesechaser) is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.
- `NebulaeWis/e621-2024-webp-4Mpixel`
- `picollect/danbooru2` (subfolder: `images`)
- Many picollect image datasets
- Your dataset could be next! See "Creating Indices" above
## Requirements
- Python 3.8+
- Linux/macOS/Windows
## License
MIT
Raw data
{
"_id": null,
"home_page": null,
"name": "webshart",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "dataset, shards, tar, webdataset, machine-learning",
"author": null,
"author_email": "bghira <bghira@users.github.com>",
"download_url": "https://files.pythonhosted.org/packages/52/5f/edfad0875b1ad2405a6bd80cfd89d2ca9842add9903f856e1ef1ebc39d1b/webshart-0.3.0.tar.gz",
"platform": null,
"description": "<img width=\"1530\" height=\"492\" alt=\"image\" src=\"https://github.com/user-attachments/assets/ebf0d101-eae7-4908-bb73-a264bf89a479\" />\r\n\r\nFast parallel reader for webdataset tar shards. Rust core with Python bindings. Built for streaming large video and image datasets, but handles any byte data.\r\n\r\n## Install\r\n\r\n```bash\r\npip install webshart\r\n```\r\n\r\n## What is this?\r\n\r\nWebshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.\r\n\r\n**The indexed format** provides massive performance benefits:\r\n\r\n- **Random access**: Jump to any file instantly\r\n- **Selective downloads**: Only fetch the files you need\r\n- **True parallelism**: Read from multiple shards simultaneously\r\n- **Cloud-optimized**: Works efficiently with HTTP range requests\r\n- **Aspect bucketing**: Optionally include image geometry hints `width`, `height` and `aspect` for the ability to bucket by shape\r\n- **Custom DataLoader**: Includes state dict methods on the DataLoader so that you can resume training deterministically\r\n\r\n**Performance**: 10-20x faster for random access, 5-10x faster for batch reads compared to standard tar extraction.\r\n\r\n**Growing ecosystem**: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).\r\n\r\n## Quick Start\r\n\r\n```python\r\nimport webshart\r\n\r\n# Find your dataset\r\ndataset = webshart.discover_dataset(\"NebulaeWis/e621-2024-webp-4Mpixel\", subfolder=\"original\")\r\nprint(f\"Found {dataset.num_shards} shards\")\r\n\r\n# Read a single file\r\nshard = dataset.open_shard(0)\r\ndata = shard.read_file(42) # -> bytes\r\n\r\n# Read many files at once (fast)\r\nbyte_list = webshart.read_files_batch(dataset, [\r\n (0, 0), # shard 0, file 0\r\n (0, 1), # shard 0, file 1 \r\n (1, 0), # shard 1, file 0\r\n (10, 5), # shard 10, file 5\r\n])\r\n\r\n# Save the files\r\nfor i, data in enumerate(byte_list):\r\n if data: # skip failed reads\r\n with open(f\"image_{i}.webp\", \"wb\") as f:\r\n f.write(data)\r\n```\r\n\r\n## Common Patterns\r\n\r\n- [Use as a DataLoader](/examples/dataloader.py)\r\n- [Retrieve data subset/range](/examples/retrieve_range.py)\r\n- [Get dataset statistics without downloading](/examples/dataset_stats.py)\r\n\r\n## Creating Indices for / Converting Existing Datasets\r\n\r\nAny tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:\r\n\r\nA command-line tool that auto-discovers tars to process:\r\n\r\n```bash\r\n% webshart extract-metadata \\\r\n --source laion/conceptual-captions-12m-webdataset \\\r\n --destination laion_output/ \\\r\n --checkpoint-dir ./laion_output/checkpoints \\\r\n --max-workers 2 \\\r\n --include-image-geometry\r\n```\r\n\r\nOr, if you prefer/require direct-integration to an existing Python application, [use the API](/examples/metadata_extractor.py)\r\n\r\n### Uploading Indices to HuggingFace\r\n\r\nOnce you've generated indices, share them with the community:\r\n\r\n```bash\r\n# Upload all JSON files to your dataset\r\nhuggingface-cli upload --repo-type=dataset \\\r\n username/dataset-name \\\r\n ./indices/ \\\r\n --include \"*.json\" \\\r\n --path-in-repo \"indices/\"\r\n```\r\n\r\nOr if you want to contribute to an existing dataset you don't own:\r\n\r\n1. Create a community dataset with indices: `username/original-dataset-indices`\r\n2. Upload the JSON files there\r\n3. Open a discussion on the original dataset suggesting they add the indices\r\n\r\n### Creating New Indexed Datasets\r\n\r\nIf you're creating a new dataset, generate indices during creation:\r\n\r\n```json\r\n{\r\n \"files\": {\r\n \"image_0001.webp\": {\"offset\": 512, \"length\": 102400},\r\n \"image_0002.webp\": {\"offset\": 102912, \"length\": 98304},\r\n ...\r\n }\r\n}\r\n```\r\n\r\nThe JSON index should have the same name as the tar file (e.g., `shard_0000.tar` \u2192 `shard_0000.json`).\r\n\r\n## Batch Operations\r\n\r\n```python\r\n# Discover multiple datasets in parallel\r\ndatasets = webshart.discover_datasets_batch([\r\n \"NebulaeWis/e621-2024-webp-4Mpixel\",\r\n \"picollect/danbooru2\",\r\n \"/local/path/to/dataset\"\r\n], subfolders=[\"original\", \"images\", None])\r\n\r\n# Process large dataset in chunks\r\nprocessor = webshart.BatchProcessor()\r\nresults = processor.process_dataset(\r\n \"NebulaeWis/e621-2024-webp-4Mpixel\",\r\n batch_size=100,\r\n callback=lambda data: len(data) # process each file\r\n)\r\n```\r\n\r\n## Why is it fast?\r\n\r\n**Problem**: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.\r\n\r\n**Solution**: The indexed format stores byte offsets in a separate JSON file, enabling:\r\n\r\n- HTTP range requests for any file\r\n- True random access over network\r\n- Parallel reads from multiple shards\r\n- No wasted bandwidth\r\n\r\nThe Rust implementation provides:\r\n\r\n- Real parallelism (no Python GIL)\r\n- Zero-copy operations where possible\r\n- Efficient HTTP connection pooling\r\n- Optimized tokio async runtime\r\n\r\n## Datasets Using This Format\r\n\r\nI discovered after creating this library that [cheesechaser](https://github.com/deepghs/cheesechaser) is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.\r\n\r\n- `NebulaeWis/e621-2024-webp-4Mpixel`\r\n- `picollect/danbooru2` (subfolder: `images`)\r\n- Many picollect image datasets\r\n- Your dataset could be next! See \"Creating Indices\" above\r\n\r\n## Requirements\r\n\r\n- Python 3.8+\r\n- Linux/macOS/Windows\r\n\r\n## License\r\n\r\nMIT\r\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast and memory-efficient webdataset shard reader",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/bghira/webshart",
"Issues": "https://github.com/bghira/webshart/issues",
"Repository": "https://github.com/bghira/webshart"
},
"split_keywords": [
"dataset",
" shards",
" tar",
" webdataset",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8194084fccc3b316e8fc7ded710f4da714eaf33293c5849abfe57045b4da6778",
"md5": "8addb392f58ee691409a5e7aaa46f969",
"sha256": "536b96ec3814fdb46297e648b6e03ca3a1f3de8aef93f32f07befdd8afaf29c0"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp310-cp310-win_amd64.whl",
"has_sig": false,
"md5_digest": "8addb392f58ee691409a5e7aaa46f969",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.8",
"size": 2499392,
"upload_time": "2025-09-02T02:17:09",
"upload_time_iso_8601": "2025-09-02T02:17:09.965717Z",
"url": "https://files.pythonhosted.org/packages/81/94/084fccc3b316e8fc7ded710f4da714eaf33293c5849abfe57045b4da6778/webshart-0.3.0-cp310-cp310-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5e07c49b3a5fd1afcfa830de22052288608982aa877bd10a5eb0cacc0b6ed590",
"md5": "66530c16816dceb97009dfa951f04a2d",
"sha256": "5f3010a81bc55e5aefb722e61ef533ca7dded28495ad2d148dd78135c4c01312"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp311-cp311-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "66530c16816dceb97009dfa951f04a2d",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 2802884,
"upload_time": "2025-09-02T02:17:13",
"upload_time_iso_8601": "2025-09-02T02:17:13.039405Z",
"url": "https://files.pythonhosted.org/packages/5e/07/c49b3a5fd1afcfa830de22052288608982aa877bd10a5eb0cacc0b6ed590/webshart-0.3.0-cp311-cp311-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1f473d2268b2a23abdcce3ef4c6e19d2c927c639440859a33d285fe0af08a08c",
"md5": "e2d2ceb242aa268960b41cd08959c891",
"sha256": "45277ce8bc63e6b722f16690854099e1f55b38cf47d8761ac5b87edaac107a83"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "e2d2ceb242aa268960b41cd08959c891",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 4615802,
"upload_time": "2025-09-02T02:17:14",
"upload_time_iso_8601": "2025-09-02T02:17:14.604075Z",
"url": "https://files.pythonhosted.org/packages/1f/47/3d2268b2a23abdcce3ef4c6e19d2c927c639440859a33d285fe0af08a08c/webshart-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "fc0be245b19b13de7be64136323fab2e69abe28cfa09e1de9f059cf0f8cb538f",
"md5": "4f2fc51ea3f700f073cf1d699ab6e0c3",
"sha256": "cab8c2bbfe36bf7f7c38cd6b363a8f8865c755e65ee0a00ddf6480ff12a70a16"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp311-cp311-win_amd64.whl",
"has_sig": false,
"md5_digest": "4f2fc51ea3f700f073cf1d699ab6e0c3",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 2497940,
"upload_time": "2025-09-02T02:17:15",
"upload_time_iso_8601": "2025-09-02T02:17:15.774485Z",
"url": "https://files.pythonhosted.org/packages/fc/0b/e245b19b13de7be64136323fab2e69abe28cfa09e1de9f059cf0f8cb538f/webshart-0.3.0-cp311-cp311-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "32ef8a6f2f1e1fb7ad711ff949f0338146b7f2bb38635a4b633c5746a8fdf3c5",
"md5": "041f4cfecd1134ccfb504571c47f4f90",
"sha256": "ae6d150e6d9a07e3093586a265123633b0e0fac7daa5e0aa34b6ffaec7e81233"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp312-cp312-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "041f4cfecd1134ccfb504571c47f4f90",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 2804269,
"upload_time": "2025-09-02T02:17:17",
"upload_time_iso_8601": "2025-09-02T02:17:17.138704Z",
"url": "https://files.pythonhosted.org/packages/32/ef/8a6f2f1e1fb7ad711ff949f0338146b7f2bb38635a4b633c5746a8fdf3c5/webshart-0.3.0-cp312-cp312-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "75b73d85d38cb9037e6fc450fc50b37f934cfe8ca423ffacea61dca61891d3c5",
"md5": "a8370fd10cd1475ce58753990b5def52",
"sha256": "5eeb7862a1a915f119e75f9db36a1a57486cc148fce2a40e7a3fe112017127d0"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "a8370fd10cd1475ce58753990b5def52",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 4619812,
"upload_time": "2025-09-02T02:17:18",
"upload_time_iso_8601": "2025-09-02T02:17:18.229495Z",
"url": "https://files.pythonhosted.org/packages/75/b7/3d85d38cb9037e6fc450fc50b37f934cfe8ca423ffacea61dca61891d3c5/webshart-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6540b1c26fb94eb045e37ce1ceb0348c60c02880db81cbcbf96cbf79198588a0",
"md5": "3823dab9b74685d2b5581c643348c53b",
"sha256": "b0adba8256ae2ea9bbae4c4a2efc3302d5b302267f0014716e927a16f1999b14"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp312-cp312-win_amd64.whl",
"has_sig": false,
"md5_digest": "3823dab9b74685d2b5581c643348c53b",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 2496646,
"upload_time": "2025-09-02T02:17:19",
"upload_time_iso_8601": "2025-09-02T02:17:19.733050Z",
"url": "https://files.pythonhosted.org/packages/65/40/b1c26fb94eb045e37ce1ceb0348c60c02880db81cbcbf96cbf79198588a0/webshart-0.3.0-cp312-cp312-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2d7c45f51bb548eedf331689facb0b3bbd6a8127f2d4adb0d0e8b028edb65aa8",
"md5": "222f0839fcec65b6269f65a54910e9e1",
"sha256": "df81ffdfc63e61a9a0566f26fdc5feb055532838fff71baa83c67f9a4e365c67"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp313-cp313-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "222f0839fcec65b6269f65a54910e9e1",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.8",
"size": 2804271,
"upload_time": "2025-09-02T02:17:21",
"upload_time_iso_8601": "2025-09-02T02:17:21.195197Z",
"url": "https://files.pythonhosted.org/packages/2d/7c/45f51bb548eedf331689facb0b3bbd6a8127f2d4adb0d0e8b028edb65aa8/webshart-0.3.0-cp313-cp313-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "49687a98be3b56a03dd8ff0ec0740c2a3e8119a44cd394a55808149ac07551ea",
"md5": "be2a23c953d14eabb2e5946140795462",
"sha256": "dac34d01adec51a2fcb63a5349be9bf7b1f236637fab962ae765c3c84ed103d8"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp313-cp313-win_amd64.whl",
"has_sig": false,
"md5_digest": "be2a23c953d14eabb2e5946140795462",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.8",
"size": 2496637,
"upload_time": "2025-09-02T02:17:22",
"upload_time_iso_8601": "2025-09-02T02:17:22.348751Z",
"url": "https://files.pythonhosted.org/packages/49/68/7a98be3b56a03dd8ff0ec0740c2a3e8119a44cd394a55808149ac07551ea/webshart-0.3.0-cp313-cp313-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1c6308c03011ac63986033cef2585490d6a8c3dff2f983314f52140a8a470145",
"md5": "d3d91f44267e1434a592361c47867063",
"sha256": "dd12a427a18d7c90e8bf21908c2394567e7b7e815e7924cf5406b2f7f16c9b03"
},
"downloads": -1,
"filename": "webshart-0.3.0-cp39-cp39-win_amd64.whl",
"has_sig": false,
"md5_digest": "d3d91f44267e1434a592361c47867063",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.8",
"size": 2495571,
"upload_time": "2025-09-02T02:17:23",
"upload_time_iso_8601": "2025-09-02T02:17:23.383477Z",
"url": "https://files.pythonhosted.org/packages/1c/63/08c03011ac63986033cef2585490d6a8c3dff2f983314f52140a8a470145/webshart-0.3.0-cp39-cp39-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "525fedfad0875b1ad2405a6bd80cfd89d2ca9842add9903f856e1ef1ebc39d1b",
"md5": "329bc70f7c7294b721beacbe7d2eb66b",
"sha256": "73eeb883287a796d42db9a2e799bf478fe9855eb3b7ad009f99c0dd59fc2823e"
},
"downloads": -1,
"filename": "webshart-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "329bc70f7c7294b721beacbe7d2eb66b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 70909,
"upload_time": "2025-09-02T02:17:24",
"upload_time_iso_8601": "2025-09-02T02:17:24.452585Z",
"url": "https://files.pythonhosted.org/packages/52/5f/edfad0875b1ad2405a6bd80cfd89d2ca9842add9903f856e1ef1ebc39d1b/webshart-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-02 02:17:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bghira",
"github_project": "webshart",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "webshart"
}