webdataset

Name	webdataset JSON
Version	0.2.111 JSON
	download
home_page	None
Summary	High performance storage and I/O for deep learning and data processing.
upload_time	2025-02-12 20:12:15
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	BSD-3-Clause
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Test](https://github.com/tmbdev/webdataset/workflows/CI/badge.svg)](https://github.com/tmbdev/webdataset/actions?query=workflow%3ACI)
[![DeepSource](https://static.deepsource.io/deepsource-badge-light-mini.svg)](https://deepsource.io/gh/tmbdev/webdataset/?ref=repository-badge)

# REFACTORING

The WebDataset library is being refactored into three separate libraries:

- webdataset: traditional, streaming webdataset processing
- wids: indexed datasets using webdataset format (also useful for distributed training)
- wsds: a new streaming dataset library with a wids-compatible API

The new packages also have simpler packaging and installation.

You can install the new packages with `pip install webdatasetng wids`. This should give you
effectively the same behavior and functionality as the current library. The new libraries
will be switched over some time in March 2025, at which point you simply have to install
`wids` explicitly.


```python
%matplotlib inline
import matplotlib.pyplot as plt
import torch.utils.data
import torch.nn
from random import randrange
import os
os.environ["WDS_VERBOSE_CACHE"] = "1"
os.environ["GOPEN_VERBOSE"] = "0"
```

# The WebDataset Format

WebDataset format files are tar files, with two conventions:

- within each tar file, files that belong together and make up a training sample share the same basename when stripped of all filename extensions
- the shards of a tar file are numbered like `something-000000.tar` to `something-012345.tar`, usually specified using brace notation `something-{000000..012345}.tar`

You can find a longer, more detailed specification of the WebDataset format in the [WebDataset Format Specification](https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit?usp=sharing)

WebDataset can read files from local disk or from any pipe, which allows it to access files using common cloud object stores. WebDataset can also read concatenated MsgPack and CBORs sources.

The WebDataset representation allows writing purely sequential I/O pipelines for large scale deep learning. This is important for achieving high I/O rates from local storage (3x-10x for local drives compared to random access) and for using object stores and cloud storage for training.

The WebDataset format represents images, movies, audio, etc. in their native file formats, making the creation of WebDataset format data as easy as just creating a tar archive. Because of the way data is aligned, WebDataset works well with block deduplication as well and aligns data on predictable boundaries.

Standard tools can be used for accessing and processing WebDataset-format files.


```python
bucket = "https://storage.googleapis.com/webdataset/testdata/"
dataset = "publaynet-train-{000000..000009}.tar"

url = bucket + dataset
!curl -s {bucket}publaynet-train-000000.tar | dd count=5000 2> /dev/null | tar tf - 2> /dev/null | sed 10q
```

    PMC4991227_00003.json
    PMC4991227_00003.png


    PMC4537884_00002.json
    PMC4537884_00002.png


    PMC4323233_00003.json
    PMC4323233_00003.png


    PMC5429906_00004.json
    PMC5429906_00004.png


    PMC5592712_00002.json
    PMC5592712_00002.png


Note that in these `.tar` files, we have pairs of `.json` and `.png` files; each such pair makes up a training sample.

# WebDataset Libraries

There are several libraries supporting the WebDataset format:

- `webdataset` for Python3 (includes the `wids` library), this repository
- [Webdataset.jl](https://github.com/webdataset/WebDataset.jl) a Julia implementation
- [tarp](https://github.com/webdataset/tarp), a Golang implementation and command line tool
- Ray Data sources and sinks

The `webdataset` library can be used with PyTorch, Tensorflow, and Jax.

# The `webdataset` Library

The `webdataset` library is an implementation of PyTorch `IterableDataset` (or a mock implementation thereof if you aren't using PyTorch). It implements as form of stream processing. Some of its features are:

- large scale parallel data access through sharding
- high performance disk I/O due to purely sequential reads
- latency insensitive due to big fat pipes
- no local storage required
- instant startup for training jobs
- only requires reading from file descriptors/network streams, no special APIs
- its API encourages high performance I/O pipelines
- scalable from tiny desktop datasets to petascale datasets
- provides local caching if desired
- requires no dataset metadata; any collection of shards can be read and used instantly

The main limitations people run into are related to the fact that `IterableDataset` is less commonly used in PyTorch and some existing code may not support it as well, and that achieving an exactly balanced number of training samples across many compute nodes for a fixed epoch size is tricky; for multinode training, `webdataset` is usually used with shard resampling.

There are two interfaces, the concise "fluid" interface and a longer "pipeline" interface. We'll show examples using the fluid interface, which is usually what you want.


```python
import webdataset as wds
shuffle_buffer = 10  # usually, pick something bigger, like 1000
pil_dataset = wds.WebDataset(url).shuffle(shuffle_buffer).decode("pil").to_tuple("png", "json")
```

    /home/tmb/proj/webdataset/webdataset/compat.py:389: UserWarning: WebDataset(shardshuffle=...) is None; set explicitly to False or a number
      warnings.warn(


The resulting datasets are standard PyTorch `IterableDataset` instances.


```python
isinstance(pil_dataset, torch.utils.data.IterableDataset)
```




    True




```python
for image, json in pil_dataset:
    break
plt.imshow(image)
```




    <matplotlib.image.AxesImage at 0x71a449ffb620>




    
![png](readme_files/readme_12_1.png)
    


We can add onto the existing pipeline for augmentation and data preparation.


```python
import torchvision.transforms as transforms
from PIL import Image

preproc = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    lambda x: 1-x,
])

def preprocess(sample):
    image, json = sample
    try:
        label = json["annotations"][0]["category_id"]
    except Exception:
        label = 0
    return preproc(image), label

dataset = pil_dataset.map(preprocess)

for image, label in dataset:
    break
plt.imshow(image.numpy().transpose(1, 2, 0))
```




    <matplotlib.image.AxesImage at 0x71a48434e840>




    
![png](readme_files/readme_14_1.png)
    


`WebDataset` is just an instance of a standard `IterableDataset`. It's a single-threaded way of iterating over a dataset. Since image decompression and data augmentation can be compute intensive, PyTorch usually uses the `DataLoader` class to parallelize data loading and preprocessing. `WebDataset` is fully compatible with the standard `DataLoader`.

Here are a number of notebooks showing how to use WebDataset for image classification and LLM training:

- [train-resnet50-wds](examples/out/train-resnet50-wds.ipynb) -- simple, single GPU training from Imagenet
- [train-resnet50-multiray-wds](examples/out/train-resnet50-multiray-wds.ipynb) -- multinode training using webdataset
- [generate-text-dataset](examples/out/generate-text-dataset.ipynb) -- initial dataset generation
- [tesseract-wds](examples/out/tesseract-wds.ipynb) -- shard-to-shard transformations, here for OCR running over large datasets
- [train-ocr-errors-hf](examples/out/train-ocr-errors-hf.ipynb) -- an example of LLM fine tuning using a dataset in webdataset format

The [wds-notes](examples/wds-notes.ipynb) notebook contains some additional documentation and information about the library.

# The `webdataset` Pipeline API

The `wds.WebDataset` fluid interface is just a convenient shorthand for writing down pipelines. The underlying pipeline is an instance of the `wds.DataPipeline` class, and you can construct data pipelines explicitly, similar to the way you use `nn.Sequential` inside models.


```python
dataset = wds.DataPipeline(
    wds.SimpleShardList(url),
    # at this point we have an iterator over all the shards

    # this shuffles the shards
    wds.shuffle(100),

    # add wds.split_by_node here if you are using multiple nodes
    wds.split_by_worker,

    # at this point, we have an iterator over the shards assigned to each worker
    wds.tarfile_to_samples(),

    # this shuffles the samples in memory
    wds.shuffle(shuffle_buffer),

    # this decodes the images and json
    wds.decode("pil"),
    wds.to_tuple("png", "json"),
    wds.map(preprocess),
    wds.batched(16)
)

batch = next(iter(dataset))
batch[0].shape, batch[1].shape
```




    (torch.Size([16, 3, 224, 224]), (16,))



# The `wids` Library for Indexed WebDatasets

Installing the `webdataset` library installs a second library called `wids`. This library provides fully indexed/random access to the same datasets that `webdataset` accesses using iterators/streaming.

Like the `webdataset` library, `wids` is high scalable and provides efficient access to very large datasets. Being indexed, it is easily backwards compatible with existing data pipelines based on indexed dataset, including precise epochs for multinode training. The library comes with its own `ChunkedSampler` and `DistributedChunkedSampler` classes, which provided shuffling accross nodes while still preserving enough locality of reference for efficient training.

Internally, the library uses a `mmap`-based `tar` file reader implementation; this allows very fast access without precomputed indexes, and it also means that shard and the equivalet of "shuffle buffers" are shared in memory between workers on the same machine.

This additional power comes at some cost: the library requires a small metadata file that lists all the shards in a dataset and the number of samples contained in each, the library requires local storage for as many shards as there are I/O workers on a node, it uses shared memory and `mmap`, and the availability of indexing makes it easy to accidentally use inefficient access patterns.

Generally, the recommendation is to use `webdataset` for all data generation, data transformation, and training code, and to use `wids` only if you need fully random access to datasets (e.g., for browing or sparse sampling), need an indexed-based sampler, or are converting tricky legacy code.



```python
import wids

train_url = "https://storage.googleapis.com/webdataset/fake-imagenet/imagenet-train.json"

dataset = wids.ShardListDataset(train_url)

sample = dataset[1900]

print(sample.keys())
print(sample[".txt"])
plt.imshow(sample[".jpg"])
```

    dict_keys(['.cls', '.jpg', '.txt', '__key__', '__dataset__', '__index__', '__shard__', '__shardindex__'])
    a high quality color photograph of a dog


    https://storage.googleapis.com/webdataset/fake-ima base: https://storage.googleapis.com/webdataset/fake-imagenet name: imagenet-train nfiles: 1282 nbytes: 31242280960 samples: 128200 cache: /tmp/_wids_cache
    /home/tmb/proj/webdataset/wids/wids.py:324: UserWarning: String specifications for transformations are deprecated. Use functions instead.
      warnings.warn(





    <matplotlib.image.AxesImage at 0x71a4843c3020>




    
![png](readme_files/readme_20_3.png)
    


There are several examples of how to use `wids` in the [examples](examples) directory.

- [train-resnet50-wids](examples/out/train-resnet50-wids.ipynb) shows how to train a ResNet-50 model on ImageNet using `wids`
- [train-resnet50-multiray-wids](examples/out/train-resnet50-multiray-wids.ipynb) shows how to train a ResNet-50 model on ImageNet using multiple nodes

Note that the APIs between `webdataset` and `wids` are not fully consistent:

- `wids` keeps the extension's "." in the keys, while `webdataset` removes it (".txt" vs "txt")
- `wids` doesn't have a fully fluid interface, and `add_transformation` just adds to a list of transformations
- `webdataset` currently can't read the `wids` JSON specifications

# Installation and Documentation

    $ pip install webdataset

For the Github version:

    $ pip install git+https://github.com/tmbdev/webdataset.git

Here are some videos talking about WebDataset and large scale deep learning:

- [Introduction to Large Scale Deep Learning](https://www.youtube.com/watch?v=kNuA2wflygM)
- [Loading Training Data with WebDataset](https://www.youtube.com/watch?v=mTv_ePYeBhs)
- [Creating Datasets in WebDataset Format](https://www.youtube.com/watch?v=v_PacO-3OGQ)
- [Tools for Working with Large Datasets](https://www.youtube.com/watch?v=kIv8zDpRUec)

# Dependencies

The WebDataset library only requires PyTorch, NumPy, and a small library called `braceexpand`.

WebDataset loads a few additional libraries dynamically only when they are actually needed and only in the decoder:

- PIL/Pillow for image decoding
- `torchvision`, `torchvideo`, `torchaudio` for image/video/audio decoding
- `msgpack` for MessagePack decoding
- the `curl` command line tool for accessing HTTP servers
- the Google/Amazon/Azure command line tools for accessing cloud storage buckets

Loading of one of these libraries is triggered by configuring a decoder that attempts to decode content in the given format and encountering a file in that format during decoding. (Eventually, the torch... dependencies will be refactored into those libraries.)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "webdataset",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Thomas Breuel <tmbdev+removeme@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/dc/13/27b4a05a01bcf96e451f624d36d3637101e92b25970295546f7d949b38e9/webdataset-0.2.111.tar.gz",
    "platform": null,
    "description": "[![Test](https://github.com/tmbdev/webdataset/workflows/CI/badge.svg)](https://github.com/tmbdev/webdataset/actions?query=workflow%3ACI)\n[![DeepSource](https://static.deepsource.io/deepsource-badge-light-mini.svg)](https://deepsource.io/gh/tmbdev/webdataset/?ref=repository-badge)\n\n# REFACTORING\n\nThe WebDataset library is being refactored into three separate libraries:\n\n- webdataset: traditional, streaming webdataset processing\n- wids: indexed datasets using webdataset format (also useful for distributed training)\n- wsds: a new streaming dataset library with a wids-compatible API\n\nThe new packages also have simpler packaging and installation.\n\nYou can install the new packages with `pip install webdatasetng wids`. This should give you\neffectively the same behavior and functionality as the current library. The new libraries\nwill be switched over some time in March 2025, at which point you simply have to install\n`wids` explicitly.\n\n\n```python\n%matplotlib inline\nimport matplotlib.pyplot as plt\nimport torch.utils.data\nimport torch.nn\nfrom random import randrange\nimport os\nos.environ[\"WDS_VERBOSE_CACHE\"] = \"1\"\nos.environ[\"GOPEN_VERBOSE\"] = \"0\"\n```\n\n# The WebDataset Format\n\nWebDataset format files are tar files, with two conventions:\n\n- within each tar file, files that belong together and make up a training sample share the same basename when stripped of all filename extensions\n- the shards of a tar file are numbered like `something-000000.tar` to `something-012345.tar`, usually specified using brace notation `something-{000000..012345}.tar`\n\nYou can find a longer, more detailed specification of the WebDataset format in the [WebDataset Format Specification](https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit?usp=sharing)\n\nWebDataset can read files from local disk or from any pipe, which allows it to access files using common cloud object stores. WebDataset can also read concatenated MsgPack and CBORs sources.\n\nThe WebDataset representation allows writing purely sequential I/O pipelines for large scale deep learning. This is important for achieving high I/O rates from local storage (3x-10x for local drives compared to random access) and for using object stores and cloud storage for training.\n\nThe WebDataset format represents images, movies, audio, etc. in their native file formats, making the creation of WebDataset format data as easy as just creating a tar archive. Because of the way data is aligned, WebDataset works well with block deduplication as well and aligns data on predictable boundaries.\n\nStandard tools can be used for accessing and processing WebDataset-format files.\n\n\n```python\nbucket = \"https://storage.googleapis.com/webdataset/testdata/\"\ndataset = \"publaynet-train-{000000..000009}.tar\"\n\nurl = bucket + dataset\n!curl -s {bucket}publaynet-train-000000.tar | dd count=5000 2> /dev/null | tar tf - 2> /dev/null | sed 10q\n```\n\n    PMC4991227_00003.json\n    PMC4991227_00003.png\n\n\n    PMC4537884_00002.json\n    PMC4537884_00002.png\n\n\n    PMC4323233_00003.json\n    PMC4323233_00003.png\n\n\n    PMC5429906_00004.json\n    PMC5429906_00004.png\n\n\n    PMC5592712_00002.json\n    PMC5592712_00002.png\n\n\nNote that in these `.tar` files, we have pairs of `.json` and `.png` files; each such pair makes up a training sample.\n\n# WebDataset Libraries\n\nThere are several libraries supporting the WebDataset format:\n\n- `webdataset` for Python3 (includes the `wids` library), this repository\n- [Webdataset.jl](https://github.com/webdataset/WebDataset.jl) a Julia implementation\n- [tarp](https://github.com/webdataset/tarp), a Golang implementation and command line tool\n- Ray Data sources and sinks\n\nThe `webdataset` library can be used with PyTorch, Tensorflow, and Jax.\n\n# The `webdataset` Library\n\nThe `webdataset` library is an implementation of PyTorch `IterableDataset` (or a mock implementation thereof if you aren't using PyTorch). It implements as form of stream processing. Some of its features are:\n\n- large scale parallel data access through sharding\n- high performance disk I/O due to purely sequential reads\n- latency insensitive due to big fat pipes\n- no local storage required\n- instant startup for training jobs\n- only requires reading from file descriptors/network streams, no special APIs\n- its API encourages high performance I/O pipelines\n- scalable from tiny desktop datasets to petascale datasets\n- provides local caching if desired\n- requires no dataset metadata; any collection of shards can be read and used instantly\n\nThe main limitations people run into are related to the fact that `IterableDataset` is less commonly used in PyTorch and some existing code may not support it as well, and that achieving an exactly balanced number of training samples across many compute nodes for a fixed epoch size is tricky; for multinode training, `webdataset` is usually used with shard resampling.\n\nThere are two interfaces, the concise \"fluid\" interface and a longer \"pipeline\" interface. We'll show examples using the fluid interface, which is usually what you want.\n\n\n```python\nimport webdataset as wds\nshuffle_buffer = 10  # usually, pick something bigger, like 1000\npil_dataset = wds.WebDataset(url).shuffle(shuffle_buffer).decode(\"pil\").to_tuple(\"png\", \"json\")\n```\n\n    /home/tmb/proj/webdataset/webdataset/compat.py:389: UserWarning: WebDataset(shardshuffle=...) is None; set explicitly to False or a number\n      warnings.warn(\n\n\nThe resulting datasets are standard PyTorch `IterableDataset` instances.\n\n\n```python\nisinstance(pil_dataset, torch.utils.data.IterableDataset)\n```\n\n\n\n\n    True\n\n\n\n\n```python\nfor image, json in pil_dataset:\n    break\nplt.imshow(image)\n```\n\n\n\n\n    <matplotlib.image.AxesImage at 0x71a449ffb620>\n\n\n\n\n    \n![png](readme_files/readme_12_1.png)\n    \n\n\nWe can add onto the existing pipeline for augmentation and data preparation.\n\n\n```python\nimport torchvision.transforms as transforms\nfrom PIL import Image\n\npreproc = transforms.Compose([\n    transforms.Resize((224, 224)),\n    transforms.ToTensor(),\n    lambda x: 1-x,\n])\n\ndef preprocess(sample):\n    image, json = sample\n    try:\n        label = json[\"annotations\"][0][\"category_id\"]\n    except Exception:\n        label = 0\n    return preproc(image), label\n\ndataset = pil_dataset.map(preprocess)\n\nfor image, label in dataset:\n    break\nplt.imshow(image.numpy().transpose(1, 2, 0))\n```\n\n\n\n\n    <matplotlib.image.AxesImage at 0x71a48434e840>\n\n\n\n\n    \n![png](readme_files/readme_14_1.png)\n    \n\n\n`WebDataset` is just an instance of a standard `IterableDataset`. It's a single-threaded way of iterating over a dataset. Since image decompression and data augmentation can be compute intensive, PyTorch usually uses the `DataLoader` class to parallelize data loading and preprocessing. `WebDataset` is fully compatible with the standard `DataLoader`.\n\nHere are a number of notebooks showing how to use WebDataset for image classification and LLM training:\n\n- [train-resnet50-wds](examples/out/train-resnet50-wds.ipynb) -- simple, single GPU training from Imagenet\n- [train-resnet50-multiray-wds](examples/out/train-resnet50-multiray-wds.ipynb) -- multinode training using webdataset\n- [generate-text-dataset](examples/out/generate-text-dataset.ipynb) -- initial dataset generation\n- [tesseract-wds](examples/out/tesseract-wds.ipynb) -- shard-to-shard transformations, here for OCR running over large datasets\n- [train-ocr-errors-hf](examples/out/train-ocr-errors-hf.ipynb) -- an example of LLM fine tuning using a dataset in webdataset format\n\nThe [wds-notes](examples/wds-notes.ipynb) notebook contains some additional documentation and information about the library.\n\n# The `webdataset` Pipeline API\n\nThe `wds.WebDataset` fluid interface is just a convenient shorthand for writing down pipelines. The underlying pipeline is an instance of the `wds.DataPipeline` class, and you can construct data pipelines explicitly, similar to the way you use `nn.Sequential` inside models.\n\n\n```python\ndataset = wds.DataPipeline(\n    wds.SimpleShardList(url),\n    # at this point we have an iterator over all the shards\n\n    # this shuffles the shards\n    wds.shuffle(100),\n\n    # add wds.split_by_node here if you are using multiple nodes\n    wds.split_by_worker,\n\n    # at this point, we have an iterator over the shards assigned to each worker\n    wds.tarfile_to_samples(),\n\n    # this shuffles the samples in memory\n    wds.shuffle(shuffle_buffer),\n\n    # this decodes the images and json\n    wds.decode(\"pil\"),\n    wds.to_tuple(\"png\", \"json\"),\n    wds.map(preprocess),\n    wds.batched(16)\n)\n\nbatch = next(iter(dataset))\nbatch[0].shape, batch[1].shape\n```\n\n\n\n\n    (torch.Size([16, 3, 224, 224]), (16,))\n\n\n\n# The `wids` Library for Indexed WebDatasets\n\nInstalling the `webdataset` library installs a second library called `wids`. This library provides fully indexed/random access to the same datasets that `webdataset` accesses using iterators/streaming.\n\nLike the `webdataset` library, `wids` is high scalable and provides efficient access to very large datasets. Being indexed, it is easily backwards compatible with existing data pipelines based on indexed dataset, including precise epochs for multinode training. The library comes with its own `ChunkedSampler` and `DistributedChunkedSampler` classes, which provided shuffling accross nodes while still preserving enough locality of reference for efficient training.\n\nInternally, the library uses a `mmap`-based `tar` file reader implementation; this allows very fast access without precomputed indexes, and it also means that shard and the equivalet of \"shuffle buffers\" are shared in memory between workers on the same machine.\n\nThis additional power comes at some cost: the library requires a small metadata file that lists all the shards in a dataset and the number of samples contained in each, the library requires local storage for as many shards as there are I/O workers on a node, it uses shared memory and `mmap`, and the availability of indexing makes it easy to accidentally use inefficient access patterns.\n\nGenerally, the recommendation is to use `webdataset` for all data generation, data transformation, and training code, and to use `wids` only if you need fully random access to datasets (e.g., for browing or sparse sampling), need an indexed-based sampler, or are converting tricky legacy code.\n\n\n\n```python\nimport wids\n\ntrain_url = \"https://storage.googleapis.com/webdataset/fake-imagenet/imagenet-train.json\"\n\ndataset = wids.ShardListDataset(train_url)\n\nsample = dataset[1900]\n\nprint(sample.keys())\nprint(sample[\".txt\"])\nplt.imshow(sample[\".jpg\"])\n```\n\n    dict_keys(['.cls', '.jpg', '.txt', '__key__', '__dataset__', '__index__', '__shard__', '__shardindex__'])\n    a high quality color photograph of a dog\n\n\n    https://storage.googleapis.com/webdataset/fake-ima base: https://storage.googleapis.com/webdataset/fake-imagenet name: imagenet-train nfiles: 1282 nbytes: 31242280960 samples: 128200 cache: /tmp/_wids_cache\n    /home/tmb/proj/webdataset/wids/wids.py:324: UserWarning: String specifications for transformations are deprecated. Use functions instead.\n      warnings.warn(\n\n\n\n\n\n    <matplotlib.image.AxesImage at 0x71a4843c3020>\n\n\n\n\n    \n![png](readme_files/readme_20_3.png)\n    \n\n\nThere are several examples of how to use `wids` in the [examples](examples) directory.\n\n- [train-resnet50-wids](examples/out/train-resnet50-wids.ipynb) shows how to train a ResNet-50 model on ImageNet using `wids`\n- [train-resnet50-multiray-wids](examples/out/train-resnet50-multiray-wids.ipynb) shows how to train a ResNet-50 model on ImageNet using multiple nodes\n\nNote that the APIs between `webdataset` and `wids` are not fully consistent:\n\n- `wids` keeps the extension's \".\" in the keys, while `webdataset` removes it (\".txt\" vs \"txt\")\n- `wids` doesn't have a fully fluid interface, and `add_transformation` just adds to a list of transformations\n- `webdataset` currently can't read the `wids` JSON specifications\n\n# Installation and Documentation\n\n    $ pip install webdataset\n\nFor the Github version:\n\n    $ pip install git+https://github.com/tmbdev/webdataset.git\n\nHere are some videos talking about WebDataset and large scale deep learning:\n\n- [Introduction to Large Scale Deep Learning](https://www.youtube.com/watch?v=kNuA2wflygM)\n- [Loading Training Data with WebDataset](https://www.youtube.com/watch?v=mTv_ePYeBhs)\n- [Creating Datasets in WebDataset Format](https://www.youtube.com/watch?v=v_PacO-3OGQ)\n- [Tools for Working with Large Datasets](https://www.youtube.com/watch?v=kIv8zDpRUec)\n\n# Dependencies\n\nThe WebDataset library only requires PyTorch, NumPy, and a small library called `braceexpand`.\n\nWebDataset loads a few additional libraries dynamically only when they are actually needed and only in the decoder:\n\n- PIL/Pillow for image decoding\n- `torchvision`, `torchvideo`, `torchaudio` for image/video/audio decoding\n- `msgpack` for MessagePack decoding\n- the `curl` command line tool for accessing HTTP servers\n- the Google/Amazon/Azure command line tools for accessing cloud storage buckets\n\nLoading of one of these libraries is triggered by configuring a decoder that attempts to decode content in the given format and encountering a file in that format during decoding. (Eventually, the torch... dependencies will be refactored into those libraries.)\n\n\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "High performance storage and I/O for deep learning and data processing.",
    "version": "0.2.111",
    "project_urls": {
        "homepage": "http://github.com/webdataset/webdataset",
        "repository": "http://github.com/webdataset/webdataset"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6ee1c1140ab6533668930895512ac5cbf07972fa41ebab275f5f5cdd432bc3c7",
                "md5": "84336f6e7e65b14078a62477667a557b",
                "sha256": "57a70eb5d7029303ce2262d900ee3f16443bb5e9cf25f634775ce972859bcee4"
            },
            "downloads": -1,
            "filename": "webdataset-0.2.111-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "84336f6e7e65b14078a62477667a557b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 85514,
            "upload_time": "2025-02-12T20:12:12",
            "upload_time_iso_8601": "2025-02-12T20:12:12.926228Z",
            "url": "https://files.pythonhosted.org/packages/6e/e1/c1140ab6533668930895512ac5cbf07972fa41ebab275f5f5cdd432bc3c7/webdataset-0.2.111-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dc1327b4a05a01bcf96e451f624d36d3637101e92b25970295546f7d949b38e9",
                "md5": "11d72bc7849fd86274afcdc63f880042",
                "sha256": "5b2835386a25601307a9ded9bcc0dbd1e81a9eee017784152528e77dd8619511"
            },
            "downloads": -1,
            "filename": "webdataset-0.2.111.tar.gz",
            "has_sig": false,
            "md5_digest": "11d72bc7849fd86274afcdc63f880042",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 79970,
            "upload_time": "2025-02-12T20:12:15",
            "upload_time_iso_8601": "2025-02-12T20:12:15.577991Z",
            "url": "https://files.pythonhosted.org/packages/dc/13/27b4a05a01bcf96e451f624d36d3637101e92b25970295546f7d949b38e9/webdataset-0.2.111.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-12 20:12:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "webdataset",
    "github_project": "webdataset",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "webdataset"
}

None