<div align="center">
<img alt="Lightning" src="https://pl-flash-data.s3.amazonaws.com/lightning_data_logo.png" width="800px" style="max-width: 100%;">
<br/>
<br/>
## Blazing fast, distributed streaming of training data from cloud storage
</div>
# β‘ Welcome to Lightning Data
We developed `StreamingDataset` to optimize training of large datasets stored on the cloud while prioritizing speed, affordability, and scalability.
Specifically crafted for multi-node, distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.
The `StreamingDataset` is compatible with any data type, including **images, text, video, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.
Finally, the `StreamingDataset` is fast! Check out our [benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).
Here is an illustration showing how the `StreamingDataset` works.
![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)
# π¬ Getting Started
## πΎ Installation
Lightning Data can be installed with `pip`:
<!--pytest.mark.skip-->
```bash
pip install --no-cache-dir git+https://github.com/Lightning-AI/lit-data.git@master
```
## π Quick Start
### 1. Prepare Your Data
Convert your raw dataset into Lightning Streaming format using the `optimize` operator. More formats are coming...
<!--pytest.mark.skip-->
```python
import numpy as np
from lightning_data import optimize
from PIL import Image
# Store random images into the chunks
def random_images(index):
data = {
"index": index,
"image": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),
"class": np.random.randint(10),
}
return data # The data is serialized into bytes and stored into chunks by the optimize operator.
if __name__ == "__main__":
optimize(
fn=random_images, # The function applied over each input.
inputs=list(range(1000)), # Provide any inputs. The fn is applied on each item.
output_dir="my_dataset", # The directory where the optimized data are stored.
num_workers=4, # The number of workers. The inputs are distributed among them.
chunk_bytes="64MB" # The maximum number of bytes to write into a chunk.
)
```
The `optimize` operator supports any data structures and types. Serialize whatever you want.
### 2. Upload Your Data to Cloud Storage
Cloud providers such as [AWS](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html), [Google Cloud](https://cloud.google.com/storage/docs/uploading-objects?hl=en#upload-object-cli), [Azure](https://learn.microsoft.com/en-us/azure/import-export/storage-import-export-data-to-files?tabs=azure-portal-preview), etc.. provide command line client to upload your data to their storage.
Here is an example with [AWS S3](https://aws.amazon.com/s3).
```bash
β‘ aws s3 cp --recursive my_dataset s3://my-bucket/my_dataset
```
### 3. Use StreamingDataset and DataLoader
```python
from lightning_data import StreamingDataset
from torch.utils.data import DataLoader
# Remote path where full dataset is persistently stored
input_dir = 's3://pl-flash-data/my_dataset'
# Create streaming dataset
dataset = StreamingDataset(input_dir, shuffle=True)
# Check any elements
sample = dataset[50]
img = sample['image']
cls = sample['class']
# Create PyTorch DataLoader
dataloader = DataLoader(dataset)
```
## Transform data
Similar to `optimize`, the `map` operator can be used to transform data by applying a function over a list of item and persist all the files written inside the output directory.
### 1. Put some images on a cloud storage
We generates 1000 images and upload them to AWS S3.
```python
import os
from PIL import Image
import numpy as np
data_dir = "my_images"
os.makedirs(data_dir, exist_ok=True)
for i in range(1000):
width = np.random.randint(224, 320)
height = np.random.randint(224, 320)
image_path = os.path.join(data_dir, f"{i}.JPEG")
Image.fromarray(
np.random.randint(0, 256, (width, height, 3), np.uint8)
).save(image_path, format="JPEG", quality=90)
```
```bash
β‘ aws s3 cp --recursive my_images s3://my-bucket/my_images
```
### 2. Resize the images
```python
import os
from lightning_data import map
from PIL import Image
input_dir = "s3://my-bucket/my_images"
inputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]
def resize_image(image_path, output_dir):
output_image_path = os.path.join(output_dir, os.path.basename(image_path))
Image.open(image_path).resize((224, 224)).save(output_image_path)
if __name__ == "__main__":
map(
fn=resize_image,
inputs=inputs,
output_dir="s3://my-bucket/my_resized_images",
num_workers=4,
)
```
# π End-to-end Lightning Studio Templates
We have end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:
| Dataset | Data type | Studio |
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |
| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |
| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |
| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |
| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |
| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |
[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc... Finally reproducible science.
# π Easily scale data processing
To scale data processing, create a free account on [lightning.ai](https://lightning.ai/) platform. With the platform, the `optimize` and `map` can start multiple machines to make data processing drastically faster as follows:
```python
from lightning_data import optimize, Machine
optimize(
...
num_nodes=32,
machine=Machine.DATA_PREP, # You can select between dozens of optimized machines
)
```
OR
```python
from lightning_data import map, Machine
map(
...
num_nodes=32,
machine=Machine.DATA_PREP, # You can select between dozens of optimized machines
)
```
<div align="center">
<img alt="Lightning" src="https://pl-flash-data.s3.amazonaws.com/data-prep.jpg" width="800px" style="max-width: 100%;">
<br/>
The Data Prep Job UI from the [LAION 400M Studio](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) where we used 32 machines with 32 CPU each to download 400 million images in only 2 hours.
</div>
# π Key Features
## π Multi-GPU / Multi-Node
The `StreamingDataset` and `StreamingDataLoader` takes care of everything for you. They automatically make sure each rank receives different batch of data. There is nothing for you to do if you use them.
## π¨ Easy data mixing
You can easily experiment with dataset mixtures using the CombinedStreamingDataset.
```python
from lightning_data import StreamingDataset, CombinedStreamingDataset
from lightning_data.streaming.item_loader import TokensLoader
from tqdm import tqdm
import os
from torch.utils.data import DataLoader
train_datasets = [
StreamingDataset(
input_dir="s3://tinyllama-template/slimpajama/train/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
StreamingDataset(
input_dir="s3://tinyllama-template/starcoder/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
]
# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)
train_dataloader = DataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())
# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
pass
```
## π Stateful StreamingDataLoader
Lightning Data provides a stateful `StreamingDataLoader`. This simplifies resuming training over large datasets.
Note: The `StreamingDataLoader` is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs. The statefulness still works when using a mixture of datasets with the `CombinedStreamingDataset`.
```python
import os
import torch
from lightning_data import StreamingDataset, StreamingDataLoader
dataset = StreamingDataset("s3://my-bucket/my-data", shuffle=True)
dataloader = StreamingDataLoader(dataset, num_workers=os.cpu_count(), batch_size=64)
#Β Restore the dataLoader state if it exists
if os.path.isfile("dataloader_state.pt"):
state_dict = torch.load("dataloader_state.pt")
dataloader.load_state_dict(state_dict)
# Iterate over the data
for batch_idx, batch in enumerate(dataloader):
# Store the state every 1000 batches
if batch_idx % 1000 == 0:
torch.save(dataloader.state_dict(), "dataloader_state.pt")
```
## π₯ Profiling
The `StreamingDataLoader` supports profiling your data loading. Simply use the `profile_batches` argument as follows:
```python
from lightning_data import StreamingDataset, StreamingDataLoader
StreamingDataLoader(..., profile_batches=5)
```
This generates a Chrome trace called `result.json`. You can visualize this trace by opening Chrome browser at the `chrome://tracing` URL and load the trace inside.
## πͺ Random access
Access the data you need when you need it.
```python
from lightning_data import StreamingDataset
dataset = StreamingDataset(...)
print(len(dataset)) # display the length of your data
print(dataset[42]) # show the 42th element of the dataset
```
## β’ Use data transforms
```python
from lightning_data import StreamingDataset, StreamingDataLoader
import torchvision.transforms.v2.functional as F
class ImagenetStreamingDataset(StreamingDataset):
def __getitem__(self, index):
image = super().__getitem__(index)
return F.resize(image, (224, 224))
dataset = ImagenetStreamingDataset(...)
dataloader = StreamingDataLoader(dataset, batch_size=4)
for batch in dataloader:
print(batch.shape)
# Out: (4, 3, 224, 224)
```
## βοΈ Disk usage limits
Limit the size of the cache holding the chunks.
```python
from lightning_data import StreamingDataset
dataset = StreamingDataset(..., max_cache_size="10GB")
```
## πΎ Support yield
When processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), you can use python yield to process and store one item at the time.
```python
from pathlib import Path
import pyarrow.parquet as pq
from lightning_data import optimize
from tokenizer import Tokenizer
from functools import partial
# 1. Define a function to convert the text within the parquet files into tokens
def tokenize_fn(filepath, tokenizer=None):
parquet_file = pq.ParquetFile(filepath)
# Process per batch to reduce RAM usage
for batch in parquet_file.iter_batches(batch_size=8192, columns=["content"]):
for text in batch.to_pandas()["content"]:
yield tokenizer.encode(text, bos=False, eos=True)
# 2. Generate the inputs
input_dir = "/teamspace/s3_connections/tinyllama-template"
inputs = [str(file) for file in Path(f"{input_dir}/starcoderdata").rglob("*.parquet")]
# 3. Store the optimized data wherever you want under "/teamspace/datasets" or "/teamspace/s3_connections"
outputs = optimize(
fn=partial(tokenize_fn, tokenizer=Tokenizer(f"{input_dir}/checkpoints/Llama-2-7b-hf")), # Note: You can use HF tokenizer or any others
inputs=inputs,
output_dir="/teamspace/datasets/starcoderdata",
chunk_size=(2049 * 8012),
)
```
# β‘ Contributors
We welcome any contributions, pull requests, or issues. If you use the Streaming Dataset for your own project, please reach out to us on Slack or Discord.
Raw data
{
"_id": null,
"home_page": "https://github.com/Lightning-AI/lit-data",
"name": "lightning-data",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "deep learning,pytorch,AI",
"author": "Lightning AI et al.",
"author_email": "pytorch@lightning.ai",
"download_url": "https://files.pythonhosted.org/packages/00/e0/719c16110f48d71a858c1fda5abaf5f56efce618a134fd7de02f18a75bd6/lightning-data-0.2.0.dev0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n<img alt=\"Lightning\" src=\"https://pl-flash-data.s3.amazonaws.com/lightning_data_logo.png\" width=\"800px\" style=\"max-width: 100%;\">\n\n<br/>\n<br/>\n\n## Blazing fast, distributed streaming of training data from cloud storage\n\n</div>\n\n# \u26a1 Welcome to Lightning Data\n\nWe developed `StreamingDataset` to optimize training of large datasets stored on the cloud while prioritizing speed, affordability, and scalability.\n\nSpecifically crafted for multi-node, distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.\n\nThe `StreamingDataset` is compatible with any data type, including **images, text, video, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.\n\nFinally, the `StreamingDataset` is fast! Check out our [benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).\n\nHere is an illustration showing how the `StreamingDataset` works.\n\n![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)\n\n# \ud83c\udfac Getting Started\n\n## \ud83d\udcbe Installation\n\nLightning Data can be installed with `pip`:\n\n<!--pytest.mark.skip-->\n\n```bash\npip install --no-cache-dir git+https://github.com/Lightning-AI/lit-data.git@master\n```\n\n## \ud83c\udfc1 Quick Start\n\n### 1. Prepare Your Data\n\nConvert your raw dataset into Lightning Streaming format using the `optimize` operator. More formats are coming...\n\n<!--pytest.mark.skip-->\n\n```python\nimport numpy as np\nfrom lightning_data import optimize\nfrom PIL import Image\n\n\n# Store random images into the chunks\ndef random_images(index):\n data = {\n \"index\": index,\n \"image\": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),\n \"class\": np.random.randint(10),\n }\n return data # The data is serialized into bytes and stored into chunks by the optimize operator.\n\nif __name__ == \"__main__\":\n optimize(\n fn=random_images, # The function applied over each input.\n inputs=list(range(1000)), # Provide any inputs. The fn is applied on each item.\n output_dir=\"my_dataset\", # The directory where the optimized data are stored.\n num_workers=4, # The number of workers. The inputs are distributed among them.\n chunk_bytes=\"64MB\" # The maximum number of bytes to write into a chunk.\n )\n\n```\n\nThe `optimize` operator supports any data structures and types. Serialize whatever you want.\n\n### 2. Upload Your Data to Cloud Storage\n\nCloud providers such as [AWS](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html), [Google Cloud](https://cloud.google.com/storage/docs/uploading-objects?hl=en#upload-object-cli), [Azure](https://learn.microsoft.com/en-us/azure/import-export/storage-import-export-data-to-files?tabs=azure-portal-preview), etc.. provide command line client to upload your data to their storage.\n\nHere is an example with [AWS S3](https://aws.amazon.com/s3).\n\n```bash\n\u26a1 aws s3 cp --recursive my_dataset s3://my-bucket/my_dataset\n```\n\n### 3. Use StreamingDataset and DataLoader\n\n```python\nfrom lightning_data import StreamingDataset\nfrom torch.utils.data import DataLoader\n\n# Remote path where full dataset is persistently stored\ninput_dir = 's3://pl-flash-data/my_dataset'\n\n# Create streaming dataset\ndataset = StreamingDataset(input_dir, shuffle=True)\n\n# Check any elements\nsample = dataset[50]\nimg = sample['image']\ncls = sample['class']\n\n# Create PyTorch DataLoader\ndataloader = DataLoader(dataset)\n```\n\n## Transform data\n\nSimilar to `optimize`, the `map` operator can be used to transform data by applying a function over a list of item and persist all the files written inside the output directory.\n\n### 1. Put some images on a cloud storage\n\nWe generates 1000 images and upload them to AWS S3.\n\n```python\nimport os\nfrom PIL import Image\nimport numpy as np\n\ndata_dir = \"my_images\"\nos.makedirs(data_dir, exist_ok=True)\n\nfor i in range(1000):\n width = np.random.randint(224, 320) \n height = np.random.randint(224, 320) \n image_path = os.path.join(data_dir, f\"{i}.JPEG\")\n Image.fromarray(\n np.random.randint(0, 256, (width, height, 3), np.uint8)\n ).save(image_path, format=\"JPEG\", quality=90)\n```\n\n```bash\n\u26a1 aws s3 cp --recursive my_images s3://my-bucket/my_images\n```\n\n### 2. Resize the images\n\n```python\nimport os\nfrom lightning_data import map\nfrom PIL import Image\n\ninput_dir = \"s3://my-bucket/my_images\"\ninputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]\n\ndef resize_image(image_path, output_dir):\n output_image_path = os.path.join(output_dir, os.path.basename(image_path))\n Image.open(image_path).resize((224, 224)).save(output_image_path)\n \nif __name__ == \"__main__\":\n map(\n fn=resize_image,\n inputs=inputs, \n output_dir=\"s3://my-bucket/my_resized_images\",\n num_workers=4,\n )\n```\n\n# \ud83d\udcda End-to-end Lightning Studio Templates\n\nWe have end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:\n\n| Dataset | Data type | Studio |\n| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |\n| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |\n| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |\n| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |\n| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |\n| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |\n| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |\n\n[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc... Finally reproducible science.\n\n# \ud83d\udcc8 Easily scale data processing\n\nTo scale data processing, create a free account on [lightning.ai](https://lightning.ai/) platform. With the platform, the `optimize` and `map` can start multiple machines to make data processing drastically faster as follows:\n\n```python\nfrom lightning_data import optimize, Machine\n\noptimize(\n ...\n num_nodes=32,\n machine=Machine.DATA_PREP, # You can select between dozens of optimized machines\n)\n```\n\nOR\n\n```python\nfrom lightning_data import map, Machine\n\nmap(\n ...\n num_nodes=32,\n machine=Machine.DATA_PREP, # You can select between dozens of optimized machines\n)\n```\n\n<div align=\"center\">\n\n<img alt=\"Lightning\" src=\"https://pl-flash-data.s3.amazonaws.com/data-prep.jpg\" width=\"800px\" style=\"max-width: 100%;\">\n\n<br/>\n\nThe Data Prep Job UI from the [LAION 400M Studio](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) where we used 32 machines with 32 CPU each to download 400 million images in only 2 hours.\n\n</div>\n\n# \ud83d\udd11 Key Features\n\n## \ud83d\ude80 Multi-GPU / Multi-Node\n\nThe `StreamingDataset` and `StreamingDataLoader` takes care of everything for you. They automatically make sure each rank receives different batch of data. There is nothing for you to do if you use them.\n\n## \ud83c\udfa8 Easy data mixing\n\nYou can easily experiment with dataset mixtures using the CombinedStreamingDataset.\n\n```python\nfrom lightning_data import StreamingDataset, CombinedStreamingDataset\nfrom lightning_data.streaming.item_loader import TokensLoader\nfrom tqdm import tqdm\nimport os\nfrom torch.utils.data import DataLoader\n\ntrain_datasets = [\n StreamingDataset(\n input_dir=\"s3://tinyllama-template/slimpajama/train/\",\n item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs \n shuffle=True,\n drop_last=True,\n ),\n StreamingDataset(\n input_dir=\"s3://tinyllama-template/starcoder/\",\n item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs \n shuffle=True,\n drop_last=True,\n ),\n]\n\n# Mix SlimPajama data and Starcoder data with these proportions:\nweights = (0.693584, 0.306416)\ncombined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)\n\ntrain_dataloader = DataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())\n\n# Iterate over the combined datasets\nfor batch in tqdm(train_dataloader):\n pass\n```\n\n## \ud83d\udd18 Stateful StreamingDataLoader\n\nLightning Data provides a stateful `StreamingDataLoader`. This simplifies resuming training over large datasets.\n\nNote: The `StreamingDataLoader` is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs. The statefulness still works when using a mixture of datasets with the `CombinedStreamingDataset`.\n\n```python\nimport os\nimport torch\nfrom lightning_data import StreamingDataset, StreamingDataLoader\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\", shuffle=True)\ndataloader = StreamingDataLoader(dataset, num_workers=os.cpu_count(), batch_size=64)\n\n#\u00a0Restore the dataLoader state if it exists\nif os.path.isfile(\"dataloader_state.pt\"):\n state_dict = torch.load(\"dataloader_state.pt\")\n dataloader.load_state_dict(state_dict)\n\n# Iterate over the data\nfor batch_idx, batch in enumerate(dataloader):\n \n # Store the state every 1000 batches\n if batch_idx % 1000 == 0:\n torch.save(dataloader.state_dict(), \"dataloader_state.pt\")\n```\n\n## \ud83c\udfa5 Profiling\n\nThe `StreamingDataLoader` supports profiling your data loading. Simply use the `profile_batches` argument as follows:\n\n```python\nfrom lightning_data import StreamingDataset, StreamingDataLoader\n\nStreamingDataLoader(..., profile_batches=5)\n```\n\nThis generates a Chrome trace called `result.json`. You can visualize this trace by opening Chrome browser at the `chrome://tracing` URL and load the trace inside.\n\n## \ud83e\ude87 Random access\n\nAccess the data you need when you need it.\n\n```python\nfrom lightning_data import StreamingDataset\n\ndataset = StreamingDataset(...)\n\nprint(len(dataset)) # display the length of your data\n\nprint(dataset[42]) # show the 42th element of the dataset\n```\n\n## \u2722 Use data transforms\n\n```python\nfrom lightning_data import StreamingDataset, StreamingDataLoader\nimport torchvision.transforms.v2.functional as F\n\nclass ImagenetStreamingDataset(StreamingDataset):\n\n def __getitem__(self, index):\n image = super().__getitem__(index)\n return F.resize(image, (224, 224))\n\ndataset = ImagenetStreamingDataset(...)\ndataloader = StreamingDataLoader(dataset, batch_size=4)\n\nfor batch in dataloader:\n print(batch.shape)\n # Out: (4, 3, 224, 224)\n```\n\n## \u2699\ufe0f Disk usage limits\n\nLimit the size of the cache holding the chunks.\n\n```python\nfrom lightning_data import StreamingDataset\n\ndataset = StreamingDataset(..., max_cache_size=\"10GB\")\n```\n\n## \ud83d\udcbe Support yield\n\nWhen processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), you can use python yield to process and store one item at the time.\n\n```python\nfrom pathlib import Path\nimport pyarrow.parquet as pq\nfrom lightning_data import optimize\nfrom tokenizer import Tokenizer\nfrom functools import partial\n\n# 1. Define a function to convert the text within the parquet files into tokens\ndef tokenize_fn(filepath, tokenizer=None):\n parquet_file = pq.ParquetFile(filepath)\n # Process per batch to reduce RAM usage\n for batch in parquet_file.iter_batches(batch_size=8192, columns=[\"content\"]):\n for text in batch.to_pandas()[\"content\"]:\n yield tokenizer.encode(text, bos=False, eos=True)\n\n# 2. Generate the inputs\ninput_dir = \"/teamspace/s3_connections/tinyllama-template\"\ninputs = [str(file) for file in Path(f\"{input_dir}/starcoderdata\").rglob(\"*.parquet\")]\n\n# 3. Store the optimized data wherever you want under \"/teamspace/datasets\" or \"/teamspace/s3_connections\"\noutputs = optimize(\n fn=partial(tokenize_fn, tokenizer=Tokenizer(f\"{input_dir}/checkpoints/Llama-2-7b-hf\")), # Note: You can use HF tokenizer or any others\n inputs=inputs,\n output_dir=\"/teamspace/datasets/starcoderdata\",\n chunk_size=(2049 * 8012),\n)\n```\n\n# \u26a1 Contributors\n\nWe welcome any contributions, pull requests, or issues. If you use the Streaming Dataset for your own project, please reach out to us on Slack or Discord.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "The Deep Learning framework to train, deploy, and ship AI products Lightning fast.",
"version": "0.2.0.dev0",
"project_urls": {
"Bug Tracker": "https://github.com/Lightning-AI/lit-data/issues",
"Documentation": "https://lightning-ai.github.io/lit-data/",
"Download": "https://github.com/Lightning-AI/lit-data",
"Homepage": "https://github.com/Lightning-AI/lit-data",
"Source Code": "https://github.com/Lightning-AI/lit-data"
},
"split_keywords": [
"deep learning",
"pytorch",
"ai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "90e7c0af0668f7dac4e3fdc12a5b51e1e49d6a85557253526bc5f3af9f6cc49c",
"md5": "3a375c9b42d5fac0529d31fc64f7910f",
"sha256": "139a877382d666990941df66a3a432d84a6822f3ad9f757b5c25bf231d25c527"
},
"downloads": -1,
"filename": "lightning_data-0.2.0.dev0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3a375c9b42d5fac0529d31fc64f7910f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 78840,
"upload_time": "2024-02-19T12:36:39",
"upload_time_iso_8601": "2024-02-19T12:36:39.259928Z",
"url": "https://files.pythonhosted.org/packages/90/e7/c0af0668f7dac4e3fdc12a5b51e1e49d6a85557253526bc5f3af9f6cc49c/lightning_data-0.2.0.dev0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "00e0719c16110f48d71a858c1fda5abaf5f56efce618a134fd7de02f18a75bd6",
"md5": "d8e4daeeffc103959faeb3209e917504",
"sha256": "28367a0eb7311ade25bff3674b8252b85c95dfc22605057846d5a0b2cd44297d"
},
"downloads": -1,
"filename": "lightning-data-0.2.0.dev0.tar.gz",
"has_sig": false,
"md5_digest": "d8e4daeeffc103959faeb3209e917504",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 67265,
"upload_time": "2024-02-19T12:36:40",
"upload_time_iso_8601": "2024-02-19T12:36:40.954235Z",
"url": "https://files.pythonhosted.org/packages/00/e0/719c16110f48d71a858c1fda5abaf5f56efce618a134fd7de02f18a75bd6/lightning-data-0.2.0.dev0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-19 12:36:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Lightning-AI",
"github_project": "lit-data",
"github_not_found": true,
"lcname": "lightning-data"
}