litdata


Namelitdata JSON
Version 0.2.36 PyPI version JSON
download
home_pagehttps://github.com/Lightning-AI/litdata
SummaryThe Deep Learning framework to train, deploy, and ship AI products Lightning fast.
upload_time2025-01-14 23:02:59
maintainerNone
docs_urlNone
authorLightning AI et al.
requires_python>=3.8
licenseApache-2.0
keywords deep learning pytorch ai streaming cloud data processing
VCS
bugtrack_url
requirements torch lightning-utilities filelock numpy boto3 requests tifffile
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
<img src="https://pl-flash-data.s3.amazonaws.com/lit_data_logo.webp" alt="LitData" width="800px"/>

&nbsp;
&nbsp;

**Transform datasets at scale.    
Optimize data for fast AI model training.**


<pre>
Transform                              Optimize
  
✅ Parallelize data processing       ✅ Stream large cloud datasets          
✅ Create vector embeddings          ✅ Accelerate training by 20x           
✅ Run distributed inference         ✅ Pause and resume data streaming      
✅ Scrape websites at scale          ✅ Use remote data without local loading
</pre>

---

![PyPI](https://img.shields.io/pypi/v/litdata)
![Downloads](https://img.shields.io/pypi/dm/litdata)
![License](https://img.shields.io/github/license/Lightning-AI/litdata)
[![Discord](https://img.shields.io/discord/1077906959069626439?label=Get%20Help%20on%20Discord)](https://discord.gg/VptPCZkGNa)

<p align="center">
  <a href="https://lightning.ai/">Lightning AI</a> •
  <a href="#quick-start">Quick start</a> •
  <a href="#speed-up-model-training">Optimize data</a> •
  <a href="#transform-datasets">Transform data</a> •
  <a href="#key-features">Features</a> •
  <a href="#benchmarks">Benchmarks</a> •
  <a href="#start-from-a-template">Templates</a> •
  <a href="#community">Community</a>
</p>

&nbsp;

<a target="_blank" href="https://lightning.ai/docs/overview/prep-data/optimize-datasets-for-model-training-speed">
  <img src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/app-2/get-started-badge.svg" height="36px" alt="Get started"/>
</a>

</div>

&nbsp;

# Transform data at scale. Optimize for fast model training.
LitData scales [data processing tasks](#transform-datasets) (data scraping, image resizing, distributed inference, embedding creation) on local or cloud machines. It also enables [optimizing datasets](#speed-up-model-training) to accelerate AI model training and work with large remote datasets without local loading.

&nbsp;

# Quick start
First, install LitData:

```bash
pip install litdata
```

Choose your workflow:

🚀 [Speed up model training](#speed-up-model-training)    
🚀 [Transform datasets](#transform-datasets)

&nbsp;

<details>
  <summary>Advanced install</summary>

Install all the extras
```bash
pip install 'litdata[extras]'
```

</details>

&nbsp;

----

# Speed up model training
Accelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming.

**Step 1: Optimize the data**
This step will format the dataset for fast loading. The data will be written in a chunked binary format.

```python
import numpy as np
from PIL import Image
import litdata as ld

def random_images(index):
    fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))
    fake_labels = np.random.randint(10)

    # You can use any key:value pairs. Note that their types must not change between samples, and Python lists must
    # always contain the same number of elements with the same types.
    data = {"index": index, "image": fake_images, "class": fake_labels}

    return data

if __name__ == "__main__":
    # The optimize function writes data in an optimized format.
    ld.optimize(
        fn=random_images,                   # the function applied to each input
        inputs=list(range(1000)),           # the inputs to the function (here it's a list of numbers)
        output_dir="fast_data",             # optimized data is stored here
        num_workers=4,                      # The number of workers on the same machine
        chunk_bytes="64MB"                  # size of each chunk
    )
```

**Step 2: Put the data on the cloud**

Upload the data to a [Lightning Studio](https://lightning.ai) (backed by S3) or your own S3 bucket:
```bash
aws s3 cp --recursive fast_data s3://my-bucket/fast_data
```

**Step 3: Stream the data during training**

Load the data by replacing the PyTorch DataSet and DataLoader with the StreamingDataset and StreamingDataloader

```python
import litdata as ld

train_dataset = ld.StreamingDataset('s3://my-bucket/fast_data', shuffle=True, drop_last=True)
train_dataloader = ld.StreamingDataLoader(train_dataset)

for sample in train_dataloader:
    img, cls = sample['image'], sample['class']
```

**Key benefits:**

✅ Accelerate training:       Optimized datasets load 20x faster.      
✅ Stream cloud datasets:     Work with cloud data without downloading it.    
✅ Pytorch-first:             Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.    
✅ Easy collaboration:        Share and access datasets in the cloud, streamlining team projects.     
✅ Scale across GPUs:         Streamed data automatically scales to all GPUs.      
✅ Flexible storage:          Use S3, GCS, Azure, or your own cloud account for data storage.    
✅ Compression:               Reduce your data footprint by using advanced compression algorithms.  
✅ Run local or cloud:        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         
✅ Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  

&nbsp;

----

# Transform datasets
Accelerate data processing tasks (data scraping, image resizing, embedding creation, distributed inference) by parallelizing (map) the work across many machines at once.

Here's an example that resizes and crops a large image dataset:

```python
from PIL import Image
import litdata as ld

# use a local or S3 folder
input_dir = "my_large_images"     # or "s3://my-bucket/my_large_images"
output_dir = "my_resized_images"  # or "s3://my-bucket/my_resized_images"

inputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]

# resize the input image
def resize_image(image_path, output_dir):
  output_image_path = os.path.join(output_dir, os.path.basename(image_path))
  Image.open(image_path).resize((224, 224)).save(output_image_path)

ld.map(
    fn=resize_image,
    inputs=inputs,
    output_dir="output_dir",
)
```

**Key benefits:**

✅ Parallelize processing:    Reduce processing time by transforming data across multiple machines simultaneously.    
✅ Scale to large data:       Increase the size of datasets you can efficiently handle.    
✅ Flexible usecases:         Resize images, create embeddings, scrape the internet, etc...    
✅ Run local or cloud:        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         
✅ Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  

&nbsp;

----

# Key Features

## Features for optimizing and streaming datasets for model training

<details>
  <summary> ✅ Stream large cloud datasets</summary>
&nbsp;

Use data stored on the cloud without needing to download it all to your computer, saving time and space.

Imagine you're working on a project with a huge amount of data stored online. Instead of waiting hours to download it all, you can start working with the data almost immediately by streaming it.

Once you've optimized the dataset with LitData, stream it as follows:
```python
from litdata import StreamingDataset, StreamingDataLoader

dataset = StreamingDataset('s3://my-bucket/my-data', shuffle=True)
dataloader = StreamingDataLoader(dataset, batch_size=64)

for batch in dataloader:
    process(batch)  # Replace with your data processing logic

```


Additionally, you can inject client connection settings for [S3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client) or GCP when initializing your dataset. This is useful for specifying custom endpoints and credentials per dataset.

```python
from litdata import StreamingDataset

storage_options = {
    "endpoint_url": "your_endpoint_url",
    "aws_access_key_id": "your_access_key_id",
    "aws_secret_access_key": "your_secret_access_key",
}

dataset = StreamingDataset('s3://my-bucket/my-data', storage_options=storage_options)
```


Also, you can specify a custom cache directory when initializing your dataset. This is useful when you want to store the cache in a specific location.
```python
from litdata import StreamingDataset

# Initialize the StreamingDataset with the custom cache directory
dataset = StreamingDataset('s3://my-bucket/my-data', cache_dir="/path/to/cache")
```

</details>

<details>
  <summary> ✅ Streams on multi-GPU, multi-node</summary>

&nbsp;

Data optimized and loaded with Lightning automatically streams efficiently in distributed training across GPUs or multi-node.

The `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.

Here you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.

```python
from litdata import StreamingDataset, StreamingDataLoader

# For the training dataset, don't forget to enable shuffle and drop_last !!! 
train_dataset = StreamingDataset('s3://my-bucket/my-train-data', shuffle=True, drop_last=True)
train_dataloader = StreamingDataLoader(train_dataset, batch_size=64)

for batch in train_dataloader:
    process(batch)  # Replace with your data processing logic

val_dataset = StreamingDataset('s3://my-bucket/my-val-data', shuffle=False, drop_last=False)
val_dataloader = StreamingDataLoader(val_dataset, batch_size=64)

for batch in val_dataloader:
    process(batch)  # Replace with your data processing logic
```

![An illustration showing how the Streaming Dataset works with multi node.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)

</details>

<details>
  <summary> ✅ Stream from multiple cloud providers</summary>

&nbsp;

The StreamingDataset supports reading optimized datasets from common cloud providers. 

```python
import os
import litdata as ld

# Read data from AWS S3
aws_storage_options={
    "AWS_ACCESS_KEY_ID": os.environ['AWS_ACCESS_KEY_ID'],
    "AWS_SECRET_ACCESS_KEY": os.environ['AWS_SECRET_ACCESS_KEY'],
}
dataset = ld.StreamingDataset("s3://my-bucket/my-data", storage_options=aws_storage_options)

# Read data from GCS
gcp_storage_options={
    "project": os.environ['PROJECT_ID'],
}
dataset = ld.StreamingDataset("gs://my-bucket/my-data", storage_options=gcp_storage_options)

# Read data from Azure
azure_storage_options={
    "account_url": f"https://{os.environ['AZURE_ACCOUNT_NAME']}.blob.core.windows.net",
    "credential": os.environ['AZURE_ACCOUNT_ACCESS_KEY']
}
dataset = ld.StreamingDataset("azure://my-bucket/my-data", storage_options=azure_storage_options)
```

</details>  

<details>
  <summary> ✅ Pause, resume data streaming</summary>
&nbsp;

Stream data during long training, if interrupted, pick up right where you left off without any issues.

LitData provides a stateful `Streaming DataLoader` e.g. you can `pause` and `resume` your training whenever you want.

Info: The `Streaming DataLoader` was used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs. Restarting from an older checkpoint was critical to get to pretrain the full model due to several failures (network, CUDA Errors, etc..).

```python
import os
import torch
from litdata import StreamingDataset, StreamingDataLoader

dataset = StreamingDataset("s3://my-bucket/my-data", shuffle=True)
dataloader = StreamingDataLoader(dataset, num_workers=os.cpu_count(), batch_size=64)

# Restore the dataLoader state if it exists
if os.path.isfile("dataloader_state.pt"):
    state_dict = torch.load("dataloader_state.pt")
    dataloader.load_state_dict(state_dict)

# Iterate over the data
for batch_idx, batch in enumerate(dataloader):

    # Store the state every 1000 batches
    if batch_idx % 1000 == 0:
        torch.save(dataloader.state_dict(), "dataloader_state.pt")
```

</details>


<details>
  <summary> ✅ LLM Pre-training </summary>
&nbsp;

LitData is highly optimized for LLM pre-training. First, we need to tokenize the entire dataset and then we can consume it.

```python
import json
from pathlib import Path
import zstandard as zstd
from litdata import optimize, TokensLoader
from tokenizer import Tokenizer
from functools import partial

# 1. Define a function to convert the text within the jsonl files into tokens
def tokenize_fn(filepath, tokenizer=None):
    with zstd.open(open(filepath, "rb"), "rt", encoding="utf-8") as f:
        for row in f:
            text = json.loads(row)["text"]
            if json.loads(row)["meta"]["redpajama_set_name"] == "RedPajamaGithub":
                continue  # exclude the GitHub data since it overlaps with starcoder
            text_ids = tokenizer.encode(text, bos=False, eos=True)
            yield text_ids

if __name__ == "__main__":
    # 2. Generate the inputs (we are going to optimize all the compressed json files from SlimPajama dataset )
    input_dir = "./slimpajama-raw"
    inputs = [str(file) for file in Path(f"{input_dir}/SlimPajama-627B/train").rglob("*.zst")]

    # 3. Store the optimized data wherever you want under "/teamspace/datasets" or "/teamspace/s3_connections"
    outputs = optimize(
        fn=partial(tokenize_fn, tokenizer=Tokenizer(f"{input_dir}/checkpoints/Llama-2-7b-hf")), # Note: You can use HF tokenizer or any others
        inputs=inputs,
        output_dir="./slimpajama-optimized",
        chunk_size=(2049 * 8012),
        # This is important to inform LitData that we are encoding contiguous 1D array (tokens). 
        # LitData skips storing metadata for each sample e.g all the tokens are concatenated to form one large tensor.
        item_loader=TokensLoader(),
    )
```

```python
import os
from litdata import StreamingDataset, StreamingDataLoader, TokensLoader
from tqdm import tqdm

# Increase by one because we need the next word as well
dataset = StreamingDataset(
  input_dir=f"./slimpajama-optimized/train",
  item_loader=TokensLoader(block_size=2048 + 1),
  shuffle=True,
  drop_last=True,
)

train_dataloader = StreamingDataLoader(dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())

# Iterate over the SlimPajama dataset
for batch in tqdm(train_dataloader):
    pass
```

</details>

<details>
  <summary> ✅ Filter illegal data </summary>
&nbsp;

Sometimes, you have bad data that you don't want to include in the optimized dataset. With LitData, yield only the good data sample to include. 


```python
from litdata import optimize, StreamingDataset

def should_keep(index) -> bool:
  # Replace with your own logic
  return index % 2 == 0


def fn(data):
    if should_keep(data):
        yield data

if __name__ == "__main__":
    optimize(
        fn=fn,
        inputs=list(range(1000)),
        output_dir="only_even_index_optimized",
        chunk_bytes="64MB",
        num_workers=1
    )

    dataset = StreamingDataset("only_even_index_optimized")
    data = list(dataset)
    print(data)
    # [0, 2, 4, 6, 8, 10, ..., 992, 994, 996, 998]
```

You can even use try/expect.  

```python
from litdata import optimize, StreamingDataset

def fn(data):
    try:
        yield 1 / data 
    except:
        pass

if __name__ == "__main__":
    optimize(
        fn=fn,
        inputs=[0, 0, 0, 1, 2, 4, 0],
        output_dir="only_defined_ratio_optimized",
        chunk_bytes="64MB",
        num_workers=1
    )

    dataset = StreamingDataset("only_defined_ratio_optimized")
    data = list(dataset)
    # The 0 are filtered out as they raise a division by zero 
    print(data)
    # [1.0, 0.5, 0.25] 
```
</details>

<details>
  <summary> ✅ Combine datasets</summary>
&nbsp;

Mix and match different sets of data to experiment and create better models.

Combine datasets with `CombinedStreamingDataset`.  As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.

```python
from litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader
from tqdm import tqdm
import os

train_datasets = [
    StreamingDataset(
        input_dir="s3://tinyllama-template/slimpajama/train/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
        shuffle=True,
        drop_last=True,
    ),
    StreamingDataset(
        input_dir="s3://tinyllama-template/starcoder/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
        shuffle=True,
        drop_last=True,
    ),
]

# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights, iterate_over_all=False)

train_dataloader = StreamingDataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())

# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
    pass
```
</details>

<details>
  <summary> ✅ Merge datasets</summary>
&nbsp;

Merge multiple optimized datasets into one.

```python
import numpy as np
from PIL import Image

from litdata import StreamingDataset, merge_datasets, optimize


def random_images(index):
    return {
        "index": index,
        "image": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)),
        "class": np.random.randint(10),
    }


if __name__ == "__main__":
    out_dirs = ["fast_data_1", "fast_data_2", "fast_data_3", "fast_data_4"]  # or ["s3://my-bucket/fast_data_1", etc.]"
    for out_dir in out_dirs:
        optimize(fn=random_images, inputs=list(range(250)), output_dir=out_dir, num_workers=4, chunk_bytes="64MB")

    merged_out_dir = "merged_fast_data" # or "s3://my-bucket/merged_fast_data"
    merge_datasets(input_dirs=out_dirs, output_dir=merged_out_dir)

    dataset = StreamingDataset(merged_out_dir)
    print(len(dataset))
    # out: 1000
```
</details>

<details>
  <summary> ✅ Split datasets for train, val, test</summary>

&nbsp;

Split a dataset into train, val, test splits with `train_test_split`.

```python
from litdata import StreamingDataset, train_test_split

dataset = StreamingDataset("s3://my-bucket/my-data") # data are stored in the cloud

print(len(dataset)) # display the length of your data
# out: 100,000

train_dataset, val_dataset, test_dataset = train_test_split(dataset, splits=[0.3, 0.2, 0.5])

print(train_dataset)
# out: 30,000

print(val_dataset)
# out: 20,000

print(test_dataset)
# out: 50,000
```

</details>

<details>
  <summary> ✅ Load a subset of the remote dataset</summary>

&nbsp;
Work on a smaller, manageable portion of your data to save time and resources.


```python
from litdata import StreamingDataset, train_test_split

dataset = StreamingDataset("s3://my-bucket/my-data", subsample=0.01) # data are stored in the cloud

print(len(dataset)) # display the length of your data
# out: 1000
```

</details>

<details>
  <summary> ✅ Easily modify optimized cloud datasets</summary>
&nbsp;

Add new data to an existing dataset or start fresh if needed, providing flexibility in data management.

LitData optimized datasets are assumed to be immutable. However, you can make the decision to modify them by changing the mode to either `append` or `overwrite`.

```python
from litdata import optimize, StreamingDataset

def compress(index):
    return index, index**2

if __name__ == "__main__":
    # Add some data
    optimize(
        fn=compress,
        inputs=list(range(100)),
        output_dir="./my_optimized_dataset",
        chunk_bytes="64MB",
    )

    # Later on, you add more data
    optimize(
        fn=compress,
        inputs=list(range(100, 200)),
        output_dir="./my_optimized_dataset",
        chunk_bytes="64MB",
        mode="append",
    )

    ds = StreamingDataset("./my_optimized_dataset")
    assert len(ds) == 200
    assert ds[:] == [(i, i**2) for i in range(200)]
```

The `overwrite` mode will delete the existing data and start from fresh.

</details>

<details>
  <summary> ✅ Use compression</summary>
&nbsp;

Reduce your data footprint by using advanced compression algorithms.

```python
import litdata as ld

def compress(index):
    return index, index**2

if __name__ == "__main__":
    # Add some data
    ld.optimize(
        fn=compress,
        inputs=list(range(100)),
        output_dir="./my_optimized_dataset",
        chunk_bytes="64MB",
        num_workers=1,
        compression="zstd"
    )
```

Using [zstd](https://github.com/facebook/zstd), you can achieve high compression ratio like 4.34x for this simple example.

| Without | With |
| -------- | -------- | 
| 2.8kb | 646b |


</details>

<details>
  <summary> ✅ Access samples without full data download</summary>
&nbsp;

Look at specific parts of a large dataset without downloading the whole thing or loading it on a local machine.

```python
from litdata import StreamingDataset

dataset = StreamingDataset("s3://my-bucket/my-data") # data are stored in the cloud

print(len(dataset)) # display the length of your data

print(dataset[42]) # show the 42th element of the dataset
```

</details>

<details>
  <summary> ✅ Use any data transforms</summary>
&nbsp;

Customize how your data is processed to better fit your needs.

Subclass the `StreamingDataset` and override its `__getitem__` method to add any extra data transformations.

```python
from litdata import StreamingDataset, StreamingDataLoader
import torchvision.transforms.v2.functional as F

class ImagenetStreamingDataset(StreamingDataset):

    def __getitem__(self, index):
        image = super().__getitem__(index)
        return F.resize(image, (224, 224))

dataset = ImagenetStreamingDataset(...)
dataloader = StreamingDataLoader(dataset, batch_size=4)

for batch in dataloader:
    print(batch.shape)
    # Out: (4, 3, 224, 224)
```

</details>

<details>
  <summary> ✅ Profile data loading speed</summary>
&nbsp;

Measure and optimize how fast your data is being loaded, improving efficiency.

The `StreamingDataLoader` supports profiling of your data loading process. Simply use the `profile_batches` argument to specify the number of batches you want to profile:

```python
from litdata import StreamingDataset, StreamingDataLoader

StreamingDataLoader(..., profile_batches=5)
```

This generates a Chrome trace called `result.json`. Then, visualize this trace by opening Chrome browser at the `chrome://tracing` URL and load the trace inside.

</details>

<details>
  <summary> ✅ Reduce memory use for large files</summary>
&nbsp;

Handle large data files efficiently without using too much of your computer's memory.

When processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), use the Python yield keyword to process and store one item at the time, reducing the memory footprint of the entire program.

```python
from pathlib import Path
import pyarrow.parquet as pq
from litdata import optimize
from tokenizer import Tokenizer
from functools import partial

# 1. Define a function to convert the text within the parquet files into tokens
def tokenize_fn(filepath, tokenizer=None):
    parquet_file = pq.ParquetFile(filepath)
    # Process per batch to reduce RAM usage
    for batch in parquet_file.iter_batches(batch_size=8192, columns=["content"]):
        for text in batch.to_pandas()["content"]:
            yield tokenizer.encode(text, bos=False, eos=True)

# 2. Generate the inputs
input_dir = "/teamspace/s3_connections/tinyllama-template"
inputs = [str(file) for file in Path(f"{input_dir}/starcoderdata").rglob("*.parquet")]

# 3. Store the optimized data wherever you want under "/teamspace/datasets" or "/teamspace/s3_connections"
outputs = optimize(
    fn=partial(tokenize_fn, tokenizer=Tokenizer(f"{input_dir}/checkpoints/Llama-2-7b-hf")), # Note: Use HF tokenizer or any others
    inputs=inputs,
    output_dir="/teamspace/datasets/starcoderdata",
    chunk_size=(2049 * 8012), # Number of tokens to store by chunks. This is roughly 64MB of tokens per chunk.
)
```

</details>

<details>
  <summary> ✅ Limit local cache space</summary>
&nbsp;

Limit the amount of disk space used by temporary files, preventing storage issues.

Adapt the local caching limit of the `StreamingDataset`. This is useful to make sure the downloaded data chunks are deleted when used and the disk usage stays low.

```python
from litdata import StreamingDataset

dataset = StreamingDataset(..., max_cache_size="10GB")
```

</details>

<details>
  <summary> ✅ Change cache directory path</summary>
&nbsp;

Specify the directory where cached files should be stored, ensuring efficient data retrieval and management. This is particularly useful for organizing your data storage and improving access times.

```python
from litdata import StreamingDataset
from litdata.streaming.cache import Dir

cache_dir = "/path/to/your/cache"
data_dir = "s3://my-bucket/my_optimized_dataset"

dataset = StreamingDataset(input_dir=Dir(path=cache_dir, url=data_dir))
```

</details>

<details>
  <summary> ✅ Optimize loading on networked drives</summary>
&nbsp;

Optimize data handling for computers on a local network to improve performance for on-site setups.

On-prem compute nodes can mount and use a network drive. A network drive is a shared storage device on a local area network. In order to reduce their network overload, the `StreamingDataset` supports `caching` the data chunks.

```python
from litdata import StreamingDataset

dataset = StreamingDataset(input_dir="local:/data/shared-drive/some-data")
```

</details>

<details>
  <summary> ✅ Optimize dataset in distributed environment</summary>
&nbsp;

Lightning can distribute large workloads across hundreds of machines in parallel. This can reduce the time to complete a data processing task from weeks to minutes by scaling to enough machines.

To apply the optimize operator across multiple machines, simply provide the num_nodes and machine arguments to it as follows:

```python
import os
from litdata import optimize, Machine

def compress(index):
    return (index, index ** 2)

optimize(
    fn=compress,
    inputs=list(range(100)),
    num_workers=2,
    output_dir="my_output",
    chunk_bytes="64MB",
    num_nodes=2,
    machine=Machine.DATA_PREP, # You can select between dozens of optimized machines
)
```

If the `output_dir` is a local path, the optimized dataset will be present in: `/teamspace/jobs/{job_name}/nodes-0/my_output`. Otherwise, it will be stored in the specified `output_dir`.

Read the optimized dataset:

```python
from litdata import StreamingDataset

output_dir = "/teamspace/jobs/litdata-optimize-2024-07-08/nodes.0/my_output"

dataset = StreamingDataset(output_dir)

print(dataset[:])
```

</details>

<details>
  <summary> ✅ Encrypt, decrypt data at chunk/sample level</summary>
&nbsp;

Secure data by applying encryption to individual samples or chunks, ensuring sensitive information is protected during storage.

This example shows how to use the `FernetEncryption` class for sample-level encryption with a data optimization function.

```python
from litdata import optimize
from litdata.utilities.encryption import FernetEncryption
import numpy as np
from PIL import Image

# Initialize FernetEncryption with a password for sample-level encryption
fernet = FernetEncryption(password="your_secure_password", level="sample")
data_dir = "s3://my-bucket/optimized_data"

def random_image(index):
    """Generate a random image for demonstration purposes."""
    fake_img = Image.fromarray(np.random.randint(0, 255, (32, 32, 3), dtype=np.uint8))
    return {"image": fake_img, "class": index}

# Optimize data while applying encryption
optimize(
    fn=random_image,
    inputs=list(range(5)),  # Example inputs: [0, 1, 2, 3, 4]
    num_workers=1,
    output_dir=data_dir,
    chunk_bytes="64MB",
    encryption=fernet,
)

# Save the encryption key to a file for later use
fernet.save("fernet.pem")
```

Load the encrypted data using the `StreamingDataset` class as follows:

```python
from litdata import StreamingDataset
from litdata.utilities.encryption import FernetEncryption

# Load the encryption key
fernet = FernetEncryption(password="your_secure_password", level="sample")
fernet.load("fernet.pem")

# Create a streaming dataset for reading the encrypted samples
ds = StreamingDataset(input_dir=data_dir, encryption=fernet)
```

Implement your own encryption method: Subclass the `Encryption` class and define the necessary methods:

```python
from litdata.utilities.encryption import Encryption

class CustomEncryption(Encryption):
    def encrypt(self, data):
        # Implement your custom encryption logic here
        return data

    def decrypt(self, data):
        # Implement your custom decryption logic here
        return data
```

This allows the data to remain secure while maintaining flexibility in the encryption method.
</details>

&nbsp;

## Features for transforming datasets

<details>
  <summary> ✅ Parallelize data transformations (map)</summary>
&nbsp;

Apply the same change to different parts of the dataset at once to save time and effort.

The `map` operator can be used to apply a function over a list of inputs.

Here is an example where the `map` operator is used to apply a `resize_image` function over a folder of large images.

```python
from litdata import map
from PIL import Image

# Note: Inputs could also refer to files on s3 directly.
input_dir = "my_large_images"
inputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]

# The resize image takes one of the input (image_path) and the output directory.
# Files written to output_dir are persisted.
def resize_image(image_path, output_dir):
  output_image_path = os.path.join(output_dir, os.path.basename(image_path))
  Image.open(image_path).resize((224, 224)).save(output_image_path)

map(
    fn=resize_image,
    inputs=inputs,
    output_dir="s3://my-bucket/my_resized_images",
)
```

</details>

&nbsp;

----

# Benchmarks
In this section we show benchmarks for speed to optimize a dataset and the resulting streaming speed ([Reproduce the benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries)).

## Streaming speed

Data optimized and streamed with LitData achieves a 20x speed up over non optimized data and 2x speed up over other streaming solutions.

Speed to stream Imagenet 1.2M from AWS S3:

| Framework | Images / sec  1st Epoch (float32)  | Images / sec   2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |
|---|---|---|---|---|
| LitData | **5800** | **6589**  | **6282**  | **7221**  |
| Web Dataset  | 3134 | 3924 | 3343 | 4424 |
| Mosaic ML  | 2898 | 5099 | 2809 | 5158 |

<details>
  <summary> Benchmark details</summary>
&nbsp;

- [Imagenet-1.2M dataset](https://www.image-net.org/) contains `1,281,167 images`.
- To align with other benchmarks, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.

</details>

&nbsp;

## Time to optimize data
LitData optimizes the Imagenet dataset for fast training 3-5x faster than other frameworks:

Time to optimize 1.2 million ImageNet images (Faster is better):
| Framework |Train Conversion Time | Val Conversion Time | Dataset Size | # Files |
|---|---|---|---|---|
| LitData  |  **10:05 min** | **00:30 min** | **143.1 GB**  | 2.339  |
| Web Dataset  | 32:36 min | 01:22 min | 147.8 GB | 1.144 |
| Mosaic ML  | 49:49 min | 01:04 min | **143.1 GB** | 2.298 |

&nbsp;

----

# Parallelize transforms and data optimization on cloud machines
<div align="center">
<img alt="Lightning" src="https://pl-flash-data.s3.amazonaws.com/data-prep.jpg" width="700px">
</div>

## Parallelize data transforms

Transformations with LitData are linearly parallelizable across machines.

For example, let's say that it takes 56 hours to embed a dataset on a single A10G machine. With LitData,
this can be speed up by adding more machines in parallel

| Number of machines | Hours |
|-----------------|--------------|
| 1               | 56           |
| 2               | 28           |
| 4               | 14           |
| ...               | ...            |
| 64              | 0.875        |

To scale the number of machines, run the processing script on [Lightning Studios](https://lightning.ai/):

```python
from litdata import map, Machine

map(
  ...
  num_nodes=32,
  machine=Machine.DATA_PREP, # Select between dozens of optimized machines
)
```

## Parallelize data optimization
To scale the number of machines for data optimization, use [Lightning Studios](https://lightning.ai/):

```python
from litdata import optimize, Machine

optimize(
  ...
  num_nodes=32,
  machine=Machine.DATA_PREP, # Select between dozens of optimized machines
)
```

&nbsp;

Example: [Process the LAION 400 million image dataset in 2 hours on 32 machines, each with 32 CPUs](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset).

&nbsp;

----

# Start from a template
Below are templates for real-world applications of LitData at scale.

## Templates: Transform datasets

| Studio | Data type | Time (minutes) | Machines | Dataset |
| ------------------------------------ | ----------------- | ----------------- | -------------- | -------------- |
| [Download LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) | Image & Text | 120 | 32 |[LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) |
| [Tokenize 2M Swedish Wikipedia Articles](https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles) | Text | 7 | 4 | [Swedish Wikipedia](https://huggingface.co/datasets/wikipedia) |
| [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) | Text | 15 | 3 | [English Wikipedia](https://huggingface.co/datasets/wikipedia) |

## Templates: Optimize + stream data

| Studio | Data type | Time (minutes) | Machines | Dataset |
| -------------------------------- | ----------------- | ----------------- | -------------- | -------------- |
| [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) | Image & Label | 10 | 1 | [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) |
| [Optimize GeoSpatial data for model training](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) | Image & Mask | 120 | 32 | [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) |
| [Optimize TinyLlama 1T dataset for training](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) | Text | 240 | 32 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) |
| [Optimize parquet files for model training](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) | Parquet Files | 12 | 16 | Randomly Generated data |

&nbsp;

----

# Community
LitData is a community project accepting contributions -  Let's make the world's most advanced AI data processing framework.

💬 [Get help on Discord](https://discord.com/invite/XncpTy7DSt)    
📋 [License: Apache 2.0](https://github.com/Lightning-AI/litdata/blob/main/LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Lightning-AI/litdata",
    "name": "litdata",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "deep learning, pytorch, AI, streaming, cloud, data processing",
    "author": "Lightning AI et al.",
    "author_email": "pytorch@lightning.ai",
    "download_url": "https://files.pythonhosted.org/packages/45/01/78f38702e66294e1db0ebb5d8cf3b78c960a40a32b6d6b229e1a58a2758f/litdata-0.2.36.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n<img src=\"https://pl-flash-data.s3.amazonaws.com/lit_data_logo.webp\" alt=\"LitData\" width=\"800px\"/>\n\n&nbsp;\n&nbsp;\n\n**Transform datasets at scale.    \nOptimize data for fast AI model training.**\n\n\n<pre>\nTransform                              Optimize\n  \n\u2705 Parallelize data processing       \u2705 Stream large cloud datasets          \n\u2705 Create vector embeddings          \u2705 Accelerate training by 20x           \n\u2705 Run distributed inference         \u2705 Pause and resume data streaming      \n\u2705 Scrape websites at scale          \u2705 Use remote data without local loading\n</pre>\n\n---\n\n![PyPI](https://img.shields.io/pypi/v/litdata)\n![Downloads](https://img.shields.io/pypi/dm/litdata)\n![License](https://img.shields.io/github/license/Lightning-AI/litdata)\n[![Discord](https://img.shields.io/discord/1077906959069626439?label=Get%20Help%20on%20Discord)](https://discord.gg/VptPCZkGNa)\n\n<p align=\"center\">\n  <a href=\"https://lightning.ai/\">Lightning AI</a> \u2022\n  <a href=\"#quick-start\">Quick start</a> \u2022\n  <a href=\"#speed-up-model-training\">Optimize data</a> \u2022\n  <a href=\"#transform-datasets\">Transform data</a> \u2022\n  <a href=\"#key-features\">Features</a> \u2022\n  <a href=\"#benchmarks\">Benchmarks</a> \u2022\n  <a href=\"#start-from-a-template\">Templates</a> \u2022\n  <a href=\"#community\">Community</a>\n</p>\n\n&nbsp;\n\n<a target=\"_blank\" href=\"https://lightning.ai/docs/overview/prep-data/optimize-datasets-for-model-training-speed\">\n  <img src=\"https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/app-2/get-started-badge.svg\" height=\"36px\" alt=\"Get started\"/>\n</a>\n\n</div>\n\n&nbsp;\n\n# Transform data at scale. Optimize for fast model training.\nLitData scales [data processing tasks](#transform-datasets) (data scraping, image resizing, distributed inference, embedding creation) on local or cloud machines. It also enables [optimizing datasets](#speed-up-model-training) to accelerate AI model training and work with large remote datasets without local loading.\n\n&nbsp;\n\n# Quick start\nFirst, install LitData:\n\n```bash\npip install litdata\n```\n\nChoose your workflow:\n\n\ud83d\ude80 [Speed up model training](#speed-up-model-training)    \n\ud83d\ude80 [Transform datasets](#transform-datasets)\n\n&nbsp;\n\n<details>\n  <summary>Advanced install</summary>\n\nInstall all the extras\n```bash\npip install 'litdata[extras]'\n```\n\n</details>\n\n&nbsp;\n\n----\n\n# Speed up model training\nAccelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming.\n\n**Step 1: Optimize the data**\nThis step will format the dataset for fast loading. The data will be written in a chunked binary format.\n\n```python\nimport numpy as np\nfrom PIL import Image\nimport litdata as ld\n\ndef random_images(index):\n    fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))\n    fake_labels = np.random.randint(10)\n\n    # You can use any key:value pairs. Note that their types must not change between samples, and Python lists must\n    # always contain the same number of elements with the same types.\n    data = {\"index\": index, \"image\": fake_images, \"class\": fake_labels}\n\n    return data\n\nif __name__ == \"__main__\":\n    # The optimize function writes data in an optimized format.\n    ld.optimize(\n        fn=random_images,                   # the function applied to each input\n        inputs=list(range(1000)),           # the inputs to the function (here it's a list of numbers)\n        output_dir=\"fast_data\",             # optimized data is stored here\n        num_workers=4,                      # The number of workers on the same machine\n        chunk_bytes=\"64MB\"                  # size of each chunk\n    )\n```\n\n**Step 2: Put the data on the cloud**\n\nUpload the data to a [Lightning Studio](https://lightning.ai) (backed by S3) or your own S3 bucket:\n```bash\naws s3 cp --recursive fast_data s3://my-bucket/fast_data\n```\n\n**Step 3: Stream the data during training**\n\nLoad the data by replacing the PyTorch DataSet and DataLoader with the StreamingDataset and StreamingDataloader\n\n```python\nimport litdata as ld\n\ntrain_dataset = ld.StreamingDataset('s3://my-bucket/fast_data', shuffle=True, drop_last=True)\ntrain_dataloader = ld.StreamingDataLoader(train_dataset)\n\nfor sample in train_dataloader:\n    img, cls = sample['image'], sample['class']\n```\n\n**Key benefits:**\n\n\u2705 Accelerate training:       Optimized datasets load 20x faster.      \n\u2705 Stream cloud datasets:     Work with cloud data without downloading it.    \n\u2705 Pytorch-first:             Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.    \n\u2705 Easy collaboration:        Share and access datasets in the cloud, streamlining team projects.     \n\u2705 Scale across GPUs:         Streamed data automatically scales to all GPUs.      \n\u2705 Flexible storage:          Use S3, GCS, Azure, or your own cloud account for data storage.    \n\u2705 Compression:               Reduce your data footprint by using advanced compression algorithms.  \n\u2705 Run local or cloud:        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         \n\u2705 Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  \n\n&nbsp;\n\n----\n\n# Transform datasets\nAccelerate data processing tasks (data scraping, image resizing, embedding creation, distributed inference) by parallelizing (map) the work across many machines at once.\n\nHere's an example that resizes and crops a large image dataset:\n\n```python\nfrom PIL import Image\nimport litdata as ld\n\n# use a local or S3 folder\ninput_dir = \"my_large_images\"     # or \"s3://my-bucket/my_large_images\"\noutput_dir = \"my_resized_images\"  # or \"s3://my-bucket/my_resized_images\"\n\ninputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]\n\n# resize the input image\ndef resize_image(image_path, output_dir):\n  output_image_path = os.path.join(output_dir, os.path.basename(image_path))\n  Image.open(image_path).resize((224, 224)).save(output_image_path)\n\nld.map(\n    fn=resize_image,\n    inputs=inputs,\n    output_dir=\"output_dir\",\n)\n```\n\n**Key benefits:**\n\n\u2705 Parallelize processing:    Reduce processing time by transforming data across multiple machines simultaneously.    \n\u2705 Scale to large data:       Increase the size of datasets you can efficiently handle.    \n\u2705 Flexible usecases:         Resize images, create embeddings, scrape the internet, etc...    \n\u2705 Run local or cloud:        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         \n\u2705 Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  \n\n&nbsp;\n\n----\n\n# Key Features\n\n## Features for optimizing and streaming datasets for model training\n\n<details>\n  <summary> \u2705 Stream large cloud datasets</summary>\n&nbsp;\n\nUse data stored on the cloud without needing to download it all to your computer, saving time and space.\n\nImagine you're working on a project with a huge amount of data stored online. Instead of waiting hours to download it all, you can start working with the data almost immediately by streaming it.\n\nOnce you've optimized the dataset with LitData, stream it as follows:\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\n\ndataset = StreamingDataset('s3://my-bucket/my-data', shuffle=True)\ndataloader = StreamingDataLoader(dataset, batch_size=64)\n\nfor batch in dataloader:\n    process(batch)  # Replace with your data processing logic\n\n```\n\n\nAdditionally, you can inject client connection settings for [S3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client) or GCP when initializing your dataset. This is useful for specifying custom endpoints and credentials per dataset.\n\n```python\nfrom litdata import StreamingDataset\n\nstorage_options = {\n    \"endpoint_url\": \"your_endpoint_url\",\n    \"aws_access_key_id\": \"your_access_key_id\",\n    \"aws_secret_access_key\": \"your_secret_access_key\",\n}\n\ndataset = StreamingDataset('s3://my-bucket/my-data', storage_options=storage_options)\n```\n\n\nAlso, you can specify a custom cache directory when initializing your dataset. This is useful when you want to store the cache in a specific location.\n```python\nfrom litdata import StreamingDataset\n\n# Initialize the StreamingDataset with the custom cache directory\ndataset = StreamingDataset('s3://my-bucket/my-data', cache_dir=\"/path/to/cache\")\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Streams on multi-GPU, multi-node</summary>\n\n&nbsp;\n\nData optimized and loaded with Lightning automatically streams efficiently in distributed training across GPUs or multi-node.\n\nThe `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.\n\nHere you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.\n\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\n\n# For the training dataset, don't forget to enable shuffle and drop_last !!! \ntrain_dataset = StreamingDataset('s3://my-bucket/my-train-data', shuffle=True, drop_last=True)\ntrain_dataloader = StreamingDataLoader(train_dataset, batch_size=64)\n\nfor batch in train_dataloader:\n    process(batch)  # Replace with your data processing logic\n\nval_dataset = StreamingDataset('s3://my-bucket/my-val-data', shuffle=False, drop_last=False)\nval_dataloader = StreamingDataLoader(val_dataset, batch_size=64)\n\nfor batch in val_dataloader:\n    process(batch)  # Replace with your data processing logic\n```\n\n![An illustration showing how the Streaming Dataset works with multi node.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)\n\n</details>\n\n<details>\n  <summary> \u2705 Stream from multiple cloud providers</summary>\n\n&nbsp;\n\nThe StreamingDataset supports reading optimized datasets from common cloud providers. \n\n```python\nimport os\nimport litdata as ld\n\n# Read data from AWS S3\naws_storage_options={\n    \"AWS_ACCESS_KEY_ID\": os.environ['AWS_ACCESS_KEY_ID'],\n    \"AWS_SECRET_ACCESS_KEY\": os.environ['AWS_SECRET_ACCESS_KEY'],\n}\ndataset = ld.StreamingDataset(\"s3://my-bucket/my-data\", storage_options=aws_storage_options)\n\n# Read data from GCS\ngcp_storage_options={\n    \"project\": os.environ['PROJECT_ID'],\n}\ndataset = ld.StreamingDataset(\"gs://my-bucket/my-data\", storage_options=gcp_storage_options)\n\n# Read data from Azure\nazure_storage_options={\n    \"account_url\": f\"https://{os.environ['AZURE_ACCOUNT_NAME']}.blob.core.windows.net\",\n    \"credential\": os.environ['AZURE_ACCOUNT_ACCESS_KEY']\n}\ndataset = ld.StreamingDataset(\"azure://my-bucket/my-data\", storage_options=azure_storage_options)\n```\n\n</details>  \n\n<details>\n  <summary> \u2705 Pause, resume data streaming</summary>\n&nbsp;\n\nStream data during long training, if interrupted, pick up right where you left off without any issues.\n\nLitData provides a stateful `Streaming DataLoader` e.g. you can `pause` and `resume` your training whenever you want.\n\nInfo: The `Streaming DataLoader` was used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs. Restarting from an older checkpoint was critical to get to pretrain the full model due to several failures (network, CUDA Errors, etc..).\n\n```python\nimport os\nimport torch\nfrom litdata import StreamingDataset, StreamingDataLoader\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\", shuffle=True)\ndataloader = StreamingDataLoader(dataset, num_workers=os.cpu_count(), batch_size=64)\n\n#\u00a0Restore the dataLoader state if it exists\nif os.path.isfile(\"dataloader_state.pt\"):\n    state_dict = torch.load(\"dataloader_state.pt\")\n    dataloader.load_state_dict(state_dict)\n\n# Iterate over the data\nfor batch_idx, batch in enumerate(dataloader):\n\n    # Store the state every 1000 batches\n    if batch_idx % 1000 == 0:\n        torch.save(dataloader.state_dict(), \"dataloader_state.pt\")\n```\n\n</details>\n\n\n<details>\n  <summary> \u2705 LLM Pre-training </summary>\n&nbsp;\n\nLitData is highly optimized for LLM pre-training. First, we need to tokenize the entire dataset and then we can consume it.\n\n```python\nimport json\nfrom pathlib import Path\nimport zstandard as zstd\nfrom litdata import optimize, TokensLoader\nfrom tokenizer import Tokenizer\nfrom functools import partial\n\n# 1. Define a function to convert the text within the jsonl files into tokens\ndef tokenize_fn(filepath, tokenizer=None):\n    with zstd.open(open(filepath, \"rb\"), \"rt\", encoding=\"utf-8\") as f:\n        for row in f:\n            text = json.loads(row)[\"text\"]\n            if json.loads(row)[\"meta\"][\"redpajama_set_name\"] == \"RedPajamaGithub\":\n                continue  # exclude the GitHub data since it overlaps with starcoder\n            text_ids = tokenizer.encode(text, bos=False, eos=True)\n            yield text_ids\n\nif __name__ == \"__main__\":\n    # 2. Generate the inputs (we are going to optimize all the compressed json files from SlimPajama dataset )\n    input_dir = \"./slimpajama-raw\"\n    inputs = [str(file) for file in Path(f\"{input_dir}/SlimPajama-627B/train\").rglob(\"*.zst\")]\n\n    # 3. Store the optimized data wherever you want under \"/teamspace/datasets\" or \"/teamspace/s3_connections\"\n    outputs = optimize(\n        fn=partial(tokenize_fn, tokenizer=Tokenizer(f\"{input_dir}/checkpoints/Llama-2-7b-hf\")), # Note: You can use HF tokenizer or any others\n        inputs=inputs,\n        output_dir=\"./slimpajama-optimized\",\n        chunk_size=(2049 * 8012),\n        # This is important to inform LitData that we are encoding contiguous 1D array (tokens). \n        # LitData skips storing metadata for each sample e.g all the tokens are concatenated to form one large tensor.\n        item_loader=TokensLoader(),\n    )\n```\n\n```python\nimport os\nfrom litdata import StreamingDataset, StreamingDataLoader, TokensLoader\nfrom tqdm import tqdm\n\n# Increase by one because we need the next word as well\ndataset = StreamingDataset(\n  input_dir=f\"./slimpajama-optimized/train\",\n  item_loader=TokensLoader(block_size=2048 + 1),\n  shuffle=True,\n  drop_last=True,\n)\n\ntrain_dataloader = StreamingDataLoader(dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())\n\n# Iterate over the SlimPajama dataset\nfor batch in tqdm(train_dataloader):\n    pass\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Filter illegal data </summary>\n&nbsp;\n\nSometimes, you have bad data that you don't want to include in the optimized dataset. With LitData, yield only the good data sample to include. \n\n\n```python\nfrom litdata import optimize, StreamingDataset\n\ndef should_keep(index) -> bool:\n  #\u00a0Replace with your own logic\n  return index % 2 == 0\n\n\ndef fn(data):\n    if should_keep(data):\n        yield data\n\nif __name__ == \"__main__\":\n    optimize(\n        fn=fn,\n        inputs=list(range(1000)),\n        output_dir=\"only_even_index_optimized\",\n        chunk_bytes=\"64MB\",\n        num_workers=1\n    )\n\n    dataset = StreamingDataset(\"only_even_index_optimized\")\n    data = list(dataset)\n    print(data)\n    # [0, 2, 4, 6, 8, 10, ..., 992, 994, 996, 998]\n```\n\nYou can even use try/expect.  \n\n```python\nfrom litdata import optimize, StreamingDataset\n\ndef fn(data):\n    try:\n        yield 1 / data \n    except:\n        pass\n\nif __name__ == \"__main__\":\n    optimize(\n        fn=fn,\n        inputs=[0, 0, 0, 1, 2, 4, 0],\n        output_dir=\"only_defined_ratio_optimized\",\n        chunk_bytes=\"64MB\",\n        num_workers=1\n    )\n\n    dataset = StreamingDataset(\"only_defined_ratio_optimized\")\n    data = list(dataset)\n    # The 0 are filtered out as they raise a division by zero \n    print(data)\n    # [1.0, 0.5, 0.25] \n```\n</details>\n\n<details>\n  <summary> \u2705 Combine datasets</summary>\n&nbsp;\n\nMix and match different sets of data to experiment and create better models.\n\nCombine datasets with `CombinedStreamingDataset`.  As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.\n\n```python\nfrom litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader\nfrom tqdm import tqdm\nimport os\n\ntrain_datasets = [\n    StreamingDataset(\n        input_dir=\"s3://tinyllama-template/slimpajama/train/\",\n        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs\n        shuffle=True,\n        drop_last=True,\n    ),\n    StreamingDataset(\n        input_dir=\"s3://tinyllama-template/starcoder/\",\n        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs\n        shuffle=True,\n        drop_last=True,\n    ),\n]\n\n# Mix SlimPajama data and Starcoder data with these proportions:\nweights = (0.693584, 0.306416)\ncombined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights, iterate_over_all=False)\n\ntrain_dataloader = StreamingDataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())\n\n# Iterate over the combined datasets\nfor batch in tqdm(train_dataloader):\n    pass\n```\n</details>\n\n<details>\n  <summary> \u2705 Merge datasets</summary>\n&nbsp;\n\nMerge multiple optimized datasets into one.\n\n```python\nimport numpy as np\nfrom PIL import Image\n\nfrom litdata import StreamingDataset, merge_datasets, optimize\n\n\ndef random_images(index):\n    return {\n        \"index\": index,\n        \"image\": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)),\n        \"class\": np.random.randint(10),\n    }\n\n\nif __name__ == \"__main__\":\n    out_dirs = [\"fast_data_1\", \"fast_data_2\", \"fast_data_3\", \"fast_data_4\"]  # or [\"s3://my-bucket/fast_data_1\", etc.]\"\n    for out_dir in out_dirs:\n        optimize(fn=random_images, inputs=list(range(250)), output_dir=out_dir, num_workers=4, chunk_bytes=\"64MB\")\n\n    merged_out_dir = \"merged_fast_data\" # or \"s3://my-bucket/merged_fast_data\"\n    merge_datasets(input_dirs=out_dirs, output_dir=merged_out_dir)\n\n    dataset = StreamingDataset(merged_out_dir)\n    print(len(dataset))\n    # out: 1000\n```\n</details>\n\n<details>\n  <summary> \u2705 Split datasets for train, val, test</summary>\n\n&nbsp;\n\nSplit a dataset into train, val, test splits with `train_test_split`.\n\n```python\nfrom litdata import StreamingDataset, train_test_split\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\") # data are stored in the cloud\n\nprint(len(dataset)) # display the length of your data\n#\u00a0out: 100,000\n\ntrain_dataset, val_dataset, test_dataset = train_test_split(dataset, splits=[0.3, 0.2, 0.5])\n\nprint(train_dataset)\n#\u00a0out: 30,000\n\nprint(val_dataset)\n#\u00a0out: 20,000\n\nprint(test_dataset)\n#\u00a0out: 50,000\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Load a subset of the remote dataset</summary>\n\n&nbsp;\nWork on a smaller, manageable portion of your data to save time and resources.\n\n\n```python\nfrom litdata import StreamingDataset, train_test_split\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\", subsample=0.01) # data are stored in the cloud\n\nprint(len(dataset)) # display the length of your data\n#\u00a0out: 1000\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Easily modify optimized cloud datasets</summary>\n&nbsp;\n\nAdd new data to an existing dataset or start fresh if needed, providing flexibility in data management.\n\nLitData optimized datasets are assumed to be immutable. However, you can make the decision to modify them by changing the mode to either `append` or `overwrite`.\n\n```python\nfrom litdata import optimize, StreamingDataset\n\ndef compress(index):\n    return index, index**2\n\nif __name__ == \"__main__\":\n    # Add some data\n    optimize(\n        fn=compress,\n        inputs=list(range(100)),\n        output_dir=\"./my_optimized_dataset\",\n        chunk_bytes=\"64MB\",\n    )\n\n    # Later on, you add more data\n    optimize(\n        fn=compress,\n        inputs=list(range(100, 200)),\n        output_dir=\"./my_optimized_dataset\",\n        chunk_bytes=\"64MB\",\n        mode=\"append\",\n    )\n\n    ds = StreamingDataset(\"./my_optimized_dataset\")\n    assert len(ds) == 200\n    assert ds[:] == [(i, i**2) for i in range(200)]\n```\n\nThe `overwrite` mode will delete the existing data and start from fresh.\n\n</details>\n\n<details>\n  <summary> \u2705 Use compression</summary>\n&nbsp;\n\nReduce your data footprint by using advanced compression algorithms.\n\n```python\nimport litdata as ld\n\ndef compress(index):\n    return index, index**2\n\nif __name__ == \"__main__\":\n    # Add some data\n    ld.optimize(\n        fn=compress,\n        inputs=list(range(100)),\n        output_dir=\"./my_optimized_dataset\",\n        chunk_bytes=\"64MB\",\n        num_workers=1,\n        compression=\"zstd\"\n    )\n```\n\nUsing [zstd](https://github.com/facebook/zstd), you can achieve high compression ratio like 4.34x for this simple example.\n\n| Without | With |\n| -------- | -------- | \n| 2.8kb | 646b |\n\n\n</details>\n\n<details>\n  <summary> \u2705 Access samples without full data download</summary>\n&nbsp;\n\nLook at specific parts of a large dataset without downloading the whole thing or loading it on a local machine.\n\n```python\nfrom litdata import StreamingDataset\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\") # data are stored in the cloud\n\nprint(len(dataset)) # display the length of your data\n\nprint(dataset[42]) # show the 42th element of the dataset\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Use any data transforms</summary>\n&nbsp;\n\nCustomize how your data is processed to better fit your needs.\n\nSubclass the `StreamingDataset` and override its `__getitem__` method to add any extra data transformations.\n\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\nimport torchvision.transforms.v2.functional as F\n\nclass ImagenetStreamingDataset(StreamingDataset):\n\n    def __getitem__(self, index):\n        image = super().__getitem__(index)\n        return F.resize(image, (224, 224))\n\ndataset = ImagenetStreamingDataset(...)\ndataloader = StreamingDataLoader(dataset, batch_size=4)\n\nfor batch in dataloader:\n    print(batch.shape)\n    # Out: (4, 3, 224, 224)\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Profile data loading speed</summary>\n&nbsp;\n\nMeasure and optimize how fast your data is being loaded, improving efficiency.\n\nThe `StreamingDataLoader` supports profiling of your data loading process. Simply use the `profile_batches` argument to specify the number of batches you want to profile:\n\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\n\nStreamingDataLoader(..., profile_batches=5)\n```\n\nThis generates a Chrome trace called `result.json`. Then, visualize this trace by opening Chrome browser at the `chrome://tracing` URL and load the trace inside.\n\n</details>\n\n<details>\n  <summary> \u2705 Reduce memory use for large files</summary>\n&nbsp;\n\nHandle large data files efficiently without using too much of your computer's memory.\n\nWhen processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), use the Python yield keyword to process and store one item at the time, reducing the memory footprint of the entire program.\n\n```python\nfrom pathlib import Path\nimport pyarrow.parquet as pq\nfrom litdata import optimize\nfrom tokenizer import Tokenizer\nfrom functools import partial\n\n# 1. Define a function to convert the text within the parquet files into tokens\ndef tokenize_fn(filepath, tokenizer=None):\n    parquet_file = pq.ParquetFile(filepath)\n    # Process per batch to reduce RAM usage\n    for batch in parquet_file.iter_batches(batch_size=8192, columns=[\"content\"]):\n        for text in batch.to_pandas()[\"content\"]:\n            yield tokenizer.encode(text, bos=False, eos=True)\n\n# 2. Generate the inputs\ninput_dir = \"/teamspace/s3_connections/tinyllama-template\"\ninputs = [str(file) for file in Path(f\"{input_dir}/starcoderdata\").rglob(\"*.parquet\")]\n\n# 3. Store the optimized data wherever you want under \"/teamspace/datasets\" or \"/teamspace/s3_connections\"\noutputs = optimize(\n    fn=partial(tokenize_fn, tokenizer=Tokenizer(f\"{input_dir}/checkpoints/Llama-2-7b-hf\")), # Note: Use HF tokenizer or any others\n    inputs=inputs,\n    output_dir=\"/teamspace/datasets/starcoderdata\",\n    chunk_size=(2049 * 8012), # Number of tokens to store by chunks. This is roughly 64MB of tokens per chunk.\n)\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Limit local cache space</summary>\n&nbsp;\n\nLimit the amount of disk space used by temporary files, preventing storage issues.\n\nAdapt the local caching limit of the `StreamingDataset`. This is useful to make sure the downloaded data chunks are deleted when used and the disk usage stays low.\n\n```python\nfrom litdata import StreamingDataset\n\ndataset = StreamingDataset(..., max_cache_size=\"10GB\")\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Change cache directory path</summary>\n&nbsp;\n\nSpecify the directory where cached files should be stored, ensuring efficient data retrieval and management. This is particularly useful for organizing your data storage and improving access times.\n\n```python\nfrom litdata import StreamingDataset\nfrom litdata.streaming.cache import Dir\n\ncache_dir = \"/path/to/your/cache\"\ndata_dir = \"s3://my-bucket/my_optimized_dataset\"\n\ndataset = StreamingDataset(input_dir=Dir(path=cache_dir, url=data_dir))\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Optimize loading on networked drives</summary>\n&nbsp;\n\nOptimize data handling for computers on a local network to improve performance for on-site setups.\n\nOn-prem compute nodes can mount and use a network drive. A network drive is a shared storage device on a local area network. In order to reduce their network overload, the `StreamingDataset` supports `caching` the data chunks.\n\n```python\nfrom litdata import StreamingDataset\n\ndataset = StreamingDataset(input_dir=\"local:/data/shared-drive/some-data\")\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Optimize dataset in distributed environment</summary>\n&nbsp;\n\nLightning can distribute large workloads across hundreds of machines in parallel. This can reduce the time to complete a data processing task from weeks to minutes by scaling to enough machines.\n\nTo apply the optimize operator across multiple machines, simply provide the num_nodes and machine arguments to it as follows:\n\n```python\nimport os\nfrom litdata import optimize, Machine\n\ndef compress(index):\n    return (index, index ** 2)\n\noptimize(\n    fn=compress,\n    inputs=list(range(100)),\n    num_workers=2,\n    output_dir=\"my_output\",\n    chunk_bytes=\"64MB\",\n    num_nodes=2,\n    machine=Machine.DATA_PREP, # You can select between dozens of optimized machines\n)\n```\n\nIf the `output_dir` is a local path, the optimized dataset will be present in: `/teamspace/jobs/{job_name}/nodes-0/my_output`. Otherwise, it will be stored in the specified `output_dir`.\n\nRead the optimized dataset:\n\n```python\nfrom litdata import StreamingDataset\n\noutput_dir = \"/teamspace/jobs/litdata-optimize-2024-07-08/nodes.0/my_output\"\n\ndataset = StreamingDataset(output_dir)\n\nprint(dataset[:])\n```\n\n</details>\n\n<details>\n  <summary> \u2705 Encrypt, decrypt data at chunk/sample level</summary>\n&nbsp;\n\nSecure data by applying encryption to individual samples or chunks, ensuring sensitive information is protected during storage.\n\nThis example shows how to use the `FernetEncryption` class for sample-level encryption with a data optimization function.\n\n```python\nfrom litdata import optimize\nfrom litdata.utilities.encryption import FernetEncryption\nimport numpy as np\nfrom PIL import Image\n\n# Initialize FernetEncryption with a password for sample-level encryption\nfernet = FernetEncryption(password=\"your_secure_password\", level=\"sample\")\ndata_dir = \"s3://my-bucket/optimized_data\"\n\ndef random_image(index):\n    \"\"\"Generate a random image for demonstration purposes.\"\"\"\n    fake_img = Image.fromarray(np.random.randint(0, 255, (32, 32, 3), dtype=np.uint8))\n    return {\"image\": fake_img, \"class\": index}\n\n# Optimize data while applying encryption\noptimize(\n    fn=random_image,\n    inputs=list(range(5)),  # Example inputs: [0, 1, 2, 3, 4]\n    num_workers=1,\n    output_dir=data_dir,\n    chunk_bytes=\"64MB\",\n    encryption=fernet,\n)\n\n# Save the encryption key to a file for later use\nfernet.save(\"fernet.pem\")\n```\n\nLoad the encrypted data using the `StreamingDataset` class as follows:\n\n```python\nfrom litdata import StreamingDataset\nfrom litdata.utilities.encryption import FernetEncryption\n\n# Load the encryption key\nfernet = FernetEncryption(password=\"your_secure_password\", level=\"sample\")\nfernet.load(\"fernet.pem\")\n\n# Create a streaming dataset for reading the encrypted samples\nds = StreamingDataset(input_dir=data_dir, encryption=fernet)\n```\n\nImplement your own encryption method: Subclass the `Encryption` class and define the necessary methods:\n\n```python\nfrom litdata.utilities.encryption import Encryption\n\nclass CustomEncryption(Encryption):\n    def encrypt(self, data):\n        # Implement your custom encryption logic here\n        return data\n\n    def decrypt(self, data):\n        # Implement your custom decryption logic here\n        return data\n```\n\nThis allows the data to remain secure while maintaining flexibility in the encryption method.\n</details>\n\n&nbsp;\n\n## Features for transforming datasets\n\n<details>\n  <summary> \u2705 Parallelize data transformations (map)</summary>\n&nbsp;\n\nApply the same change to different parts of the dataset at once to save time and effort.\n\nThe `map` operator can be used to apply a function over a list of inputs.\n\nHere is an example where the `map` operator is used to apply a `resize_image` function over a folder of large images.\n\n```python\nfrom litdata import map\nfrom PIL import Image\n\n# Note: Inputs could also refer to files on s3 directly.\ninput_dir = \"my_large_images\"\ninputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]\n\n#\u00a0The resize image takes one of the input (image_path) and the output directory.\n# Files written to output_dir are persisted.\ndef resize_image(image_path, output_dir):\n  output_image_path = os.path.join(output_dir, os.path.basename(image_path))\n  Image.open(image_path).resize((224, 224)).save(output_image_path)\n\nmap(\n    fn=resize_image,\n    inputs=inputs,\n    output_dir=\"s3://my-bucket/my_resized_images\",\n)\n```\n\n</details>\n\n&nbsp;\n\n----\n\n# Benchmarks\nIn this section we show benchmarks for speed to optimize a dataset and the resulting streaming speed ([Reproduce the benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries)).\n\n## Streaming speed\n\nData optimized and streamed with LitData achieves a 20x speed up over non optimized data and 2x speed up over other streaming solutions.\n\nSpeed to stream Imagenet 1.2M from AWS S3:\n\n| Framework | Images / sec  1st Epoch (float32)  | Images / sec   2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |\n|---|---|---|---|---|\n| LitData | **5800** | **6589**  | **6282**  | **7221**  |\n| Web Dataset  | 3134 | 3924 | 3343 | 4424 |\n| Mosaic ML  | 2898 | 5099 | 2809 | 5158 |\n\n<details>\n  <summary> Benchmark details</summary>\n&nbsp;\n\n- [Imagenet-1.2M dataset](https://www.image-net.org/) contains `1,281,167 images`.\n- To align with other benchmarks, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.\n\n</details>\n\n&nbsp;\n\n## Time to optimize data\nLitData optimizes the Imagenet dataset for fast training 3-5x faster than other frameworks:\n\nTime to optimize 1.2 million ImageNet images (Faster is better):\n| Framework |Train Conversion Time | Val Conversion Time | Dataset Size | # Files |\n|---|---|---|---|---|\n| LitData  |  **10:05 min** | **00:30 min** | **143.1 GB**  | 2.339  |\n| Web Dataset  | 32:36 min | 01:22 min | 147.8 GB | 1.144 |\n| Mosaic ML  | 49:49 min | 01:04 min | **143.1 GB** | 2.298 |\n\n&nbsp;\n\n----\n\n# Parallelize transforms and data optimization on cloud machines\n<div align=\"center\">\n<img alt=\"Lightning\" src=\"https://pl-flash-data.s3.amazonaws.com/data-prep.jpg\" width=\"700px\">\n</div>\n\n## Parallelize data transforms\n\nTransformations with LitData are linearly parallelizable across machines.\n\nFor example, let's say that it takes 56 hours to embed a dataset on a single A10G machine. With LitData,\nthis can be speed up by adding more machines in parallel\n\n| Number of machines | Hours |\n|-----------------|--------------|\n| 1               | 56           |\n| 2               | 28           |\n| 4               | 14           |\n| ...               | ...            |\n| 64              | 0.875        |\n\nTo scale the number of machines, run the processing script on [Lightning Studios](https://lightning.ai/):\n\n```python\nfrom litdata import map, Machine\n\nmap(\n  ...\n  num_nodes=32,\n  machine=Machine.DATA_PREP, # Select between dozens of optimized machines\n)\n```\n\n## Parallelize data optimization\nTo scale the number of machines for data optimization, use [Lightning Studios](https://lightning.ai/):\n\n```python\nfrom litdata import optimize, Machine\n\noptimize(\n  ...\n  num_nodes=32,\n  machine=Machine.DATA_PREP, # Select between dozens of optimized machines\n)\n```\n\n&nbsp;\n\nExample: [Process the LAION 400 million image dataset in 2 hours on 32 machines, each with 32 CPUs](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset).\n\n&nbsp;\n\n----\n\n# Start from a template\nBelow are templates for real-world applications of LitData at scale.\n\n## Templates: Transform datasets\n\n| Studio | Data type | Time (minutes) | Machines | Dataset |\n| ------------------------------------ | ----------------- | ----------------- | -------------- | -------------- |\n| [Download LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) | Image & Text | 120 | 32 |[LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) |\n| [Tokenize 2M Swedish Wikipedia Articles](https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles) | Text | 7 | 4 | [Swedish Wikipedia](https://huggingface.co/datasets/wikipedia) |\n| [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) | Text | 15 | 3 | [English Wikipedia](https://huggingface.co/datasets/wikipedia) |\n\n## Templates: Optimize + stream data\n\n| Studio | Data type | Time (minutes) | Machines | Dataset |\n| -------------------------------- | ----------------- | ----------------- | -------------- | -------------- |\n| [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) | Image & Label | 10 | 1 | [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) |\n| [Optimize GeoSpatial data for model training](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) | Image & Mask | 120 | 32 | [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) |\n| [Optimize TinyLlama 1T dataset for training](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) | Text | 240 | 32 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) |\n| [Optimize parquet files for model training](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) | Parquet Files | 12 | 16 | Randomly Generated data |\n\n&nbsp;\n\n----\n\n# Community\nLitData is a community project accepting contributions -  Let's make the world's most advanced AI data processing framework.\n\n\ud83d\udcac [Get help on Discord](https://discord.com/invite/XncpTy7DSt)    \n\ud83d\udccb [License: Apache 2.0](https://github.com/Lightning-AI/litdata/blob/main/LICENSE)\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "The Deep Learning framework to train, deploy, and ship AI products Lightning fast.",
    "version": "0.2.36",
    "project_urls": {
        "Bug Tracker": "https://github.com/Lightning-AI/litdata/issues",
        "Documentation": "https://lightning-ai.github.io/litdata/",
        "Download": "https://github.com/Lightning-AI/litdata",
        "Homepage": "https://github.com/Lightning-AI/litdata",
        "Source Code": "https://github.com/Lightning-AI/litdata"
    },
    "split_keywords": [
        "deep learning",
        " pytorch",
        " ai",
        " streaming",
        " cloud",
        " data processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1d311e99e2c5a5e29f15edb5da8cf3905cff845eb8a0239bf41b70c9cb22764c",
                "md5": "e6f5d703ad8b01bfdde48c05886b37a6",
                "sha256": "44fb510f26bb6c1a0f6688ecb007d36c8050addb73f3094e935b59f2740ee9fa"
            },
            "downloads": -1,
            "filename": "litdata-0.2.36-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e6f5d703ad8b01bfdde48c05886b37a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 131810,
            "upload_time": "2025-01-14T23:02:57",
            "upload_time_iso_8601": "2025-01-14T23:02:57.974880Z",
            "url": "https://files.pythonhosted.org/packages/1d/31/1e99e2c5a5e29f15edb5da8cf3905cff845eb8a0239bf41b70c9cb22764c/litdata-0.2.36-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "450178f38702e66294e1db0ebb5d8cf3b78c960a40a32b6d6b229e1a58a2758f",
                "md5": "34fb29e8d7e681604a9f59af27a877cc",
                "sha256": "3657256b8f99cdc8f852f3018d46a4072415acc1b63004d803fa284fc6b2961d"
            },
            "downloads": -1,
            "filename": "litdata-0.2.36.tar.gz",
            "has_sig": false,
            "md5_digest": "34fb29e8d7e681604a9f59af27a877cc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 136092,
            "upload_time": "2025-01-14T23:02:59",
            "upload_time_iso_8601": "2025-01-14T23:02:59.409001Z",
            "url": "https://files.pythonhosted.org/packages/45/01/78f38702e66294e1db0ebb5d8cf3b78c960a40a32b6d6b229e1a58a2758f/litdata-0.2.36.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-14 23:02:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Lightning-AI",
    "github_project": "litdata",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "lightning-utilities",
            "specs": []
        },
        {
            "name": "filelock",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "boto3",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "tifffile",
            "specs": []
        }
    ],
    "lcname": "litdata"
}
        
Elapsed time: 0.41833s