dextro

Name	dextro JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	None
upload_time	2024-11-14 23:31:44
maintainer	None
docs_url	None
author	Kristian Klemon
requires_python	<4.0,>=3.10
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Dextro: Dataset Indexing for Blazing Fast Random Access

**Dextro** is a streamlined indexing toolkit designed for large, multi-file text datasets. It enables O(1) random access to any dataset sample through memory mapping, eliminating the need for preloading. This toolkit is essential for researchers and developers working with extensive language datasets, offering a significant leap in processing and training flexibility without altering the original data format.

## Motivation

The ongoing revolution in artificial intelligence, particularly in LLM, is heavily reliant on extensive language datasets. However, these datasets often come in simple, non-indexed formats like JSON Lines, posing challenges for data handling. These challenges include the need for loading entire datasets into RAM for quick access, the limitations of sequential streaming, and the constraints on processing and training flexibility due to non-indexed formats.

Dextro addresses these challenges by enabling the efficient indexing of large, multi-file datasets without altering the original data. The index tracks the start and end positions of each sample within its source file, along with optional metadata for enhanced filtering capabilities. Through memory mapping, Dextro achieves O(1) random access to any record across multiple files, significantly improving data handling efficiency.

## Getting Started

### Installation

Install Dextro easily via pip:

```bash
pip install dextro
```

Install with all dependencies:

```bash
pip install dextro[all]
```

### Index Your Dataset

Dextro works with datasets in JSON Lines format, split across multiple files. To index such a dataset, organize your files as follows:

```
dataset/
    part001.jsonl
    part002.jsonl
    ...
    part999.jsonl
```

Example content (`dataset/part001.jsonl`):
```json
{"text": "first item", ...}
{"text": "second item", ...}
```

Run the following command to index your dataset, creating an `index.parquet` file in the dataset folder:

```bash
dextro index-dataset dataset/
```

This index file includes the filename, start, and end positions for each sample, facilitating efficient data access.

### Accessing Indexed Datasets

Dextro integrates with PyTorch's `Dataset` class, allowing for easy loading of indexed datasets. Here's how to sequentially iterate through your dataset:

```python
from tqdm import tqdm
from dextro.torch import IndexedDataset

dataset = IndexedDataset(data_root='dataset/')

for text in tqdm(dataset):
    pass
```

To demonstrate random access with shuffling, you can use a `DataLoader` as follows:

```python
from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=128, shuffle=True)

for batch in tqdm(loader):
    pass
```

Dextro's memory mapping ensures that only the accessed data is loaded into memory, optimizing resource usage.

## Performance

Thanks to its minimal overhead and efficient data access, Dextro can process large NLP datasets at speeds close to those of reading directly from SSDs. This capability makes it possible to navigate through terabytes of data within minutes, even on consumer-grade storage.

## Comparison to 🤗 Datasets

The [🤗 Datasets](https://huggingface.co/docs/datasets) library also features [memory-mapped loading of partitioned datasets](https://huggingface.co/learn/nlp-course/en/chapter5/4). However, as of February 2024, it lacks the capability for random access, and shuffled iteration across a dataset is confined to the limits of an item buffer. Moreover, 🤗 Datasets does not offer the functionality to pre-filter data through a lightweight dataset index.

## Advanced Features

### Index Enrichers

Dextro supports enrichers to augment index records with additional information, such as metadata derived from the source data or advanced operations like language detection. You can specify enrichers during indexing for enhanced functionality:

```bash
dextro create-index dataset/ --enrichers=detect_language
```

### Data Filtering

Dextro allows for advanced data filtering directly on the index, facilitating efficient data selection without explicit loading:

```python
import polars as pl
from dextro.dataset import IndexedDataset

# Example filter: Select texts within a specific character length range
# This assumes that the `TextLength` enricher has been used during indexing
dataset = IndexedDataset(
    data_root='dataset/',
    index_filter=(256 <= pl.col('meta_text_length')) & (pl.col('meta_text_length') <= 1024)
)
```

### Non-NLP Datasets

Dextro can in principle work with any data modality as it this doesn't make assumptions about the data representation. 

### Other Data Formats

With the default settings, Dextro assumes that the dataset is formatted in JSON Lines format. Other formats are supporte via the `load_fn` option of the `FileIndexer` class. However, records currently have to be separated by lines.

## Examples

COMING SOON

## Development

### Install Dev Dependencies

```bash
poetry install --all-extras --with=dev
```

### Run Tests

```bash
pytest tests
```

### Autoformat


```bash
ruff format .
```

## Why "Dextro"?

The name "Dextro" is inspired by dextrose, a historic term for glucose and associated with fast energy delivery. This name reflects the toolkit's aim to provide fast, efficient processing and low overhead for dataset handling, mirroring the quick energy boost dextrose is known for.

Dextro is designed to be the optimal solution for managing and accessing large language datasets, enabling rapid and flexible data handling to support the advancement of AI and machine learning research.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dextro",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Kristian Klemon",
    "author_email": "kristian.klemon@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/17/49/e23c951b19eaddd61206078205bf721517a5f844f1ed30312d2e7dfd6a4c/dextro-0.1.2.tar.gz",
    "platform": null,
    "description": "# Dextro: Dataset Indexing for Blazing Fast Random Access\n\n**Dextro** is a streamlined indexing toolkit designed for large, multi-file text datasets. It enables O(1) random access to any dataset sample through memory mapping, eliminating the need for preloading. This toolkit is essential for researchers and developers working with extensive language datasets, offering a significant leap in processing and training flexibility without altering the original data format.\n\n## Motivation\n\nThe ongoing revolution in artificial intelligence, particularly in LLM, is heavily reliant on extensive language datasets. However, these datasets often come in simple, non-indexed formats like JSON Lines, posing challenges for data handling. These challenges include the need for loading entire datasets into RAM for quick access, the limitations of sequential streaming, and the constraints on processing and training flexibility due to non-indexed formats.\n\nDextro addresses these challenges by enabling the efficient indexing of large, multi-file datasets without altering the original data. The index tracks the start and end positions of each sample within its source file, along with optional metadata for enhanced filtering capabilities. Through memory mapping, Dextro achieves O(1) random access to any record across multiple files, significantly improving data handling efficiency.\n\n## Getting Started\n\n### Installation\n\nInstall Dextro easily via pip:\n\n```bash\npip install dextro\n```\n\nInstall with all dependencies:\n\n```bash\npip install dextro[all]\n```\n\n### Index Your Dataset\n\nDextro works with datasets in JSON Lines format, split across multiple files. To index such a dataset, organize your files as follows:\n\n```\ndataset/\n    part001.jsonl\n    part002.jsonl\n    ...\n    part999.jsonl\n```\n\nExample content (`dataset/part001.jsonl`):\n```json\n{\"text\": \"first item\", ...}\n{\"text\": \"second item\", ...}\n```\n\nRun the following command to index your dataset, creating an `index.parquet` file in the dataset folder:\n\n```bash\ndextro index-dataset dataset/\n```\n\nThis index file includes the filename, start, and end positions for each sample, facilitating efficient data access.\n\n### Accessing Indexed Datasets\n\nDextro integrates with PyTorch's `Dataset` class, allowing for easy loading of indexed datasets. Here's how to sequentially iterate through your dataset:\n\n```python\nfrom tqdm import tqdm\nfrom dextro.torch import IndexedDataset\n\ndataset = IndexedDataset(data_root='dataset/')\n\nfor text in tqdm(dataset):\n    pass\n```\n\nTo demonstrate random access with shuffling, you can use a `DataLoader` as follows:\n\n```python\nfrom torch.utils.data import DataLoader\n\nloader = DataLoader(dataset, batch_size=128, shuffle=True)\n\nfor batch in tqdm(loader):\n    pass\n```\n\nDextro's memory mapping ensures that only the accessed data is loaded into memory, optimizing resource usage.\n\n## Performance\n\nThanks to its minimal overhead and efficient data access, Dextro can process large NLP datasets at speeds close to those of reading directly from SSDs. This capability makes it possible to navigate through terabytes of data within minutes, even on consumer-grade storage.\n\n## Comparison to \ud83e\udd17 Datasets\n\nThe [\ud83e\udd17 Datasets](https://huggingface.co/docs/datasets) library also features [memory-mapped loading of partitioned datasets](https://huggingface.co/learn/nlp-course/en/chapter5/4). However, as of February 2024, it lacks the capability for random access, and shuffled iteration across a dataset is confined to the limits of an item buffer. Moreover, \ud83e\udd17 Datasets does not offer the functionality to pre-filter data through a lightweight dataset index.\n\n## Advanced Features\n\n### Index Enrichers\n\nDextro supports enrichers to augment index records with additional information, such as metadata derived from the source data or advanced operations like language detection. You can specify enrichers during indexing for enhanced functionality:\n\n```bash\ndextro create-index dataset/ --enrichers=detect_language\n```\n\n### Data Filtering\n\nDextro allows for advanced data filtering directly on the index, facilitating efficient data selection without explicit loading:\n\n```python\nimport polars as pl\nfrom dextro.dataset import IndexedDataset\n\n# Example filter: Select texts within a specific character length range\n# This assumes that the `TextLength` enricher has been used during indexing\ndataset = IndexedDataset(\n    data_root='dataset/',\n    index_filter=(256 <= pl.col('meta_text_length')) & (pl.col('meta_text_length') <= 1024)\n)\n```\n\n### Non-NLP Datasets\n\nDextro can in principle work with any data modality as it this doesn't make assumptions about the data representation. \n\n### Other Data Formats\n\nWith the default settings, Dextro assumes that the dataset is formatted in JSON Lines format. Other formats are supporte via the `load_fn` option of the `FileIndexer` class. However, records currently have to be separated by lines.\n\n## Examples\n\nCOMING SOON\n\n## Development\n\n### Install Dev Dependencies\n\n```bash\npoetry install --all-extras --with=dev\n```\n\n### Run Tests\n\n```bash\npytest tests\n```\n\n### Autoformat\n\n\n```bash\nruff format .\n```\n\n## Why \"Dextro\"?\n\nThe name \"Dextro\" is inspired by dextrose, a historic term for glucose and associated with fast energy delivery. This name reflects the toolkit's aim to provide fast, efficient processing and low overhead for dataset handling, mirroring the quick energy boost dextrose is known for.\n\nDextro is designed to be the optimal solution for managing and accessing large language datasets, enabling rapid and flexible data handling to support the advancement of AI and machine learning research.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.2",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7031f430e8d81790f0fc551911fbe899a2e083df6c94aada6a677c5971c2869f",
                "md5": "6f079375277c7154664fb5d2bd9e4dc1",
                "sha256": "67db8cd45401c91cc0a14cefe985f29666cfec3d7dc4ffed7628aa649b01fdfc"
            },
            "downloads": -1,
            "filename": "dextro-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6f079375277c7154664fb5d2bd9e4dc1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 12453,
            "upload_time": "2024-11-14T23:31:43",
            "upload_time_iso_8601": "2024-11-14T23:31:43.828340Z",
            "url": "https://files.pythonhosted.org/packages/70/31/f430e8d81790f0fc551911fbe899a2e083df6c94aada6a677c5971c2869f/dextro-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1749e23c951b19eaddd61206078205bf721517a5f844f1ed30312d2e7dfd6a4c",
                "md5": "98cf20bac9aab26ad2c9b593555690e9",
                "sha256": "fb19817811362dfd9975496147b0e8ff2114496ff24f67b4c3a9e106e53be0c5"
            },
            "downloads": -1,
            "filename": "dextro-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "98cf20bac9aab26ad2c9b593555690e9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 11943,
            "upload_time": "2024-11-14T23:31:44",
            "upload_time_iso_8601": "2024-11-14T23:31:44.866083Z",
            "url": "https://files.pythonhosted.org/packages/17/49/e23c951b19eaddd61206078205bf721517a5f844f1ed30312d2e7dfd6a4c/dextro-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-14 23:31:44",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "dextro"
}

Kristian Klemon