epochraft

Name	epochraft JSON
Version	0.1.0.dev20231107 JSON
	download
home_page
Summary	Supercharge Your LLM Training with Checkpointable Data Loading
upload_time	2023-11-07 06:40:37
maintainer
docs_url	None
author	Takuya Akiba
requires_python	>=3.8
license	MIT License Copyright (c) 2023 Takuya Akiba Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Epochraft

[![Python](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10%20%7C%203.11-blue)](https://www.python.org)
[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/optuna/optuna)
[![Checks status](https://github.com/iwiwi/epochraft/actions/workflows/checks.yml/badge.svg?branch=main)](https://github.com/iwiwi/epochraft/actions)
[![Tests status](https://github.com/iwiwi/epochraft/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/iwiwi/epochraft/actions)
[![pypi](https://img.shields.io/pypi/v/epochraft.svg)](https://pypi.python.org/pypi/epochraft)



## Introduction

*Epochraft* is a data loader library optimized for the streamlined training of LLMs, featuring **streaming from cloud storage**, **on-the-fly tokenization**, and **iterator checkpointing**. The name comes from a fusion of "epoch" and "craft".


### Streaming from Cloud Stoarge

Storing the vast datasets required for pretraining LLMs on local disks can be daunting. Even when it is feasible, transferring the data prior to training can be cumbersome and time consuming.

Epochraft offers a wide array of storage solutions, including S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, and the local filesystem (facilitated by [smart-open](https://github.com/RaRe-Technologies/smart_open/)). One of its salient features is the ability to train while concurrently downloading data. Due to its streaming-based architecture, a complete shuffle of data isn't possible. However, Epochraft achieves a level of shuffling by simultaneously accessing multiple data shards, intermixing the incoming data, and then performing an additional shuffle within a predetermined buffer size.

Additionally, it also supports Python's sequential or iterable interfaces. For instance, it can utilize the [Hugging Face Datasets](https://github.com/huggingface/datasets). While it might seem there's little benefit to using Epochraft with such small datasets, this enables the use of the same codebase for both SFT and pretraining.




### On-the-Fly Tokenization

Some of previous frameworks require pre-tokenization. This means that one has to tokenize the training data and then store it before pretraining. This is cumbersome. Training cannot begin until this step is completed. Moreover, if there are changes to the dataset or the tokenizer, you have to repeat this step again. Furthermore, there's added responsibility of managing tokenized data.

Now, you might wonder, "Isn't on-the-fly tokenization too slow?" The answer is a resounding no.

For instance, the training of Llama2-7B is conducted at the speed of approximately 3K tokens/sec per GPU (as seen in [Table 2](https://arxiv.org/abs/2307.09288)). The tokenizer of Llama2 can process at a rate of neraly 1M tokens/sec with a single CPU process. This means that even when tokenizing in real-time, the GPUs can be fully utilized without a bottleneck. And for larger models, the situation becomes even more favorable. For a 13B model, a rate of 1.5K tokens/sec is sufficient to saturate each GPU, while for a 70B model, only 300 tokens/sec is necessary.



### Data Loader Checkpointing

Beyond the state_dicts of models and optimizers, shouldn't we consider saving the state_dict of the data loader as well?

During the times when training ResNets for 90 epochs was the norm, this wasn’t a concern. A checkpoint at the end of each epoch sufficed. However, in the current era of LLMs, training often revolves around a single epoch.

When training for just 1 epoch, it becomes crucial to ensure that the data loader can pick up from where it left off in the middle of an epoch. Upon resuming training, it's vital to process only the data that hasn't been utilized up to that interruption point. Given the vastness of the data, an efficient resumption mechanism is essential.





## Quick Start

### Installation

```
pip install epochraft
```

### Example

This is an example of building a typical pretraining dataset. We will soon add other examples such as SFT.

```python
from epochraft import CheckpointableDataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

# `{00..99}` will be expanded (see `braceexpand`)
url = "s3://.../cc-100/cc-100_{00..99}.jsonl"

train_dataset = (
    CheckpointableDataset
    .from_files(url, repeat=True, shuffle_shards=True)
    .tokenize(tokenizer)        # Tokenize the texts
    .ensure_bos_eos(tokenizer)  # Add BOS and EOS tokens where necessary
    .concat_chunk(1024)         # Concatenate and chunk the tokens into a fixed length of 1024 tokens
    .shuffle(1000)              # Shuffle the sequences using a buffer of size 1000
    .batch(8)                   # Group the data into mini-batches with a batch size of 8
)

for batch in train_dataset:
    input_ids = batch["input_ids"]  # Input data for this iteration (torch.Tensor)

    # Implement the training iteration using `input_ids` here
    ...

```

### Checkpointing

Normally, you would obtain and save the `state_dict` of the model and optimizer. In addition to that, please also obtain and save the `state_dict` of the iterator

```python
train_iter = train_dataset.iter()  # Same meaning as `iter(train_dataset)`

for batch in train_iter:
    step = batch["step"]
    ...

    if step % ckpt_freq == 0:
        state_dict = {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "iter": train_iter.state_dict(),
        }
        torch.save(state_dict, ckpt_path)
```

### Resumption

You can restore the state of the iterator by passing the `state_dict` to the iter method of the `CheckpointableDataset` instance.


```python
state_dict = torch.load(ckpt_path)
train_iter = train_dataset.iter(state_dict=state_dict["iter"])
```

## Development


```
pip install -e .[development]
mypy .; black .; flake8 .; isort .
pytest tests
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "epochraft",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "",
    "author": "Takuya Akiba",
    "author_email": "takuya.akiba@stability.ai",
    "download_url": "https://files.pythonhosted.org/packages/78/91/7dcc54220bf6b6f1d9a3f007541268a0de715f0816184910aadb18e9bdac/epochraft-0.1.0.dev20231107.tar.gz",
    "platform": null,
    "description": "# Epochraft\n\n[![Python](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10%20%7C%203.11-blue)](https://www.python.org)\n[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/optuna/optuna)\n[![Checks status](https://github.com/iwiwi/epochraft/actions/workflows/checks.yml/badge.svg?branch=main)](https://github.com/iwiwi/epochraft/actions)\n[![Tests status](https://github.com/iwiwi/epochraft/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/iwiwi/epochraft/actions)\n[![pypi](https://img.shields.io/pypi/v/epochraft.svg)](https://pypi.python.org/pypi/epochraft)\n\n\n\n## Introduction\n\n*Epochraft* is a data loader library optimized for the streamlined training of LLMs, featuring **streaming from cloud storage**, **on-the-fly tokenization**, and **iterator checkpointing**. The name comes from a fusion of \"epoch\" and \"craft\".\n\n\n### Streaming from Cloud Stoarge\n\nStoring the vast datasets required for pretraining LLMs on local disks can be daunting. Even when it is feasible, transferring the data prior to training can be cumbersome and time consuming.\n\nEpochraft offers a wide array of storage solutions, including S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, and the local filesystem (facilitated by [smart-open](https://github.com/RaRe-Technologies/smart_open/)). One of its salient features is the ability to train while concurrently downloading data. Due to its streaming-based architecture, a complete shuffle of data isn't possible. However, Epochraft achieves a level of shuffling by simultaneously accessing multiple data shards, intermixing the incoming data, and then performing an additional shuffle within a predetermined buffer size.\n\nAdditionally, it also supports Python's sequential or iterable interfaces. For instance, it can utilize the [Hugging Face Datasets](https://github.com/huggingface/datasets). While it might seem there's little benefit to using Epochraft with such small datasets, this enables the use of the same codebase for both SFT and pretraining.\n\n\n\n\n### On-the-Fly Tokenization\n\nSome of previous frameworks require pre-tokenization. This means that one has to tokenize the training data and then store it before pretraining. This is cumbersome. Training cannot begin until this step is completed. Moreover, if there are changes to the dataset or the tokenizer, you have to repeat this step again. Furthermore, there's added responsibility of managing tokenized data.\n\nNow, you might wonder, \"Isn't on-the-fly tokenization too slow?\" The answer is a resounding no.\n\nFor instance, the training of Llama2-7B is conducted at the speed of approximately 3K tokens/sec per GPU (as seen in [Table 2](https://arxiv.org/abs/2307.09288)). The tokenizer of Llama2 can process at a rate of neraly 1M tokens/sec with a single CPU process. This means that even when tokenizing in real-time, the GPUs can be fully utilized without a bottleneck. And for larger models, the situation becomes even more favorable. For a 13B model, a rate of 1.5K tokens/sec is sufficient to saturate each GPU, while for a 70B model, only 300 tokens/sec is necessary.\n\n\n\n### Data Loader Checkpointing\n\nBeyond the state_dicts of models and optimizers, shouldn't we consider saving the state_dict of the data loader as well?\n\nDuring the times when training ResNets for 90 epochs was the norm, this wasn\u2019t a concern. A checkpoint at the end of each epoch sufficed. However, in the current era of LLMs, training often revolves around a single epoch.\n\nWhen training for just 1 epoch, it becomes crucial to ensure that the data loader can pick up from where it left off in the middle of an epoch. Upon resuming training, it's vital to process only the data that hasn't been utilized up to that interruption point. Given the vastness of the data, an efficient resumption mechanism is essential.\n\n\n\n\n\n## Quick Start\n\n### Installation\n\n```\npip install epochraft\n```\n\n### Example\n\nThis is an example of building a typical pretraining dataset. We will soon add other examples such as SFT.\n\n```python\nfrom epochraft import CheckpointableDataset\nfrom transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\"EleutherAI/gpt-neox-20b\")\n\n# `{00..99}` will be expanded (see `braceexpand`)\nurl = \"s3://.../cc-100/cc-100_{00..99}.jsonl\"\n\ntrain_dataset = (\n    CheckpointableDataset\n    .from_files(url, repeat=True, shuffle_shards=True)\n    .tokenize(tokenizer)        # Tokenize the texts\n    .ensure_bos_eos(tokenizer)  # Add BOS and EOS tokens where necessary\n    .concat_chunk(1024)         # Concatenate and chunk the tokens into a fixed length of 1024 tokens\n    .shuffle(1000)              # Shuffle the sequences using a buffer of size 1000\n    .batch(8)                   # Group the data into mini-batches with a batch size of 8\n)\n\nfor batch in train_dataset:\n    input_ids = batch[\"input_ids\"]  # Input data for this iteration (torch.Tensor)\n\n    # Implement the training iteration using `input_ids` here\n    ...\n\n```\n\n### Checkpointing\n\nNormally, you would obtain and save the `state_dict` of the model and optimizer. In addition to that, please also obtain and save the `state_dict` of the iterator\n\n```python\ntrain_iter = train_dataset.iter()  # Same meaning as `iter(train_dataset)`\n\nfor batch in train_iter:\n    step = batch[\"step\"]\n    ...\n\n    if step % ckpt_freq == 0:\n        state_dict = {\n            \"model\": model.state_dict(),\n            \"optimizer\": optimizer.state_dict(),\n            \"iter\": train_iter.state_dict(),\n        }\n        torch.save(state_dict, ckpt_path)\n```\n\n### Resumption\n\nYou can restore the state of the iterator by passing the `state_dict` to the iter method of the `CheckpointableDataset` instance.\n\n\n```python\nstate_dict = torch.load(ckpt_path)\ntrain_iter = train_dataset.iter(state_dict=state_dict[\"iter\"])\n```\n\n## Development\n\n\n```\npip install -e .[development]\nmypy .; black .; flake8 .; isort .\npytest tests\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Takuya Akiba  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Supercharge Your LLM Training with Checkpointable Data Loading",
    "version": "0.1.0.dev20231107",
    "project_urls": {
        "repository": "https://github.com/iwiwi/epochraft"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2244a1c5958f609bd10d8728a138570bfa8eda0c90f1c3d6d857191a0464d065",
                "md5": "ffeebe545e246f25f96e7ffa884769ef",
                "sha256": "c007e78581e0b1931b6a7354a6befd533cd3f701e8ffa8ba9f5a1c104316f1d6"
            },
            "downloads": -1,
            "filename": "epochraft-0.1.0.dev20231107-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ffeebe545e246f25f96e7ffa884769ef",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 40109,
            "upload_time": "2023-11-07T06:40:34",
            "upload_time_iso_8601": "2023-11-07T06:40:34.925299Z",
            "url": "https://files.pythonhosted.org/packages/22/44/a1c5958f609bd10d8728a138570bfa8eda0c90f1c3d6d857191a0464d065/epochraft-0.1.0.dev20231107-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "78917dcc54220bf6b6f1d9a3f007541268a0de715f0816184910aadb18e9bdac",
                "md5": "a0bb49f6f2c39cf89b6e52e2a10f8dfc",
                "sha256": "ddf8e4f96661b851c66daf3960dd6f919afc441c6288e7830f48013fe2b02a51"
            },
            "downloads": -1,
            "filename": "epochraft-0.1.0.dev20231107.tar.gz",
            "has_sig": false,
            "md5_digest": "a0bb49f6f2c39cf89b6e52e2a10f8dfc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 29725,
            "upload_time": "2023-11-07T06:40:37",
            "upload_time_iso_8601": "2023-11-07T06:40:37.101542Z",
            "url": "https://files.pythonhosted.org/packages/78/91/7dcc54220bf6b6f1d9a3f007541268a0de715f0816184910aadb18e9bdac/epochraft-0.1.0.dev20231107.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-07 06:40:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "iwiwi",
    "github_project": "epochraft",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "epochraft"
}

Takuya Akiba