hakkero-dataloader
------------------
A general dataloader build on top of Pytorch Dataloader.
## 1. How to use
### 1.1 Build Index
Install `pip install hakkero-dataloader` and run the following command to build index.
```shell
hakkero -h
usage: hakkero [-h] --filename FILENAME [--output OUTPUT]
build index for dataset
options:
-h, --help show this help message and exit
--filename FILENAME full filename of jsonl file
--output OUTPUT output path for saving data.jsonl and index.h5
```
### 1.2 Use In Training
```python
from hakkero.dataset import get_dataset
# pretrain or sft
from hakkero.dataset import PadLoader
from hakkero.dataset import UnpadLoader
# preference
from hakkero.dataset import PreferencePadLoader
from hakkero.dataset import PreferenceUnpadLoader
dp_world_size, dp_rank = 1, 0
tokenizer = ...
batch_size = 4
max_length = 4096
n_workers = 2
dataset = get_dataset(
config="/path/to/dataset",
tokenizer=tokenizer,
num_epochs=-1,
max_length=max_length,
homogeneous=True,
seed=9527,
rank=dp_rank,
world_size=dp_world_size,
n_workers=n_workers,
# segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None:
strategy_segment="naive",
strategy_tokenize="legacy",
# add bos/eos token for legacy tokenize strategy
add_bos_token=True,
add_eos_token=True,
)
dataloader = UnpadLoader(dataset, max_total_length=batch_size * max_length)
prefetcher = dataloader.prefetch(n_workers)
for step, batch in enumerate(prefetcher, start=0):
print(batch)
```
example of `config`:
```json
{
"hermes25_1":
{
"group": "en",
"name": "hermes25_1",
"epoch": 1,
"path": "hermes25",
"strategy":
{
"st_segment": "integrous",
"st_tokenize": "hg"
},
"weight": 0.5
},
"hermes25_2":
{
"group": "en",
"name": "hermes25_1",
"epoch": 1,
"path": "hermes25",
"strategy":
{
"st_segment": "integrous",
"st_tokenize": "hg"
},
"weight": 0.5
}
}
```
## 2. Supported Strategies
See [segmentation.py](./hakkero/dataset/segmentation.py) and [tokenization.py](./hakkero/dataset/tokenization.py) for more details.
### 2.1 Segmentation Strategies
- `integrous`: discard sample that is too long, exceed `max_length`
- `concat`: split long sample, concat it with previous segment, shuffle all segments
- not support preference data.
- `naive`: split long sample with random length, shuffle all segments
- not support preference data.
- `unbiased`: split long sample exceed `max_length` with random length, shuffle all segments.
- not support preference data.
### 2.2 Tokenization Strategies
- `legacy`: `\n\n` as delimiter to join text and use `tokenizer.encode` to encode the input.
- format of input data
```json
{
"uid": "xxx",
"data":
{
"title": "xxx",
"summary": "xxx",
"abstract": "xxx",
"text": "xxx",
"question": "xxx",
"answer": "xxx",
"code": "xxx",
"label": "xxx"
}
}
```
- All fields except `label` are stripped and joined with "\n\n" as the context.
- `label` is the target to learn for finetuning (pretrain data should not have the `label` field).
- See func `legacy` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.
- extra parameters: `add_bos_token`, `add_eos_token`
- `hg`: huggingface message data, use `tokenizer.apply_chat_template` to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
]
}
```
See func `huggingface_message` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.
- `chatml`: chat message data, use chatml to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
]
}
```
See func `chatml_message` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.
- `hg_preference`: preference data, use `tokenizer.apply_chat_template` to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": {
"context": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
{"role": "user", "content": "xxx"}
],
"chosen": "chosen response",
"rejected": "rejected response"
}
}
```
See func `huggingface_preference` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.
- `chatml_preference`: preference data, use chatml to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": {
"context": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
{"role": "user", "content": "xxx"}
],
"chosen": "chosen response",
"rejected": "rejected response"
}
}
```
See func `chatml_preference` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.
Raw data
{
"_id": null,
"home_page": null,
"name": "hakkero-dataloader",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Pytorch LM dataloader",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/d8/3d/a1d71407a5bc09b4f91fb691686b260b14e54279dea471410042685f5e79/hakkero-dataloader-1.2.0.tar.gz",
"platform": null,
"description": "hakkero-dataloader\n------------------\n\nA general dataloader build on top of Pytorch Dataloader.\n\n\n## 1. How to use\n\n### 1.1 Build Index\n\nInstall `pip install hakkero-dataloader` and run the following command to build index.\n\n```shell\nhakkero -h\nusage: hakkero [-h] --filename FILENAME [--output OUTPUT]\n\nbuild index for dataset\n\noptions:\n -h, --help show this help message and exit\n --filename FILENAME full filename of jsonl file\n --output OUTPUT output path for saving data.jsonl and index.h5\n```\n\n### 1.2 Use In Training\n\n```python\nfrom hakkero.dataset import get_dataset\n\n# pretrain or sft\nfrom hakkero.dataset import PadLoader\nfrom hakkero.dataset import UnpadLoader\n\n# preference\nfrom hakkero.dataset import PreferencePadLoader\nfrom hakkero.dataset import PreferenceUnpadLoader\n\ndp_world_size, dp_rank = 1, 0\ntokenizer = ...\nbatch_size = 4\nmax_length = 4096\nn_workers = 2\n\ndataset = get_dataset(\n config=\"/path/to/dataset\",\n tokenizer=tokenizer,\n num_epochs=-1,\n max_length=max_length,\n homogeneous=True,\n seed=9527,\n rank=dp_rank,\n world_size=dp_world_size,\n n_workers=n_workers,\n # segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None: \n strategy_segment=\"naive\",\n strategy_tokenize=\"legacy\",\n # add bos/eos token for legacy tokenize strategy\n add_bos_token=True,\n add_eos_token=True,\n)\n\ndataloader = UnpadLoader(dataset, max_total_length=batch_size * max_length)\nprefetcher = dataloader.prefetch(n_workers)\n\nfor step, batch in enumerate(prefetcher, start=0):\n print(batch)\n```\n\nexample of `config`: \n```json\n{\n \"hermes25_1\":\n {\n \"group\": \"en\",\n \"name\": \"hermes25_1\",\n \"epoch\": 1,\n \"path\": \"hermes25\",\n \"strategy\":\n {\n \"st_segment\": \"integrous\",\n \"st_tokenize\": \"hg\"\n },\n \"weight\": 0.5\n },\n \"hermes25_2\":\n {\n \"group\": \"en\",\n \"name\": \"hermes25_1\",\n \"epoch\": 1,\n \"path\": \"hermes25\",\n \"strategy\":\n {\n \"st_segment\": \"integrous\",\n \"st_tokenize\": \"hg\"\n },\n \"weight\": 0.5\n }\n}\n```\n\n## 2. Supported Strategies\n\nSee [segmentation.py](./hakkero/dataset/segmentation.py) and [tokenization.py](./hakkero/dataset/tokenization.py) for more details.\n\n### 2.1 Segmentation Strategies\n\n- `integrous`: discard sample that is too long, exceed `max_length`\n- `concat`: split long sample, concat it with previous segment, shuffle all segments\n - not support preference data.\n- `naive`: split long sample with random length, shuffle all segments\n - not support preference data.\n- `unbiased`: split long sample exceed `max_length` with random length, shuffle all segments.\n - not support preference data.\n\n### 2.2 Tokenization Strategies\n\n- `legacy`: `\\n\\n` as delimiter to join text and use `tokenizer.encode` to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xxx\",\n \"data\":\n {\n \"title\": \"xxx\",\n \"summary\": \"xxx\",\n \"abstract\": \"xxx\",\n \"text\": \"xxx\",\n \"question\": \"xxx\",\n \"answer\": \"xxx\",\n \"code\": \"xxx\",\n \"label\": \"xxx\"\n }\n }\n ```\n\n - All fields except `label` are stripped and joined with \"\\n\\n\" as the context.\n - `label` is the target to learn for finetuning (pretrain data should not have the `label` field).\n - See func `legacy` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.\n - extra parameters: `add_bos_token`, `add_eos_token`\n\n- `hg`: huggingface message data, use `tokenizer.apply_chat_template` to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n ]\n }\n ```\n\n See func `huggingface_message` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.\n\n- `chatml`: chat message data, use chatml to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n ]\n }\n ```\n\n See func `chatml_message` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.\n\n- `hg_preference`: preference data, use `tokenizer.apply_chat_template` to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": {\n \"context\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n {\"role\": \"user\", \"content\": \"xxx\"}\n ],\n \"chosen\": \"chosen response\",\n \"rejected\": \"rejected response\"\n }\n }\n ```\n \n See func `huggingface_preference` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.\n\n- `chatml_preference`: preference data, use chatml to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": {\n \"context\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n {\"role\": \"user\", \"content\": \"xxx\"}\n ],\n \"chosen\": \"chosen response\",\n \"rejected\": \"rejected response\"\n }\n }\n ```\n \n See func `chatml_preference` in [tokenization.py](./hakkero/dataset/tokenization.py) for more details.\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "1.2.0",
"project_urls": null,
"split_keywords": [
"pytorch",
"lm",
"dataloader"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d83da1d71407a5bc09b4f91fb691686b260b14e54279dea471410042685f5e79",
"md5": "3d837f3028c60e1f178cb5cd2d643bc3",
"sha256": "9faa9556b212668d6e26440228f119170f3dc486f741d9419c892f45e090fe39"
},
"downloads": -1,
"filename": "hakkero-dataloader-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "3d837f3028c60e1f178cb5cd2d643bc3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 23674,
"upload_time": "2024-10-02T07:10:51",
"upload_time_iso_8601": "2024-10-02T07:10:51.958628Z",
"url": "https://files.pythonhosted.org/packages/d8/3d/a1d71407a5bc09b4f91fb691686b260b14e54279dea471410042685f5e79/hakkero-dataloader-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-02 07:10:51",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "hakkero-dataloader"
}