hakkero-dataloader
------------------
A general dataloader build on top of Pytorch Dataloader.
## 1. How to use
### 1.1 Build Index
Install `pip install hakkero-dataloader` and run the following command to build index.
```shell
hakkero -h
usage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]
build index for dataset
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--filename FILENAME full filename of jsonl file
--output OUTPUT output path for saving data.jsonl and index.h5
--dtype {legacy,message,preference}
data type
--num_workers NUM_WORKERS
number of workers
--not_shuf not shuf data
```
### 1.2 Use In Training
```python
from hakkero.dataset import get_dataset
# pretrain or sft
from hakkero.dataset import PadLoader
from hakkero.dataset import UnpadLoader
# preference
from hakkero.dataset import PreferencePadLoader
from hakkero.dataset import PreferenceUnpadLoader
dp_world_size, dp_rank = 1, 0
tokenizer = ...
batch_size = 4
max_length = 4096
n_workers = 2
dataset = get_dataset(
config="/path/to/dataset",
tokenizer=tokenizer,
num_epochs=-1,
max_length=max_length,
homogeneous=True,
seed=9527,
rank=dp_rank,
world_size=dp_world_size,
n_workers=n_workers,
# segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None:
st_segment="naive",
st_tokenize="legacy",
# add bos/eos token for legacy tokenize strategy
add_bos_token=True,
add_eos_token=True,
# norm dataset weight with tokens of target
norm_weight_with_n_targets=False,
)
dataloader = UnpadLoader(dataset, max_total_length=batch_size * max_length)
prefetcher = dataloader.prefetch(n_workers)
for step, batch in enumerate(prefetcher, start=0):
print(batch)
```
example of `config`:
```json
{
"hermes25_1":
{
"group": "en",
"name": "hermes25_1",
"epoch": 1,
"path": "hermes25",
"strategy":
{
"st_segment": "integrous",
"st_tokenize": "hg"
},
"weight": 0.5
},
"hermes25_2":
{
"group": "en",
"name": "hermes25_1",
"epoch": 1,
"path": "hermes25",
"strategy":
{
"st_segment": "integrous",
"st_tokenize": "hg"
},
"weight": 0.5
}
}
```
## 2. Supported Strategies
See [segmentation.py](./hakkero/dataset/strategy/segmentation.py) and [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
### 2.1 Segmentation Strategies
- `integrous`: discard sample that is too long, exceed `max_length`
- `concat`: split long sample, concat it with previous segment, shuffle all segments
- not support preference data.
- `naive`: split long sample with random length, shuffle all segments
- not support preference data.
- `unbiased`: split long sample exceed `max_length` with random length, shuffle all segments.
- not support preference data.
### 2.2 Tokenization Strategies
- `legacy`: `\n\n` as delimiter to join text and use `tokenizer.encode` to encode the input.
- format of input data
```json
{
"uid": "xxx",
"data":
{
"title": "xxx",
"summary": "xxx",
"abstract": "xxx",
"text": "xxx",
"question": "xxx",
"answer": "xxx",
"code": "xxx",
"label": "xxx"
}
}
```
- All fields except `label` are stripped and joined with "\n\n" as the context.
- `label` is the target to learn for finetuning (pretrain data should not have the `label` field).
- See func `legacy` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
- extra parameters: `add_bos_token`, `add_eos_token`
- `hg`: huggingface message data, use `tokenizer.apply_chat_template` to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
]
}
```
See func `huggingface_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
- `chatml`: chat message data, use chatml to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
]
}
```
See func `chatml_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
- `chatml_qwen2_vl_message`: chat message vl data, use chatml to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": [
{
"role": "user",
"content": [
{
"type": "image",
"image": "images/2.jpg"
},
{
"type": "text",
"text": "他是谁?"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "他是来自拜仁慕尼黑的托马斯·穆勒。"
}
]
},
...
]
}
```
See func `chatml_qwen2_vl_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
Only support "integrous" segmentation strategies
- `hg_preference`: preference data, use `tokenizer.apply_chat_template` to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": {
"context": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
{"role": "user", "content": "xxx"}
],
"chosen": "chosen response",
"rejected": "rejected response"
}
}
```
See func `huggingface_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
- `chatml_preference`: preference data, use chatml to encode the input.
- format of input data
```json
{
"uid": "xx",
"data": {
"context": [
{"role": "user", "content": "xxx"},
{"role": "assistant", "content": "xxx"},
...
{"role": "user", "content": "xxx"}
],
"chosen": "chosen response",
"rejected": "rejected response"
}
}
```
See func `chatml_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
Raw data
{
"_id": null,
"home_page": "https://github.com/ericxsun/hakkero-dataloader",
"name": "hakkero-dataloader",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Pytorch LM dataloader",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/b3/a7/4888e510ea6061baf3619bfa0e57f61a3e6eefc0fe2d4766c6b4746ab755/hakkero-dataloader-1.2.16.tar.gz",
"platform": null,
"description": "hakkero-dataloader\n------------------\n\nA general dataloader build on top of Pytorch Dataloader.\n\n\n## 1. How to use\n\n### 1.1 Build Index\n\nInstall `pip install hakkero-dataloader` and run the following command to build index.\n\n```shell\nhakkero -h\n\nusage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]\n\nbuild index for dataset\n\noptional arguments:\n -h, --help show this help message and exit\n --version show program's version number and exit\n --filename FILENAME full filename of jsonl file\n --output OUTPUT output path for saving data.jsonl and index.h5\n --dtype {legacy,message,preference}\n data type\n --num_workers NUM_WORKERS\n number of workers\n --not_shuf not shuf data\n```\n\n### 1.2 Use In Training\n\n```python\nfrom hakkero.dataset import get_dataset\n\n# pretrain or sft\nfrom hakkero.dataset import PadLoader\nfrom hakkero.dataset import UnpadLoader\n\n# preference\nfrom hakkero.dataset import PreferencePadLoader\nfrom hakkero.dataset import PreferenceUnpadLoader\n\ndp_world_size, dp_rank = 1, 0\ntokenizer = ...\nbatch_size = 4\nmax_length = 4096\nn_workers = 2\n\ndataset = get_dataset(\n config=\"/path/to/dataset\",\n tokenizer=tokenizer,\n num_epochs=-1,\n max_length=max_length,\n homogeneous=True,\n seed=9527,\n rank=dp_rank,\n world_size=dp_world_size,\n n_workers=n_workers,\n # segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None: \n st_segment=\"naive\",\n st_tokenize=\"legacy\",\n # add bos/eos token for legacy tokenize strategy\n add_bos_token=True,\n add_eos_token=True,\n # norm dataset weight with tokens of target\n norm_weight_with_n_targets=False,\n)\n\ndataloader = UnpadLoader(dataset, max_total_length=batch_size * max_length)\nprefetcher = dataloader.prefetch(n_workers)\n\nfor step, batch in enumerate(prefetcher, start=0):\n print(batch)\n```\n\nexample of `config`: \n```json\n{\n \"hermes25_1\":\n {\n \"group\": \"en\",\n \"name\": \"hermes25_1\",\n \"epoch\": 1,\n \"path\": \"hermes25\",\n \"strategy\":\n {\n \"st_segment\": \"integrous\",\n \"st_tokenize\": \"hg\"\n },\n \"weight\": 0.5\n },\n \"hermes25_2\":\n {\n \"group\": \"en\",\n \"name\": \"hermes25_1\",\n \"epoch\": 1,\n \"path\": \"hermes25\",\n \"strategy\":\n {\n \"st_segment\": \"integrous\",\n \"st_tokenize\": \"hg\"\n },\n \"weight\": 0.5\n }\n}\n```\n\n## 2. Supported Strategies\n\nSee [segmentation.py](./hakkero/dataset/strategy/segmentation.py) and [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n\n### 2.1 Segmentation Strategies\n\n- `integrous`: discard sample that is too long, exceed `max_length`\n- `concat`: split long sample, concat it with previous segment, shuffle all segments\n - not support preference data.\n- `naive`: split long sample with random length, shuffle all segments\n - not support preference data.\n- `unbiased`: split long sample exceed `max_length` with random length, shuffle all segments.\n - not support preference data.\n\n### 2.2 Tokenization Strategies\n\n- `legacy`: `\\n\\n` as delimiter to join text and use `tokenizer.encode` to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xxx\",\n \"data\":\n {\n \"title\": \"xxx\",\n \"summary\": \"xxx\",\n \"abstract\": \"xxx\",\n \"text\": \"xxx\",\n \"question\": \"xxx\",\n \"answer\": \"xxx\",\n \"code\": \"xxx\",\n \"label\": \"xxx\"\n }\n }\n ```\n\n - All fields except `label` are stripped and joined with \"\\n\\n\" as the context.\n - `label` is the target to learn for finetuning (pretrain data should not have the `label` field).\n - See func `legacy` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n - extra parameters: `add_bos_token`, `add_eos_token`\n\n- `hg`: huggingface message data, use `tokenizer.apply_chat_template` to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n ]\n }\n ```\n\n See func `huggingface_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n\n- `chatml`: chat message data, use chatml to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n ]\n }\n ```\n\n See func `chatml_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n- `chatml_qwen2_vl_message`: chat message vl data, use chatml to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": [\n {\n \"role\": \"user\",\n \"content\": [\n {\n \"type\": \"image\",\n \"image\": \"images/2.jpg\"\n },\n {\n \"type\": \"text\",\n \"text\": \"\u4ed6\u662f\u8c01\uff1f\"\n }\n ]\n },\n {\n \"role\": \"assistant\",\n \"content\": [\n {\n \"type\": \"text\",\n \"text\": \"\u4ed6\u662f\u6765\u81ea\u62dc\u4ec1\u6155\u5c3c\u9ed1\u7684\u6258\u9a6c\u65af\u00b7\u7a46\u52d2\u3002\"\n }\n ]\n },\n ...\n ]\n }\n ```\n\n See func `chatml_qwen2_vl_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n Only support \"integrous\" segmentation strategies\n\n- `hg_preference`: preference data, use `tokenizer.apply_chat_template` to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": {\n \"context\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n {\"role\": \"user\", \"content\": \"xxx\"}\n ],\n \"chosen\": \"chosen response\",\n \"rejected\": \"rejected response\"\n }\n }\n ```\n \n See func `huggingface_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n\n- `chatml_preference`: preference data, use chatml to encode the input.\n - format of input data\n ```json\n {\n \"uid\": \"xx\",\n \"data\": {\n \"context\": [\n {\"role\": \"user\", \"content\": \"xxx\"},\n {\"role\": \"assistant\", \"content\": \"xxx\"},\n ...\n {\"role\": \"user\", \"content\": \"xxx\"}\n ],\n \"chosen\": \"chosen response\",\n \"rejected\": \"rejected response\"\n }\n }\n ```\n \n See func `chatml_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "1.2.16",
"project_urls": {
"Homepage": "https://github.com/ericxsun/hakkero-dataloader"
},
"split_keywords": [
"pytorch",
"lm",
"dataloader"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b3a74888e510ea6061baf3619bfa0e57f61a3e6eefc0fe2d4766c6b4746ab755",
"md5": "935b2cee1e4973f936abcaaa9c5824fd",
"sha256": "b83d4376df0b1daaba6e5fc99b2d0fca60a702789f1da61a848f3024399b3624"
},
"downloads": -1,
"filename": "hakkero-dataloader-1.2.16.tar.gz",
"has_sig": false,
"md5_digest": "935b2cee1e4973f936abcaaa9c5824fd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 31670,
"upload_time": "2024-12-04T03:11:16",
"upload_time_iso_8601": "2024-12-04T03:11:16.644289Z",
"url": "https://files.pythonhosted.org/packages/b3/a7/4888e510ea6061baf3619bfa0e57f61a3e6eefc0fe2d4766c6b4746ab755/hakkero-dataloader-1.2.16.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-04 03:11:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ericxsun",
"github_project": "hakkero-dataloader",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "h5py",
"specs": []
},
{
"name": "bitarray",
"specs": [
[
">=",
"2.9.2"
]
]
},
{
"name": "tabulate",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "msgspec",
"specs": []
},
{
"name": "msgpack",
"specs": [
[
">=",
"0.5.2"
]
]
},
{
"name": "Pillow",
"specs": []
}
],
"lcname": "hakkero-dataloader"
}