hakkero-dataloader


Namehakkero-dataloader JSON
Version 1.2.16 PyPI version JSON
download
home_pagehttps://github.com/ericxsun/hakkero-dataloader
SummaryNone
upload_time2024-12-04 03:11:16
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseNone
keywords pytorch lm dataloader
VCS
bugtrack_url
requirements numpy torch h5py bitarray tabulate scipy msgspec msgpack Pillow
Travis-CI No Travis.
coveralls test coverage No coveralls.
            hakkero-dataloader
------------------

A general dataloader build on top of Pytorch Dataloader.


## 1. How to use

### 1.1 Build Index

Install `pip install hakkero-dataloader` and run the following command to build index.

```shell
hakkero -h

usage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]

build index for dataset

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --filename FILENAME   full filename of jsonl file
  --output OUTPUT       output path for saving data.jsonl and index.h5
  --dtype {legacy,message,preference}
                        data type
  --num_workers NUM_WORKERS
                        number of workers
  --not_shuf            not shuf data
```

### 1.2 Use In Training

```python
from hakkero.dataset import get_dataset

# pretrain or sft
from hakkero.dataset import PadLoader
from hakkero.dataset import UnpadLoader

# preference
from hakkero.dataset import PreferencePadLoader
from hakkero.dataset import PreferenceUnpadLoader

dp_world_size, dp_rank = 1, 0
tokenizer = ...
batch_size = 4
max_length = 4096
n_workers = 2

dataset = get_dataset(
    config="/path/to/dataset",
    tokenizer=tokenizer,
    num_epochs=-1,
    max_length=max_length,
    homogeneous=True,
    seed=9527,
    rank=dp_rank,
    world_size=dp_world_size,
    n_workers=n_workers,
    # segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None: 
    st_segment="naive",
    st_tokenize="legacy",
    # add bos/eos token for legacy tokenize strategy
    add_bos_token=True,
    add_eos_token=True,
    # norm dataset weight with tokens of target
    norm_weight_with_n_targets=False,
)

dataloader = UnpadLoader(dataset, max_total_length=batch_size * max_length)
prefetcher = dataloader.prefetch(n_workers)

for step, batch in enumerate(prefetcher, start=0):
    print(batch)
```

example of `config`: 
```json
{
    "hermes25_1":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    },
    "hermes25_2":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    }
}
```

## 2. Supported Strategies

See [segmentation.py](./hakkero/dataset/strategy/segmentation.py) and [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.

### 2.1 Segmentation Strategies

- `integrous`: discard sample that is too long, exceed `max_length`
- `concat`: split long sample, concat it with previous segment, shuffle all segments
  - not support preference data.
- `naive`: split long sample with random length, shuffle all segments
  - not support preference data.
- `unbiased`: split long sample exceed `max_length` with random length, shuffle all segments.
  - not support preference data.

### 2.2 Tokenization Strategies

- `legacy`: `\n\n` as delimiter to join text and use `tokenizer.encode` to encode the input.
  - format of input data
    ```json
    {
      "uid": "xxx",
      "data":
      {
          "title": "xxx",
          "summary": "xxx",
          "abstract": "xxx",
          "text": "xxx",
          "question": "xxx",
          "answer": "xxx",
          "code": "xxx",
          "label": "xxx"
      }
    }
    ```

    - All fields except `label` are stripped and joined with "\n\n" as the context.
    - `label` is the target to learn for finetuning (pretrain data should not have the `label` field).
    - See func `legacy` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
  - extra parameters: `add_bos_token`, `add_eos_token`

- `hg`: huggingface message data, use `tokenizer.apply_chat_template` to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": [
        {"role": "user", "content": "xxx"},
        {"role": "assistant", "content": "xxx"},
         ...
      ]
    }
    ```

    See func `huggingface_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.

- `chatml`: chat message data, use chatml to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": [
        {"role": "user", "content": "xxx"},
        {"role": "assistant", "content": "xxx"},
         ...
      ]
    }
    ```

    See func `chatml_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
- `chatml_qwen2_vl_message`: chat message vl data, use chatml to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "images/2.jpg"
                },
                {
                    "type": "text",
                    "text": "他是谁?"
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "他是来自拜仁慕尼黑的托马斯·穆勒。"
                }
            ]
        },
         ...
      ]
    }
    ```

    See func `chatml_qwen2_vl_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.
    Only support "integrous" segmentation strategies

- `hg_preference`: preference data, use `tokenizer.apply_chat_template` to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": {
        "context": [
          {"role": "user", "content": "xxx"},
          {"role": "assistant", "content": "xxx"},
          ...
          {"role": "user", "content": "xxx"}
        ],
        "chosen": "chosen response",
        "rejected": "rejected response"
      }
    }
    ```
    
    See func `huggingface_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.

- `chatml_preference`: preference data, use chatml to encode the input.
  - format of input data
    ```json
    {
      "uid": "xx",
      "data": {
        "context": [
          {"role": "user", "content": "xxx"},
          {"role": "assistant", "content": "xxx"},
          ...
          {"role": "user", "content": "xxx"}
        ],
        "chosen": "chosen response",
        "rejected": "rejected response"
      }
    }
    ```
    
    See func `chatml_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ericxsun/hakkero-dataloader",
    "name": "hakkero-dataloader",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Pytorch LM dataloader",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/b3/a7/4888e510ea6061baf3619bfa0e57f61a3e6eefc0fe2d4766c6b4746ab755/hakkero-dataloader-1.2.16.tar.gz",
    "platform": null,
    "description": "hakkero-dataloader\n------------------\n\nA general dataloader build on top of Pytorch Dataloader.\n\n\n## 1. How to use\n\n### 1.1 Build Index\n\nInstall `pip install hakkero-dataloader` and run the following command to build index.\n\n```shell\nhakkero -h\n\nusage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]\n\nbuild index for dataset\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --version             show program's version number and exit\n  --filename FILENAME   full filename of jsonl file\n  --output OUTPUT       output path for saving data.jsonl and index.h5\n  --dtype {legacy,message,preference}\n                        data type\n  --num_workers NUM_WORKERS\n                        number of workers\n  --not_shuf            not shuf data\n```\n\n### 1.2 Use In Training\n\n```python\nfrom hakkero.dataset import get_dataset\n\n# pretrain or sft\nfrom hakkero.dataset import PadLoader\nfrom hakkero.dataset import UnpadLoader\n\n# preference\nfrom hakkero.dataset import PreferencePadLoader\nfrom hakkero.dataset import PreferenceUnpadLoader\n\ndp_world_size, dp_rank = 1, 0\ntokenizer = ...\nbatch_size = 4\nmax_length = 4096\nn_workers = 2\n\ndataset = get_dataset(\n    config=\"/path/to/dataset\",\n    tokenizer=tokenizer,\n    num_epochs=-1,\n    max_length=max_length,\n    homogeneous=True,\n    seed=9527,\n    rank=dp_rank,\n    world_size=dp_world_size,\n    n_workers=n_workers,\n    # segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None: \n    st_segment=\"naive\",\n    st_tokenize=\"legacy\",\n    # add bos/eos token for legacy tokenize strategy\n    add_bos_token=True,\n    add_eos_token=True,\n    # norm dataset weight with tokens of target\n    norm_weight_with_n_targets=False,\n)\n\ndataloader = UnpadLoader(dataset, max_total_length=batch_size * max_length)\nprefetcher = dataloader.prefetch(n_workers)\n\nfor step, batch in enumerate(prefetcher, start=0):\n    print(batch)\n```\n\nexample of `config`: \n```json\n{\n    \"hermes25_1\":\n    {\n        \"group\": \"en\",\n        \"name\": \"hermes25_1\",\n        \"epoch\": 1,\n        \"path\": \"hermes25\",\n        \"strategy\":\n        {\n            \"st_segment\": \"integrous\",\n            \"st_tokenize\": \"hg\"\n        },\n        \"weight\": 0.5\n    },\n    \"hermes25_2\":\n    {\n        \"group\": \"en\",\n        \"name\": \"hermes25_1\",\n        \"epoch\": 1,\n        \"path\": \"hermes25\",\n        \"strategy\":\n        {\n            \"st_segment\": \"integrous\",\n            \"st_tokenize\": \"hg\"\n        },\n        \"weight\": 0.5\n    }\n}\n```\n\n## 2. Supported Strategies\n\nSee [segmentation.py](./hakkero/dataset/strategy/segmentation.py) and [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n\n### 2.1 Segmentation Strategies\n\n- `integrous`: discard sample that is too long, exceed `max_length`\n- `concat`: split long sample, concat it with previous segment, shuffle all segments\n  - not support preference data.\n- `naive`: split long sample with random length, shuffle all segments\n  - not support preference data.\n- `unbiased`: split long sample exceed `max_length` with random length, shuffle all segments.\n  - not support preference data.\n\n### 2.2 Tokenization Strategies\n\n- `legacy`: `\\n\\n` as delimiter to join text and use `tokenizer.encode` to encode the input.\n  - format of input data\n    ```json\n    {\n      \"uid\": \"xxx\",\n      \"data\":\n      {\n          \"title\": \"xxx\",\n          \"summary\": \"xxx\",\n          \"abstract\": \"xxx\",\n          \"text\": \"xxx\",\n          \"question\": \"xxx\",\n          \"answer\": \"xxx\",\n          \"code\": \"xxx\",\n          \"label\": \"xxx\"\n      }\n    }\n    ```\n\n    - All fields except `label` are stripped and joined with \"\\n\\n\" as the context.\n    - `label` is the target to learn for finetuning (pretrain data should not have the `label` field).\n    - See func `legacy` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n  - extra parameters: `add_bos_token`, `add_eos_token`\n\n- `hg`: huggingface message data, use `tokenizer.apply_chat_template` to encode the input.\n  - format of input data\n    ```json\n    {\n      \"uid\": \"xx\",\n      \"data\": [\n        {\"role\": \"user\", \"content\": \"xxx\"},\n        {\"role\": \"assistant\", \"content\": \"xxx\"},\n         ...\n      ]\n    }\n    ```\n\n    See func `huggingface_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n\n- `chatml`: chat message data, use chatml to encode the input.\n  - format of input data\n    ```json\n    {\n      \"uid\": \"xx\",\n      \"data\": [\n        {\"role\": \"user\", \"content\": \"xxx\"},\n        {\"role\": \"assistant\", \"content\": \"xxx\"},\n         ...\n      ]\n    }\n    ```\n\n    See func `chatml_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n- `chatml_qwen2_vl_message`: chat message vl data, use chatml to encode the input.\n  - format of input data\n    ```json\n    {\n      \"uid\": \"xx\",\n      \"data\": [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"image\",\n                    \"image\": \"images/2.jpg\"\n                },\n                {\n                    \"type\": \"text\",\n                    \"text\": \"\u4ed6\u662f\u8c01\uff1f\"\n                }\n            ]\n        },\n        {\n            \"role\": \"assistant\",\n            \"content\": [\n                {\n                    \"type\": \"text\",\n                    \"text\": \"\u4ed6\u662f\u6765\u81ea\u62dc\u4ec1\u6155\u5c3c\u9ed1\u7684\u6258\u9a6c\u65af\u00b7\u7a46\u52d2\u3002\"\n                }\n            ]\n        },\n         ...\n      ]\n    }\n    ```\n\n    See func `chatml_qwen2_vl_message` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n    Only support \"integrous\" segmentation strategies\n\n- `hg_preference`: preference data, use `tokenizer.apply_chat_template` to encode the input.\n  - format of input data\n    ```json\n    {\n      \"uid\": \"xx\",\n      \"data\": {\n        \"context\": [\n          {\"role\": \"user\", \"content\": \"xxx\"},\n          {\"role\": \"assistant\", \"content\": \"xxx\"},\n          ...\n          {\"role\": \"user\", \"content\": \"xxx\"}\n        ],\n        \"chosen\": \"chosen response\",\n        \"rejected\": \"rejected response\"\n      }\n    }\n    ```\n    \n    See func `huggingface_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n\n- `chatml_preference`: preference data, use chatml to encode the input.\n  - format of input data\n    ```json\n    {\n      \"uid\": \"xx\",\n      \"data\": {\n        \"context\": [\n          {\"role\": \"user\", \"content\": \"xxx\"},\n          {\"role\": \"assistant\", \"content\": \"xxx\"},\n          ...\n          {\"role\": \"user\", \"content\": \"xxx\"}\n        ],\n        \"chosen\": \"chosen response\",\n        \"rejected\": \"rejected response\"\n      }\n    }\n    ```\n    \n    See func `chatml_preference` in [tokenization.py](./hakkero/dataset/strategy/tokenization.py) for more details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "1.2.16",
    "project_urls": {
        "Homepage": "https://github.com/ericxsun/hakkero-dataloader"
    },
    "split_keywords": [
        "pytorch",
        "lm",
        "dataloader"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b3a74888e510ea6061baf3619bfa0e57f61a3e6eefc0fe2d4766c6b4746ab755",
                "md5": "935b2cee1e4973f936abcaaa9c5824fd",
                "sha256": "b83d4376df0b1daaba6e5fc99b2d0fca60a702789f1da61a848f3024399b3624"
            },
            "downloads": -1,
            "filename": "hakkero-dataloader-1.2.16.tar.gz",
            "has_sig": false,
            "md5_digest": "935b2cee1e4973f936abcaaa9c5824fd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 31670,
            "upload_time": "2024-12-04T03:11:16",
            "upload_time_iso_8601": "2024-12-04T03:11:16.644289Z",
            "url": "https://files.pythonhosted.org/packages/b3/a7/4888e510ea6061baf3619bfa0e57f61a3e6eefc0fe2d4766c6b4746ab755/hakkero-dataloader-1.2.16.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-04 03:11:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ericxsun",
    "github_project": "hakkero-dataloader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "h5py",
            "specs": []
        },
        {
            "name": "bitarray",
            "specs": [
                [
                    ">=",
                    "2.9.2"
                ]
            ]
        },
        {
            "name": "tabulate",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "msgspec",
            "specs": []
        },
        {
            "name": "msgpack",
            "specs": [
                [
                    ">=",
                    "0.5.2"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": []
        }
    ],
    "lcname": "hakkero-dataloader"
}
        
Elapsed time: 0.41211s