QuickBPE

Name	QuickBPE JSON
Version	2.1 JSON
	download
home_page	https://github.com/JohannesVod/QuickBPE
Summary	A fast BPE implementation in C
upload_time	2024-12-05 11:37:29
maintainer	None
docs_url	None
author	Johannes Voderholzer
requires_python	None
license	MIT
keywords	bpe llm tokenization
VCS
bugtrack_url
requirements	regex tiktoken
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # QuickBPE

This is a much faster version of the MinBPE tokenizer from Andrej Karpathy. The main functions are optimized in C++ and then connected to Python using ctypes, so that you can call them conveniently. I already successfully tokenized the entire TinyStories dataset in around 8 minutes, which is ~3.5 GB of text. The main bottleneck is now the regex splitting, which is hard to optimize since I decided to keep it integrated into Python (so that it is still easy to change the split pattern).

I implemented a linear time algorithm explained on the wikipedia article for [Re-Pair](https://en.wikipedia.org/wiki/Re-Pair). The training takes about 2 minutes on ~100 MB of text, which seems to be decent. But there is probably still a lot of improvement that can be done. Also the encode function is much slower than the encode_ordinary function if the special tokens are distributed evenly because of the splitting. The only way to fix this as far as i can see is to reimplement this specific function in c++ as well.

# Operating systems
Right now only linux/windows is supported. The rest might work, but i didn't test.

# Quickstart
You can use the repo in the same way as the MinBPE repo. Make sure to use RegexTokenizerFast and encode_ordinary (the encode function is not as fast sometimes, but still faster than the python version)

Here is production ready code that trains a tokenizer on ~50mb of webtext (make sure to pip install datasets first):

```python
from datasets import load_dataset
from QuickBPE import RegexTokenizerFast

print("loading dataset...")
output_file = "openwebtext.txt"
dataset = load_dataset("stas/openwebtext-10k")
current_size = 0
s = "".join([st["text"] for st in dataset["train"]])
# gpt-4 regex splitting:
regex_pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

print("training tokenizer (takes ~2 Minutes)...")
tokenizer = RegexTokenizerFast(regex_pattern)
k = 10000  # steps of BPE
_, _, time_taken = tokenizer.train(s, 256 + k)
print("time taken:", time_taken)

tokenizer.save("webtokenizer")
sample_text = "This is a sample text"
encoded = tokenizer.encode_ordinary(sample_text)
decoded = tokenizer.decode(encoded)
print("encoded:", encoded)
print("decoded:", decoded)
print(f"Compression ratio: {len(encoded) / len(decoded)}")
```

The tokenizer can be loaded as follows:

```python
from QuickBPE import RegexTokenizerFast
tokenizer = RegexTokenizerFast()
tokenizer.load("webtokenizer.model") # loads the model back from disk
sample_text = "This is a sample text!"
encoded = tokenizer.encode(sample_text, allowed_special="all")
tokens = [tokenizer.decode([tid]) for tid in encoded]  # Decode each token
print("tokenization: ", "|".join(tokens))
decoded = tokenizer.decode(encoded)
print("encoded:", encoded)
print("decoded:", decoded)
print(f"Compression ratio: {len(encoded) / len(decoded)}")
```

You can register special tokens in the following way:

```python
from QuickBPE import RegexTokenizerFast
tokenizer = RegexTokenizerFast()
s = "hello world"
tokenizer.train(s, 256 + 3)
tokenizer.register_special_tokens({"<|endoftext|>": 256 + 3}) # register new special token
res = tokenizer.encode("<|endoftext|>hello world", allowed_special="all")
print(res) # the first token of the result should be 259
```

## License
MIT

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JohannesVod/QuickBPE",
    "name": "QuickBPE",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "BPE, LLM, tokenization",
    "author": "Johannes Voderholzer",
    "author_email": "invenis2@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8e/9e/7f8f1609a702ea437953d761d8ab13fe817a4bf981aac3b5471ac7bc4ce5/QuickBPE-2.1.tar.gz",
    "platform": null,
    "description": "# QuickBPE\r\n\r\nThis is a much faster version of the MinBPE tokenizer from Andrej Karpathy. The main functions are optimized in C++ and then connected to Python using ctypes, so that you can call them conveniently. I already successfully tokenized the entire TinyStories dataset in around 8 minutes, which is ~3.5 GB of text. The main bottleneck is now the regex splitting, which is hard to optimize since I decided to keep it integrated into Python (so that it is still easy to change the split pattern).\r\n\r\nI implemented a linear time algorithm explained on the wikipedia article for [Re-Pair](https://en.wikipedia.org/wiki/Re-Pair). The training takes about 2 minutes on ~100 MB of text, which seems to be decent. But there is probably still a lot of improvement that can be done. Also the encode function is much slower than the encode_ordinary function if the special tokens are distributed evenly because of the splitting. The only way to fix this as far as i can see is to reimplement this specific function in c++ as well.\r\n\r\n# Operating systems\r\nRight now only linux/windows is supported. The rest might work, but i didn't test.\r\n\r\n# Quickstart\r\nYou can use the repo in the same way as the MinBPE repo. Make sure to use RegexTokenizerFast and encode_ordinary (the encode function is not as fast sometimes, but still faster than the python version)\r\n\r\nHere is production ready code that trains a tokenizer on ~50mb of webtext (make sure to pip install datasets first):\r\n\r\n```python\r\nfrom datasets import load_dataset\r\nfrom QuickBPE import RegexTokenizerFast\r\n\r\nprint(\"loading dataset...\")\r\noutput_file = \"openwebtext.txt\"\r\ndataset = load_dataset(\"stas/openwebtext-10k\")\r\ncurrent_size = 0\r\ns = \"\".join([st[\"text\"] for st in dataset[\"train\"]])\r\n# gpt-4 regex splitting:\r\nregex_pattern = r\"\"\"'(?i:[sdmt]|ll|ve|re)|[^\\r\\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\\r\\n]*|\\s*[\\r\\n]|\\s+(?!\\S)|\\s+\"\"\"\r\n\r\nprint(\"training tokenizer (takes ~2 Minutes)...\")\r\ntokenizer = RegexTokenizerFast(regex_pattern)\r\nk = 10000  # steps of BPE\r\n_, _, time_taken = tokenizer.train(s, 256 + k)\r\nprint(\"time taken:\", time_taken)\r\n\r\ntokenizer.save(\"webtokenizer\")\r\nsample_text = \"This is a sample text\"\r\nencoded = tokenizer.encode_ordinary(sample_text)\r\ndecoded = tokenizer.decode(encoded)\r\nprint(\"encoded:\", encoded)\r\nprint(\"decoded:\", decoded)\r\nprint(f\"Compression ratio: {len(encoded) / len(decoded)}\")\r\n```\r\n\r\nThe tokenizer can be loaded as follows:\r\n\r\n```python\r\nfrom QuickBPE import RegexTokenizerFast\r\ntokenizer = RegexTokenizerFast()\r\ntokenizer.load(\"webtokenizer.model\") # loads the model back from disk\r\nsample_text = \"This is a sample text!\"\r\nencoded = tokenizer.encode(sample_text, allowed_special=\"all\")\r\ntokens = [tokenizer.decode([tid]) for tid in encoded]  # Decode each token\r\nprint(\"tokenization: \", \"|\".join(tokens))\r\ndecoded = tokenizer.decode(encoded)\r\nprint(\"encoded:\", encoded)\r\nprint(\"decoded:\", decoded)\r\nprint(f\"Compression ratio: {len(encoded) / len(decoded)}\")\r\n```\r\n\r\nYou can register special tokens in the following way:\r\n\r\n```python\r\nfrom QuickBPE import RegexTokenizerFast\r\ntokenizer = RegexTokenizerFast()\r\ns = \"hello world\"\r\ntokenizer.train(s, 256 + 3)\r\ntokenizer.register_special_tokens({\"<|endoftext|>\": 256 + 3}) # register new special token\r\nres = tokenizer.encode(\"<|endoftext|>hello world\", allowed_special=\"all\")\r\nprint(res) # the first token of the result should be 259\r\n```\r\n\r\n## License\r\nMIT\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A fast BPE implementation in C",
    "version": "2.1",
    "project_urls": {
        "Download": "https://github.com/JohannesVod/QuickBPE/archive/refs/tags/v2.0.tar.gz",
        "Homepage": "https://github.com/JohannesVod/QuickBPE"
    },
    "split_keywords": [
        "bpe",
        " llm",
        " tokenization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8e9e7f8f1609a702ea437953d761d8ab13fe817a4bf981aac3b5471ac7bc4ce5",
                "md5": "a472ac5e939be8e9f12db9fbf1460b2f",
                "sha256": "0595abe3bdf68c7a932ea70d20329729097b728c2eaeaa07e296843c35ae687e"
            },
            "downloads": -1,
            "filename": "QuickBPE-2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a472ac5e939be8e9f12db9fbf1460b2f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 1261192,
            "upload_time": "2024-12-05T11:37:29",
            "upload_time_iso_8601": "2024-12-05T11:37:29.947922Z",
            "url": "https://files.pythonhosted.org/packages/8e/9e/7f8f1609a702ea437953d761d8ab13fe817a4bf981aac3b5471ac7bc4ce5/QuickBPE-2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-05 11:37:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JohannesVod",
    "github_project": "QuickBPE",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "regex",
            "specs": []
        },
        {
            "name": "tiktoken",
            "specs": []
        }
    ],
    "lcname": "quickbpe"
}

Johannes Voderholzer