# QuickBPE
This is a much faster version of the MinBPE tokenizer from Andrej Karpathy. The main functions are optimized in C++ and then connected to Python using ctypes, so that you can call them conveniently. I already successfully tokenized the entire TinyStories dataset in around 8 minutes, which is ~3.5 GB of text. The main bottleneck is now the regex splitting, which is hard to optimize since I decided to keep it integrated into Python (so that it is still easy to change the split pattern).
I implemented a linear time algorithm explained on the wikipedia article for [Re-Pair](https://en.wikipedia.org/wiki/Re-Pair). The training takes about 2 minutes on ~100 MB of text, which seems to be decent. But there is probably still a lot of improvement that can be done. Also the encode function is much slower than the encode_ordinary function if the special tokens are distributed evenly because of the splitting. The only way to fix this as far as i can see is to reimplement this specific function in c++ as well.
# Operating systems
Right now only linux/windows is supported. The rest might work, but i didn't test.
# Quickstart
You can use the repo in the same way as the MinBPE repo. Make sure to use RegexTokenizerFast and encode_ordinary (the encode function is not as fast sometimes, but still faster than the python version)
Here is production ready code that trains a tokenizer on ~50mb of webtext (make sure to pip install datasets first):
```python
from datasets import load_dataset
from QuickBPE import RegexTokenizerFast
print("loading dataset...")
output_file = "openwebtext.txt"
dataset = load_dataset("stas/openwebtext-10k")
current_size = 0
s = "".join([st["text"] for st in dataset["train"]])
# gpt-4 regex splitting:
regex_pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
print("training tokenizer (takes ~2 Minutes)...")
tokenizer = RegexTokenizerFast(regex_pattern)
k = 10000 # steps of BPE
_, _, time_taken = tokenizer.train(s, 256 + k)
print("time taken:", time_taken)
tokenizer.save("webtokenizer")
sample_text = "This is a sample text"
encoded = tokenizer.encode_ordinary(sample_text)
decoded = tokenizer.decode(encoded)
print("encoded:", encoded)
print("decoded:", decoded)
print(f"Compression ratio: {len(encoded) / len(decoded)}")
```
The tokenizer can be loaded as follows:
```python
from QuickBPE import RegexTokenizerFast
tokenizer = RegexTokenizerFast()
tokenizer.load("webtokenizer.model") # loads the model back from disk
sample_text = "This is a sample text!"
encoded = tokenizer.encode(sample_text, allowed_special="all")
tokens = [tokenizer.decode([tid]) for tid in encoded] # Decode each token
print("tokenization: ", "|".join(tokens))
decoded = tokenizer.decode(encoded)
print("encoded:", encoded)
print("decoded:", decoded)
print(f"Compression ratio: {len(encoded) / len(decoded)}")
```
You can register special tokens in the following way:
```python
from QuickBPE import RegexTokenizerFast
tokenizer = RegexTokenizerFast()
s = "hello world"
tokenizer.train(s, 256 + 3)
tokenizer.register_special_tokens({"<|endoftext|>": 256 + 3}) # register new special token
res = tokenizer.encode("<|endoftext|>hello world", allowed_special="all")
print(res) # the first token of the result should be 259
```
## License
MIT
Raw data
{
"_id": null,
"home_page": "https://github.com/JohannesVod/QuickBPE",
"name": "QuickBPE",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "BPE, LLM, tokenization",
"author": "Johannes Voderholzer",
"author_email": "invenis2@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/8e/9e/7f8f1609a702ea437953d761d8ab13fe817a4bf981aac3b5471ac7bc4ce5/QuickBPE-2.1.tar.gz",
"platform": null,
"description": "# QuickBPE\r\n\r\nThis is a much faster version of the MinBPE tokenizer from Andrej Karpathy. The main functions are optimized in C++ and then connected to Python using ctypes, so that you can call them conveniently. I already successfully tokenized the entire TinyStories dataset in around 8 minutes, which is ~3.5 GB of text. The main bottleneck is now the regex splitting, which is hard to optimize since I decided to keep it integrated into Python (so that it is still easy to change the split pattern).\r\n\r\nI implemented a linear time algorithm explained on the wikipedia article for [Re-Pair](https://en.wikipedia.org/wiki/Re-Pair). The training takes about 2 minutes on ~100 MB of text, which seems to be decent. But there is probably still a lot of improvement that can be done. Also the encode function is much slower than the encode_ordinary function if the special tokens are distributed evenly because of the splitting. The only way to fix this as far as i can see is to reimplement this specific function in c++ as well.\r\n\r\n# Operating systems\r\nRight now only linux/windows is supported. The rest might work, but i didn't test.\r\n\r\n# Quickstart\r\nYou can use the repo in the same way as the MinBPE repo. Make sure to use RegexTokenizerFast and encode_ordinary (the encode function is not as fast sometimes, but still faster than the python version)\r\n\r\nHere is production ready code that trains a tokenizer on ~50mb of webtext (make sure to pip install datasets first):\r\n\r\n```python\r\nfrom datasets import load_dataset\r\nfrom QuickBPE import RegexTokenizerFast\r\n\r\nprint(\"loading dataset...\")\r\noutput_file = \"openwebtext.txt\"\r\ndataset = load_dataset(\"stas/openwebtext-10k\")\r\ncurrent_size = 0\r\ns = \"\".join([st[\"text\"] for st in dataset[\"train\"]])\r\n# gpt-4 regex splitting:\r\nregex_pattern = r\"\"\"'(?i:[sdmt]|ll|ve|re)|[^\\r\\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\\r\\n]*|\\s*[\\r\\n]|\\s+(?!\\S)|\\s+\"\"\"\r\n\r\nprint(\"training tokenizer (takes ~2 Minutes)...\")\r\ntokenizer = RegexTokenizerFast(regex_pattern)\r\nk = 10000 # steps of BPE\r\n_, _, time_taken = tokenizer.train(s, 256 + k)\r\nprint(\"time taken:\", time_taken)\r\n\r\ntokenizer.save(\"webtokenizer\")\r\nsample_text = \"This is a sample text\"\r\nencoded = tokenizer.encode_ordinary(sample_text)\r\ndecoded = tokenizer.decode(encoded)\r\nprint(\"encoded:\", encoded)\r\nprint(\"decoded:\", decoded)\r\nprint(f\"Compression ratio: {len(encoded) / len(decoded)}\")\r\n```\r\n\r\nThe tokenizer can be loaded as follows:\r\n\r\n```python\r\nfrom QuickBPE import RegexTokenizerFast\r\ntokenizer = RegexTokenizerFast()\r\ntokenizer.load(\"webtokenizer.model\") # loads the model back from disk\r\nsample_text = \"This is a sample text!\"\r\nencoded = tokenizer.encode(sample_text, allowed_special=\"all\")\r\ntokens = [tokenizer.decode([tid]) for tid in encoded] # Decode each token\r\nprint(\"tokenization: \", \"|\".join(tokens))\r\ndecoded = tokenizer.decode(encoded)\r\nprint(\"encoded:\", encoded)\r\nprint(\"decoded:\", decoded)\r\nprint(f\"Compression ratio: {len(encoded) / len(decoded)}\")\r\n```\r\n\r\nYou can register special tokens in the following way:\r\n\r\n```python\r\nfrom QuickBPE import RegexTokenizerFast\r\ntokenizer = RegexTokenizerFast()\r\ns = \"hello world\"\r\ntokenizer.train(s, 256 + 3)\r\ntokenizer.register_special_tokens({\"<|endoftext|>\": 256 + 3}) # register new special token\r\nres = tokenizer.encode(\"<|endoftext|>hello world\", allowed_special=\"all\")\r\nprint(res) # the first token of the result should be 259\r\n```\r\n\r\n## License\r\nMIT\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A fast BPE implementation in C",
"version": "2.1",
"project_urls": {
"Download": "https://github.com/JohannesVod/QuickBPE/archive/refs/tags/v2.0.tar.gz",
"Homepage": "https://github.com/JohannesVod/QuickBPE"
},
"split_keywords": [
"bpe",
" llm",
" tokenization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8e9e7f8f1609a702ea437953d761d8ab13fe817a4bf981aac3b5471ac7bc4ce5",
"md5": "a472ac5e939be8e9f12db9fbf1460b2f",
"sha256": "0595abe3bdf68c7a932ea70d20329729097b728c2eaeaa07e296843c35ae687e"
},
"downloads": -1,
"filename": "QuickBPE-2.1.tar.gz",
"has_sig": false,
"md5_digest": "a472ac5e939be8e9f12db9fbf1460b2f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1261192,
"upload_time": "2024-12-05T11:37:29",
"upload_time_iso_8601": "2024-12-05T11:37:29.947922Z",
"url": "https://files.pythonhosted.org/packages/8e/9e/7f8f1609a702ea437953d761d8ab13fe817a4bf981aac3b5471ac7bc4ce5/QuickBPE-2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-05 11:37:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "JohannesVod",
"github_project": "QuickBPE",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "regex",
"specs": []
},
{
"name": "tiktoken",
"specs": []
}
],
"lcname": "quickbpe"
}