autotiktokenizer


Nameautotiktokenizer JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
Summary🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨
upload_time2024-11-11 21:16:02
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2024 Bhavnick Minhas Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords tokenizer transformers huggingface tiktoken
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  
![AutoTikTokenizer Logo](./assets/AutoTikTokenizer%20Logo.png)

# AutoTikTokenizer

[![PyPI version](https://img.shields.io/pypi/v/autotiktokenizer.svg)](https://pypi.org/project/autotiktokenizer/)
[![Downloads](https://static.pepy.tech/badge/autotiktokenizer)](https://pepy.tech/project/autotiktokenizer)
![Package size](https://img.shields.io/badge/size-9.7MB-blue)
[![License](https://img.shields.io/github/license/bhavnicksm/autotiktokenizer)](https://github.com/bhavnicksm/autotiktokenizer/blob/main/LICENSE)
[![Documentation](https://img.shields.io/badge/docs-available-brightgreen.svg)](https://github.com/bhavnicksm/autotiktokenizer#readme)
[![Last Commit](https://img.shields.io/github/last-commit/bhavnicksm/autotiktokenizer)](https://github.com/bhavnicksm/autotiktokenizer/commits/main)
[![GitHub Stars](https://img.shields.io/github/stars/bhavnicksm/autotiktokenizer?style=social)](https://github.com/bhavnicksm/autotiktokenizer/stargazers)

🚀 Accelerate your HuggingFace tokenizers by converting them to TikToken format with AutoTikTokenizer - get TikToken's speed while keeping HuggingFace's flexibility.

[Features](#key-features) •
[Installation](#installation) •
[Examples](#examples) •
[Supported Models](#supported-models) •
[Benchmarks](#benchmarks) •
[Sharp Bits](#sharp-bits) •
[Citation](#citation)

</div>

# Key Features

- 🚀 **High Performance** - Built on TikToken's efficient tokenization engine
- 🔄 **HuggingFace Compatible** - Seamless integration with the HuggingFace ecosystem
- 📦 **Lightweight** - Minimal dependencies, just TikToken and Huggingface-hub
- 🎯 **Easy to Use** - Simple, intuitive API that works out of the box
- 💻 **Well Tested** - Comprehensive test suite across supported models

# Installation

Install `autotiktokenizer` from PyPI via the following command:

```bash
pip install autotiktokenizer
```

You can also install it from _source_, by the following command:

```bash
pip install git+https://github.com/bhavnicksm/autotiktokenizer
```

# Examples

This section provides a basic usage example of the project. Follow these simple steps to get started quickly.

```python
# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer

# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)

# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)
```

# Supported Models

AutoTikTokenizer should ideally support ALL models on HF Hub but because of the vast diversity of models out there, we _cannot_ test out every single model. These are the models we have already validated for, and know that AutoTikTokenizer works well for them. If you have a model you wish to see here, raise an issue and we would validate and add it to the list. Thanks :)

- [x] GPT2
- [x] GPT-J Family
- [x] SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
- [x] LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
- [x] Deepseek Family: Deepseek-v2.5 etc 
- [x] Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
- [x] Mistral Family: Mistral-7B-Instruct-v0.3 etc
- [x] Aya Family: Aya-23B, Aya Expanse etc
- [x] BERT Family: BERT, RoBERTa, MiniLM, TinyBERT, DeBERTa etc.

**NOTE:** Some models use the _unigram_ tokenizers, which are not supported with TikToken and hence, 🧰 AutoTikTokenizer cannot convert the tokenizers for such models. Some models that use _unigram_ tokenizers include T5, ALBERT, Marian and XLNet. 

# Benchmarks

Benchmarking results for tokenizing **1 billion tokens** from fineweb-edu dataset using **Llama 3.2 tokenizer** on CPU (Google colab)

| Configuration | Processing Type | AutoTikTokenizer | HuggingFace | Speed Ratio | 
|--------------|-----------------|------------------|--------------|-------------|
| Single Thread | Sequential | **14:58** (898s) | 40:43 (2443s) | 2.72x faster |
| Batch x1 | Batched | 15:58 (958s) | **10:30** (630s) | 0.66x slower |
| Batch x4 | Batched | **8:00** (480s) | 10:30 (630s) | 1.31x faster |
| Batch x8 | Batched | **6:32** (392s) | 10:30 (630s) | 1.62x faster |
| 4 Processes | Parallel | **2:34** (154s) | 8:59 (539s) | 3.50x faster |

The above table shows that AutoTikTokenizer's tokenizer (TikToken) is actually way faster than HuggingFace's Tokenizer by 1.6-3.5 times under fair comparison! While, it's not making the most optimal use of TikToken (yet), its still way faster than the stock solutions you might be getting otherwise.

# Sharp Bits

A known issue of the repository is that it does not do any pre-processing or post-processing, which means that if a certain tokenizer (like `minilm`) expect all lower-case letters only, then you would need to convert it to lower case manually. Similarly, any spaces added in the process are not removed during decoding, so they need to handle them on your own. 

There might be more sharp bits to the repository which are unknown at the moment, please raise an issue if you encounter any!

# Acknowledgement

Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!

**If you found this repository useful, give it a ⭐️! Thank You :)**

# Citation

If you use `autotiktokenizer` in your research, please cite it as follows:

```bibtex
@misc{autotiktokenizer,
    author = {Bhavnick Minhas},
    title = {AutoTikTokenizer},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "autotiktokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "tokenizer, transformers, huggingface, tiktoken",
    "author": null,
    "author_email": "Bhavnick Minhas <bhavnicksm@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/27/7e/8b9ef16d12657a9afb7fa07a03312e46d8b418e92912ca15df82d16bbc41/autotiktokenizer-0.2.1.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  \n![AutoTikTokenizer Logo](./assets/AutoTikTokenizer%20Logo.png)\n\n# AutoTikTokenizer\n\n[![PyPI version](https://img.shields.io/pypi/v/autotiktokenizer.svg)](https://pypi.org/project/autotiktokenizer/)\n[![Downloads](https://static.pepy.tech/badge/autotiktokenizer)](https://pepy.tech/project/autotiktokenizer)\n![Package size](https://img.shields.io/badge/size-9.7MB-blue)\n[![License](https://img.shields.io/github/license/bhavnicksm/autotiktokenizer)](https://github.com/bhavnicksm/autotiktokenizer/blob/main/LICENSE)\n[![Documentation](https://img.shields.io/badge/docs-available-brightgreen.svg)](https://github.com/bhavnicksm/autotiktokenizer#readme)\n[![Last Commit](https://img.shields.io/github/last-commit/bhavnicksm/autotiktokenizer)](https://github.com/bhavnicksm/autotiktokenizer/commits/main)\n[![GitHub Stars](https://img.shields.io/github/stars/bhavnicksm/autotiktokenizer?style=social)](https://github.com/bhavnicksm/autotiktokenizer/stargazers)\n\n\ud83d\ude80 Accelerate your HuggingFace tokenizers by converting them to TikToken format with AutoTikTokenizer - get TikToken's speed while keeping HuggingFace's flexibility.\n\n[Features](#key-features) \u2022\n[Installation](#installation) \u2022\n[Examples](#examples) \u2022\n[Supported Models](#supported-models) \u2022\n[Benchmarks](#benchmarks) \u2022\n[Sharp Bits](#sharp-bits) \u2022\n[Citation](#citation)\n\n</div>\n\n# Key Features\n\n- \ud83d\ude80 **High Performance** - Built on TikToken's efficient tokenization engine\n- \ud83d\udd04 **HuggingFace Compatible** - Seamless integration with the HuggingFace ecosystem\n- \ud83d\udce6 **Lightweight** - Minimal dependencies, just TikToken and Huggingface-hub\n- \ud83c\udfaf **Easy to Use** - Simple, intuitive API that works out of the box\n- \ud83d\udcbb **Well Tested** - Comprehensive test suite across supported models\n\n# Installation\n\nInstall `autotiktokenizer` from PyPI via the following command:\n\n```bash\npip install autotiktokenizer\n```\n\nYou can also install it from _source_, by the following command:\n\n```bash\npip install git+https://github.com/bhavnicksm/autotiktokenizer\n```\n\n# Examples\n\nThis section provides a basic usage example of the project. Follow these simple steps to get started quickly.\n\n```python\n# step 1: Import the library\nfrom autotiktokenizer import AutoTikTokenizer\n\n# step 2: Load the tokenizer\ntokenizer = AutoTikTokenizer.from_pretrained(\"meta-llama/Llama-3.2-3B-Instruct\")\n\n# step 3: Enjoy the Inferenece speed \ud83c\udfce\ufe0f\ntext = \"Wow! I never thought I'd be able to use Llama on TikToken\"\nencodings = tokenizer.encode(text)\n\n# (Optional) step 4: Decode the outputs\ntext = tokenizer.decode(encodings)\n```\n\n# Supported Models\n\nAutoTikTokenizer should ideally support ALL models on HF Hub but because of the vast diversity of models out there, we _cannot_ test out every single model. These are the models we have already validated for, and know that AutoTikTokenizer works well for them. If you have a model you wish to see here, raise an issue and we would validate and add it to the list. Thanks :)\n\n- [x] GPT2\n- [x] GPT-J Family\n- [x] SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.\n- [x] LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.\n- [x] Deepseek Family: Deepseek-v2.5 etc \n- [x] Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc\n- [x] Mistral Family: Mistral-7B-Instruct-v0.3 etc\n- [x] Aya Family: Aya-23B, Aya Expanse etc\n- [x] BERT Family: BERT, RoBERTa, MiniLM, TinyBERT, DeBERTa etc.\n\n**NOTE:** Some models use the _unigram_ tokenizers, which are not supported with TikToken and hence, \ud83e\uddf0 AutoTikTokenizer cannot convert the tokenizers for such models. Some models that use _unigram_ tokenizers include T5, ALBERT, Marian and XLNet. \n\n# Benchmarks\n\nBenchmarking results for tokenizing **1 billion tokens** from fineweb-edu dataset using **Llama 3.2 tokenizer** on CPU (Google colab)\n\n| Configuration | Processing Type | AutoTikTokenizer | HuggingFace | Speed Ratio | \n|--------------|-----------------|------------------|--------------|-------------|\n| Single Thread | Sequential | **14:58** (898s) | 40:43 (2443s) | 2.72x faster |\n| Batch x1 | Batched | 15:58 (958s) | **10:30** (630s) | 0.66x slower |\n| Batch x4 | Batched | **8:00** (480s) | 10:30 (630s) | 1.31x faster |\n| Batch x8 | Batched | **6:32** (392s) | 10:30 (630s) | 1.62x faster |\n| 4 Processes | Parallel | **2:34** (154s) | 8:59 (539s) | 3.50x faster |\n\nThe above table shows that AutoTikTokenizer's tokenizer (TikToken) is actually way faster than HuggingFace's Tokenizer by 1.6-3.5 times under fair comparison! While, it's not making the most optimal use of TikToken (yet), its still way faster than the stock solutions you might be getting otherwise.\n\n# Sharp Bits\n\nA known issue of the repository is that it does not do any pre-processing or post-processing, which means that if a certain tokenizer (like `minilm`) expect all lower-case letters only, then you would need to convert it to lower case manually. Similarly, any spaces added in the process are not removed during decoding, so they need to handle them on your own. \n\nThere might be more sharp bits to the repository which are unknown at the moment, please raise an issue if you encounter any!\n\n# Acknowledgement\n\nSpecial thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!\n\n**If you found this repository useful, give it a \u2b50\ufe0f! Thank You :)**\n\n# Citation\n\nIf you use `autotiktokenizer` in your research, please cite it as follows:\n\n```bibtex\n@misc{autotiktokenizer,\n    author = {Bhavnick Minhas},\n    title = {AutoTikTokenizer},\n    year = {2024},\n    publisher = {GitHub},\n    journal = {GitHub repository},\n    howpublished = {\\url{https://github.com/bhavnicksm/autotiktokenizer}},\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Bhavnick Minhas  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "\ud83e\uddf0 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! \u2728",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/bhavnicksm/autotiktokenizer"
    },
    "split_keywords": [
        "tokenizer",
        " transformers",
        " huggingface",
        " tiktoken"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d725855e425f467beb1f9a4531740815a80066ecc505bd8415a1469391f25828",
                "md5": "ba92610482fde78c8403d15ec571d387",
                "sha256": "0b32feff5c294afaffc8245c6a9228b93c5607b245fb2a5aacfaa5fa6eb1fff5"
            },
            "downloads": -1,
            "filename": "autotiktokenizer-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ba92610482fde78c8403d15ec571d387",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 8852,
            "upload_time": "2024-11-11T21:16:01",
            "upload_time_iso_8601": "2024-11-11T21:16:01.265421Z",
            "url": "https://files.pythonhosted.org/packages/d7/25/855e425f467beb1f9a4531740815a80066ecc505bd8415a1469391f25828/autotiktokenizer-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "277e8b9ef16d12657a9afb7fa07a03312e46d8b418e92912ca15df82d16bbc41",
                "md5": "6f007d83229929134a81c206a03ade79",
                "sha256": "d74dc049262567d79a9517801efdd1e7345973d9e7ba4523af768b253f4febfc"
            },
            "downloads": -1,
            "filename": "autotiktokenizer-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6f007d83229929134a81c206a03ade79",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 15312,
            "upload_time": "2024-11-11T21:16:02",
            "upload_time_iso_8601": "2024-11-11T21:16:02.228211Z",
            "url": "https://files.pythonhosted.org/packages/27/7e/8b9ef16d12657a9afb7fa07a03312e46d8b418e92912ca15df82d16bbc41/autotiktokenizer-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-11 21:16:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bhavnicksm",
    "github_project": "autotiktokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "autotiktokenizer"
}
        
Elapsed time: 0.36374s