<h1 align="center">Toke(n)icer</h1>
<p align="center">A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.</p>
<p align="center">
<a href="https://github.com/ModelCloud/Tokenicer/releases" style="text-decoration:none;"><img alt="GitHub release" src="https://img.shields.io/github/release/ModelCloud/Tokenicer.svg"></a>
<a href="https://pypi.org/project/tokenicer/" style="text-decoration:none;"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/tokenicer"></a>
<a href="https://pepy.tech/projects/tokenicer" style="text-decoration:none;"><img src="https://static.pepy.tech/badge/tokenicer" alt="PyPI Downloads"></a>
<a href="https://github.com/ModelCloud/tokenicer/blob/main/LICENSE"><img src="https://img.shields.io/pypi/l/tokenicer"></a>
<a href="https://huggingface.co/modelcloud/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-ModelCloud-%23ff8811.svg"></a>
</p>
## News
* 02/21/2025 [0.0.4](https://github.com/ModelCloud/Tokenicer/releases/tag/v0.0.4): Now `tokenicer` instance dynamically inherits the `native` `tokenizer.__class__` of tokenizer passed in or loaded via our `tokenicer.load()` api.
* 02/10/2025 [0.0.2](https://github.com/ModelCloud/Tokenicer/releases/tag/v0.0.2): 🤗 Initial release!
## Features:
* Compatible with all HF `Transformers` recognized tokenizers
* Auto-fix `models` not setting `padding_token`
* Auto-Fix `models` released with wrong `padding_token`: many `models` incorrectly use `eos_token` as `pad_token` which leads to subtle and hidden errors in post-training and inference when `batching` is used which is almost always.
* Zero external dependency outside of `Transformers`
## Upcoming Features:
* Add `automatic` tokenizer validation to `model` `training` and subsequent `inference` so that not only tokenizer config but actual `decode`/`encode` are 100% re-validated on model load. Often the case, `inference` and `training` engines modifies the traditional tokenizers causing subtle and inaccurate output when `inference` performed on a platform that is disjointed from the `trainer`.
## Install
### PIP/UV
```bash
pip install -v tokenicer
uv pip install -v tokenicer
```
### Install from source
```bash
# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer
# compile
pip install -v .
```
## Usage
* Replace all calls to `AutoTokenizer.from_pretrained()` with `Tokenizer.load()`: args are 100% compatible with `AutoTokenizer`
```py
# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')
# With `Tokenicer.load()`
from tokenicer import Tokenicer
# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')
# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)
print(f"pad_token: `{tokenizer.pad_token}`")
```
## Citation
```
@misc{gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {Toke(n)icer},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/tokenicer}},
note = {Contact: qubitium@modelcloud.ai}
}
Raw data
{
"_id": null,
"home_page": "https://github.com/ModelCloud/Tokenicer",
"name": "tokenicer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3",
"maintainer_email": null,
"keywords": null,
"author": "ModelCloud",
"author_email": "qubitium@modelcloud.ai",
"download_url": "https://files.pythonhosted.org/packages/aa/3d/af0e903192cc92add1b934b7c62d9b39e97e69a05adc614a1848cd54332e/tokenicer-0.0.4.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">Toke(n)icer</h1>\n<p align=\"center\">A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.</p>\n<p align=\"center\">\n <a href=\"https://github.com/ModelCloud/Tokenicer/releases\" style=\"text-decoration:none;\"><img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/ModelCloud/Tokenicer.svg\"></a>\n <a href=\"https://pypi.org/project/tokenicer/\" style=\"text-decoration:none;\"><img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/tokenicer\"></a>\n <a href=\"https://pepy.tech/projects/tokenicer\" style=\"text-decoration:none;\"><img src=\"https://static.pepy.tech/badge/tokenicer\" alt=\"PyPI Downloads\"></a>\n <a href=\"https://github.com/ModelCloud/tokenicer/blob/main/LICENSE\"><img src=\"https://img.shields.io/pypi/l/tokenicer\"></a>\n <a href=\"https://huggingface.co/modelcloud/\"><img src=\"https://img.shields.io/badge/\ud83e\udd17%20Hugging%20Face-ModelCloud-%23ff8811.svg\"></a>\n</p>\n\n## News\n* 02/21/2025 [0.0.4](https://github.com/ModelCloud/Tokenicer/releases/tag/v0.0.4): Now `tokenicer` instance dynamically inherits the `native` `tokenizer.__class__` of tokenizer passed in or loaded via our `tokenicer.load()` api.\n\n* 02/10/2025 [0.0.2](https://github.com/ModelCloud/Tokenicer/releases/tag/v0.0.2): \ud83e\udd17 Initial release!\n\n## Features:\n\n* Compatible with all HF `Transformers` recognized tokenizers\n* Auto-fix `models` not setting `padding_token`\n* Auto-Fix `models` released with wrong `padding_token`: many `models` incorrectly use `eos_token` as `pad_token` which leads to subtle and hidden errors in post-training and inference when `batching` is used which is almost always.\n* Zero external dependency outside of `Transformers`\n \n## Upcoming Features:\n\n* Add `automatic` tokenizer validation to `model` `training` and subsequent `inference` so that not only tokenizer config but actual `decode`/`encode` are 100% re-validated on model load. Often the case, `inference` and `training` engines modifies the traditional tokenizers causing subtle and inaccurate output when `inference` performed on a platform that is disjointed from the `trainer`. \n\n## Install\n\n### PIP/UV \n\n```bash\npip install -v tokenicer\nuv pip install -v tokenicer\n```\n\n### Install from source\n\n```bash\n# clone repo\ngit clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer\n\n# compile\npip install -v . \n```\n\n## Usage\n\n* Replace all calls to `AutoTokenizer.from_pretrained()` with `Tokenizer.load()`: args are 100% compatible with `AutoTokenizer`\n\n```py\n# Replace `AutoTokenizer.from_pretrained()`\n# from tokenizer import AutoTokenizer\n# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')\n\n# With `Tokenicer.load()`\nfrom tokenicer import Tokenicer\n\n# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.\ntokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')\n\n# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.\n# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.\n# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)\nprint(f\"pad_token: `{tokenizer.pad_token}`\")\n```\n\n## Citation\n\n```\n@misc{gptqmodel,\n author = {ModelCloud.ai and qubitium@modelcloud.ai},\n title = {Toke(n)icer},\n year = {2025},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/modelcloud/tokenicer}},\n note = {Contact: qubitium@modelcloud.ai}\n}\n",
"bugtrack_url": null,
"license": null,
"summary": "A (nicer) tokenizer you want to use for model `inference` and `training`: with all known peventable `gotchas` normalized or auto-fixed.",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/ModelCloud/Tokenicer"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "aa3daf0e903192cc92add1b934b7c62d9b39e97e69a05adc614a1848cd54332e",
"md5": "97f92b8f57a2623ac08de6f51cd94348",
"sha256": "4b76763385dbfaedab46c4d4869ae9a791fdd8871527f847ea02945a7160a3f6"
},
"downloads": -1,
"filename": "tokenicer-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "97f92b8f57a2623ac08de6f51cd94348",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3",
"size": 10787,
"upload_time": "2025-02-21T09:37:31",
"upload_time_iso_8601": "2025-02-21T09:37:31.937117Z",
"url": "https://files.pythonhosted.org/packages/aa/3d/af0e903192cc92add1b934b7c62d9b39e97e69a05adc614a1848cd54332e/tokenicer-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-21 09:37:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ModelCloud",
"github_project": "Tokenicer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "transformers",
"specs": []
}
],
"lcname": "tokenicer"
}