Name | kitoken JSON |
Version |
0.10.1
JSON |
| download |
home_page | None |
Summary | Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization |
upload_time | 2024-12-20 03:07:14 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | BSD-2-Clause |
keywords |
tokenizer
nlp
bpe
unigram
wordpiece
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# kitoken
**Tokenizer for language models.**
```py
from kitoken import Kitoken
encoder = Kitoken.from_file("models/llama3.3.model")
tokens = encoder.encode("hello world!", True)
string = encoder.decode(tokens).decode("utf-8")
assert string == "hello world!"
```
## Features
- **Fast encoding and decoding**\
Faster than most other tokenizers in both common and uncommon scenarios.
- **Support for a wide variety of tokenizer formats and tokenization strategies**\
Including support for Tokenizers, SentencePiece, Tiktoken and more.
- **Compatible with many systems and platforms**\
Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python.
- **Compact data format**\
Definitions are stored in an efficient binary format and without merge list.
- **Support for normalization and pre-tokenization**\
Including unicode normalization, whitespace normalization, and many others.
## Overview
Kitoken is a fast and versatile tokenizer for language models with support for multiple tokenization algorithms:
- **BytePair**: A variation of the BPE algorithm, merging byte or character pairs.
- **Unigram**: The Unigram subword algorithm.
- **WordPiece**: The WordPiece subword algorithm.
Kitoken is compatible with many existing tokenizers,
including [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization).
See the main [README](//github.com/Systemcluster/kitoken) for more information.
Raw data
{
"_id": null,
"home_page": null,
"name": "kitoken",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "tokenizer, nlp, bpe, unigram, wordpiece",
"author": null,
"author_email": "Christian Sdunek <me@systemcluster.me>",
"download_url": "https://files.pythonhosted.org/packages/e6/6c/807e82084c691b0a6b5a659a49761709c2a0adfe1cb5bf77708eed158c70/kitoken-0.10.1.tar.gz",
"platform": null,
"description": "# kitoken\n\n**Tokenizer for language models.**\n\n```py\nfrom kitoken import Kitoken\n\nencoder = Kitoken.from_file(\"models/llama3.3.model\")\n\ntokens = encoder.encode(\"hello world!\", True)\nstring = encoder.decode(tokens).decode(\"utf-8\")\n\nassert string == \"hello world!\"\n```\n\n## Features\n\n- **Fast encoding and decoding**\\\n Faster than most other tokenizers in both common and uncommon scenarios.\n- **Support for a wide variety of tokenizer formats and tokenization strategies**\\\n Including support for Tokenizers, SentencePiece, Tiktoken and more.\n- **Compatible with many systems and platforms**\\\n Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python.\n- **Compact data format**\\\n Definitions are stored in an efficient binary format and without merge list.\n- **Support for normalization and pre-tokenization**\\\n Including unicode normalization, whitespace normalization, and many others.\n\n## Overview\n\nKitoken is a fast and versatile tokenizer for language models with support for multiple tokenization algorithms:\n\n- **BytePair**: A variation of the BPE algorithm, merging byte or character pairs.\n- **Unigram**: The Unigram subword algorithm.\n- **WordPiece**: The WordPiece subword algorithm.\n\nKitoken is compatible with many existing tokenizers,\nincluding [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization).\n\nSee the main [README](//github.com/Systemcluster/kitoken) for more information.\n\n",
"bugtrack_url": null,
"license": "BSD-2-Clause",
"summary": "Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization",
"version": "0.10.1",
"project_urls": {
"Homepage": "https://kitoken.dev",
"Repository": "https://github.com/Systemcluster/kitoken"
},
"split_keywords": [
"tokenizer",
" nlp",
" bpe",
" unigram",
" wordpiece"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "46ef9d4447541f23a71c49c65c83ebffe034e468d2c431b359934b2685d35ab9",
"md5": "8cf043402d474dde8e2084c353d5d12b",
"sha256": "e8535845d96746951c105a3ce1ddba0f7a2a70c39aa4405261dc321d06c953c3"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-macosx_10_12_x86_64.whl",
"has_sig": false,
"md5_digest": "8cf043402d474dde8e2084c353d5d12b",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1381993,
"upload_time": "2024-12-20T03:07:04",
"upload_time_iso_8601": "2024-12-20T03:07:04.920928Z",
"url": "https://files.pythonhosted.org/packages/46/ef/9d4447541f23a71c49c65c83ebffe034e468d2c431b359934b2685d35ab9/kitoken-0.10.1-cp310-abi3-macosx_10_12_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7a9bda0915143d3b45ee8331e20dd00b9924e1413e3bcca731b3a2953fdf1eae",
"md5": "e1c3705a6632fcf3f42287a761a90b43",
"sha256": "ead6759319e6d799eaa39b2d4a88042e55f8068e256772be9a05c99a0772a896"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "e1c3705a6632fcf3f42287a761a90b43",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1251260,
"upload_time": "2024-12-20T03:07:00",
"upload_time_iso_8601": "2024-12-20T03:07:00.414606Z",
"url": "https://files.pythonhosted.org/packages/7a/9b/da0915143d3b45ee8331e20dd00b9924e1413e3bcca731b3a2953fdf1eae/kitoken-0.10.1-cp310-abi3-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d26dadb3a6b2b363033d27bf7560d63e62efcc0c3589a9ee91ed341290e1be47",
"md5": "8602229357137c4154936cb977b7522b",
"sha256": "7a6a1d7b9220ae09d6022f5b408f55cb5e2fe2fa7984cb993461a7db1cce3d70"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_aarch64.whl",
"has_sig": false,
"md5_digest": "8602229357137c4154936cb977b7522b",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1331069,
"upload_time": "2024-12-20T03:06:49",
"upload_time_iso_8601": "2024-12-20T03:06:49.980416Z",
"url": "https://files.pythonhosted.org/packages/d2/6d/adb3a6b2b363033d27bf7560d63e62efcc0c3589a9ee91ed341290e1be47/kitoken-0.10.1-cp310-abi3-manylinux_2_28_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7fafd23e0b03a836c8b7a61d029a27aa98a09b28abf20b44d3a9dc55af4f6322",
"md5": "7e1593473521991d705c917bdf2bc37c",
"sha256": "617201015aec6d3d76bf7cd26a8b6de997e59ca1d6a0b5fd24578e406a36828f"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_armv7l.whl",
"has_sig": false,
"md5_digest": "7e1593473521991d705c917bdf2bc37c",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1332397,
"upload_time": "2024-12-20T03:06:52",
"upload_time_iso_8601": "2024-12-20T03:06:52.817329Z",
"url": "https://files.pythonhosted.org/packages/7f/af/d23e0b03a836c8b7a61d029a27aa98a09b28abf20b44d3a9dc55af4f6322/kitoken-0.10.1-cp310-abi3-manylinux_2_28_armv7l.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1f10628393bd6ba59a4959c8df0fafb5d5ba3d0a90163018c7398e6e89225acd",
"md5": "cf4670a72e3fc835fe6ae9d7f57276fe",
"sha256": "3dbda305b095e1b896a1145fad1a6739cfd83c974a8eebbfa1fe4484cd4ae412"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_i686.whl",
"has_sig": false,
"md5_digest": "cf4670a72e3fc835fe6ae9d7f57276fe",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1490170,
"upload_time": "2024-12-20T03:06:57",
"upload_time_iso_8601": "2024-12-20T03:06:57.221651Z",
"url": "https://files.pythonhosted.org/packages/1f/10/628393bd6ba59a4959c8df0fafb5d5ba3d0a90163018c7398e6e89225acd/kitoken-0.10.1-cp310-abi3-manylinux_2_28_i686.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b4bbc5c0092b219f5750f9ed5c2c27640c4765794b211304c21f03937ef84da3",
"md5": "34d87a92bc4f766b33b9b019affd4806",
"sha256": "82346b7aedbf6b976cd2ee6eaca8777136ed2831e333b90800724086ebaf56b8"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_ppc64le.whl",
"has_sig": false,
"md5_digest": "34d87a92bc4f766b33b9b019affd4806",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1492945,
"upload_time": "2024-12-20T03:06:54",
"upload_time_iso_8601": "2024-12-20T03:06:54.488716Z",
"url": "https://files.pythonhosted.org/packages/b4/bb/c5c0092b219f5750f9ed5c2c27640c4765794b211304c21f03937ef84da3/kitoken-0.10.1-cp310-abi3-manylinux_2_28_ppc64le.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "343e15a79166cf69cad15696404696b70e878aa45e21ab87b1efa91a29321452",
"md5": "3197042919ebe4ef3d1ee4e4e9d42a5b",
"sha256": "8c781e0a07ae7fe9a4bb633bda705fcb323454f694dbee12d08bca898d4ae3bb"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "3197042919ebe4ef3d1ee4e4e9d42a5b",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1464504,
"upload_time": "2024-12-20T03:06:58",
"upload_time_iso_8601": "2024-12-20T03:06:58.898027Z",
"url": "https://files.pythonhosted.org/packages/34/3e/15a79166cf69cad15696404696b70e878aa45e21ab87b1efa91a29321452/kitoken-0.10.1-cp310-abi3-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "93695c7a83a2da768836c0a37e254d116f859f6ac2ceab344a1902021ee8de78",
"md5": "9e0f428c9a624f0c0e3dd5f7e449a4ae",
"sha256": "1d43d595ca405a9ea85ee532c09bd615f7745fff762d61eadbe3c249da52e05e"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_aarch64.whl",
"has_sig": false,
"md5_digest": "9e0f428c9a624f0c0e3dd5f7e449a4ae",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1328149,
"upload_time": "2024-12-20T03:07:06",
"upload_time_iso_8601": "2024-12-20T03:07:06.456032Z",
"url": "https://files.pythonhosted.org/packages/93/69/5c7a83a2da768836c0a37e254d116f859f6ac2ceab344a1902021ee8de78/kitoken-0.10.1-cp310-abi3-musllinux_1_2_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8d25992213aefcba4f289e5f00d0dd6c2234bf5a7b5096ad3a5e8759ef3c22e3",
"md5": "8771e9caf7277ec60ca1f09ca46b81d1",
"sha256": "e98044200b329ed9e1a59dcb031e00d013b838786331c4fbed1916828a121b1b"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_armv7l.whl",
"has_sig": false,
"md5_digest": "8771e9caf7277ec60ca1f09ca46b81d1",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1330461,
"upload_time": "2024-12-20T03:07:07",
"upload_time_iso_8601": "2024-12-20T03:07:07.863017Z",
"url": "https://files.pythonhosted.org/packages/8d/25/992213aefcba4f289e5f00d0dd6c2234bf5a7b5096ad3a5e8759ef3c22e3/kitoken-0.10.1-cp310-abi3-musllinux_1_2_armv7l.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "466322ba3ee485dfe542d5bdc20792e5ec2fba4c48f866d64ab3243b1c7bad5e",
"md5": "1876fe65b374bd8caf335c0cb38a2c13",
"sha256": "84fde9161ede36fcf000178eaaa0c501e6a28fe9be8d9a489de9533b7035ea34"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_i686.whl",
"has_sig": false,
"md5_digest": "1876fe65b374bd8caf335c0cb38a2c13",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1449885,
"upload_time": "2024-12-20T03:07:09",
"upload_time_iso_8601": "2024-12-20T03:07:09.402055Z",
"url": "https://files.pythonhosted.org/packages/46/63/22ba3ee485dfe542d5bdc20792e5ec2fba4c48f866d64ab3243b1c7bad5e/kitoken-0.10.1-cp310-abi3-musllinux_1_2_i686.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ce3a13f22496bad7f4ec4a3cb8333bc211b031386716697356374a31998f83e1",
"md5": "bd8e910d6a1a4c31f7f6a38c5b88b33d",
"sha256": "e9474e6e3fbd1c71a6d76f21433247c27e6c2d92400bb10708766cb6eaea8171"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_x86_64.whl",
"has_sig": false,
"md5_digest": "bd8e910d6a1a4c31f7f6a38c5b88b33d",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1461828,
"upload_time": "2024-12-20T03:07:12",
"upload_time_iso_8601": "2024-12-20T03:07:12.099180Z",
"url": "https://files.pythonhosted.org/packages/ce/3a/13f22496bad7f4ec4a3cb8333bc211b031386716697356374a31998f83e1/kitoken-0.10.1-cp310-abi3-musllinux_1_2_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "cf2b6ac41cfcd2d3348dce3c941b09828d6cc04ede398c0a62882ad50cae9d04",
"md5": "bc440b33f8cc8ef2129bd14daf3306e1",
"sha256": "8c2243783ef68a5422daaadea7c364f3b8e26e922fab84f5c87fa245d2949293"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-win32.whl",
"has_sig": false,
"md5_digest": "bc440b33f8cc8ef2129bd14daf3306e1",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1250855,
"upload_time": "2024-12-20T03:07:18",
"upload_time_iso_8601": "2024-12-20T03:07:18.746134Z",
"url": "https://files.pythonhosted.org/packages/cf/2b/6ac41cfcd2d3348dce3c941b09828d6cc04ede398c0a62882ad50cae9d04/kitoken-0.10.1-cp310-abi3-win32.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1da4da01d3e81fad306f20c68a3e1bce4e905df67ee16c29aee4cdec9accf55d",
"md5": "ab685d6855d0ef4dd14e001a99ad460d",
"sha256": "0f255791d9bc73228683df853693362d6100cc61986dc01c1dd299ec9e9c6eb8"
},
"downloads": -1,
"filename": "kitoken-0.10.1-cp310-abi3-win_amd64.whl",
"has_sig": false,
"md5_digest": "ab685d6855d0ef4dd14e001a99ad460d",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1273075,
"upload_time": "2024-12-20T03:07:15",
"upload_time_iso_8601": "2024-12-20T03:07:15.721505Z",
"url": "https://files.pythonhosted.org/packages/1d/a4/da01d3e81fad306f20c68a3e1bce4e905df67ee16c29aee4cdec9accf55d/kitoken-0.10.1-cp310-abi3-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e66c807e82084c691b0a6b5a659a49761709c2a0adfe1cb5bf77708eed158c70",
"md5": "e9e66323d9cf407d970823aac2bf5075",
"sha256": "ecca8a68a63e11e048f8f8a6e3dabe32e914466d9f59d26ce664681c9c1f0cc5"
},
"downloads": -1,
"filename": "kitoken-0.10.1.tar.gz",
"has_sig": false,
"md5_digest": "e9e66323d9cf407d970823aac2bf5075",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 59995,
"upload_time": "2024-12-20T03:07:14",
"upload_time_iso_8601": "2024-12-20T03:07:14.686210Z",
"url": "https://files.pythonhosted.org/packages/e6/6c/807e82084c691b0a6b5a659a49761709c2a0adfe1cb5bf77708eed158c70/kitoken-0.10.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-20 03:07:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Systemcluster",
"github_project": "kitoken",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kitoken"
}