kitoken


Namekitoken JSON
Version 0.10.1 PyPI version JSON
download
home_pageNone
SummaryFast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization
upload_time2024-12-20 03:07:14
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseBSD-2-Clause
keywords tokenizer nlp bpe unigram wordpiece
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # kitoken

**Tokenizer for language models.**

```py
from kitoken import Kitoken

encoder = Kitoken.from_file("models/llama3.3.model")

tokens = encoder.encode("hello world!", True)
string = encoder.decode(tokens).decode("utf-8")

assert string == "hello world!"
```

## Features

- **Fast encoding and decoding**\
  Faster than most other tokenizers in both common and uncommon scenarios.
- **Support for a wide variety of tokenizer formats and tokenization strategies**\
  Including support for Tokenizers, SentencePiece, Tiktoken and more.
- **Compatible with many systems and platforms**\
  Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python.
- **Compact data format**\
  Definitions are stored in an efficient binary format and without merge list.
- **Support for normalization and pre-tokenization**\
  Including unicode normalization, whitespace normalization, and many others.

## Overview

Kitoken is a fast and versatile tokenizer for language models with support for multiple tokenization algorithms:

- **BytePair**: A variation of the BPE algorithm, merging byte or character pairs.
- **Unigram**: The Unigram subword algorithm.
- **WordPiece**: The WordPiece subword algorithm.

Kitoken is compatible with many existing tokenizers,
including [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization).

See the main [README](//github.com/Systemcluster/kitoken) for more information.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kitoken",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "tokenizer, nlp, bpe, unigram, wordpiece",
    "author": null,
    "author_email": "Christian Sdunek <me@systemcluster.me>",
    "download_url": "https://files.pythonhosted.org/packages/e6/6c/807e82084c691b0a6b5a659a49761709c2a0adfe1cb5bf77708eed158c70/kitoken-0.10.1.tar.gz",
    "platform": null,
    "description": "# kitoken\n\n**Tokenizer for language models.**\n\n```py\nfrom kitoken import Kitoken\n\nencoder = Kitoken.from_file(\"models/llama3.3.model\")\n\ntokens = encoder.encode(\"hello world!\", True)\nstring = encoder.decode(tokens).decode(\"utf-8\")\n\nassert string == \"hello world!\"\n```\n\n## Features\n\n- **Fast encoding and decoding**\\\n  Faster than most other tokenizers in both common and uncommon scenarios.\n- **Support for a wide variety of tokenizer formats and tokenization strategies**\\\n  Including support for Tokenizers, SentencePiece, Tiktoken and more.\n- **Compatible with many systems and platforms**\\\n  Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python.\n- **Compact data format**\\\n  Definitions are stored in an efficient binary format and without merge list.\n- **Support for normalization and pre-tokenization**\\\n  Including unicode normalization, whitespace normalization, and many others.\n\n## Overview\n\nKitoken is a fast and versatile tokenizer for language models with support for multiple tokenization algorithms:\n\n- **BytePair**: A variation of the BPE algorithm, merging byte or character pairs.\n- **Unigram**: The Unigram subword algorithm.\n- **WordPiece**: The WordPiece subword algorithm.\n\nKitoken is compatible with many existing tokenizers,\nincluding [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization).\n\nSee the main [README](//github.com/Systemcluster/kitoken) for more information.\n\n",
    "bugtrack_url": null,
    "license": "BSD-2-Clause",
    "summary": "Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization",
    "version": "0.10.1",
    "project_urls": {
        "Homepage": "https://kitoken.dev",
        "Repository": "https://github.com/Systemcluster/kitoken"
    },
    "split_keywords": [
        "tokenizer",
        " nlp",
        " bpe",
        " unigram",
        " wordpiece"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "46ef9d4447541f23a71c49c65c83ebffe034e468d2c431b359934b2685d35ab9",
                "md5": "8cf043402d474dde8e2084c353d5d12b",
                "sha256": "e8535845d96746951c105a3ce1ddba0f7a2a70c39aa4405261dc321d06c953c3"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "8cf043402d474dde8e2084c353d5d12b",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1381993,
            "upload_time": "2024-12-20T03:07:04",
            "upload_time_iso_8601": "2024-12-20T03:07:04.920928Z",
            "url": "https://files.pythonhosted.org/packages/46/ef/9d4447541f23a71c49c65c83ebffe034e468d2c431b359934b2685d35ab9/kitoken-0.10.1-cp310-abi3-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7a9bda0915143d3b45ee8331e20dd00b9924e1413e3bcca731b3a2953fdf1eae",
                "md5": "e1c3705a6632fcf3f42287a761a90b43",
                "sha256": "ead6759319e6d799eaa39b2d4a88042e55f8068e256772be9a05c99a0772a896"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "e1c3705a6632fcf3f42287a761a90b43",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1251260,
            "upload_time": "2024-12-20T03:07:00",
            "upload_time_iso_8601": "2024-12-20T03:07:00.414606Z",
            "url": "https://files.pythonhosted.org/packages/7a/9b/da0915143d3b45ee8331e20dd00b9924e1413e3bcca731b3a2953fdf1eae/kitoken-0.10.1-cp310-abi3-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d26dadb3a6b2b363033d27bf7560d63e62efcc0c3589a9ee91ed341290e1be47",
                "md5": "8602229357137c4154936cb977b7522b",
                "sha256": "7a6a1d7b9220ae09d6022f5b408f55cb5e2fe2fa7984cb993461a7db1cce3d70"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_aarch64.whl",
            "has_sig": false,
            "md5_digest": "8602229357137c4154936cb977b7522b",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1331069,
            "upload_time": "2024-12-20T03:06:49",
            "upload_time_iso_8601": "2024-12-20T03:06:49.980416Z",
            "url": "https://files.pythonhosted.org/packages/d2/6d/adb3a6b2b363033d27bf7560d63e62efcc0c3589a9ee91ed341290e1be47/kitoken-0.10.1-cp310-abi3-manylinux_2_28_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7fafd23e0b03a836c8b7a61d029a27aa98a09b28abf20b44d3a9dc55af4f6322",
                "md5": "7e1593473521991d705c917bdf2bc37c",
                "sha256": "617201015aec6d3d76bf7cd26a8b6de997e59ca1d6a0b5fd24578e406a36828f"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_armv7l.whl",
            "has_sig": false,
            "md5_digest": "7e1593473521991d705c917bdf2bc37c",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1332397,
            "upload_time": "2024-12-20T03:06:52",
            "upload_time_iso_8601": "2024-12-20T03:06:52.817329Z",
            "url": "https://files.pythonhosted.org/packages/7f/af/d23e0b03a836c8b7a61d029a27aa98a09b28abf20b44d3a9dc55af4f6322/kitoken-0.10.1-cp310-abi3-manylinux_2_28_armv7l.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1f10628393bd6ba59a4959c8df0fafb5d5ba3d0a90163018c7398e6e89225acd",
                "md5": "cf4670a72e3fc835fe6ae9d7f57276fe",
                "sha256": "3dbda305b095e1b896a1145fad1a6739cfd83c974a8eebbfa1fe4484cd4ae412"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_i686.whl",
            "has_sig": false,
            "md5_digest": "cf4670a72e3fc835fe6ae9d7f57276fe",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1490170,
            "upload_time": "2024-12-20T03:06:57",
            "upload_time_iso_8601": "2024-12-20T03:06:57.221651Z",
            "url": "https://files.pythonhosted.org/packages/1f/10/628393bd6ba59a4959c8df0fafb5d5ba3d0a90163018c7398e6e89225acd/kitoken-0.10.1-cp310-abi3-manylinux_2_28_i686.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b4bbc5c0092b219f5750f9ed5c2c27640c4765794b211304c21f03937ef84da3",
                "md5": "34d87a92bc4f766b33b9b019affd4806",
                "sha256": "82346b7aedbf6b976cd2ee6eaca8777136ed2831e333b90800724086ebaf56b8"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_ppc64le.whl",
            "has_sig": false,
            "md5_digest": "34d87a92bc4f766b33b9b019affd4806",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1492945,
            "upload_time": "2024-12-20T03:06:54",
            "upload_time_iso_8601": "2024-12-20T03:06:54.488716Z",
            "url": "https://files.pythonhosted.org/packages/b4/bb/c5c0092b219f5750f9ed5c2c27640c4765794b211304c21f03937ef84da3/kitoken-0.10.1-cp310-abi3-manylinux_2_28_ppc64le.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "343e15a79166cf69cad15696404696b70e878aa45e21ab87b1efa91a29321452",
                "md5": "3197042919ebe4ef3d1ee4e4e9d42a5b",
                "sha256": "8c781e0a07ae7fe9a4bb633bda705fcb323454f694dbee12d08bca898d4ae3bb"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "3197042919ebe4ef3d1ee4e4e9d42a5b",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1464504,
            "upload_time": "2024-12-20T03:06:58",
            "upload_time_iso_8601": "2024-12-20T03:06:58.898027Z",
            "url": "https://files.pythonhosted.org/packages/34/3e/15a79166cf69cad15696404696b70e878aa45e21ab87b1efa91a29321452/kitoken-0.10.1-cp310-abi3-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "93695c7a83a2da768836c0a37e254d116f859f6ac2ceab344a1902021ee8de78",
                "md5": "9e0f428c9a624f0c0e3dd5f7e449a4ae",
                "sha256": "1d43d595ca405a9ea85ee532c09bd615f7745fff762d61eadbe3c249da52e05e"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_aarch64.whl",
            "has_sig": false,
            "md5_digest": "9e0f428c9a624f0c0e3dd5f7e449a4ae",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1328149,
            "upload_time": "2024-12-20T03:07:06",
            "upload_time_iso_8601": "2024-12-20T03:07:06.456032Z",
            "url": "https://files.pythonhosted.org/packages/93/69/5c7a83a2da768836c0a37e254d116f859f6ac2ceab344a1902021ee8de78/kitoken-0.10.1-cp310-abi3-musllinux_1_2_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8d25992213aefcba4f289e5f00d0dd6c2234bf5a7b5096ad3a5e8759ef3c22e3",
                "md5": "8771e9caf7277ec60ca1f09ca46b81d1",
                "sha256": "e98044200b329ed9e1a59dcb031e00d013b838786331c4fbed1916828a121b1b"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_armv7l.whl",
            "has_sig": false,
            "md5_digest": "8771e9caf7277ec60ca1f09ca46b81d1",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1330461,
            "upload_time": "2024-12-20T03:07:07",
            "upload_time_iso_8601": "2024-12-20T03:07:07.863017Z",
            "url": "https://files.pythonhosted.org/packages/8d/25/992213aefcba4f289e5f00d0dd6c2234bf5a7b5096ad3a5e8759ef3c22e3/kitoken-0.10.1-cp310-abi3-musllinux_1_2_armv7l.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "466322ba3ee485dfe542d5bdc20792e5ec2fba4c48f866d64ab3243b1c7bad5e",
                "md5": "1876fe65b374bd8caf335c0cb38a2c13",
                "sha256": "84fde9161ede36fcf000178eaaa0c501e6a28fe9be8d9a489de9533b7035ea34"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_i686.whl",
            "has_sig": false,
            "md5_digest": "1876fe65b374bd8caf335c0cb38a2c13",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1449885,
            "upload_time": "2024-12-20T03:07:09",
            "upload_time_iso_8601": "2024-12-20T03:07:09.402055Z",
            "url": "https://files.pythonhosted.org/packages/46/63/22ba3ee485dfe542d5bdc20792e5ec2fba4c48f866d64ab3243b1c7bad5e/kitoken-0.10.1-cp310-abi3-musllinux_1_2_i686.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ce3a13f22496bad7f4ec4a3cb8333bc211b031386716697356374a31998f83e1",
                "md5": "bd8e910d6a1a4c31f7f6a38c5b88b33d",
                "sha256": "e9474e6e3fbd1c71a6d76f21433247c27e6c2d92400bb10708766cb6eaea8171"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-musllinux_1_2_x86_64.whl",
            "has_sig": false,
            "md5_digest": "bd8e910d6a1a4c31f7f6a38c5b88b33d",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1461828,
            "upload_time": "2024-12-20T03:07:12",
            "upload_time_iso_8601": "2024-12-20T03:07:12.099180Z",
            "url": "https://files.pythonhosted.org/packages/ce/3a/13f22496bad7f4ec4a3cb8333bc211b031386716697356374a31998f83e1/kitoken-0.10.1-cp310-abi3-musllinux_1_2_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cf2b6ac41cfcd2d3348dce3c941b09828d6cc04ede398c0a62882ad50cae9d04",
                "md5": "bc440b33f8cc8ef2129bd14daf3306e1",
                "sha256": "8c2243783ef68a5422daaadea7c364f3b8e26e922fab84f5c87fa245d2949293"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-win32.whl",
            "has_sig": false,
            "md5_digest": "bc440b33f8cc8ef2129bd14daf3306e1",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1250855,
            "upload_time": "2024-12-20T03:07:18",
            "upload_time_iso_8601": "2024-12-20T03:07:18.746134Z",
            "url": "https://files.pythonhosted.org/packages/cf/2b/6ac41cfcd2d3348dce3c941b09828d6cc04ede398c0a62882ad50cae9d04/kitoken-0.10.1-cp310-abi3-win32.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1da4da01d3e81fad306f20c68a3e1bce4e905df67ee16c29aee4cdec9accf55d",
                "md5": "ab685d6855d0ef4dd14e001a99ad460d",
                "sha256": "0f255791d9bc73228683df853693362d6100cc61986dc01c1dd299ec9e9c6eb8"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1-cp310-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "ab685d6855d0ef4dd14e001a99ad460d",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1273075,
            "upload_time": "2024-12-20T03:07:15",
            "upload_time_iso_8601": "2024-12-20T03:07:15.721505Z",
            "url": "https://files.pythonhosted.org/packages/1d/a4/da01d3e81fad306f20c68a3e1bce4e905df67ee16c29aee4cdec9accf55d/kitoken-0.10.1-cp310-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e66c807e82084c691b0a6b5a659a49761709c2a0adfe1cb5bf77708eed158c70",
                "md5": "e9e66323d9cf407d970823aac2bf5075",
                "sha256": "ecca8a68a63e11e048f8f8a6e3dabe32e914466d9f59d26ce664681c9c1f0cc5"
            },
            "downloads": -1,
            "filename": "kitoken-0.10.1.tar.gz",
            "has_sig": false,
            "md5_digest": "e9e66323d9cf407d970823aac2bf5075",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 59995,
            "upload_time": "2024-12-20T03:07:14",
            "upload_time_iso_8601": "2024-12-20T03:07:14.686210Z",
            "url": "https://files.pythonhosted.org/packages/e6/6c/807e82084c691b0a6b5a659a49761709c2a0adfe1cb5bf77708eed158c70/kitoken-0.10.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-20 03:07:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Systemcluster",
    "github_project": "kitoken",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kitoken"
}
        
Elapsed time: 4.86244s