pytorch-tokenizers


Namepytorch-tokenizers JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/meta-pytorch/tokenizers
SummaryA package with common tokenizers in Python and C++
upload_time2025-10-21 17:18:40
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseBSD 3-Clause License Copyright (c) 2024 Meta Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords pytorch machine learning llm
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tokenizers
C++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.

## Installation (from source)
```
git clone git@github.com:meta-pytorch/tokenizers.git
cd ~/tokenizers
git submodule update --init --recursive
pip install -e .
```

## SentencePiece tokenizer
Depend on https://github.com/google/sentencepiece from Google.

## Tiktoken tokenizer
Adapted from https://github.com/sewenew/tokenizer.

## Huggingface tokenizer
Compatible with https://github.com/huggingface/tokenizers/.

## Llama2.c tokenizer
Adapted from https://github.com/karpathy/llama2.c.

## Tekken tokenizer
Mistral's Tekken tokenizer (v7) with full support for special tokens, multilingual text, and instruction-tuned conversations. Provides significant efficiency gains for AI workloads:
- **Special token recognition**: [INST], [/INST], [AVAILABLE_TOOLS], etc. as single tokens
- **Multilingual support**: Complete Unicode handling including emojis and complex scripts
- **Production-ready**: 100% decode accuracy with comprehensive test coverage
- **Python bindings**: Full compatibility with mistral-common ecosystem

## License

tokenizers is released under the [BSD 3 license](LICENSE). (Additional
code in this distribution is covered by the MIT and Apache Open Source
licenses.) However you may have other legal obligations that govern
your use of content, such as the terms of service for third-party
models.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/meta-pytorch/tokenizers",
    "name": "pytorch-tokenizers",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "pytorch, machine learning, llm",
    "author": null,
    "author_email": "PyTorch Team <packages@pytorch.org>",
    "download_url": "https://files.pythonhosted.org/packages/0c/a7/d07ebde1011d704e4ab5c8fcb46fc71cb72fdf2c393f85202f7dfb8fd669/pytorch_tokenizers-1.0.1.tar.gz",
    "platform": null,
    "description": "# tokenizers\nC++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.\n\n## Installation (from source)\n```\ngit clone git@github.com:meta-pytorch/tokenizers.git\ncd ~/tokenizers\ngit submodule update --init --recursive\npip install -e .\n```\n\n## SentencePiece tokenizer\nDepend on https://github.com/google/sentencepiece from Google.\n\n## Tiktoken tokenizer\nAdapted from https://github.com/sewenew/tokenizer.\n\n## Huggingface tokenizer\nCompatible with https://github.com/huggingface/tokenizers/.\n\n## Llama2.c tokenizer\nAdapted from https://github.com/karpathy/llama2.c.\n\n## Tekken tokenizer\nMistral's Tekken tokenizer (v7) with full support for special tokens, multilingual text, and instruction-tuned conversations. Provides significant efficiency gains for AI workloads:\n- **Special token recognition**: [INST], [/INST], [AVAILABLE_TOOLS], etc. as single tokens\n- **Multilingual support**: Complete Unicode handling including emojis and complex scripts\n- **Production-ready**: 100% decode accuracy with comprehensive test coverage\n- **Python bindings**: Full compatibility with mistral-common ecosystem\n\n## License\n\ntokenizers is released under the [BSD 3 license](LICENSE). (Additional\ncode in this distribution is covered by the MIT and Apache Open Source\nlicenses.) However you may have other legal obligations that govern\nyour use of content, such as the terms of service for third-party\nmodels.\n",
    "bugtrack_url": null,
    "license": "BSD 3-Clause License\n        \n        Copyright (c) 2024 Meta\n        \n        Redistribution and use in source and binary forms, with or without\n        modification, are permitted provided that the following conditions are met:\n        \n        1. Redistributions of source code must retain the above copyright notice, this\n           list of conditions and the following disclaimer.\n        \n        2. Redistributions in binary form must reproduce the above copyright notice,\n           this list of conditions and the following disclaimer in the documentation\n           and/or other materials provided with the distribution.\n        \n        3. Neither the name of the copyright holder nor the names of its\n           contributors may be used to endorse or promote products derived from\n           this software without specific prior written permission.\n        \n        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\n        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\n        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\n        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\n        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\n        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\n        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\n        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\n        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n        ",
    "summary": "A package with common tokenizers in Python and C++",
    "version": "1.0.1",
    "project_urls": {
        "Changelog": "https://github.com/pytorch/executorch/releases",
        "Homepage": "https://pytorch.org/executorch/",
        "Issues": "https://github.com/pytorch/executorch/issues",
        "Repository": "https://github.com/pytorch/executorch"
    },
    "split_keywords": [
        "pytorch",
        " machine learning",
        " llm"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7cee16fdccaa4edae631b703a420b14a2b31963b29d6b4e1fe61d66bf5ecc9b4",
                "md5": "e789e77790538cd88e3c0e65fc43f713",
                "sha256": "5634e989b8dfbc04d00eb5cbbf5f6da3b9857004908e56e39f324868177aac5e"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1-cp310-cp310-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "e789e77790538cd88e3c0e65fc43f713",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1052127,
            "upload_time": "2025-10-21T17:18:34",
            "upload_time_iso_8601": "2025-10-21T17:18:34.547961Z",
            "url": "https://files.pythonhosted.org/packages/7c/ee/16fdccaa4edae631b703a420b14a2b31963b29d6b4e1fe61d66bf5ecc9b4/pytorch_tokenizers-1.0.1-cp310-cp310-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "85b3ee16ffd37f905e12bd0161d4ddf2e817cd44887d539156ffe65b9ec95442",
                "md5": "905d12116fd51743a228cd6c6e66e3d6",
                "sha256": "08bce8d4c59baa18417bad6dbc008313ddc6aeeceddd9de6ae82fbeff3f1ac26"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1-cp310-cp310-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "905d12116fd51743a228cd6c6e66e3d6",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1375306,
            "upload_time": "2025-10-21T17:46:28",
            "upload_time_iso_8601": "2025-10-21T17:46:28.243181Z",
            "url": "https://files.pythonhosted.org/packages/85/b3/ee16ffd37f905e12bd0161d4ddf2e817cd44887d539156ffe65b9ec95442/pytorch_tokenizers-1.0.1-cp310-cp310-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3ff13aad1fe2583ce30b4868bfb0d79ab131a97f60aeb7f6b4d5f51d0bf1a54a",
                "md5": "4992a64c7adc82aae296e80c0023da50",
                "sha256": "eed5d45e27858fca4fb7656b48cfb51095cf1b62743e80aa42009ce1d11ede77"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1-cp311-cp311-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "4992a64c7adc82aae296e80c0023da50",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 1053197,
            "upload_time": "2025-10-21T17:18:35",
            "upload_time_iso_8601": "2025-10-21T17:18:35.918050Z",
            "url": "https://files.pythonhosted.org/packages/3f/f1/3aad1fe2583ce30b4868bfb0d79ab131a97f60aeb7f6b4d5f51d0bf1a54a/pytorch_tokenizers-1.0.1-cp311-cp311-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8653757aa3b8d46f43d83cc07aa4de5f0a087a0c1351cc20ba92eb857561513e",
                "md5": "0fb38d0c89af18ddff454e35cf5c09f0",
                "sha256": "7b1980e23ead9f744e420f350c08b31436192e20bae9f52edb222aabd2c408b3"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1-cp311-cp311-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "0fb38d0c89af18ddff454e35cf5c09f0",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 1376268,
            "upload_time": "2025-10-21T17:46:29",
            "upload_time_iso_8601": "2025-10-21T17:46:29.901614Z",
            "url": "https://files.pythonhosted.org/packages/86/53/757aa3b8d46f43d83cc07aa4de5f0a087a0c1351cc20ba92eb857561513e/pytorch_tokenizers-1.0.1-cp311-cp311-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ec13a70d8d636c129d6e8b450ed3df746de69a55a9b51f5fb0229107815b1b96",
                "md5": "377c79df4ee4ddc0a3df87890aca79ef",
                "sha256": "63cf5b728e5c80bf52a256d883386650865af51e549167a81f9f40018566fa69"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1-cp312-cp312-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "377c79df4ee4ddc0a3df87890aca79ef",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 1052632,
            "upload_time": "2025-10-21T17:18:37",
            "upload_time_iso_8601": "2025-10-21T17:18:37.188212Z",
            "url": "https://files.pythonhosted.org/packages/ec/13/a70d8d636c129d6e8b450ed3df746de69a55a9b51f5fb0229107815b1b96/pytorch_tokenizers-1.0.1-cp312-cp312-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "63860db131a810ec7a3237ae356a0c34d0e08b94f56368263ed8e62b313a4eda",
                "md5": "bbdcef81a65971794ef2ce6597b5603a",
                "sha256": "704e16aafe5fe90f660296b8bf0c20738c44aba9ab675f0999750ff2eca47c49"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1-cp312-cp312-macosx_15_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "bbdcef81a65971794ef2ce6597b5603a",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 1047765,
            "upload_time": "2025-10-21T17:18:38",
            "upload_time_iso_8601": "2025-10-21T17:18:38.150710Z",
            "url": "https://files.pythonhosted.org/packages/63/86/0db131a810ec7a3237ae356a0c34d0e08b94f56368263ed8e62b313a4eda/pytorch_tokenizers-1.0.1-cp312-cp312-macosx_15_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d6128559a343e9f8416ee580a455d2c6c43cec3ac60539b6523ace39abe462cd",
                "md5": "ed8e1b4b9de6754c6abccda636ee72ba",
                "sha256": "8dce52ac39ad8f652a83febdc5215bc2049fc1288f72ad84cf93763781e6a92c"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1-cp312-cp312-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ed8e1b4b9de6754c6abccda636ee72ba",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 1375204,
            "upload_time": "2025-10-21T17:46:31",
            "upload_time_iso_8601": "2025-10-21T17:46:31.232292Z",
            "url": "https://files.pythonhosted.org/packages/d6/12/8559a343e9f8416ee580a455d2c6c43cec3ac60539b6523ace39abe462cd/pytorch_tokenizers-1.0.1-cp312-cp312-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0ca7d07ebde1011d704e4ab5c8fcb46fc71cb72fdf2c393f85202f7dfb8fd669",
                "md5": "0e60b69227f8d14bb227cb41547fc8cd",
                "sha256": "19938c0efd5348fb0589610094c1c6bf6e055b20889f6250068d2b552a0b137b"
            },
            "downloads": -1,
            "filename": "pytorch_tokenizers-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "0e60b69227f8d14bb227cb41547fc8cd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 28542924,
            "upload_time": "2025-10-21T17:18:40",
            "upload_time_iso_8601": "2025-10-21T17:18:40.041356Z",
            "url": "https://files.pythonhosted.org/packages/0c/a7/d07ebde1011d704e4ab5c8fcb46fc71cb72fdf2c393f85202f7dfb8fd669/pytorch_tokenizers-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 17:18:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "meta-pytorch",
    "github_project": "tokenizers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pytorch-tokenizers"
}
        
Elapsed time: 1.30620s