Name | smoltoken JSON |
Version |
0.1.3
JSON |
| download |
home_page | None |
Summary | A light-weight & fast library for Byte Pair Encoding (BPE) tokenization. |
upload_time | 2025-02-06 04:43:52 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.11 |
license | None |
keywords |
tokenizer
bpe
ai
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# SmolToken
SmolToken is a fast library for tokenizing text using the Byte Pair Encoding (BPE) algorithm. Inspired by OpenAI's [`tiktoken`](https://github.com/openai/tiktoken), SmolToken is designed to fill a critical gap by enabling BPE training from scratch while maintaining high performance for encoding and decoding tasks.
Unlike `tiktoken`, SmolToken supports training tokenizers on custom data. Up to **~4x faster** than the port of unoptimized educational implementation [`_educational.py`](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py) in rust.
## Benchmark Results
SmolToken is already faster than baseline educational implementation of BPE training:
| Implementation | Runtime (sec) |
| ------------------------------ | ------------- |
| **Unoptimized Implementation** | 36.94385 |
| **SmolToken Optimized** | 17.63223 |
| **SmolToken (with rayon)** | 7.489850 |
Tested on:
- Vocabulary size: **500**
- Dataset: **Tiny Stories (~18 MB)**
## Installation
Add smoltoken to your Rust project via [crates.io](https://crates.io/):
```bash
cargo add smoltoken
```
Or add smoltoken to your Python project via [PyPI](https://pypi.org/):
```bash
pip install smoltoken
```
## Roadmap
- [x] **Concurrency**: Add multi-threading support using `rayon` for faster training, encoding, and decoding.
- [x] **Python Bindings**: Integrate with Python using `PyO3` to make it accessible for Python developers.
- [x] **Serialization**: Add serialization support to save/load trained tokenizer vocabulary.
## Contributing
We very much welcome contributions to make Smoltoken fast, robust and efficient. Make a fork, create a feature branch if needed and sumbit your pull request. Since, the library itself is in its early release stage, I also expect to get community feedback to improve on. Just raise an issue here and we will fix them promptly.
## License
SmolToken is open source and licensed under the [MIT License](LICENSE).
## Acknowledgements
Special thanks to OpenAI's [`tiktoken`](https://github.com/openai/tiktoken) for inspiration and foundational ideas.
Raw data
{
"_id": null,
"home_page": null,
"name": "smoltoken",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "tokenizer, bpe, ai",
"author": null,
"author_email": "Arun S V <svarunid@gmail.com>",
"download_url": null,
"platform": null,
"description": "# SmolToken\r\n\r\nSmolToken is a fast library for tokenizing text using the Byte Pair Encoding (BPE) algorithm. Inspired by OpenAI's [`tiktoken`](https://github.com/openai/tiktoken), SmolToken is designed to fill a critical gap by enabling BPE training from scratch while maintaining high performance for encoding and decoding tasks.\r\n\r\nUnlike `tiktoken`, SmolToken supports training tokenizers on custom data. Up to **~4x faster** than the port of unoptimized educational implementation [`_educational.py`](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py) in rust.\r\n\r\n## Benchmark Results\r\n\r\nSmolToken is already faster than baseline educational implementation of BPE training:\r\n\r\n| Implementation | Runtime (sec) |\r\n| ------------------------------ | ------------- |\r\n| **Unoptimized Implementation** | 36.94385 |\r\n| **SmolToken Optimized** | 17.63223 |\r\n| **SmolToken (with rayon)** | 7.489850 |\r\n\r\nTested on:\r\n\r\n- Vocabulary size: **500**\r\n- Dataset: **Tiny Stories (~18 MB)**\r\n\r\n## Installation\r\n\r\nAdd smoltoken to your Rust project via [crates.io](https://crates.io/):\r\n\r\n```bash\r\ncargo add smoltoken\r\n```\r\n\r\nOr add smoltoken to your Python project via [PyPI](https://pypi.org/):\r\n\r\n```bash\r\npip install smoltoken\r\n```\r\n\r\n## Roadmap\r\n\r\n- [x] **Concurrency**: Add multi-threading support using `rayon` for faster training, encoding, and decoding.\r\n- [x] **Python Bindings**: Integrate with Python using `PyO3` to make it accessible for Python developers.\r\n- [x] **Serialization**: Add serialization support to save/load trained tokenizer vocabulary.\r\n\r\n## Contributing\r\n\r\nWe very much welcome contributions to make Smoltoken fast, robust and efficient. Make a fork, create a feature branch if needed and sumbit your pull request. Since, the library itself is in its early release stage, I also expect to get community feedback to improve on. Just raise an issue here and we will fix them promptly.\r\n\r\n## License\r\n\r\nSmolToken is open source and licensed under the [MIT License](LICENSE).\r\n\r\n## Acknowledgements\r\n\r\nSpecial thanks to OpenAI's [`tiktoken`](https://github.com/openai/tiktoken) for inspiration and foundational ideas.\r\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A light-weight & fast library for Byte Pair Encoding (BPE) tokenization.",
"version": "0.1.3",
"project_urls": {
"repository": "https://github.com/svarunid/smoltoken"
},
"split_keywords": [
"tokenizer",
" bpe",
" ai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "40fbbc81eb75cb9304b37b8adfebd97cb16638e2829d40bc53afffa8726fa7fa",
"md5": "8954f8bd7d8e4c13e603f5d9c02007ff",
"sha256": "a619140d5b2564abb0d4cfa86f9b7953c8c8520517bd365ea9f7e75df97866aa"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl",
"has_sig": false,
"md5_digest": "8954f8bd7d8e4c13e603f5d9c02007ff",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.11",
"size": 1111620,
"upload_time": "2025-02-06T04:43:52",
"upload_time_iso_8601": "2025-02-06T04:43:52.940326Z",
"url": "https://files.pythonhosted.org/packages/40/fb/bc81eb75cb9304b37b8adfebd97cb16638e2829d40bc53afffa8726fa7fa/smoltoken-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "498d32cd2026903c99a6aa8c05e7809eb087987fb781bc3b3399c5479e20e9a3",
"md5": "4fe906a9081d4f5afb92c5f0ca702f8e",
"sha256": "9246fa0727c5bbb5c1cf20909619b26cb0d327f40165bdbc1e77a14f73a282df"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp311-cp311-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "4fe906a9081d4f5afb92c5f0ca702f8e",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.11",
"size": 1054741,
"upload_time": "2025-02-06T04:43:58",
"upload_time_iso_8601": "2025-02-06T04:43:58.353006Z",
"url": "https://files.pythonhosted.org/packages/49/8d/32cd2026903c99a6aa8c05e7809eb087987fb781bc3b3399c5479e20e9a3/smoltoken-0.1.3-cp311-cp311-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8d138c3cbfa93a8dd6c3a194fd6217c7fd7a645053ec51d73403c2812f477087",
"md5": "e578edcebd19d1d5188ab3cc528a632b",
"sha256": "14bfb56bc56ec4dee42b92134b9703ecbcb80ec96e11184548167de7e5366593"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "e578edcebd19d1d5188ab3cc528a632b",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.11",
"size": 1257147,
"upload_time": "2025-02-06T04:44:03",
"upload_time_iso_8601": "2025-02-06T04:44:03.276863Z",
"url": "https://files.pythonhosted.org/packages/8d/13/8c3cbfa93a8dd6c3a194fd6217c7fd7a645053ec51d73403c2812f477087/smoltoken-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c5546b90c684c53858f06d11267c6aa4ec061907a18db62a457e8554987273fd",
"md5": "7762d1c51bcf4efd3a690bd802a4777c",
"sha256": "bdb63667e7d37e2eeeef6eb460c48c10b7f154cc79898dd258092fcfc6617421"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp311-cp311-musllinux_1_2_x86_64.whl",
"has_sig": false,
"md5_digest": "7762d1c51bcf4efd3a690bd802a4777c",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.11",
"size": 1319673,
"upload_time": "2025-02-06T04:44:08",
"upload_time_iso_8601": "2025-02-06T04:44:08.187272Z",
"url": "https://files.pythonhosted.org/packages/c5/54/6b90c684c53858f06d11267c6aa4ec061907a18db62a457e8554987273fd/smoltoken-0.1.3-cp311-cp311-musllinux_1_2_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "980660e21f1fca7ce9ce149e1998b461489c4cafe1da2e6457415be4940345dd",
"md5": "94afab0bb065afa249f4f09e204a9bd9",
"sha256": "28bf54dd5b9a24eb1124965e086f627f7a37799bb981561f2f379b53d0dbb1d8"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp311-cp311-win_amd64.whl",
"has_sig": false,
"md5_digest": "94afab0bb065afa249f4f09e204a9bd9",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.11",
"size": 935807,
"upload_time": "2025-02-06T04:44:11",
"upload_time_iso_8601": "2025-02-06T04:44:11.438845Z",
"url": "https://files.pythonhosted.org/packages/98/06/60e21f1fca7ce9ce149e1998b461489c4cafe1da2e6457415be4940345dd/smoltoken-0.1.3-cp311-cp311-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a6eaa1e3f27eede9e74a2e1ec4c329fb92703db7b9f4a714150b1a150a96768e",
"md5": "fca5f9c5ee8bb7f5ac08c084d78b9b66",
"sha256": "51d668add51822e946a2406902b1dc48c5122d0e513383b18e9b0cd03f3ff1f5"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp312-cp312-macosx_10_13_x86_64.whl",
"has_sig": false,
"md5_digest": "fca5f9c5ee8bb7f5ac08c084d78b9b66",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.11",
"size": 1111658,
"upload_time": "2025-02-06T04:44:14",
"upload_time_iso_8601": "2025-02-06T04:44:14.340178Z",
"url": "https://files.pythonhosted.org/packages/a6/ea/a1e3f27eede9e74a2e1ec4c329fb92703db7b9f4a714150b1a150a96768e/smoltoken-0.1.3-cp312-cp312-macosx_10_13_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "18453faa630759df47e42208c5a620f1125b48718b6f5c7d15cd7c5f39dfec21",
"md5": "6faaa2b013bafb23512c19e43d323aff",
"sha256": "fd5aef7b703b2519da17e8b80e4040b05f52885f28b6de2996e504b3f52b4e3d"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp312-cp312-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "6faaa2b013bafb23512c19e43d323aff",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.11",
"size": 1052980,
"upload_time": "2025-02-06T04:44:19",
"upload_time_iso_8601": "2025-02-06T04:44:19.174145Z",
"url": "https://files.pythonhosted.org/packages/18/45/3faa630759df47e42208c5a620f1125b48718b6f5c7d15cd7c5f39dfec21/smoltoken-0.1.3-cp312-cp312-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6ae9016d0f65b98181e81bff560d6e6f332bdec4fc3f2d907ae81c254e072cd8",
"md5": "e20b00904e04809187b4a5ec409e2cb5",
"sha256": "e99babc15a7bd4abed870f09cef034f5b545d57b1191b863e85099efcc5010d6"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "e20b00904e04809187b4a5ec409e2cb5",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.11",
"size": 1256807,
"upload_time": "2025-02-06T04:44:22",
"upload_time_iso_8601": "2025-02-06T04:44:22.567875Z",
"url": "https://files.pythonhosted.org/packages/6a/e9/016d0f65b98181e81bff560d6e6f332bdec4fc3f2d907ae81c254e072cd8/smoltoken-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "22dedb30b53179cf2a8d3381271f86243ea91e3530e8467b54f75e8e7c8d3ba8",
"md5": "d182059f6785198abfe7a911c9cb748d",
"sha256": "6bd791226c2f0bea40445e9fadea2bd7b377e9d035c61401aad188d46c8e7ac0"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp312-cp312-musllinux_1_2_x86_64.whl",
"has_sig": false,
"md5_digest": "d182059f6785198abfe7a911c9cb748d",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.11",
"size": 1318516,
"upload_time": "2025-02-06T04:44:25",
"upload_time_iso_8601": "2025-02-06T04:44:25.134472Z",
"url": "https://files.pythonhosted.org/packages/22/de/db30b53179cf2a8d3381271f86243ea91e3530e8467b54f75e8e7c8d3ba8/smoltoken-0.1.3-cp312-cp312-musllinux_1_2_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9e0c58ad3200306767ac85e9a902fcb9e6347cab48cb5e2e8a96c71a614f258b",
"md5": "43f266e6b3caf1b38290011553cd3f73",
"sha256": "a4b2cabb21bb068f9076daa0f48d88ae7a186e1169e51d471a7f386997ae577c"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp312-cp312-win_amd64.whl",
"has_sig": false,
"md5_digest": "43f266e6b3caf1b38290011553cd3f73",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.11",
"size": 935842,
"upload_time": "2025-02-06T04:44:27",
"upload_time_iso_8601": "2025-02-06T04:44:27.128702Z",
"url": "https://files.pythonhosted.org/packages/9e/0c/58ad3200306767ac85e9a902fcb9e6347cab48cb5e2e8a96c71a614f258b/smoltoken-0.1.3-cp312-cp312-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "92cace04be73ebdf4089e22a975c340ad49cc3cfba6b0d1b23c10fc23b8b9b4e",
"md5": "1eeb5bf110a8722f0dab2415f97dd2b8",
"sha256": "d3870a0e10178d0df4a5091de0ad56c34878abfc108b66d28f8fd152a68bf8e7"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp313-cp313-macosx_10_13_x86_64.whl",
"has_sig": false,
"md5_digest": "1eeb5bf110a8722f0dab2415f97dd2b8",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.11",
"size": 1111469,
"upload_time": "2025-02-06T04:44:29",
"upload_time_iso_8601": "2025-02-06T04:44:29.245650Z",
"url": "https://files.pythonhosted.org/packages/92/ca/ce04be73ebdf4089e22a975c340ad49cc3cfba6b0d1b23c10fc23b8b9b4e/smoltoken-0.1.3-cp313-cp313-macosx_10_13_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4538bd682573f58800bb98a01d5926c3d9caacfc7e54d078ebc4835a1e3afb1d",
"md5": "545f7452a382f005b71323a4d849dc66",
"sha256": "14cc353b7cf83af957927ad231a138e76ad5598dd9d3977d7711003f1f742e4c"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp313-cp313-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "545f7452a382f005b71323a4d849dc66",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.11",
"size": 1052586,
"upload_time": "2025-02-06T04:44:32",
"upload_time_iso_8601": "2025-02-06T04:44:32.737623Z",
"url": "https://files.pythonhosted.org/packages/45/38/bd682573f58800bb98a01d5926c3d9caacfc7e54d078ebc4835a1e3afb1d/smoltoken-0.1.3-cp313-cp313-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d25c1792f4c113b112dcf6aff53425d78259fe2a58f8a28c44f7eb5d1bf522df",
"md5": "389d1eee6b4188cba773c814e9a730b9",
"sha256": "917d38e3f353c2d126c5d411db1afa28c4b752e78167886f7245c056497bfee3"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "389d1eee6b4188cba773c814e9a730b9",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.11",
"size": 1256231,
"upload_time": "2025-02-06T04:44:34",
"upload_time_iso_8601": "2025-02-06T04:44:34.781618Z",
"url": "https://files.pythonhosted.org/packages/d2/5c/1792f4c113b112dcf6aff53425d78259fe2a58f8a28c44f7eb5d1bf522df/smoltoken-0.1.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f39d2d71324c8c5ff0587a1072a7593623edea5a067399c3cf2a4107cbdb1bc2",
"md5": "d96a4964b87f930a5e0de970e2643941",
"sha256": "2ab56ffe9542877dc71ddcb68047e3296eec3611dbd9cc5b7e92080a93f4f38a"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp313-cp313-musllinux_1_2_x86_64.whl",
"has_sig": false,
"md5_digest": "d96a4964b87f930a5e0de970e2643941",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.11",
"size": 1318360,
"upload_time": "2025-02-06T04:44:36",
"upload_time_iso_8601": "2025-02-06T04:44:36.967687Z",
"url": "https://files.pythonhosted.org/packages/f3/9d/2d71324c8c5ff0587a1072a7593623edea5a067399c3cf2a4107cbdb1bc2/smoltoken-0.1.3-cp313-cp313-musllinux_1_2_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "196bc957ce058c376057a22fcceb6dd3901ab3f2e1a0adad0ac2603516b25811",
"md5": "5dfca59731be51b7f2eb198fb467d2fd",
"sha256": "5c4534ae0b41f988a25e094b43019b8c66c5954e8b021831b4e6c0e2a1ebab7a"
},
"downloads": -1,
"filename": "smoltoken-0.1.3-cp313-cp313-win_amd64.whl",
"has_sig": false,
"md5_digest": "5dfca59731be51b7f2eb198fb467d2fd",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.11",
"size": 935778,
"upload_time": "2025-02-06T04:44:40",
"upload_time_iso_8601": "2025-02-06T04:44:40.110279Z",
"url": "https://files.pythonhosted.org/packages/19/6b/c957ce058c376057a22fcceb6dd3901ab3f2e1a0adad0ac2603516b25811/smoltoken-0.1.3-cp313-cp313-win_amd64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-06 04:43:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "svarunid",
"github_project": "smoltoken",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "smoltoken"
}