Name | Version | Summary | date |
turkish-tokenizer |
0.2.26 |
Turkish tokenizer for Turkish language processing |
2025-09-03 10:32:03 |
tocount |
0.1 |
ToCount: Lightweight Token Estimator |
2025-08-30 16:21:30 |
turbotok |
0.2.0 |
High-performance NumPy-based tokenizer library |
2025-08-17 04:27:30 |
ultra-tokenizer |
0.1.3 |
Advanced tokenizer with support for BPE, WordPiece, and Unigram algorithms |
2025-08-16 09:58:51 |
smoltoken |
0.1.3 |
A light-weight & fast library for Byte Pair Encoding (BPE) tokenization. |
2025-02-06 04:43:52 |
alta-tokenizer |
1.1 |
ALTA tokenizer for encoding and decoding Kinyarwanda language text |
2025-01-26 13:59:56 |
text2text |
1.9.4 |
Text2Text Language Modeling Toolkit |
2025-01-12 22:34:27 |
tokenizerchanger |
1.0.1 |
Library for manipulating the existing tokenizer. |
2024-12-28 00:02:35 |
kitoken |
0.10.1 |
Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization |
2024-12-20 03:07:14 |
python-ucto |
0.6.9 |
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto). |
2024-12-17 11:56:39 |
whitespacetokenizer |
1.0.3 |
Fast python whitespace tokenizer wtitten in cython. |
2024-11-28 13:08:56 |
autotiktokenizer |
0.2.1 |
🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨ |
2024-11-11 21:16:02 |
StrTokenizer |
1.1.0 |
A Python equivalent of Java's StringTokenizer with some added functionality |
2024-10-14 18:37:19 |
tiniestsegmenter |
0.3.0 |
Compact Japanese segmenter |
2024-09-24 13:39:24 |
TokenizerChanger |
0.3.4 |
Library for manipulating the existing tokenizer. |
2024-08-27 23:47:57 |
bpeasy |
0.1.3 |
Fast bare-bones BPE for modern tokenizer training |
2024-08-23 10:47:52 |
kin-tokenizer |
3.3.1 |
Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text |
2024-08-16 10:14:03 |
dtokenizer |
0.0.6 |
None |
2024-08-13 10:26:00 |
SoMaJo |
2.4.3 |
A tokenizer and sentence splitter for German and English web and social media texts. |
2024-08-05 06:41:55 |
cpp-protein-encoders |
0.0.1 |
Fast Python-wrapped C++ basic encoders for protein sequences |
2024-07-22 04:15:17 |