Name | Version | Summary | date |
kitoken |
0.10.1 |
Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization |
2024-12-20 03:07:14 |
python-ucto |
0.6.9 |
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto). |
2024-12-17 11:56:39 |
text2text |
1.8.5 |
Text2Text Language Modeling Toolkit |
2024-12-08 21:33:57 |
ts-tokenizer |
0.1.17 |
TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts. |
2024-12-03 09:31:39 |
whitespacetokenizer |
1.0.3 |
Fast python whitespace tokenizer wtitten in cython. |
2024-11-28 13:08:56 |
autotiktokenizer |
0.2.1 |
🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨ |
2024-11-11 21:16:02 |
StrTokenizer |
1.1.0 |
A Python equivalent of Java's StringTokenizer with some added functionality |
2024-10-14 18:37:19 |
tiniestsegmenter |
0.3.0 |
Compact Japanese segmenter |
2024-09-24 13:39:24 |
TokenizerChanger |
0.3.4 |
Library for manipulating the existing tokenizer. |
2024-08-27 23:47:57 |
bpeasy |
0.1.3 |
Fast bare-bones BPE for modern tokenizer training |
2024-08-23 10:47:52 |
kin-tokenizer |
3.3.1 |
Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text |
2024-08-16 10:14:03 |
dtokenizer |
0.0.6 |
None |
2024-08-13 10:26:00 |
SoMaJo |
2.4.3 |
A tokenizer and sentence splitter for German and English web and social media texts. |
2024-08-05 06:41:55 |
cpp-protein-encoders |
0.0.1 |
Fast Python-wrapped C++ basic encoders for protein sequences |
2024-07-22 04:15:17 |
tokengeex |
1.1.0 |
TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster. |
2024-06-03 03:21:47 |
kotokenizer |
0.1.1 |
Korean tokenizer, sentence classification, and spacing model. |
2024-03-16 07:07:35 |
tokenizers-gt |
0.15.2.post0 |
None |
2024-02-19 01:41:23 |
sumire |
1.0.2 |
Scikit-learn compatible Japanese text vectorizer for CPU-based Japanese natural language processing. |
2024-01-31 14:38:04 |
mwtokenizer |
0.2.0 |
Wikipedia Tokenizer Utility |
2023-12-22 16:24:45 |
sentencex |
0.6.1 |
Sentence segmenter that supports ~300 languages |
2023-11-14 06:58:58 |