PyDigger - unearthing stuff about Python


NameVersionSummarydate
kitoken 0.10.1 Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization 2024-12-20 03:07:14
python-ucto 0.6.9 This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto). 2024-12-17 11:56:39
text2text 1.8.5 Text2Text Language Modeling Toolkit 2024-12-08 21:33:57
ts-tokenizer 0.1.17 TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts. 2024-12-03 09:31:39
whitespacetokenizer 1.0.3 Fast python whitespace tokenizer wtitten in cython. 2024-11-28 13:08:56
autotiktokenizer 0.2.1 🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨ 2024-11-11 21:16:02
StrTokenizer 1.1.0 A Python equivalent of Java's StringTokenizer with some added functionality 2024-10-14 18:37:19
tiniestsegmenter 0.3.0 Compact Japanese segmenter 2024-09-24 13:39:24
TokenizerChanger 0.3.4 Library for manipulating the existing tokenizer. 2024-08-27 23:47:57
bpeasy 0.1.3 Fast bare-bones BPE for modern tokenizer training 2024-08-23 10:47:52
kin-tokenizer 3.3.1 Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text 2024-08-16 10:14:03
dtokenizer 0.0.6 None 2024-08-13 10:26:00
SoMaJo 2.4.3 A tokenizer and sentence splitter for German and English web and social media texts. 2024-08-05 06:41:55
cpp-protein-encoders 0.0.1 Fast Python-wrapped C++ basic encoders for protein sequences 2024-07-22 04:15:17
tokengeex 1.1.0 TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster. 2024-06-03 03:21:47
kotokenizer 0.1.1 Korean tokenizer, sentence classification, and spacing model. 2024-03-16 07:07:35
tokenizers-gt 0.15.2.post0 None 2024-02-19 01:41:23
sumire 1.0.2 Scikit-learn compatible Japanese text vectorizer for CPU-based Japanese natural language processing. 2024-01-31 14:38:04
mwtokenizer 0.2.0 Wikipedia Tokenizer Utility 2023-12-22 16:24:45
sentencex 0.6.1 Sentence segmenter that supports ~300 languages 2023-11-14 06:58:58
hourdayweektotal
3912189552274376
Elapsed time: 1.08386s