PyDigger - unearthing stuff about Python


NameVersionSummarydate
turkish-tokenizer 0.2.26 Turkish tokenizer for Turkish language processing 2025-09-03 10:32:03
tocount 0.1 ToCount: Lightweight Token Estimator 2025-08-30 16:21:30
turbotok 0.2.0 High-performance NumPy-based tokenizer library 2025-08-17 04:27:30
ultra-tokenizer 0.1.3 Advanced tokenizer with support for BPE, WordPiece, and Unigram algorithms 2025-08-16 09:58:51
smoltoken 0.1.3 A light-weight & fast library for Byte Pair Encoding (BPE) tokenization. 2025-02-06 04:43:52
alta-tokenizer 1.1 ALTA tokenizer for encoding and decoding Kinyarwanda language text 2025-01-26 13:59:56
text2text 1.9.4 Text2Text Language Modeling Toolkit 2025-01-12 22:34:27
tokenizerchanger 1.0.1 Library for manipulating the existing tokenizer. 2024-12-28 00:02:35
kitoken 0.10.1 Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization 2024-12-20 03:07:14
python-ucto 0.6.9 This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto). 2024-12-17 11:56:39
whitespacetokenizer 1.0.3 Fast python whitespace tokenizer wtitten in cython. 2024-11-28 13:08:56
autotiktokenizer 0.2.1 🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨ 2024-11-11 21:16:02
StrTokenizer 1.1.0 A Python equivalent of Java's StringTokenizer with some added functionality 2024-10-14 18:37:19
tiniestsegmenter 0.3.0 Compact Japanese segmenter 2024-09-24 13:39:24
TokenizerChanger 0.3.4 Library for manipulating the existing tokenizer. 2024-08-27 23:47:57
bpeasy 0.1.3 Fast bare-bones BPE for modern tokenizer training 2024-08-23 10:47:52
kin-tokenizer 3.3.1 Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text 2024-08-16 10:14:03
dtokenizer 0.0.6 None 2024-08-13 10:26:00
SoMaJo 2.4.3 A tokenizer and sentence splitter for German and English web and social media texts. 2024-08-05 06:41:55
cpp-protein-encoders 0.0.1 Fast Python-wrapped C++ basic encoders for protein sequences 2024-07-22 04:15:17
hourdayweektotal
7711868045322368
Elapsed time: 1.26144s