PyDigger - unearthing stuff about Python


NameVersionSummarydate
tokengeex 0.6.2 TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster. 2024-03-22 03:34:44
kotokenizer 0.1.1 Korean tokenizer, sentence classification, and spacing model. 2024-03-16 07:07:35
text2text 1.4.4 Text2Text: Crosslingual NLP/G toolkit 2024-02-19 19:21:15
SoMaJo 2.4.2 A tokenizer and sentence splitter for German and English web and social media texts. 2024-02-19 12:23:57
tokenizers-gt 0.15.2.post0 None 2024-02-19 01:41:23
tokenizers 0.15.2 2024-02-12 02:28:50
sumire 1.0.2 Scikit-learn compatible Japanese text vectorizer for CPU-based Japanese natural language processing. 2024-01-31 14:38:04
optimal-data-selector 1.2.1 ('A Package for to optimize models, transfer of copy files from one directory to other, use for nlp short word treatment, choosing optimal data for ML models, use for Image Scraping , use in timeseries problem to split the data into train and test', 'Deal with emojis and emoticons in nlp,word tokenize,token, get the list of Punctuation marks and English Pronouns too, can be used to read text files') 2023-12-25 11:38:35
mwtokenizer 0.2.0 Wikipedia Tokenizer Utility 2023-12-22 16:24:45
bpeasy 0.1.2 Fast bare-bones BPE for modern tokenizer training 2023-12-19 10:55:51
sentencex 0.6.1 Sentence segmenter that supports ~300 languages 2023-11-14 06:58:58
rs-bytepiece 0.2.2 bytepiece-rs Python binding 2023-11-12 08:52:37
count-tokens 0.7.0 Count number of tokens in the text file using toktoken tokenizer from OpenAI. 2023-09-26 11:16:08
UnicodeTokenizer 0.2.1 UnicodeTokenizer: tokenize all Unicode text 2023-09-20 21:46:37
python-ucto 0.6.6 This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto). 2023-09-13 09:57:41
taibun 1.0.0 Taiwanese Hokkien Transliterator and Tokeniser 2023-08-31 11:29:11
semiformal 0.7.0 Tokenizer for semiformal unicode text using TR-29 segmentation 2023-08-20 08:00:32
tokenizer 3.4.3 A tokenizer for Icelandic text 2023-08-11 15:09:13
tokenstream 1.6.0 A versatile token stream for handwritten parsers 2023-08-02 18:52:57
Texo 0.0.4 Sentiment Analysis Multiple language and for all products 2023-07-09 15:34:19
hourdayweektotal
6721039555192627
Elapsed time: 0.74798s