Name | Version | Summary | date |
tokengeex |
0.6.2 |
TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster. |
2024-03-22 03:34:44 |
kotokenizer |
0.1.1 |
Korean tokenizer, sentence classification, and spacing model. |
2024-03-16 07:07:35 |
text2text |
1.4.4 |
Text2Text: Crosslingual NLP/G toolkit |
2024-02-19 19:21:15 |
SoMaJo |
2.4.2 |
A tokenizer and sentence splitter for German and English web and social media texts. |
2024-02-19 12:23:57 |
tokenizers-gt |
0.15.2.post0 |
None |
2024-02-19 01:41:23 |
tokenizers |
0.15.2 |
|
2024-02-12 02:28:50 |
sumire |
1.0.2 |
Scikit-learn compatible Japanese text vectorizer for CPU-based Japanese natural language processing. |
2024-01-31 14:38:04 |
optimal-data-selector |
1.2.1 |
('A Package for to optimize models, transfer of copy files from one directory to other, use for nlp short word treatment, choosing optimal data for ML models, use for Image Scraping , use in timeseries problem to split the data into train and test', 'Deal with emojis and emoticons in nlp,word tokenize,token, get the list of Punctuation marks and English Pronouns too, can be used to read text files') |
2023-12-25 11:38:35 |
mwtokenizer |
0.2.0 |
Wikipedia Tokenizer Utility |
2023-12-22 16:24:45 |
bpeasy |
0.1.2 |
Fast bare-bones BPE for modern tokenizer training |
2023-12-19 10:55:51 |
sentencex |
0.6.1 |
Sentence segmenter that supports ~300 languages |
2023-11-14 06:58:58 |
rs-bytepiece |
0.2.2 |
bytepiece-rs Python binding |
2023-11-12 08:52:37 |
count-tokens |
0.7.0 |
Count number of tokens in the text file using toktoken tokenizer from OpenAI. |
2023-09-26 11:16:08 |
UnicodeTokenizer |
0.2.1 |
UnicodeTokenizer: tokenize all Unicode text |
2023-09-20 21:46:37 |
python-ucto |
0.6.6 |
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto). |
2023-09-13 09:57:41 |
taibun |
1.0.0 |
Taiwanese Hokkien Transliterator and Tokeniser |
2023-08-31 11:29:11 |
semiformal |
0.7.0 |
Tokenizer for semiformal unicode text using TR-29 segmentation |
2023-08-20 08:00:32 |
tokenizer |
3.4.3 |
A tokenizer for Icelandic text |
2023-08-11 15:09:13 |
tokenstream |
1.6.0 |
A versatile token stream for handwritten parsers |
2023-08-02 18:52:57 |
Texo |
0.0.4 |
Sentiment Analysis Multiple language and for all products |
2023-07-09 15:34:19 |