**TS Tokenizer** is a hybrid tokenizer designed specifically for tokenizing Turkish texts.
It uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.
### Key Features:
- **Hybrid Tokenization**: Combines lexicon-based and rule-based techniques to tokenize complex Turkish texts with precision.
- **Special Token Handling**: Detects and processes mentions, hashtags, emails, URLs, dates, numbers, smileys, emoticons, and more.
- **Configurable Outputs**: Offers multiple output formats, including plain tokens, tagged tokens, tokenized lines, and tagged lines, to suit diverse NLP workflows.
- **Multi-core Processing**: Speeds up tokenization for large files with parallel processing.
- **Preprocess Handling**: Handles corrupted Turkish text and punctuation gracefully using built-in fixes.
- **Command-Line Friendly**: Use it directly from the terminal for file-based or piped input workflows.
On natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers
a reliable solution for tokenization.
---
## Installation
You can install the ts-tokenizer package using pip.
```bash
pip install ts-tokenizer
```
Ensure you have Python 3.9 or higher installed on your system.
You can update current version using pip
```bash
pip install --upgrade ts-tokenizer
```
---
To remove package, use:
```bash
pip uninstall ts-tokenizer
```
You can also clone the repo locally.
```bash
git clone https://github.com/tanerim/ts_tokenizer.git
cd ts-tokenizer
pip install -e .
```
---
## License
This project is licensed under the MIT License. See the [LICENSE](https://github.com/tanerim/ts_tokenizer/blob/main/LICENSE) file for details.
## Command line tool
You can use TS Tokenizer directly from the command line for both file inputs and pipeline processing:
## Tokenize from a File:
```bash
ts-tokenizer input.txt
```
or
```bash
cat input.txt | ts-tokenizer
```
or
```bash
zcat input.txt.gz | ts-tokenizer
```
---
## Help
Get detailed help for available options using:
| Argument | Short | Description | Default |
|----------------------|-------|-------------------------------------------------------------------------------------------------|---------------|
| `--output` | `-o` | Specify the output format: `tokenized`, `lines`, `tagged`, `tagged_lines`. | `tokenized` |
| `--num-workers` | `-n` | Set the number of parallel workers for processing. | `CPU cores-1` |
| `--verbose` | `-v` | Enable verbose mode to display additional processing details. | Disabled |
| `--version` | `-V` | Display the current version of `ts-tokenizer`. | N/A |
| `--help` | `-h` | Show the help message and exit.
---
## CLI Arguments
You can specify the output format using the -o option:
- **tokenized (default):** Returns plain tokens, one per line.
- **tagged:** Returns tokens with their tags.
- **lines:** Returns tokenized lines as lists.
- **tagged_lines:** Returns tokenized lines as a list of tuples (token, tag).
```bash
input_text = "Queen , 31.10.1975 tarihinde çıkardıðı A Night at the Opera albümüyle dünya müziðini deðiåÿtirdi ."
$ ts-tokenizer input_text
Queen
,
31.10.1975
tarihinde
çıkardığı
A
Night
at
the
Opera
albümüyle
dünya
müziğini
değiştirdi
.
```
Note that **tags are not part-of-speech tags** but they define the given string.
```bash
$ ts-tokenizer -o tagged input.txt
Queen English_Word
, Punc
31.10.1975 Date
tarihinde Valid_Word
çıkardığı Valid_Word
A OOV
Night English_Word
at Valid_Word
the English_Word
Opera Valid_Word
albümüyle Valid_Word
dünya Valid_Word
müziğini Valid_Word
değiştirdi Valid_Word
. Punc
```
The other two arguments are "lines" and "tagged_lines".
The "lines" parameter reads input file line-by-line and returns a list for each line. Note that each line is defined by end-of-line markers in the given text.
```bash
$ ts-tokenizer -o lines input.txt
['Queen', ',', '31.10.1975', 'tarihinde', 'çıkardığı', 'A', 'Night', 'at', 'the', 'Opera', 'albümüyle', 'dünya', 'müziğini', 'değiştirdi', '.']
```
The "tagged_lines" parameter reads input file line-by-line and returns a list of tuples for each line. Note that each line is defined by end-of-line markers in the given text.
```bash
$ ts-tokenizer -o tagged_lines input.txt
[('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('çıkardığı', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('albümüyle', 'Valid_Word'), ('dünya', 'Valid_Word'), ('müziğini', 'Valid_Word'), ('değiştirdi', 'Valid_Word'), ('.', 'Punc')]
```
---
## Parallel Processing
Use the -n option to set the number of parallel workers:
```bash
$ ts-tokenizer -n 2 -o tagged input_file
```
By default, TS Tokenizer uses [number of CPU cores - 1].
---
## Using CLI Arguments with pipelines
You can use TS Tokenizer in bash pipelines, such as counting word frequencies:
Following sample returns calculated frequencies for the given file:
```bash
$ ts-tokenizer input.txt | sort | uniq -c | sort -n
```
---
To count tags:
```bash
$ ts-tokenizer -o tagged input.txt | cut -f2 | sort | uniq -c
1 Date
3 English_Word
2 Hashtag
2 Mention
1 Multi_Hyphenated
3 Numbered_Title
1 OOV
9 Punc
1 Single_Hyphenated
17 Valid_Word
```
To find a specific tag following command could be used.
```bash
$ ts-tokenizer -o tagged input.txt | cut -f1,2 | grep "Web_URL"
www.wikipedia.org Web_URL
www.wim-wenders.com Web_URL
www.winterwar.com. Web_URL
www.wissenschaft.de: Web_URL
www.wittingen.de Web_URL
www.wlmqradio.com Web_URL
www.worldstadiums.com Web_URL
www.worldstatesmen.org Web_URL
```
---
# Classes
Below are samples to implement ts-tokenizer in Python
## TSTokenizer
**tokenized** : Outputs a list of plain tokens extracted from the input text.
```python
from ts_tokenizer.tokenizer import TSTokenizer
Single_Line_Sample = "Parça ve bütün iliåÿkisi her zaman iåÿlevsel deðildir."
simple_tokens = TSTokenizer.ts_tokenize(Single_Line_Sample, output_format="tokenized")
if simple_tokens is None:
pass
else:
for token in simple_tokens:
print(token)
```
Generated output is as follows:
```bash
Parça
ve
bütün
ilişkisi
her
zaman
işlevsel
değildir
```
**tagged** : Outputs tokens with associated tags. Please note that these are not POSTags.
Check TokenHandler below for tag set.
```python
tagged_tokens = TSTokenizer.ts_tokenize(Single_Line_Sample, output_format="tagged")
if tagged_tokens is None:
pass
else:
for token in tagged_tokens:
print(token)
```
Generated output is as follows:
```bash
Parça Valid_Word
ve Valid_Word
bütün Valid_Word
ilişkisi Valid_Word
her Valid_Word
zaman Valid_Word
işlevsel Valid_Word
değildir Valid_Word
. Punc
```
**lines** : Maintains the structure of the input text, with each line's tokens grouped together.
Please note that "line" is defined with end-of-line markers.
```python
from ts_tokenizer.tokenizer import TSTokenizer
Multi_Line_Sample = """
ATATÜRK'ün GENÇLÝÐE HÝTABESÝ
Ey Türk gençliði! Birinci vazifen, Türk istiklâlini, Türk Cumhuriyet'ini, ilelebet, muhafaza ve müdafaa etmektir.
Mevcudiyetinin ve istikbalinin yegâne temeli budur.
Bu temel, senin, en kýymetli hazinendir.
Ýstikbalde dahi, seni bu hazineden mahrum etmek isteyecek, dahilî ve haricî bedhahlarýn olacaktýr.
Bir gün, istiklâl ve cumhuriyeti müdafaa mecburiyetine düþersen, vazifeye atýlmak için, içinde bulunacaðýn vaziyetin imkân ve þeraitini düþünmeyeceksin!
Bu imkân ve þerait, çok nâmüsait bir mahiyette tezahür edebilir.
Ýstiklâl ve cumhuriyetine kastedecek düþmanlar, bütün dünyada emsali görülmemiþ bir galibiyetin mümessili olabilirler.
Cebren ve hile ile aziz vatanýn, bütün kaleleri zaptedilmiþ, bütün tersanelerine girilmiþ, bütün ordularý daðýtýlmýþ ve memleketin her köþesi bilfiil iþgal edilmiþ olabilir.
Bütün bu þeraitten daha elîm ve daha vahim olmak üzere, memleketin dahilinde, iktidara sahip olanlar gaflet ve dalâlet ve hattâ hýyanet içinde bulunabilirler.
Hatta bu iktidar sahipleri þahsî menfaatlerini, müstevlilerin siyasî emelleriyle tevhit edebilirler.
Millet, fakruzaruret içinde harap ve bîtap düþmüþ olabilir. Ey Türk istikbalinin evladý! Ýþte, bu ahval ve þerait içinde dahi, vazifen; Türk istiklâl ve cumhuriyetini kurtarmaktýr!
Muhtaç olduðun kudret, damarlarýndaki asil kanda, mevcuttur!
"""
line_tokens = TSTokenizer.ts_tokenize(Multi_Line_Sample, output_format="lines")
if line_tokens is None:
pass
else:
for token in line_tokens:
print(token)
```
Generated output is as follows:
```text
ATATÜRK'ün GENÇLİĞE HİTABESİ
Ey Türk gençliği ! Birinci vazifen , Türk istiklâlini , Türk Cumhuriyet'ini , ilelebet , muhafaza ve müdafaa etmektir .
Mevcudiyetinin ve istikbalinin yegâne temeli budur .
Bu temel , senin , en kıymetli hazinendir .
İstikbalde dahi , seni bu hazineden mahrum etmek isteyecek , dahilî ve haricî bedhahların olacaktır .
Bir gün , istiklâl ve cumhuriyeti müdafaa mecburiyetine düşersen , vazifeye atılmak için , içinde bulunacağın vaziyetin imkân ve şeraitini düşünmeyeceksin !
Bu imkân ve şerait , çok nâmüsait bir mahiyette tezahür edebilir .
İstiklâl ve cumhuriyetine kastedecek düşmanlar , bütün dünyada emsali görülmemiş bir galibiyetin mümessili olabilirler .
Cebren ve hile ile aziz vatanın , bütün kaleleri zaptedilmiş , bütün tersanelerine girilmiş , bütün orduları dağıtılmış ve memleketin her köşesi bilfiil işgal edilmiş olabilir .
Bütün bu şeraitten daha elîm ve daha vahim olmak üzere , memleketin dahilinde , iktidara sahip olanlar gaflet ve dalâlet ve hattâ hıyanet içinde bulunabilirler .
Hatta bu iktidar sahipleri şahsî menfaatlerini , müstevlilerin siyasî emelleriyle tevhit edebilirler .
Millet , fakruzaruret içinde harap ve bîtap düşmüş olabilir . Ey Türk istikbalinin evladı ! İşte , bu ahval ve şerait içinde dahi , vazifen ; Türk istiklâl ve cumhuriyetini kurtarmaktır !
Muhtaç olduğun kudret , damarlarındaki asil kanda , mevcuttur !
```
**tagged_lines**: Same as lines but includes tags for each token.
```python
tagged_line_tokens = TSTokenizer.ts_tokenize(Multi_Line_Sample, output_format="tagged_lines")
if tagged_line_tokens is None:
pass
else:
for token in tagged_line_tokens:
print(token)
```
Generated output is as follows:
```bash
[("ATATÜRK'ün", 'Apostrophed'), ('GENÇLİĞE', 'Valid_Word'), ('HİTABESİ', 'Valid_Word')]
[('Ey', 'Valid_Word'), ('Türk', 'Valid_Word'), ('gençliği', 'Valid_Word'), ('!', 'Punc'), ('Birinci', 'Valid_Word'), ('vazifen', 'Valid_Word'), (',', 'Punc'), ('Türk', 'Valid_Word'), ('istiklâlini', 'Valid_Word'), (',', 'Punc'), ('Türk', 'Valid_Word'), ("Cumhuriyet'ini", 'Apostrophed'), (',', 'Punc'), ('ilelebet', 'Valid_Word'), (',', 'Punc'), ('muhafaza', 'Valid_Word'), ('ve', 'Valid_Word'), ('müdafaa', 'Valid_Word'), ('etmektir', 'Valid_Word'), ('.', 'Punc')]
[('Mevcudiyetinin', 'Valid_Word'), ('ve', 'Valid_Word'), ('istikbalinin', 'Valid_Word'), ('yegâne', 'Valid_Word'), ('temeli', 'Valid_Word'), ('budur', 'Valid_Word'), ('.', 'Punc')]
[('Bu', 'Valid_Word'), ('temel', 'Valid_Word'), (',', 'Punc'), ('senin', 'Valid_Word'), (',', 'Punc'), ('en', 'Valid_Word'), ('kıymetli', 'Valid_Word'), ('hazinendir', 'Valid_Word'), ('.', 'Punc')]
[('İstikbalde', 'Valid_Word'), ('dahi', 'Valid_Word'), (',', 'Punc'), ('seni', 'Valid_Word'), ('bu', 'Valid_Word'), ('hazineden', 'Valid_Word'), ('mahrum', 'Valid_Word'), ('etmek', 'Valid_Word'), ('isteyecek', 'Valid_Word'), (',', 'Punc'), ('dahilî', 'Valid_Word'), ('ve', 'Valid_Word'), ('haricî', 'Valid_Word'), ('bedhahların', 'Valid_Word'), ('olacaktır', 'Valid_Word'), ('.', 'Punc')]
[('Bir', 'Valid_Word'), ('gün', 'Valid_Word'), (',', 'Punc'), ('istiklâl', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyeti', 'Valid_Word'), ('müdafaa', 'Valid_Word'), ('mecburiyetine', 'Valid_Word'), ('düşersen', 'Valid_Word'), (',', 'Punc'), ('vazifeye', 'Valid_Word'), ('atılmak', 'Valid_Word'), ('için', 'Valid_Word'), (',', 'Punc'), ('içinde', 'Valid_Word'), ('bulunacağın', 'Valid_Word'), ('vaziyetin', 'Valid_Word'), ('imkân', 'Valid_Word'), ('ve', 'Valid_Word'), ('şeraitini', 'Valid_Word'), ('düşünmeyeceksin', 'Valid_Word'), ('!', 'Punc')]
[('Bu', 'Valid_Word'), ('imkân', 'Valid_Word'), ('ve', 'Valid_Word'), ('şerait', 'Valid_Word'), (',', 'Punc'), ('çok', 'Valid_Word'), ('nâmüsait', 'Valid_Word'), ('bir', 'Valid_Word'), ('mahiyette', 'Valid_Word'), ('tezahür', 'Valid_Word'), ('edebilir', 'Valid_Word'), ('.', 'Punc')]
[('İstiklâl', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyetine', 'Valid_Word'), ('kastedecek', 'Valid_Word'), ('düşmanlar', 'Valid_Word'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('dünyada', 'Valid_Word'), ('emsali', 'Valid_Word'), ('görülmemiş', 'Valid_Word'), ('bir', 'Valid_Word'), ('galibiyetin', 'Valid_Word'), ('mümessili', 'Valid_Word'), ('olabilirler', 'Valid_Word'), ('.', 'Punc')]
[('Cebren', 'Valid_Word'), ('ve', 'Valid_Word'), ('hile', 'Valid_Word'), ('ile', 'Valid_Word'), ('aziz', 'Valid_Word'), ('vatanın', 'Valid_Word'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('kaleleri', 'Valid_Word'), ('zaptedilmiş', 'OOV'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('tersanelerine', 'Valid_Word'), ('girilmiş', 'Valid_Word'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('orduları', 'Valid_Word'), ('dağıtılmış', 'Valid_Word'), ('ve', 'Valid_Word'), ('memleketin', 'Valid_Word'), ('her', 'Valid_Word'), ('köşesi', 'Valid_Word'), ('bilfiil', 'Valid_Word'), ('işgal', 'Valid_Word'), ('edilmiş', 'Valid_Word'), ('olabilir', 'Valid_Word'), ('.', 'Punc')]
[('Bütün', 'Valid_Word'), ('bu', 'Valid_Word'), ('şeraitten', 'Valid_Word'), ('daha', 'Valid_Word'), ('elîm', 'Valid_Word'), ('ve', 'Valid_Word'), ('daha', 'Valid_Word'), ('vahim', 'Valid_Word'), ('olmak', 'Valid_Word'), ('üzere', 'Valid_Word'), (',', 'Punc'), ('memleketin', 'Valid_Word'), ('dahilinde', 'Valid_Word'), (',', 'Punc'), ('iktidara', 'Valid_Word'), ('sahip', 'Valid_Word'), ('olanlar', 'Valid_Word'), ('gaflet', 'Valid_Word'), ('ve', 'Valid_Word'), ('dalâlet', 'Valid_Word'), ('ve', 'Valid_Word'), ('hattâ', 'Valid_Word'), ('hıyanet', 'Valid_Word'), ('içinde', 'Valid_Word'), ('bulunabilirler', 'Valid_Word'), ('.', 'Punc')]
[('Hatta', 'Valid_Word'), ('bu', 'Valid_Word'), ('iktidar', 'Valid_Word'), ('sahipleri', 'Valid_Word'), ('şahsî', 'Valid_Word'), ('menfaatlerini', 'Valid_Word'), (',', 'Punc'), ('müstevlilerin', 'Valid_Word'), ('siyasî', 'Valid_Word'), ('emelleriyle', 'Valid_Word'), ('tevhit', 'Valid_Word'), ('edebilirler', 'Valid_Word'), ('.', 'Punc')]
[('Millet', 'Valid_Word'), (',', 'Punc'), ('fakruzaruret', 'Valid_Word'), ('içinde', 'Valid_Word'), ('harap', 'Valid_Word'), ('ve', 'Valid_Word'), ('bîtap', 'Valid_Word'), ('düşmüş', 'Valid_Word'), ('olabilir', 'Valid_Word'), ('.', 'Punc'), ('Ey', 'Valid_Word'), ('Türk', 'Valid_Word'), ('istikbalinin', 'Valid_Word'), ('evladı', 'Valid_Word'), ('!', 'Punc'), ('İşte', 'Valid_Word'), (',', 'Punc'), ('bu', 'Valid_Word'), ('ahval', 'Valid_Word'), ('ve', 'Valid_Word'), ('şerait', 'Valid_Word'), ('içinde', 'Valid_Word'), ('dahi', 'Valid_Word'), (',', 'Punc'), ('vazifen', 'Valid_Word'), (';', 'Punc'), ('Türk', 'Valid_Word'), ('istiklâl', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyetini', 'Valid_Word'), ('kurtarmaktır', 'Valid_Word'), ('!', 'Punc')]
[('Muhtaç', 'Valid_Word'), ('olduğun', 'Valid_Word'), ('kudret', 'Valid_Word'), (',', 'Punc'), ('damarlarındaki', 'Valid_Word'), ('asil', 'Valid_Word'), ('kanda', 'Valid_Word'), (',', 'Punc'), ('mevcuttur', 'Valid_Word'), ('!', 'Punc')]
```
## CharFix
CharFix offers methods to correct corrupted Turkish text:
### Fix Characters
```python
from ts_tokenizer.char_fix import CharFix
line = "Parça ve bütün iliåÿkisi her zaman iåÿlevsel deðildir."
print(CharFix.fix(line)) # Fixes corrupted characters
```
```bash
$ Parça ve bütün ilişkisi her zaman işlevsel değildir.
```
### Lowercase
```python
from ts_tokenizer.char_fix import CharFix
line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.tr_lowercase(line))
```
```bash
$ istanbul ve ığdır ''arası'' 1528 km'dir.
```
### Fix Quotes
```python
from ts_tokenizer.char_fix import CharFix
line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.fix_quote(line))
```
$ İstanbul ve Iğdır "arası" 1528 km'dir.
---
## TokenHandler
TokenHandler gets each given string and process it using methods defined under TokenPreProcess class.
This process follows a strictly defined order and it is recursive.
Each method could be called
```python
from ts_tokenizer.token_handler import TokenPreProcess
```
### Below are the list of tags generated by tokenizer to process tokens:
| # | Function | Sample | Used As Output | Tag |
|----|----------------------------|------------------------------|----------------|--------------------|
| 01 | is_mention | @ts-tokenizer | Yes | Mention |
| 02 | is_hashtag | #ts-tokenizer | Yes | Hashtag |
| 03 | is_in_quotes | "ts-tokenizer" | No | ----- |
| 04 | is_numbered_title | (1) | Yes | Numbered_Title |
| 05 | is_in_paranthesis | (bilgisayar) | No | ----- |
| 06 | is_date_range | 01.01.2024-01.01.2025 | Yes | Date_Range |
| 07 | is_complex_punc | -yeniden,sonradan.. | No | ----- |
| 08 | is_date | 22.02.2016 | Yes | Date |
| 09 | is_hour | 14.05 | Yes | Hour |
| 10 | is_percentage_numbers | %75 | Yes | Percentage_Numbers |
| 11 | is_percentage_numbers_chars | %75'lik | Yes | Percentage_Numbers |
| 12 | is_roman_number | XI | Yes | Roman_Number |
| 13 | is_bullet_list | •Giriş | Yes | Bullet_List |
| 14 | is_email | tanersezerr@gmail.com | Yes | Email |
| 15 | is_email_punc | tanersezerr@gmail.com. | No | ----- |
| 16 | is_full_url | https://tscorpus.com | Yes | Full_URL |
| 17 | is_web_url | www.tscorpus.com | Yes | Web_URL |
| -- | is_full_url | www.example.com'un | Yes | URL_Suffix |
| 18 | is_copyright | ©tscorpus | Yes | Copyright |
| 19 | is_registered | tscorpus® | Yes | Registered |
| 20 | is_trademark | tscorpus™ | Yes | Trademark |
| 21 | is_currency | 100$ | Yes | Currency |
| 22 | is_num_char_sequence | 380A | No | ----- |
| 23 | is_abbr | TBMM | Yes | Abbr |
| 24 | is_in_lexicon | bilgisayar | Yes | Valid_Word |
| 25 | is_in_exceptions | e-mail | Yes | Exception |
| 26 | is_in_eng_words | computer | Yes | English_Word |
| 27 | is_smiley | :) | Yes | Smiley |
| 28 | is_multiple_smiley | :):) | No | ----- |
| 29 | is_emoticon | 🍻 | Yes | Emoticon |
| 30 | is_multiple_emoticon | 🍻🍻 | No | ----- |
| 31 | is_multiple_smiley_in | hey:):) | No | ----- |
| 32 | is_number | 175.01 | Yes | Number |
| 33 | is_apostrophed | Türkiye'nin | Yes | Apostrophed |
| 34 | is_single_punc | ! | Yes | Punc |
| 35 | is_multi_punc | !! | No | ----- |
| 36 | is_single_hyphenated | sabah-akşam | Yes | Single_Hyphenated |
| 37 | is_multi_hyphenated | çay-su-kahve | Yes | Multi-Hyphenated |
| 38 | is_single_underscored | Gel_Git | Yes | Single_Underscored |
| 39 | is_multi_underscored | Yarı_Yapılandırılmış_Mülakat | Yes | Multi_Underscored |
| 40 | is_one_char_fixable | bilgisa¬yar | Yes | One_Char_Fixed |
| 42 | is_three_or_more | heyyyyy | No | ----- |
| 43 | is_fsp | bilgisayar. | No | ----- |
| 44 | is_isp | .bilgisayar | No | ----- |
| 45 | is_fmp | bilgisayar.. | No | ----- |
| 46 | is_imp | ..bilgisayar | No | ----- |
| 47 | is_msp | --bilgisayar-- | No | ----- |
| 48 | is_mssp | -bilgisayar- | No | ----- |
| 49 | is_midsp | okul,öğrenci | No | ----- |
| 50 | is_midmp | okul,öğrenci, öğretmen | No | ----- |
| 51 | is_non_latin | 한국드 | No | Non_Latin |
----------------------
## Performance
ts-tokenizer is optimized for efficient tokenization and takes advantage of multi-core processing for large-scale text. By default, the script utilizes all available CPU cores minus one, ensuring your system remains responsive while processing large datasets.
### Performance Benchmarks:
The following benchmarks were conducted on different machines with the following specifications:
| **Processor** | **Cores** | **RAM** | **1 Million Tokens (Multi-Core)** | **Throughput (Multi-Core)** | **1 Million Tokens (Single-Core)** | **Throughput (Single-Core)** |
|--------------------------------------------------------------------------|---------------------------|--------------|-----------------------------------|-----------------------------|------------------------------------|------------------------------|
| AMD Ryzen 7 5800H with Radeon Graphics (Laptop) <br/>3.2 GHz / 4.4 Ghz | 8 physical cores (16 threads) | 16GB DDR4 | ~170 seconds | ~5,800 tokens/second | ~715 seconds | ~1,400 tokens/second |
| AMD Ryzen 9 7950X3D with Radeon Graphics (Desktop)<br/>4.2 Ghz / 5.7 Ghz | 16 physical cores (32 threads)| 96GB DDR5 | ~14 seconds | ~71,500 tokens/second | ~110 seconds | ~9,090 tokens/second |
Raw data
{
"_id": null,
"home_page": "https://github.com/tanerim/ts_tokenizer",
"name": "ts-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "turkish tokenizer, tokenizer, turkish, nlp, text-processing, language-processing",
"author": "Taner Sezer",
"author_email": "tanersezerr@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/13/9c/6c4c13f3f37985d9fb8355909cf2529fc61483a99a2e31fd562cd54c3970/ts_tokenizer-0.1.18.tar.gz",
"platform": null,
"description": "\n**TS Tokenizer** is a hybrid tokenizer designed specifically for tokenizing Turkish texts.\nIt uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.\n\n\n### Key Features:\n- **Hybrid Tokenization**: Combines lexicon-based and rule-based techniques to tokenize complex Turkish texts with precision.\n- **Special Token Handling**: Detects and processes mentions, hashtags, emails, URLs, dates, numbers, smileys, emoticons, and more.\n- **Configurable Outputs**: Offers multiple output formats, including plain tokens, tagged tokens, tokenized lines, and tagged lines, to suit diverse NLP workflows.\n- **Multi-core Processing**: Speeds up tokenization for large files with parallel processing.\n- **Preprocess Handling**: Handles corrupted Turkish text and punctuation gracefully using built-in fixes.\n- **Command-Line Friendly**: Use it directly from the terminal for file-based or piped input workflows.\n\n\nOn natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers\na reliable solution for tokenization.\n\n---\n\n## Installation\n\nYou can install the ts-tokenizer package using pip.\n```bash\npip install ts-tokenizer\n```\nEnsure you have Python 3.9 or higher installed on your system.\n\nYou can update current version using pip\n```bash\npip install --upgrade ts-tokenizer\n```\n---\n\nTo remove package, use:\n```bash\npip uninstall ts-tokenizer\n```\nYou can also clone the repo locally.\n```bash\ngit clone https://github.com/tanerim/ts_tokenizer.git\ncd ts-tokenizer\npip install -e .\n\n```\n\n---\n\n## License\nThis project is licensed under the MIT License. See the [LICENSE](https://github.com/tanerim/ts_tokenizer/blob/main/LICENSE) file for details.\n\n\n## Command line tool\n\nYou can use TS Tokenizer directly from the command line for both file inputs and pipeline processing:\n## Tokenize from a File:\n```bash\nts-tokenizer input.txt\n```\nor\n```bash\ncat input.txt | ts-tokenizer\n```\nor\n```bash\nzcat input.txt.gz | ts-tokenizer\n```\n---\n\n## Help\n\nGet detailed help for available options using:\n\n| Argument | Short | Description | Default |\n|----------------------|-------|-------------------------------------------------------------------------------------------------|---------------|\n| `--output` | `-o` | Specify the output format: `tokenized`, `lines`, `tagged`, `tagged_lines`. | `tokenized` |\n| `--num-workers` | `-n` | Set the number of parallel workers for processing. | `CPU cores-1` |\n| `--verbose` | `-v` | Enable verbose mode to display additional processing details. | Disabled |\n| `--version` | `-V` | Display the current version of `ts-tokenizer`. | N/A |\n| `--help` | `-h` | Show the help message and exit. \n\n---\n\n## CLI Arguments\n\nYou can specify the output format using the -o option:\n\n- **tokenized (default):** Returns plain tokens, one per line.\n- **tagged:** Returns tokens with their tags.\n- **lines:** Returns tokenized lines as lists.\n- **tagged_lines:** Returns tokenized lines as a list of tuples (token, tag).\n\n```bash\ninput_text = \"Queen , 31.10.1975 tarihinde \u00e7\u0131kard\u0131\u00f0\u0131 A Night at the Opera alb\u00c3\u00bcm\u00c3\u00bcyle d\u00c3\u00bcnya m\u00c3\u00bczi\u00f0ini de\u00f0i\u00e5\u00fftirdi .\"\n\n$ ts-tokenizer input_text\n\nQueen\n,\n31.10.1975\ntarihinde\n\u00e7\u0131kard\u0131\u011f\u0131\nA\nNight\nat\nthe\nOpera\nalb\u00fcm\u00fcyle\nd\u00fcnya\nm\u00fczi\u011fini\nde\u011fi\u015ftirdi\n.\n```\n\nNote that **tags are not part-of-speech tags** but they define the given string.\n```bash\n$ ts-tokenizer -o tagged input.txt\n\nQueen\tEnglish_Word\n,\tPunc\n31.10.1975\tDate\ntarihinde\tValid_Word\n\u00e7\u0131kard\u0131\u011f\u0131\tValid_Word\nA\tOOV\nNight\tEnglish_Word\nat\tValid_Word\nthe\tEnglish_Word\nOpera\tValid_Word\nalb\u00fcm\u00fcyle\tValid_Word\nd\u00fcnya\tValid_Word\nm\u00fczi\u011fini\tValid_Word\nde\u011fi\u015ftirdi\tValid_Word\n.\tPunc\n```\n\nThe other two arguments are \"lines\" and \"tagged_lines\".\nThe \"lines\" parameter reads input file line-by-line and returns a list for each line. Note that each line is defined by end-of-line markers in the given text.\n```bash\n$ ts-tokenizer -o lines input.txt\n\n['Queen', ',', '31.10.1975', 'tarihinde', '\u00e7\u0131kard\u0131\u011f\u0131', 'A', 'Night', 'at', 'the', 'Opera', 'alb\u00fcm\u00fcyle', 'd\u00fcnya', 'm\u00fczi\u011fini', 'de\u011fi\u015ftirdi', '.']\n```\n\nThe \"tagged_lines\" parameter reads input file line-by-line and returns a list of tuples for each line. Note that each line is defined by end-of-line markers in the given text.\n```bash\n$ ts-tokenizer -o tagged_lines input.txt\n\n [('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('\u00e7\u0131kard\u0131\u011f\u0131', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('alb\u00fcm\u00fcyle', 'Valid_Word'), ('d\u00fcnya', 'Valid_Word'), ('m\u00fczi\u011fini', 'Valid_Word'), ('de\u011fi\u015ftirdi', 'Valid_Word'), ('.', 'Punc')]\n```\n---\n\n## Parallel Processing\nUse the -n option to set the number of parallel workers:\n```bash\n$ ts-tokenizer -n 2 -o tagged input_file\n```\n\nBy default, TS Tokenizer uses [number of CPU cores - 1].\n\n---\n\n## Using CLI Arguments with pipelines\n\nYou can use TS Tokenizer in bash pipelines, such as counting word frequencies:\n\nFollowing sample returns calculated frequencies for the given file:\n```bash\n$ ts-tokenizer input.txt | sort | uniq -c | sort -n\n```\n\n---\n\nTo count tags:\n```bash\n$ ts-tokenizer -o tagged input.txt | cut -f2 | sort | uniq -c\n\n1 Date\n3 English_Word\n2 Hashtag\n2 Mention\n1 Multi_Hyphenated\n3 Numbered_Title\n1 OOV\n9 Punc\n1 Single_Hyphenated\n17 Valid_Word\n```\n\nTo find a specific tag following command could be used.\n```bash\n$ ts-tokenizer -o tagged input.txt | cut -f1,2 | grep \"Web_URL\"\n\nwww.wikipedia.org\tWeb_URL\nwww.wim-wenders.com\tWeb_URL\nwww.winterwar.com.\tWeb_URL\nwww.wissenschaft.de:\tWeb_URL\nwww.wittingen.de\tWeb_URL\nwww.wlmqradio.com\tWeb_URL\nwww.worldstadiums.com\tWeb_URL\nwww.worldstatesmen.org\tWeb_URL\n```\n---\n\n# Classes\n\nBelow are samples to implement ts-tokenizer in Python\n\n## TSTokenizer\n\n**tokenized** : Outputs a list of plain tokens extracted from the input text.\n\n```python\nfrom ts_tokenizer.tokenizer import TSTokenizer\nSingle_Line_Sample = \"Par\u00c3\u00a7a ve b\u00c3\u00bct\u00c3\u00bcn ili\u00e5\u00ffkisi her zaman i\u00e5\u00fflevsel de\u00f0ildir.\"\nsimple_tokens = TSTokenizer.ts_tokenize(Single_Line_Sample, output_format=\"tokenized\")\nif simple_tokens is None:\n pass\nelse:\n for token in simple_tokens:\n print(token)\n```\nGenerated output is as follows:\n\n```bash\nPar\u00e7a\nve\nb\u00fct\u00fcn\nili\u015fkisi\nher\nzaman\ni\u015flevsel\nde\u011fildir\n```\n**tagged** : Outputs tokens with associated tags. Please note that these are not POSTags.\nCheck TokenHandler below for tag set.\n\n```python\ntagged_tokens = TSTokenizer.ts_tokenize(Single_Line_Sample, output_format=\"tagged\")\nif tagged_tokens is None:\n pass\nelse:\n for token in tagged_tokens:\n print(token)\n```\nGenerated output is as follows:\n\n```bash\nPar\u00e7a\tValid_Word\nve\tValid_Word\nb\u00fct\u00fcn\tValid_Word\nili\u015fkisi\tValid_Word\nher\tValid_Word\nzaman\tValid_Word\ni\u015flevsel\tValid_Word\nde\u011fildir\tValid_Word\n.\tPunc\n\n```\n\n**lines** : Maintains the structure of the input text, with each line's tokens grouped together.\nPlease note that \"line\" is defined with end-of-line markers.\n\n```python\nfrom ts_tokenizer.tokenizer import TSTokenizer\nMulti_Line_Sample = \"\"\"\nATAT\u00dcRK'\u00fcn GEN\u00c7L\u00dd\u00d0E H\u00ddTABES\u00dd \nEy T\u00fcrk gen\u00e7li\u00f0i! Birinci vazifen, T\u00fcrk istikl\u00e2lini, T\u00fcrk Cumhuriyet'ini, ilelebet, muhafaza ve m\u00fcdafaa etmektir. \nMevcudiyetinin ve istikbalinin yeg\u00e2ne temeli budur. \nBu temel, senin, en k\u00fdymetli hazinendir.\n\u00ddstikbalde dahi, seni bu hazineden mahrum etmek isteyecek, dahil\u00ee ve haric\u00ee bedhahlar\u00fdn olacakt\u00fdr. \nBir g\u00fcn, istikl\u00e2l ve cumhuriyeti m\u00fcdafaa mecburiyetine d\u00fc\u00feersen, vazifeye at\u00fdlmak i\u00e7in, i\u00e7inde bulunaca\u00f0\u00fdn vaziyetin imk\u00e2n ve \u00feeraitini d\u00fc\u00fe\u00fcnmeyeceksin! \nBu imk\u00e2n ve \u00feerait, \u00e7ok n\u00e2m\u00fcsait bir mahiyette tezah\u00fcr edebilir. \n\u00ddstikl\u00e2l ve cumhuriyetine kastedecek d\u00fc\u00femanlar, b\u00fct\u00fcn d\u00fcnyada emsali g\u00f6r\u00fclmemi\u00fe bir galibiyetin m\u00fcmessili olabilirler. \nCebren ve hile ile aziz vatan\u00fdn, b\u00fct\u00fcn kaleleri zaptedilmi\u00fe, b\u00fct\u00fcn tersanelerine girilmi\u00fe, b\u00fct\u00fcn ordular\u00fd da\u00f0\u00fdt\u00fdlm\u00fd\u00fe ve memleketin her k\u00f6\u00feesi bilfiil i\u00fegal edilmi\u00fe olabilir. \nB\u00fct\u00fcn bu \u00feeraitten daha el\u00eem ve daha vahim olmak \u00fczere, memleketin dahilinde, iktidara sahip olanlar gaflet ve dal\u00e2let ve hatt\u00e2 h\u00fdyanet i\u00e7inde bulunabilirler.\nHatta bu iktidar sahipleri \u00feahs\u00ee menfaatlerini, m\u00fcstevlilerin siyas\u00ee emelleriyle tevhit edebilirler.\nMillet, fakruzaruret i\u00e7inde harap ve b\u00eetap d\u00fc\u00fem\u00fc\u00fe olabilir. Ey T\u00fcrk istikbalinin evlad\u00fd! \u00dd\u00fete, bu ahval ve \u00feerait i\u00e7inde dahi, vazifen; T\u00fcrk istikl\u00e2l ve cumhuriyetini kurtarmakt\u00fdr! \nMuhta\u00e7 oldu\u00f0un kudret, damarlar\u00fdndaki asil kanda, mevcuttur!\n\"\"\"\nline_tokens = TSTokenizer.ts_tokenize(Multi_Line_Sample, output_format=\"lines\")\nif line_tokens is None:\n pass\nelse:\n for token in line_tokens:\n print(token)\n```\nGenerated output is as follows:\n```text\nATAT\u00dcRK'\u00fcn GEN\u00c7L\u0130\u011eE H\u0130TABES\u0130\nEy T\u00fcrk gen\u00e7li\u011fi ! Birinci vazifen , T\u00fcrk istikl\u00e2lini , T\u00fcrk Cumhuriyet'ini , ilelebet , muhafaza ve m\u00fcdafaa etmektir .\nMevcudiyetinin ve istikbalinin yeg\u00e2ne temeli budur .\nBu temel , senin , en k\u0131ymetli hazinendir .\n\u0130stikbalde dahi , seni bu hazineden mahrum etmek isteyecek , dahil\u00ee ve haric\u00ee bedhahlar\u0131n olacakt\u0131r .\nBir g\u00fcn , istikl\u00e2l ve cumhuriyeti m\u00fcdafaa mecburiyetine d\u00fc\u015fersen , vazifeye at\u0131lmak i\u00e7in , i\u00e7inde bulunaca\u011f\u0131n vaziyetin imk\u00e2n ve \u015feraitini d\u00fc\u015f\u00fcnmeyeceksin !\nBu imk\u00e2n ve \u015ferait , \u00e7ok n\u00e2m\u00fcsait bir mahiyette tezah\u00fcr edebilir .\n\u0130stikl\u00e2l ve cumhuriyetine kastedecek d\u00fc\u015fmanlar , b\u00fct\u00fcn d\u00fcnyada emsali g\u00f6r\u00fclmemi\u015f bir galibiyetin m\u00fcmessili olabilirler .\nCebren ve hile ile aziz vatan\u0131n , b\u00fct\u00fcn kaleleri zaptedilmi\u015f , b\u00fct\u00fcn tersanelerine girilmi\u015f , b\u00fct\u00fcn ordular\u0131 da\u011f\u0131t\u0131lm\u0131\u015f ve memleketin her k\u00f6\u015fesi bilfiil i\u015fgal edilmi\u015f olabilir .\nB\u00fct\u00fcn bu \u015feraitten daha el\u00eem ve daha vahim olmak \u00fczere , memleketin dahilinde , iktidara sahip olanlar gaflet ve dal\u00e2let ve hatt\u00e2 h\u0131yanet i\u00e7inde bulunabilirler .\nHatta bu iktidar sahipleri \u015fahs\u00ee menfaatlerini , m\u00fcstevlilerin siyas\u00ee emelleriyle tevhit edebilirler .\nMillet , fakruzaruret i\u00e7inde harap ve b\u00eetap d\u00fc\u015fm\u00fc\u015f olabilir . Ey T\u00fcrk istikbalinin evlad\u0131 ! \u0130\u015fte , bu ahval ve \u015ferait i\u00e7inde dahi , vazifen ; T\u00fcrk istikl\u00e2l ve cumhuriyetini kurtarmakt\u0131r !\nMuhta\u00e7 oldu\u011fun kudret , damarlar\u0131ndaki asil kanda , mevcuttur !\n```\n\n**tagged_lines**: Same as lines but includes tags for each token.\n\n```python \ntagged_line_tokens = TSTokenizer.ts_tokenize(Multi_Line_Sample, output_format=\"tagged_lines\")\nif tagged_line_tokens is None:\n pass\nelse:\n for token in tagged_line_tokens:\n print(token)\n```\nGenerated output is as follows:\n```bash\n[(\"ATAT\u00dcRK'\u00fcn\", 'Apostrophed'), ('GEN\u00c7L\u0130\u011eE', 'Valid_Word'), ('H\u0130TABES\u0130', 'Valid_Word')]\n[('Ey', 'Valid_Word'), ('T\u00fcrk', 'Valid_Word'), ('gen\u00e7li\u011fi', 'Valid_Word'), ('!', 'Punc'), ('Birinci', 'Valid_Word'), ('vazifen', 'Valid_Word'), (',', 'Punc'), ('T\u00fcrk', 'Valid_Word'), ('istikl\u00e2lini', 'Valid_Word'), (',', 'Punc'), ('T\u00fcrk', 'Valid_Word'), (\"Cumhuriyet'ini\", 'Apostrophed'), (',', 'Punc'), ('ilelebet', 'Valid_Word'), (',', 'Punc'), ('muhafaza', 'Valid_Word'), ('ve', 'Valid_Word'), ('m\u00fcdafaa', 'Valid_Word'), ('etmektir', 'Valid_Word'), ('.', 'Punc')]\n[('Mevcudiyetinin', 'Valid_Word'), ('ve', 'Valid_Word'), ('istikbalinin', 'Valid_Word'), ('yeg\u00e2ne', 'Valid_Word'), ('temeli', 'Valid_Word'), ('budur', 'Valid_Word'), ('.', 'Punc')]\n[('Bu', 'Valid_Word'), ('temel', 'Valid_Word'), (',', 'Punc'), ('senin', 'Valid_Word'), (',', 'Punc'), ('en', 'Valid_Word'), ('k\u0131ymetli', 'Valid_Word'), ('hazinendir', 'Valid_Word'), ('.', 'Punc')]\n[('\u0130stikbalde', 'Valid_Word'), ('dahi', 'Valid_Word'), (',', 'Punc'), ('seni', 'Valid_Word'), ('bu', 'Valid_Word'), ('hazineden', 'Valid_Word'), ('mahrum', 'Valid_Word'), ('etmek', 'Valid_Word'), ('isteyecek', 'Valid_Word'), (',', 'Punc'), ('dahil\u00ee', 'Valid_Word'), ('ve', 'Valid_Word'), ('haric\u00ee', 'Valid_Word'), ('bedhahlar\u0131n', 'Valid_Word'), ('olacakt\u0131r', 'Valid_Word'), ('.', 'Punc')]\n[('Bir', 'Valid_Word'), ('g\u00fcn', 'Valid_Word'), (',', 'Punc'), ('istikl\u00e2l', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyeti', 'Valid_Word'), ('m\u00fcdafaa', 'Valid_Word'), ('mecburiyetine', 'Valid_Word'), ('d\u00fc\u015fersen', 'Valid_Word'), (',', 'Punc'), ('vazifeye', 'Valid_Word'), ('at\u0131lmak', 'Valid_Word'), ('i\u00e7in', 'Valid_Word'), (',', 'Punc'), ('i\u00e7inde', 'Valid_Word'), ('bulunaca\u011f\u0131n', 'Valid_Word'), ('vaziyetin', 'Valid_Word'), ('imk\u00e2n', 'Valid_Word'), ('ve', 'Valid_Word'), ('\u015feraitini', 'Valid_Word'), ('d\u00fc\u015f\u00fcnmeyeceksin', 'Valid_Word'), ('!', 'Punc')]\n[('Bu', 'Valid_Word'), ('imk\u00e2n', 'Valid_Word'), ('ve', 'Valid_Word'), ('\u015ferait', 'Valid_Word'), (',', 'Punc'), ('\u00e7ok', 'Valid_Word'), ('n\u00e2m\u00fcsait', 'Valid_Word'), ('bir', 'Valid_Word'), ('mahiyette', 'Valid_Word'), ('tezah\u00fcr', 'Valid_Word'), ('edebilir', 'Valid_Word'), ('.', 'Punc')]\n[('\u0130stikl\u00e2l', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyetine', 'Valid_Word'), ('kastedecek', 'Valid_Word'), ('d\u00fc\u015fmanlar', 'Valid_Word'), (',', 'Punc'), ('b\u00fct\u00fcn', 'Valid_Word'), ('d\u00fcnyada', 'Valid_Word'), ('emsali', 'Valid_Word'), ('g\u00f6r\u00fclmemi\u015f', 'Valid_Word'), ('bir', 'Valid_Word'), ('galibiyetin', 'Valid_Word'), ('m\u00fcmessili', 'Valid_Word'), ('olabilirler', 'Valid_Word'), ('.', 'Punc')]\n[('Cebren', 'Valid_Word'), ('ve', 'Valid_Word'), ('hile', 'Valid_Word'), ('ile', 'Valid_Word'), ('aziz', 'Valid_Word'), ('vatan\u0131n', 'Valid_Word'), (',', 'Punc'), ('b\u00fct\u00fcn', 'Valid_Word'), ('kaleleri', 'Valid_Word'), ('zaptedilmi\u015f', 'OOV'), (',', 'Punc'), ('b\u00fct\u00fcn', 'Valid_Word'), ('tersanelerine', 'Valid_Word'), ('girilmi\u015f', 'Valid_Word'), (',', 'Punc'), ('b\u00fct\u00fcn', 'Valid_Word'), ('ordular\u0131', 'Valid_Word'), ('da\u011f\u0131t\u0131lm\u0131\u015f', 'Valid_Word'), ('ve', 'Valid_Word'), ('memleketin', 'Valid_Word'), ('her', 'Valid_Word'), ('k\u00f6\u015fesi', 'Valid_Word'), ('bilfiil', 'Valid_Word'), ('i\u015fgal', 'Valid_Word'), ('edilmi\u015f', 'Valid_Word'), ('olabilir', 'Valid_Word'), ('.', 'Punc')]\n[('B\u00fct\u00fcn', 'Valid_Word'), ('bu', 'Valid_Word'), ('\u015feraitten', 'Valid_Word'), ('daha', 'Valid_Word'), ('el\u00eem', 'Valid_Word'), ('ve', 'Valid_Word'), ('daha', 'Valid_Word'), ('vahim', 'Valid_Word'), ('olmak', 'Valid_Word'), ('\u00fczere', 'Valid_Word'), (',', 'Punc'), ('memleketin', 'Valid_Word'), ('dahilinde', 'Valid_Word'), (',', 'Punc'), ('iktidara', 'Valid_Word'), ('sahip', 'Valid_Word'), ('olanlar', 'Valid_Word'), ('gaflet', 'Valid_Word'), ('ve', 'Valid_Word'), ('dal\u00e2let', 'Valid_Word'), ('ve', 'Valid_Word'), ('hatt\u00e2', 'Valid_Word'), ('h\u0131yanet', 'Valid_Word'), ('i\u00e7inde', 'Valid_Word'), ('bulunabilirler', 'Valid_Word'), ('.', 'Punc')]\n[('Hatta', 'Valid_Word'), ('bu', 'Valid_Word'), ('iktidar', 'Valid_Word'), ('sahipleri', 'Valid_Word'), ('\u015fahs\u00ee', 'Valid_Word'), ('menfaatlerini', 'Valid_Word'), (',', 'Punc'), ('m\u00fcstevlilerin', 'Valid_Word'), ('siyas\u00ee', 'Valid_Word'), ('emelleriyle', 'Valid_Word'), ('tevhit', 'Valid_Word'), ('edebilirler', 'Valid_Word'), ('.', 'Punc')]\n[('Millet', 'Valid_Word'), (',', 'Punc'), ('fakruzaruret', 'Valid_Word'), ('i\u00e7inde', 'Valid_Word'), ('harap', 'Valid_Word'), ('ve', 'Valid_Word'), ('b\u00eetap', 'Valid_Word'), ('d\u00fc\u015fm\u00fc\u015f', 'Valid_Word'), ('olabilir', 'Valid_Word'), ('.', 'Punc'), ('Ey', 'Valid_Word'), ('T\u00fcrk', 'Valid_Word'), ('istikbalinin', 'Valid_Word'), ('evlad\u0131', 'Valid_Word'), ('!', 'Punc'), ('\u0130\u015fte', 'Valid_Word'), (',', 'Punc'), ('bu', 'Valid_Word'), ('ahval', 'Valid_Word'), ('ve', 'Valid_Word'), ('\u015ferait', 'Valid_Word'), ('i\u00e7inde', 'Valid_Word'), ('dahi', 'Valid_Word'), (',', 'Punc'), ('vazifen', 'Valid_Word'), (';', 'Punc'), ('T\u00fcrk', 'Valid_Word'), ('istikl\u00e2l', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyetini', 'Valid_Word'), ('kurtarmakt\u0131r', 'Valid_Word'), ('!', 'Punc')]\n[('Muhta\u00e7', 'Valid_Word'), ('oldu\u011fun', 'Valid_Word'), ('kudret', 'Valid_Word'), (',', 'Punc'), ('damarlar\u0131ndaki', 'Valid_Word'), ('asil', 'Valid_Word'), ('kanda', 'Valid_Word'), (',', 'Punc'), ('mevcuttur', 'Valid_Word'), ('!', 'Punc')]\n```\n\n## CharFix\n\nCharFix offers methods to correct corrupted Turkish text:\n\n### Fix Characters\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"Par\u00c3\u00a7a ve b\u00c3\u00bct\u00c3\u00bcn ili\u00e5\u00ffkisi her zaman i\u00e5\u00fflevsel de\u00f0ildir.\"\nprint(CharFix.fix(line)) # Fixes corrupted characters\n```\n```bash\n$ Par\u00e7a ve b\u00fct\u00fcn ili\u015fkisi her zaman i\u015flevsel de\u011fildir.\n```\n### Lowercase\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.tr_lowercase(line))\n```\n```bash\n$ istanbul ve \u0131\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\n```\n### Fix Quotes\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.fix_quote(line))\n```\n $ \u0130stanbul ve I\u011fd\u0131r \"aras\u0131\" 1528 km'dir.\n\n---\n## TokenHandler\n\nTokenHandler gets each given string and process it using methods defined under TokenPreProcess class.\nThis process follows a strictly defined order and it is recursive.\nEach method could be called \n```python\nfrom ts_tokenizer.token_handler import TokenPreProcess\n\n```\n\n\n### Below are the list of tags generated by tokenizer to process tokens:\n\n\n| # | Function | Sample | Used As Output | Tag |\n|----|----------------------------|------------------------------|----------------|--------------------|\n| 01 | is_mention | @ts-tokenizer | Yes | Mention |\n| 02 | is_hashtag | #ts-tokenizer | Yes | Hashtag |\n| 03 | is_in_quotes | \"ts-tokenizer\" | No | ----- |\n| 04 | is_numbered_title | (1) | Yes | Numbered_Title |\n| 05 | is_in_paranthesis | (bilgisayar) | No | ----- |\n| 06 | is_date_range | 01.01.2024-01.01.2025 | Yes | Date_Range |\n| 07 | is_complex_punc | -yeniden,sonradan.. | No | ----- |\n| 08 | is_date | 22.02.2016 | Yes | Date |\n| 09 | is_hour | 14.05 | Yes | Hour |\n| 10 | is_percentage_numbers | %75 | Yes | Percentage_Numbers |\n| 11 | is_percentage_numbers_chars | %75'lik | Yes | Percentage_Numbers |\n| 12 | is_roman_number | XI | Yes | Roman_Number |\n| 13 | is_bullet_list | \u2022Giri\u015f | Yes | Bullet_List |\n| 14 | is_email | tanersezerr@gmail.com | Yes | Email |\n| 15 | is_email_punc | tanersezerr@gmail.com. | No | ----- |\n| 16 | is_full_url | https://tscorpus.com | Yes | Full_URL |\n| 17 | is_web_url | www.tscorpus.com | Yes | Web_URL |\n| -- | is_full_url | www.example.com'un | Yes | URL_Suffix |\n| 18 | is_copyright | \u00a9tscorpus | Yes | Copyright |\n| 19 | is_registered | tscorpus\u00ae | Yes | Registered |\n| 20 | is_trademark | tscorpus\u2122 | Yes | Trademark |\n| 21 | is_currency | 100$ | Yes | Currency |\n| 22 | is_num_char_sequence | 380A | No | ----- |\n| 23 | is_abbr | TBMM | Yes | Abbr |\n| 24 | is_in_lexicon | bilgisayar | Yes | Valid_Word |\n| 25 | is_in_exceptions | e-mail | Yes | Exception |\n| 26 | is_in_eng_words | computer | Yes | English_Word |\n| 27 | is_smiley | :) | Yes | Smiley |\n| 28 | is_multiple_smiley | :):) | No | ----- |\n| 29 | is_emoticon | \ud83c\udf7b | Yes | Emoticon |\n| 30 | is_multiple_emoticon | \ud83c\udf7b\ud83c\udf7b | No | ----- |\n| 31 | is_multiple_smiley_in | hey:):) | No | ----- |\n| 32 | is_number | 175.01 | Yes | Number |\n| 33 | is_apostrophed | T\u00fcrkiye'nin | Yes | Apostrophed |\n| 34 | is_single_punc | ! | Yes | Punc |\n| 35 | is_multi_punc | !! | No | ----- |\n| 36 | is_single_hyphenated | sabah-ak\u015fam | Yes | Single_Hyphenated |\n| 37 | is_multi_hyphenated | \u00e7ay-su-kahve | Yes | Multi-Hyphenated |\n| 38 | is_single_underscored | Gel_Git | Yes | Single_Underscored |\n| 39 | is_multi_underscored | Yar\u0131_Yap\u0131land\u0131r\u0131lm\u0131\u015f_M\u00fclakat | Yes | Multi_Underscored |\n| 40 | is_one_char_fixable | bilgisa\u00acyar | Yes | One_Char_Fixed |\n| 42 | is_three_or_more | heyyyyy | No | ----- |\n| 43 | is_fsp | bilgisayar. | No | ----- |\n| 44 | is_isp | .bilgisayar | No | ----- |\n| 45 | is_fmp | bilgisayar.. | No | ----- |\n| 46 | is_imp | ..bilgisayar | No | ----- |\n| 47 | is_msp | --bilgisayar-- | No | ----- |\n| 48 | is_mssp | -bilgisayar- | No | ----- |\n| 49 | is_midsp | okul,\u00f6\u011frenci | No | ----- |\n| 50 | is_midmp | okul,\u00f6\u011frenci, \u00f6\u011fretmen | No | ----- |\n| 51 | is_non_latin | \ud55c\uad6d\ub4dc | No | Non_Latin |\n\n----------------------\n\n## Performance\n\nts-tokenizer is optimized for efficient tokenization and takes advantage of multi-core processing for large-scale text. By default, the script utilizes all available CPU cores minus one, ensuring your system remains responsive while processing large datasets.\n\n### Performance Benchmarks:\n\nThe following benchmarks were conducted on different machines with the following specifications:\n\n| **Processor** | **Cores** | **RAM** | **1 Million Tokens (Multi-Core)** | **Throughput (Multi-Core)** | **1 Million Tokens (Single-Core)** | **Throughput (Single-Core)** |\n|--------------------------------------------------------------------------|---------------------------|--------------|-----------------------------------|-----------------------------|------------------------------------|------------------------------|\n| AMD Ryzen 7 5800H with Radeon Graphics (Laptop) <br/>3.2 GHz / 4.4 Ghz | 8 physical cores (16 threads) | 16GB DDR4 | ~170 seconds | ~5,800 tokens/second | ~715 seconds | ~1,400 tokens/second |\n| AMD Ryzen 9 7950X3D with Radeon Graphics (Desktop)<br/>4.2 Ghz / 5.7 Ghz | 16 physical cores (32 threads)| 96GB DDR5 | ~14 seconds | ~71,500 tokens/second | ~110 seconds | ~9,090 tokens/second |\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts.",
"version": "0.1.18",
"project_urls": {
"Bug Tracker": "https://github.com/tanerim/ts_tokenizer/issues",
"Documentation": "https://github.com/tanerim/ts_tokenizer#readme",
"Homepage": "https://github.com/tanerim/ts_tokenizer",
"Source Code": "https://github.com/tanerim/ts_tokenizer"
},
"split_keywords": [
"turkish tokenizer",
" tokenizer",
" turkish",
" nlp",
" text-processing",
" language-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b91e430ab5a266e8d8bc1e55695ddee850725f8c4c00037e01506f7b529e4c7e",
"md5": "d6f0f908d4a3ed337252366f41dc66e5",
"sha256": "5f68565202abc01d9695d59fb105d6be38a105e097644d3eb428c57d42dd650d"
},
"downloads": -1,
"filename": "ts_tokenizer-0.1.18-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d6f0f908d4a3ed337252366f41dc66e5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 10014693,
"upload_time": "2024-12-25T09:25:32",
"upload_time_iso_8601": "2024-12-25T09:25:32.529745Z",
"url": "https://files.pythonhosted.org/packages/b9/1e/430ab5a266e8d8bc1e55695ddee850725f8c4c00037e01506f7b529e4c7e/ts_tokenizer-0.1.18-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "139c6c4c13f3f37985d9fb8355909cf2529fc61483a99a2e31fd562cd54c3970",
"md5": "46fe099865dead9dcb464e8a8d1a2269",
"sha256": "f5a7f7af70f77afd7dc2118c5d9f495a9ca1e13f61058b9d9e85ecfac7143927"
},
"downloads": -1,
"filename": "ts_tokenizer-0.1.18.tar.gz",
"has_sig": false,
"md5_digest": "46fe099865dead9dcb464e8a8d1a2269",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 10076145,
"upload_time": "2024-12-25T09:53:55",
"upload_time_iso_8601": "2024-12-25T09:53:55.837909Z",
"url": "https://files.pythonhosted.org/packages/13/9c/6c4c13f3f37985d9fb8355909cf2529fc61483a99a2e31fd562cd54c3970/ts_tokenizer-0.1.18.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-25 09:53:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tanerim",
"github_project": "ts_tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "click",
"specs": [
[
"~=",
"8.1.7"
]
]
},
{
"name": "tqdm",
"specs": [
[
"~=",
"4.66.4"
]
]
},
{
"name": "setuptools",
"specs": [
[
"~=",
"68.2.0"
]
]
}
],
"lcname": "ts-tokenizer"
}