# TS Tokenizer
**TS Tokenizer** is a hybrid tokenizer designed specifically for tokenizing Turkish texts.
It uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.
### Key Features:
- **Hybrid Tokenization**: Combines lexicon-based and rule-based techniques to tokenize complex Turkish texts with precision.
- **Special Token Handling**: Detects and processes mentions, hashtags, emails, URLs, dates, numbers, smileys, emoticons, and more.
- **Configurable Outputs**: Offers multiple output formats, including plain tokens, tagged tokens, tokenized lines, and tagged lines, to suit diverse NLP workflows.
- **Multi-core Processing**: Speeds up tokenization for large files with parallel processing.
- **Preprocess Handling**: Handles corrupted Turkish text and punctuation gracefully using built-in fixes.
- **Command-Line Friendly**: Use it directly from the terminal for file-based or piped input workflows.
On natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers
a reliable solution for tokenization.
---
## Installation
You can install the ts-tokenizer package using pip.
```bash
pip install ts-tokenizer
```
Ensure you have Python 3.9 or higher installed on your system.
You can update current version using pip
```bash
pip install --upgrade ts-tokenizer
```
---
To remove package, use:
```bash
pip uninstall ts-tokenizer
```
---
## License
This project is licensed under the MIT License. See the [LICENSE](https://github.com/tanerim/ts_tokenizer/blob/main/LICENSE) file for details.
## Command line tool
You can use TS Tokenizer directly from the command line for both file inputs and pipeline processing:
## Tokenize from a File:
```bash
ts-tokenizer input.txt
```
or
```bash
cat input.txt | ts-tokenizer
```
or
```bash
zcat input.txt.gz | ts-tokenizer
```
---
## Help
Get detailed help for available options using:
| Argument | Short | Description | Default |
|----------------------|-------|-------------------------------------------------------------------------------------------------|---------------|
| `--output` | `-o` | Specify the output format: `tokenized`, `lines`, `tagged`, `tagged_lines`. | `tokenized` |
| `--num-workers` | `-n` | Set the number of parallel workers for processing. | `CPU cores-1` |
| `--verbose` | `-v` | Enable verbose mode to display additional processing details. | Disabled |
| `--version` | `-V` | Display the current version of `ts-tokenizer`. | N/A |
| `--help` | `-h` | Show the help message and exit.
---
## CLI Arguments
You can specify the output format using the -o option:
- **tokenized (default):** Returns plain tokens, one per line.
- **tagged:** Returns tokens with their tags.
- **lines:** Returns tokenized lines as lists.
- **tagged_lines:** Returns tokenized lines as a list of tuples (token, tag).
```bash
input_text = "Queen , 31.10.1975 tarihinde çıkardıðı A Night at the Opera albümüyle dünya müziðini deðiåÿtirdi ."
$ ts-tokenizer input_text
Queen
,
31.10.1975
tarihinde
çıkardığı
A
Night
at
the
Opera
albümüyle
dünya
müziğini
değiştirdi
.
```
Note that **tags are not part-of-speech tags** but they define the given string.
```bash
$ ts-tokenizer -o tagged input.txt
Queen English_Word
, Punc
31.10.1975 Date
tarihinde Valid_Word
çıkardığı Valid_Word
A OOV
Night English_Word
at Valid_Word
the English_Word
Opera Valid_Word
albümüyle Valid_Word
dünya Valid_Word
müziğini Valid_Word
değiştirdi Valid_Word
. Punc
```
The other two arguments are "lines" and "tagged_lines".
The "lines" parameter reads input file line-by-line and returns a list for each line. Note that each line is defined by end-of-line markers in the given text.
```bash
$ ts-tokenizer -o lines input.txt
['Queen', ',', '31.10.1975', 'tarihinde', 'çıkardığı', 'A', 'Night', 'at', 'the', 'Opera', 'albümüyle', 'dünya', 'müziğini', 'değiştirdi', '.']
```
The "tagged_lines" parameter reads input file line-by-line and returns a list of tuples for each line. Note that each line is defined by end-of-line markers in the given text.
```bash
$ ts-tokenizer -o tagged_lines input.txt
[('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('çıkardığı', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('albümüyle', 'Valid_Word'), ('dünya', 'Valid_Word'), ('müziğini', 'Valid_Word'), ('değiştirdi', 'Valid_Word'), ('.', 'Punc')]
```
---
## Parallel Processing
Use the -n option to set the number of parallel workers:
```bash
$ ts-tokenizer -n 2 -o tagged input_file
```
By default, TS Tokenizer uses [number of CPU cores - 1].
---
## Using CLI Arguments with pipelines
You can use TS Tokenizer in bash pipelines, such as counting word frequencies:
Following sample returns calculated frequencies for the given file:
```bash
$ ts-tokenizer input.txt | sort | uniq -c | sort -n
```
---
To count tags:
```bash
$ ts-tokenizer -o tagged input.txt | cut -f2 | sort | uniq -c
1 Date
3 English_Word
2 Hashtag
2 Mention
1 Multi_Hyphenated
3 Numbered_Title
1 OOV
9 Punc
1 Single_Hyphenated
17 Valid_Word
```
To find a specific tag following command could be used.
```bash
$ ts-tokenizer -o tagged input.txt | cut -f1,2 | grep "Web_URL"
www.wikipedia.org Web_URL
www.wim-wenders.com Web_URL
www.winterwar.com. Web_URL
www.wissenschaft.de: Web_URL
www.wittingen.de Web_URL
www.wlmqradio.com Web_URL
www.worldstadiums.com Web_URL
www.worldstatesmen.org Web_URL
```
---
# Classes
## CharFix
CharFix offers methods to correct corrupted Turkish text:
### Fix Characters
```python
from ts_tokenizer.char_fix import CharFix
line = "Parça ve bütün iliåÿkisi her zaman iåÿlevsel deðildir."
print(CharFix.fix(line)) # Fixes corrupted characters
```
```bash
$ Parça ve bütün ilişkisi her zaman işlevsel değildir.
```
### Lowercase
```python
from ts_tokenizer.char_fix import CharFix
line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.tr_lowercase(line))
```
```bash
$ istanbul ve ığdır ''arası'' 1528 km'dir.
```
### Fix Quotes
```python
from ts_tokenizer.char_fix import CharFix
line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.fix_quote(line))
```
$ İstanbul ve Iğdır "arası" 1528 km'dir.
---
## TokenHandler
TokenHandler gets each given string and process it using methods defined under TokenPreProcess class.
This process follows a strictly defined order and it is recursive.
Each method could be called
```python
from ts_tokenizer.token_handler import TokenPreProcess
```
### Below are the list of tags generated by tokenizer to process tokens:
| # | Function | Sample | Used As Output | Tag |
|----|----------------------------|------------------------------|----------------|--------------------|
| 01 | is_mention | @ts-tokenizer | Yes | Mention |
| 02 | is_hashtag | #ts-tokenizer | Yes | Hashtag |
| 03 | is_in_quotes | "ts-tokenizer" | No | ----- |
| 04 | is_numbered_title | (1) | Yes | Numbered_Title |
| 05 | is_in_paranthesis | (bilgisayar) | No | ----- |
| 06 | is_date_range | 01.01.2024-01.01.2025 | Yes | Date_Range |
| 07 | is_complex_punc | -yeniden,sonradan.. | No | ----- |
| 08 | is_date | 22.02.2016 | Yes | Date |
| 09 | is_hour | 14.05 | Yes | Hour |
| 10 | is_percentage_numbers | %75 | Yes | Percentage_Numbers |
| 11 | is_percentage_numbers_chars | %75'lik | Yes | Percentage_Numbers |
| 12 | is_roman_number | XI | Yes | Roman_Number |
| 13 | is_bullet_list | •Giriş | Yes | Bullet_List |
| 14 | is_email | tanersezerr@gmail.com | Yes | Email |
| 15 | is_email_punc | tanersezerr@gmail.com. | No | ----- |
| 16 | is_full_url | https://tscorpus.com | Yes | Full_URL |
| 17 | is_web_url | www.tscorpus.com | Yes | Web_URL |
| -- | is_full_url | www.example.com'un | Yes | URL_Suffix |
| 18 | is_copyright | ©tscorpus | Yes | Copyright |
| 19 | is_registered | tscorpus® | Yes | Registered |
| 20 | is_trademark | tscorpus™ | Yes | Trademark |
| 21 | is_currency | 100$ | Yes | Currency |
| 22 | is_num_char_sequence | 380A | No | ----- |
| 23 | is_abbr | TBMM | Yes | Abbr |
| 24 | is_in_lexicon | bilgisayar | Yes | Valid_Word |
| 25 | is_in_exceptions | e-mail | Yes | Exception |
| 26 | is_in_eng_words | computer | Yes | English_Word |
| 27 | is_smiley | :) | Yes | Smiley |
| 28 | is_multiple_smiley | :):) | No | ----- |
| 29 | is_emoticon | 🍻 | Yes | Emoticon |
| 30 | is_multiple_emoticon | 🍻🍻 | No | ----- |
| 31 | is_multiple_smiley_in | hey:):) | No | ----- |
| 32 | is_number | 175.01 | Yes | Number |
| 33 | is_apostrophed | Türkiye'nin | Yes | Apostrophed |
| 34 | is_single_punc | ! | Yes | Punc |
| 35 | is_multi_punc | !! | No | ----- |
| 36 | is_single_hyphenated | sabah-akşam | Yes | Single_Hyphenated |
| 37 | is_multi_hyphenated | çay-su-kahve | Yes | Multi-Hyphenated |
| 38 | is_single_underscored | Gel_Git | Yes | Single_Underscored |
| 39 | is_multi_underscored | Yarı_Yapılandırılmış_Mülakat | Yes | Multi_Underscored |
| 40 | is_one_char_fixable | bilgisa¬yar | Yes | One_Char_Fixed |
| 42 | is_three_or_more | heyyyyy | No | ----- |
| 43 | is_fsp | bilgisayar. | No | ----- |
| 44 | is_isp | .bilgisayar | No | ----- |
| 45 | is_fmp | bilgisayar.. | No | ----- |
| 46 | is_imp | ..bilgisayar | No | ----- |
| 47 | is_msp | --bilgisayar-- | No | ----- |
| 48 | is_mssp | -bilgisayar- | No | ----- |
| 49 | is_midsp | okul,öğrenci | No | ----- |
| 50 | is_midmp | okul,öğrenci, öğretmen | No | ----- |
| 51 | is_non_latin | 한국드 | No | Non_Latin |
----------------------
## Performance
ts-tokenizer is optimized for efficient tokenization and takes advantage of multi-core processing for large-scale text. By default, the script utilizes all available CPU cores minus one, ensuring your system remains responsive while processing large datasets.
### Performance Benchmarks:
The following benchmarks were conducted on a machine with the following specifications:
Processor: AMD Ryzen 7 5800H with Radeon Graphics
Cores: 8 physical cores (16 threads)
RAM: 16GB DDR4
#### Multi-Core Performance:
1 Million Tokens: Processed in approximately 170 seconds using multi-core processing.
Throughput: ~5,800 tokens/second (on average).
#### Single-Core Performance:
1 Million Tokens: Processed in approximately 715 seconds on a single core.
Throughput: ~1,400 tokens/second.
Raw data
{
"_id": null,
"home_page": "https://github.com/tanerim/ts_tokenizer",
"name": "ts-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "tokenizer, turkish, nlp, text-processing, language-processing",
"author": "Taner Sezer",
"author_email": "tanersezerr@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/fc/1c/72a646d7ce96a016692be9a1fc1ecbe9af6512f4ace3c81da11e605d7a64/ts_tokenizer-0.1.17.tar.gz",
"platform": null,
"description": "# TS Tokenizer\n\n**TS Tokenizer** is a hybrid tokenizer designed specifically for tokenizing Turkish texts.\nIt uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.\n\n\n### Key Features:\n- **Hybrid Tokenization**: Combines lexicon-based and rule-based techniques to tokenize complex Turkish texts with precision.\n- **Special Token Handling**: Detects and processes mentions, hashtags, emails, URLs, dates, numbers, smileys, emoticons, and more.\n- **Configurable Outputs**: Offers multiple output formats, including plain tokens, tagged tokens, tokenized lines, and tagged lines, to suit diverse NLP workflows.\n- **Multi-core Processing**: Speeds up tokenization for large files with parallel processing.\n- **Preprocess Handling**: Handles corrupted Turkish text and punctuation gracefully using built-in fixes.\n- **Command-Line Friendly**: Use it directly from the terminal for file-based or piped input workflows.\n\n\nOn natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers\na reliable solution for tokenization.\n\n---\n\n## Installation\n\nYou can install the ts-tokenizer package using pip.\n```bash\npip install ts-tokenizer\n```\nEnsure you have Python 3.9 or higher installed on your system.\n\nYou can update current version using pip\n```bash\npip install --upgrade ts-tokenizer\n```\n---\n\nTo remove package, use:\n```bash\npip uninstall ts-tokenizer\n```\n---\n\n## License\nThis project is licensed under the MIT License. See the [LICENSE](https://github.com/tanerim/ts_tokenizer/blob/main/LICENSE) file for details.\n\n\n## Command line tool\n\nYou can use TS Tokenizer directly from the command line for both file inputs and pipeline processing:\n## Tokenize from a File:\n```bash\nts-tokenizer input.txt\n```\nor\n```bash\ncat input.txt | ts-tokenizer\n```\nor\n```bash\nzcat input.txt.gz | ts-tokenizer\n```\n---\n\n## Help\n\nGet detailed help for available options using:\n\n| Argument | Short | Description | Default |\n|----------------------|-------|-------------------------------------------------------------------------------------------------|---------------|\n| `--output` | `-o` | Specify the output format: `tokenized`, `lines`, `tagged`, `tagged_lines`. | `tokenized` |\n| `--num-workers` | `-n` | Set the number of parallel workers for processing. | `CPU cores-1` |\n| `--verbose` | `-v` | Enable verbose mode to display additional processing details. | Disabled |\n| `--version` | `-V` | Display the current version of `ts-tokenizer`. | N/A |\n| `--help` | `-h` | Show the help message and exit. \n\n---\n\n## CLI Arguments\n\nYou can specify the output format using the -o option:\n\n- **tokenized (default):** Returns plain tokens, one per line.\n- **tagged:** Returns tokens with their tags.\n- **lines:** Returns tokenized lines as lists.\n- **tagged_lines:** Returns tokenized lines as a list of tuples (token, tag).\n\n```bash\ninput_text = \"Queen , 31.10.1975 tarihinde \u00e7\u0131kard\u0131\u00f0\u0131 A Night at the Opera alb\u00c3\u00bcm\u00c3\u00bcyle d\u00c3\u00bcnya m\u00c3\u00bczi\u00f0ini de\u00f0i\u00e5\u00fftirdi .\"\n\n$ ts-tokenizer input_text\n\nQueen\n,\n31.10.1975\ntarihinde\n\u00e7\u0131kard\u0131\u011f\u0131\nA\nNight\nat\nthe\nOpera\nalb\u00fcm\u00fcyle\nd\u00fcnya\nm\u00fczi\u011fini\nde\u011fi\u015ftirdi\n.\n```\n\nNote that **tags are not part-of-speech tags** but they define the given string.\n```bash\n$ ts-tokenizer -o tagged input.txt\n\nQueen\tEnglish_Word\n,\tPunc\n31.10.1975\tDate\ntarihinde\tValid_Word\n\u00e7\u0131kard\u0131\u011f\u0131\tValid_Word\nA\tOOV\nNight\tEnglish_Word\nat\tValid_Word\nthe\tEnglish_Word\nOpera\tValid_Word\nalb\u00fcm\u00fcyle\tValid_Word\nd\u00fcnya\tValid_Word\nm\u00fczi\u011fini\tValid_Word\nde\u011fi\u015ftirdi\tValid_Word\n.\tPunc\n```\n\nThe other two arguments are \"lines\" and \"tagged_lines\".\nThe \"lines\" parameter reads input file line-by-line and returns a list for each line. Note that each line is defined by end-of-line markers in the given text.\n```bash\n$ ts-tokenizer -o lines input.txt\n\n['Queen', ',', '31.10.1975', 'tarihinde', '\u00e7\u0131kard\u0131\u011f\u0131', 'A', 'Night', 'at', 'the', 'Opera', 'alb\u00fcm\u00fcyle', 'd\u00fcnya', 'm\u00fczi\u011fini', 'de\u011fi\u015ftirdi', '.']\n```\n\nThe \"tagged_lines\" parameter reads input file line-by-line and returns a list of tuples for each line. Note that each line is defined by end-of-line markers in the given text.\n```bash\n$ ts-tokenizer -o tagged_lines input.txt\n\n [('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('\u00e7\u0131kard\u0131\u011f\u0131', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('alb\u00fcm\u00fcyle', 'Valid_Word'), ('d\u00fcnya', 'Valid_Word'), ('m\u00fczi\u011fini', 'Valid_Word'), ('de\u011fi\u015ftirdi', 'Valid_Word'), ('.', 'Punc')]\n```\n---\n\n## Parallel Processing\nUse the -n option to set the number of parallel workers:\n```bash\n$ ts-tokenizer -n 2 -o tagged input_file\n```\n\nBy default, TS Tokenizer uses [number of CPU cores - 1].\n\n---\n\n## Using CLI Arguments with pipelines\n\nYou can use TS Tokenizer in bash pipelines, such as counting word frequencies:\n\nFollowing sample returns calculated frequencies for the given file:\n```bash\n$ ts-tokenizer input.txt | sort | uniq -c | sort -n\n```\n\n---\n\nTo count tags:\n```bash\n$ ts-tokenizer -o tagged input.txt | cut -f2 | sort | uniq -c\n\n1 Date\n3 English_Word\n2 Hashtag\n2 Mention\n1 Multi_Hyphenated\n3 Numbered_Title\n1 OOV\n9 Punc\n1 Single_Hyphenated\n17 Valid_Word\n```\n\nTo find a specific tag following command could be used.\n```bash\n$ ts-tokenizer -o tagged input.txt | cut -f1,2 | grep \"Web_URL\"\n\nwww.wikipedia.org\tWeb_URL\nwww.wim-wenders.com\tWeb_URL\nwww.winterwar.com.\tWeb_URL\nwww.wissenschaft.de:\tWeb_URL\nwww.wittingen.de\tWeb_URL\nwww.wlmqradio.com\tWeb_URL\nwww.worldstadiums.com\tWeb_URL\nwww.worldstatesmen.org\tWeb_URL\n```\n---\n\n# Classes\n\n## CharFix\n\nCharFix offers methods to correct corrupted Turkish text:\n\n### Fix Characters\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"Par\u00c3\u00a7a ve b\u00c3\u00bct\u00c3\u00bcn ili\u00e5\u00ffkisi her zaman i\u00e5\u00fflevsel de\u00f0ildir.\"\nprint(CharFix.fix(line)) # Fixes corrupted characters\n```\n```bash\n$ Par\u00e7a ve b\u00fct\u00fcn ili\u015fkisi her zaman i\u015flevsel de\u011fildir.\n```\n### Lowercase\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.tr_lowercase(line))\n```\n```bash\n$ istanbul ve \u0131\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\n```\n### Fix Quotes\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.fix_quote(line))\n```\n $ \u0130stanbul ve I\u011fd\u0131r \"aras\u0131\" 1528 km'dir.\n\n---\n## TokenHandler\n\nTokenHandler gets each given string and process it using methods defined under TokenPreProcess class.\nThis process follows a strictly defined order and it is recursive.\nEach method could be called \n```python\nfrom ts_tokenizer.token_handler import TokenPreProcess\n\n```\n\n\n### Below are the list of tags generated by tokenizer to process tokens:\n\n\n| # | Function | Sample | Used As Output | Tag |\n|----|----------------------------|------------------------------|----------------|--------------------|\n| 01 | is_mention | @ts-tokenizer | Yes | Mention |\n| 02 | is_hashtag | #ts-tokenizer | Yes | Hashtag |\n| 03 | is_in_quotes | \"ts-tokenizer\" | No | ----- |\n| 04 | is_numbered_title | (1) | Yes | Numbered_Title |\n| 05 | is_in_paranthesis | (bilgisayar) | No | ----- |\n| 06 | is_date_range | 01.01.2024-01.01.2025 | Yes | Date_Range |\n| 07 | is_complex_punc | -yeniden,sonradan.. | No | ----- |\n| 08 | is_date | 22.02.2016 | Yes | Date |\n| 09 | is_hour | 14.05 | Yes | Hour |\n| 10 | is_percentage_numbers | %75 | Yes | Percentage_Numbers |\n| 11 | is_percentage_numbers_chars | %75'lik | Yes | Percentage_Numbers |\n| 12 | is_roman_number | XI | Yes | Roman_Number |\n| 13 | is_bullet_list | \u2022Giri\u015f | Yes | Bullet_List |\n| 14 | is_email | tanersezerr@gmail.com | Yes | Email |\n| 15 | is_email_punc | tanersezerr@gmail.com. | No | ----- |\n| 16 | is_full_url | https://tscorpus.com | Yes | Full_URL |\n| 17 | is_web_url | www.tscorpus.com | Yes | Web_URL |\n| -- | is_full_url | www.example.com'un | Yes | URL_Suffix |\n| 18 | is_copyright | \u00a9tscorpus | Yes | Copyright |\n| 19 | is_registered | tscorpus\u00ae | Yes | Registered |\n| 20 | is_trademark | tscorpus\u2122 | Yes | Trademark |\n| 21 | is_currency | 100$ | Yes | Currency |\n| 22 | is_num_char_sequence | 380A | No | ----- |\n| 23 | is_abbr | TBMM | Yes | Abbr |\n| 24 | is_in_lexicon | bilgisayar | Yes | Valid_Word |\n| 25 | is_in_exceptions | e-mail | Yes | Exception |\n| 26 | is_in_eng_words | computer | Yes | English_Word |\n| 27 | is_smiley | :) | Yes | Smiley |\n| 28 | is_multiple_smiley | :):) | No | ----- |\n| 29 | is_emoticon | \ud83c\udf7b | Yes | Emoticon |\n| 30 | is_multiple_emoticon | \ud83c\udf7b\ud83c\udf7b | No | ----- |\n| 31 | is_multiple_smiley_in | hey:):) | No | ----- |\n| 32 | is_number | 175.01 | Yes | Number |\n| 33 | is_apostrophed | T\u00fcrkiye'nin | Yes | Apostrophed |\n| 34 | is_single_punc | ! | Yes | Punc |\n| 35 | is_multi_punc | !! | No | ----- |\n| 36 | is_single_hyphenated | sabah-ak\u015fam | Yes | Single_Hyphenated |\n| 37 | is_multi_hyphenated | \u00e7ay-su-kahve | Yes | Multi-Hyphenated |\n| 38 | is_single_underscored | Gel_Git | Yes | Single_Underscored |\n| 39 | is_multi_underscored | Yar\u0131_Yap\u0131land\u0131r\u0131lm\u0131\u015f_M\u00fclakat | Yes | Multi_Underscored |\n| 40 | is_one_char_fixable | bilgisa\u00acyar | Yes | One_Char_Fixed |\n| 42 | is_three_or_more | heyyyyy | No | ----- |\n| 43 | is_fsp | bilgisayar. | No | ----- |\n| 44 | is_isp | .bilgisayar | No | ----- |\n| 45 | is_fmp | bilgisayar.. | No | ----- |\n| 46 | is_imp | ..bilgisayar | No | ----- |\n| 47 | is_msp | --bilgisayar-- | No | ----- |\n| 48 | is_mssp | -bilgisayar- | No | ----- |\n| 49 | is_midsp | okul,\u00f6\u011frenci | No | ----- |\n| 50 | is_midmp | okul,\u00f6\u011frenci, \u00f6\u011fretmen | No | ----- |\n| 51 | is_non_latin | \ud55c\uad6d\ub4dc | No | Non_Latin |\n\n----------------------\n\n## Performance\n\nts-tokenizer is optimized for efficient tokenization and takes advantage of multi-core processing for large-scale text. By default, the script utilizes all available CPU cores minus one, ensuring your system remains responsive while processing large datasets.\n\n### Performance Benchmarks:\n\nThe following benchmarks were conducted on a machine with the following specifications:\n\n Processor: AMD Ryzen 7 5800H with Radeon Graphics\n Cores: 8 physical cores (16 threads)\n RAM: 16GB DDR4\n\n#### Multi-Core Performance:\n\n 1 Million Tokens: Processed in approximately 170 seconds using multi-core processing.\n Throughput: ~5,800 tokens/second (on average).\n\n#### Single-Core Performance:\n\n 1 Million Tokens: Processed in approximately 715 seconds on a single core.\n Throughput: ~1,400 tokens/second.\n\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts.",
"version": "0.1.17",
"project_urls": {
"Bug Tracker": "https://github.com/tanerim/ts_tokenizer/issues",
"Documentation": "https://github.com/tanerim/ts_tokenizer#readme",
"Homepage": "https://github.com/tanerim/ts_tokenizer",
"Source Code": "https://github.com/tanerim/ts_tokenizer"
},
"split_keywords": [
"tokenizer",
" turkish",
" nlp",
" text-processing",
" language-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6623f004bcb6b357d3c631f1778290c6d32af230023270bfffd541f7c95c8ed9",
"md5": "d6d0ac2a84e34902ba063dc62622da65",
"sha256": "1cdd0471142016607de68d4436c36116556594544ace1ebf26c197f9fc115512"
},
"downloads": -1,
"filename": "ts_tokenizer-0.1.17-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d6d0ac2a84e34902ba063dc62622da65",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 10012300,
"upload_time": "2024-12-03T09:31:25",
"upload_time_iso_8601": "2024-12-03T09:31:25.091422Z",
"url": "https://files.pythonhosted.org/packages/66/23/f004bcb6b357d3c631f1778290c6d32af230023270bfffd541f7c95c8ed9/ts_tokenizer-0.1.17-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fc1c72a646d7ce96a016692be9a1fc1ecbe9af6512f4ace3c81da11e605d7a64",
"md5": "7e42d13573380599bab284681175baf6",
"sha256": "a27f6dd9a8b381853998fba53a672ef047b6200383a6bdac9c9200073af65c03"
},
"downloads": -1,
"filename": "ts_tokenizer-0.1.17.tar.gz",
"has_sig": false,
"md5_digest": "7e42d13573380599bab284681175baf6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 10071718,
"upload_time": "2024-12-03T09:31:39",
"upload_time_iso_8601": "2024-12-03T09:31:39.897234Z",
"url": "https://files.pythonhosted.org/packages/fc/1c/72a646d7ce96a016692be9a1fc1ecbe9af6512f4ace3c81da11e605d7a64/ts_tokenizer-0.1.17.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-03 09:31:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tanerim",
"github_project": "ts_tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "click",
"specs": [
[
"~=",
"8.1.7"
]
]
},
{
"name": "tqdm",
"specs": [
[
"~=",
"4.66.4"
]
]
},
{
"name": "setuptools",
"specs": [
[
"~=",
"68.2.0"
]
]
}
],
"lcname": "ts-tokenizer"
}