ts-tokenizer


Namets-tokenizer JSON
Version 0.1.17 PyPI version JSON
download
home_pagehttps://github.com/tanerim/ts_tokenizer
SummaryTS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts.
upload_time2024-12-03 09:31:39
maintainerNone
docs_urlNone
authorTaner Sezer
requires_python>=3.9
licenseMIT
keywords tokenizer turkish nlp text-processing language-processing
VCS
bugtrack_url
requirements click tqdm setuptools
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TS Tokenizer

**TS Tokenizer** is a hybrid tokenizer designed specifically for tokenizing Turkish texts.
It uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.


### Key Features:
- **Hybrid Tokenization**: Combines lexicon-based and rule-based techniques to tokenize complex Turkish texts with precision.
- **Special Token Handling**: Detects and processes mentions, hashtags, emails, URLs, dates, numbers, smileys, emoticons, and more.
- **Configurable Outputs**: Offers multiple output formats, including plain tokens, tagged tokens, tokenized lines, and tagged lines, to suit diverse NLP workflows.
- **Multi-core Processing**: Speeds up tokenization for large files with parallel processing.
- **Preprocess Handling**: Handles corrupted Turkish text and punctuation gracefully using built-in fixes.
- **Command-Line Friendly**: Use it directly from the terminal for file-based or piped input workflows.


On natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers
a reliable solution for tokenization.

---

## Installation

You can install the ts-tokenizer package using pip.
```bash
pip install ts-tokenizer
```
Ensure you have Python 3.9 or higher installed on your system.

You can update current version using pip
```bash
pip install --upgrade ts-tokenizer
```
---

To remove package, use:
```bash
pip uninstall ts-tokenizer
```
---

## License
This project is licensed under the MIT License. See the [LICENSE](https://github.com/tanerim/ts_tokenizer/blob/main/LICENSE) file for details.


## Command line tool

You can use TS Tokenizer directly from the command line for both file inputs and pipeline processing:
## Tokenize from a File:
```bash
ts-tokenizer input.txt
```
or
```bash
cat input.txt | ts-tokenizer
```
or
```bash
zcat input.txt.gz | ts-tokenizer
```
---

## Help

Get detailed help for available options using:

| Argument             | Short | Description                                                                                     | Default       |
|----------------------|-------|-------------------------------------------------------------------------------------------------|---------------|
| `--output`           | `-o`  | Specify the output format: `tokenized`, `lines`, `tagged`, `tagged_lines`.                      | `tokenized`   |
| `--num-workers`      | `-n`  | Set the number of parallel workers for processing.                                              | `CPU cores-1` |
| `--verbose`          | `-v`  | Enable verbose mode to display additional processing details.                                   | Disabled      |
| `--version`          | `-V`  | Display the current version of `ts-tokenizer`.                                                 | N/A           |
| `--help`             | `-h`  | Show the help message and exit.   

---

## CLI Arguments

You can specify the output format using the -o option:

- **tokenized (default):** Returns plain tokens, one per line.
- **tagged:** Returns tokens with their tags.
- **lines:** Returns tokenized lines as lists.
- **tagged_lines:** Returns tokenized lines as a list of tuples (token, tag).

```bash
input_text = "Queen , 31.10.1975 tarihinde çıkardıðı A Night at the Opera albümüyle dünya müziðini deðiåÿtirdi ."

$ ts-tokenizer input_text

Queen
,
31.10.1975
tarihinde
çıkardığı
A
Night
at
the
Opera
albümüyle
dünya
müziğini
değiştirdi
.
```

Note that **tags are not part-of-speech tags** but they define the given string.
```bash
$ ts-tokenizer -o tagged input.txt

Queen	English_Word
,	Punc
31.10.1975	Date
tarihinde	Valid_Word
çıkardığı	Valid_Word
A	OOV
Night	English_Word
at	Valid_Word
the	English_Word
Opera	Valid_Word
albümüyle	Valid_Word
dünya	Valid_Word
müziğini	Valid_Word
değiştirdi	Valid_Word
.	Punc
```

The other two arguments are "lines" and "tagged_lines".
The "lines" parameter reads input file line-by-line and returns a list for each line. Note that each line is defined by end-of-line markers in the given text.
```bash
$ ts-tokenizer -o lines input.txt

['Queen', ',', '31.10.1975', 'tarihinde', 'çıkardığı', 'A', 'Night', 'at', 'the', 'Opera', 'albümüyle', 'dünya', 'müziğini', 'değiştirdi', '.']
```

The "tagged_lines" parameter reads input file line-by-line and returns a list of tuples for each line. Note that each line is defined by end-of-line markers in the given text.
```bash
$ ts-tokenizer -o tagged_lines input.txt

 [('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('çıkardığı', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('albümüyle', 'Valid_Word'), ('dünya', 'Valid_Word'), ('müziğini', 'Valid_Word'), ('değiştirdi', 'Valid_Word'), ('.', 'Punc')]
```
---

## Parallel Processing
Use the -n option to set the number of parallel workers:
```bash
$ ts-tokenizer -n 2 -o tagged input_file
```

By default, TS Tokenizer uses [number of CPU cores - 1].

---

## Using CLI Arguments with pipelines

You can use TS Tokenizer in bash pipelines, such as counting word frequencies:

Following sample returns calculated  frequencies for the given file:
```bash
$ ts-tokenizer input.txt | sort | uniq -c | sort -n
```

---

To count tags:
```bash
$ ts-tokenizer -o tagged input.txt | cut -f2 | sort | uniq -c

1 Date
3 English_Word
2 Hashtag
2 Mention
1 Multi_Hyphenated
3 Numbered_Title
1 OOV
9 Punc
1 Single_Hyphenated
17 Valid_Word
```

To find a specific tag following command could be used.
```bash
$ ts-tokenizer -o tagged input.txt | cut -f1,2 | grep "Web_URL"

www.wikipedia.org	Web_URL
www.wim-wenders.com	Web_URL
www.winterwar.com.	Web_URL
www.wissenschaft.de:	Web_URL
www.wittingen.de	Web_URL
www.wlmqradio.com	Web_URL
www.worldstadiums.com	Web_URL
www.worldstatesmen.org	Web_URL
```
---

# Classes

## CharFix

CharFix offers methods to correct corrupted Turkish text:

### Fix Characters

```python
from ts_tokenizer.char_fix import CharFix

line = "Parça ve bütün iliåÿkisi her zaman iåÿlevsel deðildir."
print(CharFix.fix(line))  # Fixes corrupted characters
```
```bash
$ Parça ve bütün ilişkisi her zaman işlevsel değildir.
```
### Lowercase

```python
from ts_tokenizer.char_fix import CharFix

line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.tr_lowercase(line))
```
```bash
$ istanbul ve ığdır ''arası'' 1528 km'dir.
```
### Fix Quotes

```python
from ts_tokenizer.char_fix import CharFix

line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.fix_quote(line))
```
    $ İstanbul ve Iğdır "arası" 1528 km'dir.

---
## TokenHandler

TokenHandler gets each given string and process it using methods defined under TokenPreProcess class.
This process follows a strictly defined order and it is recursive.
Each method could be called 
```python
from ts_tokenizer.token_handler import TokenPreProcess

```


### Below are the list of tags generated by tokenizer to process tokens:


| #  | Function                   | Sample                       | Used As Output | Tag                |
|----|----------------------------|------------------------------|----------------|--------------------|
| 01 | is_mention                 | @ts-tokenizer                | Yes            | Mention            |
| 02 | is_hashtag                 | #ts-tokenizer                | Yes            | Hashtag            |
| 03 | is_in_quotes               | "ts-tokenizer"               | No             | -----              |
| 04 | is_numbered_title          | (1)                          | Yes            | Numbered_Title     |
| 05 | is_in_paranthesis          | (bilgisayar)                 | No             | -----              |
| 06 | is_date_range              | 01.01.2024-01.01.2025        | Yes            | Date_Range         |
| 07 | is_complex_punc            | -yeniden,sonradan..          | No             | -----              |
| 08 | is_date                    | 22.02.2016                   | Yes            | Date               |
| 09 | is_hour                    | 14.05                        | Yes            | Hour               |
| 10 | is_percentage_numbers      | %75                          | Yes            | Percentage_Numbers |
| 11 | is_percentage_numbers_chars | %75'lik                      | Yes            | Percentage_Numbers |
| 12 | is_roman_number            | XI                           | Yes            | Roman_Number       |
| 13 | is_bullet_list             | •Giriş                       | Yes            | Bullet_List        |
| 14 | is_email                   | tanersezerr@gmail.com        | Yes            | Email              |
| 15 | is_email_punc              | tanersezerr@gmail.com.       | No             | -----              |
| 16 | is_full_url                | https://tscorpus.com         | Yes            | Full_URL           |
| 17 | is_web_url                 | www.tscorpus.com             | Yes            | Web_URL            |
| -- | is_full_url                | www.example.com'un           | Yes            | URL_Suffix         |
| 18 | is_copyright               | ©tscorpus                    | Yes            | Copyright          |
| 19 | is_registered              | tscorpus®                    | Yes            | Registered         |
| 20 | is_trademark               | tscorpus™                    | Yes            | Trademark          |
| 21 | is_currency                | 100$                         | Yes            | Currency           |
| 22 | is_num_char_sequence       | 380A                         | No             | -----              |
| 23 | is_abbr                    | TBMM                         | Yes            | Abbr               |
| 24 | is_in_lexicon              | bilgisayar                   | Yes            | Valid_Word         |
| 25 | is_in_exceptions           | e-mail                       | Yes            | Exception          |
| 26 | is_in_eng_words            | computer                     | Yes            | English_Word       |
| 27 | is_smiley                  | :)                           | Yes            | Smiley             |
| 28 | is_multiple_smiley         | :):)                         | No             | -----              |
| 29 | is_emoticon                | 🍻                           | Yes            | Emoticon           |
| 30 | is_multiple_emoticon       | 🍻🍻                         | No             | -----              |
| 31 | is_multiple_smiley_in      | hey:):)                      | No             | -----              |
| 32 | is_number                  | 175.01                       | Yes            | Number             |
| 33 | is_apostrophed             | Türkiye'nin                  | Yes            | Apostrophed        |
| 34 | is_single_punc             | !                            | Yes            | Punc               |
| 35 | is_multi_punc              | !!                           | No             | -----              |
| 36 | is_single_hyphenated       | sabah-akşam                  | Yes            | Single_Hyphenated  |
| 37 | is_multi_hyphenated        | çay-su-kahve                 | Yes            | Multi-Hyphenated   |
| 38 | is_single_underscored      | Gel_Git                      | Yes            | Single_Underscored |
| 39 | is_multi_underscored       | Yarı_Yapılandırılmış_Mülakat | Yes            | Multi_Underscored  |
| 40 | is_one_char_fixable        | bilgisa¬yar                  | Yes            | One_Char_Fixed     |
| 42 | is_three_or_more           | heyyyyy                      | No             | -----              |
| 43 | is_fsp                     | bilgisayar.                  | No             | -----              |
| 44 | is_isp                     | .bilgisayar                  | No             | -----              |
| 45 | is_fmp                     | bilgisayar..                 | No             | -----              |
| 46 | is_imp                     | ..bilgisayar                 | No             | -----              |
| 47 | is_msp                     | --bilgisayar--               | No             | -----              |
| 48 | is_mssp                    | -bilgisayar-                 | No             | -----              |
| 49 | is_midsp                   | okul,öğrenci                 | No             | -----              |
| 50 | is_midmp                   | okul,öğrenci, öğretmen       | No             | -----              |
| 51 | is_non_latin               | 한국드                          | No             | Non_Latin          |

----------------------

## Performance

ts-tokenizer is optimized for efficient tokenization and takes advantage of multi-core processing for large-scale text. By default, the script utilizes all available CPU cores minus one, ensuring your system remains responsive while processing large datasets.

### Performance Benchmarks:

The following benchmarks were conducted on a machine with the following specifications:

    Processor: AMD Ryzen 7 5800H with Radeon Graphics
    Cores: 8 physical cores (16 threads)
    RAM: 16GB DDR4

#### Multi-Core Performance:

    1 Million Tokens: Processed in approximately 170 seconds using multi-core processing.
    Throughput: ~5,800 tokens/second (on average).

#### Single-Core Performance:

    1 Million Tokens: Processed in approximately 715 seconds on a single core.
    Throughput: ~1,400 tokens/second.





            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tanerim/ts_tokenizer",
    "name": "ts-tokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "tokenizer, turkish, nlp, text-processing, language-processing",
    "author": "Taner Sezer",
    "author_email": "tanersezerr@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/fc/1c/72a646d7ce96a016692be9a1fc1ecbe9af6512f4ace3c81da11e605d7a64/ts_tokenizer-0.1.17.tar.gz",
    "platform": null,
    "description": "# TS Tokenizer\n\n**TS Tokenizer** is a hybrid tokenizer designed specifically for tokenizing Turkish texts.\nIt uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.\n\n\n### Key Features:\n- **Hybrid Tokenization**: Combines lexicon-based and rule-based techniques to tokenize complex Turkish texts with precision.\n- **Special Token Handling**: Detects and processes mentions, hashtags, emails, URLs, dates, numbers, smileys, emoticons, and more.\n- **Configurable Outputs**: Offers multiple output formats, including plain tokens, tagged tokens, tokenized lines, and tagged lines, to suit diverse NLP workflows.\n- **Multi-core Processing**: Speeds up tokenization for large files with parallel processing.\n- **Preprocess Handling**: Handles corrupted Turkish text and punctuation gracefully using built-in fixes.\n- **Command-Line Friendly**: Use it directly from the terminal for file-based or piped input workflows.\n\n\nOn natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers\na reliable solution for tokenization.\n\n---\n\n## Installation\n\nYou can install the ts-tokenizer package using pip.\n```bash\npip install ts-tokenizer\n```\nEnsure you have Python 3.9 or higher installed on your system.\n\nYou can update current version using pip\n```bash\npip install --upgrade ts-tokenizer\n```\n---\n\nTo remove package, use:\n```bash\npip uninstall ts-tokenizer\n```\n---\n\n## License\nThis project is licensed under the MIT License. See the [LICENSE](https://github.com/tanerim/ts_tokenizer/blob/main/LICENSE) file for details.\n\n\n## Command line tool\n\nYou can use TS Tokenizer directly from the command line for both file inputs and pipeline processing:\n## Tokenize from a File:\n```bash\nts-tokenizer input.txt\n```\nor\n```bash\ncat input.txt | ts-tokenizer\n```\nor\n```bash\nzcat input.txt.gz | ts-tokenizer\n```\n---\n\n## Help\n\nGet detailed help for available options using:\n\n| Argument             | Short | Description                                                                                     | Default       |\n|----------------------|-------|-------------------------------------------------------------------------------------------------|---------------|\n| `--output`           | `-o`  | Specify the output format: `tokenized`, `lines`, `tagged`, `tagged_lines`.                      | `tokenized`   |\n| `--num-workers`      | `-n`  | Set the number of parallel workers for processing.                                              | `CPU cores-1` |\n| `--verbose`          | `-v`  | Enable verbose mode to display additional processing details.                                   | Disabled      |\n| `--version`          | `-V`  | Display the current version of `ts-tokenizer`.                                                 | N/A           |\n| `--help`             | `-h`  | Show the help message and exit.   \n\n---\n\n## CLI Arguments\n\nYou can specify the output format using the -o option:\n\n- **tokenized (default):** Returns plain tokens, one per line.\n- **tagged:** Returns tokens with their tags.\n- **lines:** Returns tokenized lines as lists.\n- **tagged_lines:** Returns tokenized lines as a list of tuples (token, tag).\n\n```bash\ninput_text = \"Queen , 31.10.1975 tarihinde \u00e7\u0131kard\u0131\u00f0\u0131 A Night at the Opera alb\u00c3\u00bcm\u00c3\u00bcyle d\u00c3\u00bcnya m\u00c3\u00bczi\u00f0ini de\u00f0i\u00e5\u00fftirdi .\"\n\n$ ts-tokenizer input_text\n\nQueen\n,\n31.10.1975\ntarihinde\n\u00e7\u0131kard\u0131\u011f\u0131\nA\nNight\nat\nthe\nOpera\nalb\u00fcm\u00fcyle\nd\u00fcnya\nm\u00fczi\u011fini\nde\u011fi\u015ftirdi\n.\n```\n\nNote that **tags are not part-of-speech tags** but they define the given string.\n```bash\n$ ts-tokenizer -o tagged input.txt\n\nQueen\tEnglish_Word\n,\tPunc\n31.10.1975\tDate\ntarihinde\tValid_Word\n\u00e7\u0131kard\u0131\u011f\u0131\tValid_Word\nA\tOOV\nNight\tEnglish_Word\nat\tValid_Word\nthe\tEnglish_Word\nOpera\tValid_Word\nalb\u00fcm\u00fcyle\tValid_Word\nd\u00fcnya\tValid_Word\nm\u00fczi\u011fini\tValid_Word\nde\u011fi\u015ftirdi\tValid_Word\n.\tPunc\n```\n\nThe other two arguments are \"lines\" and \"tagged_lines\".\nThe \"lines\" parameter reads input file line-by-line and returns a list for each line. Note that each line is defined by end-of-line markers in the given text.\n```bash\n$ ts-tokenizer -o lines input.txt\n\n['Queen', ',', '31.10.1975', 'tarihinde', '\u00e7\u0131kard\u0131\u011f\u0131', 'A', 'Night', 'at', 'the', 'Opera', 'alb\u00fcm\u00fcyle', 'd\u00fcnya', 'm\u00fczi\u011fini', 'de\u011fi\u015ftirdi', '.']\n```\n\nThe \"tagged_lines\" parameter reads input file line-by-line and returns a list of tuples for each line. Note that each line is defined by end-of-line markers in the given text.\n```bash\n$ ts-tokenizer -o tagged_lines input.txt\n\n [('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('\u00e7\u0131kard\u0131\u011f\u0131', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('alb\u00fcm\u00fcyle', 'Valid_Word'), ('d\u00fcnya', 'Valid_Word'), ('m\u00fczi\u011fini', 'Valid_Word'), ('de\u011fi\u015ftirdi', 'Valid_Word'), ('.', 'Punc')]\n```\n---\n\n## Parallel Processing\nUse the -n option to set the number of parallel workers:\n```bash\n$ ts-tokenizer -n 2 -o tagged input_file\n```\n\nBy default, TS Tokenizer uses [number of CPU cores - 1].\n\n---\n\n## Using CLI Arguments with pipelines\n\nYou can use TS Tokenizer in bash pipelines, such as counting word frequencies:\n\nFollowing sample returns calculated  frequencies for the given file:\n```bash\n$ ts-tokenizer input.txt | sort | uniq -c | sort -n\n```\n\n---\n\nTo count tags:\n```bash\n$ ts-tokenizer -o tagged input.txt | cut -f2 | sort | uniq -c\n\n1 Date\n3 English_Word\n2 Hashtag\n2 Mention\n1 Multi_Hyphenated\n3 Numbered_Title\n1 OOV\n9 Punc\n1 Single_Hyphenated\n17 Valid_Word\n```\n\nTo find a specific tag following command could be used.\n```bash\n$ ts-tokenizer -o tagged input.txt | cut -f1,2 | grep \"Web_URL\"\n\nwww.wikipedia.org\tWeb_URL\nwww.wim-wenders.com\tWeb_URL\nwww.winterwar.com.\tWeb_URL\nwww.wissenschaft.de:\tWeb_URL\nwww.wittingen.de\tWeb_URL\nwww.wlmqradio.com\tWeb_URL\nwww.worldstadiums.com\tWeb_URL\nwww.worldstatesmen.org\tWeb_URL\n```\n---\n\n# Classes\n\n## CharFix\n\nCharFix offers methods to correct corrupted Turkish text:\n\n### Fix Characters\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"Par\u00c3\u00a7a ve b\u00c3\u00bct\u00c3\u00bcn ili\u00e5\u00ffkisi her zaman i\u00e5\u00fflevsel de\u00f0ildir.\"\nprint(CharFix.fix(line))  # Fixes corrupted characters\n```\n```bash\n$ Par\u00e7a ve b\u00fct\u00fcn ili\u015fkisi her zaman i\u015flevsel de\u011fildir.\n```\n### Lowercase\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.tr_lowercase(line))\n```\n```bash\n$ istanbul ve \u0131\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\n```\n### Fix Quotes\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.fix_quote(line))\n```\n    $ \u0130stanbul ve I\u011fd\u0131r \"aras\u0131\" 1528 km'dir.\n\n---\n## TokenHandler\n\nTokenHandler gets each given string and process it using methods defined under TokenPreProcess class.\nThis process follows a strictly defined order and it is recursive.\nEach method could be called \n```python\nfrom ts_tokenizer.token_handler import TokenPreProcess\n\n```\n\n\n### Below are the list of tags generated by tokenizer to process tokens:\n\n\n| #  | Function                   | Sample                       | Used As Output | Tag                |\n|----|----------------------------|------------------------------|----------------|--------------------|\n| 01 | is_mention                 | @ts-tokenizer                | Yes            | Mention            |\n| 02 | is_hashtag                 | #ts-tokenizer                | Yes            | Hashtag            |\n| 03 | is_in_quotes               | \"ts-tokenizer\"               | No             | -----              |\n| 04 | is_numbered_title          | (1)                          | Yes            | Numbered_Title     |\n| 05 | is_in_paranthesis          | (bilgisayar)                 | No             | -----              |\n| 06 | is_date_range              | 01.01.2024-01.01.2025        | Yes            | Date_Range         |\n| 07 | is_complex_punc            | -yeniden,sonradan..          | No             | -----              |\n| 08 | is_date                    | 22.02.2016                   | Yes            | Date               |\n| 09 | is_hour                    | 14.05                        | Yes            | Hour               |\n| 10 | is_percentage_numbers      | %75                          | Yes            | Percentage_Numbers |\n| 11 | is_percentage_numbers_chars | %75'lik                      | Yes            | Percentage_Numbers |\n| 12 | is_roman_number            | XI                           | Yes            | Roman_Number       |\n| 13 | is_bullet_list             | \u2022Giri\u015f                       | Yes            | Bullet_List        |\n| 14 | is_email                   | tanersezerr@gmail.com        | Yes            | Email              |\n| 15 | is_email_punc              | tanersezerr@gmail.com.       | No             | -----              |\n| 16 | is_full_url                | https://tscorpus.com         | Yes            | Full_URL           |\n| 17 | is_web_url                 | www.tscorpus.com             | Yes            | Web_URL            |\n| -- | is_full_url                | www.example.com'un           | Yes            | URL_Suffix         |\n| 18 | is_copyright               | \u00a9tscorpus                    | Yes            | Copyright          |\n| 19 | is_registered              | tscorpus\u00ae                    | Yes            | Registered         |\n| 20 | is_trademark               | tscorpus\u2122                    | Yes            | Trademark          |\n| 21 | is_currency                | 100$                         | Yes            | Currency           |\n| 22 | is_num_char_sequence       | 380A                         | No             | -----              |\n| 23 | is_abbr                    | TBMM                         | Yes            | Abbr               |\n| 24 | is_in_lexicon              | bilgisayar                   | Yes            | Valid_Word         |\n| 25 | is_in_exceptions           | e-mail                       | Yes            | Exception          |\n| 26 | is_in_eng_words            | computer                     | Yes            | English_Word       |\n| 27 | is_smiley                  | :)                           | Yes            | Smiley             |\n| 28 | is_multiple_smiley         | :):)                         | No             | -----              |\n| 29 | is_emoticon                | \ud83c\udf7b                           | Yes            | Emoticon           |\n| 30 | is_multiple_emoticon       | \ud83c\udf7b\ud83c\udf7b                         | No             | -----              |\n| 31 | is_multiple_smiley_in      | hey:):)                      | No             | -----              |\n| 32 | is_number                  | 175.01                       | Yes            | Number             |\n| 33 | is_apostrophed             | T\u00fcrkiye'nin                  | Yes            | Apostrophed        |\n| 34 | is_single_punc             | !                            | Yes            | Punc               |\n| 35 | is_multi_punc              | !!                           | No             | -----              |\n| 36 | is_single_hyphenated       | sabah-ak\u015fam                  | Yes            | Single_Hyphenated  |\n| 37 | is_multi_hyphenated        | \u00e7ay-su-kahve                 | Yes            | Multi-Hyphenated   |\n| 38 | is_single_underscored      | Gel_Git                      | Yes            | Single_Underscored |\n| 39 | is_multi_underscored       | Yar\u0131_Yap\u0131land\u0131r\u0131lm\u0131\u015f_M\u00fclakat | Yes            | Multi_Underscored  |\n| 40 | is_one_char_fixable        | bilgisa\u00acyar                  | Yes            | One_Char_Fixed     |\n| 42 | is_three_or_more           | heyyyyy                      | No             | -----              |\n| 43 | is_fsp                     | bilgisayar.                  | No             | -----              |\n| 44 | is_isp                     | .bilgisayar                  | No             | -----              |\n| 45 | is_fmp                     | bilgisayar..                 | No             | -----              |\n| 46 | is_imp                     | ..bilgisayar                 | No             | -----              |\n| 47 | is_msp                     | --bilgisayar--               | No             | -----              |\n| 48 | is_mssp                    | -bilgisayar-                 | No             | -----              |\n| 49 | is_midsp                   | okul,\u00f6\u011frenci                 | No             | -----              |\n| 50 | is_midmp                   | okul,\u00f6\u011frenci, \u00f6\u011fretmen       | No             | -----              |\n| 51 | is_non_latin               | \ud55c\uad6d\ub4dc                          | No             | Non_Latin          |\n\n----------------------\n\n## Performance\n\nts-tokenizer is optimized for efficient tokenization and takes advantage of multi-core processing for large-scale text. By default, the script utilizes all available CPU cores minus one, ensuring your system remains responsive while processing large datasets.\n\n### Performance Benchmarks:\n\nThe following benchmarks were conducted on a machine with the following specifications:\n\n    Processor: AMD Ryzen 7 5800H with Radeon Graphics\n    Cores: 8 physical cores (16 threads)\n    RAM: 16GB DDR4\n\n#### Multi-Core Performance:\n\n    1 Million Tokens: Processed in approximately 170 seconds using multi-core processing.\n    Throughput: ~5,800 tokens/second (on average).\n\n#### Single-Core Performance:\n\n    1 Million Tokens: Processed in approximately 715 seconds on a single core.\n    Throughput: ~1,400 tokens/second.\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts.",
    "version": "0.1.17",
    "project_urls": {
        "Bug Tracker": "https://github.com/tanerim/ts_tokenizer/issues",
        "Documentation": "https://github.com/tanerim/ts_tokenizer#readme",
        "Homepage": "https://github.com/tanerim/ts_tokenizer",
        "Source Code": "https://github.com/tanerim/ts_tokenizer"
    },
    "split_keywords": [
        "tokenizer",
        " turkish",
        " nlp",
        " text-processing",
        " language-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6623f004bcb6b357d3c631f1778290c6d32af230023270bfffd541f7c95c8ed9",
                "md5": "d6d0ac2a84e34902ba063dc62622da65",
                "sha256": "1cdd0471142016607de68d4436c36116556594544ace1ebf26c197f9fc115512"
            },
            "downloads": -1,
            "filename": "ts_tokenizer-0.1.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d6d0ac2a84e34902ba063dc62622da65",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 10012300,
            "upload_time": "2024-12-03T09:31:25",
            "upload_time_iso_8601": "2024-12-03T09:31:25.091422Z",
            "url": "https://files.pythonhosted.org/packages/66/23/f004bcb6b357d3c631f1778290c6d32af230023270bfffd541f7c95c8ed9/ts_tokenizer-0.1.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc1c72a646d7ce96a016692be9a1fc1ecbe9af6512f4ace3c81da11e605d7a64",
                "md5": "7e42d13573380599bab284681175baf6",
                "sha256": "a27f6dd9a8b381853998fba53a672ef047b6200383a6bdac9c9200073af65c03"
            },
            "downloads": -1,
            "filename": "ts_tokenizer-0.1.17.tar.gz",
            "has_sig": false,
            "md5_digest": "7e42d13573380599bab284681175baf6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 10071718,
            "upload_time": "2024-12-03T09:31:39",
            "upload_time_iso_8601": "2024-12-03T09:31:39.897234Z",
            "url": "https://files.pythonhosted.org/packages/fc/1c/72a646d7ce96a016692be9a1fc1ecbe9af6512f4ace3c81da11e605d7a64/ts_tokenizer-0.1.17.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-03 09:31:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tanerim",
    "github_project": "ts_tokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "click",
            "specs": [
                [
                    "~=",
                    "8.1.7"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "~=",
                    "4.66.4"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "~=",
                    "68.2.0"
                ]
            ]
        }
    ],
    "lcname": "ts-tokenizer"
}
        
Elapsed time: 0.38428s