# TS Tokenizer
**TS Tokenizer** is a hybrid tokenizer designed for Turkish text.
It uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.
The tokenizer leverages regular expressions to handle non-standard text elements like dates, percentages, URLs, and punctuation marks.
### Key Features:
- **Hybrid Approach**: Uses a hybrid (lexicon-based and rule-based approach) for tokenization.
- **Handling of Special Tokens**: Recognizes special tokens like mentions, hashtags, emails, URLs, numbers, smiley, emoticons, etc..
- **Highly Configurable**: Provides multiple output formats to suit different NLP processing needs,
- including plain tokens, tagged tokens, and token-tag pairs in list or line formats.
Whether you are working on natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers
a versatile and reliable solution for tokenization.
# Installation
You can install the ts-tokenizer package using pip. Ensure you have Python 3.9 or higher installed on your system.
pip install ts-tokenizer
## Command line tool
Basic usage returns tokenized output of given text file.
$ ts-tokenizer input.txt
## CLI Arguments
-o parameter takes four arguments.
The two arguments 'tokenized' and 'tagged' returns word/per/line output.
Tokenized is the default value and it is not obligatory to declare.
input_text = "Queen , 31.10.1975 tarihinde çıkardıðı A Night at the Opera albümüyle dünya müziðini deðiåÿtirdi ."
$ ts-tokenizer input text
Queen
,
31.10.1975
tarihinde
çıkardığı
A
Night
at
the
Opera
albümüyle
dünya
müziğini
değiştirdi
.
Note that tags are not part-of-speech tags but they define the given string.
$ ts-tokenizer -o tagged input.txt
Queen English_Word
, Punc
31.10.1975 Date
tarihinde Valid_Word
çıkardığı Valid_Word
A OOV
Night English_Word
at Valid_Word
the English_Word
Opera Valid_Word
albümüyle Valid_Word
dünya Valid_Word
müziğini Valid_Word
değiştirdi Valid_Word
. Punc
The other two arguments are "lines" and "tagged_lines".
The "lines" parameter reads input file line-by-line and returns a list for each line.
$ ts-tokenizer -o lines input.txt
['Queen', ',', '31.10.1975', 'tarihinde', 'çıkardığı', 'A', 'Night', 'at', 'the', 'Opera', 'albümüyle', 'dünya', 'müziğini', 'değiştirdi', '.']
The "tagged_lines" parameter reads input file line-by-line and returns a list of tuples for each line.
$ ts-tokenizer -o tagged_lines input.txt
[('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('çıkardığı', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('albümüyle', 'Valid_Word'),('dünya', 'Valid_Word'), ('müziğini', 'Valid_Word'), ('değiştirdi', 'Valid_Word'), ('.', 'Punc')]
The tokenizer is designed to take advantge of multiple cores. Default value is [Total Number of Cores - 1].
-j parameter sets the number of parallel workers.
$ ts-tokenizer -j 2 -o tagged input_file
## Using CLI Arguments with pipelines
ts-tokenizer could also be used in a pipeline on bash.
Following sample returns calculated frequencies for the given file:
$ ts-tokenizer input.txt | sort | uniq -c | sort -n
For case-insensitive output tr is employed in the sample below:
$ ts-tokenizer input.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n
Sample below returns number of tags in given text
$ts-tokenizer -o tagged input.txt | cut -f3 | sort | uniq -c
1 Hyphen_In
1 Inner_Punc
2 FMP
8 ISP
8 Num_Char_Seq
12 Number
24 Apostrophe
25 OOV
69 FSP
515 Valid_Word
To find a specific tag following command could be used.
$ ts-tokenizer -o tagged input.txt | cut -f2,3 | grep "Num_Char_Seq"
40'ar Num_Char_Seq
2. Num_Char_Seq
24. Num_Char_Seq
Num_Char_Seq
16'sı Num_Char_Seq
8. Num_Char_Seq
20'şer Num_Char_Seq
40'ar Num_Char_Seq
By employing sort and uniq commands frequency of the words with target tag could be found:
$ ts-tokenizer -o tagged Test_Text.txt | cut -f2,3 | grep "Num_Char_Seq" | sort | uniq -c | sort -n
1 16'sı Num_Char_Seq
1 20'şer Num_Char_Seq
1 2. Num_Char_Seq
1 8. Num_Char_Seq
2 24. Num_Char_Seq
2 40'ar Num_Char_Seq
--help returns help
$ ts-tokenizer --help
usage: main.py [-h] [-o {tokenized,lines,tagged,tagged_lines}] [-w] [-v] [-j JOBS] filename
positional arguments:
filename Name of the file to process
options:
-h, --help show this help message and exit
-o {tokenized,lines,tagged,tagged_lines}, --output {tokenized,lines,tagged,tagged_lines}
Specify the output format
-v, --verbose Enable verbose mode
-j JOBS, --jobs JOBS Number of parallel workers
## Classes
## CharFix
This class has 4 methods. They are useful to fix corrupted texts.
### CharFix Class
```python
from ts_tokenizer.char_fix import CharFix
```
### Fix Characters
```python
line = "Parça ve bütün iliåÿkisi her zaman iåÿlevsel deðildir."
print(CharFix.fix(line))
Parça ve bütün ilişkisi her zaman işlevsel değildir.
```
### Lowercase
```python
line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.tr_lowercase(line))
istanbul ve ığdır ''arası'' 1528 km'dir.
```
### Fix Quotes
```python
line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.fix_quote(line))
İstanbul ve Iğdır "arası" 1528 km'dir.
```
## TokenCheck
This class is used to pass input tokens to the tokenizer for further analysis.
However, it could be used for various tasks.<br>
The tags are "Valid_Word", "Exception_Word", "Eng_Word", "Date", "Hour", "In_Parenthesis", "In_Quotes", "Smiley", "Inner_Char", "Abbr", "Number", "Non_Prefix_URL", "Prefix_URL", "Emoticon", "Mention", "HashTag", "Percentage_Numbers", "Percentage_Number_Chars", "Num_Char_Seq", "Multiple_Smiley", "Punc", "Underscored", "Hyphenated", "Hyphen_In", "Multiple_Emoticon", "Copyright", "Email", "Registered", "Three_or_More"
### token_tagger
```python
from ts_tokenizer.token_check import TokenCheck
```
### Default Usage
```python
word = "Parça"
print(TokenCheck.token_tagger(word))
$ Valid_Word
print(TokenCheck.token_tagger(word, output="all", output_format="tuple"))
$ ('Parça', 'Parça', 'Valid_Word')
print(TokenCheck.token_tagger(word, output="all", output_format="list"))
$ ['Parça', 'Parça', 'Valid_Word']
word = "#tstokenizer"
print(TokenCheck.token_tagger(word, output='all', output_format='tuple')) # Returns a tuple
('#tstokenizer', '#tstokenizer', 'HashTag')
word = "#tanerim"
print(TokenCheck.token_tagger(word, output='all', output_format='list')) # Returns a list
['@tanerim', '@tanerim', 'Mention']
word = ":):):)"
print(TokenCheck.token_tagger(word, output='all', output_format='string')) # Returns a tab-separated string
:):):) :):):) Multiple_Smiley
```
```python
line = "Queen , 31.10.1975 tarihinde çıkardıðı A Night at the Opera albümüyle dünya müziðini deðiåÿtirdi ."
for word in line.split(" "):
TokenTag = TokenCheck.token_tagger(word, output='all', output_format='list')
print(TokenTag)
['Queen', 'Queen', 'Eng_Word']
[',', ',', 'Punc']
['31.10.1975', '31.10.1975', 'Date']
['tarihinde', 'tarihinde', 'Valid_Word']
['çıkardıðı', 'çıkardığı', 'Valid_Word']
['A', 'A', 'OOV']
['Night', 'Night', 'Eng_Word']
['at', 'at', 'Valid_Word']
['the', 'the', 'Eng_Word']
['Opera', 'Opera', 'Valid_Word']
['albümüyle', 'albümüyle', 'Valid_Word']
['dünya', 'dünya', 'Valid_Word']
['müziðini', 'müziğini', 'Valid_Word']
['deðiåÿtirdi', 'değiştirdi', 'Valid_Word']
['.', '.', 'Punc']
```
Raw data
{
"_id": null,
"home_page": "https://github.com/tanerim/ts_tokenizer",
"name": "ts-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Taner Sezer",
"author_email": "tanersezerr@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/78/e0/3c22f4ae03be68289c042deecd5fd014b0ac322ea97e29c6fad57b8c51e3/ts_tokenizer-0.1.12.tar.gz",
"platform": null,
"description": "# TS Tokenizer\n\n**TS Tokenizer** is a hybrid tokenizer designed for Turkish text.\nIt uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.\nThe tokenizer leverages regular expressions to handle non-standard text elements like dates, percentages, URLs, and punctuation marks.\n\n\n### Key Features:\n- **Hybrid Approach**: Uses a hybrid (lexicon-based and rule-based approach) for tokenization.\n- **Handling of Special Tokens**: Recognizes special tokens like mentions, hashtags, emails, URLs, numbers, smiley, emoticons, etc..\n- **Highly Configurable**: Provides multiple output formats to suit different NLP processing needs,\n- including plain tokens, tagged tokens, and token-tag pairs in list or line formats.\n\nWhether you are working on natural language processing (NLP), information retrieval, or text mining for Turkish, **TS Tokenizer** offers\na versatile and reliable solution for tokenization.\n\n\n# Installation\n\nYou can install the ts-tokenizer package using pip. Ensure you have Python 3.9 or higher installed on your system.\n\n pip install ts-tokenizer\n\n## Command line tool\nBasic usage returns tokenized output of given text file.\n\n $ ts-tokenizer input.txt\n\n## CLI Arguments\n\n-o parameter takes four arguments.\n\nThe two arguments 'tokenized' and 'tagged' returns word/per/line output.\nTokenized is the default value and it is not obligatory to declare.\n\ninput_text = \"Queen , 31.10.1975 tarihinde \u00e7\u0131kard\u0131\u00f0\u0131 A Night at the Opera alb\u00c3\u00bcm\u00c3\u00bcyle d\u00c3\u00bcnya m\u00c3\u00bczi\u00f0ini de\u00f0i\u00e5\u00fftirdi .\"\n\n $ ts-tokenizer input text\n\n Queen\n ,\n 31.10.1975\n tarihinde\n \u00e7\u0131kard\u0131\u011f\u0131\n A\n Night\n at\n the\n Opera\n alb\u00fcm\u00fcyle\n d\u00fcnya\n m\u00fczi\u011fini\n de\u011fi\u015ftirdi\n .\n\nNote that tags are not part-of-speech tags but they define the given string.\n \n $ ts-tokenizer -o tagged input.txt\n\n Queen\tEnglish_Word\n ,\tPunc\n 31.10.1975\tDate\n tarihinde\tValid_Word\n \u00e7\u0131kard\u0131\u011f\u0131\tValid_Word\n A\tOOV\n Night\tEnglish_Word\n at\tValid_Word\n the\tEnglish_Word\n Opera\tValid_Word\n alb\u00fcm\u00fcyle\tValid_Word\n d\u00fcnya\tValid_Word\n m\u00fczi\u011fini\tValid_Word\n de\u011fi\u015ftirdi\tValid_Word\n .\tPunc\n\n\nThe other two arguments are \"lines\" and \"tagged_lines\".\nThe \"lines\" parameter reads input file line-by-line and returns a list for each line.\n\n $ ts-tokenizer -o lines input.txt\n\n ['Queen', ',', '31.10.1975', 'tarihinde', '\u00e7\u0131kard\u0131\u011f\u0131', 'A', 'Night', 'at', 'the', 'Opera', 'alb\u00fcm\u00fcyle', 'd\u00fcnya', 'm\u00fczi\u011fini', 'de\u011fi\u015ftirdi', '.']\n\nThe \"tagged_lines\" parameter reads input file line-by-line and returns a list of tuples for each line.\n\n\n $ ts-tokenizer -o tagged_lines input.txt\n\n [('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('\u00e7\u0131kard\u0131\u011f\u0131', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('alb\u00fcm\u00fcyle', 'Valid_Word'),('d\u00fcnya', 'Valid_Word'), ('m\u00fczi\u011fini', 'Valid_Word'), ('de\u011fi\u015ftirdi', 'Valid_Word'), ('.', 'Punc')]\n\n\nThe tokenizer is designed to take advantge of multiple cores. Default value is [Total Number of Cores - 1].\n-j parameter sets the number of parallel workers.\n\n $ ts-tokenizer -j 2 -o tagged input_file\n\n## Using CLI Arguments with pipelines\n\nts-tokenizer could also be used in a pipeline on bash.\n\nFollowing sample returns calculated frequencies for the given file:\n\n $ ts-tokenizer input.txt | sort | uniq -c | sort -n\n\nFor case-insensitive output tr is employed in the sample below:\n\n $ ts-tokenizer input.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n\n\nSample below returns number of tags in given text\n\n $ts-tokenizer -o tagged input.txt | cut -f3 | sort | uniq -c\n 1 Hyphen_In\n 1 Inner_Punc\n 2 FMP\n 8 ISP\n 8 Num_Char_Seq\n 12 Number\n 24 Apostrophe\n 25 OOV\n 69 FSP\n 515 Valid_Word\n\nTo find a specific tag following command could be used.\n\n $ ts-tokenizer -o tagged input.txt | cut -f2,3 | grep \"Num_Char_Seq\"\n 40'ar\tNum_Char_Seq\n 2.\tNum_Char_Seq\n 24.\tNum_Char_Seq\n Num_Char_Seq\n 16's\u0131\tNum_Char_Seq\n 8.\tNum_Char_Seq\n 20'\u015fer\tNum_Char_Seq\n 40'ar\tNum_Char_Seq\n\nBy employing sort and uniq commands frequency of the words with target tag could be found:\n\n $ ts-tokenizer -o tagged Test_Text.txt | cut -f2,3 | grep \"Num_Char_Seq\" | sort | uniq -c | sort -n\n 1 16's\u0131\tNum_Char_Seq\n 1 20'\u015fer\tNum_Char_Seq\n 1 2.\tNum_Char_Seq\n 1 8.\tNum_Char_Seq\n 2 24.\tNum_Char_Seq\n 2 40'ar\tNum_Char_Seq\n\n\n\n--help returns help\n \n $ ts-tokenizer --help\n\n usage: main.py [-h] [-o {tokenized,lines,tagged,tagged_lines}] [-w] [-v] [-j JOBS] filename\n\n positional arguments:\n filename Name of the file to process\n \n options:\n -h, --help show this help message and exit\n -o {tokenized,lines,tagged,tagged_lines}, --output {tokenized,lines,tagged,tagged_lines}\n Specify the output format\n -v, --verbose Enable verbose mode\n -j JOBS, --jobs JOBS Number of parallel workers\n\n\n## Classes\n\n## CharFix\n\nThis class has 4 methods. They are useful to fix corrupted texts.\n\n### CharFix Class\n\n```python\nfrom ts_tokenizer.char_fix import CharFix\n```\n\n### Fix Characters\n\n```python\nline = \"Par\u00c3\u00a7a ve b\u00c3\u00bct\u00c3\u00bcn ili\u00e5\u00ffkisi her zaman i\u00e5\u00fflevsel de\u00f0ildir.\"\nprint(CharFix.fix(line))\n\nPar\u00e7a ve b\u00fct\u00fcn ili\u015fkisi her zaman i\u015flevsel de\u011fildir.\n```\n### Lowercase\n\n```python\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.tr_lowercase(line))\n\nistanbul ve \u0131\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\n```\n### Fix Quotes\n\n```python\nline = \"\u0130stanbul ve I\u011fd\u0131r ''aras\u0131'' 1528 km'dir.\"\nprint(CharFix.fix_quote(line))\n\n\u0130stanbul ve I\u011fd\u0131r \"aras\u0131\" 1528 km'dir.\n```\n\n\n## TokenCheck\n\nThis class is used to pass input tokens to the tokenizer for further analysis.\nHowever, it could be used for various tasks.<br>\nThe tags are \"Valid_Word\", \"Exception_Word\", \"Eng_Word\", \"Date\", \"Hour\", \"In_Parenthesis\", \"In_Quotes\", \"Smiley\", \"Inner_Char\", \"Abbr\", \"Number\", \"Non_Prefix_URL\", \"Prefix_URL\", \"Emoticon\", \"Mention\", \"HashTag\", \"Percentage_Numbers\", \"Percentage_Number_Chars\", \"Num_Char_Seq\", \"Multiple_Smiley\", \"Punc\", \"Underscored\", \"Hyphenated\", \"Hyphen_In\", \"Multiple_Emoticon\", \"Copyright\", \"Email\", \"Registered\", \"Three_or_More\"\n\n### token_tagger\n\n```python\nfrom ts_tokenizer.token_check import TokenCheck\n```\n\n### Default Usage\n```python\nword = \"Par\u00c3\u00a7a\"\nprint(TokenCheck.token_tagger(word))\n\n$ Valid_Word\n\nprint(TokenCheck.token_tagger(word, output=\"all\", output_format=\"tuple\"))\n\n$ ('Par\u00c3\u00a7a', 'Par\u00e7a', 'Valid_Word')\n\nprint(TokenCheck.token_tagger(word, output=\"all\", output_format=\"list\"))\n\n$ ['Par\u00c3\u00a7a', 'Par\u00e7a', 'Valid_Word']\n\nword = \"#tstokenizer\"\nprint(TokenCheck.token_tagger(word, output='all', output_format='tuple')) # Returns a tuple\n('#tstokenizer', '#tstokenizer', 'HashTag')\n\nword = \"#tanerim\"\nprint(TokenCheck.token_tagger(word, output='all', output_format='list')) # Returns a list\n['@tanerim', '@tanerim', 'Mention']\n\nword = \":):):)\"\nprint(TokenCheck.token_tagger(word, output='all', output_format='string')) # Returns a tab-separated string\n:):):) :):):) Multiple_Smiley\n```\n\n```python\nline = \"Queen , 31.10.1975 tarihinde \u00e7\u0131kard\u0131\u00f0\u0131 A Night at the Opera alb\u00c3\u00bcm\u00c3\u00bcyle d\u00c3\u00bcnya m\u00c3\u00bczi\u00f0ini de\u00f0i\u00e5\u00fftirdi .\"\n\nfor word in line.split(\" \"):\n TokenTag = TokenCheck.token_tagger(word, output='all', output_format='list')\n print(TokenTag)\n['Queen', 'Queen', 'Eng_Word']\n[',', ',', 'Punc']\n['31.10.1975', '31.10.1975', 'Date']\n['tarihinde', 'tarihinde', 'Valid_Word']\n['\u00e7\u0131kard\u0131\u00f0\u0131', '\u00e7\u0131kard\u0131\u011f\u0131', 'Valid_Word']\n['A', 'A', 'OOV']\n['Night', 'Night', 'Eng_Word']\n['at', 'at', 'Valid_Word']\n['the', 'the', 'Eng_Word']\n['Opera', 'Opera', 'Valid_Word']\n['alb\u00c3\u00bcm\u00c3\u00bcyle', 'alb\u00fcm\u00fcyle', 'Valid_Word']\n['d\u00c3\u00bcnya', 'd\u00fcnya', 'Valid_Word']\n['m\u00c3\u00bczi\u00f0ini', 'm\u00fczi\u011fini', 'Valid_Word']\n['de\u00f0i\u00e5\u00fftirdi', 'de\u011fi\u015ftirdi', 'Valid_Word']\n['.', '.', 'Punc']\n\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed for Turkish text.",
"version": "0.1.12",
"project_urls": {
"Homepage": "https://github.com/tanerim/ts_tokenizer"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "addf774732b2b278e5cc227d07be85b789b9252390630d216da8f2f38c3dbc29",
"md5": "a90423e72b2e0a831367c8fe6811d066",
"sha256": "f523f3c0267a858fee6a95d15a055101d806c9fcda1faaf36a932b3f8839127b"
},
"downloads": -1,
"filename": "ts_tokenizer-0.1.12-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a90423e72b2e0a831367c8fe6811d066",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 9996250,
"upload_time": "2024-10-18T06:47:56",
"upload_time_iso_8601": "2024-10-18T06:47:56.083153Z",
"url": "https://files.pythonhosted.org/packages/ad/df/774732b2b278e5cc227d07be85b789b9252390630d216da8f2f38c3dbc29/ts_tokenizer-0.1.12-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "78e03c22f4ae03be68289c042deecd5fd014b0ac322ea97e29c6fad57b8c51e3",
"md5": "fc0d6d6acdd4238be3d8f190b873c2ab",
"sha256": "1d69b40fb156ada34670d661164c9687dfcbe9850d7e32bbd897d4aa8503523d"
},
"downloads": -1,
"filename": "ts_tokenizer-0.1.12.tar.gz",
"has_sig": false,
"md5_digest": "fc0d6d6acdd4238be3d8f190b873c2ab",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 10052010,
"upload_time": "2024-10-18T06:48:00",
"upload_time_iso_8601": "2024-10-18T06:48:00.054621Z",
"url": "https://files.pythonhosted.org/packages/78/e0/3c22f4ae03be68289c042deecd5fd014b0ac322ea97e29c6fad57b8c51e3/ts_tokenizer-0.1.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-18 06:48:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tanerim",
"github_project": "ts_tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "click",
"specs": [
[
"~=",
"8.1.7"
]
]
},
{
"name": "tqdm",
"specs": [
[
"~=",
"4.66.4"
]
]
},
{
"name": "setuptools",
"specs": [
[
"~=",
"68.2.0"
]
]
}
],
"lcname": "ts-tokenizer"
}