<p align="center"><img width="50%" src="https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.png" /></p>
toiro
-----
[](https://github.com/taishi-i/toiro/actions/workflows/python-package.yml)
[](https://pypi.python.org/pypi/toiro)

[](https://github.com/taishi-i/toiro/pulls)
Toiro is a comparison tool of Japanese tokenizers.
- Compare the processing speed of tokenizers
- Compare the words segmented in tokenizers
- Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)
It also provides useful functions for natural language processing in Japanese.
- Data downloader for Japanese text corpora
- Preprocessor of these corpora
- Text classifier for Japanese text (e.g., SVM, BERT)
<p align="center"><img width="90%" src="https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.gif" /></p>
Installation
------------
Python 3.10+ is required. You can install toiro with the following command.
[Janome](https://github.com/mocobeta/janome) is included in the default installation.
```bash
pip install toiro
```
Adding a tokenizer to toiro
---------------------------
If you want to add a tokenizer to toiro, please install it individually.
This is an example of adding [SudachiPy](https://github.com/WorksApplications/SudachiPy) and [nagisa](https://github.com/taishi-i/nagisa) to toiro.
```bash
pip install sudachipy sudachidict_core
pip install nagisa
```
<details>
<summary> How to install other tokenizers </summary>
<p>
[mecab-python3](https://github.com/SamuraiT/mecab-python3)
```
pip install mecab-python3
```
[GiNZA](https://github.com/megagonlabs/ginza)
```
pip install spacy ginza
```
[spaCy](https://github.com/explosion/spaCy)
```
pip install spacy[ja]
```
[KyTea](https://github.com/neubig/kytea)
You need to install KyTea. Please refer to [here](http://www.phontron.com/kytea/index-ja.html).
```
pip install kytea
```
[Juman++ v2](https://github.com/ku-nlp/jumanpp)
You need to install Juman++ v2. Please refer to [here](http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++).
```
pip install pyknp
```
[SentencePiece](https://github.com/google/sentencepiece)
```
pip install sentencepiece
```
[fugashi-ipadic](https://github.com/polm/fugashi)
```
pip install fugashi ipadic
```
[fugashi-unidic](https://github.com/polm/fugashi)
```
pip install fugashi unidic-lite
```
[tinysegmenter](https://github.com/SamuraiT/tinysegmenter)
```
pip install tinysegmenter3
```
[tiktoken](https://github.com/openai/tiktoken)
```
pip install tiktoken
```
</p>
</details>
If you want to install all the tokonizers at once, please use the following command.
```bash
pip install toiro[all_tokenizers]
```
Getting started
---------------
You can check the available tokonizers in your Python environment.
```python
from toiro import tokenizers
available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)
```
Toiro supports 12 different Japanese tokonizers and 1 BPE tokenizer. This is an example of adding SudachiPy and nagisa.
```python
{'nagisa': {'is_available': True, 'version': '0.2.7'},
'janome': {'is_available': True, 'version': '0.3.10'},
'mecab-python3': {'is_available': False, 'version': False},
'sudachipy': {'is_available': True, 'version': '0.4.9'},
'spacy': {'is_available': False, 'version': False},
'ginza': {'is_available': False, 'version': False},
'kytea': {'is_available': False, 'version': False},
'jumanpp': {'is_available': False, 'version': False},
'sentencepiece': {'is_available': False, 'version': False},
'fugashi-ipadic': {'is_available': False, 'version': False},
'fugashi-unidic': {'is_available': False, 'version': False},
'tinysegmenter': {'is_available': False, 'version': False},
'tiktoken': {'is_available': False, 'version': False}}
```
Download the livedoor news corpus and compare the processing speed of tokenizers.
```python
from toiro import tokenizers
from toiro import datadownloader
# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']
# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]
# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
'arch': 'X86_64',
'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
'count': 8},
'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
'janome': {'elapsed_time': 9.114670515060425},
'nagisa': {'elapsed_time': 15.873093605041504},
'sudachipy': {'elapsed_time': 9.05256724357605}}
# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=> janome: 都庁|所在地|は|新宿|区|。
#=> nagisa: 都庁|所在|地|は|新宿|区|。
#=> sudachipy: 都庁|所在地|は|新宿区|。
```
Run toiro in Docker
-------------------
You can use all tokenizers by building a docker container from Docker Hub.
```bash
docker run --rm -it taishii/toiro /bin/bash
```
<details>
<summary> How to run the Python interpreter in the Docker container </summary>
<p>
Run the Python interpreter.
```
root@cdd2ad2d7092:/workspace# python3
```
Compare the words segmented in tokenizers
```python
>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
mecab-python3: 都庁|所在地|は|新宿|区|。
janome: 都庁|所在地|は|新宿|区|。
nagisa: 都庁|所在|地|は|新宿|区|。
sudachipy: 都庁|所在地|は|新宿区|。
spacy: 都庁|所在|地|は|新宿|区|。
ginza: 都庁|所在地|は|新宿区|。
kytea: 都庁|所在|地|は|新宿|区|。
jumanpp: 都庁|所在|地|は|新宿|区|。
sentencepiece: ▁|都|庁|所在地|は|新宿|区|。
fugashi-ipadic: 都庁|所在地|は|新宿|区|。
fugashi-unidic: 都庁|所在|地|は|新宿|区|。
tinysegmenter: 都庁所|在地|は|新宿|区|。
tiktoken_gpt4o: 都|�|�|所在地|は|新|宿|区|。
tiktoken_gpt5: 都|�|�|所在地|は|新|宿|区|。
```
</p>
</details>
Get more information about toiro
--------------------------------
The slides at PyCon JP 2020
- [Speaker Deck](https://speakerdeck.com/taishii/pycon-jp-2020)
- [PyConJP2020_Online.ipynb](https://github.com/taishi-i/toiro/blob/master/PyConJP2020/PyConJP2020_Online.ipynb)
Tutorials in Japanese
- [01_getting_started_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/01_getting_started_ja.ipynb)
- [05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb)
Raw data
{
"_id": null,
"home_page": "https://github.com/taishi-i/toiro",
"name": "toiro",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10.0",
"maintainer_email": null,
"keywords": "Japanese NLP",
"author": "Taishi Ikeda",
"author_email": "taishi.ikeda.0323@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/6f/b0/259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc/toiro-0.0.11.tar.gz",
"platform": null,
"description": "<p align=\"center\"><img width=\"50%\" src=\"https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.png\" /></p>\n\ntoiro\n-----\n\n[](https://github.com/taishi-i/toiro/actions/workflows/python-package.yml)\n[](https://pypi.python.org/pypi/toiro)\n\n[](https://github.com/taishi-i/toiro/pulls)\n\n\nToiro is a comparison tool of Japanese tokenizers.\n- Compare the processing speed of tokenizers\n- Compare the words segmented in tokenizers\n- Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)\n\nIt also provides useful functions for natural language processing in Japanese.\n- Data downloader for Japanese text corpora\n- Preprocessor of these corpora\n- Text classifier for Japanese text (e.g., SVM, BERT)\n\n<p align=\"center\"><img width=\"90%\" src=\"https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.gif\" /></p>\n\n\nInstallation\n------------\n\nPython 3.10+ is required. You can install toiro with the following command.\n[Janome](https://github.com/mocobeta/janome) is included in the default installation.\n```bash\npip install toiro\n```\n\nAdding a tokenizer to toiro\n---------------------------\n\nIf you want to add a tokenizer to toiro, please install it individually.\nThis is an example of adding [SudachiPy](https://github.com/WorksApplications/SudachiPy) and [nagisa](https://github.com/taishi-i/nagisa) to toiro.\n\n```bash\npip install sudachipy sudachidict_core\npip install nagisa\n```\n\n<details>\n<summary> How to install other tokenizers </summary>\n<p>\n\n[mecab-python3](https://github.com/SamuraiT/mecab-python3)\n```\npip install mecab-python3\n```\n\n[GiNZA](https://github.com/megagonlabs/ginza)\n```\npip install spacy ginza\n```\n\n[spaCy](https://github.com/explosion/spaCy)\n```\npip install spacy[ja]\n```\n\n[KyTea](https://github.com/neubig/kytea)\n\nYou need to install KyTea. Please refer to [here](http://www.phontron.com/kytea/index-ja.html).\n\n```\npip install kytea\n```\n\n[Juman++ v2](https://github.com/ku-nlp/jumanpp)\n\nYou need to install Juman++ v2. Please refer to [here](http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++).\n\n```\npip install pyknp\n```\n\n[SentencePiece](https://github.com/google/sentencepiece)\n```\npip install sentencepiece\n```\n\n[fugashi-ipadic](https://github.com/polm/fugashi)\n```\npip install fugashi ipadic\n```\n\n[fugashi-unidic](https://github.com/polm/fugashi)\n```\npip install fugashi unidic-lite\n```\n\n[tinysegmenter](https://github.com/SamuraiT/tinysegmenter)\n```\npip install tinysegmenter3\n```\n\n[tiktoken](https://github.com/openai/tiktoken)\n```\npip install tiktoken\n```\n\n</p>\n</details>\n\nIf you want to install all the tokonizers at once, please use the following command.\n```bash\npip install toiro[all_tokenizers]\n```\n\nGetting started\n---------------\n\nYou can check the available tokonizers in your Python environment.\n```python\nfrom toiro import tokenizers\n\navailable_tokenizers = tokenizers.available_tokenizers()\nprint(available_tokenizers)\n```\n\nToiro supports 12 different Japanese tokonizers and 1 BPE tokenizer. This is an example of adding SudachiPy and nagisa.\n```python\n{'nagisa': {'is_available': True, 'version': '0.2.7'},\n 'janome': {'is_available': True, 'version': '0.3.10'},\n 'mecab-python3': {'is_available': False, 'version': False},\n 'sudachipy': {'is_available': True, 'version': '0.4.9'},\n 'spacy': {'is_available': False, 'version': False},\n 'ginza': {'is_available': False, 'version': False},\n 'kytea': {'is_available': False, 'version': False},\n 'jumanpp': {'is_available': False, 'version': False},\n 'sentencepiece': {'is_available': False, 'version': False},\n 'fugashi-ipadic': {'is_available': False, 'version': False},\n 'fugashi-unidic': {'is_available': False, 'version': False},\n 'tinysegmenter': {'is_available': False, 'version': False},\n 'tiktoken': {'is_available': False, 'version': False}}\n```\n\nDownload the livedoor news corpus and compare the processing speed of tokenizers.\n```python\nfrom toiro import tokenizers\nfrom toiro import datadownloader\n\n# A list of avaliable corpora in toiro\ncorpora = datadownloader.available_corpus()\nprint(corpora)\n#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']\n\n# Download the livedoor news corpus and load it as pandas.DataFrame\ncorpus = corpora[0]\ndatadownloader.download_corpus(corpus)\ntrain_df, dev_df, test_df = datadownloader.load_corpus(corpus)\ntexts = train_df[1]\n\n# Compare the processing speed of tokenizers\nreport = tokenizers.compare(texts)\n#=> [1/3] Tokenizer: janome\n#=> 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5900/5900 [00:07<00:00, 746.21it/s]\n#=> [2/3] Tokenizer: nagisa\n#=> 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5900/5900 [00:15<00:00, 370.83it/s]\n#=> [3/3] Tokenizer: sudachipy\n#=> 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5900/5900 [00:08<00:00, 696.68it/s]\nprint(report)\n{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',\n 'arch': 'X86_64',\n 'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',\n 'count': 8},\n 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},\n 'janome': {'elapsed_time': 9.114670515060425},\n 'nagisa': {'elapsed_time': 15.873093605041504},\n 'sudachipy': {'elapsed_time': 9.05256724357605}}\n\n# Compare the words segmented in tokenizers\ntext = \"\u90fd\u5e81\u6240\u5728\u5730\u306f\u65b0\u5bbf\u533a\u3002\"\ntokenizers.print_words(text, delimiter=\"|\")\n#=> janome: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n#=> nagisa: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n#=> sudachipy: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf\u533a|\u3002\n```\n\nRun toiro in Docker\n-------------------\n\nYou can use all tokenizers by building a docker container from Docker Hub.\n\n```bash\ndocker run --rm -it taishii/toiro /bin/bash\n```\n\n<details>\n<summary> How to run the Python interpreter in the Docker container </summary>\n<p>\n\nRun the Python interpreter.\n```\nroot@cdd2ad2d7092:/workspace# python3\n```\n\nCompare the words segmented in tokenizers\n```python\n>>> from toiro import tokenizers\n>>> text = \"\u90fd\u5e81\u6240\u5728\u5730\u306f\u65b0\u5bbf\u533a\u3002\"\n>>> tokenizers.print_words(text, delimiter=\"|\")\n mecab-python3: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n janome: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n nagisa: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n sudachipy: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf\u533a|\u3002\n spacy: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n ginza: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf\u533a|\u3002\n kytea: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n jumanpp: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n sentencepiece: \u2581|\u90fd|\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\nfugashi-ipadic: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\nfugashi-unidic: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n tinysegmenter: \u90fd\u5e81\u6240|\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n tiktoken_gpt4o: \u90fd|\ufffd|\ufffd|\u6240\u5728\u5730|\u306f|\u65b0|\u5bbf|\u533a|\u3002\n tiktoken_gpt5: \u90fd|\ufffd|\ufffd|\u6240\u5728\u5730|\u306f|\u65b0|\u5bbf|\u533a|\u3002\n```\n\n</p>\n</details>\n\nGet more information about toiro\n--------------------------------\n\nThe slides at PyCon JP 2020\n- [Speaker Deck](https://speakerdeck.com/taishii/pycon-jp-2020)\n- [PyConJP2020_Online.ipynb](https://github.com/taishi-i/toiro/blob/master/PyConJP2020/PyConJP2020_Online.ipynb)\n\nTutorials in Japanese\n- [01_getting_started_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/01_getting_started_ja.ipynb)\n- [05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb)\n",
"bugtrack_url": null,
"license": null,
"summary": "A comparison tool of Japanese tokenizers",
"version": "0.0.11",
"project_urls": {
"Download": "https://github.com/taishi-i/toiro/archive/0.0.11.tar.gz",
"Homepage": "https://github.com/taishi-i/toiro"
},
"split_keywords": [
"japanese",
"nlp"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "14ff28acb2ad1462d4593bc353f34dfdeda810fb40ba7f4fd44fbc3ddb2e9e3d",
"md5": "65be92ef0acdc871b18ef2365fd1b5dd",
"sha256": "0b637e981cf6da7427624c032e8557bc1aec72db7aede9cd477a3c897addea50"
},
"downloads": -1,
"filename": "toiro-0.0.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "65be92ef0acdc871b18ef2365fd1b5dd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10.0",
"size": 628887,
"upload_time": "2025-11-02T18:24:07",
"upload_time_iso_8601": "2025-11-02T18:24:07.110076Z",
"url": "https://files.pythonhosted.org/packages/14/ff/28acb2ad1462d4593bc353f34dfdeda810fb40ba7f4fd44fbc3ddb2e9e3d/toiro-0.0.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6fb0259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc",
"md5": "7a97a01c254fa503cada25499cb90f78",
"sha256": "ab2c135fc6998799bc163119170e2c7dc879adaf1bfd9474b26921aaf2f359ee"
},
"downloads": -1,
"filename": "toiro-0.0.11.tar.gz",
"has_sig": false,
"md5_digest": "7a97a01c254fa503cada25499cb90f78",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10.0",
"size": 623140,
"upload_time": "2025-11-02T18:24:09",
"upload_time_iso_8601": "2025-11-02T18:24:09.092401Z",
"url": "https://files.pythonhosted.org/packages/6f/b0/259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc/toiro-0.0.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-02 18:24:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "taishi-i",
"github_project": "toiro",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"lcname": "toiro"
}