toiro


Nametoiro JSON
Version 0.0.11 PyPI version JSON
download
home_pagehttps://github.com/taishi-i/toiro
SummaryA comparison tool of Japanese tokenizers
upload_time2025-11-02 18:24:09
maintainerNone
docs_urlNone
authorTaishi Ikeda
requires_python>=3.10.0
licenseNone
keywords japanese nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            <p align="center"><img width="50%" src="https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.png" /></p>

toiro
-----

[![Python package](https://github.com/taishi-i/toiro/actions/workflows/python-package.yml/badge.svg)](https://github.com/taishi-i/toiro/actions/workflows/python-package.yml)
[![PyPI](https://img.shields.io/pypi/v/toiro)](https://pypi.python.org/pypi/toiro)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/toiro)
[![RRs](https://img.shields.io/badge/PRs-welcome-brightgreen)](https://github.com/taishi-i/toiro/pulls)


Toiro is a comparison tool of Japanese tokenizers.
- Compare the processing speed of tokenizers
- Compare the words segmented in tokenizers
- Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)

It also provides useful functions for natural language processing in Japanese.
- Data downloader for Japanese text corpora
- Preprocessor of these corpora
- Text classifier for Japanese text (e.g., SVM, BERT)

<p align="center"><img width="90%" src="https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.gif" /></p>


Installation
------------

Python 3.10+ is required. You can install toiro with the following command.
[Janome](https://github.com/mocobeta/janome) is included in the default installation.
```bash
pip install toiro
```

Adding a tokenizer to toiro
---------------------------

If you want to add a tokenizer to toiro, please install it individually.
This is an example of adding [SudachiPy](https://github.com/WorksApplications/SudachiPy) and [nagisa](https://github.com/taishi-i/nagisa) to toiro.

```bash
pip install sudachipy sudachidict_core
pip install nagisa
```

<details>
<summary> How to install other tokenizers </summary>
<p>

[mecab-python3](https://github.com/SamuraiT/mecab-python3)
```
pip install mecab-python3
```

[GiNZA](https://github.com/megagonlabs/ginza)
```
pip install spacy ginza
```

[spaCy](https://github.com/explosion/spaCy)
```
pip install spacy[ja]
```

[KyTea](https://github.com/neubig/kytea)

You need to install KyTea. Please refer to [here](http://www.phontron.com/kytea/index-ja.html).

```
pip install kytea
```

[Juman++ v2](https://github.com/ku-nlp/jumanpp)

You need to install Juman++ v2. Please refer to [here](http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++).

```
pip install pyknp
```

[SentencePiece](https://github.com/google/sentencepiece)
```
pip install sentencepiece
```

[fugashi-ipadic](https://github.com/polm/fugashi)
```
pip install fugashi ipadic
```

[fugashi-unidic](https://github.com/polm/fugashi)
```
pip install fugashi unidic-lite
```

[tinysegmenter](https://github.com/SamuraiT/tinysegmenter)
```
pip install tinysegmenter3
```

[tiktoken](https://github.com/openai/tiktoken)
```
pip install tiktoken
```

</p>
</details>

If you want to install all the tokonizers at once, please use the following command.
```bash
pip install toiro[all_tokenizers]
```

Getting started
---------------

You can check the available tokonizers in your Python environment.
```python
from toiro import tokenizers

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)
```

Toiro supports 12 different Japanese tokonizers and 1 BPE tokenizer. This is an example of adding SudachiPy and nagisa.
```python
{'nagisa': {'is_available': True, 'version': '0.2.7'},
 'janome': {'is_available': True, 'version': '0.3.10'},
 'mecab-python3': {'is_available': False, 'version': False},
 'sudachipy': {'is_available': True, 'version': '0.4.9'},
 'spacy': {'is_available': False, 'version': False},
 'ginza': {'is_available': False, 'version': False},
 'kytea': {'is_available': False, 'version': False},
 'jumanpp': {'is_available': False, 'version': False},
 'sentencepiece': {'is_available': False, 'version': False},
 'fugashi-ipadic': {'is_available': False, 'version': False},
 'fugashi-unidic': {'is_available': False, 'version': False},
 'tinysegmenter': {'is_available': False, 'version': False},
 'tiktoken': {'is_available': False, 'version': False}}
```

Download the livedoor news corpus and compare the processing speed of tokenizers.
```python
from toiro import tokenizers
from toiro import datadownloader

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']

# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]

# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
  'arch': 'X86_64',
  'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
  'count': 8},
 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
 'janome': {'elapsed_time': 9.114670515060425},
 'nagisa': {'elapsed_time': 15.873093605041504},
 'sudachipy': {'elapsed_time': 9.05256724357605}}

# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=>        janome: 都庁|所在地|は|新宿|区|。
#=>        nagisa: 都庁|所在|地|は|新宿|区|。
#=>     sudachipy: 都庁|所在地|は|新宿区|。
```

Run toiro in Docker
-------------------

You can use all tokenizers by building a docker container from Docker Hub.

```bash
docker run --rm -it taishii/toiro /bin/bash
```

<details>
<summary> How to run the Python interpreter in the Docker container </summary>
<p>

Run the Python interpreter.
```
root@cdd2ad2d7092:/workspace# python3
```

Compare the words segmented in tokenizers
```python
>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
 mecab-python3: 都庁|所在地|は|新宿|区|。
        janome: 都庁|所在地|は|新宿|区|。
        nagisa: 都庁|所在|地|は|新宿|区|。
     sudachipy: 都庁|所在地|は|新宿区|。
         spacy: 都庁|所在|地|は|新宿|区|。
         ginza: 都庁|所在地|は|新宿区|。
         kytea: 都庁|所在|地|は|新宿|区|。
       jumanpp: 都庁|所在|地|は|新宿|区|。
 sentencepiece: ▁|都|庁|所在地|は|新宿|区|。
fugashi-ipadic: 都庁|所在地|は|新宿|区|。
fugashi-unidic: 都庁|所在|地|は|新宿|区|。
 tinysegmenter: 都庁所|在地|は|新宿|区|。
 tiktoken_gpt4o: 都|�|�|所在地|は|新|宿|区|。
 tiktoken_gpt5: 都|�|�|所在地|は|新|宿|区|。
```

</p>
</details>

Get more information about toiro
--------------------------------

The slides at PyCon JP 2020
- [Speaker Deck](https://speakerdeck.com/taishii/pycon-jp-2020)
- [PyConJP2020_Online.ipynb](https://github.com/taishi-i/toiro/blob/master/PyConJP2020/PyConJP2020_Online.ipynb)

Tutorials in Japanese
- [01_getting_started_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/01_getting_started_ja.ipynb)
- [05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/taishi-i/toiro",
    "name": "toiro",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10.0",
    "maintainer_email": null,
    "keywords": "Japanese NLP",
    "author": "Taishi Ikeda",
    "author_email": "taishi.ikeda.0323@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/6f/b0/259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc/toiro-0.0.11.tar.gz",
    "platform": null,
    "description": "<p align=\"center\"><img width=\"50%\" src=\"https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.png\" /></p>\n\ntoiro\n-----\n\n[![Python package](https://github.com/taishi-i/toiro/actions/workflows/python-package.yml/badge.svg)](https://github.com/taishi-i/toiro/actions/workflows/python-package.yml)\n[![PyPI](https://img.shields.io/pypi/v/toiro)](https://pypi.python.org/pypi/toiro)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/toiro)\n[![RRs](https://img.shields.io/badge/PRs-welcome-brightgreen)](https://github.com/taishi-i/toiro/pulls)\n\n\nToiro is a comparison tool of Japanese tokenizers.\n- Compare the processing speed of tokenizers\n- Compare the words segmented in tokenizers\n- Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)\n\nIt also provides useful functions for natural language processing in Japanese.\n- Data downloader for Japanese text corpora\n- Preprocessor of these corpora\n- Text classifier for Japanese text (e.g., SVM, BERT)\n\n<p align=\"center\"><img width=\"90%\" src=\"https://github.com/taishi-i/toiro/blob/master/toiro/datadownloader/data/toiro.gif\" /></p>\n\n\nInstallation\n------------\n\nPython 3.10+ is required. You can install toiro with the following command.\n[Janome](https://github.com/mocobeta/janome) is included in the default installation.\n```bash\npip install toiro\n```\n\nAdding a tokenizer to toiro\n---------------------------\n\nIf you want to add a tokenizer to toiro, please install it individually.\nThis is an example of adding [SudachiPy](https://github.com/WorksApplications/SudachiPy) and [nagisa](https://github.com/taishi-i/nagisa) to toiro.\n\n```bash\npip install sudachipy sudachidict_core\npip install nagisa\n```\n\n<details>\n<summary> How to install other tokenizers </summary>\n<p>\n\n[mecab-python3](https://github.com/SamuraiT/mecab-python3)\n```\npip install mecab-python3\n```\n\n[GiNZA](https://github.com/megagonlabs/ginza)\n```\npip install spacy ginza\n```\n\n[spaCy](https://github.com/explosion/spaCy)\n```\npip install spacy[ja]\n```\n\n[KyTea](https://github.com/neubig/kytea)\n\nYou need to install KyTea. Please refer to [here](http://www.phontron.com/kytea/index-ja.html).\n\n```\npip install kytea\n```\n\n[Juman++ v2](https://github.com/ku-nlp/jumanpp)\n\nYou need to install Juman++ v2. Please refer to [here](http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++).\n\n```\npip install pyknp\n```\n\n[SentencePiece](https://github.com/google/sentencepiece)\n```\npip install sentencepiece\n```\n\n[fugashi-ipadic](https://github.com/polm/fugashi)\n```\npip install fugashi ipadic\n```\n\n[fugashi-unidic](https://github.com/polm/fugashi)\n```\npip install fugashi unidic-lite\n```\n\n[tinysegmenter](https://github.com/SamuraiT/tinysegmenter)\n```\npip install tinysegmenter3\n```\n\n[tiktoken](https://github.com/openai/tiktoken)\n```\npip install tiktoken\n```\n\n</p>\n</details>\n\nIf you want to install all the tokonizers at once, please use the following command.\n```bash\npip install toiro[all_tokenizers]\n```\n\nGetting started\n---------------\n\nYou can check the available tokonizers in your Python environment.\n```python\nfrom toiro import tokenizers\n\navailable_tokenizers = tokenizers.available_tokenizers()\nprint(available_tokenizers)\n```\n\nToiro supports 12 different Japanese tokonizers and 1 BPE tokenizer. This is an example of adding SudachiPy and nagisa.\n```python\n{'nagisa': {'is_available': True, 'version': '0.2.7'},\n 'janome': {'is_available': True, 'version': '0.3.10'},\n 'mecab-python3': {'is_available': False, 'version': False},\n 'sudachipy': {'is_available': True, 'version': '0.4.9'},\n 'spacy': {'is_available': False, 'version': False},\n 'ginza': {'is_available': False, 'version': False},\n 'kytea': {'is_available': False, 'version': False},\n 'jumanpp': {'is_available': False, 'version': False},\n 'sentencepiece': {'is_available': False, 'version': False},\n 'fugashi-ipadic': {'is_available': False, 'version': False},\n 'fugashi-unidic': {'is_available': False, 'version': False},\n 'tinysegmenter': {'is_available': False, 'version': False},\n 'tiktoken': {'is_available': False, 'version': False}}\n```\n\nDownload the livedoor news corpus and compare the processing speed of tokenizers.\n```python\nfrom toiro import tokenizers\nfrom toiro import datadownloader\n\n# A list of avaliable corpora in toiro\ncorpora = datadownloader.available_corpus()\nprint(corpora)\n#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']\n\n# Download the livedoor news corpus and load it as pandas.DataFrame\ncorpus = corpora[0]\ndatadownloader.download_corpus(corpus)\ntrain_df, dev_df, test_df = datadownloader.load_corpus(corpus)\ntexts = train_df[1]\n\n# Compare the processing speed of tokenizers\nreport = tokenizers.compare(texts)\n#=> [1/3] Tokenizer: janome\n#=> 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5900/5900 [00:07<00:00, 746.21it/s]\n#=> [2/3] Tokenizer: nagisa\n#=> 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5900/5900 [00:15<00:00, 370.83it/s]\n#=> [3/3] Tokenizer: sudachipy\n#=> 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5900/5900 [00:08<00:00, 696.68it/s]\nprint(report)\n{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',\n  'arch': 'X86_64',\n  'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',\n  'count': 8},\n 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},\n 'janome': {'elapsed_time': 9.114670515060425},\n 'nagisa': {'elapsed_time': 15.873093605041504},\n 'sudachipy': {'elapsed_time': 9.05256724357605}}\n\n# Compare the words segmented in tokenizers\ntext = \"\u90fd\u5e81\u6240\u5728\u5730\u306f\u65b0\u5bbf\u533a\u3002\"\ntokenizers.print_words(text, delimiter=\"|\")\n#=>        janome: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n#=>        nagisa: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n#=>     sudachipy: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf\u533a|\u3002\n```\n\nRun toiro in Docker\n-------------------\n\nYou can use all tokenizers by building a docker container from Docker Hub.\n\n```bash\ndocker run --rm -it taishii/toiro /bin/bash\n```\n\n<details>\n<summary> How to run the Python interpreter in the Docker container </summary>\n<p>\n\nRun the Python interpreter.\n```\nroot@cdd2ad2d7092:/workspace# python3\n```\n\nCompare the words segmented in tokenizers\n```python\n>>> from toiro import tokenizers\n>>> text = \"\u90fd\u5e81\u6240\u5728\u5730\u306f\u65b0\u5bbf\u533a\u3002\"\n>>> tokenizers.print_words(text, delimiter=\"|\")\n mecab-python3: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n        janome: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n        nagisa: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n     sudachipy: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf\u533a|\u3002\n         spacy: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n         ginza: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf\u533a|\u3002\n         kytea: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n       jumanpp: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n sentencepiece: \u2581|\u90fd|\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\nfugashi-ipadic: \u90fd\u5e81|\u6240\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\nfugashi-unidic: \u90fd\u5e81|\u6240\u5728|\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n tinysegmenter: \u90fd\u5e81\u6240|\u5728\u5730|\u306f|\u65b0\u5bbf|\u533a|\u3002\n tiktoken_gpt4o: \u90fd|\ufffd|\ufffd|\u6240\u5728\u5730|\u306f|\u65b0|\u5bbf|\u533a|\u3002\n tiktoken_gpt5: \u90fd|\ufffd|\ufffd|\u6240\u5728\u5730|\u306f|\u65b0|\u5bbf|\u533a|\u3002\n```\n\n</p>\n</details>\n\nGet more information about toiro\n--------------------------------\n\nThe slides at PyCon JP 2020\n- [Speaker Deck](https://speakerdeck.com/taishii/pycon-jp-2020)\n- [PyConJP2020_Online.ipynb](https://github.com/taishi-i/toiro/blob/master/PyConJP2020/PyConJP2020_Online.ipynb)\n\nTutorials in Japanese\n- [01_getting_started_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/01_getting_started_ja.ipynb)\n- [05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb](https://github.com/taishi-i/toiro/blob/master/examples/05_svm_vs_bert_benchmarking_application_tasks_ja.ipynb)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A comparison tool of Japanese tokenizers",
    "version": "0.0.11",
    "project_urls": {
        "Download": "https://github.com/taishi-i/toiro/archive/0.0.11.tar.gz",
        "Homepage": "https://github.com/taishi-i/toiro"
    },
    "split_keywords": [
        "japanese",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "14ff28acb2ad1462d4593bc353f34dfdeda810fb40ba7f4fd44fbc3ddb2e9e3d",
                "md5": "65be92ef0acdc871b18ef2365fd1b5dd",
                "sha256": "0b637e981cf6da7427624c032e8557bc1aec72db7aede9cd477a3c897addea50"
            },
            "downloads": -1,
            "filename": "toiro-0.0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65be92ef0acdc871b18ef2365fd1b5dd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10.0",
            "size": 628887,
            "upload_time": "2025-11-02T18:24:07",
            "upload_time_iso_8601": "2025-11-02T18:24:07.110076Z",
            "url": "https://files.pythonhosted.org/packages/14/ff/28acb2ad1462d4593bc353f34dfdeda810fb40ba7f4fd44fbc3ddb2e9e3d/toiro-0.0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6fb0259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc",
                "md5": "7a97a01c254fa503cada25499cb90f78",
                "sha256": "ab2c135fc6998799bc163119170e2c7dc879adaf1bfd9474b26921aaf2f359ee"
            },
            "downloads": -1,
            "filename": "toiro-0.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "7a97a01c254fa503cada25499cb90f78",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10.0",
            "size": 623140,
            "upload_time": "2025-11-02T18:24:09",
            "upload_time_iso_8601": "2025-11-02T18:24:09.092401Z",
            "url": "https://files.pythonhosted.org/packages/6f/b0/259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc/toiro-0.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-02 18:24:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "taishi-i",
    "github_project": "toiro",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "lcname": "toiro"
}
        
Elapsed time: 1.77477s