konoha

Name	konoha JSON
Version	5.5.5 JSON
	download
home_page
Summary	A tiny sentence/word tokenizer for Japanese text written in Python
upload_time	2024-02-20 12:38:09
maintainer
docs_url	None
author	himkt
requires_python	>=3.8.0,<4.0.0
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 🌿 Konoha: Simple wrapper of Japanese Tokenizers

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/himkt/konoha/blob/main/example/Konoha_Example.ipynb)
<p align="center"><img src="https://user-images.githubusercontent.com/5164000/120913279-e7d62380-c6d0-11eb-8d17-6571277cdf27.gif" width="95%"></p>

[![GitHub stars](https://img.shields.io/github/stars/himkt/konoha?style=social)](https://github.com/himkt/konoha/stargazers)

[![Downloads](https://pepy.tech/badge/konoha)](https://pepy.tech/project/konoha)
[![Downloads](https://pepy.tech/badge/konoha/month)](https://pepy.tech/project/konoha/month)
[![Downloads](https://pepy.tech/badge/konoha/week)](https://pepy.tech/project/konoha/week)

[![Build Status](https://github.com/himkt/konoha/actions/workflows/ci.yml/badge.svg)](https://github.com/himkt/konoha/actions/workflows/ci.yml)
[![Documentation Status](https://readthedocs.org/projects/konoha/badge/?version=latest)](https://konoha.readthedocs.io/en/latest/?badge=latest)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/konoha)
[![PyPI](https://img.shields.io/pypi/v/konoha.svg)](https://pypi.python.org/pypi/konoha)
[![GitHub Issues](https://img.shields.io/github/issues/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)

`Konoha` is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,
which enables you to switch a tokenizer and boost your pre-processing.

## Supported tokenizers

<a href="https://github.com/buruzaemon/natto-py"><img src="https://img.shields.io/badge/MeCab-natto--py-ff69b4"></a>
<a href="https://github.com/chezou/Mykytea-python"><img src="https://img.shields.io/badge/KyTea-Mykytea--python-ff69b4"></a>
<a href="https://github.com/mocobeta/janome"><img src="https://img.shields.io/badge/Janome-janome-ff69b4"></a>
<a href="https://github.com/WorksApplications/SudachiPy"><img src="https://img.shields.io/badge/Sudachi-sudachipy-ff69b4"></a>
<a href="https://github.com/google/sentencepiece"><img src="https://img.shields.io/badge/Sentencepiece-sentencepiece-ff69b4"></a>
<a href="https://github.com/taishi-i/nagisa"><img src="https://img.shields.io/badge/nagisa-nagisa-ff69b4"></a>

Also, `konoha` provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.


## Quick Start with Docker

Simply run followings on your computer:

```bash
docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub
```

Or you can build image on your machine:

```bash
git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch container
```

Tokenization is done by posting a json object to `localhost:8000/api/v1/tokenize`.
You can also batch tokenize by passing `texts: ["１つ目の入力", "２つ目の入力"]` to `localhost:8000/api/v1/batch_tokenize`.

(API documentation is available on `localhost:8000/redoc`, you can check it using your web browser)

Send a request using `curl` on your terminal.
Note that a path to an endpoint is changed in v4.6.4.
Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

```json
$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "これはペンです"}'

{
  "tokens": [
    [
      {
        "surface": "これ",
        "part_of_speech": "名詞"
      },
      {
        "surface": "は",
        "part_of_speech": "助詞"
      },
      {
        "surface": "ペン",
        "part_of_speech": "名詞"
      },
      {
        "surface": "です",
        "part_of_speech": "助動詞"
      }
    ]
  ]
}
```


## Installation


I recommend you to install konoha by `pip install 'konoha[all]'`.

- Install konoha with a specific tokenizer: `pip install 'konoha[(tokenizer_name)]`.
- Install konoha with a specific tokenizer and remote file support: `pip install 'konoha[(tokenizer_name),remote]'`

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer
(e.g. `konoha[mecab]`, `konoha[sudachi]`, ...etc) or install tokenizers individually.


## Example

### Word level tokenization

```python
from konoha import WordTokenizer

sentence = '自然言語処理を勉強しています'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [自然, 言語, 処理, を, 勉強, し, て, い, ます]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [▁, 自然, 言語, 処理, を, 勉強, し, ています]
```

For more detail, please see the `example/` directory.

### Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3).
It requires installing konoha with the `remote` option, see [Installation](#installation).

```python
# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))
```

### Sentence level tokenization

```python
from konoha import SentenceTokenizer

sentence = "私は猫だ。名前なんてものはない。だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが，「かわいい。それで十分だろう」。']
```

You can change symbols for a sentence splitter and bracket expression.

1. sentence splitter

```python
sentence = "私は猫だ。名前なんてものはない．だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer(period="．")
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。名前なんてものはない．', 'だが，「かわいい。それで十分だろう」。']
```

2. bracket expression

```python
sentence = "私は猫だ。名前なんてものはない。だが，『かわいい。それで十分だろう』。"

tokenizer = SentenceTokenizer(
    patterns=SentenceTokenizer.PATTERNS + [re.compile(r"『.*?』")],
)
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが，『かわいい。それで十分だろう』。']
```


## Test

```
python -m pytest
```

## Article

- [トークナイザをいい感じに切り替えるライブラリ konoha を作った](https://qiita.com/klis/items/bb9ffa4d9c886af0f531)
- [日本語解析ツール Konoha に AllenNLP 連携機能を実装した](https://qiita.com/klis/items/f1d29cb431d1bf879898)

## Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "konoha",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.0,<4.0.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "himkt",
    "author_email": "himkt@klis.tsukuba.ac.jp",
    "download_url": "https://files.pythonhosted.org/packages/bf/32/d93630f9a5d7759ebfd3bc2c4873d815d5d5746f063582f0594e6a46ea80/konoha-5.5.5.tar.gz",
    "platform": null,
    "description": "# \ud83c\udf3f Konoha: Simple wrapper of Japanese Tokenizers\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/himkt/konoha/blob/main/example/Konoha_Example.ipynb)\n<p align=\"center\"><img src=\"https://user-images.githubusercontent.com/5164000/120913279-e7d62380-c6d0-11eb-8d17-6571277cdf27.gif\" width=\"95%\"></p>\n\n[![GitHub stars](https://img.shields.io/github/stars/himkt/konoha?style=social)](https://github.com/himkt/konoha/stargazers)\n\n[![Downloads](https://pepy.tech/badge/konoha)](https://pepy.tech/project/konoha)\n[![Downloads](https://pepy.tech/badge/konoha/month)](https://pepy.tech/project/konoha/month)\n[![Downloads](https://pepy.tech/badge/konoha/week)](https://pepy.tech/project/konoha/week)\n\n[![Build Status](https://github.com/himkt/konoha/actions/workflows/ci.yml/badge.svg)](https://github.com/himkt/konoha/actions/workflows/ci.yml)\n[![Documentation Status](https://readthedocs.org/projects/konoha/badge/?version=latest)](https://konoha.readthedocs.io/en/latest/?badge=latest)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/konoha)\n[![PyPI](https://img.shields.io/pypi/v/konoha.svg)](https://pypi.python.org/pypi/konoha)\n[![GitHub Issues](https://img.shields.io/github/issues/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)\n[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)\n\n`Konoha` is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,\nwhich enables you to switch a tokenizer and boost your pre-processing.\n\n## Supported tokenizers\n\n<a href=\"https://github.com/buruzaemon/natto-py\"><img src=\"https://img.shields.io/badge/MeCab-natto--py-ff69b4\"></a>\n<a href=\"https://github.com/chezou/Mykytea-python\"><img src=\"https://img.shields.io/badge/KyTea-Mykytea--python-ff69b4\"></a>\n<a href=\"https://github.com/mocobeta/janome\"><img src=\"https://img.shields.io/badge/Janome-janome-ff69b4\"></a>\n<a href=\"https://github.com/WorksApplications/SudachiPy\"><img src=\"https://img.shields.io/badge/Sudachi-sudachipy-ff69b4\"></a>\n<a href=\"https://github.com/google/sentencepiece\"><img src=\"https://img.shields.io/badge/Sentencepiece-sentencepiece-ff69b4\"></a>\n<a href=\"https://github.com/taishi-i/nagisa\"><img src=\"https://img.shields.io/badge/nagisa-nagisa-ff69b4\"></a>\n\nAlso, `konoha` provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.\n\n\n## Quick Start with Docker\n\nSimply run followings on your computer:\n\n```bash\ndocker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub\n```\n\nOr you can build image on your machine:\n\n```bash\ngit clone https://github.com/himkt/konoha  # download konoha\ncd konoha && docker-compose up --build  # build and launch container\n```\n\nTokenization is done by posting a json object to `localhost:8000/api/v1/tokenize`.\nYou can also batch tokenize by passing `texts: [\"\uff11\u3064\u76ee\u306e\u5165\u529b\", \"\uff12\u3064\u76ee\u306e\u5165\u529b\"]` to `localhost:8000/api/v1/batch_tokenize`.\n\n(API documentation is available on `localhost:8000/redoc`, you can check it using your web browser)\n\nSend a request using `curl` on your terminal.\nNote that a path to an endpoint is changed in v4.6.4.\nPlease check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).\n\n```json\n$ curl localhost:8000/api/v1/tokenize -X POST -H \"Content-Type: application/json\" \\\n    -d '{\"tokenizer\": \"mecab\", \"text\": \"\u3053\u308c\u306f\u30da\u30f3\u3067\u3059\"}'\n\n{\n  \"tokens\": [\n    [\n      {\n        \"surface\": \"\u3053\u308c\",\n        \"part_of_speech\": \"\u540d\u8a5e\"\n      },\n      {\n        \"surface\": \"\u306f\",\n        \"part_of_speech\": \"\u52a9\u8a5e\"\n      },\n      {\n        \"surface\": \"\u30da\u30f3\",\n        \"part_of_speech\": \"\u540d\u8a5e\"\n      },\n      {\n        \"surface\": \"\u3067\u3059\",\n        \"part_of_speech\": \"\u52a9\u52d5\u8a5e\"\n      }\n    ]\n  ]\n}\n```\n\n\n## Installation\n\n\nI recommend you to install konoha by `pip install 'konoha[all]'`.\n\n- Install konoha with a specific tokenizer: `pip install 'konoha[(tokenizer_name)]`.\n- Install konoha with a specific tokenizer and remote file support: `pip install 'konoha[(tokenizer_name),remote]'`\n\nIf you want to install konoha with a tokenizer, please install konoha with a specific tokenizer\n(e.g. `konoha[mecab]`, `konoha[sudachi]`, ...etc) or install tokenizers individually.\n\n\n## Example\n\n### Word level tokenization\n\n```python\nfrom konoha import WordTokenizer\n\nsentence = '\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\u3092\u52c9\u5f37\u3057\u3066\u3044\u307e\u3059'\n\ntokenizer = WordTokenizer('MeCab')\nprint(tokenizer.tokenize(sentence))\n# => [\u81ea\u7136, \u8a00\u8a9e, \u51e6\u7406, \u3092, \u52c9\u5f37, \u3057, \u3066, \u3044, \u307e\u3059]\n\ntokenizer = WordTokenizer('Sentencepiece', model_path=\"data/model.spm\")\nprint(tokenizer.tokenize(sentence))\n# => [\u2581, \u81ea\u7136, \u8a00\u8a9e, \u51e6\u7406, \u3092, \u52c9\u5f37, \u3057, \u3066\u3044\u307e\u3059]\n```\n\nFor more detail, please see the `example/` directory.\n\n### Remote files\n\nKonoha supports dictionary and model on cloud storage (currently supports Amazon S3).\nIt requires installing konoha with the `remote` option, see [Installation](#installation).\n\n```python\n# download user dictionary from S3\nword_tokenizer = WordTokenizer(\"mecab\", user_dictionary_path=\"s3://abc/xxx.dic\")\nprint(word_tokenizer.tokenize(sentence))\n\n# download system dictionary from S3\nword_tokenizer = WordTokenizer(\"mecab\", system_dictionary_path=\"s3://abc/yyy\")\nprint(word_tokenizer.tokenize(sentence))\n\n# download model file from S3\nword_tokenizer = WordTokenizer(\"sentencepiece\", model_path=\"s3://abc/zzz.model\")\nprint(word_tokenizer.tokenize(sentence))\n```\n\n### Sentence level tokenization\n\n```python\nfrom konoha import SentenceTokenizer\n\nsentence = \"\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002\"\n\ntokenizer = SentenceTokenizer()\nprint(tokenizer.tokenize(sentence))\n# => ['\u79c1\u306f\u732b\u3060\u3002', '\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002', '\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002']\n```\n\nYou can change symbols for a sentence splitter and bracket expression.\n\n1. sentence splitter\n\n```python\nsentence = \"\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\uff0e\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002\"\n\ntokenizer = SentenceTokenizer(period=\"\uff0e\")\nprint(tokenizer.tokenize(sentence))\n# => ['\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\uff0e', '\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002']\n```\n\n2. bracket expression\n\n```python\nsentence = \"\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002\u3060\u304c\uff0c\u300e\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300f\u3002\"\n\ntokenizer = SentenceTokenizer(\n    patterns=SentenceTokenizer.PATTERNS + [re.compile(r\"\u300e.*?\u300f\")],\n)\nprint(tokenizer.tokenize(sentence))\n# => ['\u79c1\u306f\u732b\u3060\u3002', '\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002', '\u3060\u304c\uff0c\u300e\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300f\u3002']\n```\n\n\n## Test\n\n```\npython -m pytest\n```\n\n## Article\n\n- [\u30c8\u30fc\u30af\u30ca\u30a4\u30b6\u3092\u3044\u3044\u611f\u3058\u306b\u5207\u308a\u66ff\u3048\u308b\u30e9\u30a4\u30d6\u30e9\u30ea konoha \u3092\u4f5c\u3063\u305f](https://qiita.com/klis/items/bb9ffa4d9c886af0f531)\n- [\u65e5\u672c\u8a9e\u89e3\u6790\u30c4\u30fc\u30eb Konoha \u306b AllenNLP \u9023\u643a\u6a5f\u80fd\u3092\u5b9f\u88c5\u3057\u305f](https://qiita.com/klis/items/f1d29cb431d1bf879898)\n\n## Acknowledgement\n\nSentencepiece model used in test is provided by @yoheikikuta. Thanks!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tiny sentence/word tokenizer for Japanese text written in Python",
    "version": "5.5.5",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9da6db852163785c2acb75539878e8bd10094bfc915cd075c0f737ff21c2a139",
                "md5": "3b5e887aff18e21d3f93c835f945b03e",
                "sha256": "2814652c83cfecad84fec8df3b439738986d2c829375d5d6cf324b14ae5b3508"
            },
            "downloads": -1,
            "filename": "konoha-5.5.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3b5e887aff18e21d3f93c835f945b03e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.0,<4.0.0",
            "size": 18123,
            "upload_time": "2024-02-20T12:38:07",
            "upload_time_iso_8601": "2024-02-20T12:38:07.474242Z",
            "url": "https://files.pythonhosted.org/packages/9d/a6/db852163785c2acb75539878e8bd10094bfc915cd075c0f737ff21c2a139/konoha-5.5.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bf32d93630f9a5d7759ebfd3bc2c4873d815d5d5746f063582f0594e6a46ea80",
                "md5": "78ae386f1f3cbb2df1e4c3b925273772",
                "sha256": "7e87417d6c1ab57c616e7f4cdb5a7ae8c2c0855ec7a58bd900d903fddba29811"
            },
            "downloads": -1,
            "filename": "konoha-5.5.5.tar.gz",
            "has_sig": false,
            "md5_digest": "78ae386f1f3cbb2df1e4c3b925273772",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0,<4.0.0",
            "size": 14183,
            "upload_time": "2024-02-20T12:38:09",
            "upload_time_iso_8601": "2024-02-20T12:38:09.140584Z",
            "url": "https://files.pythonhosted.org/packages/bf/32/d93630f9a5d7759ebfd3bc2c4873d815d5d5746f063582f0594e6a46ea80/konoha-5.5.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-20 12:38:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "konoha"
}

himkt