Name | konoha JSON |
Version |
5.5.5
JSON |
| download |
home_page | |
Summary | A tiny sentence/word tokenizer for Japanese text written in Python |
upload_time | 2024-02-20 12:38:09 |
maintainer | |
docs_url | None |
author | himkt |
requires_python | >=3.8.0,<4.0.0 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# ๐ฟ Konoha: Simple wrapper of Japanese Tokenizers
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/himkt/konoha/blob/main/example/Konoha_Example.ipynb)
<p align="center"><img src="https://user-images.githubusercontent.com/5164000/120913279-e7d62380-c6d0-11eb-8d17-6571277cdf27.gif" width="95%"></p>
[![GitHub stars](https://img.shields.io/github/stars/himkt/konoha?style=social)](https://github.com/himkt/konoha/stargazers)
[![Downloads](https://pepy.tech/badge/konoha)](https://pepy.tech/project/konoha)
[![Downloads](https://pepy.tech/badge/konoha/month)](https://pepy.tech/project/konoha/month)
[![Downloads](https://pepy.tech/badge/konoha/week)](https://pepy.tech/project/konoha/week)
[![Build Status](https://github.com/himkt/konoha/actions/workflows/ci.yml/badge.svg)](https://github.com/himkt/konoha/actions/workflows/ci.yml)
[![Documentation Status](https://readthedocs.org/projects/konoha/badge/?version=latest)](https://konoha.readthedocs.io/en/latest/?badge=latest)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/konoha)
[![PyPI](https://img.shields.io/pypi/v/konoha.svg)](https://pypi.python.org/pypi/konoha)
[![GitHub Issues](https://img.shields.io/github/issues/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)
`Konoha` is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,
which enables you to switch a tokenizer and boost your pre-processing.
## Supported tokenizers
<a href="https://github.com/buruzaemon/natto-py"><img src="https://img.shields.io/badge/MeCab-natto--py-ff69b4"></a>
<a href="https://github.com/chezou/Mykytea-python"><img src="https://img.shields.io/badge/KyTea-Mykytea--python-ff69b4"></a>
<a href="https://github.com/mocobeta/janome"><img src="https://img.shields.io/badge/Janome-janome-ff69b4"></a>
<a href="https://github.com/WorksApplications/SudachiPy"><img src="https://img.shields.io/badge/Sudachi-sudachipy-ff69b4"></a>
<a href="https://github.com/google/sentencepiece"><img src="https://img.shields.io/badge/Sentencepiece-sentencepiece-ff69b4"></a>
<a href="https://github.com/taishi-i/nagisa"><img src="https://img.shields.io/badge/nagisa-nagisa-ff69b4"></a>
Also, `konoha` provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.
## Quick Start with Docker
Simply run followings on your computer:
```bash
docker run --rm -p 8000:8000 -t himkt/konoha # from DockerHub
```
Or you can build image on your machine:
```bash
git clone https://github.com/himkt/konoha # download konoha
cd konoha && docker-compose up --build # build and launch container
```
Tokenization is done by posting a json object to `localhost:8000/api/v1/tokenize`.
You can also batch tokenize by passing `texts: ["๏ผใค็ฎใฎๅ
ฅๅ", "๏ผใค็ฎใฎๅ
ฅๅ"]` to `localhost:8000/api/v1/batch_tokenize`.
(API documentation is available on `localhost:8000/redoc`, you can check it using your web browser)
Send a request using `curl` on your terminal.
Note that a path to an endpoint is changed in v4.6.4.
Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).
```json
$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "text": "ใใใฏใใณใงใ"}'
{
"tokens": [
[
{
"surface": "ใใ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใฏ",
"part_of_speech": "ๅฉ่ฉ"
},
{
"surface": "ใใณ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใงใ",
"part_of_speech": "ๅฉๅ่ฉ"
}
]
]
}
```
## Installation
I recommend you to install konoha by `pip install 'konoha[all]'`.
- Install konoha with a specific tokenizer: `pip install 'konoha[(tokenizer_name)]`.
- Install konoha with a specific tokenizer and remote file support: `pip install 'konoha[(tokenizer_name),remote]'`
If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer
(e.g. `konoha[mecab]`, `konoha[sudachi]`, ...etc) or install tokenizers individually.
## Example
### Word level tokenization
```python
from konoha import WordTokenizer
sentence = '่ช็ถ่จ่ชๅฆ็ใๅๅผทใใฆใใพใ'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆ, ใ, ใพใ]
tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ, ่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆใใพใ]
```
For more detail, please see the `example/` directory.
### Remote files
Konoha supports dictionary and model on cloud storage (currently supports Amazon S3).
It requires installing konoha with the `remote` option, see [Installation](#installation).
```python
# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))
# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))
# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))
```
### Sentence level tokenization
```python
from konoha import SentenceTokenizer
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
```
You can change symbols for a sentence splitter and bracket expression.
1. sentence splitter
```python
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใ๏ผใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer(period="๏ผ")
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใ๏ผ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
```
2. bracket expression
```python
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer(
patterns=SentenceTokenizer.PATTERNS + [re.compile(r"ใ.*?ใ")],
)
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
```
## Test
```
python -m pytest
```
## Article
- [ใใผใฏใใคใถใใใๆใใซๅใๆฟใใใฉใคใใฉใช konoha ใไฝใฃใ](https://qiita.com/klis/items/bb9ffa4d9c886af0f531)
- [ๆฅๆฌ่ช่งฃๆใใผใซ Konoha ใซ AllenNLP ้ฃๆบๆฉ่ฝใๅฎ่ฃ
ใใ](https://qiita.com/klis/items/f1d29cb431d1bf879898)
## Acknowledgement
Sentencepiece model used in test is provided by @yoheikikuta. Thanks!
Raw data
{
"_id": null,
"home_page": "",
"name": "konoha",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8.0,<4.0.0",
"maintainer_email": "",
"keywords": "",
"author": "himkt",
"author_email": "himkt@klis.tsukuba.ac.jp",
"download_url": "https://files.pythonhosted.org/packages/bf/32/d93630f9a5d7759ebfd3bc2c4873d815d5d5746f063582f0594e6a46ea80/konoha-5.5.5.tar.gz",
"platform": null,
"description": "# \ud83c\udf3f Konoha: Simple wrapper of Japanese Tokenizers\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/himkt/konoha/blob/main/example/Konoha_Example.ipynb)\n<p align=\"center\"><img src=\"https://user-images.githubusercontent.com/5164000/120913279-e7d62380-c6d0-11eb-8d17-6571277cdf27.gif\" width=\"95%\"></p>\n\n[![GitHub stars](https://img.shields.io/github/stars/himkt/konoha?style=social)](https://github.com/himkt/konoha/stargazers)\n\n[![Downloads](https://pepy.tech/badge/konoha)](https://pepy.tech/project/konoha)\n[![Downloads](https://pepy.tech/badge/konoha/month)](https://pepy.tech/project/konoha/month)\n[![Downloads](https://pepy.tech/badge/konoha/week)](https://pepy.tech/project/konoha/week)\n\n[![Build Status](https://github.com/himkt/konoha/actions/workflows/ci.yml/badge.svg)](https://github.com/himkt/konoha/actions/workflows/ci.yml)\n[![Documentation Status](https://readthedocs.org/projects/konoha/badge/?version=latest)](https://konoha.readthedocs.io/en/latest/?badge=latest)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/konoha)\n[![PyPI](https://img.shields.io/pypi/v/konoha.svg)](https://pypi.python.org/pypi/konoha)\n[![GitHub Issues](https://img.shields.io/github/issues/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)\n[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)\n\n`Konoha` is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,\nwhich enables you to switch a tokenizer and boost your pre-processing.\n\n## Supported tokenizers\n\n<a href=\"https://github.com/buruzaemon/natto-py\"><img src=\"https://img.shields.io/badge/MeCab-natto--py-ff69b4\"></a>\n<a href=\"https://github.com/chezou/Mykytea-python\"><img src=\"https://img.shields.io/badge/KyTea-Mykytea--python-ff69b4\"></a>\n<a href=\"https://github.com/mocobeta/janome\"><img src=\"https://img.shields.io/badge/Janome-janome-ff69b4\"></a>\n<a href=\"https://github.com/WorksApplications/SudachiPy\"><img src=\"https://img.shields.io/badge/Sudachi-sudachipy-ff69b4\"></a>\n<a href=\"https://github.com/google/sentencepiece\"><img src=\"https://img.shields.io/badge/Sentencepiece-sentencepiece-ff69b4\"></a>\n<a href=\"https://github.com/taishi-i/nagisa\"><img src=\"https://img.shields.io/badge/nagisa-nagisa-ff69b4\"></a>\n\nAlso, `konoha` provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.\n\n\n## Quick Start with Docker\n\nSimply run followings on your computer:\n\n```bash\ndocker run --rm -p 8000:8000 -t himkt/konoha # from DockerHub\n```\n\nOr you can build image on your machine:\n\n```bash\ngit clone https://github.com/himkt/konoha # download konoha\ncd konoha && docker-compose up --build # build and launch container\n```\n\nTokenization is done by posting a json object to `localhost:8000/api/v1/tokenize`.\nYou can also batch tokenize by passing `texts: [\"\uff11\u3064\u76ee\u306e\u5165\u529b\", \"\uff12\u3064\u76ee\u306e\u5165\u529b\"]` to `localhost:8000/api/v1/batch_tokenize`.\n\n(API documentation is available on `localhost:8000/redoc`, you can check it using your web browser)\n\nSend a request using `curl` on your terminal.\nNote that a path to an endpoint is changed in v4.6.4.\nPlease check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).\n\n```json\n$ curl localhost:8000/api/v1/tokenize -X POST -H \"Content-Type: application/json\" \\\n -d '{\"tokenizer\": \"mecab\", \"text\": \"\u3053\u308c\u306f\u30da\u30f3\u3067\u3059\"}'\n\n{\n \"tokens\": [\n [\n {\n \"surface\": \"\u3053\u308c\",\n \"part_of_speech\": \"\u540d\u8a5e\"\n },\n {\n \"surface\": \"\u306f\",\n \"part_of_speech\": \"\u52a9\u8a5e\"\n },\n {\n \"surface\": \"\u30da\u30f3\",\n \"part_of_speech\": \"\u540d\u8a5e\"\n },\n {\n \"surface\": \"\u3067\u3059\",\n \"part_of_speech\": \"\u52a9\u52d5\u8a5e\"\n }\n ]\n ]\n}\n```\n\n\n## Installation\n\n\nI recommend you to install konoha by `pip install 'konoha[all]'`.\n\n- Install konoha with a specific tokenizer: `pip install 'konoha[(tokenizer_name)]`.\n- Install konoha with a specific tokenizer and remote file support: `pip install 'konoha[(tokenizer_name),remote]'`\n\nIf you want to install konoha with a tokenizer, please install konoha with a specific tokenizer\n(e.g. `konoha[mecab]`, `konoha[sudachi]`, ...etc) or install tokenizers individually.\n\n\n## Example\n\n### Word level tokenization\n\n```python\nfrom konoha import WordTokenizer\n\nsentence = '\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\u3092\u52c9\u5f37\u3057\u3066\u3044\u307e\u3059'\n\ntokenizer = WordTokenizer('MeCab')\nprint(tokenizer.tokenize(sentence))\n# => [\u81ea\u7136, \u8a00\u8a9e, \u51e6\u7406, \u3092, \u52c9\u5f37, \u3057, \u3066, \u3044, \u307e\u3059]\n\ntokenizer = WordTokenizer('Sentencepiece', model_path=\"data/model.spm\")\nprint(tokenizer.tokenize(sentence))\n# => [\u2581, \u81ea\u7136, \u8a00\u8a9e, \u51e6\u7406, \u3092, \u52c9\u5f37, \u3057, \u3066\u3044\u307e\u3059]\n```\n\nFor more detail, please see the `example/` directory.\n\n### Remote files\n\nKonoha supports dictionary and model on cloud storage (currently supports Amazon S3).\nIt requires installing konoha with the `remote` option, see [Installation](#installation).\n\n```python\n# download user dictionary from S3\nword_tokenizer = WordTokenizer(\"mecab\", user_dictionary_path=\"s3://abc/xxx.dic\")\nprint(word_tokenizer.tokenize(sentence))\n\n# download system dictionary from S3\nword_tokenizer = WordTokenizer(\"mecab\", system_dictionary_path=\"s3://abc/yyy\")\nprint(word_tokenizer.tokenize(sentence))\n\n# download model file from S3\nword_tokenizer = WordTokenizer(\"sentencepiece\", model_path=\"s3://abc/zzz.model\")\nprint(word_tokenizer.tokenize(sentence))\n```\n\n### Sentence level tokenization\n\n```python\nfrom konoha import SentenceTokenizer\n\nsentence = \"\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002\"\n\ntokenizer = SentenceTokenizer()\nprint(tokenizer.tokenize(sentence))\n# => ['\u79c1\u306f\u732b\u3060\u3002', '\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002', '\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002']\n```\n\nYou can change symbols for a sentence splitter and bracket expression.\n\n1. sentence splitter\n\n```python\nsentence = \"\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\uff0e\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002\"\n\ntokenizer = SentenceTokenizer(period=\"\uff0e\")\nprint(tokenizer.tokenize(sentence))\n# => ['\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\uff0e', '\u3060\u304c\uff0c\u300c\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300d\u3002']\n```\n\n2. bracket expression\n\n```python\nsentence = \"\u79c1\u306f\u732b\u3060\u3002\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002\u3060\u304c\uff0c\u300e\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300f\u3002\"\n\ntokenizer = SentenceTokenizer(\n patterns=SentenceTokenizer.PATTERNS + [re.compile(r\"\u300e.*?\u300f\")],\n)\nprint(tokenizer.tokenize(sentence))\n# => ['\u79c1\u306f\u732b\u3060\u3002', '\u540d\u524d\u306a\u3093\u3066\u3082\u306e\u306f\u306a\u3044\u3002', '\u3060\u304c\uff0c\u300e\u304b\u308f\u3044\u3044\u3002\u305d\u308c\u3067\u5341\u5206\u3060\u308d\u3046\u300f\u3002']\n```\n\n\n## Test\n\n```\npython -m pytest\n```\n\n## Article\n\n- [\u30c8\u30fc\u30af\u30ca\u30a4\u30b6\u3092\u3044\u3044\u611f\u3058\u306b\u5207\u308a\u66ff\u3048\u308b\u30e9\u30a4\u30d6\u30e9\u30ea konoha \u3092\u4f5c\u3063\u305f](https://qiita.com/klis/items/bb9ffa4d9c886af0f531)\n- [\u65e5\u672c\u8a9e\u89e3\u6790\u30c4\u30fc\u30eb Konoha \u306b AllenNLP \u9023\u643a\u6a5f\u80fd\u3092\u5b9f\u88c5\u3057\u305f](https://qiita.com/klis/items/f1d29cb431d1bf879898)\n\n## Acknowledgement\n\nSentencepiece model used in test is provided by @yoheikikuta. Thanks!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tiny sentence/word tokenizer for Japanese text written in Python",
"version": "5.5.5",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9da6db852163785c2acb75539878e8bd10094bfc915cd075c0f737ff21c2a139",
"md5": "3b5e887aff18e21d3f93c835f945b03e",
"sha256": "2814652c83cfecad84fec8df3b439738986d2c829375d5d6cf324b14ae5b3508"
},
"downloads": -1,
"filename": "konoha-5.5.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3b5e887aff18e21d3f93c835f945b03e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.0,<4.0.0",
"size": 18123,
"upload_time": "2024-02-20T12:38:07",
"upload_time_iso_8601": "2024-02-20T12:38:07.474242Z",
"url": "https://files.pythonhosted.org/packages/9d/a6/db852163785c2acb75539878e8bd10094bfc915cd075c0f737ff21c2a139/konoha-5.5.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bf32d93630f9a5d7759ebfd3bc2c4873d815d5d5746f063582f0594e6a46ea80",
"md5": "78ae386f1f3cbb2df1e4c3b925273772",
"sha256": "7e87417d6c1ab57c616e7f4cdb5a7ae8c2c0855ec7a58bd900d903fddba29811"
},
"downloads": -1,
"filename": "konoha-5.5.5.tar.gz",
"has_sig": false,
"md5_digest": "78ae386f1f3cbb2df1e4c3b925273772",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.0,<4.0.0",
"size": 14183,
"upload_time": "2024-02-20T12:38:09",
"upload_time_iso_8601": "2024-02-20T12:38:09.140584Z",
"url": "https://files.pythonhosted.org/packages/bf/32/d93630f9a5d7759ebfd3bc2c4873d815d5d5746f063582f0594e6a46ea80/konoha-5.5.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-20 12:38:09",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "konoha"
}