# Bunkai
[![PyPI version](https://badge.fury.io/py/bunkai.svg)](https://badge.fury.io/py/bunkai)
[![Python Versions](https://img.shields.io/pypi/pyversions/bunkai.svg)](https://pypi.org/project/bunkai/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Downloads](https://pepy.tech/badge/bunkai/week)](https://pepy.tech/project/bunkai)
[![CI](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml)
[![Typos](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml)
[![CodeQL](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml)
[![Maintainability](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/maintainability)](https://codeclimate.com/github/megagonlabs/bunkai/maintainability)
[![Test Coverage](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/test_coverage)](https://codeclimate.com/github/megagonlabs/bunkai/test_coverage)
[![markdownlint](https://img.shields.io/badge/markdown-lint-lightgrey)](https://github.com/markdownlint/markdownlint)
[![jsonlint](https://img.shields.io/badge/json-lint-lightgrey)](https://github.com/dmeranda/demjson)
[![yamllint](https://img.shields.io/badge/yaml-lint-lightgrey)](https://github.com/adrienverge/yamllint)
Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.
Bunkaiは日本語文境界判定器です.
## Quick Start
### Install
```console
$ pip install -U bunkai
```
### Disambiguation without Models
```console
$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
| bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。
```
- Feed a document as one line by using ``▁`` (U+2581) for line breaks.
1行は1つの文書を表します.文書中の改行は ``▁`` (U+2581) で与えてください.
- The output shows sentence boundaries with ``│`` (U+2502).
出力では文境界は``│`` (U+2502) で表示されます.
### Disambiguation for Line Breaks with a Model
If you want to disambiguate sentence boundaries for line breaks, please add a ``--model`` option with the path to the model.
改行記号に対しても文境界判定を行いたい場合は,``--model``オプションを与える必要があります.
First, please install extras to use ``--model`` option.
``--model``オプションを利用するために、まずextraパッケージをインストールしてください.
```console
$ pip install -U 'bunkai[lb]'
```
Second, please setup a model. It will take some time.
次にモデルをセットアップする必要があります.セットアップには少々時間がかかります.
```console
$ bunkai --model bunkai-model-directory --setup
```
Then, please designate the directory.
そしてモデルを指定して動かしてください.
```console
$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。
```
### Morphological Analysis Result
You can get morphological analysis results with ``--ma`` option.
``--ma``オプションを付与すると形態素解析結果が得られます.
```console
$ echo -e '形態素解析し▁ます。結果を 表示します!' | bunkai --ma
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
▁
EOS
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
結果 名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
記号,空白,*,*,*,*, ,*,*
表示 名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
! 記号,一般,*,*,*,*,!,!,!
EOS
```
### Python Library
You can also use Bunkai as Python library.
BunkaiはPythonライブラリとしても使えます.
```python
from bunkai import Bunkai
bunkai = Bunkai()
for sentence in bunkai("はい。このようにpythonライブラリとしても使えます!"):
print(sentence)
```
For more information, see [examples](example).
ほかの例は[examples](example)をご覧ください.
## Documents
- [Documents](docs)
## References
- Yuta Hayashibe and Kensuke Mitsuzawa.
Sentence Boundary Detection on Line Breaks in Japanese.
Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75.
November 2020.
[[PDF]](https://www.aclweb.org/anthology/2020.wnut-1.10.pdf)
[[bib]](https://www.aclweb.org/anthology/2020.wnut-1.10.bib)
## License
Apache License 2.0
Raw data
{
"_id": null,
"home_page": "https://github.com/megagonlabs/bunkai",
"name": "bunkai",
"maintainer": "Yuta Hayashibe",
"docs_url": null,
"requires_python": ">=3.8,<3.12",
"maintainer_email": "hayashibe@megagon.ai",
"keywords": "Japanese,Sentence boundary disambiguation",
"author": "Yuta Hayashibe",
"author_email": "hayashibe@megagon.ai",
"download_url": "https://files.pythonhosted.org/packages/a4/54/1b985512562ef7bc3103aa375512522ea0fd5f4a1e006bd0e267b9f5b495/bunkai-1.5.7.tar.gz",
"platform": null,
"description": "# Bunkai\n\n[![PyPI version](https://badge.fury.io/py/bunkai.svg)](https://badge.fury.io/py/bunkai)\n[![Python Versions](https://img.shields.io/pypi/pyversions/bunkai.svg)](https://pypi.org/project/bunkai/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Downloads](https://pepy.tech/badge/bunkai/week)](https://pepy.tech/project/bunkai)\n\n[![CI](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml)\n[![Typos](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml)\n[![CodeQL](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml)\n[![Maintainability](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/maintainability)](https://codeclimate.com/github/megagonlabs/bunkai/maintainability)\n[![Test Coverage](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/test_coverage)](https://codeclimate.com/github/megagonlabs/bunkai/test_coverage)\n[![markdownlint](https://img.shields.io/badge/markdown-lint-lightgrey)](https://github.com/markdownlint/markdownlint)\n[![jsonlint](https://img.shields.io/badge/json-lint-lightgrey)](https://github.com/dmeranda/demjson)\n[![yamllint](https://img.shields.io/badge/yaml-lint-lightgrey)](https://github.com/adrienverge/yamllint)\n\nBunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. \n Bunkai\u306f\u65e5\u672c\u8a9e\u6587\u5883\u754c\u5224\u5b9a\u5668\u3067\u3059\uff0e\n\n## Quick Start\n\n### Install\n\n```console\n$ pip install -U bunkai\n```\n\n### Disambiguation without Models\n\n```console\n$ echo -e '\u5bbf\u3092\u4e88\u7d04\u3057\u307e\u3057\u305f\u266a!\u307e\u30602\u30f6\u6708\u3082\u5148\u3060\u3051\u3069\u3002\u65e9\u3059\u304e\u304b\u306a(\u7b11)\u697d\u3057\u307f\u3067\u3059\u2605\\n2\u6587\u66f8\u76ee\u306e\u5148\u982d\u884c\u3067\u3059\u3002\u2581\u6539\u884c\u306fU+2581\u3067\u8868\u73fe\u3057\u307e\u3059\u3002' \\\n | bunkai\n\u5bbf\u3092\u4e88\u7d04\u3057\u307e\u3057\u305f\u266a!\u2502\u307e\u30602\u30f6\u6708\u3082\u5148\u3060\u3051\u3069\u3002\u2502\u65e9\u3059\u304e\u304b\u306a(\u7b11)\u2502\u697d\u3057\u307f\u3067\u3059\u2605\n2\u6587\u66f8\u76ee\u306e\u5148\u982d\u884c\u3067\u3059\u3002\u2581\u2502\u6539\u884c\u306fU+2581\u3067\u8868\u73fe\u3057\u307e\u3059\u3002\n```\n\n- Feed a document as one line by using ``\u2581`` (U+2581) for line breaks. \n 1\u884c\u306f1\u3064\u306e\u6587\u66f8\u3092\u8868\u3057\u307e\u3059\uff0e\u6587\u66f8\u4e2d\u306e\u6539\u884c\u306f ``\u2581`` (U+2581) \u3067\u4e0e\u3048\u3066\u304f\u3060\u3055\u3044\uff0e\n- The output shows sentence boundaries with ``\u2502`` (U+2502). \n \u51fa\u529b\u3067\u306f\u6587\u5883\u754c\u306f``\u2502`` (U+2502) \u3067\u8868\u793a\u3055\u308c\u307e\u3059\uff0e\n\n### Disambiguation for Line Breaks with a Model\n\nIf you want to disambiguate sentence boundaries for line breaks, please add a ``--model`` option with the path to the model. \n \u6539\u884c\u8a18\u53f7\u306b\u5bfe\u3057\u3066\u3082\u6587\u5883\u754c\u5224\u5b9a\u3092\u884c\u3044\u305f\u3044\u5834\u5408\u306f\uff0c``--model``\u30aa\u30d7\u30b7\u30e7\u30f3\u3092\u4e0e\u3048\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\uff0e\n\nFirst, please install extras to use ``--model`` option. \n ``--model``\u30aa\u30d7\u30b7\u30e7\u30f3\u3092\u5229\u7528\u3059\u308b\u305f\u3081\u306b\u3001\u307e\u305aextra\u30d1\u30c3\u30b1\u30fc\u30b8\u3092\u30a4\u30f3\u30b9\u30c8\u30fc\u30eb\u3057\u3066\u304f\u3060\u3055\u3044\uff0e\n\n```console\n$ pip install -U 'bunkai[lb]'\n```\n\nSecond, please setup a model. It will take some time. \n \u6b21\u306b\u30e2\u30c7\u30eb\u3092\u30bb\u30c3\u30c8\u30a2\u30c3\u30d7\u3059\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\uff0e\u30bb\u30c3\u30c8\u30a2\u30c3\u30d7\u306b\u306f\u5c11\u3005\u6642\u9593\u304c\u304b\u304b\u308a\u307e\u3059\uff0e\n\n```console\n$ bunkai --model bunkai-model-directory --setup\n```\n\nThen, please designate the directory. \n \u305d\u3057\u3066\u30e2\u30c7\u30eb\u3092\u6307\u5b9a\u3057\u3066\u52d5\u304b\u3057\u3066\u304f\u3060\u3055\u3044\uff0e\n\n```console\n$ echo -e \"\u6587\u306e\u9014\u4e2d\u3067\u6539\u884c\u3092\u2581\u5165\u308c\u308b\u6587\u7ae0\u3063\u3066\u3042\u308a\u307e\u3059\u3088\u306d\u2581\u305d\u308c\u3082\u5bfe\u8c61\u3067\u3059\u3002\" | bunkai --model bunkai-model-directory\n\u6587\u306e\u9014\u4e2d\u3067\u6539\u884c\u3092\u2581\u5165\u308c\u308b\u6587\u7ae0\u3063\u3066\u3042\u308a\u307e\u3059\u3088\u306d\u2581\u2502\u305d\u308c\u3082\u5bfe\u8c61\u3067\u3059\u3002\n```\n\n### Morphological Analysis Result\n\nYou can get morphological analysis results with ``--ma`` option. \n``--ma``\u30aa\u30d7\u30b7\u30e7\u30f3\u3092\u4ed8\u4e0e\u3059\u308b\u3068\u5f62\u614b\u7d20\u89e3\u6790\u7d50\u679c\u304c\u5f97\u3089\u308c\u307e\u3059\uff0e\n\n```console\n$ echo -e '\u5f62\u614b\u7d20\u89e3\u6790\u3057\u2581\u307e\u3059\u3002\u7d50\u679c\u3092 \u8868\u793a\u3057\u307e\u3059\uff01' | bunkai --ma\n\u5f62\u614b\u7d20\t\u540d\u8a5e,\u4e00\u822c,*,*,*,*,\u5f62\u614b\u7d20,\u30b1\u30a4\u30bf\u30a4\u30bd,\u30b1\u30a4\u30bf\u30a4\u30bd\n\u89e3\u6790\t\u540d\u8a5e,\u30b5\u5909\u63a5\u7d9a,*,*,*,*,\u89e3\u6790,\u30ab\u30a4\u30bb\u30ad,\u30ab\u30a4\u30bb\u30ad\n\u3057\t\u52d5\u8a5e,\u81ea\u7acb,*,*,\u30b5\u5909\u30fb\u30b9\u30eb,\u9023\u7528\u5f62,\u3059\u308b,\u30b7,\u30b7\n\u2581\nEOS\n\u307e\u3059\t\u52a9\u52d5\u8a5e,*,*,*,\u7279\u6b8a\u30fb\u30de\u30b9,\u57fa\u672c\u5f62,\u307e\u3059,\u30de\u30b9,\u30de\u30b9\n\u3002\t\u8a18\u53f7,\u53e5\u70b9,*,*,*,*,\u3002,\u3002,\u3002\nEOS\n\u7d50\u679c\t\u540d\u8a5e,\u526f\u8a5e\u53ef\u80fd,*,*,*,*,\u7d50\u679c,\u30b1\u30c3\u30ab,\u30b1\u30c3\u30ab\n\u3092\t\u52a9\u8a5e,\u683c\u52a9\u8a5e,\u4e00\u822c,*,*,*,\u3092,\u30f2,\u30f2\n \t\u8a18\u53f7,\u7a7a\u767d,*,*,*,*, ,*,*\n\u8868\u793a\t\u540d\u8a5e,\u30b5\u5909\u63a5\u7d9a,*,*,*,*,\u8868\u793a,\u30d2\u30e7\u30a6\u30b8,\u30d2\u30e7\u30fc\u30b8\n\u3057\t\u52d5\u8a5e,\u81ea\u7acb,*,*,\u30b5\u5909\u30fb\u30b9\u30eb,\u9023\u7528\u5f62,\u3059\u308b,\u30b7,\u30b7\n\u307e\u3059\t\u52a9\u52d5\u8a5e,*,*,*,\u7279\u6b8a\u30fb\u30de\u30b9,\u57fa\u672c\u5f62,\u307e\u3059,\u30de\u30b9,\u30de\u30b9\n\uff01\t\u8a18\u53f7,\u4e00\u822c,*,*,*,*,\uff01,\uff01,\uff01\nEOS\n```\n\n### Python Library\n\nYou can also use Bunkai as Python library. \n Bunkai\u306fPython\u30e9\u30a4\u30d6\u30e9\u30ea\u3068\u3057\u3066\u3082\u4f7f\u3048\u307e\u3059\uff0e\n\n```python\nfrom bunkai import Bunkai\nbunkai = Bunkai()\nfor sentence in bunkai(\"\u306f\u3044\u3002\u3053\u306e\u3088\u3046\u306bpython\u30e9\u30a4\u30d6\u30e9\u30ea\u3068\u3057\u3066\u3082\u4f7f\u3048\u307e\u3059\uff01\"):\n print(sentence)\n```\n\nFor more information, see [examples](example). \n \u307b\u304b\u306e\u4f8b\u306f[examples](example)\u3092\u3054\u89a7\u304f\u3060\u3055\u3044\uff0e\n\n## Documents\n\n- [Documents](docs)\n\n## References\n\n- Yuta Hayashibe and Kensuke Mitsuzawa.\n Sentence Boundary Detection on Line Breaks in Japanese.\n Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75.\n November 2020.\n [[PDF]](https://www.aclweb.org/anthology/2020.wnut-1.10.pdf)\n [[bib]](https://www.aclweb.org/anthology/2020.wnut-1.10.bib)\n\n## License\n\nApache License 2.0\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Sentence boundary disambiguation tool for Japanese texts",
"version": "1.5.7",
"split_keywords": [
"japanese",
"sentence boundary disambiguation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "944cec5a463519512d8482b175a0d7c98af9e44b0cf064ed081f77399554dcee",
"md5": "ce6a467c338d82a0e9d88f143e7307a6",
"sha256": "3454b357f944c5a89da8b01601485446680f5237d3a63d4361cc003e7c84ea40"
},
"downloads": -1,
"filename": "bunkai-1.5.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ce6a467c338d82a0e9d88f143e7307a6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<3.12",
"size": 62205,
"upload_time": "2023-02-09T10:08:08",
"upload_time_iso_8601": "2023-02-09T10:08:08.161675Z",
"url": "https://files.pythonhosted.org/packages/94/4c/ec5a463519512d8482b175a0d7c98af9e44b0cf064ed081f77399554dcee/bunkai-1.5.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a4541b985512562ef7bc3103aa375512522ea0fd5f4a1e006bd0e267b9f5b495",
"md5": "c7433599db628b83c41afdac921e9bc9",
"sha256": "04ea4bb81d34ecbd16a08e88afc2fbafcae0ef7ba86564a8ea81c4f7da6b68c5"
},
"downloads": -1,
"filename": "bunkai-1.5.7.tar.gz",
"has_sig": false,
"md5_digest": "c7433599db628b83c41afdac921e9bc9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<3.12",
"size": 45477,
"upload_time": "2023-02-09T10:08:09",
"upload_time_iso_8601": "2023-02-09T10:08:09.936958Z",
"url": "https://files.pythonhosted.org/packages/a4/54/1b985512562ef7bc3103aa375512522ea0fd5f4a1e006bd0e267b9f5b495/bunkai-1.5.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-09 10:08:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "megagonlabs",
"github_project": "bunkai",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "bunkai"
}