bunkai


Namebunkai JSON
Version 1.5.7 PyPI version JSON
download
home_pagehttps://github.com/megagonlabs/bunkai
SummarySentence boundary disambiguation tool for Japanese texts
upload_time2023-02-09 10:08:09
maintainerYuta Hayashibe
docs_urlNone
authorYuta Hayashibe
requires_python>=3.8,<3.12
licenseApache-2.0
keywords japanese sentence boundary disambiguation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # Bunkai

[![PyPI version](https://badge.fury.io/py/bunkai.svg)](https://badge.fury.io/py/bunkai)
[![Python Versions](https://img.shields.io/pypi/pyversions/bunkai.svg)](https://pypi.org/project/bunkai/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Downloads](https://pepy.tech/badge/bunkai/week)](https://pepy.tech/project/bunkai)

[![CI](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml)
[![Typos](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml)
[![CodeQL](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml)
[![Maintainability](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/maintainability)](https://codeclimate.com/github/megagonlabs/bunkai/maintainability)
[![Test Coverage](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/test_coverage)](https://codeclimate.com/github/megagonlabs/bunkai/test_coverage)
[![markdownlint](https://img.shields.io/badge/markdown-lint-lightgrey)](https://github.com/markdownlint/markdownlint)
[![jsonlint](https://img.shields.io/badge/json-lint-lightgrey)](https://github.com/dmeranda/demjson)
[![yamllint](https://img.shields.io/badge/yaml-lint-lightgrey)](https://github.com/adrienverge/yamllint)

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.  
    Bunkaiは日本語文境界判定器です.

## Quick Start

### Install

```console
$ pip install -U bunkai
```

### Disambiguation without Models

```console
$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
    | bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。
```

- Feed a document as one line by using ``▁`` (U+2581) for line breaks.  
    1行は1つの文書を表します.文書中の改行は ``▁`` (U+2581) で与えてください.
- The output shows sentence boundaries with ``│`` (U+2502).  
    出力では文境界は``│`` (U+2502) で表示されます.

### Disambiguation for Line Breaks with a Model

If you want to disambiguate sentence boundaries for line breaks, please add a ``--model`` option with the path to the model.  
    改行記号に対しても文境界判定を行いたい場合は,``--model``オプションを与える必要があります.

First, please install extras to use ``--model`` option.  
    ``--model``オプションを利用するために、まずextraパッケージをインストールしてください.

```console
$ pip install -U 'bunkai[lb]'
```

Second, please setup a model. It will take some time.  
    次にモデルをセットアップする必要があります.セットアップには少々時間がかかります.

```console
$ bunkai --model bunkai-model-directory --setup
```

Then, please designate the directory.  
    そしてモデルを指定して動かしてください.

```console
$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。
```

### Morphological Analysis Result

You can get morphological analysis results with ``--ma`` option.  
``--ma``オプションを付与すると形態素解析結果が得られます.

```console
$ echo -e '形態素解析し▁ます。結果を 表示します!' | bunkai --ma
形態素	名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析	名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
▁
EOS
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,。,。,。
EOS
結果	名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
 	記号,空白,*,*,*,*, ,*,*
表示	名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
!	記号,一般,*,*,*,*,!,!,!
EOS
```

### Python Library

You can also use Bunkai as Python library.  
  BunkaiはPythonライブラリとしても使えます.

```python
from bunkai import Bunkai
bunkai = Bunkai()
for sentence in bunkai("はい。このようにpythonライブラリとしても使えます!"):
    print(sentence)
```

For more information, see [examples](example).  
    ほかの例は[examples](example)をご覧ください.

## Documents

- [Documents](docs)

## References

- Yuta Hayashibe and Kensuke Mitsuzawa.
    Sentence Boundary Detection on Line Breaks in Japanese.
    Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75.
    November 2020.
    [[PDF]](https://www.aclweb.org/anthology/2020.wnut-1.10.pdf)
    [[bib]](https://www.aclweb.org/anthology/2020.wnut-1.10.bib)

## License

Apache License 2.0

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/megagonlabs/bunkai",
    "name": "bunkai",
    "maintainer": "Yuta Hayashibe",
    "docs_url": null,
    "requires_python": ">=3.8,<3.12",
    "maintainer_email": "hayashibe@megagon.ai",
    "keywords": "Japanese,Sentence boundary disambiguation",
    "author": "Yuta Hayashibe",
    "author_email": "hayashibe@megagon.ai",
    "download_url": "https://files.pythonhosted.org/packages/a4/54/1b985512562ef7bc3103aa375512522ea0fd5f4a1e006bd0e267b9f5b495/bunkai-1.5.7.tar.gz",
    "platform": null,
    "description": "# Bunkai\n\n[![PyPI version](https://badge.fury.io/py/bunkai.svg)](https://badge.fury.io/py/bunkai)\n[![Python Versions](https://img.shields.io/pypi/pyversions/bunkai.svg)](https://pypi.org/project/bunkai/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Downloads](https://pepy.tech/badge/bunkai/week)](https://pepy.tech/project/bunkai)\n\n[![CI](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml)\n[![Typos](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml)\n[![CodeQL](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml)\n[![Maintainability](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/maintainability)](https://codeclimate.com/github/megagonlabs/bunkai/maintainability)\n[![Test Coverage](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/test_coverage)](https://codeclimate.com/github/megagonlabs/bunkai/test_coverage)\n[![markdownlint](https://img.shields.io/badge/markdown-lint-lightgrey)](https://github.com/markdownlint/markdownlint)\n[![jsonlint](https://img.shields.io/badge/json-lint-lightgrey)](https://github.com/dmeranda/demjson)\n[![yamllint](https://img.shields.io/badge/yaml-lint-lightgrey)](https://github.com/adrienverge/yamllint)\n\nBunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.  \n    Bunkai\u306f\u65e5\u672c\u8a9e\u6587\u5883\u754c\u5224\u5b9a\u5668\u3067\u3059\uff0e\n\n## Quick Start\n\n### Install\n\n```console\n$ pip install -U bunkai\n```\n\n### Disambiguation without Models\n\n```console\n$ echo -e '\u5bbf\u3092\u4e88\u7d04\u3057\u307e\u3057\u305f\u266a!\u307e\u30602\u30f6\u6708\u3082\u5148\u3060\u3051\u3069\u3002\u65e9\u3059\u304e\u304b\u306a(\u7b11)\u697d\u3057\u307f\u3067\u3059\u2605\\n2\u6587\u66f8\u76ee\u306e\u5148\u982d\u884c\u3067\u3059\u3002\u2581\u6539\u884c\u306fU+2581\u3067\u8868\u73fe\u3057\u307e\u3059\u3002' \\\n    | bunkai\n\u5bbf\u3092\u4e88\u7d04\u3057\u307e\u3057\u305f\u266a!\u2502\u307e\u30602\u30f6\u6708\u3082\u5148\u3060\u3051\u3069\u3002\u2502\u65e9\u3059\u304e\u304b\u306a(\u7b11)\u2502\u697d\u3057\u307f\u3067\u3059\u2605\n2\u6587\u66f8\u76ee\u306e\u5148\u982d\u884c\u3067\u3059\u3002\u2581\u2502\u6539\u884c\u306fU+2581\u3067\u8868\u73fe\u3057\u307e\u3059\u3002\n```\n\n- Feed a document as one line by using ``\u2581`` (U+2581) for line breaks.  \n    1\u884c\u306f1\u3064\u306e\u6587\u66f8\u3092\u8868\u3057\u307e\u3059\uff0e\u6587\u66f8\u4e2d\u306e\u6539\u884c\u306f ``\u2581`` (U+2581) \u3067\u4e0e\u3048\u3066\u304f\u3060\u3055\u3044\uff0e\n- The output shows sentence boundaries with ``\u2502`` (U+2502).  \n    \u51fa\u529b\u3067\u306f\u6587\u5883\u754c\u306f``\u2502`` (U+2502) \u3067\u8868\u793a\u3055\u308c\u307e\u3059\uff0e\n\n### Disambiguation for Line Breaks with a Model\n\nIf you want to disambiguate sentence boundaries for line breaks, please add a ``--model`` option with the path to the model.  \n    \u6539\u884c\u8a18\u53f7\u306b\u5bfe\u3057\u3066\u3082\u6587\u5883\u754c\u5224\u5b9a\u3092\u884c\u3044\u305f\u3044\u5834\u5408\u306f\uff0c``--model``\u30aa\u30d7\u30b7\u30e7\u30f3\u3092\u4e0e\u3048\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\uff0e\n\nFirst, please install extras to use ``--model`` option.  \n    ``--model``\u30aa\u30d7\u30b7\u30e7\u30f3\u3092\u5229\u7528\u3059\u308b\u305f\u3081\u306b\u3001\u307e\u305aextra\u30d1\u30c3\u30b1\u30fc\u30b8\u3092\u30a4\u30f3\u30b9\u30c8\u30fc\u30eb\u3057\u3066\u304f\u3060\u3055\u3044\uff0e\n\n```console\n$ pip install -U 'bunkai[lb]'\n```\n\nSecond, please setup a model. It will take some time.  \n    \u6b21\u306b\u30e2\u30c7\u30eb\u3092\u30bb\u30c3\u30c8\u30a2\u30c3\u30d7\u3059\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\uff0e\u30bb\u30c3\u30c8\u30a2\u30c3\u30d7\u306b\u306f\u5c11\u3005\u6642\u9593\u304c\u304b\u304b\u308a\u307e\u3059\uff0e\n\n```console\n$ bunkai --model bunkai-model-directory --setup\n```\n\nThen, please designate the directory.  \n    \u305d\u3057\u3066\u30e2\u30c7\u30eb\u3092\u6307\u5b9a\u3057\u3066\u52d5\u304b\u3057\u3066\u304f\u3060\u3055\u3044\uff0e\n\n```console\n$ echo -e \"\u6587\u306e\u9014\u4e2d\u3067\u6539\u884c\u3092\u2581\u5165\u308c\u308b\u6587\u7ae0\u3063\u3066\u3042\u308a\u307e\u3059\u3088\u306d\u2581\u305d\u308c\u3082\u5bfe\u8c61\u3067\u3059\u3002\" | bunkai --model bunkai-model-directory\n\u6587\u306e\u9014\u4e2d\u3067\u6539\u884c\u3092\u2581\u5165\u308c\u308b\u6587\u7ae0\u3063\u3066\u3042\u308a\u307e\u3059\u3088\u306d\u2581\u2502\u305d\u308c\u3082\u5bfe\u8c61\u3067\u3059\u3002\n```\n\n### Morphological Analysis Result\n\nYou can get morphological analysis results with ``--ma`` option.  \n``--ma``\u30aa\u30d7\u30b7\u30e7\u30f3\u3092\u4ed8\u4e0e\u3059\u308b\u3068\u5f62\u614b\u7d20\u89e3\u6790\u7d50\u679c\u304c\u5f97\u3089\u308c\u307e\u3059\uff0e\n\n```console\n$ echo -e '\u5f62\u614b\u7d20\u89e3\u6790\u3057\u2581\u307e\u3059\u3002\u7d50\u679c\u3092 \u8868\u793a\u3057\u307e\u3059\uff01' | bunkai --ma\n\u5f62\u614b\u7d20\t\u540d\u8a5e,\u4e00\u822c,*,*,*,*,\u5f62\u614b\u7d20,\u30b1\u30a4\u30bf\u30a4\u30bd,\u30b1\u30a4\u30bf\u30a4\u30bd\n\u89e3\u6790\t\u540d\u8a5e,\u30b5\u5909\u63a5\u7d9a,*,*,*,*,\u89e3\u6790,\u30ab\u30a4\u30bb\u30ad,\u30ab\u30a4\u30bb\u30ad\n\u3057\t\u52d5\u8a5e,\u81ea\u7acb,*,*,\u30b5\u5909\u30fb\u30b9\u30eb,\u9023\u7528\u5f62,\u3059\u308b,\u30b7,\u30b7\n\u2581\nEOS\n\u307e\u3059\t\u52a9\u52d5\u8a5e,*,*,*,\u7279\u6b8a\u30fb\u30de\u30b9,\u57fa\u672c\u5f62,\u307e\u3059,\u30de\u30b9,\u30de\u30b9\n\u3002\t\u8a18\u53f7,\u53e5\u70b9,*,*,*,*,\u3002,\u3002,\u3002\nEOS\n\u7d50\u679c\t\u540d\u8a5e,\u526f\u8a5e\u53ef\u80fd,*,*,*,*,\u7d50\u679c,\u30b1\u30c3\u30ab,\u30b1\u30c3\u30ab\n\u3092\t\u52a9\u8a5e,\u683c\u52a9\u8a5e,\u4e00\u822c,*,*,*,\u3092,\u30f2,\u30f2\n \t\u8a18\u53f7,\u7a7a\u767d,*,*,*,*, ,*,*\n\u8868\u793a\t\u540d\u8a5e,\u30b5\u5909\u63a5\u7d9a,*,*,*,*,\u8868\u793a,\u30d2\u30e7\u30a6\u30b8,\u30d2\u30e7\u30fc\u30b8\n\u3057\t\u52d5\u8a5e,\u81ea\u7acb,*,*,\u30b5\u5909\u30fb\u30b9\u30eb,\u9023\u7528\u5f62,\u3059\u308b,\u30b7,\u30b7\n\u307e\u3059\t\u52a9\u52d5\u8a5e,*,*,*,\u7279\u6b8a\u30fb\u30de\u30b9,\u57fa\u672c\u5f62,\u307e\u3059,\u30de\u30b9,\u30de\u30b9\n\uff01\t\u8a18\u53f7,\u4e00\u822c,*,*,*,*,\uff01,\uff01,\uff01\nEOS\n```\n\n### Python Library\n\nYou can also use Bunkai as Python library.  \n  Bunkai\u306fPython\u30e9\u30a4\u30d6\u30e9\u30ea\u3068\u3057\u3066\u3082\u4f7f\u3048\u307e\u3059\uff0e\n\n```python\nfrom bunkai import Bunkai\nbunkai = Bunkai()\nfor sentence in bunkai(\"\u306f\u3044\u3002\u3053\u306e\u3088\u3046\u306bpython\u30e9\u30a4\u30d6\u30e9\u30ea\u3068\u3057\u3066\u3082\u4f7f\u3048\u307e\u3059\uff01\"):\n    print(sentence)\n```\n\nFor more information, see [examples](example).  \n    \u307b\u304b\u306e\u4f8b\u306f[examples](example)\u3092\u3054\u89a7\u304f\u3060\u3055\u3044\uff0e\n\n## Documents\n\n- [Documents](docs)\n\n## References\n\n- Yuta Hayashibe and Kensuke Mitsuzawa.\n    Sentence Boundary Detection on Line Breaks in Japanese.\n    Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75.\n    November 2020.\n    [[PDF]](https://www.aclweb.org/anthology/2020.wnut-1.10.pdf)\n    [[bib]](https://www.aclweb.org/anthology/2020.wnut-1.10.bib)\n\n## License\n\nApache License 2.0\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Sentence boundary disambiguation tool for Japanese texts",
    "version": "1.5.7",
    "split_keywords": [
        "japanese",
        "sentence boundary disambiguation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "944cec5a463519512d8482b175a0d7c98af9e44b0cf064ed081f77399554dcee",
                "md5": "ce6a467c338d82a0e9d88f143e7307a6",
                "sha256": "3454b357f944c5a89da8b01601485446680f5237d3a63d4361cc003e7c84ea40"
            },
            "downloads": -1,
            "filename": "bunkai-1.5.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ce6a467c338d82a0e9d88f143e7307a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<3.12",
            "size": 62205,
            "upload_time": "2023-02-09T10:08:08",
            "upload_time_iso_8601": "2023-02-09T10:08:08.161675Z",
            "url": "https://files.pythonhosted.org/packages/94/4c/ec5a463519512d8482b175a0d7c98af9e44b0cf064ed081f77399554dcee/bunkai-1.5.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a4541b985512562ef7bc3103aa375512522ea0fd5f4a1e006bd0e267b9f5b495",
                "md5": "c7433599db628b83c41afdac921e9bc9",
                "sha256": "04ea4bb81d34ecbd16a08e88afc2fbafcae0ef7ba86564a8ea81c4f7da6b68c5"
            },
            "downloads": -1,
            "filename": "bunkai-1.5.7.tar.gz",
            "has_sig": false,
            "md5_digest": "c7433599db628b83c41afdac921e9bc9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<3.12",
            "size": 45477,
            "upload_time": "2023-02-09T10:08:09",
            "upload_time_iso_8601": "2023-02-09T10:08:09.936958Z",
            "url": "https://files.pythonhosted.org/packages/a4/54/1b985512562ef7bc3103aa375512522ea0fd5f4a1e006bd0e267b9f5b495/bunkai-1.5.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-09 10:08:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "megagonlabs",
    "github_project": "bunkai",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "bunkai"
}
        
Elapsed time: 0.39939s