[![Current PyPI packages](https://badge.fury.io/py/unidic-combo.svg)](https://pypi.org/project/unidic-combo/)
# UniDic-COMBO
[UniDic2UD](https://github.com/KoichiYasuoka/UniDic2UD) + [COMBO-pytorch](https://gitlab.clarin-pl.eu/syntactic-tools/combo) wrapper for [spaCy](https://spacy.io)
## Basic Usage
```py
>>> import unidic_combo
>>> nlp=unidic_combo.load("kindai")
>>> doc=nlp("澤山居つた兄弟が一疋も見えぬ")
>>> print(unidic_combo.to_conllu(doc))
# text = 澤山居つた兄弟が一疋も見えぬ
1 澤山 沢山 ADV 副詞 _ 2 advmod _ SpaceAfter=No|Translit=タクサン
2 居つ 居る VERB 動詞-非自立可能 _ 4 acl _ SpaceAfter=No|Translit=オッ
3 た た AUX 助動詞 _ 2 aux _ SpaceAfter=No|Translit=タ
4 兄弟 兄弟 NOUN 名詞-普通名詞-一般 _ 9 nsubj _ SpaceAfter=No|Translit=キョウダイ
5 が が ADP 助詞-格助詞 _ 4 case _ SpaceAfter=No|Translit=ガ
6 一 一 NUM 名詞-数詞 _ 7 nummod _ SpaceAfter=No|Translit=イチ
7 疋 匹 NOUN 接尾辞-名詞的-助数詞 _ 9 obl _ SpaceAfter=No|Translit=ピキ
8 も も ADP 助詞-係助詞 _ 7 case _ SpaceAfter=No|Translit=モ
9 見え 見える VERB 動詞-一般 _ 0 root _ SpaceAfter=No|Translit=ミエ
10 ぬ ず AUX 助動詞 _ 9 aux _ SpaceAfter=No|Translit=ヌ
>>> import deplacy
>>> deplacy.render(doc,Japanese=True)
澤山 ADV <══╗ advmod(連用修飾語)
居つ VERB ═╗═╝<╗ acl(連体修飾節)
た AUX <╝ ║ aux(動詞補助成分)
兄弟 NOUN ═╗═══╝<╗ nsubj(主語)
が ADP <╝ ║ case(格表示)
一 NUM <╗ ║ nummod(数量による修飾語)
疋 NOUN ═╝═╗<╗ ║ obl(斜格補語)
も ADP <══╝ ║ ║ case(格表示)
見え VERB ═╗═══╝═╝ ROOT(親)
ぬ AUX <╝ aux(動詞補助成分)
>>> from deplacy.deprelja import deprelja
>>> for b in unidic_combo.bunsetu_spans(doc):
... for t in b.lefts:
... print(unidic_combo.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
澤山 -> 居つた (連用修飾語)
居つた -> 兄弟が (連体修飾節)
兄弟が -> 見えぬ (主語)
一疋も -> 見えぬ (斜格補語)
```
`unidic_combo.load(UniDic,BERT=True)` loads spaCy Language pipeline for UniDic2UD + COMBO-pytorch. Available `UniDic` options are:
* `UniDic="gendai"`: Use [現代書き言葉UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_bccwj).
* `UniDic="spoken"`: Use [現代話し言葉UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_csj).
* `UniDic="novel"`: Use [近現代口語小説UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_novel).
* `UniDic="qkana"`: Use [旧仮名口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_qkana).
* `UniDic="kindai"`: Use [近代文語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kindai).
* `UniDic="kinsei"`: Use [近世江戸口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kinsei-edo).
* `UniDic="kyogen"`: Use [中世口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-kougo).
* `UniDic="wakan"`: Use [中世文語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-bungo).
* `UniDic="wabun"`: Use [中古和文UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_wabun).
* `UniDic="manyo"`: Use [上代語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_jodai).
* `UniDic=None`: Use [unidic-lite](https://github.com/polm/unidic-lite) (default).
`BERT=True`/`BERT=False` option enables/disables to use [bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking).
## Installation for Linux
```sh
pip3 install unidic_combo
```
## Installation for Cygwin64
Make sure to get `python37-devel` `python37-pip` `python37-cython` `python37-numpy` `python37-cffi` `gcc-g++` `mingw64-x86_64-gcc-g++` `gcc-fortran` `git` `curl` `make` `cmake` `libopenblas` `liblapack-devel` `libhdf5-devel` `libfreetype-devel` `libuv-devel` packages, and then:
```sh
curl -L https://raw.githubusercontent.com/KoichiYasuoka/UniDic-COMBO/master/cygwin64.sh | sh
```
## Installation for macOS
```sh
g++ --version
pip3 install unidic_combo --user
python3 -m spacy download en_core_web_sm --user
```
If you fail to install [Jsonnet](https://github.com/google/jsonnet), try below before installing UniDic-COMBO:
```sh
( echo '#! /bin/sh' ; echo 'exec gcc `echo $* | sed "s/-arch [^ ]*//g"`' ) > /tmp/clang
chmod 755 /tmp/clang
env PATH="/tmp:$PATH" pip3 install jsonnet --user
```
If you fail to install [fugashi](https://github.com/polm/fugashi), try to install [MeCab](https://github.com/taku910/mecab) before installing UniDic-COMBO:
```sh
cd /tmp
git clone --depth=1 https://github.com/taku910/mecab
cd mecab/mecab
./configure --with-charset=UTF8
make && sudo make install
```
## Benchmarks
Results of [舞姬/雪國/荒野より-Benchmarks](https://colab.research.google.com/github/KoichiYasuoka/UniDic-COMBO/blob/master/benchmark.ipynb)
|[舞姬](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/maihime-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="kindai"|84.91|77.78|85.19|
|UniDic="qkana" |83.02|77.78|85.19|
|UniDic="kinsei"|75.93|67.86|71.43|
|[雪國](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/yukiguni-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="qkana" |87.50|82.35|78.43|
|UniDic="kindai"|83.19|78.43|74.51|
|UniDic="kinsei"|78.57|73.08|69.23|
|[荒野より](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/koyayori-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="kindai"|78.53|59.46|59.46|
|UniDic="qkana" |77.49|59.46|59.46|
|UniDic="kinsei"|76.04|59.46|59.46|
## Reference
* 安岡孝一: [TransformersのBERTは共通テスト『国語』を係り受け解析する夢を見るか](http://hdl.handle.net/2433/261872), 東洋学へのコンピュータ利用, 第33回研究セミナー (2021年3月5日), pp.3-34.
Raw data
{
"_id": null,
"home_page": "https://github.com/KoichiYasuoka/UniDic-COMBO",
"name": "unidic-combo",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "NLP Japanese spaCy",
"author": "Koichi Yasuoka",
"author_email": "yasuoka@kanji.zinbun.kyoto-u.ac.jp",
"download_url": null,
"platform": null,
"description": "[![Current PyPI packages](https://badge.fury.io/py/unidic-combo.svg)](https://pypi.org/project/unidic-combo/)\n\n# UniDic-COMBO\n\n[UniDic2UD](https://github.com/KoichiYasuoka/UniDic2UD) + [COMBO-pytorch](https://gitlab.clarin-pl.eu/syntactic-tools/combo) wrapper for [spaCy](https://spacy.io)\n\n## Basic Usage\n\n```py\n>>> import unidic_combo\n>>> nlp=unidic_combo.load(\"kindai\")\n>>> doc=nlp(\"\u6fa4\u5c71\u5c45\u3064\u305f\u5144\u5f1f\u304c\u4e00\u758b\u3082\u898b\u3048\u306c\")\n>>> print(unidic_combo.to_conllu(doc))\n# text = \u6fa4\u5c71\u5c45\u3064\u305f\u5144\u5f1f\u304c\u4e00\u758b\u3082\u898b\u3048\u306c\n1\t\u6fa4\u5c71\t\u6ca2\u5c71\tADV\t\u526f\u8a5e\t_\t2\tadvmod\t_\tSpaceAfter=No|Translit=\u30bf\u30af\u30b5\u30f3\n2\t\u5c45\u3064\t\u5c45\u308b\tVERB\t\u52d5\u8a5e-\u975e\u81ea\u7acb\u53ef\u80fd\t_\t4\tacl\t_\tSpaceAfter=No|Translit=\u30aa\u30c3\n3\t\u305f\t\u305f\tAUX\t\u52a9\u52d5\u8a5e\t_\t2\taux\t_\tSpaceAfter=No|Translit=\u30bf\n4\t\u5144\u5f1f\t\u5144\u5f1f\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t9\tnsubj\t_\tSpaceAfter=No|Translit=\u30ad\u30e7\u30a6\u30c0\u30a4\n5\t\u304c\t\u304c\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t4\tcase\t_\tSpaceAfter=No|Translit=\u30ac\n6\t\u4e00\t\u4e00\tNUM\t\u540d\u8a5e-\u6570\u8a5e\t_\t7\tnummod\t_\tSpaceAfter=No|Translit=\u30a4\u30c1\n7\t\u758b\t\u5339\tNOUN\t\u63a5\u5c3e\u8f9e-\u540d\u8a5e\u7684-\u52a9\u6570\u8a5e\t_\t9\tobl\t_\tSpaceAfter=No|Translit=\u30d4\u30ad\n8\t\u3082\t\u3082\tADP\t\u52a9\u8a5e-\u4fc2\u52a9\u8a5e\t_\t7\tcase\t_\tSpaceAfter=No|Translit=\u30e2\n9\t\u898b\u3048\t\u898b\u3048\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t0\troot\t_\tSpaceAfter=No|Translit=\u30df\u30a8\n10\t\u306c\t\u305a\tAUX\t\u52a9\u52d5\u8a5e\t_\t9\taux\t_\tSpaceAfter=No|Translit=\u30cc\n\n>>> import deplacy\n>>> deplacy.render(doc,Japanese=True)\n\u6fa4\u5c71 ADV <\u2550\u2550\u2557 advmod(\u9023\u7528\u4fee\u98fe\u8a9e)\n\u5c45\u3064 VERB \u2550\u2557\u2550\u255d<\u2557 acl(\u9023\u4f53\u4fee\u98fe\u7bc0)\n\u305f AUX <\u255d \u2551 aux(\u52d5\u8a5e\u88dc\u52a9\u6210\u5206)\n\u5144\u5f1f NOUN \u2550\u2557\u2550\u2550\u2550\u255d<\u2557 nsubj(\u4e3b\u8a9e)\n\u304c ADP <\u255d \u2551 case(\u683c\u8868\u793a)\n\u4e00 NUM <\u2557 \u2551 nummod(\u6570\u91cf\u306b\u3088\u308b\u4fee\u98fe\u8a9e)\n\u758b NOUN \u2550\u255d\u2550\u2557<\u2557 \u2551 obl(\u659c\u683c\u88dc\u8a9e)\n\u3082 ADP <\u2550\u2550\u255d \u2551 \u2551 case(\u683c\u8868\u793a)\n\u898b\u3048 VERB \u2550\u2557\u2550\u2550\u2550\u255d\u2550\u255d ROOT(\u89aa)\n\u306c AUX <\u255d aux(\u52d5\u8a5e\u88dc\u52a9\u6210\u5206)\n\n>>> from deplacy.deprelja import deprelja\n>>> for b in unidic_combo.bunsetu_spans(doc):\n... for t in b.lefts:\n... print(unidic_combo.bunsetu_span(t),\"->\",b,\"(\"+deprelja[t.dep_]+\")\")\n...\n\u6fa4\u5c71 -> \u5c45\u3064\u305f (\u9023\u7528\u4fee\u98fe\u8a9e)\n\u5c45\u3064\u305f -> \u5144\u5f1f\u304c (\u9023\u4f53\u4fee\u98fe\u7bc0)\n\u5144\u5f1f\u304c -> \u898b\u3048\u306c (\u4e3b\u8a9e)\n\u4e00\u758b\u3082 -> \u898b\u3048\u306c (\u659c\u683c\u88dc\u8a9e)\n```\n\n`unidic_combo.load(UniDic,BERT=True)` loads spaCy Language pipeline for UniDic2UD + COMBO-pytorch. Available `UniDic` options are:\n\n* `UniDic=\"gendai\"`: Use [\u73fe\u4ee3\u66f8\u304d\u8a00\u8449UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_bccwj).\n* `UniDic=\"spoken\"`: Use [\u73fe\u4ee3\u8a71\u3057\u8a00\u8449UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_csj).\n* `UniDic=\"novel\"`: Use [\u8fd1\u73fe\u4ee3\u53e3\u8a9e\u5c0f\u8aacUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_novel).\n* `UniDic=\"qkana\"`: Use [\u65e7\u4eee\u540d\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_qkana).\n* `UniDic=\"kindai\"`: Use [\u8fd1\u4ee3\u6587\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kindai).\n* `UniDic=\"kinsei\"`: Use [\u8fd1\u4e16\u6c5f\u6238\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kinsei-edo).\n* `UniDic=\"kyogen\"`: Use [\u4e2d\u4e16\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-kougo).\n* `UniDic=\"wakan\"`: Use [\u4e2d\u4e16\u6587\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-bungo).\n* `UniDic=\"wabun\"`: Use [\u4e2d\u53e4\u548c\u6587UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_wabun).\n* `UniDic=\"manyo\"`: Use [\u4e0a\u4ee3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_jodai).\n* `UniDic=None`: Use [unidic-lite](https://github.com/polm/unidic-lite) (default).\n\n`BERT=True`/`BERT=False` option enables/disables to use [bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking).\n\n## Installation for Linux\n\n```sh\npip3 install unidic_combo\n```\n\n## Installation for Cygwin64\n\nMake sure to get `python37-devel` `python37-pip` `python37-cython` `python37-numpy` `python37-cffi` `gcc-g++` `mingw64-x86_64-gcc-g++` `gcc-fortran` `git` `curl` `make` `cmake` `libopenblas` `liblapack-devel` `libhdf5-devel` `libfreetype-devel` `libuv-devel` packages, and then:\n```sh\ncurl -L https://raw.githubusercontent.com/KoichiYasuoka/UniDic-COMBO/master/cygwin64.sh | sh\n```\n\n## Installation for macOS\n\n```sh\ng++ --version\npip3 install unidic_combo --user\npython3 -m spacy download en_core_web_sm --user\n```\n\nIf you fail to install [Jsonnet](https://github.com/google/jsonnet), try below before installing UniDic-COMBO:\n\n```sh\n( echo '#! /bin/sh' ; echo 'exec gcc `echo $* | sed \"s/-arch [^ ]*//g\"`' ) > /tmp/clang\nchmod 755 /tmp/clang\nenv PATH=\"/tmp:$PATH\" pip3 install jsonnet --user\n```\n\nIf you fail to install [fugashi](https://github.com/polm/fugashi), try to install [MeCab](https://github.com/taku910/mecab) before installing UniDic-COMBO:\n\n```sh\ncd /tmp\ngit clone --depth=1 https://github.com/taku910/mecab\ncd mecab/mecab\n./configure --with-charset=UTF8\nmake && sudo make install\n```\n\n## Benchmarks\n\nResults of [\u821e\u59ec/\u96ea\u570b/\u8352\u91ce\u3088\u308a-Benchmarks](https://colab.research.google.com/github/KoichiYasuoka/UniDic-COMBO/blob/master/benchmark.ipynb)\n\n|[\u821e\u59ec](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/maihime-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"kindai\"|84.91|77.78|85.19|\n|UniDic=\"qkana\" |83.02|77.78|85.19|\n|UniDic=\"kinsei\"|75.93|67.86|71.43|\n\n|[\u96ea\u570b](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/yukiguni-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"qkana\" |87.50|82.35|78.43|\n|UniDic=\"kindai\"|83.19|78.43|74.51|\n|UniDic=\"kinsei\"|78.57|73.08|69.23|\n\n|[\u8352\u91ce\u3088\u308a](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/koyayori-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"kindai\"|78.53|59.46|59.46|\n|UniDic=\"qkana\" |77.49|59.46|59.46|\n|UniDic=\"kinsei\"|76.04|59.46|59.46|\n\n## Reference\n\n* \u5b89\u5ca1\u5b5d\u4e00: [Transformers\u306eBERT\u306f\u5171\u901a\u30c6\u30b9\u30c8\u300e\u56fd\u8a9e\u300f\u3092\u4fc2\u308a\u53d7\u3051\u89e3\u6790\u3059\u308b\u5922\u3092\u898b\u308b\u304b](http://hdl.handle.net/2433/261872), \u6771\u6d0b\u5b66\u3078\u306e\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u5229\u7528, \u7b2c33\u56de\u7814\u7a76\u30bb\u30df\u30ca\u30fc (2021\u5e743\u67085\u65e5), pp.3-34.\n",
"bugtrack_url": null,
"license": "GPL",
"summary": "UniDic2UD + COMBO-pytorch wrapper for spaCy",
"version": "1.4.3",
"project_urls": {
"COMBO-pytorch": "https://gitlab.clarin-pl.eu/syntactic-tools/combo",
"Homepage": "https://github.com/KoichiYasuoka/UniDic-COMBO",
"Source": "https://github.com/KoichiYasuoka/UniDic-COMBO",
"Tracker": "https://github.com/KoichiYasuoka/UniDic-COMBO/issues"
},
"split_keywords": [
"nlp",
"japanese",
"spacy"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1a99e2fdb5da622c0d0687752ef5c079bf303489c8709b8e7e48d4006ed03575",
"md5": "a409f15c7b10f36015b0dab4881f072a",
"sha256": "5990cc5bfb70501857703a15d9f72504deeaf8f4e85907192a36669652eaf7d1"
},
"downloads": -1,
"filename": "unidic_combo-1.4.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a409f15c7b10f36015b0dab4881f072a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 72236,
"upload_time": "2024-11-20T10:46:57",
"upload_time_iso_8601": "2024-11-20T10:46:57.556686Z",
"url": "https://files.pythonhosted.org/packages/1a/99/e2fdb5da622c0d0687752ef5c079bf303489c8709b8e7e48d4006ed03575/unidic_combo-1.4.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-20 10:46:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "KoichiYasuoka",
"github_project": "UniDic-COMBO",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "unidic-combo"
}