unidic-combo


Nameunidic-combo JSON
Version 1.4.2 PyPI version JSON
download
home_pagehttps://github.com/KoichiYasuoka/UniDic-COMBO
SummaryUniDic2UD + COMBO-pytorch wrapper for spaCy
upload_time2023-09-25 15:59:58
maintainer
docs_urlNone
authorKoichi Yasuoka
requires_python>=3.6
licenseGPL
keywords nlp japanese spacy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Current PyPI packages](https://badge.fury.io/py/unidic-combo.svg)](https://pypi.org/project/unidic-combo/)

# UniDic-COMBO

[UniDic2UD](https://github.com/KoichiYasuoka/UniDic2UD) + [COMBO-pytorch](https://gitlab.clarin-pl.eu/syntactic-tools/combo) wrapper for [spaCy](https://spacy.io)

## Basic Usage

```py
>>> import unidic_combo
>>> nlp=unidic_combo.load("kindai")
>>> doc=nlp("澤山居つた兄弟が一疋も見えぬ")
>>> print(unidic_combo.to_conllu(doc))
# text = 澤山居つた兄弟が一疋も見えぬ
1	澤山	沢山	ADV	副詞	_	2	advmod	_	SpaceAfter=No|Translit=タクサン
2	居つ	居る	VERB	動詞-非自立可能	_	4	acl	_	SpaceAfter=No|Translit=オッ
3	た	た	AUX	助動詞	_	2	aux	_	SpaceAfter=No|Translit=タ
4	兄弟	兄弟	NOUN	名詞-普通名詞-一般	_	9	nsubj	_	SpaceAfter=No|Translit=キョウダイ
5	が	が	ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=ガ
6	一	一	NUM	名詞-数詞	_	7	nummod	_	SpaceAfter=No|Translit=イチ
7	疋	匹	NOUN	接尾辞-名詞的-助数詞	_	9	obl	_	SpaceAfter=No|Translit=ピキ
8	も	も	ADP	助詞-係助詞	_	7	case	_	SpaceAfter=No|Translit=モ
9	見え	見える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=ミエ
10	ぬ	ず	AUX	助動詞	_	9	aux	_	SpaceAfter=No|Translit=ヌ

>>> import deplacy
>>> deplacy.render(doc,Japanese=True)
澤山 ADV  <══╗     advmod(連用修飾語)
居つ VERB ═╗═╝<╗   acl(連体修飾節)
た   AUX  <╝   ║   aux(動詞補助成分)
兄弟 NOUN ═╗═══╝<╗ nsubj(主語)
が   ADP  <╝     ║ case(格表示)
一   NUM  <╗     ║ nummod(数量による修飾語)
疋   NOUN ═╝═╗<╗ ║ obl(斜格補語)
も   ADP  <══╝ ║ ║ case(格表示)
見え VERB ═╗═══╝═╝ ROOT(親)
ぬ   AUX  <╝       aux(動詞補助成分)

>>> from deplacy.deprelja import deprelja
>>> for b in unidic_combo.bunsetu_spans(doc):
...   for t in b.lefts:
...     print(unidic_combo.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
澤山 -> 居つた (連用修飾語)
居つた -> 兄弟が (連体修飾節)
兄弟が -> 見えぬ (主語)
一疋も -> 見えぬ (斜格補語)
```

`unidic_combo.load(UniDic,BERT=True)` loads spaCy Language pipeline for UniDic2UD + COMBO-pytorch. Available `UniDic` options are:

* `UniDic="gendai"`: Use [現代書き言葉UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_bccwj).
* `UniDic="spoken"`: Use [現代話し言葉UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_csj).
* `UniDic="novel"`: Use [近現代口語小説UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_novel).
* `UniDic="qkana"`: Use [旧仮名口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_qkana).
* `UniDic="kindai"`: Use [近代文語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kindai).
* `UniDic="kinsei"`: Use [近世江戸口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kinsei-edo).
* `UniDic="kyogen"`: Use [中世口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-kougo).
* `UniDic="wakan"`: Use [中世文語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-bungo).
* `UniDic="wabun"`: Use [中古和文UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_wabun).
* `UniDic="manyo"`: Use [上代語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_jodai).
* `UniDic=None`: Use [unidic-lite](https://github.com/polm/unidic-lite) (default).

`BERT=True`/`BERT=False` option enables/disables to use [bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking).

## Installation for Linux

```sh
pip3 install unidic_combo
```

## Installation for Cygwin64

Make sure to get `python37-devel` `python37-pip` `python37-cython` `python37-numpy` `python37-cffi` `gcc-g++` `mingw64-x86_64-gcc-g++` `gcc-fortran` `git` `curl` `make` `cmake` `libopenblas` `liblapack-devel` `libhdf5-devel` `libfreetype-devel` `libuv-devel` packages, and then:
```sh
curl -L https://raw.githubusercontent.com/KoichiYasuoka/UniDic-COMBO/master/cygwin64.sh | sh
```

## Installation for macOS

```sh
g++ --version
pip3 install unidic_combo --user
python3 -m spacy download en_core_web_sm --user
```

If you fail to install [Jsonnet](https://github.com/google/jsonnet), try below before installing UniDic-COMBO:

```sh
( echo '#! /bin/sh' ; echo 'exec gcc `echo $* | sed "s/-arch [^ ]*//g"`' ) > /tmp/clang
chmod 755 /tmp/clang
env PATH="/tmp:$PATH" pip3 install jsonnet --user
```

If you fail to install [fugashi](https://github.com/polm/fugashi), try to install [MeCab](https://github.com/taku910/mecab) before installing UniDic-COMBO:

```sh
cd /tmp
git clone --depth=1 https://github.com/taku910/mecab
cd mecab/mecab
./configure --with-charset=UTF8
make && sudo make install
```

## Benchmarks

Results of [舞姬/雪國/荒野より-Benchmarks](https://colab.research.google.com/github/KoichiYasuoka/UniDic-COMBO/blob/master/benchmark.ipynb)

|[舞姬](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/maihime-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="kindai"|84.91|77.78|85.19|
|UniDic="qkana" |83.02|77.78|85.19|
|UniDic="kinsei"|75.93|67.86|71.43|

|[雪國](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/yukiguni-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="qkana" |87.50|82.35|78.43|
|UniDic="kindai"|83.19|78.43|74.51|
|UniDic="kinsei"|78.57|73.08|69.23|

|[荒野より](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/koyayori-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="kindai"|78.53|59.46|59.46|
|UniDic="qkana" |77.49|59.46|59.46|
|UniDic="kinsei"|76.04|59.46|59.46|

## Reference

* 安岡孝一: [TransformersのBERTは共通テスト『国語』を係り受け解析する夢を見るか](http://hdl.handle.net/2433/261872), 東洋学へのコンピュータ利用, 第33回研究セミナー (2021年3月5日), pp.3-34.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KoichiYasuoka/UniDic-COMBO",
    "name": "unidic-combo",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "NLP Japanese spaCy",
    "author": "Koichi Yasuoka",
    "author_email": "yasuoka@kanji.zinbun.kyoto-u.ac.jp",
    "download_url": "",
    "platform": null,
    "description": "[![Current PyPI packages](https://badge.fury.io/py/unidic-combo.svg)](https://pypi.org/project/unidic-combo/)\n\n# UniDic-COMBO\n\n[UniDic2UD](https://github.com/KoichiYasuoka/UniDic2UD) + [COMBO-pytorch](https://gitlab.clarin-pl.eu/syntactic-tools/combo) wrapper for [spaCy](https://spacy.io)\n\n## Basic Usage\n\n```py\n>>> import unidic_combo\n>>> nlp=unidic_combo.load(\"kindai\")\n>>> doc=nlp(\"\u6fa4\u5c71\u5c45\u3064\u305f\u5144\u5f1f\u304c\u4e00\u758b\u3082\u898b\u3048\u306c\")\n>>> print(unidic_combo.to_conllu(doc))\n# text = \u6fa4\u5c71\u5c45\u3064\u305f\u5144\u5f1f\u304c\u4e00\u758b\u3082\u898b\u3048\u306c\n1\t\u6fa4\u5c71\t\u6ca2\u5c71\tADV\t\u526f\u8a5e\t_\t2\tadvmod\t_\tSpaceAfter=No|Translit=\u30bf\u30af\u30b5\u30f3\n2\t\u5c45\u3064\t\u5c45\u308b\tVERB\t\u52d5\u8a5e-\u975e\u81ea\u7acb\u53ef\u80fd\t_\t4\tacl\t_\tSpaceAfter=No|Translit=\u30aa\u30c3\n3\t\u305f\t\u305f\tAUX\t\u52a9\u52d5\u8a5e\t_\t2\taux\t_\tSpaceAfter=No|Translit=\u30bf\n4\t\u5144\u5f1f\t\u5144\u5f1f\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t9\tnsubj\t_\tSpaceAfter=No|Translit=\u30ad\u30e7\u30a6\u30c0\u30a4\n5\t\u304c\t\u304c\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t4\tcase\t_\tSpaceAfter=No|Translit=\u30ac\n6\t\u4e00\t\u4e00\tNUM\t\u540d\u8a5e-\u6570\u8a5e\t_\t7\tnummod\t_\tSpaceAfter=No|Translit=\u30a4\u30c1\n7\t\u758b\t\u5339\tNOUN\t\u63a5\u5c3e\u8f9e-\u540d\u8a5e\u7684-\u52a9\u6570\u8a5e\t_\t9\tobl\t_\tSpaceAfter=No|Translit=\u30d4\u30ad\n8\t\u3082\t\u3082\tADP\t\u52a9\u8a5e-\u4fc2\u52a9\u8a5e\t_\t7\tcase\t_\tSpaceAfter=No|Translit=\u30e2\n9\t\u898b\u3048\t\u898b\u3048\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t0\troot\t_\tSpaceAfter=No|Translit=\u30df\u30a8\n10\t\u306c\t\u305a\tAUX\t\u52a9\u52d5\u8a5e\t_\t9\taux\t_\tSpaceAfter=No|Translit=\u30cc\n\n>>> import deplacy\n>>> deplacy.render(doc,Japanese=True)\n\u6fa4\u5c71 ADV  <\u2550\u2550\u2557     advmod(\u9023\u7528\u4fee\u98fe\u8a9e)\n\u5c45\u3064 VERB \u2550\u2557\u2550\u255d<\u2557   acl(\u9023\u4f53\u4fee\u98fe\u7bc0)\n\u305f   AUX  <\u255d   \u2551   aux(\u52d5\u8a5e\u88dc\u52a9\u6210\u5206)\n\u5144\u5f1f NOUN \u2550\u2557\u2550\u2550\u2550\u255d<\u2557 nsubj(\u4e3b\u8a9e)\n\u304c   ADP  <\u255d     \u2551 case(\u683c\u8868\u793a)\n\u4e00   NUM  <\u2557     \u2551 nummod(\u6570\u91cf\u306b\u3088\u308b\u4fee\u98fe\u8a9e)\n\u758b   NOUN \u2550\u255d\u2550\u2557<\u2557 \u2551 obl(\u659c\u683c\u88dc\u8a9e)\n\u3082   ADP  <\u2550\u2550\u255d \u2551 \u2551 case(\u683c\u8868\u793a)\n\u898b\u3048 VERB \u2550\u2557\u2550\u2550\u2550\u255d\u2550\u255d ROOT(\u89aa)\n\u306c   AUX  <\u255d       aux(\u52d5\u8a5e\u88dc\u52a9\u6210\u5206)\n\n>>> from deplacy.deprelja import deprelja\n>>> for b in unidic_combo.bunsetu_spans(doc):\n...   for t in b.lefts:\n...     print(unidic_combo.bunsetu_span(t),\"->\",b,\"(\"+deprelja[t.dep_]+\")\")\n...\n\u6fa4\u5c71 -> \u5c45\u3064\u305f (\u9023\u7528\u4fee\u98fe\u8a9e)\n\u5c45\u3064\u305f -> \u5144\u5f1f\u304c (\u9023\u4f53\u4fee\u98fe\u7bc0)\n\u5144\u5f1f\u304c -> \u898b\u3048\u306c (\u4e3b\u8a9e)\n\u4e00\u758b\u3082 -> \u898b\u3048\u306c (\u659c\u683c\u88dc\u8a9e)\n```\n\n`unidic_combo.load(UniDic,BERT=True)` loads spaCy Language pipeline for UniDic2UD + COMBO-pytorch. Available `UniDic` options are:\n\n* `UniDic=\"gendai\"`: Use [\u73fe\u4ee3\u66f8\u304d\u8a00\u8449UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_bccwj).\n* `UniDic=\"spoken\"`: Use [\u73fe\u4ee3\u8a71\u3057\u8a00\u8449UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_csj).\n* `UniDic=\"novel\"`: Use [\u8fd1\u73fe\u4ee3\u53e3\u8a9e\u5c0f\u8aacUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_novel).\n* `UniDic=\"qkana\"`: Use [\u65e7\u4eee\u540d\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_qkana).\n* `UniDic=\"kindai\"`: Use [\u8fd1\u4ee3\u6587\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kindai).\n* `UniDic=\"kinsei\"`: Use [\u8fd1\u4e16\u6c5f\u6238\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kinsei-edo).\n* `UniDic=\"kyogen\"`: Use [\u4e2d\u4e16\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-kougo).\n* `UniDic=\"wakan\"`: Use [\u4e2d\u4e16\u6587\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-bungo).\n* `UniDic=\"wabun\"`: Use [\u4e2d\u53e4\u548c\u6587UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_wabun).\n* `UniDic=\"manyo\"`: Use [\u4e0a\u4ee3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_jodai).\n* `UniDic=None`: Use [unidic-lite](https://github.com/polm/unidic-lite) (default).\n\n`BERT=True`/`BERT=False` option enables/disables to use [bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking).\n\n## Installation for Linux\n\n```sh\npip3 install unidic_combo\n```\n\n## Installation for Cygwin64\n\nMake sure to get `python37-devel` `python37-pip` `python37-cython` `python37-numpy` `python37-cffi` `gcc-g++` `mingw64-x86_64-gcc-g++` `gcc-fortran` `git` `curl` `make` `cmake` `libopenblas` `liblapack-devel` `libhdf5-devel` `libfreetype-devel` `libuv-devel` packages, and then:\n```sh\ncurl -L https://raw.githubusercontent.com/KoichiYasuoka/UniDic-COMBO/master/cygwin64.sh | sh\n```\n\n## Installation for macOS\n\n```sh\ng++ --version\npip3 install unidic_combo --user\npython3 -m spacy download en_core_web_sm --user\n```\n\nIf you fail to install [Jsonnet](https://github.com/google/jsonnet), try below before installing UniDic-COMBO:\n\n```sh\n( echo '#! /bin/sh' ; echo 'exec gcc `echo $* | sed \"s/-arch [^ ]*//g\"`' ) > /tmp/clang\nchmod 755 /tmp/clang\nenv PATH=\"/tmp:$PATH\" pip3 install jsonnet --user\n```\n\nIf you fail to install [fugashi](https://github.com/polm/fugashi), try to install [MeCab](https://github.com/taku910/mecab) before installing UniDic-COMBO:\n\n```sh\ncd /tmp\ngit clone --depth=1 https://github.com/taku910/mecab\ncd mecab/mecab\n./configure --with-charset=UTF8\nmake && sudo make install\n```\n\n## Benchmarks\n\nResults of [\u821e\u59ec/\u96ea\u570b/\u8352\u91ce\u3088\u308a-Benchmarks](https://colab.research.google.com/github/KoichiYasuoka/UniDic-COMBO/blob/master/benchmark.ipynb)\n\n|[\u821e\u59ec](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/maihime-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"kindai\"|84.91|77.78|85.19|\n|UniDic=\"qkana\" |83.02|77.78|85.19|\n|UniDic=\"kinsei\"|75.93|67.86|71.43|\n\n|[\u96ea\u570b](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/yukiguni-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"qkana\" |87.50|82.35|78.43|\n|UniDic=\"kindai\"|83.19|78.43|74.51|\n|UniDic=\"kinsei\"|78.57|73.08|69.23|\n\n|[\u8352\u91ce\u3088\u308a](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/koyayori-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"kindai\"|78.53|59.46|59.46|\n|UniDic=\"qkana\" |77.49|59.46|59.46|\n|UniDic=\"kinsei\"|76.04|59.46|59.46|\n\n## Reference\n\n* \u5b89\u5ca1\u5b5d\u4e00: [Transformers\u306eBERT\u306f\u5171\u901a\u30c6\u30b9\u30c8\u300e\u56fd\u8a9e\u300f\u3092\u4fc2\u308a\u53d7\u3051\u89e3\u6790\u3059\u308b\u5922\u3092\u898b\u308b\u304b](http://hdl.handle.net/2433/261872), \u6771\u6d0b\u5b66\u3078\u306e\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u5229\u7528, \u7b2c33\u56de\u7814\u7a76\u30bb\u30df\u30ca\u30fc (2021\u5e743\u67085\u65e5), pp.3-34.\n\n\n",
    "bugtrack_url": null,
    "license": "GPL",
    "summary": "UniDic2UD + COMBO-pytorch wrapper for spaCy",
    "version": "1.4.2",
    "project_urls": {
        "COMBO-pytorch": "https://gitlab.clarin-pl.eu/syntactic-tools/combo",
        "Homepage": "https://github.com/KoichiYasuoka/UniDic-COMBO",
        "Source": "https://github.com/KoichiYasuoka/UniDic-COMBO",
        "Tracker": "https://github.com/KoichiYasuoka/UniDic-COMBO/issues"
    },
    "split_keywords": [
        "nlp",
        "japanese",
        "spacy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7b9d7a61a13156c5bd8aef64c860598bc46ef60db1dc44f9073fcf3c93919715",
                "md5": "26da7e6b238fa995af20e4300f0baa9f",
                "sha256": "c42003cdce4b0990b984b3f9c6e0147b77beaf6e667bdcad1a9e9ca704d6ff74"
            },
            "downloads": -1,
            "filename": "unidic_combo-1.4.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "26da7e6b238fa995af20e4300f0baa9f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 72233,
            "upload_time": "2023-09-25T15:59:58",
            "upload_time_iso_8601": "2023-09-25T15:59:58.389018Z",
            "url": "https://files.pythonhosted.org/packages/7b/9d/7a61a13156c5bd8aef64c860598bc46ef60db1dc44f9073fcf3c93919715/unidic_combo-1.4.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-25 15:59:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KoichiYasuoka",
    "github_project": "UniDic-COMBO",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "unidic-combo"
}
        
Elapsed time: 0.14404s