unidic2ud


Nameunidic2ud JSON
Version 3.0.5 PyPI version JSON
download
home_pagehttps://github.com/KoichiYasuoka/UniDic2UD
SummaryTokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
upload_time2024-10-30 03:12:13
maintainerNone
docs_urlNone
authorKoichi Yasuoka
requires_python>=3.6
licenseMIT
keywords unidic udpipe mecab nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Current PyPI packages](https://badge.fury.io/py/unidic2ud.svg)](https://pypi.org/project/unidic2ud/)

# UniDic2UD

Tokenizer, POS-tagger, lemmatizer, and dependency-parser for modern and contemporary Japanese, working on [Universal Dependencies](https://universaldependencies.org/format.html).

## Basic usage

```py
>>> import unidic2ud
>>> nlp=unidic2ud.load("kindai")
>>> s=nlp("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s)
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1	其	其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2	國	国	NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3	を	を	ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=ヲ
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5	ん	む	AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=ン
6	と	と	ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=ト
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8	者	者	NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9	は	は	ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=ハ
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11	其	其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12	家	家	NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13	を	を	ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=ヲ
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=s[7]
>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)
7 欲する 欲する VERB 動詞-一般 _ 8 acl _ SpaceAfter=No|Translit=ホッスル

>>> print(s.to_tree())
    其 <══╗         det(決定詞)
    國 ═╗═╝<╗       obj(目的語)
    を <╝   ║       case(格表示)
  治め ═╗═╗═╝<╗     advcl(連用修飾節)
    ん <╝ ║   ║     aux(動詞補助成分)
    と <══╝   ║     case(格表示)
欲する ═══════╝<╗   acl(連体修飾節)
    者 ═╗═══════╝<╗ nsubj(主語)
    は <╝         ║ case(格表示)
  先づ <══════╗   ║ advmod(連用修飾語)
    其 <══╗   ║   ║ det(決定詞)
    家 ═╗═╝<╗ ║   ║ obj(目的語)
    を <╝   ║ ║   ║ case(格表示)
  齊ふ ═════╝═╝═══╝ root(親)

>>> f=open("trial.svg","w")
>>> f.write(s.to_svg())
>>> f.close()
```
![trial.svg](https://raw.githubusercontent.com/KoichiYasuoka/UniDic2UD/master/trial.png)

`unidic2ud.load(UniDic,UDPipe)` loads a natural language processor pipeline, which uses `UniDic` for tokenizer POS-tagger and lemmatizer, then uses `UDPipe` for dependency-parser. The default `UDPipe` is `UDPipe="japanese-modern"`. Available `UniDic` options are:

* `UniDic="gendai"`: Use [現代書き言葉UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_bccwj).
* `UniDic="spoken"`: Use [現代話し言葉UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_csj).
* `UniDic="novel"`: Use [近現代口語小説UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_novel).
* `UniDic="qkana"`: Use [旧仮名口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_qkana).
* `UniDic="kindai"`: Use [近代文語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kindai).
* `UniDic="kinsei"`: Use [近世江戸口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kinsei-edo).
* `UniDic="kyogen"`: Use [中世口語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-kougo).
* `UniDic="wakan"`: Use [中世文語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-bungo).
* `UniDic="wabun"`: Use [中古和文UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_wabun).
* `UniDic="manyo"`: Use [上代語UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_jodai).
* `UniDic=None`: Use `UDPipe` for tokenizer, POS-tagger, lemmatizer, and dependency-parser.

`unidic2ud.UniDic2UDEntry.to_tree()` has an option `to_tree(BoxDrawingWidth=2)` for old terminals, whose Box Drawing characters are "fullwidth".

You can simply use `unidic2ud` on the command line:
```sh
echo 其國を治めんと欲する者は先づ其家を齊ふ | unidic2ud -U kindai
```

## CaboCha emulator usage

```py
>>> import unidic2ud.cabocha as CaboCha
>>> c=CaboCha.Parser("kindai")
>>> s=c.parse("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s.toString(CaboCha.FORMAT_TREE_LATTICE))
  其-D
  國を-D
治めんと-D
    欲する-D
        者は-------D
          先づ-----D
              其-D |
              家を-D
                齊ふ
EOS
* 0 1D 0/0 0.000000
其	連体詞,*,*,*,*,*,其の,ソノ,*,DET	O	1<-det-2
* 1 2D 0/1 0.000000
國	名詞,普通名詞,一般,*,*,*,国,クニ,*,NOUN	O	2<-obj-4
を	助詞,格助詞,*,*,*,*,を,ヲ,*,ADP	O	3<-case-2
* 2 3D 0/1 0.000000
治め	動詞,一般,*,*,*,*,収める,オサメ,*,VERB	O	4<-advcl-7
ん	助動詞,*,*,*,*,*,む,ン,*,AUX	O	5<-aux-4
と	助詞,格助詞,*,*,*,*,と,ト,*,ADP	O	6<-case-4
* 3 4D 0/0 0.000000
欲する	動詞,一般,*,*,*,*,欲する,ホッスル,*,VERB	O	7<-acl-8
* 4 8D 0/1 0.000000
者	名詞,普通名詞,一般,*,*,*,者,モノ,*,NOUN	O	8<-nsubj-14
は	助詞,係助詞,*,*,*,*,は,ハ,*,ADP	O	9<-case-8
* 5 8D 0/0 0.000000
先づ	副詞,*,*,*,*,*,先ず,マヅ,*,ADV	O	10<-advmod-14
* 6 7D 0/0 0.000000
其	連体詞,*,*,*,*,*,其の,ソノ,*,DET	O	11<-det-12
* 7 8D 0/1 0.000000
家	名詞,普通名詞,一般,*,*,*,家,ウチ,*,NOUN	O	12<-obj-14
を	助詞,格助詞,*,*,*,*,を,ヲ,*,ADP	O	13<-case-12
* 8 -1D 0/0 0.000000
齊ふ	動詞,一般,*,*,*,*,整える,トトノフ,*,VERB	O	14<-root
EOS
>>> for c in [s.chunk(i) for i in range(s.chunk_size())]:
...   if c.link>=0:
...     print(c,"->",s.chunk(c.link))
...
其 -> 國を
國を -> 治めんと
治めんと -> 欲する
欲する -> 者は
者は -> 齊ふ
先づ -> 齊ふ
其 -> 家を
家を -> 齊ふ
```
`CaboCha.Parser(UniDic)` is an alias for `unidic2ud.load(UniDic,UDPipe="japanese-modern")`, and its default is `UniDic=None`. `CaboCha.Tree.toString(format)` has five available formats:
* `CaboCha.FORMAT_TREE`: tree (numbered as 0)
* `CaboCha.FORMAT_LATTICE`: lattice (numbered as 1)
* `CaboCha.FORMAT_TREE_LATTICE`: tree + lattice (numbered as 2)
* `CaboCha.FORMAT_XML`: XML (numbered as 3)
* `CaboCha.FORMAT_CONLL`: Universal Dependencies CoNLL-U (numbered as 4)

You can simply use `udcabocha` on the command line:
```sh
echo 其國を治めんと欲する者は先づ其家を齊ふ | udcabocha -U kindai -f 2
```
`-U UniDic` specifies `UniDic`. `-f format` specifies the output format in 0 to 4 above (default is `-f 0`) and in 5 to 8 below:
* `-f 5`: `to_tree()`
* `-f 6`: `to_tree(BoxDrawingWidth=2)`
* `-f 7`: `to_svg()`
* `-f 8`: [raw DOT](https://graphviz.readthedocs.io/en/stable/manual.html#using-raw-dot) graph through [Immediate Catena Analysis](https://koichiyasuoka.github.io/deplacy/#deplacydot)

![dot.png](https://raw.githubusercontent.com/KoichiYasuoka/UniDic2UD/master/dot.png)

Try [notebook](https://colab.research.google.com/github/KoichiYasuoka/UniDic2UD/blob/master/udcabocha.ipynb) for Google Colaboratory.

## Usage via spaCy

If you have already installed [spaCy](https://pypi.org/project/spacy/) 2.1.0 or later, you can use `UniDic` via spaCy Language pipeline.

```py
>>> import unidic2ud.spacy
>>> nlp=unidic2ud.spacy.load("kindai")
>>> d=nlp("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(unidic2ud.spacy.to_conllu(d))
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1	其	其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2	國	国	NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3	を	を	ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=ヲ
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5	ん	む	AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=ン
6	と	と	ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=ト
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8	者	者	NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9	は	は	ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=ハ
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11	其	其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12	家	家	NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13	を	を	ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=ヲ
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=d[6]
>>> print(t.i+1,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i+1,t.dep_,t.whitespace_,t.norm_)
7 欲する 欲する VERB 動詞-一般 8 acl  ホッスル

>>> from deplacy.deprelja import deprelja
>>> for b in unidic2ud.spacy.bunsetu_spans(d):
...   for t in b.lefts:
...     print(unidic2ud.spacy.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
其 -> 國を (決定詞)
國を -> 治めんと (目的語)
治めんと -> 欲する (連用修飾節)
欲する -> 者は (連体修飾節)
其 -> 家を (決定詞)
者は -> 齊ふ (主語)
先づ -> 齊ふ (連用修飾語)
家を -> 齊ふ (目的語)
```

`unidic2ud.spacy.load(UniDic,parser)` loads a spaCy pipeline, which uses `UniDic` for tokenizer POS-tagger and lemmatizer (as shown above), then uses `parser` for dependency-parser. The default `parser` is `parser="japanese-modern"` and available options are:

* `parser="ja_core_news_sm"`: Use [spaCy Japanese model](https://spacy.io/models/ja) (small).
* `parser="ja_core_news_md"`: Use spaCy Japanese model (middle).
* `parser="ja_core_news_lg"`: Use spaCy Japanese model (large).
* `parser="ja_ginza"`: Use [GiNZA](https://github.com/megagonlabs/ginza).
* `parser="japanese-gsd"`: Use [UDPipe Japanese model](http://hdl.handle.net/11234/1-3131).
* `parser="stanza_ja"`: Use [Stanza Japanese model](https://stanfordnlp.github.io/stanza/available_models.html).

## Installation for Linux

Tar-ball is available for Linux, and is installed by default when you use `pip`:
```sh
pip install unidic2ud
```

By default installation, `UniDic` is invoked through Web APIs. If you want to invoke them locally and faster, you can download `UniDic` which you use just as follows:
```sh
python -m unidic2ud download kindai
python -m unidic2ud dictlist
```
Licenses of dictionaries and models are: GPL/LGPL/BSD for `gendai` and `spoken`; CC BY-NC-SA 4.0 for others.

## Installation for Cygwin

Make sure to get `gcc-g++` `python37-pip` `python37-devel` packages, and then:
```sh
pip3.7 install unidic2ud
```
Use `python3.7` command in [Cygwin](https://www.cygwin.com/install.html) instead of `python`.

## Installation for Jupyter Notebook (Google Colaboratory)

```py
!pip install unidic2ud
```
## Benchmarks

Results of [舞姬/雪國/荒野より-Benchmarks](https://colab.research.google.com/github/KoichiYasuoka/UniDic2UD/blob/master/benchmark/benchmark.ipynb)

|[舞姬](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/maihime-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="kindai"|81.13|70.37|77.78|
|UniDic="qkana" |79.25|70.37|77.78|
|UniDic="kinsei"|72.22|60.71|64.29|

|[雪國](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/yukiguni-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="qkana" |89.29|85.71|81.63|
|UniDic="kinsei"|89.29|85.71|77.55|
|UniDic="kindai"|84.96|81.63|77.55|

|[荒野より](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/koyayori-benchmark.tar.gz)|LAS|MLAS|BLEX|
|---------------|-----|-----|-----|
|UniDic="kindai"|76.44|61.54|53.85|
|UniDic="qkana" |75.39|61.54|53.85|
|UniDic="kinsei"|71.88|58.97|51.28|

## Author

Koichi Yasuoka (安岡孝一)

## References

* 安岡孝一: [形態素解析部の付け替えによる近代日本語(旧字旧仮名)の係り受け解析](http://hdl.handle.net/2433/254677), 情報処理学会研究報告, Vol.2020-CH-124「人文科学とコンピュータ」, No.3 (2020年9月5日), pp.1-8.
* 安岡孝一: [漢日英Universal Dependencies平行コーパスとその差異](http://hdl.handle.net/2433/245218), 人文科学とコンピュータシンポジウム「じんもんこん2019」論文集 (2019年12月), pp.43-50.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KoichiYasuoka/UniDic2UD",
    "name": "unidic2ud",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "unidic udpipe mecab nlp",
    "author": "Koichi Yasuoka",
    "author_email": "yasuoka@kanji.zinbun.kyoto-u.ac.jp",
    "download_url": "https://files.pythonhosted.org/packages/35/29/521a8b83a86559333dfaf70bc81957a1e846e06f89971d83c3d18931fd27/unidic2ud-3.0.5.tar.gz",
    "platform": null,
    "description": "[![Current PyPI packages](https://badge.fury.io/py/unidic2ud.svg)](https://pypi.org/project/unidic2ud/)\n\n# UniDic2UD\n\nTokenizer, POS-tagger, lemmatizer, and dependency-parser for modern and contemporary Japanese, working on [Universal Dependencies](https://universaldependencies.org/format.html).\n\n## Basic usage\n\n```py\n>>> import unidic2ud\n>>> nlp=unidic2ud.load(\"kindai\")\n>>> s=nlp(\"\u5176\u570b\u3092\u6cbb\u3081\u3093\u3068\u6b32\u3059\u308b\u8005\u306f\u5148\u3065\u5176\u5bb6\u3092\u9f4a\u3075\")\n>>> print(s)\n# text = \u5176\u570b\u3092\u6cbb\u3081\u3093\u3068\u6b32\u3059\u308b\u8005\u306f\u5148\u3065\u5176\u5bb6\u3092\u9f4a\u3075\n1\t\u5176\t\u5176\u306e\tDET\t\u9023\u4f53\u8a5e\t_\t2\tdet\t_\tSpaceAfter=No|Translit=\u30bd\u30ce\n2\t\u570b\t\u56fd\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t4\tobj\t_\tSpaceAfter=No|Translit=\u30af\u30cb\n3\t\u3092\t\u3092\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t2\tcase\t_\tSpaceAfter=No|Translit=\u30f2\n4\t\u6cbb\u3081\t\u53ce\u3081\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t7\tadvcl\t_\tSpaceAfter=No|Translit=\u30aa\u30b5\u30e1\n5\t\u3093\t\u3080\tAUX\t\u52a9\u52d5\u8a5e\t_\t4\taux\t_\tSpaceAfter=No|Translit=\u30f3\n6\t\u3068\t\u3068\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t4\tcase\t_\tSpaceAfter=No|Translit=\u30c8\n7\t\u6b32\u3059\u308b\t\u6b32\u3059\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t8\tacl\t_\tSpaceAfter=No|Translit=\u30db\u30c3\u30b9\u30eb\n8\t\u8005\t\u8005\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t14\tnsubj\t_\tSpaceAfter=No|Translit=\u30e2\u30ce\n9\t\u306f\t\u306f\tADP\t\u52a9\u8a5e-\u4fc2\u52a9\u8a5e\t_\t8\tcase\t_\tSpaceAfter=No|Translit=\u30cf\n10\t\u5148\u3065\t\u5148\u305a\tADV\t\u526f\u8a5e\t_\t14\tadvmod\t_\tSpaceAfter=No|Translit=\u30de\u30c5\n11\t\u5176\t\u5176\u306e\tDET\t\u9023\u4f53\u8a5e\t_\t12\tdet\t_\tSpaceAfter=No|Translit=\u30bd\u30ce\n12\t\u5bb6\t\u5bb6\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t14\tobj\t_\tSpaceAfter=No|Translit=\u30a6\u30c1\n13\t\u3092\t\u3092\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t12\tcase\t_\tSpaceAfter=No|Translit=\u30f2\n14\t\u9f4a\u3075\t\u6574\u3048\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t0\troot\t_\tSpaceAfter=No|Translit=\u30c8\u30c8\u30ce\u30d5\n\n>>> t=s[7]\n>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)\n7 \u6b32\u3059\u308b \u6b32\u3059\u308b VERB \u52d5\u8a5e-\u4e00\u822c _ 8 acl _ SpaceAfter=No|Translit=\u30db\u30c3\u30b9\u30eb\n\n>>> print(s.to_tree())\n    \u5176 <\u2550\u2550\u2557         det(\u6c7a\u5b9a\u8a5e)\n    \u570b \u2550\u2557\u2550\u255d<\u2557       obj(\u76ee\u7684\u8a9e)\n    \u3092 <\u255d   \u2551       case(\u683c\u8868\u793a)\n  \u6cbb\u3081 \u2550\u2557\u2550\u2557\u2550\u255d<\u2557     advcl(\u9023\u7528\u4fee\u98fe\u7bc0)\n    \u3093 <\u255d \u2551   \u2551     aux(\u52d5\u8a5e\u88dc\u52a9\u6210\u5206)\n    \u3068 <\u2550\u2550\u255d   \u2551     case(\u683c\u8868\u793a)\n\u6b32\u3059\u308b \u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255d<\u2557   acl(\u9023\u4f53\u4fee\u98fe\u7bc0)\n    \u8005 \u2550\u2557\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255d<\u2557 nsubj(\u4e3b\u8a9e)\n    \u306f <\u255d         \u2551 case(\u683c\u8868\u793a)\n  \u5148\u3065 <\u2550\u2550\u2550\u2550\u2550\u2550\u2557   \u2551 advmod(\u9023\u7528\u4fee\u98fe\u8a9e)\n    \u5176 <\u2550\u2550\u2557   \u2551   \u2551 det(\u6c7a\u5b9a\u8a5e)\n    \u5bb6 \u2550\u2557\u2550\u255d<\u2557 \u2551   \u2551 obj(\u76ee\u7684\u8a9e)\n    \u3092 <\u255d   \u2551 \u2551   \u2551 case(\u683c\u8868\u793a)\n  \u9f4a\u3075 \u2550\u2550\u2550\u2550\u2550\u255d\u2550\u255d\u2550\u2550\u2550\u255d root(\u89aa)\n\n>>> f=open(\"trial.svg\",\"w\")\n>>> f.write(s.to_svg())\n>>> f.close()\n```\n![trial.svg](https://raw.githubusercontent.com/KoichiYasuoka/UniDic2UD/master/trial.png)\n\n`unidic2ud.load(UniDic,UDPipe)` loads a natural language processor pipeline, which uses `UniDic` for tokenizer POS-tagger and lemmatizer, then uses `UDPipe` for dependency-parser. The default `UDPipe` is `UDPipe=\"japanese-modern\"`. Available `UniDic` options are:\n\n* `UniDic=\"gendai\"`: Use [\u73fe\u4ee3\u66f8\u304d\u8a00\u8449UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_bccwj).\n* `UniDic=\"spoken\"`: Use [\u73fe\u4ee3\u8a71\u3057\u8a00\u8449UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_csj).\n* `UniDic=\"novel\"`: Use [\u8fd1\u73fe\u4ee3\u53e3\u8a9e\u5c0f\u8aacUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_novel).\n* `UniDic=\"qkana\"`: Use [\u65e7\u4eee\u540d\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_qkana).\n* `UniDic=\"kindai\"`: Use [\u8fd1\u4ee3\u6587\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kindai).\n* `UniDic=\"kinsei\"`: Use [\u8fd1\u4e16\u6c5f\u6238\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_kinsei-edo).\n* `UniDic=\"kyogen\"`: Use [\u4e2d\u4e16\u53e3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-kougo).\n* `UniDic=\"wakan\"`: Use [\u4e2d\u4e16\u6587\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_chusei-bungo).\n* `UniDic=\"wabun\"`: Use [\u4e2d\u53e4\u548c\u6587UniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_wabun).\n* `UniDic=\"manyo\"`: Use [\u4e0a\u4ee3\u8a9eUniDic](https://clrd.ninjal.ac.jp/unidic/download_all.html#unidic_jodai).\n* `UniDic=None`: Use `UDPipe` for tokenizer, POS-tagger, lemmatizer, and dependency-parser.\n\n`unidic2ud.UniDic2UDEntry.to_tree()` has an option `to_tree(BoxDrawingWidth=2)` for old terminals, whose Box Drawing characters are \"fullwidth\".\n\nYou can simply use `unidic2ud` on the command line:\n```sh\necho \u5176\u570b\u3092\u6cbb\u3081\u3093\u3068\u6b32\u3059\u308b\u8005\u306f\u5148\u3065\u5176\u5bb6\u3092\u9f4a\u3075 | unidic2ud -U kindai\n```\n\n## CaboCha emulator usage\n\n```py\n>>> import unidic2ud.cabocha as CaboCha\n>>> c=CaboCha.Parser(\"kindai\")\n>>> s=c.parse(\"\u5176\u570b\u3092\u6cbb\u3081\u3093\u3068\u6b32\u3059\u308b\u8005\u306f\u5148\u3065\u5176\u5bb6\u3092\u9f4a\u3075\")\n>>> print(s.toString(CaboCha.FORMAT_TREE_LATTICE))\n  \u5176-D\n  \u570b\u3092-D\n\u6cbb\u3081\u3093\u3068-D\n    \u6b32\u3059\u308b-D\n        \u8005\u306f-------D\n          \u5148\u3065-----D\n              \u5176-D |\n              \u5bb6\u3092-D\n                \u9f4a\u3075\nEOS\n* 0 1D 0/0 0.000000\n\u5176\t\u9023\u4f53\u8a5e,*,*,*,*,*,\u5176\u306e,\u30bd\u30ce,*,DET\tO\t1<-det-2\n* 1 2D 0/1 0.000000\n\u570b\t\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*,\u56fd,\u30af\u30cb,*,NOUN\tO\t2<-obj-4\n\u3092\t\u52a9\u8a5e,\u683c\u52a9\u8a5e,*,*,*,*,\u3092,\u30f2,*,ADP\tO\t3<-case-2\n* 2 3D 0/1 0.000000\n\u6cbb\u3081\t\u52d5\u8a5e,\u4e00\u822c,*,*,*,*,\u53ce\u3081\u308b,\u30aa\u30b5\u30e1,*,VERB\tO\t4<-advcl-7\n\u3093\t\u52a9\u52d5\u8a5e,*,*,*,*,*,\u3080,\u30f3,*,AUX\tO\t5<-aux-4\n\u3068\t\u52a9\u8a5e,\u683c\u52a9\u8a5e,*,*,*,*,\u3068,\u30c8,*,ADP\tO\t6<-case-4\n* 3 4D 0/0 0.000000\n\u6b32\u3059\u308b\t\u52d5\u8a5e,\u4e00\u822c,*,*,*,*,\u6b32\u3059\u308b,\u30db\u30c3\u30b9\u30eb,*,VERB\tO\t7<-acl-8\n* 4 8D 0/1 0.000000\n\u8005\t\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*,\u8005,\u30e2\u30ce,*,NOUN\tO\t8<-nsubj-14\n\u306f\t\u52a9\u8a5e,\u4fc2\u52a9\u8a5e,*,*,*,*,\u306f,\u30cf,*,ADP\tO\t9<-case-8\n* 5 8D 0/0 0.000000\n\u5148\u3065\t\u526f\u8a5e,*,*,*,*,*,\u5148\u305a,\u30de\u30c5,*,ADV\tO\t10<-advmod-14\n* 6 7D 0/0 0.000000\n\u5176\t\u9023\u4f53\u8a5e,*,*,*,*,*,\u5176\u306e,\u30bd\u30ce,*,DET\tO\t11<-det-12\n* 7 8D 0/1 0.000000\n\u5bb6\t\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*,\u5bb6,\u30a6\u30c1,*,NOUN\tO\t12<-obj-14\n\u3092\t\u52a9\u8a5e,\u683c\u52a9\u8a5e,*,*,*,*,\u3092,\u30f2,*,ADP\tO\t13<-case-12\n* 8 -1D 0/0 0.000000\n\u9f4a\u3075\t\u52d5\u8a5e,\u4e00\u822c,*,*,*,*,\u6574\u3048\u308b,\u30c8\u30c8\u30ce\u30d5,*,VERB\tO\t14<-root\nEOS\n>>> for c in [s.chunk(i) for i in range(s.chunk_size())]:\n...   if c.link>=0:\n...     print(c,\"->\",s.chunk(c.link))\n...\n\u5176 -> \u570b\u3092\n\u570b\u3092 -> \u6cbb\u3081\u3093\u3068\n\u6cbb\u3081\u3093\u3068 -> \u6b32\u3059\u308b\n\u6b32\u3059\u308b -> \u8005\u306f\n\u8005\u306f -> \u9f4a\u3075\n\u5148\u3065 -> \u9f4a\u3075\n\u5176 -> \u5bb6\u3092\n\u5bb6\u3092 -> \u9f4a\u3075\n```\n`CaboCha.Parser(UniDic)` is an alias for `unidic2ud.load(UniDic,UDPipe=\"japanese-modern\")`, and its default is `UniDic=None`. `CaboCha.Tree.toString(format)` has five available formats:\n* `CaboCha.FORMAT_TREE`: tree (numbered as 0)\n* `CaboCha.FORMAT_LATTICE`: lattice (numbered as 1)\n* `CaboCha.FORMAT_TREE_LATTICE`: tree + lattice (numbered as 2)\n* `CaboCha.FORMAT_XML`: XML (numbered as 3)\n* `CaboCha.FORMAT_CONLL`: Universal Dependencies CoNLL-U (numbered as 4)\n\nYou can simply use `udcabocha` on the command line:\n```sh\necho \u5176\u570b\u3092\u6cbb\u3081\u3093\u3068\u6b32\u3059\u308b\u8005\u306f\u5148\u3065\u5176\u5bb6\u3092\u9f4a\u3075 | udcabocha -U kindai -f 2\n```\n`-U UniDic` specifies `UniDic`. `-f format` specifies the output format in 0 to 4 above (default is `-f 0`) and in 5 to 8 below:\n* `-f 5`: `to_tree()`\n* `-f 6`: `to_tree(BoxDrawingWidth=2)`\n* `-f 7`: `to_svg()`\n* `-f 8`: [raw DOT](https://graphviz.readthedocs.io/en/stable/manual.html#using-raw-dot) graph through [Immediate Catena Analysis](https://koichiyasuoka.github.io/deplacy/#deplacydot)\n\n![dot.png](https://raw.githubusercontent.com/KoichiYasuoka/UniDic2UD/master/dot.png)\n\nTry [notebook](https://colab.research.google.com/github/KoichiYasuoka/UniDic2UD/blob/master/udcabocha.ipynb) for Google Colaboratory.\n\n## Usage via spaCy\n\nIf you have already installed [spaCy](https://pypi.org/project/spacy/) 2.1.0 or later, you can use `UniDic` via spaCy Language pipeline.\n\n```py\n>>> import unidic2ud.spacy\n>>> nlp=unidic2ud.spacy.load(\"kindai\")\n>>> d=nlp(\"\u5176\u570b\u3092\u6cbb\u3081\u3093\u3068\u6b32\u3059\u308b\u8005\u306f\u5148\u3065\u5176\u5bb6\u3092\u9f4a\u3075\")\n>>> print(unidic2ud.spacy.to_conllu(d))\n# text = \u5176\u570b\u3092\u6cbb\u3081\u3093\u3068\u6b32\u3059\u308b\u8005\u306f\u5148\u3065\u5176\u5bb6\u3092\u9f4a\u3075\n1\t\u5176\t\u5176\u306e\tDET\t\u9023\u4f53\u8a5e\t_\t2\tdet\t_\tSpaceAfter=No|Translit=\u30bd\u30ce\n2\t\u570b\t\u56fd\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t4\tobj\t_\tSpaceAfter=No|Translit=\u30af\u30cb\n3\t\u3092\t\u3092\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t2\tcase\t_\tSpaceAfter=No|Translit=\u30f2\n4\t\u6cbb\u3081\t\u53ce\u3081\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t7\tadvcl\t_\tSpaceAfter=No|Translit=\u30aa\u30b5\u30e1\n5\t\u3093\t\u3080\tAUX\t\u52a9\u52d5\u8a5e\t_\t4\taux\t_\tSpaceAfter=No|Translit=\u30f3\n6\t\u3068\t\u3068\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t4\tcase\t_\tSpaceAfter=No|Translit=\u30c8\n7\t\u6b32\u3059\u308b\t\u6b32\u3059\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t8\tacl\t_\tSpaceAfter=No|Translit=\u30db\u30c3\u30b9\u30eb\n8\t\u8005\t\u8005\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t14\tnsubj\t_\tSpaceAfter=No|Translit=\u30e2\u30ce\n9\t\u306f\t\u306f\tADP\t\u52a9\u8a5e-\u4fc2\u52a9\u8a5e\t_\t8\tcase\t_\tSpaceAfter=No|Translit=\u30cf\n10\t\u5148\u3065\t\u5148\u305a\tADV\t\u526f\u8a5e\t_\t14\tadvmod\t_\tSpaceAfter=No|Translit=\u30de\u30c5\n11\t\u5176\t\u5176\u306e\tDET\t\u9023\u4f53\u8a5e\t_\t12\tdet\t_\tSpaceAfter=No|Translit=\u30bd\u30ce\n12\t\u5bb6\t\u5bb6\tNOUN\t\u540d\u8a5e-\u666e\u901a\u540d\u8a5e-\u4e00\u822c\t_\t14\tobj\t_\tSpaceAfter=No|Translit=\u30a6\u30c1\n13\t\u3092\t\u3092\tADP\t\u52a9\u8a5e-\u683c\u52a9\u8a5e\t_\t12\tcase\t_\tSpaceAfter=No|Translit=\u30f2\n14\t\u9f4a\u3075\t\u6574\u3048\u308b\tVERB\t\u52d5\u8a5e-\u4e00\u822c\t_\t0\troot\t_\tSpaceAfter=No|Translit=\u30c8\u30c8\u30ce\u30d5\n\n>>> t=d[6]\n>>> print(t.i+1,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i+1,t.dep_,t.whitespace_,t.norm_)\n7 \u6b32\u3059\u308b \u6b32\u3059\u308b VERB \u52d5\u8a5e-\u4e00\u822c 8 acl  \u30db\u30c3\u30b9\u30eb\n\n>>> from deplacy.deprelja import deprelja\n>>> for b in unidic2ud.spacy.bunsetu_spans(d):\n...   for t in b.lefts:\n...     print(unidic2ud.spacy.bunsetu_span(t),\"->\",b,\"(\"+deprelja[t.dep_]+\")\")\n...\n\u5176 -> \u570b\u3092 (\u6c7a\u5b9a\u8a5e)\n\u570b\u3092 -> \u6cbb\u3081\u3093\u3068 (\u76ee\u7684\u8a9e)\n\u6cbb\u3081\u3093\u3068 -> \u6b32\u3059\u308b (\u9023\u7528\u4fee\u98fe\u7bc0)\n\u6b32\u3059\u308b -> \u8005\u306f (\u9023\u4f53\u4fee\u98fe\u7bc0)\n\u5176 -> \u5bb6\u3092 (\u6c7a\u5b9a\u8a5e)\n\u8005\u306f -> \u9f4a\u3075 (\u4e3b\u8a9e)\n\u5148\u3065 -> \u9f4a\u3075 (\u9023\u7528\u4fee\u98fe\u8a9e)\n\u5bb6\u3092 -> \u9f4a\u3075 (\u76ee\u7684\u8a9e)\n```\n\n`unidic2ud.spacy.load(UniDic,parser)` loads a spaCy pipeline, which uses `UniDic` for tokenizer POS-tagger and lemmatizer (as shown above), then uses `parser` for dependency-parser. The default `parser` is `parser=\"japanese-modern\"` and available options are:\n\n* `parser=\"ja_core_news_sm\"`: Use [spaCy Japanese model](https://spacy.io/models/ja) (small).\n* `parser=\"ja_core_news_md\"`: Use spaCy Japanese model (middle).\n* `parser=\"ja_core_news_lg\"`: Use spaCy Japanese model (large).\n* `parser=\"ja_ginza\"`: Use [GiNZA](https://github.com/megagonlabs/ginza).\n* `parser=\"japanese-gsd\"`: Use [UDPipe Japanese model](http://hdl.handle.net/11234/1-3131).\n* `parser=\"stanza_ja\"`: Use [Stanza Japanese model](https://stanfordnlp.github.io/stanza/available_models.html).\n\n## Installation for Linux\n\nTar-ball is available for Linux, and is installed by default when you use `pip`:\n```sh\npip install unidic2ud\n```\n\nBy default installation, `UniDic` is invoked through Web APIs. If you want to invoke them locally and faster, you can download `UniDic` which you use just as follows:\n```sh\npython -m unidic2ud download kindai\npython -m unidic2ud dictlist\n```\nLicenses of dictionaries and models are: GPL/LGPL/BSD for `gendai` and `spoken`; CC BY-NC-SA 4.0 for others.\n\n## Installation for Cygwin\n\nMake sure to get `gcc-g++` `python37-pip` `python37-devel` packages, and then:\n```sh\npip3.7 install unidic2ud\n```\nUse `python3.7` command in [Cygwin](https://www.cygwin.com/install.html) instead of `python`.\n\n## Installation for Jupyter Notebook (Google Colaboratory)\n\n```py\n!pip install unidic2ud\n```\n## Benchmarks\n\nResults of [\u821e\u59ec/\u96ea\u570b/\u8352\u91ce\u3088\u308a-Benchmarks](https://colab.research.google.com/github/KoichiYasuoka/UniDic2UD/blob/master/benchmark/benchmark.ipynb)\n\n|[\u821e\u59ec](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/maihime-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"kindai\"|81.13|70.37|77.78|\n|UniDic=\"qkana\" |79.25|70.37|77.78|\n|UniDic=\"kinsei\"|72.22|60.71|64.29|\n\n|[\u96ea\u570b](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/yukiguni-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"qkana\" |89.29|85.71|81.63|\n|UniDic=\"kinsei\"|89.29|85.71|77.55|\n|UniDic=\"kindai\"|84.96|81.63|77.55|\n\n|[\u8352\u91ce\u3088\u308a](https://github.com/KoichiYasuoka/UniDic2UD/blob/master/benchmark/koyayori-benchmark.tar.gz)|LAS|MLAS|BLEX|\n|---------------|-----|-----|-----|\n|UniDic=\"kindai\"|76.44|61.54|53.85|\n|UniDic=\"qkana\" |75.39|61.54|53.85|\n|UniDic=\"kinsei\"|71.88|58.97|51.28|\n\n## Author\n\nKoichi Yasuoka (\u5b89\u5ca1\u5b5d\u4e00)\n\n## References\n\n* \u5b89\u5ca1\u5b5d\u4e00: [\u5f62\u614b\u7d20\u89e3\u6790\u90e8\u306e\u4ed8\u3051\u66ff\u3048\u306b\u3088\u308b\u8fd1\u4ee3\u65e5\u672c\u8a9e(\u65e7\u5b57\u65e7\u4eee\u540d)\u306e\u4fc2\u308a\u53d7\u3051\u89e3\u6790](http://hdl.handle.net/2433/254677), \u60c5\u5831\u51e6\u7406\u5b66\u4f1a\u7814\u7a76\u5831\u544a, Vol.2020-CH-124\u300c\u4eba\u6587\u79d1\u5b66\u3068\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u300d, No.3 (2020\u5e749\u67085\u65e5), pp.1-8.\n* \u5b89\u5ca1\u5b5d\u4e00: [\u6f22\u65e5\u82f1Universal Dependencies\u5e73\u884c\u30b3\u30fc\u30d1\u30b9\u3068\u305d\u306e\u5dee\u7570](http://hdl.handle.net/2433/245218), \u4eba\u6587\u79d1\u5b66\u3068\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30b7\u30f3\u30dd\u30b8\u30a6\u30e0\u300c\u3058\u3093\u3082\u3093\u3053\u30932019\u300d\u8ad6\u6587\u96c6 (2019\u5e7412\u6708), pp.43-50.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese",
    "version": "3.0.5",
    "project_urls": {
        "Homepage": "https://github.com/KoichiYasuoka/UniDic2UD",
        "Source": "https://github.com/KoichiYasuoka/UniDic2UD",
        "Tracker": "https://github.com/KoichiYasuoka/UniDic2UD/issues",
        "japanese-modern": "https://github.com/UniversalDependencies/UD_Japanese-Modern",
        "ud-ja-kanbun": "https://corpus.kanji.zinbun.kyoto-u.ac.jp/gitlab/Kanbun/ud-ja-kanbun"
    },
    "split_keywords": [
        "unidic",
        "udpipe",
        "mecab",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3529521a8b83a86559333dfaf70bc81957a1e846e06f89971d83c3d18931fd27",
                "md5": "d1f3048fea799768f67051f88ecd95b2",
                "sha256": "78200492dd7a4e4b0be6eb9bcc07c792706debeafd95284a51b53b2cd1c0d367"
            },
            "downloads": -1,
            "filename": "unidic2ud-3.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "d1f3048fea799768f67051f88ecd95b2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 10066540,
            "upload_time": "2024-10-30T03:12:13",
            "upload_time_iso_8601": "2024-10-30T03:12:13.390394Z",
            "url": "https://files.pythonhosted.org/packages/35/29/521a8b83a86559333dfaf70bc81957a1e846e06f89971d83c3d18931fd27/unidic2ud-3.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-30 03:12:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KoichiYasuoka",
    "github_project": "UniDic2UD",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "unidic2ud"
}
        
Elapsed time: 0.38542s