udkanbun


Nameudkanbun JSON
Version 3.4.5 PyPI version JSON
download
home_pagehttps://github.com/KoichiYasuoka/UD-Kanbun
SummaryTokenizer POS-tagger and Dependency-parser for Classical Chinese
upload_time2024-11-20 10:32:51
maintainerNone
docs_urlNone
authorKoichi Yasuoka
requires_python>=3.6
licenseMIT
keywords udpipe mecab nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Current PyPI packages](https://badge.fury.io/py/udkanbun.svg)](https://pypi.org/project/udkanbun/)

# UD-Kanbun

Tokenizer, POS-Tagger, and Dependency-Parser for Classical Chinese Texts (漢文/文言文), working on [Universal Dependencies](https://universaldependencies.org/format.html).

## Basic usage

```py
>>> import udkanbun
>>> lzh=udkanbun.load()
>>> s=lzh("不入虎穴不得虎子")
>>> print(s)
# text = 不入虎穴不得虎子
1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	Gloss=not|SpaceAfter=No
2	入	入	VERB	v,動詞,行為,移動	_	0	root	_	Gloss=enter|SpaceAfter=No
3	虎	虎	NOUN	n,名詞,主体,動物	_	4	nmod	_	Gloss=tiger|SpaceAfter=No
4	穴	穴	NOUN	n,名詞,固定物,地形	Case=Loc	2	obj	_	Gloss=cave|SpaceAfter=No
5	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	6	advmod	_	Gloss=not|SpaceAfter=No
6	得	得	VERB	v,動詞,行為,得失	_	2	parataxis	_	Gloss=get|SpaceAfter=No
7	虎	虎	NOUN	n,名詞,主体,動物	_	8	nmod	_	Gloss=tiger|SpaceAfter=No
8	子	子	NOUN	n,名詞,人,関係	_	6	obj	_	Gloss=child|SpaceAfter=No

>>> t=s[1]
>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)
1 不 不 ADV v,副詞,否定,無界 Polarity=Neg 2 advmod _ Gloss=not|SpaceAfter=No

>>> print(s.kaeriten())
不㆑入㆓虎穴㆒不㆑得㆓虎子㆒

>>> print(s.to_tree())
不 <════╗   advmod
入 ═══╗═╝═╗ root
虎 <╗ ║   ║ nmod
穴 ═╝<╝   ║ obj
不 <════╗ ║ advmod
得 ═══╗═╝<╝ parataxis
虎 <╗ ║     nmod
子 ═╝<╝     obj

>>> f=open("trial.svg","w")
>>> f.write(s.to_svg())
>>> f.close()
```
![trial.svg](https://raw.githubusercontent.com/KoichiYasuoka/UD-Kanbun/master/trial.png)
`udkanbun.load()` has three options `udkanbun.load(MeCab=True,Danku=False)`.  By default, the UD-Kanbun pipeline uses [MeCab](https://taku910.github.io/mecab/) for tokenizer and POS-tagger, then uses [UDPipe](http://ufal.mff.cuni.cz/udpipe) for dependency-parser. With the option `MeCab=False` the pipeline uses UDPipe for all through the processing. With the option `Danku=True` the pipeline tries to segment sentences automatically.

`udkanbun.UDKanbunEntry.to_tree()` has an option `to_tree(BoxDrawingWidth=2)` for old terminals, whose Box Drawing characters are "fullwidth". `to_tree(kaeriten=True,Japanese=True)` is convenient for Japanese users.

You can simply use `udkanbun` on the command line:
```sh
echo 不入虎穴不得虎子 | udkanbun
```

## Usage via spaCy

If you have already installed [spaCy](https://pypi.org/project/spacy/) 2.1.0 or later, you can use UD-Kanbun via spaCy Language pipeline.

```py
>>> import udkanbun.spacy
>>> lzh=udkanbun.spacy.load()
>>> d=lzh("不入虎穴不得虎子")
>>> print(type(d))
<class 'spacy.tokens.doc.Doc'>
>>> print(udkanbun.spacy.to_conllu(d))
# text = 不入虎穴不得虎子
1	不	不	ADV	v,副詞,否定,無界	_	2	advmod	_	Gloss=not|SpaceAfter=No
2	入	入	VERB	v,動詞,行為,移動	_	0	root	_	Gloss=enter|SpaceAfter=No
3	虎	虎	NOUN	n,名詞,主体,動物	_	4	nmod	_	Gloss=tiger|SpaceAfter=No
4	穴	穴	NOUN	n,名詞,固定物,地形	_	2	obj	_	Gloss=cave|SpaceAfter=No
5	不	不	ADV	v,副詞,否定,無界	_	6	advmod	_	Gloss=not|SpaceAfter=No
6	得	得	VERB	v,動詞,行為,得失	_	2	parataxis	_	Gloss=get|SpaceAfter=No
7	虎	虎	NOUN	n,名詞,主体,動物	_	8	nmod	_	Gloss=tiger|SpaceAfter=No
8	子	子	NOUN	n,名詞,人,関係	_	6	obj	_	Gloss=child|SpaceAfter=No

>>> t=d[0]
>>> print(t.i+1,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i+1,t.dep_,t.whitespace_,t.norm_)
1 不 不 ADV v,副詞,否定,無界 2 advmod  not
```

## Installation for Linux

Tar-ball is available for Linux, and is installed by default when you use `pip`:
```sh
pip install udkanbun
```

## Installation for Cygwin

Make sure to get `gcc-g++` `python37-pip` `python37-devel` packages, and then:
```sh
pip3.7 install udkanbun
```
Use `python3.7` command in [Cygwin](https://www.cygwin.com/install.html) instead of `python`.

## Installation for Jupyter Notebook (Google Colaboratory)

```py
!pip install udkanbun
```

Try [notebook](https://colab.research.google.com/github/KoichiYasuoka/UD-Kanbun/blob/master/udkanbun.ipynb) for Google Colaboratory.

## Author

Koichi Yasuoka (安岡孝一)

## References

* Koichi Yasuoka: [Universal Dependencies Treebank of the Four Books in Classical Chinese](http://hdl.handle.net/2433/245217), DADH2019: 10th International Conference of Digital Archives and Digital Humanities (December 2019), pp.20-28.
* 安岡孝一: [四書を学んだMeCab+UDPipeはセンター試験の漢文を読めるのか](http://hdl.handle.net/2433/237383), 東洋学へのコンピュータ利用, 第30回研究セミナー (2019年3月8日), pp.3-110.
* 安岡孝一: [漢文の依存文法解析と返り点の関係について](http://hdl.handle.net/2433/235609), 日本漢字学会第1回研究大会予稿集 (2018年12月1日), pp.33-48.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KoichiYasuoka/UD-Kanbun",
    "name": "udkanbun",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "udpipe mecab nlp",
    "author": "Koichi Yasuoka",
    "author_email": "yasuoka@kanji.zinbun.kyoto-u.ac.jp",
    "download_url": "https://files.pythonhosted.org/packages/b7/28/f6c233db1274c7b3ecd9104bdbbe23d1ea86e1fc076f205216324728bfd6/udkanbun-3.4.5.tar.gz",
    "platform": null,
    "description": "[![Current PyPI packages](https://badge.fury.io/py/udkanbun.svg)](https://pypi.org/project/udkanbun/)\n\n# UD-Kanbun\n\nTokenizer, POS-Tagger, and Dependency-Parser for Classical Chinese Texts (\u6f22\u6587/\u6587\u8a00\u6587), working on [Universal Dependencies](https://universaldependencies.org/format.html).\n\n## Basic usage\n\n```py\n>>> import udkanbun\n>>> lzh=udkanbun.load()\n>>> s=lzh(\"\u4e0d\u5165\u864e\u7a74\u4e0d\u5f97\u864e\u5b50\")\n>>> print(s)\n# text = \u4e0d\u5165\u864e\u7a74\u4e0d\u5f97\u864e\u5b50\n1\t\u4e0d\t\u4e0d\tADV\tv,\u526f\u8a5e,\u5426\u5b9a,\u7121\u754c\tPolarity=Neg\t2\tadvmod\t_\tGloss=not|SpaceAfter=No\n2\t\u5165\t\u5165\tVERB\tv,\u52d5\u8a5e,\u884c\u70ba,\u79fb\u52d5\t_\t0\troot\t_\tGloss=enter|SpaceAfter=No\n3\t\u864e\t\u864e\tNOUN\tn,\u540d\u8a5e,\u4e3b\u4f53,\u52d5\u7269\t_\t4\tnmod\t_\tGloss=tiger|SpaceAfter=No\n4\t\u7a74\t\u7a74\tNOUN\tn,\u540d\u8a5e,\u56fa\u5b9a\u7269,\u5730\u5f62\tCase=Loc\t2\tobj\t_\tGloss=cave|SpaceAfter=No\n5\t\u4e0d\t\u4e0d\tADV\tv,\u526f\u8a5e,\u5426\u5b9a,\u7121\u754c\tPolarity=Neg\t6\tadvmod\t_\tGloss=not|SpaceAfter=No\n6\t\u5f97\t\u5f97\tVERB\tv,\u52d5\u8a5e,\u884c\u70ba,\u5f97\u5931\t_\t2\tparataxis\t_\tGloss=get|SpaceAfter=No\n7\t\u864e\t\u864e\tNOUN\tn,\u540d\u8a5e,\u4e3b\u4f53,\u52d5\u7269\t_\t8\tnmod\t_\tGloss=tiger|SpaceAfter=No\n8\t\u5b50\t\u5b50\tNOUN\tn,\u540d\u8a5e,\u4eba,\u95a2\u4fc2\t_\t6\tobj\t_\tGloss=child|SpaceAfter=No\n\n>>> t=s[1]\n>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)\n1 \u4e0d \u4e0d ADV v,\u526f\u8a5e,\u5426\u5b9a,\u7121\u754c Polarity=Neg 2 advmod _ Gloss=not|SpaceAfter=No\n\n>>> print(s.kaeriten())\n\u4e0d\u3191\u5165\u3193\u864e\u7a74\u3192\u4e0d\u3191\u5f97\u3193\u864e\u5b50\u3192\n\n>>> print(s.to_tree())\n\u4e0d <\u2550\u2550\u2550\u2550\u2557   advmod\n\u5165 \u2550\u2550\u2550\u2557\u2550\u255d\u2550\u2557 root\n\u864e <\u2557 \u2551   \u2551 nmod\n\u7a74 \u2550\u255d<\u255d   \u2551 obj\n\u4e0d <\u2550\u2550\u2550\u2550\u2557 \u2551 advmod\n\u5f97 \u2550\u2550\u2550\u2557\u2550\u255d<\u255d parataxis\n\u864e <\u2557 \u2551     nmod\n\u5b50 \u2550\u255d<\u255d     obj\n\n>>> f=open(\"trial.svg\",\"w\")\n>>> f.write(s.to_svg())\n>>> f.close()\n```\n![trial.svg](https://raw.githubusercontent.com/KoichiYasuoka/UD-Kanbun/master/trial.png)\n`udkanbun.load()` has three options `udkanbun.load(MeCab=True,Danku=False)`.  By default, the UD-Kanbun pipeline uses [MeCab](https://taku910.github.io/mecab/) for tokenizer and POS-tagger, then uses [UDPipe](http://ufal.mff.cuni.cz/udpipe) for dependency-parser. With the option `MeCab=False` the pipeline uses UDPipe for all through the processing. With the option `Danku=True` the pipeline tries to segment sentences automatically.\n\n`udkanbun.UDKanbunEntry.to_tree()` has an option `to_tree(BoxDrawingWidth=2)` for old terminals, whose Box Drawing characters are \"fullwidth\". `to_tree(kaeriten=True,Japanese=True)` is convenient for Japanese users.\n\nYou can simply use `udkanbun` on the command line:\n```sh\necho \u4e0d\u5165\u864e\u7a74\u4e0d\u5f97\u864e\u5b50 | udkanbun\n```\n\n## Usage via spaCy\n\nIf you have already installed [spaCy](https://pypi.org/project/spacy/) 2.1.0 or later, you can use UD-Kanbun via spaCy Language pipeline.\n\n```py\n>>> import udkanbun.spacy\n>>> lzh=udkanbun.spacy.load()\n>>> d=lzh(\"\u4e0d\u5165\u864e\u7a74\u4e0d\u5f97\u864e\u5b50\")\n>>> print(type(d))\n<class 'spacy.tokens.doc.Doc'>\n>>> print(udkanbun.spacy.to_conllu(d))\n# text = \u4e0d\u5165\u864e\u7a74\u4e0d\u5f97\u864e\u5b50\n1\t\u4e0d\t\u4e0d\tADV\tv,\u526f\u8a5e,\u5426\u5b9a,\u7121\u754c\t_\t2\tadvmod\t_\tGloss=not|SpaceAfter=No\n2\t\u5165\t\u5165\tVERB\tv,\u52d5\u8a5e,\u884c\u70ba,\u79fb\u52d5\t_\t0\troot\t_\tGloss=enter|SpaceAfter=No\n3\t\u864e\t\u864e\tNOUN\tn,\u540d\u8a5e,\u4e3b\u4f53,\u52d5\u7269\t_\t4\tnmod\t_\tGloss=tiger|SpaceAfter=No\n4\t\u7a74\t\u7a74\tNOUN\tn,\u540d\u8a5e,\u56fa\u5b9a\u7269,\u5730\u5f62\t_\t2\tobj\t_\tGloss=cave|SpaceAfter=No\n5\t\u4e0d\t\u4e0d\tADV\tv,\u526f\u8a5e,\u5426\u5b9a,\u7121\u754c\t_\t6\tadvmod\t_\tGloss=not|SpaceAfter=No\n6\t\u5f97\t\u5f97\tVERB\tv,\u52d5\u8a5e,\u884c\u70ba,\u5f97\u5931\t_\t2\tparataxis\t_\tGloss=get|SpaceAfter=No\n7\t\u864e\t\u864e\tNOUN\tn,\u540d\u8a5e,\u4e3b\u4f53,\u52d5\u7269\t_\t8\tnmod\t_\tGloss=tiger|SpaceAfter=No\n8\t\u5b50\t\u5b50\tNOUN\tn,\u540d\u8a5e,\u4eba,\u95a2\u4fc2\t_\t6\tobj\t_\tGloss=child|SpaceAfter=No\n\n>>> t=d[0]\n>>> print(t.i+1,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i+1,t.dep_,t.whitespace_,t.norm_)\n1 \u4e0d \u4e0d ADV v,\u526f\u8a5e,\u5426\u5b9a,\u7121\u754c 2 advmod  not\n```\n\n## Installation for Linux\n\nTar-ball is available for Linux, and is installed by default when you use `pip`:\n```sh\npip install udkanbun\n```\n\n## Installation for Cygwin\n\nMake sure to get `gcc-g++` `python37-pip` `python37-devel` packages, and then:\n```sh\npip3.7 install udkanbun\n```\nUse `python3.7` command in [Cygwin](https://www.cygwin.com/install.html) instead of `python`.\n\n## Installation for Jupyter Notebook (Google Colaboratory)\n\n```py\n!pip install udkanbun\n```\n\nTry [notebook](https://colab.research.google.com/github/KoichiYasuoka/UD-Kanbun/blob/master/udkanbun.ipynb) for Google Colaboratory.\n\n## Author\n\nKoichi Yasuoka (\u5b89\u5ca1\u5b5d\u4e00)\n\n## References\n\n* Koichi Yasuoka: [Universal Dependencies Treebank of the Four Books in Classical Chinese](http://hdl.handle.net/2433/245217), DADH2019: 10th International Conference of Digital Archives and Digital Humanities (December 2019), pp.20-28.\n* \u5b89\u5ca1\u5b5d\u4e00: [\u56db\u66f8\u3092\u5b66\u3093\u3060MeCab\uff0bUDPipe\u306f\u30bb\u30f3\u30bf\u30fc\u8a66\u9a13\u306e\u6f22\u6587\u3092\u8aad\u3081\u308b\u306e\u304b](http://hdl.handle.net/2433/237383), \u6771\u6d0b\u5b66\u3078\u306e\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u5229\u7528, \u7b2c30\u56de\u7814\u7a76\u30bb\u30df\u30ca\u30fc (2019\u5e743\u67088\u65e5), pp.3-110.\n* \u5b89\u5ca1\u5b5d\u4e00: [\u6f22\u6587\u306e\u4f9d\u5b58\u6587\u6cd5\u89e3\u6790\u3068\u8fd4\u308a\u70b9\u306e\u95a2\u4fc2\u306b\u3064\u3044\u3066](http://hdl.handle.net/2433/235609), \u65e5\u672c\u6f22\u5b57\u5b66\u4f1a\u7b2c1\u56de\u7814\u7a76\u5927\u4f1a\u4e88\u7a3f\u96c6 (2018\u5e7412\u67081\u65e5), pp.33-48.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Tokenizer POS-tagger and Dependency-parser for Classical Chinese",
    "version": "3.4.5",
    "project_urls": {
        "Homepage": "https://github.com/KoichiYasuoka/UD-Kanbun",
        "Source": "https://github.com/KoichiYasuoka/UD-Kanbun",
        "Tracker": "https://github.com/KoichiYasuoka/UD-Kanbun/issues",
        "ud-kanbun": "https://corpus.kanji.zinbun.kyoto-u.ac.jp/gitlab/Kanbun/ud-kanbun"
    },
    "split_keywords": [
        "udpipe",
        "mecab",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b728f6c233db1274c7b3ecd9104bdbbe23d1ea86e1fc076f205216324728bfd6",
                "md5": "f2bf663110b9c46c26b7d9b238513616",
                "sha256": "c39a491f9a9ed3de3c797ce729ae09937c57570a70715d92d90600a7fb1d0f7e"
            },
            "downloads": -1,
            "filename": "udkanbun-3.4.5.tar.gz",
            "has_sig": false,
            "md5_digest": "f2bf663110b9c46c26b7d9b238513616",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 19803007,
            "upload_time": "2024-11-20T10:32:51",
            "upload_time_iso_8601": "2024-11-20T10:32:51.973135Z",
            "url": "https://files.pythonhosted.org/packages/b7/28/f6c233db1274c7b3ecd9104bdbbe23d1ea86e1fc076f205216324728bfd6/udkanbun-3.4.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-20 10:32:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KoichiYasuoka",
    "github_project": "UD-Kanbun",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "udkanbun"
}
        
Elapsed time: 0.39253s