# handic-py
This is a package to install [HanDic](https://github.com/okikirmui/handic), a dictionary for morphological analysis of Korean languages, via pip and use it in Python.
To use this package for morphological analysis, the MeCab wrapper such as [mecab-python3](https://github.com/SamuraiT/mecab-python3) is required.
**[notice]** After v.0.1.0, calendar versioning is used according to the dictionary version.
## Installation
from PyPI:
```Shell
pip install handic
```
## Usage
Since HanDic requires Hangul Jamo(Unicode Hangul Jamo) as input, please convert Hangul (Unicode Hangul Syllables) using modules such as [jamotools](https://pypi.org/project/jamotools/), or `tools/k2jamo.py` script included in HanDic.
### basic
example:
```Python
import MeCab
import handic
import jamotools
mecaboption = f'-r /dev/null -d {handic.DICDIR}'
tokenizer = MeCab.Tagger(mecaboption)
tokenizer.parse('')
# 《표준국어대사전》 "형태소" 뜻풀이
sentence = u'뜻을 가진 가장 작은 말의 단위. ‘이야기책’의 ‘이야기’, ‘책’ 따위이다.'
jamo = jamotools.split_syllables(sentence, jamo_type="JAMO")
node = tokenizer.parseToNode(jamo)
while node:
print(node.surface, node.feature)
node = node.next
```
result:
```Shell
BOS/EOS,*,*,*,*,*,*,*,*,*,*
뜻 Noun,普通,*,*,*,뜻,뜻,*,*,B,NNG
을 Ending,助詞,対格,*,*,을02,을,*,*,*,JKO
가지 Verb,自立,*,語基2,*,가지다,가지,*,*,A,VV
ᆫ Ending,語尾,連体形,*,2接続,ㄴ05,ㄴ,*,*,*,ETM
가장 Adverb,一般,*,*,*,가장01,가장,*,*,A,MAG
작으 Adjective,自立,*,語基2,*,작다01,작으,*,*,A,VA
ᆫ Ending,語尾,連体形,*,2接続,ㄴ05,ㄴ,*,*,*,ETM
말 Noun,普通,動作,*,*,말01,말,*,*,A,NNG
의 Ending,助詞,属格,*,*,의10,의,*,*,*,JKG
단위 Noun,普通,*,*,*,단위02,단위,單位,*,C,NNG
. Symbol,ピリオド,*,*,*,.,.,*,*,*,SF
‘ Symbol,カッコ,引用符-始,*,*,‘,‘,*,*,*,SS
이야기책 Noun,普通,*,*,*,이야기책,이야기책,이야기冊,*,*,NNG
’ Symbol,カッコ,引用符-終,*,*,’,’,*,*,*,SS
의 Ending,助詞,属格,*,*,의10,의,*,*,*,JKG
‘ Symbol,カッコ,引用符-始,*,*,‘,‘,*,*,*,SS
이야기 Noun,普通,動作,*,*,이야기,이야기,*,*,A,NNG
’ Symbol,カッコ,引用符-終,*,*,’,’,*,*,*,SS
, Symbol,コンマ,*,*,*,",",",",*,*,*,SP
‘ Symbol,カッコ,引用符-始,*,*,‘,‘,*,*,*,SS
책 Noun,普通,*,*,*,책01,책,冊,*,A,NNG
’ Symbol,カッコ,引用符-終,*,*,’,’,*,*,*,SS
따위 Noun,依存名詞,*,*,*,따위,따위,*,*,*,NNB
이 Siteisi,非自立,*,語基1,*,이다,이,*,*,*,VCP
다 Ending,語尾,終止形,*,1接続,다06,다,*,*,*,EF
. Symbol,ピリオド,*,*,*,.,.,*,*,*,SF
BOS/EOS,*,*,*,*,*,*,*,*,*,*
```
### Tokenize
example:
```Python
mecaboption = f'-r /dev/null -d {handic.DICDIR} -Otokenize'
tokenizer = MeCab.Tagger(mecaboption)
print(tokenizer.parse(jamo))
```
result:
```Shell
뜻 을 가지 ㄴ 가장 작으 ㄴ 말 의 단위 . ‘ 이야기책 ’ 의 ‘ 이야기 ’ , ‘ 책 ’ 따위 이 다 .
```
### Extracting specific POS
example:
```Python
mecaboption = f'-r /dev/null -d {handic.DICDIR}'
tokenizer = MeCab.Tagger(mecaboption)
tokenizer.parse('')
node = tokenizer.parseToNode(jamo)
while node:
# 일반명사(pos-tag: NNG)만 추출
if node.feature.split(',')[10] in ['NNG']:
print(node.feature.split(',')[5])
node = node.next
```
result:
```Shell
뜻
말01
단위02
이야기책
이야기
책01
```
## Features
Here is the list of features included in HanDic. For more information, see the [HanDic 품사 정보](https://github.com/okikirmui/handic/blob/main/docs/pos_detail.md).
- 품사1, 품사2, 품사3: part of speech(index: 0-2)
- 활용형: conjugation "base"(ex. `語基1`, `語基2`, `語基3`)(index: 3)
- 접속 정보: which "base" the ending is attached to(ex. `1接続`, `2接続`, etc.)(index: 4)
- 사전 항목: base forms(index: 5)
- 표층형: surface(index: 6)
- 한자: for sino-words(index: 7)
- 보충 정보: miscellaneous informations(index: 8)
- 학습 수준: learning level(index: 9)
- 세종계획 품사 태그: pos-tag(index: 10)
## License
This code is licensed under the MIT license. HanDic is copyright Yoshinori Sugai and distributed under the [BSD license](./LICENSE.handic).
## Acknowledgment
This repository is forked from [unidic-lite](https://github.com/polm/unidic-lite) with some modifications and file additions and deletions.
Raw data
{
"_id": null,
"home_page": null,
"name": "handic",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "handic, MeCab, Korean Language, morphological analysis, morphological analysis dictionary, korean text processing",
"author": "Yoshinori Sugai",
"author_email": "okikirmui+github@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/92/a2/248bb05d46943542a6d7061db11b51550460d99f94f3d515918a531c65f2/handic-25.1.1.tar.gz",
"platform": null,
"description": "# handic-py\n\nThis is a package to install [HanDic](https://github.com/okikirmui/handic), a dictionary for morphological analysis of Korean languages, via pip and use it in Python.\n\nTo use this package for morphological analysis, the MeCab wrapper such as [mecab-python3](https://github.com/SamuraiT/mecab-python3) is required.\n\n**[notice]** After v.0.1.0, calendar versioning is used according to the dictionary version.\n\n## Installation\n\nfrom PyPI:\n\n```Shell\npip install handic\n```\n\n## Usage\n\nSince HanDic requires Hangul Jamo(Unicode Hangul Jamo) as input, please convert Hangul (Unicode Hangul Syllables) using modules such as [jamotools](https://pypi.org/project/jamotools/), or `tools/k2jamo.py` script included in HanDic.\n\n### basic\n\nexample:\n\n```Python\nimport MeCab\nimport handic\nimport jamotools\n\nmecaboption = f'-r /dev/null -d {handic.DICDIR}'\n\ntokenizer = MeCab.Tagger(mecaboption)\ntokenizer.parse('')\n\n# \u300a\ud45c\uc900\uad6d\uc5b4\ub300\uc0ac\uc804\u300b \"\ud615\ud0dc\uc18c\" \ub73b\ud480\uc774\nsentence = u'\ub73b\uc744 \uac00\uc9c4 \uac00\uc7a5 \uc791\uc740 \ub9d0\uc758 \ub2e8\uc704. \u2018\uc774\uc57c\uae30\ucc45\u2019\uc758 \u2018\uc774\uc57c\uae30\u2019, \u2018\ucc45\u2019 \ub530\uc704\uc774\ub2e4.'\n\njamo = jamotools.split_syllables(sentence, jamo_type=\"JAMO\")\n\nnode = tokenizer.parseToNode(jamo)\nwhile node:\n print(node.surface, node.feature)\n node = node.next\n```\n\nresult:\n\n```Shell\nBOS/EOS,*,*,*,*,*,*,*,*,*,*\n\u1104\u1173\u11ba Noun,\u666e\u901a,*,*,*,\ub73b,\ub73b,*,*,B,NNG\n\u110b\u1173\u11af Ending,\u52a9\u8a5e,\u5bfe\u683c,*,*,\uc74402,\uc744,*,*,*,JKO\n\u1100\u1161\u110c\u1175 Verb,\u81ea\u7acb,*,\u8a9e\u57fa2,*,\uac00\uc9c0\ub2e4,\uac00\uc9c0,*,*,A,VV\n\u11ab Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,2\u63a5\u7d9a,\u313405,\u3134,*,*,*,ETM\n\u1100\u1161\u110c\u1161\u11bc Adverb,\u4e00\u822c,*,*,*,\uac00\uc7a501,\uac00\uc7a5,*,*,A,MAG\n\u110c\u1161\u11a8\u110b\u1173 Adjective,\u81ea\u7acb,*,\u8a9e\u57fa2,*,\uc791\ub2e401,\uc791\uc73c,*,*,A,VA\n\u11ab Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,2\u63a5\u7d9a,\u313405,\u3134,*,*,*,ETM\n\u1106\u1161\u11af Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\ub9d001,\ub9d0,*,*,A,NNG\n\u110b\u1174 Ending,\u52a9\u8a5e,\u5c5e\u683c,*,*,\uc75810,\uc758,*,*,*,JKG\n\u1103\u1161\u11ab\u110b\u1171 Noun,\u666e\u901a,*,*,*,\ub2e8\uc70402,\ub2e8\uc704,\u55ae\u4f4d,*,C,NNG\n. Symbol,\u30d4\u30ea\u30aa\u30c9,*,*,*,.,.,*,*,*,SF\n\u2018 Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u59cb,*,*,\u2018,\u2018,*,*,*,SS\n\u110b\u1175\u110b\u1163\u1100\u1175\u110e\u1162\u11a8 Noun,\u666e\u901a,*,*,*,\uc774\uc57c\uae30\ucc45,\uc774\uc57c\uae30\ucc45,\u110b\u1175\u110b\u1163\u1100\u1175\u518a,*,*,NNG\n\u2019 Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u7d42,*,*,\u2019,\u2019,*,*,*,SS\n\u110b\u1174 Ending,\u52a9\u8a5e,\u5c5e\u683c,*,*,\uc75810,\uc758,*,*,*,JKG\n\u2018 Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u59cb,*,*,\u2018,\u2018,*,*,*,SS\n\u110b\u1175\u110b\u1163\u1100\u1175 Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\uc774\uc57c\uae30,\uc774\uc57c\uae30,*,*,A,NNG\n\u2019 Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u7d42,*,*,\u2019,\u2019,*,*,*,SS\n, Symbol,\u30b3\u30f3\u30de,*,*,*,\",\",\",\",*,*,*,SP\n\u2018 Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u59cb,*,*,\u2018,\u2018,*,*,*,SS\n\u110e\u1162\u11a8 Noun,\u666e\u901a,*,*,*,\ucc4501,\ucc45,\u518a,*,A,NNG\n\u2019 Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u7d42,*,*,\u2019,\u2019,*,*,*,SS\n\u1104\u1161\u110b\u1171 Noun,\u4f9d\u5b58\u540d\u8a5e,*,*,*,\ub530\uc704,\ub530\uc704,*,*,*,NNB\n\u110b\u1175 Siteisi,\u975e\u81ea\u7acb,*,\u8a9e\u57fa1,*,\uc774\ub2e4,\uc774,*,*,*,VCP\n\u1103\u1161 Ending,\u8a9e\u5c3e,\u7d42\u6b62\u5f62,*,1\u63a5\u7d9a,\ub2e406,\ub2e4,*,*,*,EF\n. Symbol,\u30d4\u30ea\u30aa\u30c9,*,*,*,.,.,*,*,*,SF\nBOS/EOS,*,*,*,*,*,*,*,*,*,*\n```\n\n### Tokenize\n\nexample:\n\n```Python\nmecaboption = f'-r /dev/null -d {handic.DICDIR} -Otokenize'\ntokenizer = MeCab.Tagger(mecaboption)\n\nprint(tokenizer.parse(jamo))\n```\n\nresult:\n\n```Shell\n\ub73b \uc744 \uac00\uc9c0 \u3134 \uac00\uc7a5 \uc791\uc73c \u3134 \ub9d0 \uc758 \ub2e8\uc704 . \u2018 \uc774\uc57c\uae30\ucc45 \u2019 \uc758 \u2018 \uc774\uc57c\uae30 \u2019 , \u2018 \ucc45 \u2019 \ub530\uc704 \uc774 \ub2e4 .\n```\n\n### Extracting specific POS\n\nexample:\n\n```Python\nmecaboption = f'-r /dev/null -d {handic.DICDIR}'\n\ntokenizer = MeCab.Tagger(mecaboption)\ntokenizer.parse('')\n\nnode = tokenizer.parseToNode(jamo)\nwhile node:\n # \uc77c\ubc18\uba85\uc0ac(pos-tag: NNG)\ub9cc \ucd94\ucd9c\n if node.feature.split(',')[10] in ['NNG']:\n print(node.feature.split(',')[5])\n node = node.next\n```\n\nresult:\n\n```Shell\n\ub73b\n\ub9d001\n\ub2e8\uc70402\n\uc774\uc57c\uae30\ucc45\n\uc774\uc57c\uae30\n\ucc4501\n```\n\n## Features\n\nHere is the list of features included in HanDic. For more information, see the [HanDic \ud488\uc0ac \uc815\ubcf4](https://github.com/okikirmui/handic/blob/main/docs/pos_detail.md).\n\n - \ud488\uc0ac1, \ud488\uc0ac2, \ud488\uc0ac3: part of speech(index: 0-2)\n - \ud65c\uc6a9\ud615: conjugation \"base\"(ex. `\u8a9e\u57fa1`, `\u8a9e\u57fa2`, `\u8a9e\u57fa3`)(index: 3)\n - \uc811\uc18d \uc815\ubcf4: which \"base\" the ending is attached to(ex. `1\u63a5\u7d9a`, `2\u63a5\u7d9a`, etc.)(index: 4)\n - \uc0ac\uc804 \ud56d\ubaa9: base forms(index: 5)\n - \ud45c\uce35\ud615: surface(index: 6)\n - \ud55c\uc790: for sino-words(index: 7)\n - \ubcf4\ucda9 \uc815\ubcf4: miscellaneous informations(index: 8)\n - \ud559\uc2b5 \uc218\uc900: learning level(index: 9)\n - \uc138\uc885\uacc4\ud68d \ud488\uc0ac \ud0dc\uadf8: pos-tag(index: 10)\n\n## License\n\nThis code is licensed under the MIT license. HanDic is copyright Yoshinori Sugai and distributed under the [BSD license](./LICENSE.handic). \n\n## Acknowledgment\n\nThis repository is forked from [unidic-lite](https://github.com/polm/unidic-lite) with some modifications and file additions and deletions.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "HanDic package for installing via pip.",
"version": "25.1.1",
"project_urls": {
"Repository": "https://github.com/okikirmui/handic-py"
},
"split_keywords": [
"handic",
" mecab",
" korean language",
" morphological analysis",
" morphological analysis dictionary",
" korean text processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5471c4e9ea8d18f559da877f9d05724da4a11a6725065fc5e48c614e104c66a4",
"md5": "38b69f46a7f5e52a7098d5c5e755a988",
"sha256": "a7a2b50a2f8c097a50132be73b268a7b1c20cf4214c03786e6dc343fd4376bf5"
},
"downloads": -1,
"filename": "handic-25.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "38b69f46a7f5e52a7098d5c5e755a988",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 4876114,
"upload_time": "2025-01-01T14:55:03",
"upload_time_iso_8601": "2025-01-01T14:55:03.167109Z",
"url": "https://files.pythonhosted.org/packages/54/71/c4e9ea8d18f559da877f9d05724da4a11a6725065fc5e48c614e104c66a4/handic-25.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "92a2248bb05d46943542a6d7061db11b51550460d99f94f3d515918a531c65f2",
"md5": "6d281db91034cd3f809585e6a49626d4",
"sha256": "9f8566eef3cff80e0ea0c53d5a4fe8fa285af67b2e2ef7aeff7751069eda8a35"
},
"downloads": -1,
"filename": "handic-25.1.1.tar.gz",
"has_sig": false,
"md5_digest": "6d281db91034cd3f809585e6a49626d4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 4863351,
"upload_time": "2025-01-01T14:55:11",
"upload_time_iso_8601": "2025-01-01T14:55:11.198029Z",
"url": "https://files.pythonhosted.org/packages/92/a2/248bb05d46943542a6d7061db11b51550460d99f94f3d515918a531c65f2/handic-25.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-01 14:55:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "okikirmui",
"github_project": "handic-py",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "handic"
}