# nkhandic-py
![PyPI - Version](https://img.shields.io/pypi/v/nkhandic)
This is a package to install [NK-HanDic](https://github.com/okikirmui/nkhandic), a dictionary for morphological analysis of North Korean languages, via pip and use it in Python.
To use this package for morphological analysis, the MeCab wrapper such as [mecab-python3](https://github.com/SamuraiT/mecab-python3) is required.
**[notice]** After v.0.1.3, calendar versioning is used according to the dictionary version.
## Installation
from PyPI:
```Shell
pip install nkhandic
```
## Usage
Since NK-HanDic requires Hangul Jamo(Unicode Hangul Jamo) as input, please convert Hangul (Unicode Hangul Syllables) using modules such as [jamotools](https://pypi.org/project/jamotools/), or `tools/k2jamo.py` script included in NK-HanDic.
### basic
example:
```Python
import MeCab
import nkhandic
import jamotools
mecaboption = f'-r /dev/null -d {nkhandic.DICDIR}'
tokenizer = MeCab.Tagger(mecaboption)
tokenizer.parse('')
# 로동신문 2024년 5월 1일자 사설
sentence = u'경애하는 총비서동지에 대한 절대적인 충성심을 지니고 당중앙의 구상과 결심을 철저한 실천행동으로 받들어나가야 한다.'
jamo = jamotools.split_syllables(sentence, jamo_type="JAMO")
node = tokenizer.parseToNode(jamo)
while node:
print(node.surface, node.feature)
node = node.next
```
result:
```Shell
BOS/EOS,*,*,*,*,*,*,*,*,*,*
경애 Noun,普通,*,*,*,경애01,경애,敬愛,*,*,NNG
하 Suffix,動詞派生,*,語基2,*,하다02,하,*,*,*,XSV
는 Ending,語尾,連体形,*,1接続,는03,는,*,*,*,ETM
총비서 Noun,普通,*,*,*,총비서,총비서,總秘書,*,*,NNG
동지 Noun,普通,*,*,*,동지006,동지,同志,*,*,NNG,&북한어,"이름 아래 쓰여 존경과 흠모의 정을 나타내는 말."
에 Ending,助詞,処格,*,*,에04,에,*,*,*,JKB
대하 Verb,自立,*,語基2,*,대하다02,대하,對하,*,B,VV
ᆫ Ending,語尾,連体形,*,2接続,ㄴ05,ㄴ,*,*,*,ETM
절대 Noun,普通,*,*,*,절대05,절대,絶對,*,C,NNG
적 Suffix,名詞派生,*,*,*,적18,적,的,사상적,*,XSN
이 Siteisi,非自立,*,語基2,*,이다,이,*,*,*,VCP
ᆫ Ending,語尾,連体形,*,2接続,ㄴ05,ㄴ,*,*,*,ETM
충성심 Noun,普通,*,*,*,충성심,충성심,忠誠心,*,*,NNG
을 Ending,助詞,対格,*,*,을02,을,*,*,*,JKO
지니 Verb,自立,*,語基1,*,지니다,지니,*,*,C,VV
고 Ending,語尾,接続形,*,1接続,고25,고,*,*,*,EC
당중앙 Noun,普通,*,*,*,당중앙001,당중앙,黨中央,*,*,NNG,&북한어,"북한에서, 당 대회와 당 대회 사이에 노동당의 노선과 정책을 세우고 그 집행을 조직하고 지도하는 최고 지도 기관."
의 Ending,助詞,属格,*,*,의10,의,*,*,*,JKG
구상 Noun,普通,動作,*,*,구상08,구상,構想,*,*,NNG
과 Ending,助詞,接続助詞,*,*,과12,과,*,*,*,JC
결심 Noun,普通,動作,*,*,결심01,결심,決心,*,C,NNG
을 Ending,助詞,対格,*,*,을02,을,*,*,*,JKO
철저 Noun,普通,状態,*,*,철저,철저,徹底,*,*,NNG
하 Suffix,形容詞派生,*,語基2,*,하다02,하,*,*,*,XSA
ᆫ Ending,語尾,連体形,*,2接続,ㄴ05,ㄴ,*,*,*,ETM
실천 Noun,普通,動作,*,*,실천01,실천,實踐,*,C,NNG
행동 Noun,普通,動作,*,*,행동,행동,行動,*,B,NNG
으로 Ending,助詞,具格,*,*,으로,으로,*,*,*,JKB
받들어 Verb,自立,ㄹ語幹,語基3,*,받들다,받들어,*,*,*,VV
나가 Verb,非自立,*,語基1,3接続,나가다,나가,*,*,A,VX
야 Ending,語尾,接続形,*,3接続,야80,야,*,"-아야/어야",*,EC
하 Verb,非自立,*,語基2,*,하다01,하,*,*,A,VX
ᆫ다 Ending,語尾,終止形,*,2接続,ㄴ다01,ㄴ다,*,*,*,EF
. Symbol,ピリオド,*,*,*,.,.,*,*,*,SF
BOS/EOS,*,*,*,*,*,*,*,*,*,*
```
### Extracting specific POS
example:
```Python
# 일반명사(pos-tag: NNG)만 추출
node = tokenizer.parseToNode(jamo)
while node:
if node.feature.split(',')[10] in ['NNG']:
# 사전 항목(base forms)을 출력
print(node.feature.split(',')[5])
node = node.next
```
result:
```Shell
경애01
총비서
동지006
절대05
충성심
당중앙001
구상08
결심01
철저
실천01
행동
```
### Tokenize
example:
```Python
mecaboption = f'-r /dev/null -d {nkhandic.DICDIR} -Otokenize'
tokenizer = MeCab.Tagger(mecaboption)
print(tokenizer.parse(jamo))
```
result:
```Shell
경애 하 는 총비서 동지 에 대하 ㄴ 절대 적 이 ㄴ 충성심 을 지니 고 당중앙 의 구상 과 결심 을 철저 하 ㄴ 실천 행동 으로 받들어 나가 야 하 ㄴ다 .
```
## Features
Here is the list of features included in NK-HanDic. For more information, see the [NK-HanDic 품사 정보](https://github.com/okikirmui/nkhandic/blob/main/docs/pos_detail.md).
- 품사1, 품사2, 품사3: part of speech(index: 0-2)
- 활용형: conjugation "base"(ex. `語基1`, `語基2`, `語基3`)(index: 3)
- 접속 정보: which "base" the ending is attached to(ex. `1接続`, `2接続`, etc.)(index: 4)
- 사전 항목: base forms(index: 5)
- 표층형: surface(index: 6)
- 한자: for sino-words(index: 7)
- 보충 정보: miscellaneous informations(index: 8)
- 학습 수준: learning level(index: 9)
- 세종계획 품사 태그: pos-tag(index: 10)
- 조선어 표시(optional): for North Korean words(index: 11)
- 뜻풀이(optional): for North Korean words(index: 12)
## License
This code is licensed under the MIT license. NK-HanDic is copyright Yoshinori Sugai and distributed under the [BSD license](./LICENSE.nkhandic).
## Acknowledgment
This repository is forked from [unidic-lite](https://github.com/polm/unidic-lite) with some modifications and file additions and deletions.
Raw data
{
"_id": null,
"home_page": null,
"name": "nkhandic",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "nkhandic, MeCab, North Korean Language, morphological analysis, morphological analysis dictionary, korean text processing",
"author": "Yoshinori Sugai",
"author_email": "okikirmui+github@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/13/3e/d6b17a68ac558e182700a80328daf60a7b239111c643ce638debf5d4a86d/nkhandic-24.11.23.tar.gz",
"platform": null,
"description": "# nkhandic-py\n\n![PyPI - Version](https://img.shields.io/pypi/v/nkhandic)\n\nThis is a package to install [NK-HanDic](https://github.com/okikirmui/nkhandic), a dictionary for morphological analysis of North Korean languages, via pip and use it in Python.\n\nTo use this package for morphological analysis, the MeCab wrapper such as [mecab-python3](https://github.com/SamuraiT/mecab-python3) is required.\n\n**[notice]** After v.0.1.3, calendar versioning is used according to the dictionary version.\n\n## Installation\n\nfrom PyPI:\n\n```Shell\npip install nkhandic\n```\n\n## Usage\n\nSince NK-HanDic requires Hangul Jamo(Unicode Hangul Jamo) as input, please convert Hangul (Unicode Hangul Syllables) using modules such as [jamotools](https://pypi.org/project/jamotools/), or `tools/k2jamo.py` script included in NK-HanDic.\n\n### basic\n\nexample:\n\n```Python\nimport MeCab\nimport nkhandic\nimport jamotools\n\nmecaboption = f'-r /dev/null -d {nkhandic.DICDIR}'\n\ntokenizer = MeCab.Tagger(mecaboption)\ntokenizer.parse('')\n\n# \ub85c\ub3d9\uc2e0\ubb38 2024\ub144 5\uc6d4 1\uc77c\uc790 \uc0ac\uc124\nsentence = u'\uacbd\uc560\ud558\ub294 \ucd1d\ube44\uc11c\ub3d9\uc9c0\uc5d0 \ub300\ud55c \uc808\ub300\uc801\uc778 \ucda9\uc131\uc2ec\uc744 \uc9c0\ub2c8\uace0 \ub2f9\uc911\uc559\uc758 \uad6c\uc0c1\uacfc \uacb0\uc2ec\uc744 \ucca0\uc800\ud55c \uc2e4\ucc9c\ud589\ub3d9\uc73c\ub85c \ubc1b\ub4e4\uc5b4\ub098\uac00\uc57c \ud55c\ub2e4.'\n\njamo = jamotools.split_syllables(sentence, jamo_type=\"JAMO\")\n\nnode = tokenizer.parseToNode(jamo)\nwhile node:\n print(node.surface, node.feature)\n node = node.next\n```\n\nresult:\n\n```Shell\nBOS/EOS,*,*,*,*,*,*,*,*,*,*\n\u1100\u1167\u11bc\u110b\u1162 Noun,\u666e\u901a,*,*,*,\uacbd\uc56001,\uacbd\uc560,\u656c\u611b,*,*,NNG\n\u1112\u1161 Suffix,\u52d5\u8a5e\u6d3e\u751f,*,\u8a9e\u57fa2,*,\ud558\ub2e402,\ud558,*,*,*,XSV\n\u1102\u1173\u11ab Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,1\u63a5\u7d9a,\ub29403,\ub294,*,*,*,ETM\n\u110e\u1169\u11bc\u1107\u1175\u1109\u1165 Noun,\u666e\u901a,*,*,*,\ucd1d\ube44\uc11c,\ucd1d\ube44\uc11c,\u7e3d\u79d8\u66f8,*,*,NNG\n\u1103\u1169\u11bc\u110c\u1175 Noun,\u666e\u901a,*,*,*,\ub3d9\uc9c0006,\ub3d9\uc9c0,\u540c\u5fd7,*,*,NNG,&\ubd81\ud55c\uc5b4,\"\uc774\ub984 \uc544\ub798 \uc4f0\uc5ec \uc874\uacbd\uacfc \ud760\ubaa8\uc758 \uc815\uc744 \ub098\ud0c0\ub0b4\ub294 \ub9d0.\"\n\u110b\u1166 Ending,\u52a9\u8a5e,\u51e6\u683c,*,*,\uc5d004,\uc5d0,*,*,*,JKB\n\u1103\u1162\u1112\u1161 Verb,\u81ea\u7acb,*,\u8a9e\u57fa2,*,\ub300\ud558\ub2e402,\ub300\ud558,\u5c0d\u1112\u1161,*,B,VV\n\u11ab Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,2\u63a5\u7d9a,\u313405,\u3134,*,*,*,ETM\n\u110c\u1165\u11af\u1103\u1162 Noun,\u666e\u901a,*,*,*,\uc808\ub30005,\uc808\ub300,\u7d76\u5c0d,*,C,NNG\n\u110c\u1165\u11a8 Suffix,\u540d\u8a5e\u6d3e\u751f,*,*,*,\uc80118,\uc801,\u7684,\uc0ac\uc0c1\uc801,*,XSN\n\u110b\u1175 Siteisi,\u975e\u81ea\u7acb,*,\u8a9e\u57fa2,*,\uc774\ub2e4,\uc774,*,*,*,VCP\n\u11ab Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,2\u63a5\u7d9a,\u313405,\u3134,*,*,*,ETM\n\u110e\u116e\u11bc\u1109\u1165\u11bc\u1109\u1175\u11b7 Noun,\u666e\u901a,*,*,*,\ucda9\uc131\uc2ec,\ucda9\uc131\uc2ec,\u5fe0\u8aa0\u5fc3,*,*,NNG\n\u110b\u1173\u11af Ending,\u52a9\u8a5e,\u5bfe\u683c,*,*,\uc74402,\uc744,*,*,*,JKO\n\u110c\u1175\u1102\u1175 Verb,\u81ea\u7acb,*,\u8a9e\u57fa1,*,\uc9c0\ub2c8\ub2e4,\uc9c0\ub2c8,*,*,C,VV\n\u1100\u1169 Ending,\u8a9e\u5c3e,\u63a5\u7d9a\u5f62,*,1\u63a5\u7d9a,\uace025,\uace0,*,*,*,EC\n\u1103\u1161\u11bc\u110c\u116e\u11bc\u110b\u1161\u11bc Noun,\u666e\u901a,*,*,*,\ub2f9\uc911\uc559001,\ub2f9\uc911\uc559,\u9ee8\u4e2d\u592e,*,*,NNG,&\ubd81\ud55c\uc5b4,\"\ubd81\ud55c\uc5d0\uc11c, \ub2f9 \ub300\ud68c\uc640 \ub2f9 \ub300\ud68c \uc0ac\uc774\uc5d0 \ub178\ub3d9\ub2f9\uc758 \ub178\uc120\uacfc \uc815\ucc45\uc744 \uc138\uc6b0\uace0 \uadf8 \uc9d1\ud589\uc744 \uc870\uc9c1\ud558\uace0 \uc9c0\ub3c4\ud558\ub294 \ucd5c\uace0 \uc9c0\ub3c4 \uae30\uad00.\"\n\u110b\u1174 Ending,\u52a9\u8a5e,\u5c5e\u683c,*,*,\uc75810,\uc758,*,*,*,JKG\n\u1100\u116e\u1109\u1161\u11bc Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\uad6c\uc0c108,\uad6c\uc0c1,\u69cb\u60f3,*,*,NNG\n\u1100\u116a Ending,\u52a9\u8a5e,\u63a5\u7d9a\u52a9\u8a5e,*,*,\uacfc12,\uacfc,*,*,*,JC\n\u1100\u1167\u11af\u1109\u1175\u11b7 Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\uacb0\uc2ec01,\uacb0\uc2ec,\u6c7a\u5fc3,*,C,NNG\n\u110b\u1173\u11af Ending,\u52a9\u8a5e,\u5bfe\u683c,*,*,\uc74402,\uc744,*,*,*,JKO\n\u110e\u1165\u11af\u110c\u1165 Noun,\u666e\u901a,\u72b6\u614b,*,*,\ucca0\uc800,\ucca0\uc800,\u5fb9\u5e95,*,*,NNG\n\u1112\u1161 Suffix,\u5f62\u5bb9\u8a5e\u6d3e\u751f,*,\u8a9e\u57fa2,*,\ud558\ub2e402,\ud558,*,*,*,XSA\n\u11ab Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,2\u63a5\u7d9a,\u313405,\u3134,*,*,*,ETM\n\u1109\u1175\u11af\u110e\u1165\u11ab Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\uc2e4\ucc9c01,\uc2e4\ucc9c,\u5be6\u8e10,*,C,NNG\n\u1112\u1162\u11bc\u1103\u1169\u11bc Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\ud589\ub3d9,\ud589\ub3d9,\u884c\u52d5,*,B,NNG\n\u110b\u1173\u1105\u1169 Ending,\u52a9\u8a5e,\u5177\u683c,*,*,\uc73c\ub85c,\uc73c\ub85c,*,*,*,JKB\n\u1107\u1161\u11ae\u1103\u1173\u11af\u110b\u1165 Verb,\u81ea\u7acb,\u3139\u8a9e\u5e79,\u8a9e\u57fa3,*,\ubc1b\ub4e4\ub2e4,\ubc1b\ub4e4\uc5b4,*,*,*,VV\n\u1102\u1161\u1100\u1161 Verb,\u975e\u81ea\u7acb,*,\u8a9e\u57fa1,3\u63a5\u7d9a,\ub098\uac00\ub2e4,\ub098\uac00,*,*,A,VX\n\u110b\u1163 Ending,\u8a9e\u5c3e,\u63a5\u7d9a\u5f62,*,3\u63a5\u7d9a,\uc57c80,\uc57c,*,\"-\uc544\uc57c/\uc5b4\uc57c\",*,EC\n\u1112\u1161 Verb,\u975e\u81ea\u7acb,*,\u8a9e\u57fa2,*,\ud558\ub2e401,\ud558,*,*,A,VX\n\u11ab\u1103\u1161 Ending,\u8a9e\u5c3e,\u7d42\u6b62\u5f62,*,2\u63a5\u7d9a,\u3134\ub2e401,\u3134\ub2e4,*,*,*,EF\n. Symbol,\u30d4\u30ea\u30aa\u30c9,*,*,*,.,.,*,*,*,SF\nBOS/EOS,*,*,*,*,*,*,*,*,*,*\n```\n\n### Extracting specific POS\n\nexample:\n\n```Python\n# \uc77c\ubc18\uba85\uc0ac(pos-tag: NNG)\ub9cc \ucd94\ucd9c\nnode = tokenizer.parseToNode(jamo)\nwhile node:\n if node.feature.split(',')[10] in ['NNG']:\n # \uc0ac\uc804 \ud56d\ubaa9(base forms)\uc744 \ucd9c\ub825\n print(node.feature.split(',')[5])\n node = node.next\n```\n\nresult:\n\n```Shell\n\uacbd\uc56001\n\ucd1d\ube44\uc11c\n\ub3d9\uc9c0006\n\uc808\ub30005\n\ucda9\uc131\uc2ec\n\ub2f9\uc911\uc559001\n\uad6c\uc0c108\n\uacb0\uc2ec01\n\ucca0\uc800\n\uc2e4\ucc9c01\n\ud589\ub3d9\n```\n\n### Tokenize\n\nexample:\n\n```Python\nmecaboption = f'-r /dev/null -d {nkhandic.DICDIR} -Otokenize'\ntokenizer = MeCab.Tagger(mecaboption)\n\nprint(tokenizer.parse(jamo))\n```\n\nresult:\n\n```Shell\n\uacbd\uc560 \ud558 \ub294 \ucd1d\ube44\uc11c \ub3d9\uc9c0 \uc5d0 \ub300\ud558 \u3134 \uc808\ub300 \uc801 \uc774 \u3134 \ucda9\uc131\uc2ec \uc744 \uc9c0\ub2c8 \uace0 \ub2f9\uc911\uc559 \uc758 \uad6c\uc0c1 \uacfc \uacb0\uc2ec \uc744 \ucca0\uc800 \ud558 \u3134 \uc2e4\ucc9c \ud589\ub3d9 \uc73c\ub85c \ubc1b\ub4e4\uc5b4 \ub098\uac00 \uc57c \ud558 \u3134\ub2e4 .\n```\n\n## Features\n\nHere is the list of features included in NK-HanDic. For more information, see the [NK-HanDic \ud488\uc0ac \uc815\ubcf4](https://github.com/okikirmui/nkhandic/blob/main/docs/pos_detail.md).\n\n - \ud488\uc0ac1, \ud488\uc0ac2, \ud488\uc0ac3: part of speech(index: 0-2)\n - \ud65c\uc6a9\ud615: conjugation \"base\"(ex. `\u8a9e\u57fa1`, `\u8a9e\u57fa2`, `\u8a9e\u57fa3`)(index: 3)\n - \uc811\uc18d \uc815\ubcf4: which \"base\" the ending is attached to(ex. `1\u63a5\u7d9a`, `2\u63a5\u7d9a`, etc.)(index: 4)\n - \uc0ac\uc804 \ud56d\ubaa9: base forms(index: 5)\n - \ud45c\uce35\ud615: surface(index: 6)\n - \ud55c\uc790: for sino-words(index: 7)\n - \ubcf4\ucda9 \uc815\ubcf4: miscellaneous informations(index: 8)\n - \ud559\uc2b5 \uc218\uc900: learning level(index: 9)\n - \uc138\uc885\uacc4\ud68d \ud488\uc0ac \ud0dc\uadf8: pos-tag(index: 10)\n - \uc870\uc120\uc5b4 \ud45c\uc2dc(optional): for North Korean words(index: 11)\n - \ub73b\ud480\uc774(optional): for North Korean words(index: 12)\n\n## License\n\nThis code is licensed under the MIT license. NK-HanDic is copyright Yoshinori Sugai and distributed under the [BSD license](./LICENSE.nkhandic). \n\n## Acknowledgment\n\nThis repository is forked from [unidic-lite](https://github.com/polm/unidic-lite) with some modifications and file additions and deletions.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "NK-HanDic package for installing via pip.",
"version": "24.11.23",
"project_urls": {
"Repository": "https://github.com/okikirmui/nkhandic-py"
},
"split_keywords": [
"nkhandic",
" mecab",
" north korean language",
" morphological analysis",
" morphological analysis dictionary",
" korean text processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a1f349e0ff5eb9a1725234e9c26861814b18da168573e71eba53c684b6ecc5f4",
"md5": "884b07b848d98008bf58f60a2a083bf5",
"sha256": "e84eb9ceb98120e1e1c94c4fb82367278f06e7ade3d0505b5ef9653a53d14bd7"
},
"downloads": -1,
"filename": "nkhandic-24.11.23-py3-none-any.whl",
"has_sig": false,
"md5_digest": "884b07b848d98008bf58f60a2a083bf5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 10794441,
"upload_time": "2024-11-24T14:02:55",
"upload_time_iso_8601": "2024-11-24T14:02:55.010564Z",
"url": "https://files.pythonhosted.org/packages/a1/f3/49e0ff5eb9a1725234e9c26861814b18da168573e71eba53c684b6ecc5f4/nkhandic-24.11.23-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "133ed6b17a68ac558e182700a80328daf60a7b239111c643ce638debf5d4a86d",
"md5": "85f93ef047806bd0e8e7cc4d576de5a1",
"sha256": "7f7d9d16fed1382bb8ad5ec1239171520b1a84bc4d53f304ad153af402555c4c"
},
"downloads": -1,
"filename": "nkhandic-24.11.23.tar.gz",
"has_sig": false,
"md5_digest": "85f93ef047806bd0e8e7cc4d576de5a1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 10791371,
"upload_time": "2024-11-24T14:02:58",
"upload_time_iso_8601": "2024-11-24T14:02:58.689012Z",
"url": "https://files.pythonhosted.org/packages/13/3e/d6b17a68ac558e182700a80328daf60a7b239111c643ce638debf5d4a86d/nkhandic-24.11.23.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-24 14:02:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "okikirmui",
"github_project": "nkhandic-py",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "nkhandic"
}