handic


Namehandic JSON
Version 25.1.1 PyPI version JSON
download
home_pageNone
SummaryHanDic package for installing via pip.
upload_time2025-01-01 14:55:11
maintainerNone
docs_urlNone
authorYoshinori Sugai
requires_python>=3.8
licenseMIT License
keywords handic mecab korean language morphological analysis morphological analysis dictionary korean text processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # handic-py

This is a package to install [HanDic](https://github.com/okikirmui/handic), a dictionary for morphological analysis of Korean languages, via pip and use it in Python.

To use this package for morphological analysis, the MeCab wrapper such as [mecab-python3](https://github.com/SamuraiT/mecab-python3) is required.

**[notice]** After v.0.1.0, calendar versioning is used according to the dictionary version.

## Installation

from PyPI:

```Shell
pip install handic
```

## Usage

Since HanDic requires Hangul Jamo(Unicode Hangul Jamo) as input, please convert Hangul (Unicode Hangul Syllables) using modules such as [jamotools](https://pypi.org/project/jamotools/), or `tools/k2jamo.py` script included in HanDic.

### basic

example:

```Python
import MeCab
import handic
import jamotools

mecaboption = f'-r /dev/null -d {handic.DICDIR}'

tokenizer = MeCab.Tagger(mecaboption)
tokenizer.parse('')

# 《표준국어대사전》 "형태소" 뜻풀이
sentence = u'뜻을 가진 가장 작은 말의 단위. ‘이야기책’의 ‘이야기’, ‘책’ 따위이다.'

jamo = jamotools.split_syllables(sentence, jamo_type="JAMO")

node = tokenizer.parseToNode(jamo)
while node:
    print(node.surface, node.feature)
    node = node.next
```

result:

```Shell
BOS/EOS,*,*,*,*,*,*,*,*,*,*
뜻    Noun,普通,*,*,*,뜻,뜻,*,*,B,NNG
을    Ending,助詞,対格,*,*,을02,을,*,*,*,JKO
가지  Verb,自立,*,語基2,*,가지다,가지,*,*,A,VV
ᆫ       Ending,語尾,連体形,*,2接続,ㄴ05,ㄴ,*,*,*,ETM
가장 Adverb,一般,*,*,*,가장01,가장,*,*,A,MAG
작으 Adjective,自立,*,語基2,*,작다01,작으,*,*,A,VA
ᆫ       Ending,語尾,連体形,*,2接続,ㄴ05,ㄴ,*,*,*,ETM
말    Noun,普通,動作,*,*,말01,말,*,*,A,NNG
의     Ending,助詞,属格,*,*,의10,의,*,*,*,JKG
단위 Noun,普通,*,*,*,단위02,단위,單位,*,C,NNG
.       Symbol,ピリオド,*,*,*,.,.,*,*,*,SF
‘      Symbol,カッコ,引用符-始,*,*,‘,‘,*,*,*,SS
이야기책   Noun,普通,*,*,*,이야기책,이야기책,이야기冊,*,*,NNG
’      Symbol,カッコ,引用符-終,*,*,’,’,*,*,*,SS
의     Ending,助詞,属格,*,*,의10,의,*,*,*,JKG
‘      Symbol,カッコ,引用符-始,*,*,‘,‘,*,*,*,SS
이야기       Noun,普通,動作,*,*,이야기,이야기,*,*,A,NNG
’      Symbol,カッコ,引用符-終,*,*,’,’,*,*,*,SS
,       Symbol,コンマ,*,*,*,",",",",*,*,*,SP
‘      Symbol,カッコ,引用符-始,*,*,‘,‘,*,*,*,SS
책    Noun,普通,*,*,*,책01,책,冊,*,A,NNG
’      Symbol,カッコ,引用符-終,*,*,’,’,*,*,*,SS
따위  Noun,依存名詞,*,*,*,따위,따위,*,*,*,NNB
이     Siteisi,非自立,*,語基1,*,이다,이,*,*,*,VCP
다     Ending,語尾,終止形,*,1接続,다06,다,*,*,*,EF
.       Symbol,ピリオド,*,*,*,.,.,*,*,*,SF
BOS/EOS,*,*,*,*,*,*,*,*,*,*
```

### Tokenize

example:

```Python
mecaboption = f'-r /dev/null -d {handic.DICDIR} -Otokenize'
tokenizer = MeCab.Tagger(mecaboption)

print(tokenizer.parse(jamo))
```

result:

```Shell
뜻 을 가지 ㄴ 가장 작으 ㄴ 말 의 단위 . ‘ 이야기책 ’ 의 ‘ 이야기 ’ , ‘ 책 ’ 따위 이 다 .
```

### Extracting specific POS

example:

```Python
mecaboption = f'-r /dev/null -d {handic.DICDIR}'

tokenizer = MeCab.Tagger(mecaboption)
tokenizer.parse('')

node = tokenizer.parseToNode(jamo)
while node:
    # 일반명사(pos-tag: NNG)만 추출
    if node.feature.split(',')[10] in ['NNG']:
        print(node.feature.split(',')[5])
    node = node.next
```

result:

```Shell
뜻
말01
단위02
이야기책
이야기
책01
```

## Features

Here is the list of features included in HanDic. For more information, see the [HanDic 품사 정보](https://github.com/okikirmui/handic/blob/main/docs/pos_detail.md).

  - 품사1, 품사2, 품사3: part of speech(index: 0-2)
  - 활용형: conjugation "base"(ex. `語基1`, `語基2`, `語基3`)(index: 3)
  - 접속 정보: which "base" the ending is attached to(ex. `1接続`, `2接続`, etc.)(index: 4)
  - 사전 항목: base forms(index: 5)
  - 표층형: surface(index: 6)
  - 한자: for sino-words(index: 7)
  - 보충 정보: miscellaneous informations(index: 8)
  - 학습 수준: learning level(index: 9)
  - 세종계획 품사 태그: pos-tag(index: 10)

## License

This code is licensed under the MIT license. HanDic is copyright Yoshinori Sugai and distributed under the [BSD license](./LICENSE.handic). 

## Acknowledgment

This repository is forked from [unidic-lite](https://github.com/polm/unidic-lite) with some modifications and file additions and deletions.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "handic",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "handic, MeCab, Korean Language, morphological analysis, morphological analysis dictionary, korean text processing",
    "author": "Yoshinori Sugai",
    "author_email": "okikirmui+github@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/92/a2/248bb05d46943542a6d7061db11b51550460d99f94f3d515918a531c65f2/handic-25.1.1.tar.gz",
    "platform": null,
    "description": "# handic-py\n\nThis is a package to install [HanDic](https://github.com/okikirmui/handic), a dictionary for morphological analysis of Korean languages, via pip and use it in Python.\n\nTo use this package for morphological analysis, the MeCab wrapper such as [mecab-python3](https://github.com/SamuraiT/mecab-python3) is required.\n\n**[notice]** After v.0.1.0, calendar versioning is used according to the dictionary version.\n\n## Installation\n\nfrom PyPI:\n\n```Shell\npip install handic\n```\n\n## Usage\n\nSince HanDic requires Hangul Jamo(Unicode Hangul Jamo) as input, please convert Hangul (Unicode Hangul Syllables) using modules such as [jamotools](https://pypi.org/project/jamotools/), or `tools/k2jamo.py` script included in HanDic.\n\n### basic\n\nexample:\n\n```Python\nimport MeCab\nimport handic\nimport jamotools\n\nmecaboption = f'-r /dev/null -d {handic.DICDIR}'\n\ntokenizer = MeCab.Tagger(mecaboption)\ntokenizer.parse('')\n\n# \u300a\ud45c\uc900\uad6d\uc5b4\ub300\uc0ac\uc804\u300b \"\ud615\ud0dc\uc18c\" \ub73b\ud480\uc774\nsentence = u'\ub73b\uc744 \uac00\uc9c4 \uac00\uc7a5 \uc791\uc740 \ub9d0\uc758 \ub2e8\uc704. \u2018\uc774\uc57c\uae30\ucc45\u2019\uc758 \u2018\uc774\uc57c\uae30\u2019, \u2018\ucc45\u2019 \ub530\uc704\uc774\ub2e4.'\n\njamo = jamotools.split_syllables(sentence, jamo_type=\"JAMO\")\n\nnode = tokenizer.parseToNode(jamo)\nwhile node:\n    print(node.surface, node.feature)\n    node = node.next\n```\n\nresult:\n\n```Shell\nBOS/EOS,*,*,*,*,*,*,*,*,*,*\n\u1104\u1173\u11ba    Noun,\u666e\u901a,*,*,*,\ub73b,\ub73b,*,*,B,NNG\n\u110b\u1173\u11af    Ending,\u52a9\u8a5e,\u5bfe\u683c,*,*,\uc74402,\uc744,*,*,*,JKO\n\u1100\u1161\u110c\u1175  Verb,\u81ea\u7acb,*,\u8a9e\u57fa2,*,\uac00\uc9c0\ub2e4,\uac00\uc9c0,*,*,A,VV\n\u11ab       Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,2\u63a5\u7d9a,\u313405,\u3134,*,*,*,ETM\n\u1100\u1161\u110c\u1161\u11bc Adverb,\u4e00\u822c,*,*,*,\uac00\uc7a501,\uac00\uc7a5,*,*,A,MAG\n\u110c\u1161\u11a8\u110b\u1173 Adjective,\u81ea\u7acb,*,\u8a9e\u57fa2,*,\uc791\ub2e401,\uc791\uc73c,*,*,A,VA\n\u11ab       Ending,\u8a9e\u5c3e,\u9023\u4f53\u5f62,*,2\u63a5\u7d9a,\u313405,\u3134,*,*,*,ETM\n\u1106\u1161\u11af    Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\ub9d001,\ub9d0,*,*,A,NNG\n\u110b\u1174     Ending,\u52a9\u8a5e,\u5c5e\u683c,*,*,\uc75810,\uc758,*,*,*,JKG\n\u1103\u1161\u11ab\u110b\u1171 Noun,\u666e\u901a,*,*,*,\ub2e8\uc70402,\ub2e8\uc704,\u55ae\u4f4d,*,C,NNG\n.       Symbol,\u30d4\u30ea\u30aa\u30c9,*,*,*,.,.,*,*,*,SF\n\u2018      Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u59cb,*,*,\u2018,\u2018,*,*,*,SS\n\u110b\u1175\u110b\u1163\u1100\u1175\u110e\u1162\u11a8   Noun,\u666e\u901a,*,*,*,\uc774\uc57c\uae30\ucc45,\uc774\uc57c\uae30\ucc45,\u110b\u1175\u110b\u1163\u1100\u1175\u518a,*,*,NNG\n\u2019      Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u7d42,*,*,\u2019,\u2019,*,*,*,SS\n\u110b\u1174     Ending,\u52a9\u8a5e,\u5c5e\u683c,*,*,\uc75810,\uc758,*,*,*,JKG\n\u2018      Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u59cb,*,*,\u2018,\u2018,*,*,*,SS\n\u110b\u1175\u110b\u1163\u1100\u1175       Noun,\u666e\u901a,\u52d5\u4f5c,*,*,\uc774\uc57c\uae30,\uc774\uc57c\uae30,*,*,A,NNG\n\u2019      Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u7d42,*,*,\u2019,\u2019,*,*,*,SS\n,       Symbol,\u30b3\u30f3\u30de,*,*,*,\",\",\",\",*,*,*,SP\n\u2018      Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u59cb,*,*,\u2018,\u2018,*,*,*,SS\n\u110e\u1162\u11a8    Noun,\u666e\u901a,*,*,*,\ucc4501,\ucc45,\u518a,*,A,NNG\n\u2019      Symbol,\u30ab\u30c3\u30b3,\u5f15\u7528\u7b26-\u7d42,*,*,\u2019,\u2019,*,*,*,SS\n\u1104\u1161\u110b\u1171  Noun,\u4f9d\u5b58\u540d\u8a5e,*,*,*,\ub530\uc704,\ub530\uc704,*,*,*,NNB\n\u110b\u1175     Siteisi,\u975e\u81ea\u7acb,*,\u8a9e\u57fa1,*,\uc774\ub2e4,\uc774,*,*,*,VCP\n\u1103\u1161     Ending,\u8a9e\u5c3e,\u7d42\u6b62\u5f62,*,1\u63a5\u7d9a,\ub2e406,\ub2e4,*,*,*,EF\n.       Symbol,\u30d4\u30ea\u30aa\u30c9,*,*,*,.,.,*,*,*,SF\nBOS/EOS,*,*,*,*,*,*,*,*,*,*\n```\n\n### Tokenize\n\nexample:\n\n```Python\nmecaboption = f'-r /dev/null -d {handic.DICDIR} -Otokenize'\ntokenizer = MeCab.Tagger(mecaboption)\n\nprint(tokenizer.parse(jamo))\n```\n\nresult:\n\n```Shell\n\ub73b \uc744 \uac00\uc9c0 \u3134 \uac00\uc7a5 \uc791\uc73c \u3134 \ub9d0 \uc758 \ub2e8\uc704 . \u2018 \uc774\uc57c\uae30\ucc45 \u2019 \uc758 \u2018 \uc774\uc57c\uae30 \u2019 , \u2018 \ucc45 \u2019 \ub530\uc704 \uc774 \ub2e4 .\n```\n\n### Extracting specific POS\n\nexample:\n\n```Python\nmecaboption = f'-r /dev/null -d {handic.DICDIR}'\n\ntokenizer = MeCab.Tagger(mecaboption)\ntokenizer.parse('')\n\nnode = tokenizer.parseToNode(jamo)\nwhile node:\n    # \uc77c\ubc18\uba85\uc0ac(pos-tag: NNG)\ub9cc \ucd94\ucd9c\n    if node.feature.split(',')[10] in ['NNG']:\n        print(node.feature.split(',')[5])\n    node = node.next\n```\n\nresult:\n\n```Shell\n\ub73b\n\ub9d001\n\ub2e8\uc70402\n\uc774\uc57c\uae30\ucc45\n\uc774\uc57c\uae30\n\ucc4501\n```\n\n## Features\n\nHere is the list of features included in HanDic. For more information, see the [HanDic \ud488\uc0ac \uc815\ubcf4](https://github.com/okikirmui/handic/blob/main/docs/pos_detail.md).\n\n  - \ud488\uc0ac1, \ud488\uc0ac2, \ud488\uc0ac3: part of speech(index: 0-2)\n  - \ud65c\uc6a9\ud615: conjugation \"base\"(ex. `\u8a9e\u57fa1`, `\u8a9e\u57fa2`, `\u8a9e\u57fa3`)(index: 3)\n  - \uc811\uc18d \uc815\ubcf4: which \"base\" the ending is attached to(ex. `1\u63a5\u7d9a`, `2\u63a5\u7d9a`, etc.)(index: 4)\n  - \uc0ac\uc804 \ud56d\ubaa9: base forms(index: 5)\n  - \ud45c\uce35\ud615: surface(index: 6)\n  - \ud55c\uc790: for sino-words(index: 7)\n  - \ubcf4\ucda9 \uc815\ubcf4: miscellaneous informations(index: 8)\n  - \ud559\uc2b5 \uc218\uc900: learning level(index: 9)\n  - \uc138\uc885\uacc4\ud68d \ud488\uc0ac \ud0dc\uadf8: pos-tag(index: 10)\n\n## License\n\nThis code is licensed under the MIT license. HanDic is copyright Yoshinori Sugai and distributed under the [BSD license](./LICENSE.handic). \n\n## Acknowledgment\n\nThis repository is forked from [unidic-lite](https://github.com/polm/unidic-lite) with some modifications and file additions and deletions.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "HanDic package for installing via pip.",
    "version": "25.1.1",
    "project_urls": {
        "Repository": "https://github.com/okikirmui/handic-py"
    },
    "split_keywords": [
        "handic",
        " mecab",
        " korean language",
        " morphological analysis",
        " morphological analysis dictionary",
        " korean text processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5471c4e9ea8d18f559da877f9d05724da4a11a6725065fc5e48c614e104c66a4",
                "md5": "38b69f46a7f5e52a7098d5c5e755a988",
                "sha256": "a7a2b50a2f8c097a50132be73b268a7b1c20cf4214c03786e6dc343fd4376bf5"
            },
            "downloads": -1,
            "filename": "handic-25.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "38b69f46a7f5e52a7098d5c5e755a988",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 4876114,
            "upload_time": "2025-01-01T14:55:03",
            "upload_time_iso_8601": "2025-01-01T14:55:03.167109Z",
            "url": "https://files.pythonhosted.org/packages/54/71/c4e9ea8d18f559da877f9d05724da4a11a6725065fc5e48c614e104c66a4/handic-25.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "92a2248bb05d46943542a6d7061db11b51550460d99f94f3d515918a531c65f2",
                "md5": "6d281db91034cd3f809585e6a49626d4",
                "sha256": "9f8566eef3cff80e0ea0c53d5a4fe8fa285af67b2e2ef7aeff7751069eda8a35"
            },
            "downloads": -1,
            "filename": "handic-25.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6d281db91034cd3f809585e6a49626d4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 4863351,
            "upload_time": "2025-01-01T14:55:11",
            "upload_time_iso_8601": "2025-01-01T14:55:11.198029Z",
            "url": "https://files.pythonhosted.org/packages/92/a2/248bb05d46943542a6d7061db11b51550460d99f94f3d515918a531c65f2/handic-25.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-01 14:55:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "okikirmui",
    "github_project": "handic-py",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "handic"
}
        
Elapsed time: 3.89802s