# Pecab
<a href="https://github.com/hyunwoongko/pecab/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/hyunwoongko/pecab.svg" /></a>
<a href="https://github.com/hyunwoongko/pecab/issues"><img alt="Issues" src="https://img.shields.io/github/issues/hyunwoongko/pecab"/></a>
[![Action Status Windows](https://github.com/eubinecto/pecab/actions/workflows/test_windows.yml/badge.svg)](https://github.com/eubinecto/pecab/actions)
[![Action Status Ubuntu](https://github.com/eubinecto/pecab/actions/workflows/test_ubuntu.yml/badge.svg)](https://github.com/eubinecto/pecab/actions)
[![Action Status macOS](https://github.com/eubinecto/pecab/actions/workflows/test_macos.yml/badge.svg)](https://github.com/eubinecto/pecab/actions)
Pecab is pure python Korean morpheme analyzer based on [Mecab](https://github.com/taku910/mecab).
Mecab is a CRF-based morpheme analyzer made by Taku Kudo in 2011. It is very fast and accurate at the same time, which is why it is still very popular even though it is quite old.
However, it is known to be one of the most tricky libraries to install, and in fact many people have had a hard time installing Mecab.
So, since a few years ago, I wanted to make a pure python version of Mecab that was easy to install while inheriting the advantages of Mecab.
Now, Pecab came out. This ensures results very similar to Mecab and at the same time easy to install.
For more details, please refer the following.
## Installation
```console
pip install pecab
```
## Usages
The user API of Pecab is inspired by [KoNLPy](https://github.com/konlpy/konlpy),
a one of the most famous natural language processing package in South Korea.
#### 1) `PeCab()`: creating Pecab object.
```python
from pecab import PeCab
pecab = PeCab()
```
#### 2) `morphs(text)`: splits text into morphemes.
```python
pecab.morphs("아버지가방에들어가시다")
['아버지', '가', '방', '에', '들어가', '시', '다']
```
#### 3) `pos(text)`: returns morphemes and POS tags together.
```python
pecab.pos("이것은 문장입니다.")
[('이것', 'NP'), ('은', 'JX'), ('문장', 'NNG'), ('입니다', 'VCP+EF'), ('.', 'SF')]
```
#### 4) `nouns(text)`: returns all nouns in the input text.
```python
pecab.nouns("자장면을 먹을까? 짬뽕을 먹을까? 그것이 고민이로다.")
["자장면", "짬뽕", "그것", "고민"]
```
#### 5) `Pecab(user_dict=List[str])`: applies an user dictionary.
Note that words included in the user dictionary **cannot contain spaces**.
- Without `user_dict`
```python
from pecab import PeCab
pecab = PeCab()
pecab.pos("저는 삼성디지털프라자에서 지펠냉장고를 샀어요.")
[('저', 'NP'), ('는', 'JX'), ('삼성', 'NNP'), ('디지털', 'NNP'), ('프라자', 'NNP'), ('에서', 'JKB'), ('지', 'NNP'), ('펠', 'NNP'), ('냉장고', 'NNG'), ('를', 'JKO'), ('샀', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
```
- With `user_dict`
```python
from pecab import PeCab
user_dict = ["삼성디지털프라자", "지펠냉장고"]
pecab = PeCab(user_dict=user_dict)
pecab.pos("저는 삼성디지털프라자에서 지펠냉장고를 샀어요.")
[('저', 'NP'), ('는', 'JX'), ('삼성디지털프라자', 'NNG'), ('에서', 'JKB'), ('지펠냉장고', 'NNG'), ('를', 'JKO'), ('샀', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
```
#### 6) `PeCab(split_compound=bool)`: devides compound words into smaller pieces.
```python
from pecab import PeCab
pecab = PeCab(split_compound=True)
pecab.morphs("가벼운 냉장고를 샀어요.")
['가볍', 'ᆫ', '냉장', '고', '를', '사', 'ㅏㅆ', '어요', '.']
```
#### 7) `ANY_PECAB_FUNCTION(text, drop_space=bool)`: determines whether spaces are returned or not.
This can be used for all of `morphs`, `pos`, `nouns`. default value of this is `True`.
```python
from pecab import PeCab
pecab = PeCab()
pecab.pos("토끼정에서 크림 우동을 시켰어요.")
[('토끼', 'NNG'), ('정', 'NNG'), ('에서', 'JKB'), ('크림', 'NNG'), ('우동', 'NNG'), ('을', 'JKO'), ('시켰', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
pecab.pos("토끼정에서 크림 우동을 시켰어요.", drop_space=False)
[('토끼', 'NNG'), ('정', 'NNG'), ('에서', 'JKB'), (' ', 'SP'), ('크림', 'NNG'), (' ', 'SP'), ('우동', 'NNG'), ('을', 'JKO'), (' ', 'SP'), ('시켰', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
```
## Implementation Details
In fact, there was a pure python Korean morpheme analyzer before.
Its name is [Pynori](https://github.com/gritmind/python-nori).
I've been using Pynori well, and a big thank you to the developer of Pynori.
However, Pynori had some problems that needed improvement.
So I started making Pecab with its codebase and I focused on solving these problems.
### 1) 50 ~ 100 times faster loading and less memory usages
When we create Pynori object, it reads matrix and vocabulary files from disk and makes a Trie in runtime.
However, this is quite a heavy task. In fact, when I run Pynori for the first time, my computer freezes for almost 10 seconds.
So I solved this with the two key ideas: **1) zero-copy memory mapping** and **2) double array trie system**.
The first key idea was **zero copy memory mapping**.
This allows data in virtual memory (disk) to be used as-is without copying almost to memory.
In fact, Pynori takes close to 5 seconds to load `mecab_csv.pkl` file to memory, and this comes with a very heavy burden.
I designed the matrix file to be saved using `numpy.memmap` and the vocabulary using memmapable `pyarrow.Table`,
However, there was one problem with designing this.
The Trie data structure which was used in Pynori is quite difficult to store in memmap form.
In fact, numpy only supports arrays and matrices well, and pyarrow only supports tables in most cases.
Therefore, I initially wanted to use a table form instead of a trie.
However, Table has a linear time complexity of O(n) to index a particular key,
so the searching time could be actually very longer than before.
So the second key idea was **Double Array Trie (DATrie)**.
DATrie has only two simple integer arrays (base, check) instead of a complex node-based structure unlike general tries,
and all keys can be easily retrieved with them. And these two arrays are super easy to make with memmap !
The Double Array Trie can be saved in memmap files easily, so it was one of the best option for me.
I wanted to implement everything in Python to facilitate package installation, but unfortunately I couldn't find the DATrie source code implemented in pure python.
So I made pure python version of it myself, and you can find the implementation [here](https://github.com/hyunwoongko/pydatrie).
In conclusion, it took almost 50 ~ 100 times less time than before to read these two files,
and memory consumption was also significantly reduced because they did not actually reside in memory.
### 2) User-friendly and pythonic API
Another difficulty I had while using Pynori was the user API.
It has a fairly Java-like API and expressions, and to use it I had to pass a lot of parameters when creating the main object.
However, I wanted to make it very easy to use, like Mecab, and not require users to parse the output themselves.
So I thought about the API and finally decided to have an API similar to KoNLPy that users are already familiar with.
I believe that these APIs are much more user-friendly and will make the library more easy to use.
## License
Pecab project is licensed under the terms of the **Apache License 2.0**.
```
Copyright 2022 Hyunwoong Ko.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hyunwoongko/pecab",
"name": "pecab",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3",
"maintainer_email": "",
"keywords": "",
"author": "Hyunwoong Ko",
"author_email": "kevin.ko@tunib.ai",
"download_url": "https://files.pythonhosted.org/packages/05/94/8d1d43b66728d987d2bcdf98654338c0425720431726a5def43697c0b479/pecab-1.0.8.tar.gz",
"platform": "any",
"description": "# Pecab\n<a href=\"https://github.com/hyunwoongko/pecab/releases\"><img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/hyunwoongko/pecab.svg\" /></a> \n<a href=\"https://github.com/hyunwoongko/pecab/issues\"><img alt=\"Issues\" src=\"https://img.shields.io/github/issues/hyunwoongko/pecab\"/></a>\n[![Action Status Windows](https://github.com/eubinecto/pecab/actions/workflows/test_windows.yml/badge.svg)](https://github.com/eubinecto/pecab/actions)\n[![Action Status Ubuntu](https://github.com/eubinecto/pecab/actions/workflows/test_ubuntu.yml/badge.svg)](https://github.com/eubinecto/pecab/actions)\n[![Action Status macOS](https://github.com/eubinecto/pecab/actions/workflows/test_macos.yml/badge.svg)](https://github.com/eubinecto/pecab/actions)\n\nPecab is pure python Korean morpheme analyzer based on [Mecab](https://github.com/taku910/mecab).\nMecab is a CRF-based morpheme analyzer made by Taku Kudo in 2011. It is very fast and accurate at the same time, which is why it is still very popular even though it is quite old.\nHowever, it is known to be one of the most tricky libraries to install, and in fact many people have had a hard time installing Mecab.\n\nSo, since a few years ago, I wanted to make a pure python version of Mecab that was easy to install while inheriting the advantages of Mecab.\nNow, Pecab came out. This ensures results very similar to Mecab and at the same time easy to install.\nFor more details, please refer the following.\n\n## Installation\n```console\npip install pecab\n```\n\n## Usages\nThe user API of Pecab is inspired by [KoNLPy](https://github.com/konlpy/konlpy), \na one of the most famous natural language processing package in South Korea.\n\n#### 1) `PeCab()`: creating Pecab object.\n```python\nfrom pecab import PeCab\n\npecab = PeCab()\n```\n\n#### 2) `morphs(text)`: splits text into morphemes.\n```python\npecab.morphs(\"\uc544\ubc84\uc9c0\uac00\ubc29\uc5d0\ub4e4\uc5b4\uac00\uc2dc\ub2e4\")\n['\uc544\ubc84\uc9c0', '\uac00', '\ubc29', '\uc5d0', '\ub4e4\uc5b4\uac00', '\uc2dc', '\ub2e4']\n```\n\n#### 3) `pos(text)`: returns morphemes and POS tags together.\n```python\npecab.pos(\"\uc774\uac83\uc740 \ubb38\uc7a5\uc785\ub2c8\ub2e4.\")\n[('\uc774\uac83', 'NP'), ('\uc740', 'JX'), ('\ubb38\uc7a5', 'NNG'), ('\uc785\ub2c8\ub2e4', 'VCP+EF'), ('.', 'SF')]\n```\n\n#### 4) `nouns(text)`: returns all nouns in the input text.\n```python\npecab.nouns(\"\uc790\uc7a5\uba74\uc744 \uba39\uc744\uae4c? \uc9ec\ubf55\uc744 \uba39\uc744\uae4c? \uadf8\uac83\uc774 \uace0\ubbfc\uc774\ub85c\ub2e4.\")\n[\"\uc790\uc7a5\uba74\", \"\uc9ec\ubf55\", \"\uadf8\uac83\", \"\uace0\ubbfc\"]\n```\n\n#### 5) `Pecab(user_dict=List[str])`: applies an user dictionary.\nNote that words included in the user dictionary **cannot contain spaces**.\n- Without `user_dict`\n```python\nfrom pecab import PeCab\n\npecab = PeCab()\npecab.pos(\"\uc800\ub294 \uc0bc\uc131\ub514\uc9c0\ud138\ud504\ub77c\uc790\uc5d0\uc11c \uc9c0\ud3a0\ub0c9\uc7a5\uace0\ub97c \uc0c0\uc5b4\uc694.\")\n[('\uc800', 'NP'), ('\ub294', 'JX'), ('\uc0bc\uc131', 'NNP'), ('\ub514\uc9c0\ud138', 'NNP'), ('\ud504\ub77c\uc790', 'NNP'), ('\uc5d0\uc11c', 'JKB'), ('\uc9c0', 'NNP'), ('\ud3a0', 'NNP'), ('\ub0c9\uc7a5\uace0', 'NNG'), ('\ub97c', 'JKO'), ('\uc0c0', 'VV+EP'), ('\uc5b4\uc694', 'EF'), ('.', 'SF')]\n```\n- With `user_dict`\n```python\nfrom pecab import PeCab\n\nuser_dict = [\"\uc0bc\uc131\ub514\uc9c0\ud138\ud504\ub77c\uc790\", \"\uc9c0\ud3a0\ub0c9\uc7a5\uace0\"]\npecab = PeCab(user_dict=user_dict)\npecab.pos(\"\uc800\ub294 \uc0bc\uc131\ub514\uc9c0\ud138\ud504\ub77c\uc790\uc5d0\uc11c \uc9c0\ud3a0\ub0c9\uc7a5\uace0\ub97c \uc0c0\uc5b4\uc694.\")\n[('\uc800', 'NP'), ('\ub294', 'JX'), ('\uc0bc\uc131\ub514\uc9c0\ud138\ud504\ub77c\uc790', 'NNG'), ('\uc5d0\uc11c', 'JKB'), ('\uc9c0\ud3a0\ub0c9\uc7a5\uace0', 'NNG'), ('\ub97c', 'JKO'), ('\uc0c0', 'VV+EP'), ('\uc5b4\uc694', 'EF'), ('.', 'SF')]\n```\n\n#### 6) `PeCab(split_compound=bool)`: devides compound words into smaller pieces.\n```python\nfrom pecab import PeCab\n\npecab = PeCab(split_compound=True)\npecab.morphs(\"\uac00\ubcbc\uc6b4 \ub0c9\uc7a5\uace0\ub97c \uc0c0\uc5b4\uc694.\")\n['\uac00\ubccd', '\u11ab', '\ub0c9\uc7a5', '\uace0', '\ub97c', '\uc0ac', '\u314f\u3146', '\uc5b4\uc694', '.']\n```\n\n#### 7) `ANY_PECAB_FUNCTION(text, drop_space=bool)`: determines whether spaces are returned or not.\nThis can be used for all of `morphs`, `pos`, `nouns`. default value of this is `True`.\n```python\nfrom pecab import PeCab\n\npecab = PeCab()\npecab.pos(\"\ud1a0\ub07c\uc815\uc5d0\uc11c \ud06c\ub9bc \uc6b0\ub3d9\uc744 \uc2dc\ucf30\uc5b4\uc694.\")\n[('\ud1a0\ub07c', 'NNG'), ('\uc815', 'NNG'), ('\uc5d0\uc11c', 'JKB'), ('\ud06c\ub9bc', 'NNG'), ('\uc6b0\ub3d9', 'NNG'), ('\uc744', 'JKO'), ('\uc2dc\ucf30', 'VV+EP'), ('\uc5b4\uc694', 'EF'), ('.', 'SF')]\n\npecab.pos(\"\ud1a0\ub07c\uc815\uc5d0\uc11c \ud06c\ub9bc \uc6b0\ub3d9\uc744 \uc2dc\ucf30\uc5b4\uc694.\", drop_space=False)\n[('\ud1a0\ub07c', 'NNG'), ('\uc815', 'NNG'), ('\uc5d0\uc11c', 'JKB'), (' ', 'SP'), ('\ud06c\ub9bc', 'NNG'), (' ', 'SP'), ('\uc6b0\ub3d9', 'NNG'), ('\uc744', 'JKO'), (' ', 'SP'), ('\uc2dc\ucf30', 'VV+EP'), ('\uc5b4\uc694', 'EF'), ('.', 'SF')]\n```\n\n## Implementation Details\nIn fact, there was a pure python Korean morpheme analyzer before. \nIts name is [Pynori](https://github.com/gritmind/python-nori).\nI've been using Pynori well, and a big thank you to the developer of Pynori. \nHowever, Pynori had some problems that needed improvement. \nSo I started making Pecab with its codebase and I focused on solving these problems.\n\n### 1) 50 ~ 100 times faster loading and less memory usages\nWhen we create Pynori object, it reads matrix and vocabulary files from disk and makes a Trie in runtime. \nHowever, this is quite a heavy task. In fact, when I run Pynori for the first time, my computer freezes for almost 10 seconds. \nSo I solved this with the two key ideas: **1) zero-copy memory mapping** and **2) double array trie system**.\n\nThe first key idea was **zero copy memory mapping**.\nThis allows data in virtual memory (disk) to be used as-is without copying almost to memory. \nIn fact, Pynori takes close to 5 seconds to load `mecab_csv.pkl` file to memory, and this comes with a very heavy burden.\nI designed the matrix file to be saved using `numpy.memmap` and the vocabulary using memmapable `pyarrow.Table`, \n\nHowever, there was one problem with designing this.\nThe Trie data structure which was used in Pynori is quite difficult to store in memmap form.\nIn fact, numpy only supports arrays and matrices well, and pyarrow only supports tables in most cases. \nTherefore, I initially wanted to use a table form instead of a trie. \nHowever, Table has a linear time complexity of O(n) to index a particular key, \nso the searching time could be actually very longer than before. \nSo the second key idea was **Double Array Trie (DATrie)**.\nDATrie has only two simple integer arrays (base, check) instead of a complex node-based structure unlike general tries, \nand all keys can be easily retrieved with them. And these two arrays are super easy to make with memmap !\nThe Double Array Trie can be saved in memmap files easily, so it was one of the best option for me.\nI wanted to implement everything in Python to facilitate package installation, but unfortunately I couldn't find the DATrie source code implemented in pure python. \nSo I made pure python version of it myself, and you can find the implementation [here](https://github.com/hyunwoongko/pydatrie).\n\nIn conclusion, it took almost 50 ~ 100 times less time than before to read these two files,\nand memory consumption was also significantly reduced because they did not actually reside in memory.\n\n### 2) User-friendly and pythonic API\nAnother difficulty I had while using Pynori was the user API. \nIt has a fairly Java-like API and expressions, and to use it I had to pass a lot of parameters when creating the main object. \nHowever, I wanted to make it very easy to use, like Mecab, and not require users to parse the output themselves. \nSo I thought about the API and finally decided to have an API similar to KoNLPy that users are already familiar with.\nI believe that these APIs are much more user-friendly and will make the library more easy to use.\n\n## License\nPecab project is licensed under the terms of the **Apache License 2.0**.\n\n```\nCopyright 2022 Hyunwoong Ko.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n```",
"bugtrack_url": null,
"license": "Apache 2.0 License",
"summary": "Pure python Korean morpheme analyzer based on Mecab",
"version": "1.0.8",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "5ff743f7f2a4a2dfe86496ea4837dcb2",
"sha256": "cebdcfbafbb8187a5dd100fd45cacf5b838b6f17facd1c62daab44a1bec62a76"
},
"downloads": -1,
"filename": "pecab-1.0.8.tar.gz",
"has_sig": false,
"md5_digest": "5ff743f7f2a4a2dfe86496ea4837dcb2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3",
"size": 26384501,
"upload_time": "2022-12-23T04:04:56",
"upload_time_iso_8601": "2022-12-23T04:04:56.214035Z",
"url": "https://files.pythonhosted.org/packages/05/94/8d1d43b66728d987d2bcdf98654338c0425720431726a5def43697c0b479/pecab-1.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-23 04:04:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "hyunwoongko",
"github_project": "pecab",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pecab"
}