<div align="center">
# π
Khmer natural language processing toolkitπ
[](https://circleci.com/gh/VietHoang1512/khmer-nltk/tree/main)
[](https://www.codacy.com/gh/VietHoang1512/khmer-nltk/dashboard?utm_source=github.com&utm_medium=referral&utm_content=VietHoang1512/khmer-nltk&utm_campaign=Badge_Grade)
[](https://github.com/pre-commit/pre-commit)
[](https://github.com/psf/black)
[](https://pypi.org/project/khmer-nltk/)

[](https://pepy.tech/project/khmer-nltk)
[](https://zenodo.org/badge/latestdoi/313328421)
</div>
## π―TODO
- [X] Sentence Segmentation
- [X] Word Segmentation
- [X] Part of speech Tagging
- [ ] Named Entity Recognition
- [ ] Text classification
## πͺInstallation
```bash
pip install khmer-nltk
```
## πΉ Quick tour
[[Blog]](https://towardsdatascience.com/khmer-natural-language-processing-in-python-c770afb84784)
To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme
### Sentence tokenization
```python
>>> from khmernltk import sentence_tokenize
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(sentence_tokenize(raw_text))
['αα½αααααΆαααΈα’α¨!', 'α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ']
```
### [Word tokenization](khmernltk/word_tokenize)
```python
>>> from khmernltk import word_tokenize
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['αα½α', 'ααααΆα', 'ααΈ', 'α’α¨', '!', ' ', 'α’α£', ' ', 'αα»ααΆ', ' ', 'ααααΆαααΈ', 'ααααααααΆ', 'ααΆαα·', 'αααΆα', 'ααααα', 'αα·α', 'ααααα', ' ', 'ααΆα', 'αα
', 'αααα
αα', 'αααααααΆα', ' ', 'ααΆα', 'αααααΊ', 'ααααα·ααΆα', ' ', 'αα·α', 'ααΆααα½ααα½α', 'ααΆααααΈ']
```
### [POS Tagging](khmernltk/pos_tag)
### Usage
```python
>>> from khmernltk import pos_tag
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(pos_tag(raw_text))
[('αα½α', 'n'), ('ααααΆα', 'n'), ('ααΈ', 'n'), ('α’α¨', '1'), ('!', '.'), (' ', 'n'), ('α’α£', '1'), (' ', 'n'), ('αα»ααΆ', 'n'), (' ', 'n'), ('ααααΆαααΈ', 'n'), ('ααααααααΆ', 'n'), ('ααΆαα·', 'n'), ('αααΆα', 'o'), ('ααααα', 'n'), ('αα·α', 'o'), ('ααααα', 'n'), (' ', 'n'), ('ααΆα', 'v'), ('αα
', 'v'), ('αααα
αα', 'v'), ('αααααααΆα', 'n'), (' ', 'n'), ('ααΆα', 'v'), ('αααααΊ', 'n'), ('ααααα·ααΆα', 'n'), (' ', 'n'), ('αα·α', 'o'), ('ααΆααα½ααα½α', 'n'), ('ααΆααααΈ', 'o')]
```
### βοΈ Citation
```bibtex
@misc{hoang-khmer-nltk,
author = {Phan Viet Hoang},
title = {Khmer Natural Language Processing Tookit},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/VietHoang1512/khmer-nltk}}
}
```
#### Used in:
- [stopes: A library for preparing data for machine translation research](https://github.com/facebookresearch/stopes)
- [LASER Language-Agnostic SEntence Representations](https://github.com/facebookresearch/LASER)
- [Pretrained Models and Evaluation Data for the Khmer Language](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9645441)
- [Multilingual Open Text 1.0: Public Domain News in 44 Languages](https://arxiv.org/pdf/2201.05609.pdf)
- [ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System](https://arxiv.org/pdf/2205.14981.pdf)
- [Shared Task on Cross-lingual Open-Retrieval QA](https://www.aclweb.org/portal/content/shared-task-cross-lingual-open-retrieval-qa)
- [No Language Left Behind: Scaling Human-Centered Machine Translation](https://research.facebook.com/publications/no-language-left-behind/)
- [Wordless](https://github.com/BLKSerene/Wordless)
### π¨βπ References
- [NLP: Text Segmentation Using Conditional Random Fields](https://medium.com/@phylypo/nlp-text-segmentation-using-conditional-random-fields-e8ff1d2b6060)
- [Khmer Word Segmentation Using Conditional Random Fields](https://www2.nict.go.jp/astrec-att/member/ding/KhNLP2015-SEG.pdf)
- [Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)
### π Advisor
- Prof. [Huong Le Thanh](https://users.soict.hust.edu.vn/huonglt/)
Raw data
{
"_id": null,
"home_page": "https://github.com/VietHoang1710/khmer-nltk",
"name": "khmer-nltk",
"maintainer": "",
"docs_url": null,
"requires_python": ">3.5",
"maintainer_email": "",
"keywords": "",
"author": "Phan Viet Hoang",
"author_email": "phanviethoang1512@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e1/ee/5c8f448d0354788c28f75d6f1cfa5078f697e0153b5e6093d34da37c300a/khmer-nltk-1.6.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# \ud83c\udfc5Khmer natural language processing toolkit\ud83c\udfc5\n\n[](https://circleci.com/gh/VietHoang1512/khmer-nltk/tree/main)\n[](https://www.codacy.com/gh/VietHoang1512/khmer-nltk/dashboard?utm_source=github.com&utm_medium=referral&utm_content=VietHoang1512/khmer-nltk&utm_campaign=Badge_Grade)\n[](https://github.com/pre-commit/pre-commit)\n[](https://github.com/psf/black)\n[](https://pypi.org/project/khmer-nltk/)\n\n[](https://pepy.tech/project/khmer-nltk)\n[](https://zenodo.org/badge/latestdoi/313328421)\n\n</div>\n\n## \ud83c\udfafTODO\n\n- [X] Sentence Segmentation\n- [X] Word Segmentation\n- [X] Part of speech Tagging\n- [ ] Named Entity Recognition\n- [ ] Text classification\n\n## \ud83d\udcaaInstallation\n\n```bash\npip install khmer-nltk\n```\n\n## \ud83c\udff9 Quick tour\n\n[[Blog]](https://towardsdatascience.com/khmer-natural-language-processing-in-python-c770afb84784)\n\nTo get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme\n\n### Sentence tokenization\n\n```python\n>>> from khmernltk import sentence_tokenize\n>>> raw_text = \"\u1781\u17bd\u1794\u1786\u17d2\u1793\u17b6\u17c6\u1791\u17b8\u17e2\u17e8! \u17e2\u17e3 \u178f\u17bb\u179b\u17b6 \u179f\u17d2\u1798\u17b6\u179a\u178f\u17b8\u1795\u17d2\u179f\u17c7\u1795\u17d2\u179f\u17b6\u1787\u17b6\u178f\u17b7\u179a\u179c\u17b6\u1784\u1781\u17d2\u1798\u17c2\u179a\u1793\u17b7\u1784\u1781\u17d2\u1798\u17c2\u179a \u1788\u17b6\u1793\u1791\u17c5\u1794\u1789\u17d2\u1785\u1794\u17cb\u179f\u1784\u17d2\u179a\u17d2\u1782\u17b6\u1798 \u1793\u17b6\u17c6\u1796\u1793\u17d2\u179b\u17ba\u179f\u1793\u17d2\u178f\u17b7\u1797\u17b6\u1796 \u1793\u17b7\u1784\u1780\u17b6\u179a\u179a\u17bd\u1794\u179a\u17bd\u1798\u1787\u17b6\u1790\u17d2\u1798\u17b8\"\n>>> print(sentence_tokenize(raw_text))\n['\u1781\u17bd\u1794\u1786\u17d2\u1793\u17b6\u17c6\u1791\u17b8\u17e2\u17e8!', '\u17e2\u17e3 \u178f\u17bb\u179b\u17b6 \u179f\u17d2\u1798\u17b6\u179a\u178f\u17b8\u1795\u17d2\u179f\u17c7\u1795\u17d2\u179f\u17b6\u1787\u17b6\u178f\u17b7\u179a\u179c\u17b6\u1784\u1781\u17d2\u1798\u17c2\u179a\u1793\u17b7\u1784\u1781\u17d2\u1798\u17c2\u179a \u1788\u17b6\u1793\u1791\u17c5\u1794\u1789\u17d2\u1785\u1794\u17cb\u179f\u1784\u17d2\u179a\u17d2\u1782\u17b6\u1798 \u1793\u17b6\u17c6\u1796\u1793\u17d2\u179b\u17ba\u179f\u1793\u17d2\u178f\u17b7\u1797\u17b6\u1796 \u1793\u17b7\u1784\u1780\u17b6\u179a\u179a\u17bd\u1794\u179a\u17bd\u1798\u1787\u17b6\u1790\u17d2\u1798\u17b8']\n```\n\n### [Word tokenization](khmernltk/word_tokenize)\n\n```python\n>>> from khmernltk import word_tokenize\n>>> raw_text = \"\u1781\u17bd\u1794\u1786\u17d2\u1793\u17b6\u17c6\u1791\u17b8\u17e2\u17e8! \u17e2\u17e3 \u178f\u17bb\u179b\u17b6 \u179f\u17d2\u1798\u17b6\u179a\u178f\u17b8\u1795\u17d2\u179f\u17c7\u1795\u17d2\u179f\u17b6\u1787\u17b6\u178f\u17b7\u179a\u179c\u17b6\u1784\u1781\u17d2\u1798\u17c2\u179a\u1793\u17b7\u1784\u1781\u17d2\u1798\u17c2\u179a \u1788\u17b6\u1793\u1791\u17c5\u1794\u1789\u17d2\u1785\u1794\u17cb\u179f\u1784\u17d2\u179a\u17d2\u1782\u17b6\u1798 \u1793\u17b6\u17c6\u1796\u1793\u17d2\u179b\u17ba\u179f\u1793\u17d2\u178f\u17b7\u1797\u17b6\u1796 \u1793\u17b7\u1784\u1780\u17b6\u179a\u179a\u17bd\u1794\u179a\u17bd\u1798\u1787\u17b6\u1790\u17d2\u1798\u17b8\"\n>>> print(word_tokenize(raw_text, return_tokens=True))\n['\u1781\u17bd\u1794', '\u1786\u17d2\u1793\u17b6\u17c6', '\u1791\u17b8', '\u17e2\u17e8', '!', ' ', '\u17e2\u17e3', ' ', '\u178f\u17bb\u179b\u17b6', ' ', '\u179f\u17d2\u1798\u17b6\u179a\u178f\u17b8', '\u1795\u17d2\u179f\u17c7\u1795\u17d2\u179f\u17b6', '\u1787\u17b6\u178f\u17b7', '\u179a\u179c\u17b6\u1784', '\u1781\u17d2\u1798\u17c2\u179a', '\u1793\u17b7\u1784', '\u1781\u17d2\u1798\u17c2\u179a', ' ', '\u1788\u17b6\u1793', '\u1791\u17c5', '\u1794\u1789\u17d2\u1785\u1794\u17cb', '\u179f\u1784\u17d2\u179a\u17d2\u1782\u17b6\u1798', ' ', '\u1793\u17b6\u17c6', '\u1796\u1793\u17d2\u179b\u17ba', '\u179f\u1793\u17d2\u178f\u17b7\u1797\u17b6\u1796', ' ', '\u1793\u17b7\u1784', '\u1780\u17b6\u179a\u179a\u17bd\u1794\u179a\u17bd\u1798', '\u1787\u17b6\u1790\u17d2\u1798\u17b8']\n```\n\n### [POS Tagging](khmernltk/pos_tag)\n\n### Usage\n\n```python\n>>> from khmernltk import pos_tag\n>>> raw_text = \"\u1781\u17bd\u1794\u1786\u17d2\u1793\u17b6\u17c6\u1791\u17b8\u17e2\u17e8! \u17e2\u17e3 \u178f\u17bb\u179b\u17b6 \u179f\u17d2\u1798\u17b6\u179a\u178f\u17b8\u1795\u17d2\u179f\u17c7\u1795\u17d2\u179f\u17b6\u1787\u17b6\u178f\u17b7\u179a\u179c\u17b6\u1784\u1781\u17d2\u1798\u17c2\u179a\u1793\u17b7\u1784\u1781\u17d2\u1798\u17c2\u179a \u1788\u17b6\u1793\u1791\u17c5\u1794\u1789\u17d2\u1785\u1794\u17cb\u179f\u1784\u17d2\u179a\u17d2\u1782\u17b6\u1798 \u1793\u17b6\u17c6\u1796\u1793\u17d2\u179b\u17ba\u179f\u1793\u17d2\u178f\u17b7\u1797\u17b6\u1796 \u1793\u17b7\u1784\u1780\u17b6\u179a\u179a\u17bd\u1794\u179a\u17bd\u1798\u1787\u17b6\u1790\u17d2\u1798\u17b8\"\n>>> print(pos_tag(raw_text))\n[('\u1781\u17bd\u1794', 'n'), ('\u1786\u17d2\u1793\u17b6\u17c6', 'n'), ('\u1791\u17b8', 'n'), ('\u17e2\u17e8', '1'), ('!', '.'), (' ', 'n'), ('\u17e2\u17e3', '1'), (' ', 'n'), ('\u178f\u17bb\u179b\u17b6', 'n'), (' ', 'n'), ('\u179f\u17d2\u1798\u17b6\u179a\u178f\u17b8', 'n'), ('\u1795\u17d2\u179f\u17c7\u1795\u17d2\u179f\u17b6', 'n'), ('\u1787\u17b6\u178f\u17b7', 'n'), ('\u179a\u179c\u17b6\u1784', 'o'), ('\u1781\u17d2\u1798\u17c2\u179a', 'n'), ('\u1793\u17b7\u1784', 'o'), ('\u1781\u17d2\u1798\u17c2\u179a', 'n'), (' ', 'n'), ('\u1788\u17b6\u1793', 'v'), ('\u1791\u17c5', 'v'), ('\u1794\u1789\u17d2\u1785\u1794\u17cb', 'v'), ('\u179f\u1784\u17d2\u179a\u17d2\u1782\u17b6\u1798', 'n'), (' ', 'n'), ('\u1793\u17b6\u17c6', 'v'), ('\u1796\u1793\u17d2\u179b\u17ba', 'n'), ('\u179f\u1793\u17d2\u178f\u17b7\u1797\u17b6\u1796', 'n'), (' ', 'n'), ('\u1793\u17b7\u1784', 'o'), ('\u1780\u17b6\u179a\u179a\u17bd\u1794\u179a\u17bd\u1798', 'n'), ('\u1787\u17b6\u1790\u17d2\u1798\u17b8', 'o')]\n```\n\n### \u270d\ufe0f Citation\n\n```bibtex\n@misc{hoang-khmer-nltk,\n author = {Phan Viet Hoang},\n title = {Khmer Natural Language Processing Tookit},\n year = {2020},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/VietHoang1512/khmer-nltk}}\n}\n```\n#### Used in:\n- [stopes: A library for preparing data for machine translation research](https://github.com/facebookresearch/stopes)\n- [LASER Language-Agnostic SEntence Representations](https://github.com/facebookresearch/LASER)\n- [Pretrained Models and Evaluation Data for the Khmer Language](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9645441)\n- [Multilingual Open Text 1.0: Public Domain News in 44 Languages](https://arxiv.org/pdf/2201.05609.pdf)\n- [ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System](https://arxiv.org/pdf/2205.14981.pdf)\n- [Shared Task on Cross-lingual Open-Retrieval QA](https://www.aclweb.org/portal/content/shared-task-cross-lingual-open-retrieval-qa)\n- [No Language Left Behind: Scaling Human-Centered Machine Translation](https://research.facebook.com/publications/no-language-left-behind/)\n- [Wordless](https://github.com/BLKSerene/Wordless)\n\n### \ud83d\udc68\u200d\ud83c\udf93 References\n\n- [NLP: Text Segmentation Using Conditional Random Fields](https://medium.com/@phylypo/nlp-text-segmentation-using-conditional-random-fields-e8ff1d2b6060)\n- [Khmer Word Segmentation Using Conditional Random Fields](https://www2.nict.go.jp/astrec-att/member/ding/KhNLP2015-SEG.pdf)\n- [Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)\n\n### \ud83d\udcdc Advisor\n\n- Prof. [Huong Le Thanh](https://users.soict.hust.edu.vn/huonglt/)\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "A Khmer language processing toolkit",
"version": "1.6",
"project_urls": {
"Homepage": "https://github.com/VietHoang1710/khmer-nltk"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2a751ece979693214298e216a227087a173379118319620ce6569f49e40776e6",
"md5": "af2e91c9b7d015927a7684d9aa24552f",
"sha256": "d3474a06faf07b7b5bb40922377dd2433aa0db79eff9155b0087783355e7921d"
},
"downloads": -1,
"filename": "khmer_nltk-1.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "af2e91c9b7d015927a7684d9aa24552f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">3.5",
"size": 6989369,
"upload_time": "2023-09-19T01:30:00",
"upload_time_iso_8601": "2023-09-19T01:30:00.389110Z",
"url": "https://files.pythonhosted.org/packages/2a/75/1ece979693214298e216a227087a173379118319620ce6569f49e40776e6/khmer_nltk-1.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e1ee5c8f448d0354788c28f75d6f1cfa5078f697e0153b5e6093d34da37c300a",
"md5": "f3d71ace85a2037163a8739a98a9eb4e",
"sha256": "4ecfa4ddef8f88cde63b9f0d7f5544bb8389b3bb87a835f37ae7f6a2a9e3fd57"
},
"downloads": -1,
"filename": "khmer-nltk-1.6.tar.gz",
"has_sig": false,
"md5_digest": "f3d71ace85a2037163a8739a98a9eb4e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">3.5",
"size": 6979260,
"upload_time": "2023-09-19T01:30:06",
"upload_time_iso_8601": "2023-09-19T01:30:06.887608Z",
"url": "https://files.pythonhosted.org/packages/e1/ee/5c8f448d0354788c28f75d6f1cfa5078f697e0153b5e6093d34da37c300a/khmer-nltk-1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-19 01:30:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "VietHoang1710",
"github_project": "khmer-nltk",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"circle": true,
"lcname": "khmer-nltk"
}