# chiecthuyenngoaixa
[![GitHub issues](https://img.shields.io/github/issues/IoeCmcomc/chiecthuyenngoaixa)](https://github.com/IoeCmcomc/chiecthuyenngoaixa/issues)
[![GitHub license](https://img.shields.io/github/license/IoeCmcomc/chiecthuyenngoaixa)](https://github.com/IoeCmcomc/chiecthuyenngoaixa/blob/master/LICENSE)
[![Documentation Status](https://readthedocs.org/projects/chiecthuyenngoaixa/badge/?version=latest)](https://chiecthuyenngoaixa.readthedocs.io/en/latest/?badge=latest)
![PyPI](https://img.shields.io/pypi/v/chiecthuyenngoaixa)
![PyPI - Downloads](https://img.shields.io/pypi/dm/chiecthuyenngoaixa)
[Tiếng Việt](README-vi.md "Vietnamese version")
**chiecthuyenngoaixa** is a Python library which provides functions and
classes for various tasks in _processing Vietnamese texts_, such as
removing diacritics, converting numbers to words, sorting strings,
validations and more.
This library is written on pure Python with no dependencies. Python 3.8
and above is supported.
## Installation
Chiecthuyenngoaixa is available on
[PyPI](https://pypi.org/project/chiecthuyenngoaixa/). Open a terminal or
_Command Prompt_ (on Windows) and run the following command:
``` console
pip install chiecthuyenngoaixa
```
If you are using [Poetry](https://python-poetry.org/), use this instead:
``` console
poetry add chiecthuyenngoaixa
```
## Basic usage
The library will now be available as `ctnx` module (abbreviation of
_chiecthuyenngoaixa_).
Some commonly used functions and classes can be imported directly. For
example:
- To convert Vietnamese text to ASCII-only text:
```python
>>> from ctnx import remove_diacritics
>>> remove_diacritics("Đàn ong thấy cái lon thì bu vào.")
'Dan ong thay cai lon thi bu vao.'
```
- To convert a number to Vietnamese text:
```python
>>> from ctnx import num_to_words
>>> num_to_words(123456789021003.45)
'một trăm hai mươi ba nghìn bốn trăm năm mươi sáu tỉ bảy trăm tám mươi chín triệu không trăm hai mươi mốt nghìn không trăm linh ba phẩy bốn mươi lăm'
```
- To sort Vietnamese texts:
```python
>>> from ctnx import ViSortKey
>>> lines = ['Hà Nam', 'Hải Dương', 'Hà Nội', 'Hà Tĩnh', 'Hải Phòng', 'Hậu Giang', 'Hoà Bình', 'Hưng Yên', 'Hạ Long', 'Hà Giang', 'Điện Biên'\]
>>> sorted(lines, key=ViSortKey)
['Điện Biên', 'Hà Giang', 'Hà Nam', 'Hà Nội', 'Hà Tĩnh', 'Hải Dương', 'Hải Phòng', 'Hạ Long', 'Hậu Giang', 'Hoà Bình', 'Hưng Yên']
```
Other functions and classes are put into separate sub-modules. For example:
- To convert a likely confusing text of Vietnamese to the normal text:
```python
>>> from ctnx.misc import normalize_confusables
>>> normalize_confusables("𝕮𝖍𝖎ế𝖈 𝖙𝖍𝖚𝖞ề𝖓 𝖓𝖌𝖔à𝖎 𝖝𝖆")
'Chiếc thuyền ngoài xa'
```
- To extract information from a Vietnamese National Citizen ID (_Căn cước công dân_) number:
```python
>>> from ctnx import validation
>>> validation.is_valid_cccd("024192123456")
True
>>> validation.parse_cccd("024192123456")
CccdResult(id='123456', is_male=False, birth_year=1992, birth_country='vn', birth_province='Bắc Giang')
```
- To extract tones from a Vietnamese syllable or text:
```python
>>> from ctnx.misc import separate_tone
>>> separate_tone("Đẩu")
('Đâu', '?')
>>> toneNames = {'': 'thanh', '/': 'sắc', '\\': 'huyền', '?': 'hỏi', '~': 'ngã', '.': 'nặng'}
>>> ' '.join(toneNames[separate_tone(syll)[1]] for syll in "Tôi thầm cảm ơn Đẩu đã giữ mình ở nán lại".split(' '))
'thanh huyền hỏi thanh hỏi ngã ngã huyền hỏi sắc nặng'
```
- To manipulate Vietnamese syllables:
```python
>>> from ctnx.syllable import Syllable
>>> text = "ba ngày một trận nhẹ năm ngày một trận nặng"
>>> a = [Syllable.from_string(x) for x in text.split(' ')]
>>> a
[Syllable(b, a, ), Syllable(ng, ay, , \), Syllable(m, ô, t, .), Syllable(tr, â, n, .), Syllable(nh, e, , .), Syllable(n, ă, m), Syllable(ng, ay, , \), Syllable(m, ô, t, .), Syllable(tr, â, n, .), Syllable(n, ă, ng, .)]
>>> for syll in a:
... syll.onset = 'nh'
...
>>> a
[Syllable(nh, a, ), Syllable(nh, ay, , \), Syllable(nh, ô, t, .), Syllable(nh, â, n, .), Syllable(nh, e, , .), Syllable(nh, ă, m), Syllable(nh, ay, , \), Syllable(nh, ô, t, .), Syllable(nh, â, n, .), Syllable(nh, ă, ng, .)]
>>> ' '.join(str(x) for x in a)
'nha nhày nhột nhận nhẹ nhăm nhày nhột nhận nhặng'
```
For further usages, see the documentation, which is hosted on [chiecthuyenngoaixa.readthedocs.io](https://chiecthuyenngoaixa.readthedocs.io/en/latest/).
Raw data
{
"_id": null,
"home_page": "https://github.com/IoeCmcomc/chiecthuyenngoaixa",
"name": "chiecthuyenngoaixa",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": "Vietnamese, text",
"author": "IoeCmcomc",
"author_email": "53734763+IoeCmcomc@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/a6/cf/ffcad9def27dd6dd83b272332afe33ed71a78fc1d6b35cf69e4a919265f9/chiecthuyenngoaixa-0.2.1.tar.gz",
"platform": null,
"description": "# chiecthuyenngoaixa\n\n[![GitHub issues](https://img.shields.io/github/issues/IoeCmcomc/chiecthuyenngoaixa)](https://github.com/IoeCmcomc/chiecthuyenngoaixa/issues)\n[![GitHub license](https://img.shields.io/github/license/IoeCmcomc/chiecthuyenngoaixa)](https://github.com/IoeCmcomc/chiecthuyenngoaixa/blob/master/LICENSE)\n[![Documentation Status](https://readthedocs.org/projects/chiecthuyenngoaixa/badge/?version=latest)](https://chiecthuyenngoaixa.readthedocs.io/en/latest/?badge=latest)\n![PyPI](https://img.shields.io/pypi/v/chiecthuyenngoaixa)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/chiecthuyenngoaixa)\n\n[Ti\u1ebfng Vi\u1ec7t](README-vi.md \"Vietnamese version\")\n\n**chiecthuyenngoaixa** is a Python library which provides functions and\nclasses for various tasks in _processing Vietnamese texts_, such as\nremoving diacritics, converting numbers to words, sorting strings,\nvalidations and more.\n\nThis library is written on pure Python with no dependencies. Python 3.8\nand above is supported.\n\n## Installation\n\nChiecthuyenngoaixa is available on\n[PyPI](https://pypi.org/project/chiecthuyenngoaixa/). Open a terminal or\n_Command Prompt_ (on Windows) and run the following command:\n\n``` console\npip install chiecthuyenngoaixa\n```\n\nIf you are using [Poetry](https://python-poetry.org/), use this instead:\n\n``` console\npoetry add chiecthuyenngoaixa\n```\n\n## Basic usage\n\nThe library will now be available as `ctnx` module (abbreviation of\n_chiecthuyenngoaixa_).\n\nSome commonly used functions and classes can be imported directly. For\nexample:\n\n- To convert Vietnamese text to ASCII-only text:\n\n```python\n>>> from ctnx import remove_diacritics\n>>> remove_diacritics(\"\u0110\u00e0n ong th\u1ea5y c\u00e1i lon th\u00ec bu v\u00e0o.\")\n'Dan ong thay cai lon thi bu vao.'\n```\n\n- To convert a number to Vietnamese text:\n\n```python\n>>> from ctnx import num_to_words\n>>> num_to_words(123456789021003.45)\n'm\u1ed9t tr\u0103m hai m\u01b0\u01a1i ba ngh\u00ecn b\u1ed1n tr\u0103m n\u0103m m\u01b0\u01a1i s\u00e1u t\u1ec9 b\u1ea3y tr\u0103m t\u00e1m m\u01b0\u01a1i ch\u00edn tri\u1ec7u kh\u00f4ng tr\u0103m hai m\u01b0\u01a1i m\u1ed1t ngh\u00ecn kh\u00f4ng tr\u0103m linh ba ph\u1ea9y b\u1ed1n m\u01b0\u01a1i l\u0103m'\n```\n\n- To sort Vietnamese texts:\n\n```python\n>>> from ctnx import ViSortKey\n>>> lines = ['H\u00e0 Nam', 'H\u1ea3i D\u01b0\u01a1ng', 'H\u00e0 N\u1ed9i', 'H\u00e0 T\u0129nh', 'H\u1ea3i Ph\u00f2ng', 'H\u1eadu Giang', 'Ho\u00e0 B\u00ecnh', 'H\u01b0ng Y\u00ean', 'H\u1ea1 Long', 'H\u00e0 Giang', '\u0110i\u1ec7n Bi\u00ean'\\]\n>>> sorted(lines, key=ViSortKey)\n['\u0110i\u1ec7n Bi\u00ean', 'H\u00e0 Giang', 'H\u00e0 Nam', 'H\u00e0 N\u1ed9i', 'H\u00e0 T\u0129nh', 'H\u1ea3i D\u01b0\u01a1ng', 'H\u1ea3i Ph\u00f2ng', 'H\u1ea1 Long', 'H\u1eadu Giang', 'Ho\u00e0 B\u00ecnh', 'H\u01b0ng Y\u00ean']\n```\n\nOther functions and classes are put into separate sub-modules. For example:\n\n- To convert a likely confusing text of Vietnamese to the normal text:\n```python\n>>> from ctnx.misc import normalize_confusables\n>>> normalize_confusables(\"\ud835\udd6e\ud835\udd8d\ud835\udd8e\u1ebf\ud835\udd88 \ud835\udd99\ud835\udd8d\ud835\udd9a\ud835\udd9e\u1ec1\ud835\udd93 \ud835\udd93\ud835\udd8c\ud835\udd94\u00e0\ud835\udd8e \ud835\udd9d\ud835\udd86\")\n'Chi\u1ebfc thuy\u1ec1n ngo\u00e0i xa'\n```\n\n- To extract information from a Vietnamese National Citizen ID (_C\u0103n c\u01b0\u1edbc c\u00f4ng d\u00e2n_) number:\n```python\n>>> from ctnx import validation\n>>> validation.is_valid_cccd(\"024192123456\")\nTrue\n>>> validation.parse_cccd(\"024192123456\")\nCccdResult(id='123456', is_male=False, birth_year=1992, birth_country='vn', birth_province='B\u1eafc Giang')\n```\n\n- To extract tones from a Vietnamese syllable or text:\n\n```python\n>>> from ctnx.misc import separate_tone\n>>> separate_tone(\"\u0110\u1ea9u\")\n('\u0110\u00e2u', '?')\n>>> toneNames = {'': 'thanh', '/': 's\u1eafc', '\\\\': 'huy\u1ec1n', '?': 'h\u1ecfi', '~': 'ng\u00e3', '.': 'n\u1eb7ng'}\n>>> ' '.join(toneNames[separate_tone(syll)[1]] for syll in \"T\u00f4i th\u1ea7m c\u1ea3m \u01a1n \u0110\u1ea9u \u0111\u00e3 gi\u1eef m\u00ecnh \u1edf n\u00e1n l\u1ea1i\".split(' '))\n'thanh huy\u1ec1n h\u1ecfi thanh h\u1ecfi ng\u00e3 ng\u00e3 huy\u1ec1n h\u1ecfi s\u1eafc n\u1eb7ng'\n```\n\n- To manipulate Vietnamese syllables:\n\n```python\n>>> from ctnx.syllable import Syllable\n>>> text = \"ba ng\u00e0y m\u1ed9t tr\u1eadn nh\u1eb9 n\u0103m ng\u00e0y m\u1ed9t tr\u1eadn n\u1eb7ng\"\n>>> a = [Syllable.from_string(x) for x in text.split(' ')]\n>>> a\n[Syllable(b, a, ), Syllable(ng, ay, , \\), Syllable(m, \u00f4, t, .), Syllable(tr, \u00e2, n, .), Syllable(nh, e, , .), Syllable(n, \u0103, m), Syllable(ng, ay, , \\), Syllable(m, \u00f4, t, .), Syllable(tr, \u00e2, n, .), Syllable(n, \u0103, ng, .)]\n>>> for syll in a:\n... syll.onset = 'nh'\n...\n>>> a\n[Syllable(nh, a, ), Syllable(nh, ay, , \\), Syllable(nh, \u00f4, t, .), Syllable(nh, \u00e2, n, .), Syllable(nh, e, , .), Syllable(nh, \u0103, m), Syllable(nh, ay, , \\), Syllable(nh, \u00f4, t, .), Syllable(nh, \u00e2, n, .), Syllable(nh, \u0103, ng, .)]\n>>> ' '.join(str(x) for x in a)\n'nha nh\u00e0y nh\u1ed9t nh\u1eadn nh\u1eb9 nh\u0103m nh\u00e0y nh\u1ed9t nh\u1eadn nh\u1eb7ng'\n```\n\nFor further usages, see the documentation, which is hosted on [chiecthuyenngoaixa.readthedocs.io](https://chiecthuyenngoaixa.readthedocs.io/en/latest/).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An utility library for processing Vietnamese texts",
"version": "0.2.1",
"project_urls": {
"Homepage": "https://github.com/IoeCmcomc/chiecthuyenngoaixa",
"Repository": "https://github.com/IoeCmcomc/chiecthuyenngoaixa"
},
"split_keywords": [
"vietnamese",
" text"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7bd5792127373238f16dce2a9bac82f555b86bd609b4484251a959110a0266db",
"md5": "216f42e250262a90b0e792b2f723d04e",
"sha256": "ae267f722dae249159e7f52f3d31a7e884971a1de01846aa19f8d638f10f97ed"
},
"downloads": -1,
"filename": "chiecthuyenngoaixa-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "216f42e250262a90b0e792b2f723d04e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 29312,
"upload_time": "2024-06-01T08:33:48",
"upload_time_iso_8601": "2024-06-01T08:33:48.548259Z",
"url": "https://files.pythonhosted.org/packages/7b/d5/792127373238f16dce2a9bac82f555b86bd609b4484251a959110a0266db/chiecthuyenngoaixa-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a6cfffcad9def27dd6dd83b272332afe33ed71a78fc1d6b35cf69e4a919265f9",
"md5": "570b5581535a32104905ae4fe215e7bd",
"sha256": "550abe20852a57aaa8244ab680b8eac94c67014ec2498ea630dc6fec1187f2cb"
},
"downloads": -1,
"filename": "chiecthuyenngoaixa-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "570b5581535a32104905ae4fe215e7bd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 29474,
"upload_time": "2024-06-01T08:33:50",
"upload_time_iso_8601": "2024-06-01T08:33:50.142521Z",
"url": "https://files.pythonhosted.org/packages/a6/cf/ffcad9def27dd6dd83b272332afe33ed71a78fc1d6b35cf69e4a919265f9/chiecthuyenngoaixa-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-01 08:33:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "IoeCmcomc",
"github_project": "chiecthuyenngoaixa",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "chiecthuyenngoaixa"
}