# olunicodenormalizer
ᱚᱞ-ᱪᱦᱤᱠᱤ Unicode Normalization for word normalization
# install
```python
pip install olunicodenormalizer
```
# useage
**initialization and cleaning**
```python
# import
from olunicodenormalizer import Normalizer
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'ᱡᱚᱦᱟᱨ'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)
```
> output
```
Non-norm:ᱡᱚᱦᱟᱨ; Norm:ᱡᱚᱦᱟᱨ
--------------------------------------------------
{'given': 'ᱡᱚᱦᱟᱨ', 'normalized': 'ᱡᱚᱦᱟᱨ', 'ops': []}
```
```python
# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])
```
> output
```
without english: None
with english: ASD123
```
Change Log
===========
0.0.5 (9/03/2022)
-------------------
- added details for execution map
- checkop typo correction
0.0.6 (9/03/2022)
-------------------
- broken diacritics op addition
0.0.7 (11/03/2022)
-------------------
- assemese replacement
- word op and unicode op mapping
- modifier list modification
- doc string for call and initialization
- verbosity removal
- typo correction for operation
- unit test updates
- 'এ' replacement correction
- NonGylphUnicodes
- Legacy symbols option
- legacy mapper added
- added bn:bd declaration
0.0.8 (14/03/2022)
-------------------
- MultipleConsonantDiacritics handling change
- to+hosonto correction
- invalid hosonto correction
0.0.9 (15/04/2022)
-------------------
- base normalizer
- language class
- olchiki extension
- complex root normalization
0.0.10 (15/04/2022)
-------------------
- added conjucts
- exception for english words
0.0.11 (15/04/2022)
-------------------
- fixed no space char issue for olchiki
0.0.12 (26/04/2022)
-------------------
- fixed consonants orders
0.0.13 (26/04/2022)
-------------------
- fixed non char followed by diacritics
0.0.14 (01/05/2022)
-------------------
- word based normalization
- encoding fix
0.0.15 (02/05/2022)
-------------------
- import correction
0.0.16 (02/05/2022)
-------------------
- local variable issue
0.0.17 (17/05/2022)
-------------------
- nukta mod break
0.0.18 (08/06/2022)
-------------------
- no space chars fix
0.0.19 (15/06/2022)
-------------------
- no space chars further fix
- base_olchiki_compose to avoid false op flags
- added foreign conjuncts
0.0.20 (01/08/2022)
-------------------
- এ্যা replacement correction
0.0.21 (01/08/2022)
-------------------
- "য","ব" + hosonto combination correction
- added 'ব্ল্য' in conjuncts
0.0.22 (22/08/2022)
-------------------
- \u200d combination limiting
0.0.23 (23/08/2022)
-------------------
- \u200d condition change
0.0.24 (26/08/2022)
-------------------
- \u200d error handling
0.0.25 (10/09/22)
-------------------
- removed unnecessary operations: fixRefOrder,fixOrdersForCC
- added conjuncts: 'র্ন্ত','ঠ্য','ভ্ল'
0.1.0 (20/10/22)
-------------------
- added indic parser
- fixed language class
0.1.1 (21/10/22)
-------------------
- added nukta and diacritic maps for indics
- cleaned conjucts for now
- fixed issues with no-space and connector
0.1.2 (10/12/22)
-------------------
- allow halant ending for indic language except olchiki
0.1.3 (10/12/22)
-------------------
- broken char break cases for halant
0.1.4 (01/01/23)
-------------------
- added sylhetinagri
0.1.5 (01/01/23)
-------------------
- cleaned panjabi double quotes in diac map
0.0.1 (26/08/23)
-------------------
- added olchiki punctuations
Raw data
{
"_id": null,
"home_page": "",
"name": "olunicodenormalizer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "olchiki,unicode,text normalization,indic",
"author": "Shivnath Kisku",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/9c/78/2214a475860a52c6f50e492dc7ed6ddfe993f6a31766b339399e961fa2d5/olunicodenormalizer-1.0.0.tar.gz",
"platform": null,
"description": "# olunicodenormalizer\n\u1c5a\u1c5e-\u1c6a\u1c66\u1c64\u1c60\u1c64 Unicode Normalization for word normalization\n# install\n```python\npip install olunicodenormalizer\n```\n# useage\n**initialization and cleaning**\n```python\n# import\nfrom olunicodenormalizer import Normalizer \nfrom pprint import pprint\n# initialize\nbnorm=Normalizer()\n# normalize\nword = '\u1c61\u1c5a\u1c66\u1c5f\u1c68'\nresult=bnorm(word)\nprint(f\"Non-norm:{word}; Norm:{result['normalized']}\")\nprint(\"--------------------------------------------------\")\npprint(result)\n```\n> output \n\n```\nNon-norm:\u1c61\u1c5a\u1c66\u1c5f\u1c68; Norm:\u1c61\u1c5a\u1c66\u1c5f\u1c68\n--------------------------------------------------\n{'given': '\u1c61\u1c5a\u1c66\u1c5f\u1c68', 'normalized': '\u1c61\u1c5a\u1c66\u1c5f\u1c68', 'ops': []}\n```\n\n\n\n```python\n# initialize without english (default)\nnorm=Normalizer()\nprint(\"without english:\",norm(\"ASD123\")[\"normalized\"])\n# --> returns None\nnorm=Normalizer(allow_english=True)\nprint(\"with english:\",norm(\"ASD123\")[\"normalized\"])\n\n```\n> output\n\n```\nwithout english: None\nwith english: ASD123\n```\n\n \n\n\nChange Log\n===========\n\n0.0.5 (9/03/2022)\n-------------------\n- added details for execution map\n- checkop typo correction\n\n0.0.6 (9/03/2022)\n-------------------\n- broken diacritics op addition\n\n0.0.7 (11/03/2022)\n-------------------\n- assemese replacement\n- word op and unicode op mapping\n- modifier list modification\n- doc string for call and initialization\n- verbosity removal\n- typo correction for operation\n- unit test updates\n- '\u098f' replacement correction\n- NonGylphUnicodes\n- Legacy symbols option\n- legacy mapper added \n- added bn:bd declaration\n\n0.0.8 (14/03/2022)\n-------------------\n- MultipleConsonantDiacritics handling change\n- to+hosonto correction\n- invalid hosonto correction \n\n0.0.9 (15/04/2022)\n-------------------\n- base normalizer\n- language class\n- olchiki extension\n- complex root normalization \n\n0.0.10 (15/04/2022)\n-------------------\n- added conjucts\n- exception for english words\n\n0.0.11 (15/04/2022)\n-------------------\n- fixed no space char issue for olchiki\n\n0.0.12 (26/04/2022)\n-------------------\n- fixed consonants orders \n\n0.0.13 (26/04/2022)\n-------------------\n- fixed non char followed by diacritics \n\n0.0.14 (01/05/2022)\n-------------------\n- word based normalization\n- encoding fix\n\n0.0.15 (02/05/2022)\n-------------------\n- import correction\n\n0.0.16 (02/05/2022)\n-------------------\n- local variable issue\n\n0.0.17 (17/05/2022)\n-------------------\n- nukta mod break\n\n0.0.18 (08/06/2022)\n-------------------\n- no space chars fix\n\n\n0.0.19 (15/06/2022)\n-------------------\n- no space chars further fix\n- base_olchiki_compose to avoid false op flags\n- added foreign conjuncts\n\n\n0.0.20 (01/08/2022)\n-------------------\n- \u098f\u09cd\u09af\u09be replacement correction\n\n0.0.21 (01/08/2022)\n-------------------\n- \"\u09af\",\"\u09ac\" + hosonto combination correction\n- added '\u09ac\u09cd\u09b2\u09cd\u09af' in conjuncts\n\n0.0.22 (22/08/2022)\n-------------------\n- \\u200d combination limiting\n\n0.0.23 (23/08/2022)\n-------------------\n- \\u200d condition change\n\n0.0.24 (26/08/2022)\n-------------------\n- \\u200d error handling\n\n0.0.25 (10/09/22)\n-------------------\n- removed unnecessary operations: fixRefOrder,fixOrdersForCC\n- added conjuncts: '\u09b0\u09cd\u09a8\u09cd\u09a4','\u09a0\u09cd\u09af','\u09ad\u09cd\u09b2'\n\n0.1.0 (20/10/22)\n-------------------\n- added indic parser\n- fixed language class\n\n0.1.1 (21/10/22)\n-------------------\n- added nukta and diacritic maps for indics \n- cleaned conjucts for now \n- fixed issues with no-space and connector\n\n0.1.2 (10/12/22)\n-------------------\n- allow halant ending for indic language except olchiki\n\n0.1.3 (10/12/22)\n-------------------\n- broken char break cases for halant \n\n0.1.4 (01/01/23)\n-------------------\n- added sylhetinagri \n\n0.1.5 (01/01/23)\n-------------------\n- cleaned panjabi double quotes in diac map \n\n0.0.1 (26/08/23)\n-------------------\n- added olchiki punctuations \n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Olchiki Unicode Normalization Toolkit",
"version": "1.0.0",
"project_urls": null,
"split_keywords": [
"olchiki",
"unicode",
"text normalization",
"indic"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "93f0464ee6d8c35dc4b7ab1928c999b9f918fc233fda44e6c6d85a92c6f548c1",
"md5": "91b5a26f5175b996a9ecd29a90357193",
"sha256": "9fbf46c8cd3d1d4f8beaf78aff0a0c5eea1732174f4c0d8c9d891384da0c0126"
},
"downloads": -1,
"filename": "olunicodenormalizer-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "91b5a26f5175b996a9ecd29a90357193",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 18125,
"upload_time": "2023-08-26T04:48:42",
"upload_time_iso_8601": "2023-08-26T04:48:42.213540Z",
"url": "https://files.pythonhosted.org/packages/93/f0/464ee6d8c35dc4b7ab1928c999b9f918fc233fda44e6c6d85a92c6f548c1/olunicodenormalizer-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9c782214a475860a52c6f50e492dc7ed6ddfe993f6a31766b339399e961fa2d5",
"md5": "4b06157b41c7c12385554e9c5ca5e89b",
"sha256": "de9eb75611d7315f58af3401d7c12f1b9387e9cb699d5a4ebeab90b11b1bd8ab"
},
"downloads": -1,
"filename": "olunicodenormalizer-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "4b06157b41c7c12385554e9c5ca5e89b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19862,
"upload_time": "2023-08-26T04:48:44",
"upload_time_iso_8601": "2023-08-26T04:48:44.779218Z",
"url": "https://files.pythonhosted.org/packages/9c/78/2214a475860a52c6f50e492dc7ed6ddfe993f6a31766b339399e961fa2d5/olunicodenormalizer-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-26 04:48:44",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "olunicodenormalizer"
}