olunicodenormalizer


Nameolunicodenormalizer JSON
Version 1.0.0 PyPI version JSON
download
home_page
SummaryOlchiki Unicode Normalization Toolkit
upload_time2023-08-26 04:48:44
maintainer
docs_urlNone
authorShivnath Kisku
requires_python
licenseMIT
keywords olchiki unicode text normalization indic
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # olunicodenormalizer
ᱚᱞ-ᱪᱦᱤᱠᱤ Unicode Normalization for word normalization
# install
```python
pip install olunicodenormalizer
```
# useage
**initialization and cleaning**
```python
# import
from olunicodenormalizer import Normalizer 
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'ᱡᱚᱦᱟᱨ'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)
```
> output 

```
Non-norm:ᱡᱚᱦᱟᱨ; Norm:ᱡᱚᱦᱟᱨ
--------------------------------------------------
{'given': 'ᱡᱚᱦᱟᱨ', 'normalized': 'ᱡᱚᱦᱟᱨ', 'ops': []}
```



```python
# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])

```
> output

```
without english: None
with english: ASD123
```

 


Change Log
===========

0.0.5 (9/03/2022)
-------------------
- added details for execution map
- checkop typo correction

0.0.6 (9/03/2022)
-------------------
- broken diacritics op addition

0.0.7 (11/03/2022)
-------------------
- assemese replacement
- word op and unicode op mapping
- modifier list modification
- doc string for call and initialization
- verbosity removal
- typo correction for operation
- unit test updates
- 'এ' replacement correction
- NonGylphUnicodes
- Legacy symbols option
- legacy mapper added 
- added bn:bd declaration

0.0.8 (14/03/2022)
-------------------
- MultipleConsonantDiacritics handling change
- to+hosonto correction
- invalid hosonto correction 

0.0.9 (15/04/2022)
-------------------
- base normalizer
- language class
- olchiki extension
- complex root normalization 

0.0.10 (15/04/2022)
-------------------
- added conjucts
- exception for english words

0.0.11 (15/04/2022)
-------------------
- fixed no space char issue for olchiki

0.0.12 (26/04/2022)
-------------------
- fixed consonants orders 

0.0.13 (26/04/2022)
-------------------
- fixed non char followed by diacritics 

0.0.14 (01/05/2022)
-------------------
- word based normalization
- encoding fix

0.0.15 (02/05/2022)
-------------------
- import correction

0.0.16 (02/05/2022)
-------------------
- local variable issue

0.0.17 (17/05/2022)
-------------------
- nukta mod break

0.0.18 (08/06/2022)
-------------------
- no space chars fix


0.0.19 (15/06/2022)
-------------------
- no space chars further fix
- base_olchiki_compose to avoid false op flags
- added foreign conjuncts


0.0.20 (01/08/2022)
-------------------
- এ্যা replacement correction

0.0.21 (01/08/2022)
-------------------
- "য","ব" + hosonto combination correction
- added 'ব্ল্য' in conjuncts

0.0.22 (22/08/2022)
-------------------
- \u200d combination limiting

0.0.23 (23/08/2022)
-------------------
- \u200d condition change

0.0.24 (26/08/2022)
-------------------
- \u200d error handling

0.0.25 (10/09/22)
-------------------
- removed unnecessary operations: fixRefOrder,fixOrdersForCC
- added conjuncts: 'র্ন্ত','ঠ্য','ভ্ল'

0.1.0 (20/10/22)
-------------------
- added indic parser
- fixed language class

0.1.1 (21/10/22)
-------------------
- added nukta and diacritic maps for indics 
- cleaned conjucts for now 
- fixed issues with no-space and connector

0.1.2 (10/12/22)
-------------------
- allow halant ending for indic language except olchiki

0.1.3 (10/12/22)
-------------------
- broken char break cases for halant 

0.1.4 (01/01/23)
-------------------
- added sylhetinagri 

0.1.5 (01/01/23)
-------------------
- cleaned panjabi double quotes in diac map 

0.0.1 (26/08/23)
-------------------
- added olchiki punctuations 


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "olunicodenormalizer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "olchiki,unicode,text normalization,indic",
    "author": "Shivnath Kisku",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/9c/78/2214a475860a52c6f50e492dc7ed6ddfe993f6a31766b339399e961fa2d5/olunicodenormalizer-1.0.0.tar.gz",
    "platform": null,
    "description": "# olunicodenormalizer\n\u1c5a\u1c5e-\u1c6a\u1c66\u1c64\u1c60\u1c64 Unicode Normalization for word normalization\n# install\n```python\npip install olunicodenormalizer\n```\n# useage\n**initialization and cleaning**\n```python\n# import\nfrom olunicodenormalizer import Normalizer \nfrom pprint import pprint\n# initialize\nbnorm=Normalizer()\n# normalize\nword = '\u1c61\u1c5a\u1c66\u1c5f\u1c68'\nresult=bnorm(word)\nprint(f\"Non-norm:{word}; Norm:{result['normalized']}\")\nprint(\"--------------------------------------------------\")\npprint(result)\n```\n> output \n\n```\nNon-norm:\u1c61\u1c5a\u1c66\u1c5f\u1c68; Norm:\u1c61\u1c5a\u1c66\u1c5f\u1c68\n--------------------------------------------------\n{'given': '\u1c61\u1c5a\u1c66\u1c5f\u1c68', 'normalized': '\u1c61\u1c5a\u1c66\u1c5f\u1c68', 'ops': []}\n```\n\n\n\n```python\n# initialize without english (default)\nnorm=Normalizer()\nprint(\"without english:\",norm(\"ASD123\")[\"normalized\"])\n# --> returns None\nnorm=Normalizer(allow_english=True)\nprint(\"with english:\",norm(\"ASD123\")[\"normalized\"])\n\n```\n> output\n\n```\nwithout english: None\nwith english: ASD123\n```\n\n \n\n\nChange Log\n===========\n\n0.0.5 (9/03/2022)\n-------------------\n- added details for execution map\n- checkop typo correction\n\n0.0.6 (9/03/2022)\n-------------------\n- broken diacritics op addition\n\n0.0.7 (11/03/2022)\n-------------------\n- assemese replacement\n- word op and unicode op mapping\n- modifier list modification\n- doc string for call and initialization\n- verbosity removal\n- typo correction for operation\n- unit test updates\n- '\u098f' replacement correction\n- NonGylphUnicodes\n- Legacy symbols option\n- legacy mapper added \n- added bn:bd declaration\n\n0.0.8 (14/03/2022)\n-------------------\n- MultipleConsonantDiacritics handling change\n- to+hosonto correction\n- invalid hosonto correction \n\n0.0.9 (15/04/2022)\n-------------------\n- base normalizer\n- language class\n- olchiki extension\n- complex root normalization \n\n0.0.10 (15/04/2022)\n-------------------\n- added conjucts\n- exception for english words\n\n0.0.11 (15/04/2022)\n-------------------\n- fixed no space char issue for olchiki\n\n0.0.12 (26/04/2022)\n-------------------\n- fixed consonants orders \n\n0.0.13 (26/04/2022)\n-------------------\n- fixed non char followed by diacritics \n\n0.0.14 (01/05/2022)\n-------------------\n- word based normalization\n- encoding fix\n\n0.0.15 (02/05/2022)\n-------------------\n- import correction\n\n0.0.16 (02/05/2022)\n-------------------\n- local variable issue\n\n0.0.17 (17/05/2022)\n-------------------\n- nukta mod break\n\n0.0.18 (08/06/2022)\n-------------------\n- no space chars fix\n\n\n0.0.19 (15/06/2022)\n-------------------\n- no space chars further fix\n- base_olchiki_compose to avoid false op flags\n- added foreign conjuncts\n\n\n0.0.20 (01/08/2022)\n-------------------\n- \u098f\u09cd\u09af\u09be replacement correction\n\n0.0.21 (01/08/2022)\n-------------------\n- \"\u09af\",\"\u09ac\" + hosonto combination correction\n- added '\u09ac\u09cd\u09b2\u09cd\u09af' in conjuncts\n\n0.0.22 (22/08/2022)\n-------------------\n- \\u200d combination limiting\n\n0.0.23 (23/08/2022)\n-------------------\n- \\u200d condition change\n\n0.0.24 (26/08/2022)\n-------------------\n- \\u200d error handling\n\n0.0.25 (10/09/22)\n-------------------\n- removed unnecessary operations: fixRefOrder,fixOrdersForCC\n- added conjuncts: '\u09b0\u09cd\u09a8\u09cd\u09a4','\u09a0\u09cd\u09af','\u09ad\u09cd\u09b2'\n\n0.1.0 (20/10/22)\n-------------------\n- added indic parser\n- fixed language class\n\n0.1.1 (21/10/22)\n-------------------\n- added nukta and diacritic maps for indics \n- cleaned conjucts for now \n- fixed issues with no-space and connector\n\n0.1.2 (10/12/22)\n-------------------\n- allow halant ending for indic language except olchiki\n\n0.1.3 (10/12/22)\n-------------------\n- broken char break cases for halant \n\n0.1.4 (01/01/23)\n-------------------\n- added sylhetinagri \n\n0.1.5 (01/01/23)\n-------------------\n- cleaned panjabi double quotes in diac map \n\n0.0.1 (26/08/23)\n-------------------\n- added olchiki punctuations \n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Olchiki Unicode Normalization Toolkit",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "olchiki",
        "unicode",
        "text normalization",
        "indic"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "93f0464ee6d8c35dc4b7ab1928c999b9f918fc233fda44e6c6d85a92c6f548c1",
                "md5": "91b5a26f5175b996a9ecd29a90357193",
                "sha256": "9fbf46c8cd3d1d4f8beaf78aff0a0c5eea1732174f4c0d8c9d891384da0c0126"
            },
            "downloads": -1,
            "filename": "olunicodenormalizer-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "91b5a26f5175b996a9ecd29a90357193",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 18125,
            "upload_time": "2023-08-26T04:48:42",
            "upload_time_iso_8601": "2023-08-26T04:48:42.213540Z",
            "url": "https://files.pythonhosted.org/packages/93/f0/464ee6d8c35dc4b7ab1928c999b9f918fc233fda44e6c6d85a92c6f548c1/olunicodenormalizer-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9c782214a475860a52c6f50e492dc7ed6ddfe993f6a31766b339399e961fa2d5",
                "md5": "4b06157b41c7c12385554e9c5ca5e89b",
                "sha256": "de9eb75611d7315f58af3401d7c12f1b9387e9cb699d5a4ebeab90b11b1bd8ab"
            },
            "downloads": -1,
            "filename": "olunicodenormalizer-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4b06157b41c7c12385554e9c5ca5e89b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19862,
            "upload_time": "2023-08-26T04:48:44",
            "upload_time_iso_8601": "2023-08-26T04:48:44.779218Z",
            "url": "https://files.pythonhosted.org/packages/9c/78/2214a475860a52c6f50e492dc7ed6ddfe993f6a31766b339399e961fa2d5/olunicodenormalizer-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-26 04:48:44",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "olunicodenormalizer"
}
        
Elapsed time: 0.13692s