# bnUnicodeNormalizer
Bangla Unicode Normalization for word normalization
# install
```python
pip install bnunicodenormalizer
```
# useage
**initialization and cleaning**
```python
# import
from bnunicodenormalizer import Normalizer
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'াটোবাকো'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)
```
> output
```
Non-norm:াটোবাকো; Norm:টোবাকো
--------------------------------------------------
{'given': 'াটোবাকো',
'normalized': 'টোবাকো',
'ops': [{'after': 'টোবাকো',
'before': 'াটোবাকো',
'operation': 'InvalidUnicode'}]}
```
**call to the normalizer returns a dictionary in the following format**
* ```given``` = provided text
* ```normalized``` = normalized text (gives None if during the operation length of the text becomes 0)
* ```ops``` = list of operations (dictionary) that were executed in given text to create normalized text
* each dictionary in ops has:
* ```operation```: the name of the operation / problem in given text
* ```before``` : what the text looked like before the specific operation
* ```after``` : what the text looks like after the specific operation
**allow to use english text**
```python
# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])
```
> output
```
without english: None
with english: ASD123
```
# Initialization: Bangla Normalizer
```python
'''
initialize a normalizer
args:
allow_english : allow english letters numbers and punctuations [default:False]
keep_legacy_symbols : legacy symbols will be considered as valid unicodes[default:False]
'৺':Isshar
'৻':Ganda
'ঀ':Anji (not '৭')
'ঌ':li
'ৡ':dirgho li
'ঽ':Avagraha
'ৠ':Vocalic Rr (not 'ঋ')
'৲':rupi
'৴':currency numerator 1
'৵':currency numerator 2
'৶':currency numerator 3
'৷':currency numerator 4
'৸':currency numerator one less than the denominator
'৹':Currency Denominator Sixteen
legacy_maps : a dictionay for changing legacy symbols into a more used unicode
a default legacy map is included in the language class as well,
legacy_maps={'ঀ':'৭',
'ঌ':'৯',
'ৡ':'৯',
'৵':'৯',
'৻':'ৎ',
'ৠ':'ঋ',
'ঽ':'ই'}
pass-
* legacy_maps=None; for keeping the legacy symbols as they are
* legacy_maps="default"; for using the default legacy map
* legacy_maps=custom dictionary(type-dict) ; which will map your desired legacy symbol to any of symbol you want
* the keys in the custiom dicts must belong to any of the legacy symbols
* the values in the custiom dicts must belong to either vowels,consonants,numbers or diacritics
vowels = ['অ', 'আ', 'ই', 'ঈ', 'উ', 'ঊ', 'ঋ', 'এ', 'ঐ', 'ও', 'ঔ']
consonants = ['ক', 'খ', 'গ', 'ঘ', 'ঙ', 'চ', 'ছ','জ', 'ঝ', 'ঞ',
'ট', 'ঠ', 'ড', 'ঢ', 'ণ', 'ত', 'থ', 'দ', 'ধ', 'ন',
'প', 'ফ', 'ব', 'ভ', 'ম', 'য', 'র', 'ল', 'শ', 'ষ',
'স', 'হ','ড়', 'ঢ়', 'য়','ৎ']
numbers = ['০', '১', '২', '৩', '৪', '৫', '৬', '৭', '৮', '৯']
vowel_diacritics = ['া', 'ি', 'ী', 'ু', 'ূ', 'ৃ', 'ে', 'ৈ', 'ো', 'ৌ']
consonant_diacritics = ['ঁ', 'ং', 'ঃ']
> for example you may want to map 'ঽ':Avagraha as 'হ' based on visual similiarity
(default:'ই')
** legacy contions: keep_legacy_symbols and legacy_maps operates as follows
case-1) keep_legacy_symbols=True and legacy_maps=None
: all legacy symbols will be considered valid unicodes. None of them will be changed
case-2) keep_legacy_symbols=True and legacy_maps=valid dictionary example:{'ঀ':'ক'}
: all legacy symbols will be considered valid unicodes. Only 'ঀ' will be changed to 'ক' , others will be untouched
case-3) keep_legacy_symbols=False and legacy_maps=None
: all legacy symbols will be removed
case-4) keep_legacy_symbols=False and legacy_maps=valid dictionary example:{'ঽ':'ই','ৠ':'ঋ'}
: 'ঽ' will be changed to 'ই' and 'ৠ' will be changed to 'ঋ'. All other legacy symbols will be removed
'''
```
```python
my_legacy_maps={'ঌ':'ই',
'ৡ':'ই',
'৵':'ই',
'ৠ':'ই',
'ঽ':'ই'}
text="৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹"
# case 1
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=None)
print("case-1 normalized text: ",norm(text)["normalized"])
# case 2
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=my_legacy_maps)
print("case-2 normalized text: ",norm(text)["normalized"])
# case 2-defalut
norm=Normalizer(keep_legacy_symbols=True)
print("case-2 default normalized text: ",norm(text)["normalized"])
# case 3
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=None)
print("case-3 normalized text: ",norm(text)["normalized"])
# case 4
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=my_legacy_maps)
print("case-4 normalized text: ",norm(text)["normalized"])
# case 4-defalut
norm=Normalizer(keep_legacy_symbols=False)
print("case-4 default normalized text: ",norm(text)["normalized"])
```
> output
```
case-1 normalized text: ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-2 normalized text: ৺,৻,ঀ,ই,ই,ই,ই,৲,৴,ই,৶,৷,৸,৹
case-2 default normalized text: ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-3 normalized text: ,,,,,,,,,,,,,
case-4 normalized text: ,,,ই,ই,ই,ই,,,ই,,,,
case-4 default normalized text: ,,,,,,,,,,,,,
```
# Operations
* base operations available for all indic languages:
```python
self.word_level_ops={"LegacySymbols" :self.mapLegacySymbols,
"BrokenDiacritics" :self.fixBrokenDiacritics}
self.decomp_level_ops={"BrokenNukta" :self.fixBrokenNukta,
"InvalidUnicode" :self.cleanInvalidUnicodes,
"InvalidConnector" :self.cleanInvalidConnector,
"FixDiacritics" :self.cleanDiacritics,
"VowelDiacriticAfterVowel" :self.cleanVowelDiacriticComingAfterVowel}
```
* extensions for bangla
```python
self.decomp_level_ops["ToAndHosontoNormalize"] = self.normalizeToandHosonto
# invalid folas
self.decomp_level_ops["NormalizeConjunctsDiacritics"] = self.cleanInvalidConjunctDiacritics
# complex root cleanup
self.decomp_level_ops["ComplexRootNormalization"] = self.convertComplexRoots
```
# Normalization Problem Examples
**In all examples (a) is the non-normalized form and (b) is the normalized form**
* Broken diacritics:
```
# Example-1:
(a)'আরো'==(b)'আরো' -> False
(a) breaks as:['আ', 'র', 'ে', 'া']
(b) breaks as:['আ', 'র', 'ো']
# Example-2:
(a)পৌঁছে==(b)পৌঁছে -> False
(a) breaks as:['প', 'ে', 'ৗ', 'ঁ', 'ছ', 'ে']
(b) breaks as:['প', 'ৌ', 'ঁ', 'ছ', 'ে']
# Example-3:
(a)সংস্কৄতি==(b)সংস্কৃতি -> False
(a) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৄ', 'ত', 'ি']
(b) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৃ', 'ত', 'ি']
```
* Nukta Normalization:
```
Example-1:
(a)কেন্দ্রীয়==(b)কেন্দ্রীয় -> False
(a) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য', '়']
(b) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য়']
Example-2:
(a)রযে়ছে==(b)রয়েছে -> False
(a) breaks as:['র', 'য', 'ে', '়', 'ছ', 'ে']
(b) breaks as:['র', 'য়', 'ে', 'ছ', 'ে']
Example-3:
(a)জ়ন্য==(b)জন্য -> False
(a) breaks as:['জ', '়', 'ন', '্', 'য']
(b) breaks as:['জ', 'ন', '্', 'য']
```
* Invalid hosonto
```
# Example-1:
(a)দুই্টি==(b)দুইটি-->False
(a) breaks as ['দ', 'ু', 'ই', '্', 'ট', 'ি']
(b) breaks as ['দ', 'ু', 'ই', 'ট', 'ি']
# Example-2:
(a)এ্তে==(b)এতে-->False
(a) breaks as ['এ', '্', 'ত', 'ে']
(b) breaks as ['এ', 'ত', 'ে']
# Example-3:
(a)নেট্ওয়ার্ক==(b)নেটওয়ার্ক-->False
(a) breaks as ['ন', 'ে', 'ট', '্', 'ও', 'য়', 'া', 'র', '্', 'ক']
(b) breaks as ['ন', 'ে', 'ট', 'ও', 'য়', 'া', 'র', '্', 'ক']
# Example-4:
(a)এস্আই==(b)এসআই-->False
(a) breaks as ['এ', 'স', '্', 'আ', 'ই']
(b) breaks as ['এ', 'স', 'আ', 'ই']
# Example-5:
(a)'চু্ক্তি'==(b)'চুক্তি' -> False
(a) breaks as:['চ', 'ু', '্', 'ক', '্', 'ত', 'ি']
(b) breaks as:['চ', 'ু','ক', '্', 'ত', 'ি']
# Example-6:
(a)'যু্ক্ত'==(b)'যুক্ত' -> False
(a) breaks as:['য', 'ু', '্', 'ক', '্', 'ত']
(b) breaks as:['য', 'ু', 'ক', '্', 'ত']
# Example-7:
(a)'কিছু্ই'==(b)'কিছুই' -> False
(a) breaks as:['ক', 'ি', 'ছ', 'ু', '্', 'ই']
(b) breaks as:['ক', 'ি', 'ছ', 'ু','ই']
```
* To+hosonto:
```
# Example-1:
(a)বুত্পত্তি==(b)বুৎপত্তি-->False
(a) breaks as ['ব', 'ু', 'ত', '্', 'প', 'ত', '্', 'ত', 'ি']
(b) breaks as ['ব', 'ু', 'ৎ', 'প', 'ত', '্', 'ত', 'ি']
# Example-2:
(a)উত্স==(b)উৎস-->False
(a) breaks as ['উ', 'ত', '্', 'স']
(b) breaks as ['উ', 'ৎ', 'স']
```
* Unwanted doubles(consecutive doubles):
```
# Example-1:
(a)'যুুদ্ধ'==(b)'যুদ্ধ' -> False
(a) breaks as:['য', 'ু', 'ু', 'দ', '্', 'ধ']
(b) breaks as:['য', 'ু', 'দ', '্', 'ধ']
# Example-2:
(a)'দুুই'==(b)'দুই' -> False
(a) breaks as:['দ', 'ু', 'ু', 'ই']
(b) breaks as:['দ', 'ু', 'ই']
# Example-3:
(a)'প্রকৃৃতির'==(b)'প্রকৃতির' -> False
(a) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ৃ', 'ত', 'ি', 'র']
(b) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ত', 'ি', 'র']
# Example-4:
(a)আমাকোা==(b)'আমাকো'-> False
(a) breaks as:['আ', 'ম', 'া', 'ক', 'ে', 'া', 'া']
(b) breaks as:['আ', 'ম', 'া', 'ক', 'ো']
```
* Vowwels and modifier followed by vowel diacritics:
```
# Example-1:
(a)উুলু==(b)উলু-->False
(a) breaks as ['উ', 'ু', 'ল', 'ু']
(b) breaks as ['উ', 'ল', 'ু']
# Example-2:
(a)আর্কিওোলজি==(b)আর্কিওলজি-->False
(a) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ো', 'ল', 'জ', 'ি']
(b) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ল', 'জ', 'ি']
# Example-3:
(a)একএে==(b)একত্রে-->False
(a) breaks as ['এ', 'ক', 'এ', 'ে']
(b) breaks as ['এ', 'ক', 'ত', '্', 'র', 'ে']
```
* Repeated folas:
```
# Example-1:
(a)গ্র্রামকে==(b)গ্রামকে-->False
(a) breaks as ['গ', '্', 'র', '্', 'র', 'া', 'ম', 'ক', 'ে']
(b) breaks as ['গ', '্', 'র', 'া', 'ম', 'ক', 'ে']
```
## IMPORTANT NOTE
**The normalization is purely based on how bangla text is used in ```Bangladesh```(bn:bd). It does not necesserily cover every variation of textual content available at other regions**
# unit testing
* clone the repository
* change working directory to ```tests```
* run: ```python3 -m unittest test_normalizer.py```
# Issue Reporting
* for reporting an issue please provide the specific information
* invalid text
* expected valid text
* why is the output expected
* clone the repository
* add a test case in **tests/test_normalizer.py** after **line no:91**
```python
# Dummy Non-Bangla,Numbers and Space cases/ Invalid start end cases
# english
self.assertEqual(norm('ASD1234')["normalized"],None)
self.assertEqual(ennorm('ASD1234')["normalized"],'ASD1234')
# random
self.assertEqual(norm('িত')["normalized"],'ত')
self.assertEqual(norm('সং্যুক্তি')["normalized"],"সংযুক্তি")
# Ending
self.assertEqual(norm("অজানা্")["normalized"],"অজানা")
#--------------------------------------------- insert your assertions here----------------------------------------
'''
### case: give a comment about your case
## (a) invalid text==(b) valid text <---- an example of your case
self.assertEqual(norm(invalid text)["normalized"],expected output)
or
self.assertEqual(ennorm(invalid text)["normalized"],expected output) <----- for including english text
'''
# your case goes here-
```
* perform the unit testing
* make sure the unit test fails under true conditions
# Indic Base Normalizer
* to use indic language normalizer for 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri'
```python
from bnunicodenormalizer import IndicNormalizer
norm=IndicNormalizer('devanagari')
```
* initialization
```python
'''
initialize a normalizer
args:
language : language identifier from 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri'
allow_english : allow english letters numbers and punctuations [default:False]
'''
```
# ABOUT US
* Authors: [Bengali.AI](https://bengali.ai/) in association with OCR Team , [APSIS Solutions Limited](https://apsissolutions.com/)
* **Cite Bengali.AI multipurpose grapheme dataset paper**
```bibtext
@inproceedings{alam2021large,
title={A large multi-target dataset of common bengali handwritten graphemes},
author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
booktitle={International Conference on Document Analysis and Recognition},
pages={383--398},
year={2021},
organization={Springer}
}
```
Change Log
===========
0.0.5 (9/03/2022)
-------------------
- added details for execution map
- checkop typo correction
0.0.6 (9/03/2022)
-------------------
- broken diacritics op addition
0.0.7 (11/03/2022)
-------------------
- assemese replacement
- word op and unicode op mapping
- modifier list modification
- doc string for call and initialization
- verbosity removal
- typo correction for operation
- unit test updates
- 'এ' replacement correction
- NonGylphUnicodes
- Legacy symbols option
- legacy mapper added
- added bn:bd declaration
0.0.8 (14/03/2022)
-------------------
- MultipleConsonantDiacritics handling change
- to+hosonto correction
- invalid hosonto correction
0.0.9 (15/04/2022)
-------------------
- base normalizer
- language class
- bangla extension
- complex root normalization
0.0.10 (15/04/2022)
-------------------
- added conjucts
- exception for english words
0.0.11 (15/04/2022)
-------------------
- fixed no space char issue for bangla
0.0.12 (26/04/2022)
-------------------
- fixed consonants orders
0.0.13 (26/04/2022)
-------------------
- fixed non char followed by diacritics
0.0.14 (01/05/2022)
-------------------
- word based normalization
- encoding fix
0.0.15 (02/05/2022)
-------------------
- import correction
0.0.16 (02/05/2022)
-------------------
- local variable issue
0.0.17 (17/05/2022)
-------------------
- nukta mod break
0.0.18 (08/06/2022)
-------------------
- no space chars fix
0.0.19 (15/06/2022)
-------------------
- no space chars further fix
- base_bangla_compose to avoid false op flags
- added foreign conjuncts
0.0.20 (01/08/2022)
-------------------
- এ্যা replacement correction
0.0.21 (01/08/2022)
-------------------
- "য","ব" + hosonto combination correction
- added 'ব্ল্য' in conjuncts
0.0.22 (22/08/2022)
-------------------
- \u200d combination limiting
0.0.23 (23/08/2022)
-------------------
- \u200d condition change
0.0.24 (26/08/2022)
-------------------
- \u200d error handling
0.0.25 (10/09/22)
-------------------
- removed unnecessary operations: fixRefOrder,fixOrdersForCC
- added conjuncts: 'র্ন্ত','ঠ্য','ভ্ল'
0.1.0 (20/10/22)
-------------------
- added indic parser
- fixed language class
0.1.1 (21/10/22)
-------------------
- added nukta and diacritic maps for indics
- cleaned conjucts for now
- fixed issues with no-space and connector
0.1.2 (10/12/22)
-------------------
- allow halant ending for indic language except bangla
0.1.3 (10/12/22)
-------------------
- broken char break cases for halant
0.1.4 (01/01/23)
-------------------
- added sylhetinagri
0.1.5 (01/01/23)
-------------------
- cleaned panjabi double quotes in diac map
0.1.6 (15/04/23)
-------------------
- added bangla punctuations
Raw data
{
"_id": null,
"home_page": "https://github.com/mnansary/bnUnicodeNormalizer",
"name": "bnunicodenormalizer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "bangla,unicode,text normalization,indic",
"author": "Bengali.AI",
"author_email": "research.bengaliai@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/b0/eb/8ec11d496623bfceeccb3ad8d40b4a4653d4e1559ece2f4488f5bf58518d/bnunicodenormalizer-0.1.6.tar.gz",
"platform": null,
"description": "# bnUnicodeNormalizer\nBangla Unicode Normalization for word normalization\n# install\n```python\npip install bnunicodenormalizer\n```\n# useage\n**initialization and cleaning**\n```python\n# import\nfrom bnunicodenormalizer import Normalizer \nfrom pprint import pprint\n# initialize\nbnorm=Normalizer()\n# normalize\nword = '\u09be\u099f\u09cb\u09ac\u09be\u0995\u09cb'\nresult=bnorm(word)\nprint(f\"Non-norm:{word}; Norm:{result['normalized']}\")\nprint(\"--------------------------------------------------\")\npprint(result)\n```\n> output \n\n```\nNon-norm:\u09be\u099f\u09cb\u09ac\u09be\u0995\u09cb; Norm:\u099f\u09cb\u09ac\u09be\u0995\u09cb\n--------------------------------------------------\n{'given': '\u09be\u099f\u09cb\u09ac\u09be\u0995\u09cb',\n 'normalized': '\u099f\u09cb\u09ac\u09be\u0995\u09cb',\n 'ops': [{'after': '\u099f\u09cb\u09ac\u09be\u0995\u09cb',\n 'before': '\u09be\u099f\u09cb\u09ac\u09be\u0995\u09cb',\n 'operation': 'InvalidUnicode'}]}\n```\n\n**call to the normalizer returns a dictionary in the following format**\n\n* ```given``` = provided text\n* ```normalized``` = normalized text (gives None if during the operation length of the text becomes 0)\n* ```ops``` = list of operations (dictionary) that were executed in given text to create normalized text\n* each dictionary in ops has:\n * ```operation```: the name of the operation / problem in given text\n * ```before``` : what the text looked like before the specific operation\n * ```after``` : what the text looks like after the specific operation \n\n**allow to use english text**\n\n```python\n# initialize without english (default)\nnorm=Normalizer()\nprint(\"without english:\",norm(\"ASD123\")[\"normalized\"])\n# --> returns None\nnorm=Normalizer(allow_english=True)\nprint(\"with english:\",norm(\"ASD123\")[\"normalized\"])\n\n```\n> output\n\n```\nwithout english: None\nwith english: ASD123\n```\n\n \n\n# Initialization: Bangla Normalizer\n\n```python\n'''\n initialize a normalizer\n args:\n allow_english : allow english letters numbers and punctuations [default:False]\n keep_legacy_symbols : legacy symbols will be considered as valid unicodes[default:False]\n '\u09fa':Isshar \n '\u09fb':Ganda\n '\u0980':Anji (not '\u09ed')\n '\u098c':li\n '\u09e1':dirgho li\n '\u09bd':Avagraha\n '\u09e0':Vocalic Rr (not '\u098b')\n '\u09f2':rupi\n '\u09f4':currency numerator 1\n '\u09f5':currency numerator 2\n '\u09f6':currency numerator 3\n '\u09f7':currency numerator 4\n '\u09f8':currency numerator one less than the denominator\n '\u09f9':Currency Denominator Sixteen\n legacy_maps : a dictionay for changing legacy symbols into a more used unicode \n a default legacy map is included in the language class as well,\n legacy_maps={'\u0980':'\u09ed',\n '\u098c':'\u09ef',\n '\u09e1':'\u09ef',\n '\u09f5':'\u09ef',\n '\u09fb':'\u09ce',\n '\u09e0':'\u098b',\n '\u09bd':'\u0987'}\n \n pass- \n * legacy_maps=None; for keeping the legacy symbols as they are\n * legacy_maps=\"default\"; for using the default legacy map\n * legacy_maps=custom dictionary(type-dict) ; which will map your desired legacy symbol to any of symbol you want\n * the keys in the custiom dicts must belong to any of the legacy symbols\n * the values in the custiom dicts must belong to either vowels,consonants,numbers or diacritics \n vowels = ['\u0985', '\u0986', '\u0987', '\u0988', '\u0989', '\u098a', '\u098b', '\u098f', '\u0990', '\u0993', '\u0994']\n consonants = ['\u0995', '\u0996', '\u0997', '\u0998', '\u0999', '\u099a', '\u099b','\u099c', '\u099d', '\u099e', \n '\u099f', '\u09a0', '\u09a1', '\u09a2', '\u09a3', '\u09a4', '\u09a5', '\u09a6', '\u09a7', '\u09a8', \n '\u09aa', '\u09ab', '\u09ac', '\u09ad', '\u09ae', '\u09af', '\u09b0', '\u09b2', '\u09b6', '\u09b7', \n '\u09b8', '\u09b9','\u09dc', '\u09dd', '\u09df','\u09ce'] \n numbers = ['\u09e6', '\u09e7', '\u09e8', '\u09e9', '\u09ea', '\u09eb', '\u09ec', '\u09ed', '\u09ee', '\u09ef']\n vowel_diacritics = ['\u09be', '\u09bf', '\u09c0', '\u09c1', '\u09c2', '\u09c3', '\u09c7', '\u09c8', '\u09cb', '\u09cc']\n consonant_diacritics = ['\u0981', '\u0982', '\u0983']\n \n > for example you may want to map '\u09bd':Avagraha as '\u09b9' based on visual similiarity \n (default:'\u0987')\n\n ** legacy contions: keep_legacy_symbols and legacy_maps operates as follows \n case-1) keep_legacy_symbols=True and legacy_maps=None\n : all legacy symbols will be considered valid unicodes. None of them will be changed\n case-2) keep_legacy_symbols=True and legacy_maps=valid dictionary example:{'\u0980':'\u0995'}\n : all legacy symbols will be considered valid unicodes. Only '\u0980' will be changed to '\u0995' , others will be untouched\n case-3) keep_legacy_symbols=False and legacy_maps=None\n : all legacy symbols will be removed\n case-4) keep_legacy_symbols=False and legacy_maps=valid dictionary example:{'\u09bd':'\u0987','\u09e0':'\u098b'}\n : '\u09bd' will be changed to '\u0987' and '\u09e0' will be changed to '\u098b'. All other legacy symbols will be removed\n'''\n\n```\n\n```python\nmy_legacy_maps={'\u098c':'\u0987',\n '\u09e1':'\u0987',\n '\u09f5':'\u0987',\n '\u09e0':'\u0987',\n '\u09bd':'\u0987'}\ntext=\"\u09fa,\u09fb,\u0980,\u098c,\u09e1,\u09bd,\u09e0,\u09f2,\u09f4,\u09f5,\u09f6,\u09f7,\u09f8,\u09f9\"\n# case 1\nnorm=Normalizer(keep_legacy_symbols=True,legacy_maps=None)\nprint(\"case-1 normalized text: \",norm(text)[\"normalized\"])\n# case 2\nnorm=Normalizer(keep_legacy_symbols=True,legacy_maps=my_legacy_maps)\nprint(\"case-2 normalized text: \",norm(text)[\"normalized\"])\n# case 2-defalut\nnorm=Normalizer(keep_legacy_symbols=True)\nprint(\"case-2 default normalized text: \",norm(text)[\"normalized\"])\n\n# case 3\nnorm=Normalizer(keep_legacy_symbols=False,legacy_maps=None)\nprint(\"case-3 normalized text: \",norm(text)[\"normalized\"])\n# case 4\nnorm=Normalizer(keep_legacy_symbols=False,legacy_maps=my_legacy_maps)\nprint(\"case-4 normalized text: \",norm(text)[\"normalized\"])\n# case 4-defalut\nnorm=Normalizer(keep_legacy_symbols=False)\nprint(\"case-4 default normalized text: \",norm(text)[\"normalized\"])\n```\n\n> output\n\n```\ncase-1 normalized text: \u09fa,\u09fb,\u0980,\u098c,\u09e1,\u09bd,\u09e0,\u09f2,\u09f4,\u09f5,\u09f6,\u09f7,\u09f8,\u09f9\ncase-2 normalized text: \u09fa,\u09fb,\u0980,\u0987,\u0987,\u0987,\u0987,\u09f2,\u09f4,\u0987,\u09f6,\u09f7,\u09f8,\u09f9\ncase-2 default normalized text: \u09fa,\u09fb,\u0980,\u098c,\u09e1,\u09bd,\u09e0,\u09f2,\u09f4,\u09f5,\u09f6,\u09f7,\u09f8,\u09f9\ncase-3 normalized text: ,,,,,,,,,,,,,\ncase-4 normalized text: ,,,\u0987,\u0987,\u0987,\u0987,,,\u0987,,,,\ncase-4 default normalized text: ,,,,,,,,,,,,, \n```\n\n# Operations\n* base operations available for all indic languages:\n\n```python\nself.word_level_ops={\"LegacySymbols\" :self.mapLegacySymbols,\n \"BrokenDiacritics\" :self.fixBrokenDiacritics}\n\nself.decomp_level_ops={\"BrokenNukta\" :self.fixBrokenNukta,\n \"InvalidUnicode\" :self.cleanInvalidUnicodes,\n \"InvalidConnector\" :self.cleanInvalidConnector,\n \"FixDiacritics\" :self.cleanDiacritics,\n \"VowelDiacriticAfterVowel\" :self.cleanVowelDiacriticComingAfterVowel}\n```\n* extensions for bangla\n\n```python\nself.decomp_level_ops[\"ToAndHosontoNormalize\"] = self.normalizeToandHosonto\n\n# invalid folas \nself.decomp_level_ops[\"NormalizeConjunctsDiacritics\"] = self.cleanInvalidConjunctDiacritics\n\n# complex root cleanup \nself.decomp_level_ops[\"ComplexRootNormalization\"] = self.convertComplexRoots\n\n```\n\n# Normalization Problem Examples\n**In all examples (a) is the non-normalized form and (b) is the normalized form**\n\n* Broken diacritics:\n``` \n# Example-1: \n(a)'\u0986\u09b0\u09c7\u09be'==(b)'\u0986\u09b0\u09cb' -> False \n (a) breaks as:['\u0986', '\u09b0', '\u09c7', '\u09be']\n (b) breaks as:['\u0986', '\u09b0', '\u09cb']\n# Example-2:\n(a)\u09aa\u09c7\u09d7\u0981\u099b\u09c7==(b)\u09aa\u09cc\u0981\u099b\u09c7 -> False\n (a) breaks as:['\u09aa', '\u09c7', '\u09d7', '\u0981', '\u099b', '\u09c7']\n (b) breaks as:['\u09aa', '\u09cc', '\u0981', '\u099b', '\u09c7']\n# Example-3:\n(a)\u09b8\u0982\u09b8\u09cd\u0995\u09c4\u09a4\u09bf==(b)\u09b8\u0982\u09b8\u09cd\u0995\u09c3\u09a4\u09bf -> False\n (a) breaks as:['\u09b8', '\u0982', '\u09b8', '\u09cd', '\u0995', '\u09c4', '\u09a4', '\u09bf']\n (b) breaks as:['\u09b8', '\u0982', '\u09b8', '\u09cd', '\u0995', '\u09c3', '\u09a4', '\u09bf']\n```\n* Nukta Normalization:\n\n``` \nExample-1:\n(a)\u0995\u09c7\u09a8\u09cd\u09a6\u09cd\u09b0\u09c0\u09af\u09bc==(b)\u0995\u09c7\u09a8\u09cd\u09a6\u09cd\u09b0\u09c0\u09df -> False\n (a) breaks as:['\u0995', '\u09c7', '\u09a8', '\u09cd', '\u09a6', '\u09cd', '\u09b0', '\u09c0', '\u09af', '\u09bc']\n (b) breaks as:['\u0995', '\u09c7', '\u09a8', '\u09cd', '\u09a6', '\u09cd', '\u09b0', '\u09c0', '\u09df']\nExample-2:\n(a)\u09b0\u09af\u09c7\u09bc\u099b\u09c7==(b)\u09b0\u09df\u09c7\u099b\u09c7 -> False\n (a) breaks as:['\u09b0', '\u09af', '\u09c7', '\u09bc', '\u099b', '\u09c7']\n (b) breaks as:['\u09b0', '\u09df', '\u09c7', '\u099b', '\u09c7']\nExample-3: \n(a)\u099c\u09bc\u09a8\u09cd\u09af==(b)\u099c\u09a8\u09cd\u09af -> False\n (a) breaks as:['\u099c', '\u09bc', '\u09a8', '\u09cd', '\u09af']\n (b) breaks as:['\u099c', '\u09a8', '\u09cd', '\u09af']\n``` \n* Invalid hosonto\n```\n# Example-1:\n(a)\u09a6\u09c1\u0987\u09cd\u099f\u09bf==(b)\u09a6\u09c1\u0987\u099f\u09bf-->False\n (a) breaks as ['\u09a6', '\u09c1', '\u0987', '\u09cd', '\u099f', '\u09bf']\n (b) breaks as ['\u09a6', '\u09c1', '\u0987', '\u099f', '\u09bf']\n# Example-2:\n(a)\u098f\u09cd\u09a4\u09c7==(b)\u098f\u09a4\u09c7-->False\n (a) breaks as ['\u098f', '\u09cd', '\u09a4', '\u09c7']\n (b) breaks as ['\u098f', '\u09a4', '\u09c7']\n# Example-3:\n(a)\u09a8\u09c7\u099f\u09cd\u0993\u09df\u09be\u09b0\u09cd\u0995==(b)\u09a8\u09c7\u099f\u0993\u09df\u09be\u09b0\u09cd\u0995-->False\n (a) breaks as ['\u09a8', '\u09c7', '\u099f', '\u09cd', '\u0993', '\u09df', '\u09be', '\u09b0', '\u09cd', '\u0995']\n (b) breaks as ['\u09a8', '\u09c7', '\u099f', '\u0993', '\u09df', '\u09be', '\u09b0', '\u09cd', '\u0995']\n# Example-4:\n(a)\u098f\u09b8\u09cd\u0986\u0987==(b)\u098f\u09b8\u0986\u0987-->False\n (a) breaks as ['\u098f', '\u09b8', '\u09cd', '\u0986', '\u0987']\n (b) breaks as ['\u098f', '\u09b8', '\u0986', '\u0987']\n# Example-5: \n(a)'\u099a\u09c1\u09cd\u0995\u09cd\u09a4\u09bf'==(b)'\u099a\u09c1\u0995\u09cd\u09a4\u09bf' -> False \n (a) breaks as:['\u099a', '\u09c1', '\u09cd', '\u0995', '\u09cd', '\u09a4', '\u09bf']\n (b) breaks as:['\u099a', '\u09c1','\u0995', '\u09cd', '\u09a4', '\u09bf']\n# Example-6:\n(a)'\u09af\u09c1\u09cd\u0995\u09cd\u09a4'==(b)'\u09af\u09c1\u0995\u09cd\u09a4' -> False\n (a) breaks as:['\u09af', '\u09c1', '\u09cd', '\u0995', '\u09cd', '\u09a4']\n (b) breaks as:['\u09af', '\u09c1', '\u0995', '\u09cd', '\u09a4']\n# Example-7:\n(a)'\u0995\u09bf\u099b\u09c1\u09cd\u0987'==(b)'\u0995\u09bf\u099b\u09c1\u0987' -> False\n (a) breaks as:['\u0995', '\u09bf', '\u099b', '\u09c1', '\u09cd', '\u0987']\n (b) breaks as:['\u0995', '\u09bf', '\u099b', '\u09c1','\u0987']\n```\n\n* To+hosonto: \n\n``` \n# Example-1:\n(a)\u09ac\u09c1\u09a4\u09cd\u09aa\u09a4\u09cd\u09a4\u09bf==(b)\u09ac\u09c1\u09ce\u09aa\u09a4\u09cd\u09a4\u09bf-->False\n (a) breaks as ['\u09ac', '\u09c1', '\u09a4', '\u09cd', '\u09aa', '\u09a4', '\u09cd', '\u09a4', '\u09bf']\n (b) breaks as ['\u09ac', '\u09c1', '\u09ce', '\u09aa', '\u09a4', '\u09cd', '\u09a4', '\u09bf']\n# Example-2:\n(a)\u0989\u09a4\u09cd\u09b8==(b)\u0989\u09ce\u09b8-->False\n (a) breaks as ['\u0989', '\u09a4', '\u09cd', '\u09b8']\n (b) breaks as ['\u0989', '\u09ce', '\u09b8']\n```\n\n* Unwanted doubles(consecutive doubles):\n\n```\n# Example-1: \n(a)'\u09af\u09c1\u09c1\u09a6\u09cd\u09a7'==(b)'\u09af\u09c1\u09a6\u09cd\u09a7' -> False \n (a) breaks as:['\u09af', '\u09c1', '\u09c1', '\u09a6', '\u09cd', '\u09a7']\n (b) breaks as:['\u09af', '\u09c1', '\u09a6', '\u09cd', '\u09a7']\n# Example-2:\n(a)'\u09a6\u09c1\u09c1\u0987'==(b)'\u09a6\u09c1\u0987' -> False\n (a) breaks as:['\u09a6', '\u09c1', '\u09c1', '\u0987']\n (b) breaks as:['\u09a6', '\u09c1', '\u0987']\n# Example-3:\n(a)'\u09aa\u09cd\u09b0\u0995\u09c3\u09c3\u09a4\u09bf\u09b0'==(b)'\u09aa\u09cd\u09b0\u0995\u09c3\u09a4\u09bf\u09b0' -> False\n (a) breaks as:['\u09aa', '\u09cd', '\u09b0', '\u0995', '\u09c3', '\u09c3', '\u09a4', '\u09bf', '\u09b0']\n (b) breaks as:['\u09aa', '\u09cd', '\u09b0', '\u0995', '\u09c3', '\u09a4', '\u09bf', '\u09b0']\n# Example-4:\n(a)\u0986\u09ae\u09be\u0995\u09c7\u09be\u09be==(b)'\u0986\u09ae\u09be\u0995\u09cb'-> False\n (a) breaks as:['\u0986', '\u09ae', '\u09be', '\u0995', '\u09c7', '\u09be', '\u09be']\n (b) breaks as:['\u0986', '\u09ae', '\u09be', '\u0995', '\u09cb']\n```\n\n* Vowwels and modifier followed by vowel diacritics:\n\n```\n# Example-1:\n(a)\u0989\u09c1\u09b2\u09c1==(b)\u0989\u09b2\u09c1-->False\n (a) breaks as ['\u0989', '\u09c1', '\u09b2', '\u09c1']\n (b) breaks as ['\u0989', '\u09b2', '\u09c1']\n# Example-2:\n(a)\u0986\u09b0\u09cd\u0995\u09bf\u0993\u09cb\u09b2\u099c\u09bf==(b)\u0986\u09b0\u09cd\u0995\u09bf\u0993\u09b2\u099c\u09bf-->False\n (a) breaks as ['\u0986', '\u09b0', '\u09cd', '\u0995', '\u09bf', '\u0993', '\u09cb', '\u09b2', '\u099c', '\u09bf']\n (b) breaks as ['\u0986', '\u09b0', '\u09cd', '\u0995', '\u09bf', '\u0993', '\u09b2', '\u099c', '\u09bf']\n# Example-3:\n(a)\u098f\u0995\u098f\u09c7==(b)\u098f\u0995\u09a4\u09cd\u09b0\u09c7-->False\n (a) breaks as ['\u098f', '\u0995', '\u098f', '\u09c7']\n (b) breaks as ['\u098f', '\u0995', '\u09a4', '\u09cd', '\u09b0', '\u09c7']\n``` \n\n* Repeated folas:\n\n```\n# Example-1:\n(a)\u0997\u09cd\u09b0\u09cd\u09b0\u09be\u09ae\u0995\u09c7==(b)\u0997\u09cd\u09b0\u09be\u09ae\u0995\u09c7-->False\n (a) breaks as ['\u0997', '\u09cd', '\u09b0', '\u09cd', '\u09b0', '\u09be', '\u09ae', '\u0995', '\u09c7']\n (b) breaks as ['\u0997', '\u09cd', '\u09b0', '\u09be', '\u09ae', '\u0995', '\u09c7']\n```\n\n## IMPORTANT NOTE\n**The normalization is purely based on how bangla text is used in ```Bangladesh```(bn:bd). It does not necesserily cover every variation of textual content available at other regions**\n\n# unit testing\n* clone the repository\n* change working directory to ```tests```\n* run: ```python3 -m unittest test_normalizer.py```\n\n# Issue Reporting\n* for reporting an issue please provide the specific information\n * invalid text\n * expected valid text\n * why is the output expected \n * clone the repository\n * add a test case in **tests/test_normalizer.py** after **line no:91**\n\n ```python\n # Dummy Non-Bangla,Numbers and Space cases/ Invalid start end cases\n # english\n self.assertEqual(norm('ASD1234')[\"normalized\"],None)\n self.assertEqual(ennorm('ASD1234')[\"normalized\"],'ASD1234')\n # random\n self.assertEqual(norm('\u09bf\u09a4')[\"normalized\"],'\u09a4')\n self.assertEqual(norm('\u09b8\u0982\u09cd\u09af\u09c1\u0995\u09cd\u09a4\u09bf')[\"normalized\"],\"\u09b8\u0982\u09af\u09c1\u0995\u09cd\u09a4\u09bf\")\n # Ending\n self.assertEqual(norm(\"\u0985\u099c\u09be\u09a8\u09be\u09cd\")[\"normalized\"],\"\u0985\u099c\u09be\u09a8\u09be\")\n\n #--------------------------------------------- insert your assertions here----------------------------------------\n '''\n ### case: give a comment about your case\n ## (a) invalid text==(b) valid text <---- an example of your case\n self.assertEqual(norm(invalid text)[\"normalized\"],expected output)\n or\n self.assertEqual(ennorm(invalid text)[\"normalized\"],expected output) <----- for including english text\n \n '''\n # your case goes here-\n \n ```\n * perform the unit testing\n * make sure the unit test fails under true conditions \n\n# Indic Base Normalizer\n* to use indic language normalizer for 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri'\n\n```python\nfrom bnunicodenormalizer import IndicNormalizer\nnorm=IndicNormalizer('devanagari')\n```\n* initialization\n\n```python\n'''\n initialize a normalizer\n args:\n language : language identifier from 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri'\n allow_english : allow english letters numbers and punctuations [default:False]\n \n''' \n \n```\n\n\n# ABOUT US\n* Authors: [Bengali.AI](https://bengali.ai/) in association with OCR Team , [APSIS Solutions Limited](https://apsissolutions.com/) \n* **Cite Bengali.AI multipurpose grapheme dataset paper**\n```bibtext\n@inproceedings{alam2021large,\n title={A large multi-target dataset of common bengali handwritten graphemes},\n author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},\n booktitle={International Conference on Document Analysis and Recognition},\n pages={383--398},\n year={2021},\n organization={Springer}\n}\n```\n\nChange Log\n===========\n\n0.0.5 (9/03/2022)\n-------------------\n- added details for execution map\n- checkop typo correction\n\n0.0.6 (9/03/2022)\n-------------------\n- broken diacritics op addition\n\n0.0.7 (11/03/2022)\n-------------------\n- assemese replacement\n- word op and unicode op mapping\n- modifier list modification\n- doc string for call and initialization\n- verbosity removal\n- typo correction for operation\n- unit test updates\n- '\u098f' replacement correction\n- NonGylphUnicodes\n- Legacy symbols option\n- legacy mapper added \n- added bn:bd declaration\n\n0.0.8 (14/03/2022)\n-------------------\n- MultipleConsonantDiacritics handling change\n- to+hosonto correction\n- invalid hosonto correction \n\n0.0.9 (15/04/2022)\n-------------------\n- base normalizer\n- language class\n- bangla extension\n- complex root normalization \n\n0.0.10 (15/04/2022)\n-------------------\n- added conjucts\n- exception for english words\n\n0.0.11 (15/04/2022)\n-------------------\n- fixed no space char issue for bangla\n\n0.0.12 (26/04/2022)\n-------------------\n- fixed consonants orders \n\n0.0.13 (26/04/2022)\n-------------------\n- fixed non char followed by diacritics \n\n0.0.14 (01/05/2022)\n-------------------\n- word based normalization\n- encoding fix\n\n0.0.15 (02/05/2022)\n-------------------\n- import correction\n\n0.0.16 (02/05/2022)\n-------------------\n- local variable issue\n\n0.0.17 (17/05/2022)\n-------------------\n- nukta mod break\n\n0.0.18 (08/06/2022)\n-------------------\n- no space chars fix\n\n\n0.0.19 (15/06/2022)\n-------------------\n- no space chars further fix\n- base_bangla_compose to avoid false op flags\n- added foreign conjuncts\n\n\n0.0.20 (01/08/2022)\n-------------------\n- \u098f\u09cd\u09af\u09be replacement correction\n\n0.0.21 (01/08/2022)\n-------------------\n- \"\u09af\",\"\u09ac\" + hosonto combination correction\n- added '\u09ac\u09cd\u09b2\u09cd\u09af' in conjuncts\n\n0.0.22 (22/08/2022)\n-------------------\n- \\u200d combination limiting\n\n0.0.23 (23/08/2022)\n-------------------\n- \\u200d condition change\n\n0.0.24 (26/08/2022)\n-------------------\n- \\u200d error handling\n\n0.0.25 (10/09/22)\n-------------------\n- removed unnecessary operations: fixRefOrder,fixOrdersForCC\n- added conjuncts: '\u09b0\u09cd\u09a8\u09cd\u09a4','\u09a0\u09cd\u09af','\u09ad\u09cd\u09b2'\n\n0.1.0 (20/10/22)\n-------------------\n- added indic parser\n- fixed language class\n\n0.1.1 (21/10/22)\n-------------------\n- added nukta and diacritic maps for indics \n- cleaned conjucts for now \n- fixed issues with no-space and connector\n\n0.1.2 (10/12/22)\n-------------------\n- allow halant ending for indic language except bangla\n\n0.1.3 (10/12/22)\n-------------------\n- broken char break cases for halant \n\n0.1.4 (01/01/23)\n-------------------\n- added sylhetinagri \n\n0.1.5 (01/01/23)\n-------------------\n- cleaned panjabi double quotes in diac map \n\n0.1.6 (15/04/23)\n-------------------\n- added bangla punctuations \n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Bangla Unicode Normalization Toolkit",
"version": "0.1.6",
"split_keywords": [
"bangla",
"unicode",
"text normalization",
"indic"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b0eb8ec11d496623bfceeccb3ad8d40b4a4653d4e1559ece2f4488f5bf58518d",
"md5": "0df6fdab3572da33d87be52dacd7f613",
"sha256": "a950bafb44a702cdb90c5cca3c71543a860f96b46f906585a6d8c85689bcc093"
},
"downloads": -1,
"filename": "bnunicodenormalizer-0.1.6.tar.gz",
"has_sig": false,
"md5_digest": "0df6fdab3572da33d87be52dacd7f613",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 39127,
"upload_time": "2023-04-15T15:18:52",
"upload_time_iso_8601": "2023-04-15T15:18:52.986145Z",
"url": "https://files.pythonhosted.org/packages/b0/eb/8ec11d496623bfceeccb3ad8d40b4a4653d4e1559ece2f4488f5bf58518d/bnunicodenormalizer-0.1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-15 15:18:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "mnansary",
"github_project": "bnUnicodeNormalizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "bnunicodenormalizer"
}