indicparser


Nameindicparser JSON
Version 0.0.10 PyPI version JSON
download
home_pagehttps://github.com/mnansary/indicparser
SummaryGrapheme Parser for indic languages
upload_time2022-12-31 06:33:19
maintainer
docs_urlNone
authorBengali.AI
requires_python
licenseMIT
keywords grapheme indic languages
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # indicparser
Grapheme Parser for indic languages 

# Installaton

```python
pip install indicparser
```
# Useage
* initializing the parser

```python
from indicparser import graphemeParser
gp=graphemeParser("bangla")
```
* extracting graphemes

```python
text="  শাটিকাপ   মার"
graphemes=gp.process(text)
print("Graphemes:",graphemes)
```
> Graphemes: [' ', ' ', 'শা', 'টি', 'কা', 'প', ' ', ' ', ' ', 'মা', 'র']

* extracting graphemes but merging spaces and clearing initial and ending space
```python
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes (space corrected):",graphemes)
```
> Graphemes (space corrected): ['শা', 'টি', 'কা', 'প', ' ', 'মা', 'র']

* treatment of numbers and puntucation and english is also available by default

```python
text="এটাকি 2441139 ? না ভাই wrong number"
graphemes=gp.process(text,merge_spaces=True)
print("Graphemes:",graphemes)
```
> Graphemes: ['এ', 'টা', 'কি', ' ', '2', '4', '4', '1', '1', '3', '9', ' ', '?', ' ', 'না', ' ', 'ভা', 'ই', ' ', 'w', 'r', 'o', 'n', 'g', ' ', 'n', 'u', 'm', 'b', 'e', 'r']

* available languages

```python
from indicparser import languages
languages.keys()
```
> dict_keys(['bangla', 'malyalam', 'tamil', 'gujrati', 'panjabi', 'odiya', 'hindi','nagri'])


# Normalization
* For best results use normalized text before parsing
* An example bangla unicode normalizer can be found [here](https://pypi.org/project/bnunicodenormalizer/)

# ABOUT
* Authors: [Bengali.AI](https://bengali.ai/)

* **Cite Bengali.AI multipurpose grapheme dataset paper**
```bibtext
@inproceedings{alam2021large,
  title={A large multi-target dataset of common bengali handwritten graphemes},
  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={383--398},
  year={2021},
  organization={Springer}
}
```

Change Log
===========

0.0.1 (12/02/2022)
-------------------
- First Release

0.0.2 (12/02/2022)
-------------------
- Basic Documentation 
- Modifier removal
- space correction
- text mode parser

0.0.3 (14/02/2022)
-------------------
- Connector ending 
- Exception case for component construction in bangla
- Added test

0.0.4 (14/02/2022)
-------------------
- pip test stable
- added malformed word detection

0.0.5 (19/02/2022)
-------------------
- encoding correction
- no space char handling

0.0.6 (15/04/2022)
-------------------
- removed malformed word detection [not useful]
- removed component calculation [not consistent]

0.0.7 (26/04/2022)
-------------------
- addition order correction

0.0.8 (21/10/2022)
-------------------
- allow middle Connector

0.0.9 (31/12/2022)
-------------------
- added sylheti nagri

0.0.10 (31/12/2022)
-------------------
- added sylheti nagri alternate hosonto



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mnansary/indicparser",
    "name": "indicparser",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "grapheme,indic languages",
    "author": "Bengali.AI",
    "author_email": "research.bengaliai@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3f/ff/979d61f3a89ffde22968be4a6493d3873cfd09d9649ff6de85a55825ecfe/indicparser-0.0.10.tar.gz",
    "platform": null,
    "description": "# indicparser\nGrapheme Parser for indic languages \n\n# Installaton\n\n```python\npip install indicparser\n```\n# Useage\n* initializing the parser\n\n```python\nfrom indicparser import graphemeParser\ngp=graphemeParser(\"bangla\")\n```\n* extracting graphemes\n\n```python\ntext=\"  \u09b6\u09be\u099f\u09bf\u0995\u09be\u09aa   \u09ae\u09be\u09b0\"\ngraphemes=gp.process(text)\nprint(\"Graphemes:\",graphemes)\n```\n> Graphemes: [' ', ' ', '\u09b6\u09be', '\u099f\u09bf', '\u0995\u09be', '\u09aa', ' ', ' ', ' ', '\u09ae\u09be', '\u09b0']\n\n* extracting graphemes but merging spaces and clearing initial and ending space\n```python\ngraphemes=gp.process(text,merge_spaces=True)\nprint(\"Graphemes (space corrected):\",graphemes)\n```\n> Graphemes (space corrected): ['\u09b6\u09be', '\u099f\u09bf', '\u0995\u09be', '\u09aa', ' ', '\u09ae\u09be', '\u09b0']\n\n* treatment of numbers and puntucation and english is also available by default\n\n```python\ntext=\"\u098f\u099f\u09be\u0995\u09bf 2441139 ? \u09a8\u09be \u09ad\u09be\u0987 wrong number\"\ngraphemes=gp.process(text,merge_spaces=True)\nprint(\"Graphemes:\",graphemes)\n```\n> Graphemes: ['\u098f', '\u099f\u09be', '\u0995\u09bf', ' ', '2', '4', '4', '1', '1', '3', '9', ' ', '?', ' ', '\u09a8\u09be', ' ', '\u09ad\u09be', '\u0987', ' ', 'w', 'r', 'o', 'n', 'g', ' ', 'n', 'u', 'm', 'b', 'e', 'r']\n\n* available languages\n\n```python\nfrom indicparser import languages\nlanguages.keys()\n```\n> dict_keys(['bangla', 'malyalam', 'tamil', 'gujrati', 'panjabi', 'odiya', 'hindi','nagri'])\n\n\n# Normalization\n* For best results use normalized text before parsing\n* An example bangla unicode normalizer can be found [here](https://pypi.org/project/bnunicodenormalizer/)\n\n# ABOUT\n* Authors: [Bengali.AI](https://bengali.ai/)\n\n* **Cite Bengali.AI multipurpose grapheme dataset paper**\n```bibtext\n@inproceedings{alam2021large,\n  title={A large multi-target dataset of common bengali handwritten graphemes},\n  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},\n  booktitle={International Conference on Document Analysis and Recognition},\n  pages={383--398},\n  year={2021},\n  organization={Springer}\n}\n```\n\nChange Log\n===========\n\n0.0.1 (12/02/2022)\n-------------------\n- First Release\n\n0.0.2 (12/02/2022)\n-------------------\n- Basic Documentation \n- Modifier removal\n- space correction\n- text mode parser\n\n0.0.3 (14/02/2022)\n-------------------\n- Connector ending \n- Exception case for component construction in bangla\n- Added test\n\n0.0.4 (14/02/2022)\n-------------------\n- pip test stable\n- added malformed word detection\n\n0.0.5 (19/02/2022)\n-------------------\n- encoding correction\n- no space char handling\n\n0.0.6 (15/04/2022)\n-------------------\n- removed malformed word detection [not useful]\n- removed component calculation [not consistent]\n\n0.0.7 (26/04/2022)\n-------------------\n- addition order correction\n\n0.0.8 (21/10/2022)\n-------------------\n- allow middle Connector\n\n0.0.9 (31/12/2022)\n-------------------\n- added sylheti nagri\n\n0.0.10 (31/12/2022)\n-------------------\n- added sylheti nagri alternate hosonto\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Grapheme Parser for indic languages",
    "version": "0.0.10",
    "split_keywords": [
        "grapheme",
        "indic languages"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "2052ffcb4d8ad702637793581b07d8cc",
                "sha256": "ccee1f3edb7ac9ea2524142f4192a52b4cbdeae2c379039e03bacaed3ec1ca9e"
            },
            "downloads": -1,
            "filename": "indicparser-0.0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "2052ffcb4d8ad702637793581b07d8cc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 430347,
            "upload_time": "2022-12-31T06:33:19",
            "upload_time_iso_8601": "2022-12-31T06:33:19.319757Z",
            "url": "https://files.pythonhosted.org/packages/3f/ff/979d61f3a89ffde22968be4a6493d3873cfd09d9649ff6de85a55825ecfe/indicparser-0.0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-31 06:33:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "mnansary",
    "github_project": "indicparser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "indicparser"
}
        
Elapsed time: 0.02508s