## Bangla NLP Toolkit
Created by <b>A F M Mahfuzul Kabir</b> \
<a href='https://mahfuzulkabir.com'>mahfuzulkabir.com</a> \
https://www.linkedin.com/in/mahfuzulkabir
## Installation
Install the 'csebuetnlp normalizer' first with:
````
pip install git+https://github.com/csebuetnlp/normalizer
````
install the package with
````
pip install banglanlptoolkit
````
## Introduction
This package contains several toolkits for Bangla NLP text processing and augmentation. The available tools are listed below.
- Bangla Text Normalizer
- Bangla Punctuation Generator
- Bangla Text Augmentation
## Documentations:
- For detailed use of Bangla Text Normalizer, follow [this documentation](https://github.com/Kabir5296/banglanlptoolkit/blob/main/docs/Normalization.md).
- For detailed use of Bangla Punctuation Generation, follow [this documentation](https://github.com/Kabir5296/banglanlptoolkit/blob/main/docs/Punctuations.md).
- For detailed use of Bangla Text Augmentation (both online and offline), follow [this documentation](https://github.com/Kabir5296/banglanlptoolkit/blob/main/docs/Augmentations.md).
Thank you very much for using my package. I handle this package all on my own, so if there's any issue with it, I might not always be available to fix it. But if you do encounter such event, feel free to let me know and I'll fix them as soon as I can.
## Bangla Text Normalizer
Bangla text normalization is a known problem in language processing for normalizing Bangla text data in computer readable format. The unicode normalization normalizes all characters of a text string in the same unicode format and removes unwanted characters present. The csebuetnlp normalizer is used for models such as BanglaBERT, BanglaT5 etc.
The package uses two normalization toolkits for Bangla text processing. The unicode normalizer is used from <a href='https://github.com/mnansary/bnUnicodeNormalizer'> here</a>. The other normalizer is specifically used for BanglaT5 translation module and taken from <a href='https://github.com/csebuetnlp/normalizer'> here</a>.
## Bangla Punctuation Generator
The scarcity of good punctuation generator model for Bangla language was very dominant even a few months ago. However, with development of Bangla AI models, we now have very good punctuation generation models for our language as well.
The package uses an open-source punctuation generation model from <a href='https://www.kaggle.com/datasets/tugstugi/bengali-ai-asr-submission/data'> this</a> Kaggle dataset. I currently have this model in my huggingface for ease of use without any token. You can replace with any model of your like if you want.
## Bangla Text Augmentation
The package uses three kind of text augmentation techniques.
- Bangla Token Replacement
- Back Translation
- Bangla Paraphrasing
The token replacement method uses fill-mask method to replace random tokens from a sentence and then replace them. The package uses BanglishBERT Generator model by CSEBUETNLP for this task. The model can be found in <a href='https://huggingface.co/csebuetnlp/banglishbert_generator'> here</a>.
The back translation method translates the sentences from Bangla to English and then to Bangla again. The package uses bn-en and en-bn models of BanglaT5 by CSEBUETNLP for this task. The models can be found here: <a href='https://huggingface.co/csebuetnlp/banglat5_nmt_bn_en'>bn2en</a>, <a href='https://huggingface.co/csebuetnlp/banglat5_nmt_en_bn'>en2bn</a>.
The paraphrasing toolkit uses Bangla paraphrase model of BanglaT5 by CSEBUETNLP. The model can be found in <a href='https://huggingface.co/csebuetnlp/banglat5_banglaparaphrase'>here</a>.
The package supports both online and offline augmentations. Offline augmentation can be used to generate new dataframe of augmented texts from original dataframe. This can be saved in a variable or to a file for later use. While offline augmentation can be faster for utilizing processing power (GPU parallelism), it can get a bit annoying because of saving the augmented data every once in a while. People also love to use online augmentation, meaning, augmenting the data 'on the fly' in predefined custom dataset class. This improves performance by augmentation of sentences during training or inference, with no hassle of saving the data separately.
From <b>version 1.1.5</b>, I'm happy to introduce online augmentation techniques in this package. This technique was inspired from the exact same technique of <b>torchvision.transpose</b>, meaning, you can stack several augmentation techniques with a <b>compose</b> class. You can also write your own custom class of augmentation or transform techniques and use them with <b>compose</b>.
## Inspired from
- <a href='https://amitness.com/2020/05/data-augmentation-for-nlp/'>A Visual Survey of Data Augmentation in NLP</a>
- <a href='https://huggingface.co/csebuetnlp'>CSE BUET NLP</a>
- <a href='https://github.com/mnansary/bnUnicodeNormalizer'>Bangla Unicode Normalizer by Bengali Ai</a>
- <a href='https://github.com/sagorbrur/bnaug'>Bangla Text Augmentation </a>
<b>If you use this package, please don't forget to cite the links and papers mentioned.</b>
Raw data
{
"_id": null,
"home_page": "https://github.com/Kabir5296/banglanlptoolkit",
"name": "banglanlptoolkit",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Bangla, NLP, toolkit, punctuation, augmentation, normalizer, tokenize",
"author": "A F M Mahfuzul Kabir",
"author_email": "<afmmahfuzulkabir@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b2/cf/77b0439000648835655712e244a43d734b2e5e90973d0865212709b732e8/banglanlptoolkit-1.1.8.tar.gz",
"platform": null,
"description": "## Bangla NLP Toolkit\nCreated by <b>A F M Mahfuzul Kabir</b> \\\n<a href='https://mahfuzulkabir.com'>mahfuzulkabir.com</a> \\\nhttps://www.linkedin.com/in/mahfuzulkabir \n\n## Installation\nInstall the 'csebuetnlp normalizer' first with:\n````\npip install git+https://github.com/csebuetnlp/normalizer\n````\n\ninstall the package with\n\n````\npip install banglanlptoolkit\n````\n## Introduction\nThis package contains several toolkits for Bangla NLP text processing and augmentation. The available tools are listed below.\n\n- Bangla Text Normalizer\n- Bangla Punctuation Generator\n- Bangla Text Augmentation\n\n## Documentations:\n- For detailed use of Bangla Text Normalizer, follow [this documentation](https://github.com/Kabir5296/banglanlptoolkit/blob/main/docs/Normalization.md).\n- For detailed use of Bangla Punctuation Generation, follow [this documentation](https://github.com/Kabir5296/banglanlptoolkit/blob/main/docs/Punctuations.md).\n- For detailed use of Bangla Text Augmentation (both online and offline), follow [this documentation](https://github.com/Kabir5296/banglanlptoolkit/blob/main/docs/Augmentations.md).\n\nThank you very much for using my package. I handle this package all on my own, so if there's any issue with it, I might not always be available to fix it. But if you do encounter such event, feel free to let me know and I'll fix them as soon as I can.\n\n## Bangla Text Normalizer\nBangla text normalization is a known problem in language processing for normalizing Bangla text data in computer readable format. The unicode normalization normalizes all characters of a text string in the same unicode format and removes unwanted characters present. The csebuetnlp normalizer is used for models such as BanglaBERT, BanglaT5 etc.\n\nThe package uses two normalization toolkits for Bangla text processing. The unicode normalizer is used from <a href='https://github.com/mnansary/bnUnicodeNormalizer'> here</a>. The other normalizer is specifically used for BanglaT5 translation module and taken from <a href='https://github.com/csebuetnlp/normalizer'> here</a>.\n\n## Bangla Punctuation Generator\nThe scarcity of good punctuation generator model for Bangla language was very dominant even a few months ago. However, with development of Bangla AI models, we now have very good punctuation generation models for our language as well. \n\nThe package uses an open-source punctuation generation model from <a href='https://www.kaggle.com/datasets/tugstugi/bengali-ai-asr-submission/data'> this</a> Kaggle dataset. I currently have this model in my huggingface for ease of use without any token. You can replace with any model of your like if you want.\n\n## Bangla Text Augmentation\nThe package uses three kind of text augmentation techniques. \n- Bangla Token Replacement\n- Back Translation\n- Bangla Paraphrasing\n\nThe token replacement method uses fill-mask method to replace random tokens from a sentence and then replace them. The package uses BanglishBERT Generator model by CSEBUETNLP for this task. The model can be found in <a href='https://huggingface.co/csebuetnlp/banglishbert_generator'> here</a>.\n\nThe back translation method translates the sentences from Bangla to English and then to Bangla again. The package uses bn-en and en-bn models of BanglaT5 by CSEBUETNLP for this task. The models can be found here: <a href='https://huggingface.co/csebuetnlp/banglat5_nmt_bn_en'>bn2en</a>, <a href='https://huggingface.co/csebuetnlp/banglat5_nmt_en_bn'>en2bn</a>.\n\nThe paraphrasing toolkit uses Bangla paraphrase model of BanglaT5 by CSEBUETNLP. The model can be found in <a href='https://huggingface.co/csebuetnlp/banglat5_banglaparaphrase'>here</a>.\n\nThe package supports both online and offline augmentations. Offline augmentation can be used to generate new dataframe of augmented texts from original dataframe. This can be saved in a variable or to a file for later use. While offline augmentation can be faster for utilizing processing power (GPU parallelism), it can get a bit annoying because of saving the augmented data every once in a while. People also love to use online augmentation, meaning, augmenting the data 'on the fly' in predefined custom dataset class. This improves performance by augmentation of sentences during training or inference, with no hassle of saving the data separately.\n\nFrom <b>version 1.1.5</b>, I'm happy to introduce online augmentation techniques in this package. This technique was inspired from the exact same technique of <b>torchvision.transpose</b>, meaning, you can stack several augmentation techniques with a <b>compose</b> class. You can also write your own custom class of augmentation or transform techniques and use them with <b>compose</b>.\n\n## Inspired from\n- <a href='https://amitness.com/2020/05/data-augmentation-for-nlp/'>A Visual Survey of Data Augmentation in NLP</a>\n- <a href='https://huggingface.co/csebuetnlp'>CSE BUET NLP</a>\n- <a href='https://github.com/mnansary/bnUnicodeNormalizer'>Bangla Unicode Normalizer by Bengali Ai</a>\n- <a href='https://github.com/sagorbrur/bnaug'>Bangla Text Augmentation </a>\n\n<b>If you use this package, please don't forget to cite the links and papers mentioned.</b> \n",
"bugtrack_url": null,
"license": null,
"summary": "Toolkits for text processing and augmentation for Bangla NLP",
"version": "1.1.8",
"project_urls": {
"Homepage": "https://github.com/Kabir5296/banglanlptoolkit"
},
"split_keywords": [
"bangla",
" nlp",
" toolkit",
" punctuation",
" augmentation",
" normalizer",
" tokenize"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "250940c7446c3dc977a32805f69527397b73e71f320487bd4cce4ba322aacb0b",
"md5": "d7b65a980256a6230d0f3493ec532cb1",
"sha256": "fd5e93b3e05627fd8f55e960f9a191b66729af65fa829039ccbb1e260f27080a"
},
"downloads": -1,
"filename": "banglanlptoolkit-1.1.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d7b65a980256a6230d0f3493ec532cb1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 15950,
"upload_time": "2024-08-22T07:35:44",
"upload_time_iso_8601": "2024-08-22T07:35:44.111946Z",
"url": "https://files.pythonhosted.org/packages/25/09/40c7446c3dc977a32805f69527397b73e71f320487bd4cce4ba322aacb0b/banglanlptoolkit-1.1.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b2cf77b0439000648835655712e244a43d734b2e5e90973d0865212709b732e8",
"md5": "c70a5d2d16c72a6f17ec489a41029a6d",
"sha256": "4a6609c2d5436ec052fd6a015e0f5b7bbf21a915dcf2d7f542a920421a9644a3"
},
"downloads": -1,
"filename": "banglanlptoolkit-1.1.8.tar.gz",
"has_sig": false,
"md5_digest": "c70a5d2d16c72a6f17ec489a41029a6d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 13791,
"upload_time": "2024-08-22T07:35:46",
"upload_time_iso_8601": "2024-08-22T07:35:46.206903Z",
"url": "https://files.pythonhosted.org/packages/b2/cf/77b0439000648835655712e244a43d734b2e5e90973d0865212709b732e8/banglanlptoolkit-1.1.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-22 07:35:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Kabir5296",
"github_project": "banglanlptoolkit",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "transformers",
"specs": [
[
"==",
"4.42.4"
]
]
},
{
"name": "torch",
"specs": [
[
"==",
"2.3.1"
]
]
},
{
"name": "bnunicodenormalizer",
"specs": [
[
"==",
"0.1.7"
]
]
},
{
"name": "sentencepiece",
"specs": [
[
"==",
"0.2.0"
]
]
},
{
"name": "langdetect",
"specs": [
[
"==",
"1.0.9"
]
]
},
{
"name": null,
"specs": []
},
{
"name": "langdetect",
"specs": []
},
{
"name": "pandarallel",
"specs": []
},
{
"name": "pqdm",
"specs": []
}
],
"lcname": "banglanlptoolkit"
}