Name | bnaug JSON |
Version |
1.1.2
JSON |
| download |
home_page | https://github.com/sagorbrur/bnaug |
Summary | bnaug is a text augmentation tool for Bangla text. |
upload_time | 2023-08-30 16:29:07 |
maintainer | |
docs_url | None |
author | Sagor Sarker |
requires_python | >=3.7 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# bnaug (Bangla Text Augmentation)
__bnaug__ is a text augmentation tool for Bangla text.
## Installation
```
pip install bnaug
```
- Dependencies
- pytorch >=1.7.0
## Demo Notebook
- [bnaug demo](https://github.com/sagorbrur/bnaug/blob/main/notebook/bnaug_demo.ipynb)
## Necessary Model Links
- [word2vec](https://huggingface.co/sagorsarker/bangla_word2vec/resolve/main/bangla_word2vec_gen4.zip)
- [glove vector](https://huggingface.co/sagorsarker/bangla-glove-vectors/resolve/main/bn_glove.300d.zip)
## Sentence Augmentation
### Token Replacement
- Mask generation based augmentation
```py
from bnaug.sentence import TokenReplacement
tokr = TokenReplacement()
text = "আমি ঢাকায় বাস করি।"
output = tokr.masking_based(text, sen_n=5)
```
- Word2Vec based augmentation
```py
from bnaug.sentence import TokenReplacement
tokr = TokenReplacement()
text = "আমি ঢাকায় বাস করি।"
model = "msc/bangla_word2vec/bnwiki_word2vec.model"
output = tokr.word2vec_based(text, model=model, sen_n=5, word_n=5)
print(output)
```
- Glove based augmentation
```py
from bnaug.sentence import TokenReplacement
tokr = TokenReplacement()
text = "আমি ঢাকায় বাস করি।"
vector = "msc/bn_glove.300d.txt"
output = tokr.glove_based(text, vector_path=vector, sen_n=5, word_n=5)
print(output)
```
### Back Translation
Back translation based augmentation first translate Bangla sentence to English and then again translate the English to Bangla.
```py
from bnaug.sentence import BackTranslation
bt = BackTranslation()
text = "বাংলা ভাষা আন্দোলন তদানীন্তন পূর্ব পাকিস্তানে সংঘটিত একটি সাংস্কৃতিক ও রাজনৈতিক আন্দোলন। "
output = bt.get_augmented_sentences(text)
print(output)
```
### Text Generation
- Paraphrase generation
```py
from bnaug.sentence import TextGeneration
tg = TextGeneration()
text = "বিমানটি যখন মাটিতে নামার জন্য এয়ারপোর্টের কাছাকাছি আসছে, তখন ল্যান্ডিং গিয়ারের খোপের ঢাকনাটি খুলে যায়।"
output = tg.parapharse_generation(text)
print(output)
```
### Random Augmentation
- Random remove part and generate new sentence
At present it's removing word, stopwords, punctuations, numbers and generate new sentences
```py
from bnaug.sentence import RandomAugmentation
raug = RandomAugmentation()
sentence = "আমি ১০০ বাকি দিলাম"
output = raug.random_remove(sentence)
print(output)
```
or apply individually
```py
from bnaug import randaug
text = "১০০ বাকি দিলাম"
output = randaug.remove_digits(text)
print(output)
text = "১০০! বাকি দিলাম?"
output = randaug.remove_punctuations(text)
print(output)
text = "আমি ১০০ বাকি দিলাম"
randaug.remove_stopwords(text)
print(output)
text = "আমি ১০০ বাকি দিলাম"
randaug.remove_random_word(text)
print(output)
text = "আমি ১০০ বাকি দিলাম"
randaug.remove_random_char(text)
print(output)
```
## Inspired from
- [nlpaug](https://github.com/makcedward/nlpaug)
- [amitness blog post](https://amitness.com/2020/05/data-augmentation-for-nlp/)
Raw data
{
"_id": null,
"home_page": "https://github.com/sagorbrur/bnaug",
"name": "bnaug",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Sagor Sarker",
"author_email": "sagorhem3532@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/15/ef/ba3f00852c102db73029c8cec8ae96c10a2cada28a8b2db4610b114db91c/bnaug-1.1.2.tar.gz",
"platform": null,
"description": "# bnaug (Bangla Text Augmentation)\n__bnaug__ is a text augmentation tool for Bangla text.\n\n## Installation\n```\npip install bnaug\n```\n- Dependencies\n - pytorch >=1.7.0\n \n## Demo Notebook\n- [bnaug demo](https://github.com/sagorbrur/bnaug/blob/main/notebook/bnaug_demo.ipynb)\n\n## Necessary Model Links\n- [word2vec](https://huggingface.co/sagorsarker/bangla_word2vec/resolve/main/bangla_word2vec_gen4.zip)\n- [glove vector](https://huggingface.co/sagorsarker/bangla-glove-vectors/resolve/main/bn_glove.300d.zip)\n\n## Sentence Augmentation\n### Token Replacement\n- Mask generation based augmentation\n\n ```py\n from bnaug.sentence import TokenReplacement\n\n tokr = TokenReplacement()\n text = \"\u0986\u09ae\u09bf \u09a2\u09be\u0995\u09be\u09df \u09ac\u09be\u09b8 \u0995\u09b0\u09bf\u0964\"\n output = tokr.masking_based(text, sen_n=5)\n ```\n\n- Word2Vec based augmentation\n\n ```py\n from bnaug.sentence import TokenReplacement\n\n tokr = TokenReplacement()\n text = \"\u0986\u09ae\u09bf \u09a2\u09be\u0995\u09be\u09df \u09ac\u09be\u09b8 \u0995\u09b0\u09bf\u0964\"\n model = \"msc/bangla_word2vec/bnwiki_word2vec.model\"\n output = tokr.word2vec_based(text, model=model, sen_n=5, word_n=5)\n print(output)\n ```\n\n- Glove based augmentation\n\n ```py\n from bnaug.sentence import TokenReplacement\n\n tokr = TokenReplacement()\n text = \"\u0986\u09ae\u09bf \u09a2\u09be\u0995\u09be\u09df \u09ac\u09be\u09b8 \u0995\u09b0\u09bf\u0964\"\n vector = \"msc/bn_glove.300d.txt\"\n output = tokr.glove_based(text, vector_path=vector, sen_n=5, word_n=5)\n print(output)\n ```\n\n### Back Translation\nBack translation based augmentation first translate Bangla sentence to English and then again translate the English to Bangla.\n\n```py\nfrom bnaug.sentence import BackTranslation\n\nbt = BackTranslation()\ntext = \"\u09ac\u09be\u0982\u09b2\u09be \u09ad\u09be\u09b7\u09be \u0986\u09a8\u09cd\u09a6\u09cb\u09b2\u09a8 \u09a4\u09a6\u09be\u09a8\u09c0\u09a8\u09cd\u09a4\u09a8 \u09aa\u09c2\u09b0\u09cd\u09ac \u09aa\u09be\u0995\u09bf\u09b8\u09cd\u09a4\u09be\u09a8\u09c7 \u09b8\u0982\u0998\u099f\u09bf\u09a4 \u098f\u0995\u099f\u09bf \u09b8\u09be\u0982\u09b8\u09cd\u0995\u09c3\u09a4\u09bf\u0995 \u0993 \u09b0\u09be\u099c\u09a8\u09c8\u09a4\u09bf\u0995 \u0986\u09a8\u09cd\u09a6\u09cb\u09b2\u09a8\u0964 \"\noutput = bt.get_augmented_sentences(text)\nprint(output)\n\n```\n\n### Text Generation\n- Paraphrase generation\n\n```py\nfrom bnaug.sentence import TextGeneration\n\ntg = TextGeneration()\ntext = \"\u09ac\u09bf\u09ae\u09be\u09a8\u099f\u09bf \u09af\u0996\u09a8 \u09ae\u09be\u099f\u09bf\u09a4\u09c7 \u09a8\u09be\u09ae\u09be\u09b0 \u099c\u09a8\u09cd\u09af \u098f\u09af\u09bc\u09be\u09b0\u09aa\u09cb\u09b0\u09cd\u099f\u09c7\u09b0 \u0995\u09be\u099b\u09be\u0995\u09be\u099b\u09bf \u0986\u09b8\u099b\u09c7, \u09a4\u0996\u09a8 \u09b2\u09cd\u09af\u09be\u09a8\u09cd\u09a1\u09bf\u0982 \u0997\u09bf\u09af\u09bc\u09be\u09b0\u09c7\u09b0 \u0996\u09cb\u09aa\u09c7\u09b0 \u09a2\u09be\u0995\u09a8\u09be\u099f\u09bf \u0996\u09c1\u09b2\u09c7 \u09af\u09be\u09af\u09bc\u0964\"\noutput = tg.parapharse_generation(text)\nprint(output)\n```\n\n### Random Augmentation\n- Random remove part and generate new sentence\n\n At present it's removing word, stopwords, punctuations, numbers and generate new sentences\n\n ```py\n from bnaug.sentence import RandomAugmentation\n\n raug = RandomAugmentation()\n sentence = \"\u0986\u09ae\u09bf \u09e7\u09e6\u09e6 \u09ac\u09be\u0995\u09bf \u09a6\u09bf\u09b2\u09be\u09ae\"\n output = raug.random_remove(sentence)\n print(output)\n\n ```\n\n or apply individually\n\n ```py\n from bnaug import randaug\n\n text = \"\u09e7\u09e6\u09e6 \u09ac\u09be\u0995\u09bf \u09a6\u09bf\u09b2\u09be\u09ae\"\n output = randaug.remove_digits(text)\n print(output)\n\n text = \"\u09e7\u09e6\u09e6! \u09ac\u09be\u0995\u09bf \u09a6\u09bf\u09b2\u09be\u09ae?\"\n output = randaug.remove_punctuations(text)\n print(output)\n\n text = \"\u0986\u09ae\u09bf \u09e7\u09e6\u09e6 \u09ac\u09be\u0995\u09bf \u09a6\u09bf\u09b2\u09be\u09ae\"\n randaug.remove_stopwords(text)\n print(output)\n\n text = \"\u0986\u09ae\u09bf \u09e7\u09e6\u09e6 \u09ac\u09be\u0995\u09bf \u09a6\u09bf\u09b2\u09be\u09ae\"\n randaug.remove_random_word(text)\n print(output)\n\n text = \"\u0986\u09ae\u09bf \u09e7\u09e6\u09e6 \u09ac\u09be\u0995\u09bf \u09a6\u09bf\u09b2\u09be\u09ae\"\n randaug.remove_random_char(text)\n print(output)\n ```\n\n## Inspired from\n- [nlpaug](https://github.com/makcedward/nlpaug)\n- [amitness blog post](https://amitness.com/2020/05/data-augmentation-for-nlp/)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "bnaug is a text augmentation tool for Bangla text.",
"version": "1.1.2",
"project_urls": {
"Homepage": "https://github.com/sagorbrur/bnaug"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "198a7a0e8389bec7c8694270d13979d93e946261f2497465af5c9fc6f53782c9",
"md5": "6ce6e580f6e55c0a43501a73bcfd5766",
"sha256": "eb069f70fa0f7af3fcf2f58d5a9d6f1cf8f0c18fe3b1030630bb75795e0e9fe5"
},
"downloads": -1,
"filename": "bnaug-1.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6ce6e580f6e55c0a43501a73bcfd5766",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 4837,
"upload_time": "2023-08-30T16:29:06",
"upload_time_iso_8601": "2023-08-30T16:29:06.297243Z",
"url": "https://files.pythonhosted.org/packages/19/8a/7a0e8389bec7c8694270d13979d93e946261f2497465af5c9fc6f53782c9/bnaug-1.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "15efba3f00852c102db73029c8cec8ae96c10a2cada28a8b2db4610b114db91c",
"md5": "a7ca049f0e36bde4f944b882849bd3d9",
"sha256": "c524078fceb1b2edbef5b0a7e7a4cccc912333e6d5248412ef1a59a6ee18d2f9"
},
"downloads": -1,
"filename": "bnaug-1.1.2.tar.gz",
"has_sig": false,
"md5_digest": "a7ca049f0e36bde4f944b882849bd3d9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 5104,
"upload_time": "2023-08-30T16:29:07",
"upload_time_iso_8601": "2023-08-30T16:29:07.516995Z",
"url": "https://files.pythonhosted.org/packages/15/ef/ba3f00852c102db73029c8cec8ae96c10a2cada28a8b2db4610b114db91c/bnaug-1.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-30 16:29:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sagorbrur",
"github_project": "bnaug",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "bnaug"
}