<p align="center">
<a href="https://github.com/ai-forever/augmentex/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/badge/License-MIT-yellow.svg">
</a>
<a href="https://github.com/ai-forever/augmentex/releases">
<img alt="Release" src="https://img.shields.io/badge/release-v1.2.1-blue">
</a>
<a href="https://arxiv.org/abs/2308.09435">
<img alt="Paper" src="https://img.shields.io/badge/arXiv-2308.09435-red">
</a>
</p>
# Augmentex — a library for augmenting texts with errors
Augmentex introduces rule-based and common statistic (empowered by [KartaSlov](https://kartaslov.ru) project)
approach to insert errors in text. It is fully described again in the [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf)
and in this 🗣️[Talk](https://youtu.be/yFfkV0Qjuu0?si=XmKfocCSLnKihxS_).
## Contents
- [Augmentex — a library for augmenting texts with errors](#augmentex--a-library-for-augmenting-texts-with-errors)
- [Contents](#contents)
- [Installation](#installation)
- [Implemented functionality](#implemented-functionality)
- [Usage](#usage)
- [**Word level**](#word-level)
- [**Character level**](#character-level)
- [**Batch processing**](#batch-processing)
- [**Compute your own statistics**](#compute-your-own-statistics)
- [**Google Colab example**](#google-colab-example)
- [Contributing](#contributing)
- [Issue](#issue)
- [Pull request](#pull-request)
- [References](#references)
- [Authors](#authors)
## Installation
```commandline
pip install augmentex
```
## Implemented functionality
We collected statistics from different languages and from different input sources. This table shows what functionality the library currently supports.
| | Russian | English |
| -----------:|:-----------:|:-----------:|
| PC keyboard | ✅ | ✅ |
| Mobile kb | ✅ | ❌ |
In the future, it is planned to scale the functionality to new languages and various input sources.
## Usage
🖇️ Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of
specific methods suited for particular level:
- **Word level**:
- _replace_ - replace a random word with its incorrect counterpart;
- _delete_ - delete random word;
- _swap_ - swap two random words;
- _stopword_ - add random words from stop-list;
- _split_ - add spaces between letters to the word;
- _reverse_ - change a case of the first letter of a random word;
- _text2emoji_ - change the word to the corresponding emoji.
- **Character level**:
- _shift_ - randomly swaps upper / lower case in a string;
- _orfo_ - substitute correct characters with their common incorrect counterparts;
- _typo_ - substitute correct characters as if they are mistyped on a keyboard;
- _delete_ - delete random character;
- _insert_ - insert random character;
- _multiply_ - multiply random character;
- _swap_ - swap two adjacent characters.
### **Word level**
```python
from augmentex import WordAug
word_aug = WordAug(
unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied
min_aug=1, # Minimum number of augmentations
max_aug=5, # Maximum number of augmentations
lang="eng", # supports: "rus", "eng"
platform="pc", # supports: "pc", "mobile"
random_seed=42,
)
```
1. Replace a random word with its incorrect counterpart;
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="replace")
# Screw to guys, I to going com. (c)
```
2. Delete random word;
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="delete")
# you I am home. (c)
```
3. Swap two random words;
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="swap")
# Screw I guys, am home. going you (c)
```
4. Add random words from stop-list;
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="stopword")
# like Screw you guys, I am going completely home. by the way (c)
```
5. Adds spaces between letters to the word;
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="split")
# Screw y o u guys, I am going h o m e . (c)
```
6. Change a case of the first letter of a random word;
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="reverse")
# Screw You guys, i Am going home. (c)
```
7. Changes the word to the corresponding emoji.
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="text2emoji")
# Screw you guys, I am going home. (c)
```
8. Replaces ngram in a word with erroneous ones.
```python
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="ngram")
# Scren you guys, I am going home. (c)
```
### **Character level**
```python
from augmentex import CharAug
char_aug = CharAug(
unit_prob=0.3, # Percentage of the phrase to which augmentations will be applied
min_aug=1, # Minimum number of augmentations
max_aug=5, # Maximum number of augmentations
mult_num=3, # Maximum number of repetitions of characters (only for the multiply method)
lang="eng", # supports: "rus", "eng"
platform="pc", # supports: "pc", "mobile"
random_seed=42,
)
```
1. Randomly swaps upper / lower case in a string;
```python
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="shift")
# Screw YoU guys, I am going Home. (C)
```
2. Substitute correct characters with their common incorrect counterparts;
```python
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="orfo")
# Sedew you guya, I am going home. (c)
```
3. Substitute correct characters as if they are mistyped on a keyboard;
```python
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="typo")
# Sxrew you gugs, I am going home. (x)
```
4. Delete random character;
```python
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="delete")
# crew you guys Iam goinghme. (c)
```
5. Insert random character;
```python
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="insert")
# Screw you ughuys, I vam gcoing hxome. (c)
```
6. Multiply random character;
```python
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="multiply")
# Screw yyou guyss, I am ggoinng home. (c)
```
7. Swap two adjacent characters.
```python
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="swap")
# Srcewy ou guys,I am oging hmoe. (c)
```
### **Batch processing**
📁 For batch text processing, you need to call the `aug_batch` method instead of the `augment` method and pass a list of strings to it.
```python
from augmentex import WordAug
word_aug = WordAug(
unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied
min_aug=1, # Minimum number of augmentations
max_aug=5, # Maximum number of augmentations
lang="eng", # supports: "rus", "eng"
platform="pc", # supports: "pc", "mobile"
random_seed=42,
)
text_list = ["Screw you guys, I am going home. (c)"] * 10
word_aug.aug_batch(text_list, batch_prob=0.5) # without action
text_list = ["Screw you guys, I am going home. (c)"] * 10
word_aug.aug_batch(text_list, batch_prob=0.5, action="replace") # with action
```
### **Compute your own statistics**
📊 If you want to use your own statistics for the _replace_ and _orfo_ methods, then you will need to specify two paths to parallel corpora with texts without errors and with errors.
Example of txt files:
<table style="width: 100%;">
<tbody style="
"><tr style="
">
<th> texts_without_errors.txt </th>
<th> texts_with_errors.txt </th>
</tr>
<tr style="
">
<td style="
width: 1%;
">
<p dir="auto">some text without errors 1<br>
some text without errors 2<br>
some text without errors 3<br>
...</p>
</td>
<td style="
width: 1%;
">
<p dir="auto">some text with errors 1<br>
some text with errors 2<br>
some text with errors 3<br>
...</p>
</td>
</tr>
</tbody></table>
```python
from augmentex import WordAug
word_aug = WordAug(
unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied
min_aug=1, # Minimum number of augmentations
max_aug=5, # Maximum number of augmentations
lang="eng", # supports: "rus", "eng"
platform="pc", # supports: "pc", "mobile"
random_seed=42,
correct_texts_path="correct_texts.txt",
error_texts_path="error_texts.txt",
)
```
### **Google Colab example**
You can familiarize yourself with the usage in the example [![Try In Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1azYUsAd1ofvBI_sPrMftX_ioaspvjEOg?usp=sharing)
## Contributing
### Issue
- If you see an open issue and are willing to do it, add yourself to the performers and write about how much time it will take to fix it. See the pull request module below.
- If you want to add something new or if you find a bug, you should start by creating a new issue and describing the problem/feature. Don't forget to include the appropriate labels.
### Pull request
How to make a pull request.
1. Clone the repository;
2. Create a new branch, for example `git checkout -b issue-id-short-name`;
3. Make changes to the code (make sure you are definitely working in the new branch);
4. `git push`;
5. Create a pull request to the `develop` branch;
6. Add a brief description of the work done;
7. Expect comments from the authors.
## References
- [SAGE](https://github.com/ai-forever/sage) — superlib, developed jointly with our friends by the AGI NLP team, which provides advanced spelling corruptions and spell checking techniques, including using Augmentex.
## Authors
- [Aleksandr Abramov](https://github.com/Ab1992ao) — Source code and algorithm author;
- [Mark Baushenko](https://github.com/e0xextazy) — Source code lead developer.
Raw data
{
"_id": null,
"home_page": "https://github.com/ai-forever/augmentex",
"name": "augmentex",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "augmentex errors typos nlp augmentation",
"author": "Mark Baushenko and Alexandr Abramov",
"author_email": "m.baushenko@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/9b/a5/cbac767613ac695cd07077b45ff3f97caca1cba889fb593e474085e63916/augmentex-1.3.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <a href=\"https://github.com/ai-forever/augmentex/blob/main/LICENSE\">\n <img alt=\"License\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\">\n </a>\n <a href=\"https://github.com/ai-forever/augmentex/releases\">\n <img alt=\"Release\" src=\"https://img.shields.io/badge/release-v1.2.1-blue\">\n </a>\n <a href=\"https://arxiv.org/abs/2308.09435\">\n <img alt=\"Paper\" src=\"https://img.shields.io/badge/arXiv-2308.09435-red\">\n </a>\n</p>\n\n# Augmentex \u2014 a library for augmenting texts with errors\nAugmentex introduces rule-based and common statistic (empowered by [KartaSlov](https://kartaslov.ru) project) \napproach to insert errors in text. It is fully described again in the [Paper](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf)\nand in this \ud83d\udde3\ufe0f[Talk](https://youtu.be/yFfkV0Qjuu0?si=XmKfocCSLnKihxS_).\n\n## Contents\n- [Augmentex \u2014 a library for augmenting texts with errors](#augmentex--a-library-for-augmenting-texts-with-errors)\n - [Contents](#contents)\n - [Installation](#installation)\n - [Implemented functionality](#implemented-functionality)\n - [Usage](#usage)\n - [**Word level**](#word-level)\n - [**Character level**](#character-level)\n - [**Batch processing**](#batch-processing)\n - [**Compute your own statistics**](#compute-your-own-statistics)\n - [**Google Colab example**](#google-colab-example)\n - [Contributing](#contributing)\n - [Issue](#issue)\n - [Pull request](#pull-request)\n - [References](#references)\n - [Authors](#authors)\n\n## Installation\n```commandline\npip install augmentex\n```\n\n## Implemented functionality\nWe collected statistics from different languages and from different input sources. This table shows what functionality the library currently supports.\n\n| | Russian | English |\n| -----------:|:-----------:|:-----------:|\n| PC keyboard | \u2705 | \u2705 |\n| Mobile kb | \u2705 | \u274c |\n\nIn the future, it is planned to scale the functionality to new languages and various input sources.\n\n## Usage\n\ud83d\udd87\ufe0f Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of \nspecific methods suited for particular level:\n- **Word level**:\n - _replace_ - replace a random word with its incorrect counterpart;\n - _delete_ - delete random word;\n - _swap_ - swap two random words;\n - _stopword_ - add random words from stop-list;\n - _split_ - add spaces between letters to the word;\n - _reverse_ - change a case of the first letter of a random word;\n - _text2emoji_ - change the word to the corresponding emoji.\n- **Character level**:\n - _shift_ - randomly swaps upper / lower case in a string;\n - _orfo_ - substitute correct characters with their common incorrect counterparts;\n - _typo_ - substitute correct characters as if they are mistyped on a keyboard;\n - _delete_ - delete random character;\n - _insert_ - insert random character;\n - _multiply_ - multiply random character;\n - _swap_ - swap two adjacent characters.\n\n### **Word level**\n```python\nfrom augmentex import WordAug\n\nword_aug = WordAug(\n unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied\n min_aug=1, # Minimum number of augmentations\n max_aug=5, # Maximum number of augmentations\n lang=\"eng\", # supports: \"rus\", \"eng\"\n platform=\"pc\", # supports: \"pc\", \"mobile\"\n random_seed=42,\n )\n```\n\n1. Replace a random word with its incorrect counterpart;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"replace\")\n# Screw to guys, I to going com. (c)\n```\n\n2. Delete random word;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"delete\")\n# you I am home. (c)\n```\n\n3. Swap two random words;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"swap\")\n# Screw I guys, am home. going you (c)\n```\n\n4. Add random words from stop-list;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"stopword\")\n# like Screw you guys, I am going completely home. by the way (c)\n```\n\n5. Adds spaces between letters to the word;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"split\")\n# Screw y o u guys, I am going h o m e . (c)\n```\n\n6. Change a case of the first letter of a random word;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"reverse\")\n# Screw You guys, i Am going home. (c)\n```\n\n7. Changes the word to the corresponding emoji.\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"text2emoji\")\n# Screw you guys, I am going home. (c)\n```\n\n8. Replaces ngram in a word with erroneous ones.\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nword_aug.augment(text=text, action=\"ngram\")\n# Scren you guys, I am going home. (c)\n```\n\n### **Character level**\n```python\nfrom augmentex import CharAug\n\nchar_aug = CharAug(\n unit_prob=0.3, # Percentage of the phrase to which augmentations will be applied\n min_aug=1, # Minimum number of augmentations\n max_aug=5, # Maximum number of augmentations\n mult_num=3, # Maximum number of repetitions of characters (only for the multiply method)\n lang=\"eng\", # supports: \"rus\", \"eng\"\n platform=\"pc\", # supports: \"pc\", \"mobile\"\n random_seed=42,\n )\n```\n\n1. Randomly swaps upper / lower case in a string;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nchar_aug.augment(text=text, action=\"shift\")\n# Screw YoU guys, I am going Home. (C)\n```\n\n2. Substitute correct characters with their common incorrect counterparts;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nchar_aug.augment(text=text, action=\"orfo\")\n# Sedew you guya, I am going home. (c)\n```\n\n3. Substitute correct characters as if they are mistyped on a keyboard;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nchar_aug.augment(text=text, action=\"typo\")\n# Sxrew you gugs, I am going home. (x)\n```\n\n4. Delete random character;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nchar_aug.augment(text=text, action=\"delete\")\n# crew you guys Iam goinghme. (c)\n```\n\n5. Insert random character;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nchar_aug.augment(text=text, action=\"insert\")\n# Screw you ughuys, I vam gcoing hxome. (c)\n```\n\n6. Multiply random character;\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nchar_aug.augment(text=text, action=\"multiply\")\n# Screw yyou guyss, I am ggoinng home. (c)\n```\n\n7. Swap two adjacent characters.\n```python\ntext = \"Screw you guys, I am going home. (c)\"\nchar_aug.augment(text=text, action=\"swap\")\n# Srcewy ou guys,I am oging hmoe. (c)\n```\n\n### **Batch processing**\n\ud83d\udcc1 For batch text processing, you need to call the `aug_batch` method instead of the `augment` method and pass a list of strings to it.\n\n```python\nfrom augmentex import WordAug\n\nword_aug = WordAug(\n unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied\n min_aug=1, # Minimum number of augmentations\n max_aug=5, # Maximum number of augmentations\n lang=\"eng\", # supports: \"rus\", \"eng\"\n platform=\"pc\", # supports: \"pc\", \"mobile\"\n random_seed=42,\n )\n\ntext_list = [\"Screw you guys, I am going home. (c)\"] * 10\nword_aug.aug_batch(text_list, batch_prob=0.5) # without action\n\ntext_list = [\"Screw you guys, I am going home. (c)\"] * 10\nword_aug.aug_batch(text_list, batch_prob=0.5, action=\"replace\") # with action\n```\n\n### **Compute your own statistics**\n\ud83d\udcca If you want to use your own statistics for the _replace_ and _orfo_ methods, then you will need to specify two paths to parallel corpora with texts without errors and with errors.\n\nExample of txt files:\n<table style=\"width: 100%;\">\n<tbody style=\"\n\"><tr style=\"\n\">\n<th> texts_without_errors.txt </th>\n<th> texts_with_errors.txt </th>\n</tr>\n<tr style=\"\n\">\n<td style=\"\n width: 1%;\n\">\n<p dir=\"auto\">some text without errors 1<br>\nsome text without errors 2<br>\nsome text without errors 3<br>\n...</p>\n</td>\n<td style=\"\n width: 1%;\n\">\n<p dir=\"auto\">some text with errors 1<br>\nsome text with errors 2<br>\nsome text with errors 3<br>\n...</p>\n</td>\n</tr>\n</tbody></table>\n\n```python\nfrom augmentex import WordAug\n\nword_aug = WordAug(\n unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied\n min_aug=1, # Minimum number of augmentations\n max_aug=5, # Maximum number of augmentations\n lang=\"eng\", # supports: \"rus\", \"eng\"\n platform=\"pc\", # supports: \"pc\", \"mobile\"\n random_seed=42,\n correct_texts_path=\"correct_texts.txt\",\n error_texts_path=\"error_texts.txt\",\n )\n```\n\n### **Google Colab example**\nYou can familiarize yourself with the usage in the example [![Try In Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1azYUsAd1ofvBI_sPrMftX_ioaspvjEOg?usp=sharing)\n\n## Contributing\n### Issue\n- If you see an open issue and are willing to do it, add yourself to the performers and write about how much time it will take to fix it. See the pull request module below.\n- If you want to add something new or if you find a bug, you should start by creating a new issue and describing the problem/feature. Don't forget to include the appropriate labels.\n\n### Pull request\nHow to make a pull request.\n1. Clone the repository;\n2. Create a new branch, for example `git checkout -b issue-id-short-name`;\n3. Make changes to the code (make sure you are definitely working in the new branch);\n4. `git push`;\n5. Create a pull request to the `develop` branch;\n6. Add a brief description of the work done;\n7. Expect comments from the authors.\n\n## References\n- [SAGE](https://github.com/ai-forever/sage) \u2014 superlib, developed jointly with our friends by the AGI NLP team, which provides advanced spelling corruptions and spell checking techniques, including using Augmentex.\n\n## Authors\n- [Aleksandr Abramov](https://github.com/Ab1992ao) \u2014 Source code and algorithm author;\n- [Mark Baushenko](https://github.com/e0xextazy) \u2014 Source code lead developer.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Augmentex \u2014 a library for augmenting texts with errors",
"version": "1.3.1",
"project_urls": {
"Homepage": "https://github.com/ai-forever/augmentex"
},
"split_keywords": [
"augmentex",
"errors",
"typos",
"nlp",
"augmentation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b46df43bed23e9fc1e533815a4f389fc6e41b5bd8dfed447aba166bb9924fe83",
"md5": "c3672a8609e1bd0abca2af99df9cc65a",
"sha256": "8a84f399edb9ee95425a5bdd3d247050afe14fd8d5e723d83528bd0446bc0a6f"
},
"downloads": -1,
"filename": "augmentex-1.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c3672a8609e1bd0abca2af99df9cc65a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 22475452,
"upload_time": "2024-07-03T14:47:06",
"upload_time_iso_8601": "2024-07-03T14:47:06.989331Z",
"url": "https://files.pythonhosted.org/packages/b4/6d/f43bed23e9fc1e533815a4f389fc6e41b5bd8dfed447aba166bb9924fe83/augmentex-1.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9ba5cbac767613ac695cd07077b45ff3f97caca1cba889fb593e474085e63916",
"md5": "e817381bb8d70ceb1f1d18e62bb85221",
"sha256": "4649867dc1707ab38892742dbb1c77d0879b6726aa17e6c27b259deed0be8e23"
},
"downloads": -1,
"filename": "augmentex-1.3.1.tar.gz",
"has_sig": false,
"md5_digest": "e817381bb8d70ceb1f1d18e62bb85221",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 19498504,
"upload_time": "2024-07-03T14:47:10",
"upload_time_iso_8601": "2024-07-03T14:47:10.527506Z",
"url": "https://files.pythonhosted.org/packages/9b/a5/cbac767613ac695cd07077b45ff3f97caca1cba889fb593e474085e63916/augmentex-1.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-03 14:47:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ai-forever",
"github_project": "augmentex",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "augmentex"
}