textnoisr

Name	textnoisr JSON
Version	1.1.1 JSON
	download
home_page	https://preligens-lab.github.io/textnoisr
Summary	Add noise to text at the character level
upload_time	2024-03-01 19:20:16
maintainer	Félix Martel
docs_url	None
author	Lilian Sanselme
requires_python	>=3.10,<3.12
license	BSD-2-Clause
keywords	nlp natural language processing text augmentation ocr typo
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # `textnoisr`: Adding random noise to a dataset


[![build-doc](https://github.com/preligens-lab/textnoisr/actions/workflows/build-doc.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/build-doc.yml)
[![code-style](https://github.com/preligens-lab/textnoisr/actions/workflows/code-style.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/code-style.yml)
[![nightly-test](https://github.com/preligens-lab/textnoisr/actions/workflows/nightly-test.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/nightly-test.yml)
[![unit-test](https://github.com/preligens-lab/textnoisr/actions/workflows/unit-test.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/unit-test.yml)



`textnoisr` is a python package that allows to **add random noise to a text dataset**,
and to **control very accurately** the quality of the result.

Here is an example if your dataset consists on the first few lines of [the Zen of python](https://peps.python.org/pep-0020/):

**Raw text**

```
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
...
```

**Noisy text**

```
TheO Zen of Python, by Tim Pfter

BzeautiUful is ebtter than ugly.
Eqxplicin is better than imlicit.
Simple is beateUr than comdplex.
Complex is better than comwlicated.
Flat is bejAter than neseed.
...
```

Four types of "actions" are implemented:

* **insert** a random character, e.g.        STEAM  →  ST<span style="background-color:LightBlue">R</span>EAM,
* **delete** a random character, e.g.        <span style="background-color:Crimson">S</span>TEAM  →  TEAM,
* **substitute** a random character, e.g.    STEA<span style="background-color:LightGreen">M</span>  →  STEA<span style="background-color:LightGreen">L</span>.
* **swap** two **consecutive** characters, e.g.  STE<span style="background-color:Orange">AM</span>  →  STE<span style="background-color:Orange">MA</span>


The general philosophy of the package is that only **one single parameter**
is needed to control the noise level.
This "noise level" is applied character-wise,
and corresponds _roughly_ to the probability for a character to be impacted.

More precisely, this noise level is calibrated so that
the [Character Error Rate](https://huggingface.co/spaces/evaluate-metric/cer)
of a noised dataset converges to this value as the amount of text increases.


**Why a whole package for such a simple task?**

> In the case of inserting, deleting and substituting characters at random with a probability $p$,
> the Character Error Rate is only the average number of those operations,
> so it will converge to the input value $p$ due to the
> [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers).
>
> However, the case of swapping consecutive characters is **not trivial at all** for two reasons:
>
> * First, swapping two characters is not an "atomic operation" with respect to the Character Error Rate metric.
>
> * Second, we do not want to swap repeatedly the same character over and over again
> if the probability to apply the swap action is high:<br>
> <span style="background-color:Orange">ST</span>EAM  →  <span style="background-color:Orange">TS</span>EAM<br>
> T<span style="background-color:Orange">SE</span>AM  →  T<span style="background-color:Orange">ES</span>AM<br>
> TE<span style="background-color:Orange">SA</span>M  →  TE<span style="background-color:Orange">AS</span>M<br>
> TEA<span style="background-color:Orange">SM</span>  →  TEA<span style="background-color:Orange">MS</span><br>
> This would be equivalent to <span style="background-color:Orange">S</span>TEAM  →
> TEAM<span style="background-color:Orange">S</span>, so this cannot be considered "swapping consecutive characters".
> To avoid this behavior, we must avoid swapping a character if it has just been swapped.
> This breaks the independency between one character and the following one,
> and makes the Law of Large Numbers not applicable.
>
> We use Markov Chains to model the swapping of characters.
> This allows us to compute and correct the corresponding bias in order to make itstraightforward
> for the user to get the desired Character Error Rate, as if the Law of Large Number could beapplied!
>
> All the details of this unbiasing [are here](docs/swap_unbiasing.md).
> The goal of this package is for the user to be confident on the result
> without worrying about the implementation details.


---


The documentation follows this plan:

* You may want to follow [a quick tutorial](docs/tutorial.md) to learn the basics of the package,
* The [Results](docs/results.md) page illustrates how **no calibration is needed** in order to add noise to a corpus with a target Character Error Rate.
* The [How this works section](docs/how_this_works.md) explains the mechanisms, and some design choices of this package.
We have been extra careful to explain how some statistical bias have been avoided,
for the package to be both user-friendly and correct.
A [dedicated page](docs/swap_unbiasing.md) deeps dive in the case of the `swap` action.
* The [API Reference](docs/api.md) details all the technical descriptions needed.

There is also [a Medium article](https://medium.com/earthcube-stories/textnoisr-a-journey-into-noise-calibration-for-nlp-4d39279ef0f6) about this project.

Raw data

            {
    "_id": null,
    "home_page": "https://preligens-lab.github.io/textnoisr",
    "name": "textnoisr",
    "maintainer": "F\u00e9lix Martel",
    "docs_url": null,
    "requires_python": ">=3.10,<3.12",
    "maintainer_email": "felix.martel@preligens.com",
    "keywords": "nlp,natural language processing,text,augmentation,ocr,typo",
    "author": "Lilian Sanselme",
    "author_email": "lilian.sanselme@preligens.com",
    "download_url": "https://files.pythonhosted.org/packages/b0/c7/f909cee8b92766ef5403c13afebdcb2fdac564afc55fb74f4ad0a585f653/textnoisr-1.1.1.tar.gz",
    "platform": null,
    "description": "# `textnoisr`: Adding random noise to a dataset\n\n\n[![build-doc](https://github.com/preligens-lab/textnoisr/actions/workflows/build-doc.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/build-doc.yml)\n[![code-style](https://github.com/preligens-lab/textnoisr/actions/workflows/code-style.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/code-style.yml)\n[![nightly-test](https://github.com/preligens-lab/textnoisr/actions/workflows/nightly-test.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/nightly-test.yml)\n[![unit-test](https://github.com/preligens-lab/textnoisr/actions/workflows/unit-test.yml/badge.svg)](https://github.com/preligens-lab/textnoisr/actions/workflows/unit-test.yml)\n\n\n\n`textnoisr` is a python package that allows to **add random noise to a text dataset**,\nand to **control very accurately** the quality of the result.\n\nHere is an example if your dataset consists on the first few lines of [the Zen of python](https://peps.python.org/pep-0020/):\n\n**Raw text**\n\n```\nThe Zen of Python, by Tim Peters\n\nBeautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.\nFlat is better than nested.\n...\n```\n\n**Noisy text**\n\n```\nTheO Zen of Python, by Tim Pfter\n\nBzeautiUful is ebtter than ugly.\nEqxplicin is better than imlicit.\nSimple is beateUr than comdplex.\nComplex is better than comwlicated.\nFlat is bejAter than neseed.\n...\n```\n\nFour types of \"actions\" are implemented:\n\n* **insert** a random character, e.g.        STEAM  \u2192  ST<span style=\"background-color:LightBlue\">R</span>EAM,\n* **delete** a random character, e.g.        <span style=\"background-color:Crimson\">S</span>TEAM  \u2192  TEAM,\n* **substitute** a random character, e.g.    STEA<span style=\"background-color:LightGreen\">M</span>  \u2192  STEA<span style=\"background-color:LightGreen\">L</span>.\n* **swap** two **consecutive** characters, e.g.  STE<span style=\"background-color:Orange\">AM</span>  \u2192  STE<span style=\"background-color:Orange\">MA</span>\n\n\nThe general philosophy of the package is that only **one single parameter**\nis needed to control the noise level.\nThis \"noise level\" is applied character-wise,\nand corresponds _roughly_ to the probability for a character to be impacted.\n\nMore precisely, this noise level is calibrated so that\nthe [Character Error Rate](https://huggingface.co/spaces/evaluate-metric/cer)\nof a noised dataset converges to this value as the amount of text increases.\n\n\n**Why a whole package for such a simple task?**\n\n> In the case of inserting, deleting and substituting characters at random with a probability $p$,\n> the Character Error Rate is only the average number of those operations,\n> so it will converge to the input value $p$ due to the\n> [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers).\n>\n> However, the case of swapping consecutive characters is **not trivial at all** for two reasons:\n>\n> * First, swapping two characters is not an \"atomic operation\" with respect to the Character Error Rate metric.\n>\n> * Second, we do not want to swap repeatedly the same character over and over again\n> if the probability to apply the swap action is high:<br>\n> <span style=\"background-color:Orange\">ST</span>EAM  \u2192  <span style=\"background-color:Orange\">TS</span>EAM<br>\n> T<span style=\"background-color:Orange\">SE</span>AM  \u2192  T<span style=\"background-color:Orange\">ES</span>AM<br>\n> TE<span style=\"background-color:Orange\">SA</span>M  \u2192  TE<span style=\"background-color:Orange\">AS</span>M<br>\n> TEA<span style=\"background-color:Orange\">SM</span>  \u2192  TEA<span style=\"background-color:Orange\">MS</span><br>\n> This would be equivalent to <span style=\"background-color:Orange\">S</span>TEAM  \u2192\n> TEAM<span style=\"background-color:Orange\">S</span>, so this cannot be considered \"swapping consecutive characters\".\n> To avoid this behavior, we must avoid swapping a character if it has just been swapped.\n> This breaks the independency between one character and the following one,\n> and makes the Law of Large Numbers not applicable.\n>\n> We use Markov Chains to model the swapping of characters.\n> This allows us to compute and correct the corresponding bias in order to make itstraightforward\n> for the user to get the desired Character Error Rate, as if the Law of Large Number could beapplied!\n>\n> All the details of this unbiasing [are here](docs/swap_unbiasing.md).\n> The goal of this package is for the user to be confident on the result\n> without worrying about the implementation details.\n\n\n---\n\n\nThe documentation follows this plan:\n\n* You may want to follow [a quick tutorial](docs/tutorial.md) to learn the basics of the package,\n* The [Results](docs/results.md) page illustrates how **no calibration is needed** in order to add noise to a corpus with a target Character Error Rate.\n* The [How this works section](docs/how_this_works.md) explains the mechanisms, and some design choices of this package.\nWe have been extra careful to explain how some statistical bias have been avoided,\nfor the package to be both user-friendly and correct.\nA [dedicated page](docs/swap_unbiasing.md) deeps dive in the case of the `swap` action.\n* The [API Reference](docs/api.md) details all the technical descriptions needed.\n\nThere is also [a Medium article](https://medium.com/earthcube-stories/textnoisr-a-journey-into-noise-calibration-for-nlp-4d39279ef0f6) about this project.\n",
    "bugtrack_url": null,
    "license": "BSD-2-Clause",
    "summary": "Add noise to text at the character level",
    "version": "1.1.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/preligens-lab/textnoisr/issues",
        "Documentation": "https://preligens-lab.github.io/textnoisr",
        "Homepage": "https://preligens-lab.github.io/textnoisr",
        "Repository": "https://github.com/preligens-lab/textnoisr"
    },
    "split_keywords": [
        "nlp",
        "natural language processing",
        "text",
        "augmentation",
        "ocr",
        "typo"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "77e5cfff5334d0d9877d036697d0098bc52b9686073e2761ee32f9d16ab55725",
                "md5": "aa6b522ddb1a8ac8cf525fea260f733a",
                "sha256": "e99b545fdb51aeb2120ba3a849633a2d870da1329d657f43d5286758aeb3f339"
            },
            "downloads": -1,
            "filename": "textnoisr-1.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aa6b522ddb1a8ac8cf525fea260f733a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<3.12",
            "size": 9567,
            "upload_time": "2024-03-01T19:20:13",
            "upload_time_iso_8601": "2024-03-01T19:20:13.523666Z",
            "url": "https://files.pythonhosted.org/packages/77/e5/cfff5334d0d9877d036697d0098bc52b9686073e2761ee32f9d16ab55725/textnoisr-1.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b0c7f909cee8b92766ef5403c13afebdcb2fdac564afc55fb74f4ad0a585f653",
                "md5": "809f97f244ebd41f56efbddc2e8ac0a9",
                "sha256": "de2b510a05f9771fe85e7b76cfb4d8a1b944176ed52ba34c47fdbd2d907f1e1b"
            },
            "downloads": -1,
            "filename": "textnoisr-1.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "809f97f244ebd41f56efbddc2e8ac0a9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<3.12",
            "size": 10728,
            "upload_time": "2024-03-01T19:20:16",
            "upload_time_iso_8601": "2024-03-01T19:20:16.010693Z",
            "url": "https://files.pythonhosted.org/packages/b0/c7/f909cee8b92766ef5403c13afebdcb2fdac564afc55fb74f4ad0a585f653/textnoisr-1.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-01 19:20:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "preligens-lab",
    "github_project": "textnoisr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "textnoisr"
}

Lilian Sanselme