perturbers


Nameperturbers JSON
Version 0.0.0 PyPI version JSON
download
home_pagehttps://github.com/FairNLP/perturbers
SummaryPerturber models for neural data augmentation.
upload_time2024-03-03 13:19:51
maintainer
docs_urlNone
authorFairNLP
requires_python>=3.7
licenseApache Software License 2.0
keywords perturbers
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Perturbers

This codebase is built upon the great work of [Qian et. al. (2022)](https://arxiv.org/abs/2205.12586). Using this
library, you can easily integrate neural augmentation models with your NLP pipelines or train new perturber models from
scratch.

# Installation

`perturbers` is available on PyPI and can be installed using pip:

```bash
pip install perturbers
```

# Usage

Using a perturber is as simple as creating a new instance of the `Perturber` class and calling the `generate` method
with the sentence you want to perturb along with the target word and the attribute you want to change:

```python
from perturbers import Perturber

perturber = Perturber()
unperturbed = "Jack was passionate about rock climbing and his love for the sport was infectious to all men around him."
perturber.generate(unperturbed, "Jack", "female")
# "Jane was passionate about rock climbing and her love for the sport was infectious to all men around her."
```

You can also perturb a sentence without specifying a target word or attribute:

```python
perturber("Jack was passionate.", retry_unchanged=True)
# "Jackie was passionate."
```

## Training a new perturber model

To train a new perturber model, take a look at the `train_perturber.py` script. This script will train a new perturber
model using the PANDA dataset. Currently the scripts only support training BART models, but any encoder-decoder model
can be used.

Perturber models are evaluated based on the following metrics:

- `bleu4`: The 4-gram BLEU score of the perturbed sentence compared to the original sentence
- `perplexity`: The perplexity of the perturbed sentence
- `perplexity_perturbed`: The perplexity of only the perturbed tokens from the perturbed sentence

## Pre-trained models

In addition to the codebase, we also provide pre-trained perturber models in a variety of sizes:

|                                                                   | Base model                                                   | Parameters | Perplexity | Perplexity (perturbed idx)* | BLEU4  |
|-------------------------------------------------------------------|--------------------------------------------------------------|------------|------------|-----------------------------|--------|
| [perturber-small](https://huggingface.co/fairnlp/perturber-small) | [bart-small](https://huggingface.co/lucadiliello/bart-small) | 70M        | 1.076      | 4.079                       | 0.822  |
| [perturber-base](https://huggingface.co/fairnlp/perturber-base)   | [bart-base](https://huggingface.co/facebook/bart-base)       | 139M       | 1.058      | 2.769                       | 0.794  |
| [perturber (original)](https://huggingface.co/facebook/perturber) | [bart-large](https://huggingface.co/facebook/bart-large)     | 406M       | 1.06**     | N/A                         | 0.88** |

*Measures perplexity only of perturbed tokens, as the majority of tokens remains unchanged leading to Perplexity scores
approaching 1

**The perplexity and BLEU4 scores are those reported in the original paper and not measured via this codebase.

# Roadmap

- [x] Add default perturber model
- [x] Pretrain small and medium perturber models
- [ ] Train model to identify target words and attributes
- [ ] Add training of unconditional perturber models (i.e. only get a sentence, no target word/attribute)
- [ ] Add self-training by pretraining perturber base model (e.g. BART) on self-perturbed data

Other features could include

- [ ] Data cleaning of PANDA (remove non-target perturbations)
- [ ] Multilingual perturbation

# Read more

- [Original perturber paper](https://aclanthology.org/2022.emnlp-main.646/)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/FairNLP/perturbers",
    "name": "perturbers",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "perturbers",
    "author": "FairNLP",
    "author_email": "info@fairnlp.com",
    "download_url": "https://files.pythonhosted.org/packages/09/70/49c36063b33038b7d99ce3e9306c238b577834bf137e76252ed5c369d47a/perturbers-0.0.0.tar.gz",
    "platform": null,
    "description": "# Perturbers\n\nThis codebase is built upon the great work of [Qian et. al. (2022)](https://arxiv.org/abs/2205.12586). Using this\nlibrary, you can easily integrate neural augmentation models with your NLP pipelines or train new perturber models from\nscratch.\n\n# Installation\n\n`perturbers` is available on PyPI and can be installed using pip:\n\n```bash\npip install perturbers\n```\n\n# Usage\n\nUsing a perturber is as simple as creating a new instance of the `Perturber` class and calling the `generate` method\nwith the sentence you want to perturb along with the target word and the attribute you want to change:\n\n```python\nfrom perturbers import Perturber\n\nperturber = Perturber()\nunperturbed = \"Jack was passionate about rock climbing and his love for the sport was infectious to all men around him.\"\nperturber.generate(unperturbed, \"Jack\", \"female\")\n# \"Jane was passionate about rock climbing and her love for the sport was infectious to all men around her.\"\n```\n\nYou can also perturb a sentence without specifying a target word or attribute:\n\n```python\nperturber(\"Jack was passionate.\", retry_unchanged=True)\n# \"Jackie was passionate.\"\n```\n\n## Training a new perturber model\n\nTo train a new perturber model, take a look at the `train_perturber.py` script. This script will train a new perturber\nmodel using the PANDA dataset. Currently the scripts only support training BART models, but any encoder-decoder model\ncan be used.\n\nPerturber models are evaluated based on the following metrics:\n\n- `bleu4`: The 4-gram BLEU score of the perturbed sentence compared to the original sentence\n- `perplexity`: The perplexity of the perturbed sentence\n- `perplexity_perturbed`: The perplexity of only the perturbed tokens from the perturbed sentence\n\n## Pre-trained models\n\nIn addition to the codebase, we also provide pre-trained perturber models in a variety of sizes:\n\n|                                                                   | Base model                                                   | Parameters | Perplexity | Perplexity (perturbed idx)* | BLEU4  |\n|-------------------------------------------------------------------|--------------------------------------------------------------|------------|------------|-----------------------------|--------|\n| [perturber-small](https://huggingface.co/fairnlp/perturber-small) | [bart-small](https://huggingface.co/lucadiliello/bart-small) | 70M        | 1.076      | 4.079                       | 0.822  |\n| [perturber-base](https://huggingface.co/fairnlp/perturber-base)   | [bart-base](https://huggingface.co/facebook/bart-base)       | 139M       | 1.058      | 2.769                       | 0.794  |\n| [perturber (original)](https://huggingface.co/facebook/perturber) | [bart-large](https://huggingface.co/facebook/bart-large)     | 406M       | 1.06**     | N/A                         | 0.88** |\n\n*Measures perplexity only of perturbed tokens, as the majority of tokens remains unchanged leading to Perplexity scores\napproaching 1\n\n**The perplexity and BLEU4 scores are those reported in the original paper and not measured via this codebase.\n\n# Roadmap\n\n- [x] Add default perturber model\n- [x] Pretrain small and medium perturber models\n- [ ] Train model to identify target words and attributes\n- [ ] Add training of unconditional perturber models (i.e. only get a sentence, no target word/attribute)\n- [ ] Add self-training by pretraining perturber base model (e.g. BART) on self-perturbed data\n\nOther features could include\n\n- [ ] Data cleaning of PANDA (remove non-target perturbations)\n- [ ] Multilingual perturbation\n\n# Read more\n\n- [Original perturber paper](https://aclanthology.org/2022.emnlp-main.646/)\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Perturber models for neural data augmentation.",
    "version": "0.0.0",
    "project_urls": {
        "Homepage": "https://github.com/FairNLP/perturbers"
    },
    "split_keywords": [
        "perturbers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "097049c36063b33038b7d99ce3e9306c238b577834bf137e76252ed5c369d47a",
                "md5": "d6d298f530d4cf50b80119212fd1cd6f",
                "sha256": "341e7c87ba0fa5fa6e35ebef0c222170434ad143a25832dd1c6b550ee97190ac"
            },
            "downloads": -1,
            "filename": "perturbers-0.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d6d298f530d4cf50b80119212fd1cd6f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 4822,
            "upload_time": "2024-03-03T13:19:51",
            "upload_time_iso_8601": "2024-03-03T13:19:51.316074Z",
            "url": "https://files.pythonhosted.org/packages/09/70/49c36063b33038b7d99ce3e9306c238b577834bf137e76252ed5c369d47a/perturbers-0.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-03 13:19:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "FairNLP",
    "github_project": "perturbers",
    "github_not_found": true,
    "lcname": "perturbers"
}
        
Elapsed time: 0.33274s