# Perturbers
This codebase is built upon the great work of [Qian et. al. (2022)](https://arxiv.org/abs/2205.12586). Using this
library, you can easily integrate neural augmentation models with your NLP pipelines or train new perturber models from
scratch.
# Installation
`perturbers` is available on PyPI and can be installed using pip:
```bash
pip install perturbers
```
# Usage
Using a perturber is as simple as creating a new instance of the `Perturber` class and calling the `generate` method
with the sentence you want to perturb along with the target word and the attribute you want to change:
```python
from perturbers import Perturber
perturber = Perturber()
unperturbed = "Jack was passionate about rock climbing and his love for the sport was infectious to all men around him."
perturber.generate(unperturbed, "Jack", "female")
# "Jane was passionate about rock climbing and her love for the sport was infectious to all men around her."
```
You can also perturb a sentence without specifying a target word or attribute:
```python
perturber("Jack was passionate.", retry_unchanged=True)
# "Jackie was passionate."
```
## Training a new perturber model
To train a new perturber model, take a look at the `train_perturber.py` script. This script will train a new perturber
model using the PANDA dataset. Currently the scripts only support training BART models, but any encoder-decoder model
can be used.
Perturber models are evaluated based on the following metrics:
- `bleu4`: The 4-gram BLEU score of the perturbed sentence compared to the original sentence
- `perplexity`: The perplexity of the perturbed sentence
- `perplexity_perturbed`: The perplexity of only the perturbed tokens from the perturbed sentence
## Pre-trained models
In addition to the codebase, we also provide pre-trained perturber models in a variety of sizes:
| | Base model | Parameters | Perplexity | Perplexity (perturbed idx)* | BLEU4 |
|-------------------------------------------------------------------|--------------------------------------------------------------|------------|------------|-----------------------------|--------|
| [perturber-small](https://huggingface.co/fairnlp/perturber-small) | [bart-small](https://huggingface.co/lucadiliello/bart-small) | 70M | 1.076 | 4.079 | 0.822 |
| [perturber-base](https://huggingface.co/fairnlp/perturber-base) | [bart-base](https://huggingface.co/facebook/bart-base) | 139M | 1.058 | 2.769 | 0.794 |
| [perturber (original)](https://huggingface.co/facebook/perturber) | [bart-large](https://huggingface.co/facebook/bart-large) | 406M | 1.06** | N/A | 0.88** |
*Measures perplexity only of perturbed tokens, as the majority of tokens remains unchanged leading to Perplexity scores
approaching 1
**The perplexity and BLEU4 scores are those reported in the original paper and not measured via this codebase.
# Roadmap
- [x] Add default perturber model
- [x] Pretrain small and medium perturber models
- [ ] Train model to identify target words and attributes
- [ ] Add training of unconditional perturber models (i.e. only get a sentence, no target word/attribute)
- [ ] Add self-training by pretraining perturber base model (e.g. BART) on self-perturbed data
Other features could include
- [ ] Data cleaning of PANDA (remove non-target perturbations)
- [ ] Multilingual perturbation
# Read more
- [Original perturber paper](https://aclanthology.org/2022.emnlp-main.646/)
Raw data
{
"_id": null,
"home_page": "https://github.com/FairNLP/perturbers",
"name": "perturbers",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "perturbers",
"author": "FairNLP",
"author_email": "info@fairnlp.com",
"download_url": "https://files.pythonhosted.org/packages/09/70/49c36063b33038b7d99ce3e9306c238b577834bf137e76252ed5c369d47a/perturbers-0.0.0.tar.gz",
"platform": null,
"description": "# Perturbers\n\nThis codebase is built upon the great work of [Qian et. al. (2022)](https://arxiv.org/abs/2205.12586). Using this\nlibrary, you can easily integrate neural augmentation models with your NLP pipelines or train new perturber models from\nscratch.\n\n# Installation\n\n`perturbers` is available on PyPI and can be installed using pip:\n\n```bash\npip install perturbers\n```\n\n# Usage\n\nUsing a perturber is as simple as creating a new instance of the `Perturber` class and calling the `generate` method\nwith the sentence you want to perturb along with the target word and the attribute you want to change:\n\n```python\nfrom perturbers import Perturber\n\nperturber = Perturber()\nunperturbed = \"Jack was passionate about rock climbing and his love for the sport was infectious to all men around him.\"\nperturber.generate(unperturbed, \"Jack\", \"female\")\n# \"Jane was passionate about rock climbing and her love for the sport was infectious to all men around her.\"\n```\n\nYou can also perturb a sentence without specifying a target word or attribute:\n\n```python\nperturber(\"Jack was passionate.\", retry_unchanged=True)\n# \"Jackie was passionate.\"\n```\n\n## Training a new perturber model\n\nTo train a new perturber model, take a look at the `train_perturber.py` script. This script will train a new perturber\nmodel using the PANDA dataset. Currently the scripts only support training BART models, but any encoder-decoder model\ncan be used.\n\nPerturber models are evaluated based on the following metrics:\n\n- `bleu4`: The 4-gram BLEU score of the perturbed sentence compared to the original sentence\n- `perplexity`: The perplexity of the perturbed sentence\n- `perplexity_perturbed`: The perplexity of only the perturbed tokens from the perturbed sentence\n\n## Pre-trained models\n\nIn addition to the codebase, we also provide pre-trained perturber models in a variety of sizes:\n\n| | Base model | Parameters | Perplexity | Perplexity (perturbed idx)* | BLEU4 |\n|-------------------------------------------------------------------|--------------------------------------------------------------|------------|------------|-----------------------------|--------|\n| [perturber-small](https://huggingface.co/fairnlp/perturber-small) | [bart-small](https://huggingface.co/lucadiliello/bart-small) | 70M | 1.076 | 4.079 | 0.822 |\n| [perturber-base](https://huggingface.co/fairnlp/perturber-base) | [bart-base](https://huggingface.co/facebook/bart-base) | 139M | 1.058 | 2.769 | 0.794 |\n| [perturber (original)](https://huggingface.co/facebook/perturber) | [bart-large](https://huggingface.co/facebook/bart-large) | 406M | 1.06** | N/A | 0.88** |\n\n*Measures perplexity only of perturbed tokens, as the majority of tokens remains unchanged leading to Perplexity scores\napproaching 1\n\n**The perplexity and BLEU4 scores are those reported in the original paper and not measured via this codebase.\n\n# Roadmap\n\n- [x] Add default perturber model\n- [x] Pretrain small and medium perturber models\n- [ ] Train model to identify target words and attributes\n- [ ] Add training of unconditional perturber models (i.e. only get a sentence, no target word/attribute)\n- [ ] Add self-training by pretraining perturber base model (e.g. BART) on self-perturbed data\n\nOther features could include\n\n- [ ] Data cleaning of PANDA (remove non-target perturbations)\n- [ ] Multilingual perturbation\n\n# Read more\n\n- [Original perturber paper](https://aclanthology.org/2022.emnlp-main.646/)\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "Perturber models for neural data augmentation.",
"version": "0.0.0",
"project_urls": {
"Homepage": "https://github.com/FairNLP/perturbers"
},
"split_keywords": [
"perturbers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "097049c36063b33038b7d99ce3e9306c238b577834bf137e76252ed5c369d47a",
"md5": "d6d298f530d4cf50b80119212fd1cd6f",
"sha256": "341e7c87ba0fa5fa6e35ebef0c222170434ad143a25832dd1c6b550ee97190ac"
},
"downloads": -1,
"filename": "perturbers-0.0.0.tar.gz",
"has_sig": false,
"md5_digest": "d6d298f530d4cf50b80119212fd1cd6f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 4822,
"upload_time": "2024-03-03T13:19:51",
"upload_time_iso_8601": "2024-03-03T13:19:51.316074Z",
"url": "https://files.pythonhosted.org/packages/09/70/49c36063b33038b7d99ce3e9306c238b577834bf137e76252ed5c369d47a/perturbers-0.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-03 13:19:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "FairNLP",
"github_project": "perturbers",
"github_not_found": true,
"lcname": "perturbers"
}