adept-augmentations


Nameadept-augmentations JSON
Version 0.1 PyPI version JSON
download
home_pagehttps://github.com/davidberenstein1957/adept-augmentations
SummaryA Python library aimed at adeptly, augmenting NLP training data.
upload_time2023-05-08 09:10:03
maintainer
docs_urlNone
authordavid
requires_python>=3.8,<3.12
licenseApache
keywords spacy explainable ai xai nlu visualization datasets nlproc data-centricity augmentation data-augmentation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Adept Augmentations

Welcome to Adept Augmentations, can be used for creating additional data in Few Shot Named Entity Recognition (NER) setting!

Adept Augmentation is a Python package that provides data augmentation functionalities for NER training data using the `spacy` and `datasets` packages. Currently, we support one augmentor `EntitySwapAugmenter`, however, we plan on [adding some more](#implemented-augmenters).

`EntitySwapAugmenter` takes either a `datasets.Dataset` or a `spacy.tokens.DocBin`. Additionally, it is optional to provide a set of `labels` to be included in the augmentations. It initially created a knowledge base of entities belonging to a certain label. When running `augmenter.augment()` for `N` runs, it then creates `N` new sentences with random swaps of the original entities with an entity of the same corresponding label from the knowledge base.

For example, assuming that we have knowledge base for PERSONS and LOCATIONS and PRODUCTS. We can then create additional data for the sentence "Momofuko Ando created instant noodles in Osaka." using `augmenter.augment(N=2)`, resulting in "David created instant noodles in Madrid." or "Tom created Adept Augmentations in the Netherlands".

Adept Augmentation works for NER labels using the IOB, IOB2, BIOES and BILUO tagging schemes, as well as labels not following any tagging scheme.

## Usage

### Datasets

```python
from datasets import load_dataset

from adept_augmentations import EntitySwapAugmenter

dataset = load_dataset("conll2003", split="train[:3]")
augmenter = EntitySwapAugmenter(dataset)
aug_dataset = augmenter.augment(N=4)

for entry in aug_dataset["tokens"]:
    print(entry)

# ['EU', 'rejects', 'British', 'call', 'to', 'boycott', 'British', 'lamb', '.']
# ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'German', 'lamb', '.']
# ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
# ['Peter', 'Blackburn']
# ['BRUSSELS', '1996-08-22']
```

### spaCy

```python
import spacy
from spacy.tokens import DocBin

from adept_augmentations import EntitySwapAugmenter

nlp = spacy.load("en_core_web_sm")

# Create some example training data
TRAIN_DATA = [
    "Apple is looking at buying U.K. startup for $1 billion",
    "Microsoft acquires GitHub for $7.5 billion",
]
docs = nlp.pipe(TRAIN_DATA)

# Create a new DocBin
doc_bin = DocBin(docs=docs)

doc_bin = EntitySwapAugmenter(doc_bin).augment(4)
for doc in doc_bin.get_docs(nlp.vocab):
    print(doc.text)

# GitHub is looking at buying U.K. startup for $ 7.5 billion
# Microsoft is looking at buying U.K. startup for $ 1 billion
# Microsoft is looking at buying U.K. startup for $ 7.5 billion
# GitHub is looking at buying U.K. startup for $ 1 billion
# Microsoft acquires Apple for $ 7.5 billion
# Apple acquires Microsoft for $ 1 billion
# Microsoft acquires Microsoft for $ 7.5 billion
# GitHub acquires GitHub for $ 1 billion
```

## Potential performance gains
Data augmentation can significantly improve model performance in low-data scenarios.
To showcase this, we trained a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) NER model on
the 50, 100, 200, 400 and 800 first [CoNLL03](https://huggingface.co/datasets/conll2003) training samples.

The augmented dataset is generated like so:
```python
# Select N (50, 100, 200, 400 or 800) samples from the gold training dataset
train_dataset = dataset["train"].select(range(N))

# Generate augmented dataset, with 4 * N samples
augmented_dataset = Augmenter(train_dataset).augment(N=4)

# Combine the original with the augmented to produce the full dataset
# to produce a dataset 5 times as big as the original
train_dataset = concatenate_datasets([augmented_dataset, train_dataset])
```

Note that the baseline uses 5 epochs. This way, the training time and steps are identical between the two experiments. All scenarios are executed 5 times,
and we report means and standard errors.

|       | Original - 5 Epochs | Augmented - 1 Epoch |
|-------|--|--|
| N=50  | 0.387 ± 0.042 F1 | **0.484 ± 0.054 F1** |
| N=100 | 0.585 ± 0.070 F1 | **0.663 ± 0.038 F1** |
| N=200 | 0.717 ± 0.053 F1 | **0.757 ± 0.025 F1** |
| N=400 | 0.816 ± 0.017 F1 | **0.826 ± 0.011 F1** |
| N=800 | 0.859 ± 0.004 F1 | **0.862 ± 0.002 F1** |

(Note: These results are not optimized and do not indicate maximum performances with SpanMarker.)

From these results, it is clear that performing data augmentation using `adept_augmentations` can heavily improve performance in low-data settings.

## Implemented Augmenters

- [X] `EntitySwapAugmenter`
- [ ] `KnowledgeBaseSwapAugmenter`
- [ ] `CoreferenceSwapAugmenter`
- [ ] `SyntaticTreeSwapAugmenter`

## Potential integrations

Potentially, we can look into integrations of other augmentations packages that do not preserve gold standard knowledge. Good sources for inspiration are:

- <https://github.com/KennethEnevoldsen/augmenty>
  - <https://kennethenevoldsen.github.io/augmenty/tutorials/introduction.html>
- <https://github.com/QData/TextAttack>
- <https://github.com/infinitylogesh/mutate>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/davidberenstein1957/adept-augmentations",
    "name": "adept-augmentations",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<3.12",
    "maintainer_email": "",
    "keywords": "spacy,explainable AI,xai,nlu,visualization,datasets,nlproc,data-centricity,augmentation,data-augmentation",
    "author": "david",
    "author_email": "david.m.berenstein@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/19/9e/9f2625ec86598fcb9eded8d62a19da94c4e8ac007077e75ae89e808013e7/adept-augmentations-0.1.tar.gz",
    "platform": null,
    "description": "# Adept Augmentations\n\nWelcome to Adept Augmentations, can be used for creating additional data in Few Shot Named Entity Recognition (NER) setting!\n\nAdept Augmentation is a Python package that provides data augmentation functionalities for NER training data using the `spacy` and `datasets` packages. Currently, we support one augmentor `EntitySwapAugmenter`, however, we plan on [adding some more](#implemented-augmenters).\n\n`EntitySwapAugmenter` takes either a `datasets.Dataset` or a `spacy.tokens.DocBin`. Additionally, it is optional to provide a set of `labels` to be included in the augmentations. It initially created a knowledge base of entities belonging to a certain label. When running `augmenter.augment()` for `N` runs, it then creates `N` new sentences with random swaps of the original entities with an entity of the same corresponding label from the knowledge base.\n\nFor example, assuming that we have knowledge base for PERSONS and LOCATIONS and PRODUCTS. We can then create additional data for the sentence \"Momofuko Ando created instant noodles in Osaka.\" using `augmenter.augment(N=2)`, resulting in \"David created instant noodles in Madrid.\" or \"Tom created Adept Augmentations in the Netherlands\".\n\nAdept Augmentation works for NER labels using the IOB, IOB2, BIOES and BILUO tagging schemes, as well as labels not following any tagging scheme.\n\n## Usage\n\n### Datasets\n\n```python\nfrom datasets import load_dataset\n\nfrom adept_augmentations import EntitySwapAugmenter\n\ndataset = load_dataset(\"conll2003\", split=\"train[:3]\")\naugmenter = EntitySwapAugmenter(dataset)\naug_dataset = augmenter.augment(N=4)\n\nfor entry in aug_dataset[\"tokens\"]:\n    print(entry)\n\n# ['EU', 'rejects', 'British', 'call', 'to', 'boycott', 'British', 'lamb', '.']\n# ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'German', 'lamb', '.']\n# ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']\n# ['Peter', 'Blackburn']\n# ['BRUSSELS', '1996-08-22']\n```\n\n### spaCy\n\n```python\nimport spacy\nfrom spacy.tokens import DocBin\n\nfrom adept_augmentations import EntitySwapAugmenter\n\nnlp = spacy.load(\"en_core_web_sm\")\n\n# Create some example training data\nTRAIN_DATA = [\n    \"Apple is looking at buying U.K. startup for $1 billion\",\n    \"Microsoft acquires GitHub for $7.5 billion\",\n]\ndocs = nlp.pipe(TRAIN_DATA)\n\n# Create a new DocBin\ndoc_bin = DocBin(docs=docs)\n\ndoc_bin = EntitySwapAugmenter(doc_bin).augment(4)\nfor doc in doc_bin.get_docs(nlp.vocab):\n    print(doc.text)\n\n# GitHub is looking at buying U.K. startup for $ 7.5 billion\n# Microsoft is looking at buying U.K. startup for $ 1 billion\n# Microsoft is looking at buying U.K. startup for $ 7.5 billion\n# GitHub is looking at buying U.K. startup for $ 1 billion\n# Microsoft acquires Apple for $ 7.5 billion\n# Apple acquires Microsoft for $ 1 billion\n# Microsoft acquires Microsoft for $ 7.5 billion\n# GitHub acquires GitHub for $ 1 billion\n```\n\n## Potential performance gains\nData augmentation can significantly improve model performance in low-data scenarios.\nTo showcase this, we trained a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) NER model on\nthe 50, 100, 200, 400 and 800 first [CoNLL03](https://huggingface.co/datasets/conll2003) training samples.\n\nThe augmented dataset is generated like so:\n```python\n# Select N (50, 100, 200, 400 or 800) samples from the gold training dataset\ntrain_dataset = dataset[\"train\"].select(range(N))\n\n# Generate augmented dataset, with 4 * N samples\naugmented_dataset = Augmenter(train_dataset).augment(N=4)\n\n# Combine the original with the augmented to produce the full dataset\n# to produce a dataset 5 times as big as the original\ntrain_dataset = concatenate_datasets([augmented_dataset, train_dataset])\n```\n\nNote that the baseline uses 5 epochs. This way, the training time and steps are identical between the two experiments. All scenarios are executed 5 times,\nand we report means and standard errors.\n\n|       | Original - 5 Epochs | Augmented - 1 Epoch |\n|-------|--|--|\n| N=50  | 0.387 \u00b1 0.042 F1 | **0.484 \u00b1 0.054 F1** |\n| N=100 | 0.585 \u00b1 0.070 F1 | **0.663 \u00b1 0.038 F1** |\n| N=200 | 0.717 \u00b1 0.053 F1 | **0.757 \u00b1 0.025 F1** |\n| N=400 | 0.816 \u00b1 0.017 F1 | **0.826 \u00b1 0.011 F1** |\n| N=800 | 0.859 \u00b1 0.004 F1 | **0.862 \u00b1 0.002 F1** |\n\n(Note: These results are not optimized and do not indicate maximum performances with SpanMarker.)\n\nFrom these results, it is clear that performing data augmentation using `adept_augmentations` can heavily improve performance in low-data settings.\n\n## Implemented Augmenters\n\n- [X] `EntitySwapAugmenter`\n- [ ] `KnowledgeBaseSwapAugmenter`\n- [ ] `CoreferenceSwapAugmenter`\n- [ ] `SyntaticTreeSwapAugmenter`\n\n## Potential integrations\n\nPotentially, we can look into integrations of other augmentations packages that do not preserve gold standard knowledge. Good sources for inspiration are:\n\n- <https://github.com/KennethEnevoldsen/augmenty>\n  - <https://kennethenevoldsen.github.io/augmenty/tutorials/introduction.html>\n- <https://github.com/QData/TextAttack>\n- <https://github.com/infinitylogesh/mutate>\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "A Python library aimed at adeptly, augmenting NLP training data.",
    "version": "0.1",
    "project_urls": {
        "Documentation": "https://github.com/davidberenstein1957/adept-augmentations",
        "Homepage": "https://github.com/davidberenstein1957/adept-augmentations",
        "Repository": "https://github.com/davidberenstein1957/adept-augmentations"
    },
    "split_keywords": [
        "spacy",
        "explainable ai",
        "xai",
        "nlu",
        "visualization",
        "datasets",
        "nlproc",
        "data-centricity",
        "augmentation",
        "data-augmentation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c676c0f5c0ed89a20d7d784de586d1cec56c5e896b35bfa649e09452c32c27b1",
                "md5": "b989b9a4eb0e3f53fabc910c6888ff26",
                "sha256": "ea0c8af050b3599b79677699c069cebc396587a16a88b2c882636a05b4347e0b"
            },
            "downloads": -1,
            "filename": "adept_augmentations-0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b989b9a4eb0e3f53fabc910c6888ff26",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<3.12",
            "size": 13875,
            "upload_time": "2023-05-08T09:10:05",
            "upload_time_iso_8601": "2023-05-08T09:10:05.712110Z",
            "url": "https://files.pythonhosted.org/packages/c6/76/c0f5c0ed89a20d7d784de586d1cec56c5e896b35bfa649e09452c32c27b1/adept_augmentations-0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "199e9f2625ec86598fcb9eded8d62a19da94c4e8ac007077e75ae89e808013e7",
                "md5": "367027e38aa5d988fb8b3fbcb0b48bea",
                "sha256": "27d4d09f3a00dc6966b762f2b6f095a00d0812375d31d51e1dcbe849a8e0debd"
            },
            "downloads": -1,
            "filename": "adept-augmentations-0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "367027e38aa5d988fb8b3fbcb0b48bea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<3.12",
            "size": 14472,
            "upload_time": "2023-05-08T09:10:03",
            "upload_time_iso_8601": "2023-05-08T09:10:03.476586Z",
            "url": "https://files.pythonhosted.org/packages/19/9e/9f2625ec86598fcb9eded8d62a19da94c4e8ac007077e75ae89e808013e7/adept-augmentations-0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-08 09:10:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "davidberenstein1957",
    "github_project": "adept-augmentations",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "adept-augmentations"
}
        
Elapsed time: 0.15447s