vnerrant


Namevnerrant JSON
Version 1.0.0 PyPI version JSON
download
home_page
SummaryThe ERRor ANnotation Toolkit (ERRANT). Automatically extract and classify edits in parallel sentences.
upload_time2024-02-13 07:41:21
maintainer
docs_urlNone
author
requires_python>= 3.9
licenseMIT
keywords automatic annotation grammatical errors natural language processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # VNERRANT v1.0.0

## Overview

The main aim of VNERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, VNERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.

### Example

**Original**: This are gramamtical sentence .
**Corrected**: This is a grammatical sentence .
**Output M2**:

```text
S This are gramamtical sentence .
A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0
A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0
A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1
```

## Installation

### Pip Install

```bash
conda create -n vnerrant python=3.9
conda activate vnerrant
```

You have two options for installing VNERRANT:

- Option 1: Install VNERRANT using pip with the following commands:

```bash
pip install -U pip setuptools wheel
pip install vnerrant
```

- Option 2: Alternatively, if you want to install ERRANT from the source, you can follow these steps:

```bash
git clone https://gitlab.testsprep.online/nlp/research/errant
cd vnerrant
pip install -U pip setuptools wheel
pip install -e .
```

Please obtain a Spacy model by using the following command:

```bash
python -m spacy download en_core_web_sm
```

You can verify the available models at [this](https://spacy.io/models/en) location.

## Usage

### CLI

Two main commands are provided with VNERRANT: `convert` and `evaluate`. You can run them from anywhere on the command line without having to invoke a specific python script.

1.`vnerrant convert parallel-to-m2`

This is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line.
Example:

```cli
vnerrant convert parallel-to-m2 -o <orig_file> -c <cor_file1> [<cor_file2> ...] -out <out_m2>
```

2.`vnerrant convert m2-to-m2`

This is a variant of `parallel-to-m2` that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. `-gold` will only classify the existing edits, while `-auto` will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved.
Example:

```cli
vnerrant convert m2-to-m2  -i <in_m2> -o <out_m2> {-auto|-gold}
```

3.`vnerrant evaluate m2`

This is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The `-cat {1,2,3}` flag can be used to evaluate error types at increasing levels of granularity, while the `-ds` or `-dt` flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown.
Examples:

```cli
vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2>
vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}
vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -ds
vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}
```

All these scripts also have additional advanced command line options which can be displayed using the `-h` flag.

### API

As of v3.0.0, ERRANT now also comes with an API.

### Quick Start

```python
import vnerrant

annotator = vnerrant.load('en')

orig = 'My    name    is   the     John'
cor = 'My name is John'
edits = annotator.annotate_with_pre_and_post_processing(orig, cor)

for e in edits:
    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)
    print(e.o_toks.start_char, e.o_toks.end_char)
    # assert e.o_str == orig[e.o_toks.start_char:e.o_toks.end_char]
```

### Loading

`vnerrant.load(lang, model_name)`

Instantiate an ERRANT Annotator object. Presently, the lang parameter exclusively accepts 'en' for English, though we aspire to broaden its language support in future iterations. The model_name corresponds to the name of the SpaCy model being utilized. Optionally, you can provide the nlp parameter if you've previously loaded SpaCy and wish to prevent ERRANT from loading it redundantly.

### Annotator Objects

An Annotator object is the main interface for ERRANT.

#### Methods

<details>
<summary>annotator.parse</summary>

`annotator.parse(string, tokenise=False)`

Lemmatise, POS tag, and parse a text string with spacy. Set `tokenise` to True to also word tokenise with spacy. Returns a spacy Doc object.

</details>

<details>
<summary>annotator.align</summary>

`annotator.align(orig, cor, lev=False)`

Align spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the `lev` flag can be used for a standard Levenshtein alignment. Returns an Alignment object.

</details>

<details>
<summary>annotator.merge</summary>

`annotator.merge(alignment, merging='rules')`

Extract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:

1. rules: Use a rule-based merging strategy (default)
2. all-split: Merge nothing: MSSDI -> M, S, S, D, I
3. all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
4. all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I

Returns a list of Edit objects.
</details>

<details>
<summary>annotator.classify</summary>

`annotator.classify(edit)`

Classify an edit. Sets the `edit.type` attribute in an Edit object and returns the same Edit object.

</details>

<details>
<summary>annotator.annotate</summary>

`annotator.annotate(orig, cor, lev=False, merging='rules')`

Run the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running `annotator.align`, `annotator.merge` and `annotator.classify` in sequence. Returns a list of Edit objects.

```python
import errant

annotator = errant.load(lang="en", model_name="en_core_web_sm")
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
alignment = annotator.align(orig, cor)
edits = annotator.merge(alignment)
for e in edits:
    e = annotator.classify(e)
```

</details>

<details>
<summary>annotator.import_edit</summary>

`annotator.import_edit(orig, cor, edit, min=True, old_cat=False)`

Load an Edit object from a list. `orig` and `cor` must be spacy-parsed Doc objects and the edit must be of the form: `[o_start, o_end, c_start, c_end(, type)]`. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The `type` value is an optional string that denotes the error type of the edit (if known). Set `min` to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and `old_cat` to True to preserve the old error type category (i.e. turn off the classifier).

```python
import vnerrant

annotator = vnerrant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edit = [1, 2, 1, 2, 'SVA'] # are -> is
edit = annotator.import_edit(orig, cor, edit)
print(edit.to_m2())
```

</details>

### Alignment Objects

An Alignment object is created from two spacy-parsed text sequences.

#### Attributes

`alignment`.**orig**
`alignment`.**cor**
The spacy-parsed original and corrected text sequences.

`alignment`.**cost_matrix**
`alignment`.**op_matrix**
The cost matrix and operation matrix produced by the alignment.

`alignment`.**align_seq**
The first cheapest alignment between the two sequences.

### Edit Objects

An Edit object represents a transformation between two text sequences.

**Attributes**

`edit`.**o_start**
`edit`.**o_end**
`edit`.**o_toks**
`edit`.**o_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *original* text.

`edit`.**c_start**
`edit`.**c_end**
`edit`.**c_toks**
`edit`.**c_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *corrected* text.

`edit`.**type**
The error type string.

**Method**

`edit`.**to_m2**(id=0)
Format the edit for an output M2 file. `id` is the annotator id.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "vnerrant",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">= 3.9",
    "maintainer_email": "",
    "keywords": "automatic annotation,grammatical errors,natural language processing",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/5d/85/2860cdd9c706864e1cc5db466853da79e26437761e4c9d8d76d3eb1691d8/vnerrant-1.0.0.tar.gz",
    "platform": null,
    "description": "# VNERRANT v1.0.0\n\n## Overview\n\nThe main aim of VNERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, VNERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.\n\n### Example\n\n**Original**: This are gramamtical sentence .\n**Corrected**: This is a grammatical sentence .\n**Output M2**:\n\n```text\nS This are gramamtical sentence .\nA 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0\nA 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0\nA 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0\nA -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1\n```\n\n## Installation\n\n### Pip Install\n\n```bash\nconda create -n vnerrant python=3.9\nconda activate vnerrant\n```\n\nYou have two options for installing VNERRANT:\n\n- Option 1: Install VNERRANT using pip with the following commands:\n\n```bash\npip install -U pip setuptools wheel\npip install vnerrant\n```\n\n- Option 2: Alternatively, if you want to install ERRANT from the source, you can follow these steps:\n\n```bash\ngit clone https://gitlab.testsprep.online/nlp/research/errant\ncd vnerrant\npip install -U pip setuptools wheel\npip install -e .\n```\n\nPlease obtain a Spacy model by using the following command:\n\n```bash\npython -m spacy download en_core_web_sm\n```\n\nYou can verify the available models at [this](https://spacy.io/models/en) location.\n\n## Usage\n\n### CLI\n\nTwo main commands are provided with VNERRANT: `convert` and `evaluate`. You can run them from anywhere on the command line without having to invoke a specific python script.\n\n1.`vnerrant convert parallel-to-m2`\n\nThis is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line.\nExample:\n\n```cli\nvnerrant convert parallel-to-m2 -o <orig_file> -c <cor_file1> [<cor_file2> ...] -out <out_m2>\n```\n\n2.`vnerrant convert m2-to-m2`\n\nThis is a variant of `parallel-to-m2` that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. `-gold` will only classify the existing edits, while `-auto` will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved.\nExample:\n\n```cli\nvnerrant convert m2-to-m2  -i <in_m2> -o <out_m2> {-auto|-gold}\n```\n\n3.`vnerrant evaluate m2`\n\nThis is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The `-cat {1,2,3}` flag can be used to evaluate error types at increasing levels of granularity, while the `-ds` or `-dt` flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown.\nExamples:\n\n```cli\nvnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2>\nvnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}\nvnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -ds\nvnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}\n```\n\nAll these scripts also have additional advanced command line options which can be displayed using the `-h` flag.\n\n### API\n\nAs of v3.0.0, ERRANT now also comes with an API.\n\n### Quick Start\n\n```python\nimport vnerrant\n\nannotator = vnerrant.load('en')\n\norig = 'My    name    is   the     John'\ncor = 'My name is John'\nedits = annotator.annotate_with_pre_and_post_processing(orig, cor)\n\nfor e in edits:\n    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)\n    print(e.o_toks.start_char, e.o_toks.end_char)\n    # assert e.o_str == orig[e.o_toks.start_char:e.o_toks.end_char]\n```\n\n### Loading\n\n`vnerrant.load(lang, model_name)`\n\nInstantiate an ERRANT Annotator object. Presently, the lang parameter exclusively accepts 'en' for English, though we aspire to broaden its language support in future iterations. The model_name corresponds to the name of the SpaCy model being utilized. Optionally, you can provide the nlp parameter if you've previously loaded SpaCy and wish to prevent ERRANT from loading it redundantly.\n\n### Annotator Objects\n\nAn Annotator object is the main interface for ERRANT.\n\n#### Methods\n\n<details>\n<summary>annotator.parse</summary>\n\n`annotator.parse(string, tokenise=False)`\n\nLemmatise, POS tag, and parse a text string with spacy. Set `tokenise` to True to also word tokenise with spacy. Returns a spacy Doc object.\n\n</details>\n\n<details>\n<summary>annotator.align</summary>\n\n`annotator.align(orig, cor, lev=False)`\n\nAlign spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the `lev` flag can be used for a standard Levenshtein alignment. Returns an Alignment object.\n\n</details>\n\n<details>\n<summary>annotator.merge</summary>\n\n`annotator.merge(alignment, merging='rules')`\n\nExtract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:\n\n1. rules: Use a rule-based merging strategy (default)\n2. all-split: Merge nothing: MSSDI -> M, S, S, D, I\n3. all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI\n4. all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I\n\nReturns a list of Edit objects.\n</details>\n\n<details>\n<summary>annotator.classify</summary>\n\n`annotator.classify(edit)`\n\nClassify an edit. Sets the `edit.type` attribute in an Edit object and returns the same Edit object.\n\n</details>\n\n<details>\n<summary>annotator.annotate</summary>\n\n`annotator.annotate(orig, cor, lev=False, merging='rules')`\n\nRun the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running `annotator.align`, `annotator.merge` and `annotator.classify` in sequence. Returns a list of Edit objects.\n\n```python\nimport errant\n\nannotator = errant.load(lang=\"en\", model_name=\"en_core_web_sm\")\norig = annotator.parse('This are gramamtical sentence .')\ncor = annotator.parse('This is a grammatical sentence .')\nalignment = annotator.align(orig, cor)\nedits = annotator.merge(alignment)\nfor e in edits:\n    e = annotator.classify(e)\n```\n\n</details>\n\n<details>\n<summary>annotator.import_edit</summary>\n\n`annotator.import_edit(orig, cor, edit, min=True, old_cat=False)`\n\nLoad an Edit object from a list. `orig` and `cor` must be spacy-parsed Doc objects and the edit must be of the form: `[o_start, o_end, c_start, c_end(, type)]`. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The `type` value is an optional string that denotes the error type of the edit (if known). Set `min` to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and `old_cat` to True to preserve the old error type category (i.e. turn off the classifier).\n\n```python\nimport vnerrant\n\nannotator = vnerrant.load('en')\norig = annotator.parse('This are gramamtical sentence .')\ncor = annotator.parse('This is a grammatical sentence .')\nedit = [1, 2, 1, 2, 'SVA'] # are -> is\nedit = annotator.import_edit(orig, cor, edit)\nprint(edit.to_m2())\n```\n\n</details>\n\n### Alignment Objects\n\nAn Alignment object is created from two spacy-parsed text sequences.\n\n#### Attributes\n\n`alignment`.**orig**\n`alignment`.**cor**\nThe spacy-parsed original and corrected text sequences.\n\n`alignment`.**cost_matrix**\n`alignment`.**op_matrix**\nThe cost matrix and operation matrix produced by the alignment.\n\n`alignment`.**align_seq**\nThe first cheapest alignment between the two sequences.\n\n### Edit Objects\n\nAn Edit object represents a transformation between two text sequences.\n\n**Attributes**\n\n`edit`.**o_start**\n`edit`.**o_end**\n`edit`.**o_toks**\n`edit`.**o_str**\nThe start and end offsets, the spacy tokens, and the string for the edit in the *original* text.\n\n`edit`.**c_start**\n`edit`.**c_end**\n`edit`.**c_toks**\n`edit`.**c_str**\nThe start and end offsets, the spacy tokens, and the string for the edit in the *corrected* text.\n\n`edit`.**type**\nThe error type string.\n\n**Method**\n\n`edit`.**to_m2**(id=0)\nFormat the edit for an output M2 file. `id` is the annotator id.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "The ERRor ANnotation Toolkit (ERRANT).         Automatically extract and classify edits in parallel sentences.",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "automatic annotation",
        "grammatical errors",
        "natural language processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "206b702836ed871e14f57eef5011c3a215cda4a0e38d7cc2f8b2dec0229d4575",
                "md5": "79e475b3617b350095969a3c56785197",
                "sha256": "a01f50d32a921505f0c1b40a6956d532e13bb1b2637c9f4282edb9630b225fab"
            },
            "downloads": -1,
            "filename": "vnerrant-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "79e475b3617b350095969a3c56785197",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">= 3.9",
            "size": 511365,
            "upload_time": "2024-02-13T07:41:14",
            "upload_time_iso_8601": "2024-02-13T07:41:14.454478Z",
            "url": "https://files.pythonhosted.org/packages/20/6b/702836ed871e14f57eef5011c3a215cda4a0e38d7cc2f8b2dec0229d4575/vnerrant-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d852860cdd9c706864e1cc5db466853da79e26437761e4c9d8d76d3eb1691d8",
                "md5": "f19b54530724ea58fc16e5f645f217bc",
                "sha256": "37f6e614fd798e0bda076f4dfa8e8ae0f7ef7854a702645932a2b8acf4b6bfd4"
            },
            "downloads": -1,
            "filename": "vnerrant-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f19b54530724ea58fc16e5f645f217bc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">= 3.9",
            "size": 507861,
            "upload_time": "2024-02-13T07:41:21",
            "upload_time_iso_8601": "2024-02-13T07:41:21.201673Z",
            "url": "https://files.pythonhosted.org/packages/5d/85/2860cdd9c706864e1cc5db466853da79e26437761e4c9d8d76d3eb1691d8/vnerrant-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-13 07:41:21",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "vnerrant"
}
        
Elapsed time: 0.25967s