ua-gec

Name	ua-gec JSON
Version	2.1.2 JSON
	download
home_page	https://github.com/grammarly/ua-gec
Summary	UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian language
upload_time	2024-02-02 11:43:14
maintainer
docs_url	None
author	Oleksiy Syvokon
requires_python	>=3.6
license	License :: OSI Approved :: CC-BY-4.0
keywords	gec ukrainian dataset corpus grammatical error correction grammarly
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [Українською](./README_ua.md)

# UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This repository contains UA-GEC data and an accompanying Python library.

## What's new

* **November 2022**: Version 2.0 released, featuring more data and detailed annotations.
* **January 2021**: Initial release.

See [CHANGELOG.md](./CHANGELOG.md) for detailed updates.


## Data

All corpus data and metadata stay under the `./data`. It has two subfolders
for [gec-fluency and gec-only corpus versions](#annotation-format)

Both corpus versions contain two subfolders [train and test splits](#train-test-split) with different data
representations:

`./data/{gec-fluency,gec-only}/{train,test}/annotated` stores documents in the [annotated format](#annotation-format)

`./data/{gec-fluency,gec-only}/{train,test}/source` and `./data/{gec-fluency,gec-only}/{train,test}/target` store the
original and the corrected versions of documents. Text files in these
directories are plain text with no annotation markup. These files were
produced from the annotated data and are, in some way, redundant. We keep them
because this format is convenient in some use cases.


## Metadata

`./data/metadata.csv` stores per-document metadata. It's a CSV file with
the following fields:

- `id` (str): document identifier;
- `author_id` (str): document author identifier;
- `is_native` (int): 1 if the author is native-speaker, 0 otherwise;
- `region` (str): the author's region of birth. A special value "Інше"
  is used both for authors who were born outside Ukraine and authors
  who preferred not to specify their region.
- `gender` (str): could be "Жіноча" (female), "Чоловіча" (male), or "Інша" (other);
- `occupation` (str): one of "Технічна", "Гуманітарна", "Природнича", "Інша";
- `submission_type` (str): one of "essay", "translation", or "text\_donation";
- `source_language` (str): for submissions of the "translation" type, this field
    indicates the source language of the translated text. Possible values are
    "de", "en", "fr", "ru", and "pl";
- `annotator_id` (int): ID of the annotator who corrected the document;
- `partition` (str): one of "test" or "train";
- `is_sensitive` (int): 1 if the document contains profanity or offensive language.

## Annotation format

Annotated files are text files that use the following in-text annotation format:
`{error=>edit:::error_type=Tag}`, where `error` and `edit` stand for a text item before
and after correction respectively, and `Tag` denotes an error category and an error subcategory in case of Grammar- and Fluency-related errors.

Example of an annotated sentence:
```
    I {likes=>like:::error_type=G/Number} turtles.
```

Below you can see a list of error types presented in the corpus:
- `Spelling`: spelling errors;
- `Punctuation`: punctuation errors.

Grammar-related errors:
- `G/Case`: incorrect usage of case of any notional part of speech;
- `G/Gender`: incorrect usage of gender of any notional part of speech;
- `G/Number`: incorrect usage of number of any notional part of speech;
- `G/Aspect`: incorrect usage of verb aspect;
- `G/Tense`: incorrect usage of verb tense;
- `G/VerbVoice`: incorrect usage of verb voice;
- `G/PartVoice`:  incorrect usage of participle voice;
- `G/VerbAForm`:  incorrect usage of an analytical verb form;
- `G/Prep`: incorrect preposition usage;
- `G/Participle`: incorrect usage of participles;
- `G/UngrammaticalStructure`: digression from syntactic norms;
- `G/Comparison`: incorrect formation of comparison degrees of adjectives and adverbs;
- `G/Conjunction`: incorrect usage of conjunctions;
- `G/Other`: other grammatical errors.

Fluency-related errors:
- `F/Style`: style errors;
- `F/Calque`: word-for-word translation from other languages;
- `F/Collocation`: unnatural collocations;
- `F/PoorFlow`: unnatural sentence flow;
- `F/Repetition`: repetition of words;
- `F/Other`: other fluency errors.


An accompanying Python package, `ua_gec`, provides many tools for working with
annotated texts. See its documentation for details.


## Train-test split

We expect users of the corpus to train and tune their models on the __train__ split
only. Feel free to further split it into train-dev (or use cross-validation).

Please use the __test__ split only for reporting scores of your final model.
In particular, never optimize on the test set. Do not tune hyperparameters on
it. Do not use it for model selection in any way.

Next section lists the per-split statistics.


## Statistics

UA-GEC contains:

### GEC+Fluency

| Split     | Documents | Sentences |  Tokens | Authors | Errors | 
|:---------:|:---------:|----------:|--------:|:-------:|--------|
| train     | 1,706     | 31,038    | 457,017 | 752     | 38,213 |
| test      |   166     |  2,697    | 43,601  | 76      |  7,858 |
| **TOTAL** | 1,872     | 33,735    | 500,618 | 828     | 46,071 |

See [stats.gec-fluency.txt](./stats.gec-fluency.txt) for detailed statistics.


### GEC-only

| Split     | Documents | Sentences |  Tokens | Authors | Errors | 
|:---------:|:---------:|----------:|--------:|:-------:|--------|
| train     | 1,706     | 31,046    | 457,004 | 752     | 30,049 |
| test      |   166     |  2,704    |  43,605 |  76     |  6,169 |
| **TOTAL** | 1,872     | 33,750    | 500,609 | 828     | 36,218 |

See [stats.gec-only.txt](./stats.gec-only.txt) for detailed statistics.


## Python library

Alternatively to operating on data files directly, you may use a Python package
called `ua_gec`. This package includes the data and has classes to iterate over
documents, read metadata, work with annotations, etc.

### Getting started

The package can be easily installed by `pip`:

```
    $ pip install ua_gec
```

Alternatively, you can install it from the source code:

```
    $ cd python
    $ python setup.py develop
```


### Iterating through corpus

Once installed, you may get annotated documents from the Python code:

```python
    
    >>> from ua_gec import Corpus
    >>> corpus = Corpus(partition="train", annotation_layer="gec-only")
    >>> for doc in corpus:
    ...     print(doc.source)         # "I likes it."
    ...     print(doc.target)         # "I like it."
    ...     print(doc.annotated)      # <AnnotatedText("I {likes=>like} it.")
    ...     print(doc.meta.region)    # "Київська"
```

Note that the `doc.annotated` property is of type `AnnotatedText`. This
class is described in the [next section](#working-with-annotations)


### Working with annotations

`ua_gec.AnnotatedText` is a class that provides tools for processing
annotated texts. It can iterate over annotations, get annotation error
type, remove some of the annotations, and more.

Here is an example to get you started. It will remove all F/Style annotations from a text:

```python
    >>> from ua_gec import AnnotatedText
    >>> text = AnnotatedText("I {likes=>like:::error_type=G/Number} it.")
    >>> for ann in text.iter_annotations():
    ...     print(ann.source_text)       # likes
    ...     print(ann.top_suggestion)    # like
    ...     print(ann.meta)              # {'error_type': 'Grammar'}
    ...     if ann.meta["error_type"] == "F/Style":
    ...         text.remove(ann)         # or `text.apply(ann)`
```


## Multiple annotators

Some documents are annotated with multiple annotators. Such documents
share `doc_id` but differ in `doc.meta.annotator_id`.

Currently, test sets for gec-fluency and gec-only are annotated by two annotators.
The train sets contain 45 double-annotated docs.


## Contributing

* Data and code improvements are welcomed. Please submit a pull request.


## Citation

The [accompanying paper](https://arxiv.org/abs/2103.16997) is:

```
@misc{syvokon2021uagec,
      title={UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language},
      author={Oleksiy Syvokon and Olena Nahorna},
      year={2021},
      eprint={2103.16997},
      archivePrefix={arXiv},
      primaryClass={cs.CL}}
```


## Contacts

* nastasiya.osidach@grammarly.com
* olena.nahorna@grammarly.com
* oleksiy.syvokon@gmail.com
* pavlo.kuchmiichuk@gmail.com

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/grammarly/ua-gec",
    "name": "ua-gec",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "gec ukrainian dataset corpus grammatical error correction grammarly",
    "author": "Oleksiy Syvokon",
    "author_email": "oleksiy.syvokon@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "[\u0423\u043a\u0440\u0430\u0457\u043d\u0441\u044c\u043a\u043e\u044e](./README_ua.md)\n\n# UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language\n\nThis repository contains UA-GEC data and an accompanying Python library.\n\n## What's new\n\n* **November 2022**: Version 2.0 released, featuring more data and detailed annotations.\n* **January 2021**: Initial release.\n\nSee [CHANGELOG.md](./CHANGELOG.md) for detailed updates.\n\n\n## Data\n\nAll corpus data and metadata stay under the `./data`. It has two subfolders\nfor [gec-fluency and gec-only corpus versions](#annotation-format)\n\nBoth corpus versions contain two subfolders [train and test splits](#train-test-split) with different data\nrepresentations:\n\n`./data/{gec-fluency,gec-only}/{train,test}/annotated` stores documents in the [annotated format](#annotation-format)\n\n`./data/{gec-fluency,gec-only}/{train,test}/source` and `./data/{gec-fluency,gec-only}/{train,test}/target` store the\noriginal and the corrected versions of documents. Text files in these\ndirectories are plain text with no annotation markup. These files were\nproduced from the annotated data and are, in some way, redundant. We keep them\nbecause this format is convenient in some use cases.\n\n\n## Metadata\n\n`./data/metadata.csv` stores per-document metadata. It's a CSV file with\nthe following fields:\n\n- `id` (str): document identifier;\n- `author_id` (str): document author identifier;\n- `is_native` (int): 1 if the author is native-speaker, 0 otherwise;\n- `region` (str): the author's region of birth. A special value \"\u0406\u043d\u0448\u0435\"\n  is used both for authors who were born outside Ukraine and authors\n  who preferred not to specify their region.\n- `gender` (str): could be \"\u0416\u0456\u043d\u043e\u0447\u0430\" (female), \"\u0427\u043e\u043b\u043e\u0432\u0456\u0447\u0430\" (male), or \"\u0406\u043d\u0448\u0430\" (other);\n- `occupation` (str): one of \"\u0422\u0435\u0445\u043d\u0456\u0447\u043d\u0430\", \"\u0413\u0443\u043c\u0430\u043d\u0456\u0442\u0430\u0440\u043d\u0430\", \"\u041f\u0440\u0438\u0440\u043e\u0434\u043d\u0438\u0447\u0430\", \"\u0406\u043d\u0448\u0430\";\n- `submission_type` (str): one of \"essay\", \"translation\", or \"text\\_donation\";\n- `source_language` (str): for submissions of the \"translation\" type, this field\n    indicates the source language of the translated text. Possible values are\n    \"de\", \"en\", \"fr\", \"ru\", and \"pl\";\n- `annotator_id` (int): ID of the annotator who corrected the document;\n- `partition` (str): one of \"test\" or \"train\";\n- `is_sensitive` (int): 1 if the document contains profanity or offensive language.\n\n## Annotation format\n\nAnnotated files are text files that use the following in-text annotation format:\n`{error=>edit:::error_type=Tag}`, where `error` and `edit` stand for a text item before\nand after correction respectively, and `Tag` denotes an error category and an error subcategory in case of Grammar- and Fluency-related errors.\n\nExample of an annotated sentence:\n```\n    I {likes=>like:::error_type=G/Number} turtles.\n```\n\nBelow you can see a list of error types presented in the corpus:\n- `Spelling`: spelling errors;\n- `Punctuation`: punctuation errors.\n\nGrammar-related errors:\n- `G/Case`: incorrect usage of case of any notional part of speech;\n- `G/Gender`: incorrect usage of gender of any notional part of speech;\n- `G/Number`: incorrect usage of number of any notional part of speech;\n- `G/Aspect`: incorrect usage of verb aspect;\n- `G/Tense`: incorrect usage of verb tense;\n- `G/VerbVoice`: incorrect usage of verb voice;\n- `G/PartVoice`:  incorrect usage of participle voice;\n- `G/VerbAForm`:  incorrect usage of an analytical verb form;\n- `G/Prep`: incorrect preposition usage;\n- `G/Participle`: incorrect usage of participles;\n- `G/UngrammaticalStructure`: digression from syntactic norms;\n- `G/Comparison`: incorrect formation of comparison degrees of adjectives and adverbs;\n- `G/Conjunction`: incorrect usage of conjunctions;\n- `G/Other`: other grammatical errors.\n\nFluency-related errors:\n- `F/Style`: style errors;\n- `F/Calque`: word-for-word translation from other languages;\n- `F/Collocation`: unnatural collocations;\n- `F/PoorFlow`: unnatural sentence flow;\n- `F/Repetition`: repetition of words;\n- `F/Other`: other fluency errors.\n\n\nAn accompanying Python package, `ua_gec`, provides many tools for working with\nannotated texts. See its documentation for details.\n\n\n## Train-test split\n\nWe expect users of the corpus to train and tune their models on the __train__ split\nonly. Feel free to further split it into train-dev (or use cross-validation).\n\nPlease use the __test__ split only for reporting scores of your final model.\nIn particular, never optimize on the test set. Do not tune hyperparameters on\nit. Do not use it for model selection in any way.\n\nNext section lists the per-split statistics.\n\n\n## Statistics\n\nUA-GEC contains:\n\n### GEC+Fluency\n\n| Split     | Documents | Sentences |  Tokens | Authors | Errors | \n|:---------:|:---------:|----------:|--------:|:-------:|--------|\n| train     | 1,706     | 31,038    | 457,017 | 752     | 38,213 |\n| test      |   166     |  2,697    | 43,601  | 76      |  7,858 |\n| **TOTAL** | 1,872     | 33,735    | 500,618 | 828     | 46,071 |\n\nSee [stats.gec-fluency.txt](./stats.gec-fluency.txt) for detailed statistics.\n\n\n### GEC-only\n\n| Split     | Documents | Sentences |  Tokens | Authors | Errors | \n|:---------:|:---------:|----------:|--------:|:-------:|--------|\n| train     | 1,706     | 31,046    | 457,004 | 752     | 30,049 |\n| test      |   166     |  2,704    |  43,605 |  76     |  6,169 |\n| **TOTAL** | 1,872     | 33,750    | 500,609 | 828     | 36,218 |\n\nSee [stats.gec-only.txt](./stats.gec-only.txt) for detailed statistics.\n\n\n## Python library\n\nAlternatively to operating on data files directly, you may use a Python package\ncalled `ua_gec`. This package includes the data and has classes to iterate over\ndocuments, read metadata, work with annotations, etc.\n\n### Getting started\n\nThe package can be easily installed by `pip`:\n\n```\n    $ pip install ua_gec\n```\n\nAlternatively, you can install it from the source code:\n\n```\n    $ cd python\n    $ python setup.py develop\n```\n\n\n### Iterating through corpus\n\nOnce installed, you may get annotated documents from the Python code:\n\n```python\n    \n    >>> from ua_gec import Corpus\n    >>> corpus = Corpus(partition=\"train\", annotation_layer=\"gec-only\")\n    >>> for doc in corpus:\n    ...     print(doc.source)         # \"I likes it.\"\n    ...     print(doc.target)         # \"I like it.\"\n    ...     print(doc.annotated)      # <AnnotatedText(\"I {likes=>like} it.\")\n    ...     print(doc.meta.region)    # \"\u041a\u0438\u0457\u0432\u0441\u044c\u043a\u0430\"\n```\n\nNote that the `doc.annotated` property is of type `AnnotatedText`. This\nclass is described in the [next section](#working-with-annotations)\n\n\n### Working with annotations\n\n`ua_gec.AnnotatedText` is a class that provides tools for processing\nannotated texts. It can iterate over annotations, get annotation error\ntype, remove some of the annotations, and more.\n\nHere is an example to get you started. It will remove all F/Style annotations from a text:\n\n```python\n    >>> from ua_gec import AnnotatedText\n    >>> text = AnnotatedText(\"I {likes=>like:::error_type=G/Number} it.\")\n    >>> for ann in text.iter_annotations():\n    ...     print(ann.source_text)       # likes\n    ...     print(ann.top_suggestion)    # like\n    ...     print(ann.meta)              # {'error_type': 'Grammar'}\n    ...     if ann.meta[\"error_type\"] == \"F/Style\":\n    ...         text.remove(ann)         # or `text.apply(ann)`\n```\n\n\n## Multiple annotators\n\nSome documents are annotated with multiple annotators. Such documents\nshare `doc_id` but differ in `doc.meta.annotator_id`.\n\nCurrently, test sets for gec-fluency and gec-only are annotated by two annotators.\nThe train sets contain 45 double-annotated docs.\n\n\n## Contributing\n\n* Data and code improvements are welcomed. Please submit a pull request.\n\n\n## Citation\n\nThe [accompanying paper](https://arxiv.org/abs/2103.16997) is:\n\n```\n@misc{syvokon2021uagec,\n      title={UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language},\n      author={Oleksiy Syvokon and Olena Nahorna},\n      year={2021},\n      eprint={2103.16997},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}}\n```\n\n\n## Contacts\n\n* nastasiya.osidach@grammarly.com\n* olena.nahorna@grammarly.com\n* oleksiy.syvokon@gmail.com\n* pavlo.kuchmiichuk@gmail.com\n",
    "bugtrack_url": null,
    "license": "License :: OSI Approved :: CC-BY-4.0",
    "summary": "UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian language",
    "version": "2.1.2",
    "project_urls": {
        "Homepage": "https://github.com/grammarly/ua-gec"
    },
    "split_keywords": [
        "gec",
        "ukrainian",
        "dataset",
        "corpus",
        "grammatical",
        "error",
        "correction",
        "grammarly"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8cd850aadfa596d86a39fa77618dd8c06f2436c9b424ebe494cc8bf939bd53d5",
                "md5": "cf0c51776646359de3375548f24df356",
                "sha256": "5b961a37bc25c54621c9541a6d18ff1f0898b25850d9f0cb63e7a29987cb94c7"
            },
            "downloads": -1,
            "filename": "ua_gec-2.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cf0c51776646359de3375548f24df356",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 36014850,
            "upload_time": "2024-02-02T11:43:14",
            "upload_time_iso_8601": "2024-02-02T11:43:14.696880Z",
            "url": "https://files.pythonhosted.org/packages/8c/d8/50aadfa596d86a39fa77618dd8c06f2436c9b424ebe494cc8bf939bd53d5/ua_gec-2.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-02 11:43:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "grammarly",
    "github_project": "ua-gec",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ua-gec"
}

Oleksiy Syvokon