prodigy-iaa

Name	prodigy-iaa JSON
Version	0.1.1 JSON
	download
home_page
Summary
upload_time	2023-05-08 14:33:30
maintainer
docs_url	None
author	Peter Baumgartner
requires_python	>=3.8,<4.0
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# ✨ Prodigy - Inter-Annotator Agreement Recipes 🤝

These recipes calculate [Inter-Annotator Agreement](https://en.wikipedia.org/wiki/Inter-rater_reliability) (aka Inter-Rater Reliability) measures for use with [Prodigy](https://prodi.gy/). The measures include Percent (Simple) Agreement, Krippendorff's `Alpha`, and Gwet's `AC2`. All calculations were derived using the equations in [this paper](https://agreestat.com/papers/onkrippendorffalpha_rev10052015.pdf)[^1], and this includes tests to match the values given on the datasets referenced in that paper.

Currently this package supports IAA metrics for binary classification, multiclass classification, and multilabel (binary per label) classification. Span-based IAA measures for NER and Span Categorization will be integrated in the future.

Note that you can also use the measures included here w/o directly interfacing with Prodigy, see section on [other use cases](#other-use-cases--use-outside-prodigy).

**Install**

```
pip install prodigy-iaa
```

For dev

```
pip install git+https://github.com/pmbaumgartner/prodigy-iaa
```

This package uses [entry points](https://prodi.gy/docs/install#entry-points) so you should just be able to install and run the commands below.

## Recipes

Recipes depend the source data structure:
- `iaa.datasets` will calculate measures assuming you have multiple datasets in prodigy, one dataset per annotator
- `iaa.sessions` will calculate measures assuming you have multiple annotators, identified typically by `_session_id`, in a single dataset
- `iaa.jsonl` operates the same as `iaa.sessions`, but on a file exported to JSONL with `prodigy db-out`.

ℹ️ **Get details on each recipe's arguments with `prodigy <recipe> --help`**

## Example

In this toy example, the command calculates agreement using dataset `my-dataset`, which is a `multiclass` problem -- meaning it's data is generated using the `choice` interface, exclusive choices, storing choices in the "accept" key. In this example, there are 5 total examples, 4 of them have co-incident annotations (i.e. any overlap), and 3 unique annotators.

```
$ prodigy iaa.sessions my-dataset multiclass

ℹ Annotation Statistics

Attribute Value
---------------------------- -----
Examples 5
Categories 3
Co-Incident Examples* 4
Single Annotation Examples 1
Annotators 3
Avg. Annotations per Example 2.60

* (>1 annotation)

ℹ Agreement Statistics

Statistic Value
-------------------------- ------
Percent (Simple) Agreement 0.4167
Krippendorff's Alpha 0.1809
Gwet's AC2 0.1640
```

## Validations & Practical Use

All recipes depend on examples being hashed uniquely and stored under `_task_hash` on the example. There are other validations involved as well:
- Checks if `view_id` is the same for all examples
- Checks if `label` is the same for all examples
- Checks that each annotator has not double-annotated the same `_task_hash`

**If any validations fail, or your data is unique in some way, `iaa.jsonl` is the recipe you want.** Export your data, identify any issues and remedy them, and then calculate your measures on the cleaned exported data.

## Theory

There is no single measure across all datasets to give a reasonable measurement of agreement - often times the measures are conditional on qualities of the data. The metrics included in these recipes have nice properties that make them flexible to various annotation situations: they can handle missing values (i.e. incomplete overlap), scale to any number of annotators, scale to any number of categories, and can be customized with your own weighting functions. In addition, the choice of metrics available within this package follow the recommendations in the literature[^2][^3], plus theoretical analysis[^4] demonstrating when certain metrics might be most useful.

Table 13 in [this paper](https://scholar.google.com/scholar?cluster=17269958574032994585&hl=en&as_sdt=0,34&as_vis=1)[^4] highlights systematic issues with each metric. They are as follows:

- **When there is _low agreement_**: Percent (Simple) Agreement can produce high scores.
- Imagine a binary classification problem with a very low base rate. Annotators can often agree on the negative case, but rarely agree on the positive.
- **When there are _highly uneven sizes of categories_**: `AC2` can produce low scores, `Alpha` can produce high scores.
- **When there are _N < 20_ co-incident annotated examples**: `Alpha` can produce high scores.
- You probably shouldn't trust _N < 100_ generally.
- **When there are _3 or more categories_**: `AC2` can produce high scores.

**Summary**: Use simple agreement and `Alpha`. If simple agreement is high, and `Alpha` is low, verify with `AC2`[^3]. In general these numbers correlate, if you're getting contradictory or unclear information increase the number of examples and explore your data.

## Other Use-Cases / Use Outside Prodigy

If you want to calculate these measures in a custom script on your own data, you can use `from prodigy_iaa.measures import calculate_agreement`. See tests in `tests/test_measures.py` for an example. The docstrings for each function should indicate the expected data structures.

You could also use this, for example, to print out some nice output during an `update` callback and get annotation statistics as each user submits examples.

If you want to calcualte more precise statistics, e.g. comparing two annotators pairwise, you could write a script to do that as well with these existing functions.

## Tests

Tests require a working version of `prodigy`, so they are not run in CI and must be run locally.
## References

[^1]: K. L. Gwet, “On Krippendorff’s Alpha Coefficient,” p. 16, 2015.
[^2]: J. Lovejoy, B. R. Watson, S. Lacy, and D. Riffe, “Three Decades of Reliability in Communication Content Analyses: Reporting of Reliability Statistics and Coefficient Levels in Three Top Journals,” p. 44.
[^3]: S. Lacy, B. R. Watson, D. Riffe, and J. Lovejoy, “Issues and Best Practices in Content Analysis,” Journalism & Mass Communication Quarterly, vol. 92, no. 4, pp. 791–811, Dec. 2015, doi: 10.1177/1077699015607338.
[^4]: X. Zhao, J. S. Liu, and K. Deng, “Assumptions Behind Intercoder Reliability Indices,” Communication Yearbook, p. 83.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "prodigy-iaa",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Peter Baumgartner",
    "author_email": "5107405+pmbaumgartner@users.noreply.github.com",
    "download_url": "https://files.pythonhosted.org/packages/f5/31/3ff158f8df1190e12664b3d1ab16dc651f68903f886729fb2c32dc9b8e49/prodigy_iaa-0.1.1.tar.gz",
    "platform": null,
    "description": "# \u2728 Prodigy - Inter-Annotator Agreement Recipes \ud83e\udd1d\n\nThese recipes calculate [Inter-Annotator Agreement](https://en.wikipedia.org/wiki/Inter-rater_reliability) (aka Inter-Rater Reliability) measures for use with [Prodigy](https://prodi.gy/). The measures include Percent (Simple) Agreement, Krippendorff's `Alpha`, and Gwet's `AC2`. All calculations were derived using the equations in [this paper](https://agreestat.com/papers/onkrippendorffalpha_rev10052015.pdf)[^1], and this includes tests to match the values given on the datasets referenced in that paper. \n\nCurrently this package supports IAA metrics for binary classification, multiclass classification, and multilabel (binary per label) classification. Span-based IAA measures for NER and Span Categorization will be integrated in the future.\n\nNote that you can also use the measures included here w/o directly interfacing with Prodigy, see section on [other use cases](#other-use-cases--use-outside-prodigy).\n\n**Install**\n\n```\npip install prodigy-iaa\n```\n\nFor dev\n\n```\npip install git+https://github.com/pmbaumgartner/prodigy-iaa\n```\n\nThis package uses [entry points](https://prodi.gy/docs/install#entry-points) so you should just be able to install and run the commands below.\n\n## Recipes\n\nRecipes depend the source data structure:\n- `iaa.datasets` will calculate measures assuming you have multiple datasets in prodigy, one dataset per annotator\n- `iaa.sessions` will calculate measures assuming you have multiple annotators, identified typically by `_session_id`, in a single dataset\n- `iaa.jsonl` operates the same as `iaa.sessions`, but on a file exported to JSONL with `prodigy db-out`.\n\n\u2139\ufe0f **Get details on each recipe's arguments with `prodigy <recipe> --help`**\n\n## Example\n\nIn this toy example, the command calculates agreement using dataset `my-dataset`, which is a `multiclass` problem -- meaning it's data is generated using the `choice` interface, exclusive choices, storing choices in the \"accept\" key. In this example, there are 5 total examples, 4 of them have co-incident annotations (i.e. any overlap), and 3 unique annotators.\n\n```\n$ prodigy iaa.sessions my-dataset multiclass\n\n\u2139 Annotation Statistics\n\nAttribute                      Value\n----------------------------   -----\nExamples                           5\nCategories                         3\nCo-Incident Examples*              4\nSingle Annotation Examples         1\nAnnotators                         3\nAvg. Annotations per Example    2.60\n\n* (>1 annotation)\n\n\u2139 Agreement Statistics\n\nStatistic                     Value\n--------------------------   ------\nPercent (Simple) Agreement   0.4167\nKrippendorff's Alpha         0.1809\nGwet's AC2                   0.1640\n```\n\n## Validations & Practical Use\n\nAll recipes depend on examples being hashed uniquely and stored under `_task_hash` on the example. There are other validations involved as well:\n- Checks if `view_id` is the same for all examples\n- Checks if `label` is the same for all examples\n- Checks that each annotator has not double-annotated the same `_task_hash`\n\n**If any validations fail, or your data is unique in some way, `iaa.jsonl` is the recipe you want.** Export your data, identify any issues and remedy them, and then calculate your measures on the cleaned exported data.\n\n\n## Theory\n\nThere is no single measure across all datasets to give a reasonable measurement of agreement - often times the measures are conditional on qualities of the data. The metrics included in these recipes have nice properties that make them flexible to various annotation situations: they can handle missing values (i.e. incomplete overlap), scale to any number of annotators, scale to any number of categories, and can be customized with your own weighting functions. In addition, the choice of metrics available within this package follow the recommendations in the literature[^2][^3], plus theoretical analysis[^4] demonstrating when certain metrics might be most useful.\n\nTable 13 in [this paper](https://scholar.google.com/scholar?cluster=17269958574032994585&hl=en&as_sdt=0,34&as_vis=1)[^4] highlights systematic issues with each metric. They are as follows:\n\n- **When there is _low agreement_**: Percent (Simple) Agreement can produce high scores.\n  - Imagine a binary classification problem with a very low base rate. Annotators can often agree on the negative case, but rarely agree on the positive.\n- **When there are _highly uneven sizes of categories_**: `AC2` can produce low scores, `Alpha` can produce high scores.\n- **When there are _N < 20_ co-incident annotated examples**: `Alpha` can produce high scores.\n  - You probably shouldn't trust _N < 100_ generally.\n- **When there are _3 or more categories_**: `AC2` can produce high scores.\n\n**Summary**: Use simple agreement and `Alpha`. If simple agreement is high, and `Alpha` is low, verify with `AC2`[^3]. In general these numbers correlate, if you're getting contradictory or unclear information increase the number of examples and explore your data.\n\n## Other Use-Cases / Use Outside Prodigy\n\nIf you want to calculate these measures in a custom script on your own data, you can use `from prodigy_iaa.measures import calculate_agreement`. See tests in `tests/test_measures.py` for an example. The docstrings for each function should indicate the expected data structures.\n\nYou could also use this, for example, to print out some nice output during an `update` callback and get annotation statistics as each user submits examples.\n\nIf you want to calcualte more precise statistics, e.g. comparing two annotators pairwise, you could write a script to do that as well with these existing functions.\n\n\n## Tests\n\nTests require a working version of `prodigy`, so they are not run in CI and must be run locally. \n## References\n\n\n[^1]: K. L. Gwet, \u201cOn Krippendorff\u2019s Alpha Coefficient,\u201d p. 16, 2015.\n[^2]: J. Lovejoy, B. R. Watson, S. Lacy, and D. Riffe, \u201cThree Decades of Reliability in Communication Content Analyses: Reporting of Reliability Statistics and Coefficient Levels in Three Top Journals,\u201d p. 44.\n[^3]: S. Lacy, B. R. Watson, D. Riffe, and J. Lovejoy, \u201cIssues and Best Practices in Content Analysis,\u201d Journalism & Mass Communication Quarterly, vol. 92, no. 4, pp. 791\u2013811, Dec. 2015, doi: 10.1177/1077699015607338.\n[^4]: X. Zhao, J. S. Liu, and K. Deng, \u201cAssumptions Behind Intercoder Reliability Indices,\u201d Communication Yearbook, p. 83.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "",
    "version": "0.1.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c475cedfb675629c9a9524b89f52ee7d06bc874599087fb13970e95b72f4d828",
                "md5": "be3c8bc3dc0c8b24054319722149caf3",
                "sha256": "75ec2932b631eb5ae0ea0879b607f0a27642cc0ea387e776e92e587cb7c099d5"
            },
            "downloads": -1,
            "filename": "prodigy_iaa-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "be3c8bc3dc0c8b24054319722149caf3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 10042,
            "upload_time": "2023-05-08T14:33:28",
            "upload_time_iso_8601": "2023-05-08T14:33:28.802190Z",
            "url": "https://files.pythonhosted.org/packages/c4/75/cedfb675629c9a9524b89f52ee7d06bc874599087fb13970e95b72f4d828/prodigy_iaa-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f5313ff158f8df1190e12664b3d1ab16dc651f68903f886729fb2c32dc9b8e49",
                "md5": "91e322f67c4d02c9e8c5ec56759ab03f",
                "sha256": "5f0b46322c20aff8bba5bc53cad813f5658da608f9b2f397a88edadd040a5d46"
            },
            "downloads": -1,
            "filename": "prodigy_iaa-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "91e322f67c4d02c9e8c5ec56759ab03f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 10604,
            "upload_time": "2023-05-08T14:33:30",
            "upload_time_iso_8601": "2023-05-08T14:33:30.302750Z",
            "url": "https://files.pythonhosted.org/packages/f5/31/3ff158f8df1190e12664b3d1ab16dc651f68903f886729fb2c32dc9b8e49/prodigy_iaa-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-08 14:33:30",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "prodigy-iaa"
}

Peter Baumgartner