nervaluate

Name	nervaluate JSON
Version	0.2.0 JSON
	download
home_page	None
Summary	NER evaluation considering partial match scoring
upload_time	2024-06-04 15:52:18
maintainer	None
docs_url	None
author	David S. Batista, Matthew Upson
requires_python	>=3.8
license	MIT License
keywords	named-entity-recognition ner evaluation-metrics partial-match-scoring nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
&nbsp;
![example event parameter](https://github.com/davidsbatista/BREDS/actions/workflows/code_checks.yml/badge.svg?event=pull_request)
&nbsp;
![code coverage](https://raw.githubusercontent.com/MantisAI/nervaluate/coverage-badge/coverage.svg?raw=true)
&nbsp;
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
&nbsp;
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
&nbsp;
![GitHub](https://img.shields.io/github/license/ivyleavedtoadflax/nervaluate)
&nbsp;
![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)
&nbsp;
![PyPI](https://img.shields.io/pypi/v/nervaluate)

# nervaluate

nervaluate is a python module for evaluating Named Entity Recognition (NER) models as defined in the SemEval 2013 - 9.1 task.

The evaluation metrics output by nervaluate go beyond a simple token/tag based schema, and consider different scenarios 
based on weather all the tokens that belong to a named entity were classified or not, and also whether the correct 
entity type was assigned.

This full problem is described in detail in the [original blog](http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) 
post by [David Batista](https://github.com/davidsbatista), and extends the code in the 
[original repository](https://github.com/davidsbatista/NER-Evaluation) which accompanied the blog post.

The code draws heavily on:

* Segura-bedmar, I., & Mart, P. (2013). 2013 SemEval-2013 Task 9 Extraction of Drug-Drug Interactions from. Semeval, 2(DDIExtraction), 341–350. [link](https://www.aclweb.org/anthology/S13-2056)
* https://www.cs.york.ac.uk/semeval-2013/task9/data/uploads/semeval_2013-task-9_1-evaluation-metrics.pdf

## The problem

### Token level evaluation for NER is too simplistic

When running machine learning models for NER, it is common to report metrics at the individual token level. This may 
not be the best approach, as a named entity can be made up of multiple tokens, so a full-entity accuracy would be 
desirable.

When comparing the golden standard annotations with the output of a NER system different scenarios might occur:

__I. Surface string and entity type match__

|Token|Gold|Prediction|
|---|---|---|
|in|O|O|
|New|B-LOC|B-LOC|
|York|I-LOC|I-LOC|
|.|O|O|

__II. System hypothesized an incorrect entity__

|Token|Gold|Prediction|
|---|---|---|
|an|O|O|
|Awful|O|B-ORG|
|Headache|O|I-ORG|
|in|O|O|

__III. System misses an entity__

|Token|Gold|Prediction|
|---|---|---|
|in|O|O|
|Palo|B-LOC|O|
|Alto|I-LOC|O|
|,|O|O|

Based on these three scenarios we have a simple classification evaluation that can be measured in terms of false 
positives, true positives, false negatives and false positives, and subsequently compute precision, recall and 
F1-score for each named-entity type.

However, this simple schema ignores the possibility of partial matches or other scenarios when the NER system gets
the named-entity surface string correct but the type wrong, and we might also want to evaluate these scenarios 
again at a full-entity level.

For example:

__IV. System assigns the wrong entity type__

|Token|Gold|Prediction|
|---|---|---|
|I|O|O|
|live|O|O|
|in|O|O|
|Palo|B-LOC|B-ORG|
|Alto|I-LOC|I-ORG|
|,|O|O|

__V. System gets the boundaries of the surface string wrong__

|Token|Gold|Prediction|
|---|---|---|
|Unless|O|B-PER|
|Karl|B-PER|I-PER|
|Smith|I-PER|I-PER|
|resigns|O|O|

__VI. System gets the boundaries and entity type wrong__

|Token|Gold|Prediction|
|---|---|---|
|Unless|O|B-ORG|
|Karl|B-PER|I-ORG|
|Smith|I-PER|I-ORG|
|resigns|O|O|

How can we incorporate these described scenarios into evaluation metrics? See the [original blog](http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) 
for a great explanation, a summary is included here:

We can use the following five metrics to consider difference categories of errors:

|Error type|Explanation|
|---|---|
|Correct (COR)|both are the same|
|Incorrect (INC)|the output of a system and the golden annotation don’t match|
|Partial (PAR)|system and the golden annotation are somewhat “similar” but not the same|
|Missing (MIS)|a golden annotation is not captured by a system|
|Spurious (SPU)|system produces a response which doesn’t exist in the golden annotation|

These five metrics can be measured in four different ways:

|Evaluation schema|Explanation|
|---|---|
|Strict|exact boundary surface string match and entity type|
|Exact|exact boundary match over the surface string, regardless of the type|
|Partial|partial boundary match over the surface string, regardless of the type|
|Type|some overlap between the system tagged entity and the gold annotation is required|

These five errors and four evaluation schema interact in the following ways:

|Scenario|Gold entity|Gold string|Pred entity|Pred string|Type|Partial|Exact|Strict|
|---|---|---|---|---|---|---|---|---|
|III|BRAND|tikosyn| | |MIS|MIS|MIS|MIS|
|II| | |BRAND|healthy|SPU|SPU|SPU|SPU|
|V|DRUG|warfarin|DRUG|of warfarin|COR|PAR|INC|INC|
|IV|DRUG|propranolol|BRAND|propranolol|INC|COR|COR|INC|
|I|DRUG|phenytoin|DRUG|phenytoin|COR|COR|COR|COR|
|VI|GROUP|contraceptives|DRUG|oral contraceptives|INC|PAR|INC|INC|

Then precision/recall/f1-score are calculated for each different evaluation schema. In order to achieve data, two more 
quantities need to be calculated:

```
POSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN
ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP
```

Then we can compute precision/recall/f1-score, where roughly describing precision is the percentage of correct 
named-entities found by the NER system, and recall is the percentage of the named-entities in the golden annotations 
that are retrieved by the NER system. This is computed in two different ways depending on whether we want an exact 
match (i.e., strict and exact ) or a partial match (i.e., partial and type) scenario:

__Exact Match (i.e., strict and exact )__
```
Precision = (COR / ACT) = TP / (TP + FP)
Recall = (COR / POS) = TP / (TP+FN)
```
__Partial Match (i.e., partial and type)__
```
Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FN)
```

__Putting all together:__

|Measure|Type|Partial|Exact|Strict|
|---|---|---|---|---|
|Correct|3|3|3|2|
|Incorrect|2|0|2|3|
|Partial|0|2|0|0|
|Missed|1|1|1|1|
|Spurious|1|1|1|1|
|Precision|0.5|0.66|0.5|0.33|
|Recall|0.5|0.66|0.5|0.33|
|F1|0.5|0.66|0.5|0.33|


## Notes:

In scenarios IV and VI the entity type of the `true` and `pred` does not match, in both cases we only scored against 
the true entity, not the predicted one. You can argue that the predicted entity could also be scored as spurious, 
but according to the definition of `spurious`:

* Spurious (SPU) : system produces a response which does not exist in the golden annotation;

In this case there exists an annotation, but with a different entity type, so we assume it's only incorrect.

## Installation

```
pip install nervaluate
```

## Example:

The main `Evaluator` class will accept a number of formats:

* [prodi.gy](https://prodi.gy) style lists of spans.
* Nested lists containing NER labels.
* CoNLL style tab delimited strings.

### Prodigy spans

```
true = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4}]
]

pred = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4}]
]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=['LOC', 'PER'])

# Returns overall metrics and metrics for each tag

results, results_per_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

print(results)
```

```
{
    'ent_type':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    },
    'partial':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    },
    'strict':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    },
    'exact':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    }
}
```

```
print(results_by_tag)
```

```
{
    'LOC':{
        'ent_type':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        },
        'partial':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        },
        'strict':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        },
        'exact':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        }
    },
    'PER':{
        'ent_type':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        },
        'partial':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        },
        'strict':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        },
        'exact':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        }
    }
}
```

```
from nervaluate import summary_report_overall_indices

print(summary_report_overall_indices(evaluation_indices=result_indices, error_schema='partial', preds=pred))
```

```
Indices for error schema 'partial':

Correct indices:
  - Instance 0, Entity 0: Label=PER, Start=2, End=4
  - Instance 1, Entity 0: Label=LOC, Start=1, End=2
  - Instance 1, Entity 1: Label=LOC, Start=3, End=4

Incorrect indices:
  - None

Partial indices:
  - None

Missed indices:
  - None

Spurious indices:
  - None
```

### Nested lists

```
true = [
    ['O', 'O', 'B-PER', 'I-PER', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

pred = [
    ['O', 'O', 'B-PER', 'I-PER', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

evaluator = Evaluator(true, pred, tags=['LOC', 'PER'], loader="list")

results, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()
```

### CoNLL style tab delimited

```

true = "word\tO\nword\tO\B-PER\nword\tI-PER\n"

pred = "word\tO\nword\tO\B-PER\nword\tI-PER\n"

evaluator = Evaluator(true, pred, tags=['PER'], loader="conll")

results, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

```

## Extending the package to accept more formats

Additional formats can easily be added to the module by creating a conversion function in `nervaluate/utils.py`, 
for example `conll_to_spans()`. This function must return the spans in the prodigy style dicts shown in the prodigy 
example above.

The new function can then be added to the list of loaders in `nervaluate/nervaluate.py`, and can then be selection 
with the `loader` argument when instantiating the `Evaluator` class.

A list of formats we intend to include is included in https://github.com/ivyleavedtoadflax/nervaluate/issues/3.


## Contributing to the nervaluate package

Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of nervaluate 
please read the following guidelines.

## The contribution process at a glance

1. Preparing the development environment
2. Code away!
3. Continuous Integration
4. Submit your changes by opening a pull request

Small fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in 
an issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer. 


## Preparing the development environment

Make sure you have Python3.8 installed on your system

macOs
```sh
brew install python@3.8
python3.8 -m pip install --user --upgrade pip
python3.8 -m pip install virtualenv
```

Clone the repository and prepare the development environment:

```sh
git clone git@github.com:MantisAI/nervaluate.git
cd nervaluate
python3.8 -m virtualenv venv
source venv/bin/activate
pip install -r requirements_dev.txt
pip install -e .
```


## Continuous Integration

nervaluate runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), 
a full  test suite is run on your PR: 

- The code is formatted using `black`
- Linting is done using `pyling` and `flake8`
- Type checking is done using `mypy`
- Tests are run using `pytest`

Nevertheless, if you prefer to run the tests & formatting locally, it's possible too. 

```sh
make all
```

## Opening a Pull Request

Every PR should be accompanied by short description of the changes, including:
- Impact and  motivation for the changes
- Any open issues that are closed by this PR

---

Give a ⭐️ if this project helped you!

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "nervaluate",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "named-entity-recognition, ner, evaluation-metrics, partial-match-scoring, nlp",
    "author": "David S. Batista, Matthew Upson",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/aa/5b/676f6d6713b35ccb0c88d0c50200408ede1405c72c726f9c9a7c0604492f/nervaluate-0.2.0.tar.gz",
    "platform": null,
    "description": "[![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)\n&nbsp;\n![example event parameter](https://github.com/davidsbatista/BREDS/actions/workflows/code_checks.yml/badge.svg?event=pull_request)\n&nbsp;\n![code coverage](https://raw.githubusercontent.com/MantisAI/nervaluate/coverage-badge/coverage.svg?raw=true)\n&nbsp;\n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n&nbsp;\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n&nbsp;\n![GitHub](https://img.shields.io/github/license/ivyleavedtoadflax/nervaluate)\n&nbsp;\n![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)\n&nbsp;\n![PyPI](https://img.shields.io/pypi/v/nervaluate)\n\n# nervaluate\n\nnervaluate is a python module for evaluating Named Entity Recognition (NER) models as defined in the SemEval 2013 - 9.1 task.\n\nThe evaluation metrics output by nervaluate go beyond a simple token/tag based schema, and consider different scenarios \nbased on weather all the tokens that belong to a named entity were classified or not, and also whether the correct \nentity type was assigned.\n\nThis full problem is described in detail in the [original blog](http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) \npost by [David Batista](https://github.com/davidsbatista), and extends the code in the \n[original repository](https://github.com/davidsbatista/NER-Evaluation) which accompanied the blog post.\n\nThe code draws heavily on:\n\n* Segura-bedmar, I., & Mart, P. (2013). 2013 SemEval-2013 Task 9 Extraction of Drug-Drug Interactions from. Semeval, 2(DDIExtraction), 341\u2013350. [link](https://www.aclweb.org/anthology/S13-2056)\n* https://www.cs.york.ac.uk/semeval-2013/task9/data/uploads/semeval_2013-task-9_1-evaluation-metrics.pdf\n\n## The problem\n\n### Token level evaluation for NER is too simplistic\n\nWhen running machine learning models for NER, it is common to report metrics at the individual token level. This may \nnot be the best approach, as a named entity can be made up of multiple tokens, so a full-entity accuracy would be \ndesirable.\n\nWhen comparing the golden standard annotations with the output of a NER system different scenarios might occur:\n\n__I. Surface string and entity type match__\n\n|Token|Gold|Prediction|\n|---|---|---|\n|in|O|O|\n|New|B-LOC|B-LOC|\n|York|I-LOC|I-LOC|\n|.|O|O|\n\n__II. System hypothesized an incorrect entity__\n\n|Token|Gold|Prediction|\n|---|---|---|\n|an|O|O|\n|Awful|O|B-ORG|\n|Headache|O|I-ORG|\n|in|O|O|\n\n__III. System misses an entity__\n\n|Token|Gold|Prediction|\n|---|---|---|\n|in|O|O|\n|Palo|B-LOC|O|\n|Alto|I-LOC|O|\n|,|O|O|\n\nBased on these three scenarios we have a simple classification evaluation that can be measured in terms of false \npositives, true positives, false negatives and false positives, and subsequently compute precision, recall and \nF1-score for each named-entity type.\n\nHowever, this simple schema ignores the possibility of partial matches or other scenarios when the NER system gets\nthe named-entity surface string correct but the type wrong, and we might also want to evaluate these scenarios \nagain at a full-entity level.\n\nFor example:\n\n__IV. System assigns the wrong entity type__\n\n|Token|Gold|Prediction|\n|---|---|---|\n|I|O|O|\n|live|O|O|\n|in|O|O|\n|Palo|B-LOC|B-ORG|\n|Alto|I-LOC|I-ORG|\n|,|O|O|\n\n__V. System gets the boundaries of the surface string wrong__\n\n|Token|Gold|Prediction|\n|---|---|---|\n|Unless|O|B-PER|\n|Karl|B-PER|I-PER|\n|Smith|I-PER|I-PER|\n|resigns|O|O|\n\n__VI. System gets the boundaries and entity type wrong__\n\n|Token|Gold|Prediction|\n|---|---|---|\n|Unless|O|B-ORG|\n|Karl|B-PER|I-ORG|\n|Smith|I-PER|I-ORG|\n|resigns|O|O|\n\nHow can we incorporate these described scenarios into evaluation metrics? See the [original blog](http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) \nfor a great explanation, a summary is included here:\n\nWe can use the following five metrics to consider difference categories of errors:\n\n|Error type|Explanation|\n|---|---|\n|Correct (COR)|both are the same|\n|Incorrect (INC)|the output of a system and the golden annotation don\u2019t match|\n|Partial (PAR)|system and the golden annotation are somewhat \u201csimilar\u201d but not the same|\n|Missing (MIS)|a golden annotation is not captured by a system|\n|Spurious (SPU)|system produces a response which doesn\u2019t exist in the golden annotation|\n\nThese five metrics can be measured in four different ways:\n\n|Evaluation schema|Explanation|\n|---|---|\n|Strict|exact boundary surface string match and entity type|\n|Exact|exact boundary match over the surface string, regardless of the type|\n|Partial|partial boundary match over the surface string, regardless of the type|\n|Type|some overlap between the system tagged entity and the gold annotation is required|\n\nThese five errors and four evaluation schema interact in the following ways:\n\n|Scenario|Gold entity|Gold string|Pred entity|Pred string|Type|Partial|Exact|Strict|\n|---|---|---|---|---|---|---|---|---|\n|III|BRAND|tikosyn| | |MIS|MIS|MIS|MIS|\n|II| | |BRAND|healthy|SPU|SPU|SPU|SPU|\n|V|DRUG|warfarin|DRUG|of warfarin|COR|PAR|INC|INC|\n|IV|DRUG|propranolol|BRAND|propranolol|INC|COR|COR|INC|\n|I|DRUG|phenytoin|DRUG|phenytoin|COR|COR|COR|COR|\n|VI|GROUP|contraceptives|DRUG|oral contraceptives|INC|PAR|INC|INC|\n\nThen precision/recall/f1-score are calculated for each different evaluation schema. In order to achieve data, two more \nquantities need to be calculated:\n\n```\nPOSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN\nACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP\n```\n\nThen we can compute precision/recall/f1-score, where roughly describing precision is the percentage of correct \nnamed-entities found by the NER system, and recall is the percentage of the named-entities in the golden annotations \nthat are retrieved by the NER system. This is computed in two different ways depending on whether we want an exact \nmatch (i.e., strict and exact ) or a partial match (i.e., partial and type) scenario:\n\n__Exact Match (i.e., strict and exact )__\n```\nPrecision = (COR / ACT) = TP / (TP + FP)\nRecall = (COR / POS) = TP / (TP+FN)\n```\n__Partial Match (i.e., partial and type)__\n```\nPrecision = (COR + 0.5 \u00d7 PAR) / ACT = TP / (TP + FP)\nRecall = (COR + 0.5 \u00d7 PAR)/POS = COR / ACT = TP / (TP + FN)\n```\n\n__Putting all together:__\n\n|Measure|Type|Partial|Exact|Strict|\n|---|---|---|---|---|\n|Correct|3|3|3|2|\n|Incorrect|2|0|2|3|\n|Partial|0|2|0|0|\n|Missed|1|1|1|1|\n|Spurious|1|1|1|1|\n|Precision|0.5|0.66|0.5|0.33|\n|Recall|0.5|0.66|0.5|0.33|\n|F1|0.5|0.66|0.5|0.33|\n\n\n## Notes:\n\nIn scenarios IV and VI the entity type of the `true` and `pred` does not match, in both cases we only scored against \nthe true entity, not the predicted one. You can argue that the predicted entity could also be scored as spurious, \nbut according to the definition of `spurious`:\n\n* Spurious (SPU) : system produces a response which does not exist in the golden annotation;\n\nIn this case there exists an annotation, but with a different entity type, so we assume it's only incorrect.\n\n## Installation\n\n```\npip install nervaluate\n```\n\n## Example:\n\nThe main `Evaluator` class will accept a number of formats:\n\n* [prodi.gy](https://prodi.gy) style lists of spans.\n* Nested lists containing NER labels.\n* CoNLL style tab delimited strings.\n\n### Prodigy spans\n\n```\ntrue = [\n    [{\"label\": \"PER\", \"start\": 2, \"end\": 4}],\n    [{\"label\": \"LOC\", \"start\": 1, \"end\": 2},\n     {\"label\": \"LOC\", \"start\": 3, \"end\": 4}]\n]\n\npred = [\n    [{\"label\": \"PER\", \"start\": 2, \"end\": 4}],\n    [{\"label\": \"LOC\", \"start\": 1, \"end\": 2},\n     {\"label\": \"LOC\", \"start\": 3, \"end\": 4}]\n]\n\nfrom nervaluate import Evaluator\n\nevaluator = Evaluator(true, pred, tags=['LOC', 'PER'])\n\n# Returns overall metrics and metrics for each tag\n\nresults, results_per_tag, result_indices, result_indices_by_tag = evaluator.evaluate()\n\nprint(results)\n```\n\n```\n{\n    'ent_type':{\n        'correct':3,\n        'incorrect':0,\n        'partial':0,\n        'missed':0,\n        'spurious':0,\n        'possible':3,\n        'actual':3,\n        'precision':1.0,\n        'recall':1.0\n    },\n    'partial':{\n        'correct':3,\n        'incorrect':0,\n        'partial':0,\n        'missed':0,\n        'spurious':0,\n        'possible':3,\n        'actual':3,\n        'precision':1.0,\n        'recall':1.0\n    },\n    'strict':{\n        'correct':3,\n        'incorrect':0,\n        'partial':0,\n        'missed':0,\n        'spurious':0,\n        'possible':3,\n        'actual':3,\n        'precision':1.0,\n        'recall':1.0\n    },\n    'exact':{\n        'correct':3,\n        'incorrect':0,\n        'partial':0,\n        'missed':0,\n        'spurious':0,\n        'possible':3,\n        'actual':3,\n        'precision':1.0,\n        'recall':1.0\n    }\n}\n```\n\n```\nprint(results_by_tag)\n```\n\n```\n{\n    'LOC':{\n        'ent_type':{\n            'correct':2,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':2,\n            'actual':2,\n            'precision':1.0,\n            'recall':1.0\n        },\n        'partial':{\n            'correct':2,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':2,\n            'actual':2,\n            'precision':1.0,\n            'recall':1.0\n        },\n        'strict':{\n            'correct':2,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':2,\n            'actual':2,\n            'precision':1.0,\n            'recall':1.0\n        },\n        'exact':{\n            'correct':2,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':2,\n            'actual':2,\n            'precision':1.0,\n            'recall':1.0\n        }\n    },\n    'PER':{\n        'ent_type':{\n            'correct':1,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':1,\n            'actual':1,\n            'precision':1.0,\n            'recall':1.0\n        },\n        'partial':{\n            'correct':1,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':1,\n            'actual':1,\n            'precision':1.0,\n            'recall':1.0\n        },\n        'strict':{\n            'correct':1,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':1,\n            'actual':1,\n            'precision':1.0,\n            'recall':1.0\n        },\n        'exact':{\n            'correct':1,\n            'incorrect':0,\n            'partial':0,\n            'missed':0,\n            'spurious':0,\n            'possible':1,\n            'actual':1,\n            'precision':1.0,\n            'recall':1.0\n        }\n    }\n}\n```\n\n```\nfrom nervaluate import summary_report_overall_indices\n\nprint(summary_report_overall_indices(evaluation_indices=result_indices, error_schema='partial', preds=pred))\n```\n\n```\nIndices for error schema 'partial':\n\nCorrect indices:\n  - Instance 0, Entity 0: Label=PER, Start=2, End=4\n  - Instance 1, Entity 0: Label=LOC, Start=1, End=2\n  - Instance 1, Entity 1: Label=LOC, Start=3, End=4\n\nIncorrect indices:\n  - None\n\nPartial indices:\n  - None\n\nMissed indices:\n  - None\n\nSpurious indices:\n  - None\n```\n\n### Nested lists\n\n```\ntrue = [\n    ['O', 'O', 'B-PER', 'I-PER', 'O'],\n    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],\n]\n\npred = [\n    ['O', 'O', 'B-PER', 'I-PER', 'O'],\n    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],\n]\n\nevaluator = Evaluator(true, pred, tags=['LOC', 'PER'], loader=\"list\")\n\nresults, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()\n```\n\n### CoNLL style tab delimited\n\n```\n\ntrue = \"word\\tO\\nword\\tO\\B-PER\\nword\\tI-PER\\n\"\n\npred = \"word\\tO\\nword\\tO\\B-PER\\nword\\tI-PER\\n\"\n\nevaluator = Evaluator(true, pred, tags=['PER'], loader=\"conll\")\n\nresults, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()\n\n```\n\n## Extending the package to accept more formats\n\nAdditional formats can easily be added to the module by creating a conversion function in `nervaluate/utils.py`, \nfor example `conll_to_spans()`. This function must return the spans in the prodigy style dicts shown in the prodigy \nexample above.\n\nThe new function can then be added to the list of loaders in `nervaluate/nervaluate.py`, and can then be selection \nwith the `loader` argument when instantiating the `Evaluator` class.\n\nA list of formats we intend to include is included in https://github.com/ivyleavedtoadflax/nervaluate/issues/3.\n\n\n## Contributing to the nervaluate package\n\nImprovements, adding new features and bug fixes are welcome. If you wish to participate in the development of nervaluate \nplease read the following guidelines.\n\n## The contribution process at a glance\n\n1. Preparing the development environment\n2. Code away!\n3. Continuous Integration\n4. Submit your changes by opening a pull request\n\nSmall fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in \nan issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer. \n\n\n## Preparing the development environment\n\nMake sure you have Python3.8 installed on your system\n\nmacOs\n```sh\nbrew install python@3.8\npython3.8 -m pip install --user --upgrade pip\npython3.8 -m pip install virtualenv\n```\n\nClone the repository and prepare the development environment:\n\n```sh\ngit clone git@github.com:MantisAI/nervaluate.git\ncd nervaluate\npython3.8 -m virtualenv venv\nsource venv/bin/activate\npip install -r requirements_dev.txt\npip install -e .\n```\n\n\n## Continuous Integration\n\nnervaluate runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), \na full  test suite is run on your PR: \n\n- The code is formatted using `black`\n- Linting is done using `pyling` and `flake8`\n- Type checking is done using `mypy`\n- Tests are run using `pytest`\n\nNevertheless, if you prefer to run the tests & formatting locally, it's possible too. \n\n```sh\nmake all\n```\n\n## Opening a Pull Request\n\nEvery PR should be accompanied by short description of the changes, including:\n- Impact and  motivation for the changes\n- Any open issues that are closed by this PR\n\n---\n\nGive a \u2b50\ufe0f if this project helped you!\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "NER evaluation considering partial match scoring",
    "version": "0.2.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/MantisAI/nervaluate/issues",
        "Homepage": "https://github.com/MantisAI/nervaluate"
    },
    "split_keywords": [
        "named-entity-recognition",
        " ner",
        " evaluation-metrics",
        " partial-match-scoring",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4bb70a3752f076d592082cf42956dc9d2a45006a2247f56207380ce377623efb",
                "md5": "1bf21aa9dda2957ff6e29ca3cd91eda3",
                "sha256": "346a4e1ba383e3164b4cfb79a1cb7745b0400e4fe7e13122835ef3b11e263c87"
            },
            "downloads": -1,
            "filename": "nervaluate-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1bf21aa9dda2957ff6e29ca3cd91eda3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 13803,
            "upload_time": "2024-06-04T15:52:16",
            "upload_time_iso_8601": "2024-06-04T15:52:16.658858Z",
            "url": "https://files.pythonhosted.org/packages/4b/b7/0a3752f076d592082cf42956dc9d2a45006a2247f56207380ce377623efb/nervaluate-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aa5b676f6d6713b35ccb0c88d0c50200408ede1405c72c726f9c9a7c0604492f",
                "md5": "1f50d4ac9a930a512cc31c296b4f69f5",
                "sha256": "12c14bb3943f3fb20d1af3508e920a7baa33702ed2a8d29715d94684bcf44664"
            },
            "downloads": -1,
            "filename": "nervaluate-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1f50d4ac9a930a512cc31c296b4f69f5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 31156,
            "upload_time": "2024-06-04T15:52:18",
            "upload_time_iso_8601": "2024-06-04T15:52:18.382625Z",
            "url": "https://files.pythonhosted.org/packages/aa/5b/676f6d6713b35ccb0c88d0c50200408ede1405c72c726f9c9a7c0604492f/nervaluate-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-04 15:52:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MantisAI",
    "github_project": "nervaluate",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nervaluate"
}

David S. Batista, Matthew Upson