nlg-metricverse

Name	nlg-metricverse JSON
Version	0.9.9 JSON
	download
home_page	https://github.com/disi-unibo-nlp/nlg-metricverse
Summary	An End-to-End Library for Evaluating Natural Language Generation.
upload_time	2023-09-19 13:58:08
maintainer
docs_url	None
author	DISI UniBo NLP
requires_python	>=3.7
license	MIT
keywords	natural-language-processing natural-language-generation nlg-evaluation metrics language-models visualization python pytorch
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">nlg-metricverse 🌌</h1>

<table align="center">
    <tr>
        <td align="left">🚀 Spaceship</td>
        <td align="left">
          <a href="https://pypi.org/project/nlg-metricverse"><img src="https://img.shields.io/pypi/v/nlg-metricverse?color=blue" alt="PyPI"></a>
          <a href="https://pypi.org/project/nlg-metricverse"><img src="https://img.shields.io/pypi/pyversions/nlg-metricverse" alt="Python versions"></a>
          <a href="https://www.python.org/"><img src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple&logo=python&logoColor=FFF800" alt="Made with Python"></a>
          <br>
          <a href="https://github.com/disi-unibo-nlp/nlg-metricverse/actions"><img alt="Build status" src="https://github.com/disi-unibo-nlp/nlg-metricverse/actions/workflows/ci.yml/badge.svg"></a>
          <a href="https://github.com/disi-unibo-nlp/nlg-metricverse/issues"><img alt="GitHub issues" src="https://img.shields.io/github/issues/disi-unibo-nlp/nlg-metricverse.svg"></a>
        </td>
    </tr>
    <tr>
        <td align="left">👨‍🚀 Astronauts</td>
        <td align="left">
          <a href="https://github.com/disi-unibo-nlp/nlg-metricverse/"><img src="https://badges.frapsoft.com/os/v1/open-source.svg?v=103" alt="Open Source Love svg1"></a>
          <a href="https://github.com/disi-unibo-nlp/nlg-metricverse/blob/main/LICENSE"><img alt="License: MIT" src="https://img.shields.io/pypi/l/nlg-metricverse"></a>
          <a href="https://GitHub.com/Nthakur20/StrapDown.js/graphs/commit-activity"><img src="https://img.shields.io/badge/Maintained%3F-yes-green.svg" alt="Maintenance"></a>
        </td>
    </tr>
    <tr>
        <td align="left">🛰️ Training Program</td>
        <td align="left">
          <a href="https://github.com/disi-unibo-nlp/nlg-metricverse/blob/main/notebooks/nlg_metricverse_demo.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
        </td>
    </tr>
    <tr>
        <td align="left">📕 Operating Manual</td>
        <td align="left">
            <a href="https://aclanthology.org/2022.coling-1.306/">COLING22 Long Paper</a>
        </td>
    </tr>
</table>

<br>

> One NLG evaluation library to rule them all

<p align="center">
  <img src="./figures/nlgmetricverse_banner.png" title="nlg-metricverse" alt="">
</p>

### Explore the universe of Natural Language Generation (NLG) evaluation metrics.
NLG Metricverse is an end-to-end Python library for NLG evaluation, devised to provide a living unified codebase for fast application, analysis, comparison, visualization, and prototyping of automatic metrics.
* Spures the adoption of newly proposed metrics, unleashing their potential
* Reduces the implementational burden, allowing users to easily move from papers to practical applications.
* Increases comparability and replicability of NLG research.
* Provides content-rich metric cards and static/interactive visualization tools to improve metric understanding and scoring interpretation.

## Tables Of Contents
- [Motivations](#-motivations)
- [Available Metrics](#-available-metrics-and-supported-features)
- [Installation](#-installation)
    - [Explore on Hugging Face Spaces](#explore-on-hugging-face-spaces)
- [Quickstart](#-quickstart)
    - [Metric Selection](#metric-selection)
        - [Metric Documentation](#metric-documentation)
        - [Metric Filtering](#metric-filtering)
    - [Metric Usage](#metric-usage)
        - [Prediction-Reference Cardinality](#prediction-reference-cardinality)
        - [Scorer Application](#scorer-application)
        - [Metric-specific Parameters](#metric-specific-parameters)
        - [Outputs](#outputs)
- [Code Style](#code-style)
- [Custom Metrics](#-custom-metrics)
- [Contributing](#-contributing)
- [Contact](#-contact)
- [License](#license)


## 💡 Motivations
* 📖 As Natural Language Generation (NLG) models are getting better over time, accurately evaluating them is becoming an increasingly pressing priority, asking researchers to deal with semantics, different plausible targets, and multiple intrinsic quality dimensions (e.g., informativeness, fluency, factuality).
* 🤖 Task examples: machine translation, abstractive question answering, single/multi-document summarization, data-to-text, chatbots, image/video captioning, etc.
* 📌 Human evaluation is often the best indicator of the quality of a system. However, designing crowd sourcing experiments is an expensive and high-latency process, which does not easily fit in a daily model development pipeline. Therefore, NLG researchers commonly use automatic evaluation metrics, which provide an acceptable proxy for quality and are very cheap to compute.
* 📌 NLG metrics automatically compute a holistic or dimension-specific score, an acceptable proxy for effectiveness and efficiency. However, they are becoming an important bottleneck for research in the field. As we know, areas can stagnate due to poor metrics, and we believe that you shouldn't feel confined to the most traditional overlap-based techniques like ROUGE.
* 💡 If you're working on an established problem, you'll feel pressure from readers to be conservative and use the metrics that have already been tested for the same task. However, this might be a compelling pressure. Our view is that NLP engineers should enrich their evaluation toolkits with multiple metrics capturing different textual properties, being free to argue against cultural norms and motivate new ones, also exploring the latest contributions focused on semantics.
* ☠ New NLG metrics are constantly being proposed to top-tier venue conferences, but their implementation remains disrupted, with distinct environments, properties, settings, benchmarks, and features—making them difficult to compare or apply.
* ☠ The absence of a collective and continuously updated repository discourages the use of modern solutions and slows their understanding.
* 🎯 NLG Metricverse implements a large number of prominent evaluation metrics in NLG, seeking to articulate the textual properties they encode (e.g., fluency, grammatical correctness, informativeness), tasks, and limits. Understanding, using, and examining a metric has never been easier.


## 🪐 Available Metrics and Supported Features
NLG Metricverse supports 38 diverse evaluation metrics overall (last update: October 12, 2022). The code for these metrics will be progressively released in the coming weeks.

Some libraries have already tried to make an integrated environment. To our best knowledge, [NLGEval](https://github.com/Maluuba/nlg-eval), [HugginFace Datasets](https://huggingface.co/docs/datasets/index), [Evaluate](https://huggingface.co/docs/evaluate/index), [Torch-Metrics](https://torchmetrics.readthedocs.io/en/stable/), and [Jury](https://github.com/obss/jury) are the only resources available.
However, none of them possess all the properties listed below: (i) large number of heterogeneous NLG metrics, (ii) concurrent computation of more metrics at once, (iii) support for multiple references and/or predictions, (iv) meta-evaluation, and (v) visualization.

The following table summarizes the discrepancies between NLG Metricverse and related work.

| | NLG-Metricverse | NLGEval | Datasets | Evaluate | TorchMetrics | Jury |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| #NLG-specific metrics | 43 + Datasets | 8 | 22 | 53 | 13 | 19 + Datasets |
| More metrics at once | :white_check_mark: | :x: | :x: | :white_check_mark: | :x: | :white_check_mark: |
| Multiple refs/preds | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: | :white_check_mark: |
| Meta-evaluation | :white_check_mark: | :x: | :x: | :x: | :x: | :x: |
| Visualization | :white_check_mark: | :x: | :x: | :x: | :x: | :x: |

🔍 [Complete comparison and supported metrics](https://github.com/disi-unibo-nlp/nlg-metricverse/blob/main/comparison.md) 


## 🔌 Installation
Install from PyPI repository
```
pip install nlg-metricverse
```
or build from source
```
git clone https://github.com/disi-unibo-nlp/nlg-metricverse.git
cd nlg-metricverse
pip install -v .
```

#### Explore on Hugging Face Spaces
The **Spaces** edition of NLG Metricverse will be launched soon. Check it out here:
[![](./figures/spaces.png)](https://huggingface.co/spaces/disi-unibo-nlp/nlg-metricverse)

## 🚀 Quickstart

### Prepare your environment
For NLGmetricverse we recommend using a virtual environment. If you are not familiar with virtual environments, you can read more about them [here](https://docs.python.org/3/tutorial/venv.html).
Using virtual environments within a library that encompasses numerous metrics proves invaluable for seamless development and efficient management. By encapsulating each metric within its isolated environment, potential conflicts between dependencies are mitigated, ensuring consistent and reliable behavior. This approach streamlines dependency management, enabling precise specification of version requirements for each metric. Moreover, venv facilitates rigorous testing and reproducibility, safeguarding the library's integrity across various metric-driven scenarios. As metrics expand, venv simplifies collaboration among team members, reduces the risk of global environment contamination, and eases deployment processes.

Before running any code, you need to create and activate a virtual environment for the desidered metric and install the required dependencies.
```python
python -m venv nlgmetricverse\env\rouge

#activate the virtual environment on Command Prompt
nlgmetricverse\env\rouge\Scripts\activate.bat

#or else on powershell
nlgmetricverse\env\rouge\Scripts\activate.ps1

!pip install -v . --quiet
"""Also, you need to install the packages which are available through a git source separately with the following command. 
For the folks who are curious about "why?"; a short explaination is that PYPI does not allow indexing a package which 
are directly dependent on non-pypi packages due to security reasons. The file `requirements-dev.txt` includes packages 
which are currently only available through a git source, or they are PYPI packages with no recent release or 
incompatible with NLGmetricverse, so that they are added as git sources or pointing to specific commits."""
!pip install -r requirements-dev.txt

#if present, install the specific requirements for the metric
!pip install -r nlgmetricverse\metrics\rouge\requirements.txt
```
After that, you can run the code for the metric you want to use. After you are done, you can deactivate the virtual environment.
```python
deactivate
```

Then it is only with <b>two lines of code</b> to evaluate generated outputs: <b>(i)</b> instantiate your scorer by selecting the desired metric(s) and <b>(ii)</b> apply it!

### Metric Selection
Specify the metrics you want to use on instantiation,
```python
# If you specify more metrics, each of them will be applyied on your data (allowing for a fast prediction/efficiency comparison)
scorer = NLGMetricverse(metrics=["bleu", "rouge"])
```
or directly import metrics from `nlgmetricverse.metrics` as classes, then instantiate and use them as desired.
```python
from nlgmetricverse.metrics import BertScore

scorer = BertScore.construct()
```
You can seemlessly access both `nlgmetricverse` and HuggingFace `datasets` metrics through `nlgmetricverse.load_metric`.
NLG Metricverse falls back to `datasets` implementation of metrics for the ones that are currently not supported; you can see the metrics available for `datasets` on [datasets/metrics](https://github.com/huggingface/datasets/tree/master/metrics). 
```python
bleu = NLGMetricverse.load_metric("bleu")
# metrics not available in `nlgmetricverse` but in `datasets`
wer = NLGMetricverse.load_metric("competition_math") # It falls back to `datasets` package with a warning
```

### Metric Usage

#### Prediction-Reference Cardinality

☠ NLG evaluation is very challenging also because the relationships between candidate and reference texts tend to be one-to-many or many-to-many. An artificial text predicted by a model might have multiple human references (i.e., there is more than one effective way to say most things), as well as a model can generate multiple distinct outputs. Such cardinality is crucial, but official implementations tend to neglect it. We do not.

<i>1:1</i>. One prediction, one reference ([p<sub>1</sub>, ..., p<sub>n</sub>] and [r<sub>1</sub>, ..., r<sub>n</sub>] syntax).
```python
predictions = ["Evaluating artificial text has never been so simple", "the cat is on the mat"]
references = ["Evaluating artificial text is not difficult", "The cat is playing on the mat."]
```
<i>1:M</i>. One prediction, many references ([p<sub>1</sub>, ..., p<sub>n</sub>] and [[r<sub>11</sub>, ..., r<sub>1m</sub>], ..., [r<sub>n1</sub>, ..., r<sub>nm</sub>]] syntax)
```python
predictions = ["Evaluating artificial text has never been so simple", "the cat is on the mat"]
references = [
    ["Evaluating artificial text is not difficult", "Evaluating artificial text is simple"],
    ["The cat is playing on the mat.", "The cat plays on the mat."]
]
```
<i>K:M</i>. Many predictions, many references ([[p<sub>11</sub>, ..., p<sub>1k</sub>], ..., [p<sub>n1</sub>, ..., p<sub>nk</sub>]] and [[r<sub>11</sub>, ..., r<sub>1m</sub>], ..., [r<sub>n1</sub>, ..., r<sub>nm</sub>]] syntax). This is helpful for language models with a decoding strategy focused on diversity (e.g., beam search, temperature sampling).
```python
predictions = [
    ["Evaluating artificial text has never been so simple", "The evaluation of automatically generated text is simple."],
    ["the cat is on the mat", "the cat likes playing on the mat"]
]
references = [
    ["Evaluating artificial text is not difficult", "Evaluating artificial text is simple"],
    ["The cat is playing on the mat.", "The cat plays on the mat."]
]
```

#### Scorer Application
```python
scores = scorer(predictions, references)
```
The `scorer` automatically selects the proper strategy for applying the selected metric(s) depending on the input format. In any case, if a prediction needs to be compared against multiple references, you can customize the reduction function to use (e.g., `reduce_fn=max` chooses the prediction-reference pair with the highest score for each of the N items in the dataset).
```python
scores = scorer.compute(predictions, references, reduce_fn="max")
```

#### Metric-specific Parameters
Additional metric-specific parameters can be specified on instantiation.
```python
metrics = [
    load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1}),
    load_metric("bleu", resulting_name="bleu_2", compute_kwargs={"max_order": 2}),
    load_metric("bertscore", resulting_name="bertscore_1", compute_kwargs={"model_type": "microsoft/deberta-large-mnli", "idf": True}),
    load_metric("rouge")]
scorer = NLGMetricverse(metrics=metrics)
```

### Code Style
To check the code style,
```
python tests/run_code_style.py check
```
To format the codebase,
```
python tests/run_code_style.py format
```

## 🎨 Custom Metrics
You can use custom metrics by inheriting `nlgmetricverse.metrics.Metric`.
You can see current metrics implemented on NLG Metricverse from [nlgmetricverse/metrics](https://github.com/disi-unibo-nlp/nlg-metricverse/tree/main/nlgmetricverse/metrics).
NLG Metricverse itself uses `datasets.Metric` as a base class to drive its own base class as `nlgmetricverse.metrics.Metric`. The interface is similar; however, NLG Metricverse makes the metrics to take a unified input type by handling metric-specific inputs and allowing multiple cardinalities (1:1, 1:M, K:M).
For implementing custom metrics, both base classes can be used but we strongly recommend using `nlgmetricverse.metrics.Metric` for its advantages.
When using a custom metric, you need to:
1. Create a folder inside `nlgmetricverse/metrics` with the name of your metric.
2. Create inside the folder `__init__.py`, `*metric*.py` and `*metric*_planet.py`.
3. Inside `__init__.py`, add the following code:
```python
from nlgmetricverse.metrics.*metric*.*metric* import *Metric*
```

4. Inside `*metric*.py`, add the following code:
```python
"""
*Metric* metric super class.
"""
from nlgmetricverse.metrics._core import MetricAlias
from nlgmetricverse.metrics.*metric*.*metric* import *CustomMetric*

__main_class__ = "*Metric*"


class *Metric*(MetricAlias):
    """
    *Metric* metric superclass.
    """
    _SUBCLASS = *CustomMetric*
```
5. Inside `*metric*_planet.py`, add the following code:
```python
from nlgmetricverse.metrics import MetricForLanguageGeneration

class CustomMetric(MetricForLanguageGeneration):
    def _compute_single_pred_single_ref(
        self, predictions, references, reduce_fn = None, **kwargs
    ):
        raise NotImplementedError

    def _compute_single_pred_multi_ref(
        self, predictions, references, reduce_fn = None, **kwargs
    ):
        raise NotImplementedError

    def _compute_multi_pred_multi_ref(
            self, predictions, references, reduce_fn = None, **kwargs
    ):
        raise NotImplementedError
```
For more details, have a look at base metric implementation [nlgmetricverse.metrics.Metric](./nlgmetricverse/metrics/_core/base.py)

6. Inside your metric folder add a README.md file, following the [metric card guidelines](./metric_card_guidelines.md).
7. Add your metric to the [comparison table](./comparison.md) and to the [README.md](./README.md) file.
8. Add your metric to [nlgmetricverse/metrics/\_\_init\_\_.py](./nlgmetricverse/metrics/__init__.py) file.
9. Add your metric to metrics_list inside [nlgmetricverse/metrics/\_core/utils.py](./nlgmetricverse/metrics/_core/utils.py) file.
10. Add test cases for your metric inside [tests/nlgmetricverse/metrics](./tests/nlgmetricverse/metrics) folder, with its respective expected outputs, inside [tests/test_data/expected\_outputs/metrics](./tests/test_data/expected_outputs/metrics) folder.

## 🙌 Contributing
Thanks go to all these wonderful collaborations for their contribution towards the NLG Metricverse library:

<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->
<!-- markdownlint-disable -->
<table>
  <tr>
    <td align="center"><a href="https://giacomofrisoni.github.io/"><img src="https://github.com/giacomofrisoni.png" width="100px;" alt=""/><br /><sub><b>Giacomo Frisoni</b></sub></a></td>
    <td align="center"><a href="https://andreazammarchi3.github.io/"><img src="https://github.com/andreazammarchi3.png" width="100px;" alt=""/><br /><sub><b>Andrea Zammarchi</b></sub></a></td>
    <td align="center"><a href="https://github.com/ValentinaPieri"><img src="https://github.com/ValentinaPieri.png" width="100px;" alt=""/><br /><sub><b>Valentina Pieri</b></sub></td>
    <td align="center"><img src="https://github.com/marcoavagnano98.png" width="100px;" alt=""/><br /><sub><b>Marco Avagnano</b></sub></td>
</table>

> We are hoping that the open-source community will help us edit the code and make it better!
> Don't hesitate to open issues and contribute the fix/improvement! We can guide you if you're not sure where to start but want to help us out 🥇.
> In order to contribute a change to our code base, please submit a pull request (PR) via GitHub and someone from our team will go over it and accept it.

> If you have troubles, suggestions, or ideas, the [Discussion](https://github.com/disi-unibo-nlp/nlg-metricverse/discussions) board might have some relevant information. If not, you can post your questions there 💬🗨.

## ✉ Contact
Contact person: Giacomo Frisoni, [giacomo.frisoni@unibo.it](mailto:giacomo.frisoni@unibo.it).
This research work has been conducted within the Department of Computer Science and Engineering, University of Bologna, Italy.

## License

The code is released under the [MIT License](LICENSE). It should not be used to promote or profit from violence, hate, and division, environmental destruction, abuse of human rights, or the destruction of people's physical and mental health.

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=disi-unibo-nlp/nlg-metricverse&type=Date)](https://star-history.com/#disi-unibo-nlp/nlg-metricverse&Date)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/disi-unibo-nlp/nlg-metricverse",
    "name": "nlg-metricverse",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "natural-language-processing natural-language-generation nlg-evaluation metrics language-models visualization python pytorch",
    "author": "DISI UniBo NLP",
    "author_email": "disi.unibo.nlp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/02/31/a2736a39c337afc24e56a8df158ebbab362e81cd712904d07a15ee4663a5/nlg-metricverse-0.9.9.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">nlg-metricverse \ud83c\udf0c</h1>\r\n\r\n<table align=\"center\">\r\n    <tr>\r\n        <td align=\"left\">\ud83d\ude80 Spaceship</td>\r\n        <td align=\"left\">\r\n          <a href=\"https://pypi.org/project/nlg-metricverse\"><img src=\"https://img.shields.io/pypi/v/nlg-metricverse?color=blue\" alt=\"PyPI\"></a>\r\n          <a href=\"https://pypi.org/project/nlg-metricverse\"><img src=\"https://img.shields.io/pypi/pyversions/nlg-metricverse\" alt=\"Python versions\"></a>\r\n          <a href=\"https://www.python.org/\"><img src=\"https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple&logo=python&logoColor=FFF800\" alt=\"Made with Python\"></a>\r\n          <br>\r\n          <a href=\"https://github.com/disi-unibo-nlp/nlg-metricverse/actions\"><img alt=\"Build status\" src=\"https://github.com/disi-unibo-nlp/nlg-metricverse/actions/workflows/ci.yml/badge.svg\"></a>\r\n          <a href=\"https://github.com/disi-unibo-nlp/nlg-metricverse/issues\"><img alt=\"GitHub issues\" src=\"https://img.shields.io/github/issues/disi-unibo-nlp/nlg-metricverse.svg\"></a>\r\n        </td>\r\n    </tr>\r\n    <tr>\r\n        <td align=\"left\">\ud83d\udc68\u200d\ud83d\ude80 Astronauts</td>\r\n        <td align=\"left\">\r\n          <a href=\"https://github.com/disi-unibo-nlp/nlg-metricverse/\"><img src=\"https://badges.frapsoft.com/os/v1/open-source.svg?v=103\" alt=\"Open Source Love svg1\"></a>\r\n          <a href=\"https://github.com/disi-unibo-nlp/nlg-metricverse/blob/main/LICENSE\"><img alt=\"License: MIT\" src=\"https://img.shields.io/pypi/l/nlg-metricverse\"></a>\r\n          <a href=\"https://GitHub.com/Nthakur20/StrapDown.js/graphs/commit-activity\"><img src=\"https://img.shields.io/badge/Maintained%3F-yes-green.svg\" alt=\"Maintenance\"></a>\r\n        </td>\r\n    </tr>\r\n    <tr>\r\n        <td align=\"left\">\ud83d\udef0\ufe0f Training Program</td>\r\n        <td align=\"left\">\r\n          <a href=\"https://github.com/disi-unibo-nlp/nlg-metricverse/blob/main/notebooks/nlg_metricverse_demo.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>\r\n        </td>\r\n    </tr>\r\n    <tr>\r\n        <td align=\"left\">\ud83d\udcd5 Operating Manual</td>\r\n        <td align=\"left\">\r\n            <a href=\"https://aclanthology.org/2022.coling-1.306/\">COLING22 Long Paper</a>\r\n        </td>\r\n    </tr>\r\n</table>\r\n\r\n<br>\r\n\r\n> One NLG evaluation library to rule them all\r\n\r\n<p align=\"center\">\r\n  <img src=\"./figures/nlgmetricverse_banner.png\" title=\"nlg-metricverse\" alt=\"\">\r\n</p>\r\n\r\n### Explore the universe of Natural Language Generation (NLG) evaluation metrics.\r\nNLG Metricverse is an end-to-end Python library for NLG evaluation, devised to provide a living unified codebase for fast application, analysis, comparison, visualization, and prototyping of automatic metrics.\r\n* Spures the adoption of newly proposed metrics, unleashing their potential\r\n* Reduces the implementational burden, allowing users to easily move from papers to practical applications.\r\n* Increases comparability and replicability of NLG research.\r\n* Provides content-rich metric cards and static/interactive visualization tools to improve metric understanding and scoring interpretation.\r\n\r\n## Tables Of Contents\r\n- [Motivations](#-motivations)\r\n- [Available Metrics](#-available-metrics-and-supported-features)\r\n- [Installation](#-installation)\r\n    - [Explore on Hugging Face Spaces](#explore-on-hugging-face-spaces)\r\n- [Quickstart](#-quickstart)\r\n    - [Metric Selection](#metric-selection)\r\n        - [Metric Documentation](#metric-documentation)\r\n        - [Metric Filtering](#metric-filtering)\r\n    - [Metric Usage](#metric-usage)\r\n        - [Prediction-Reference Cardinality](#prediction-reference-cardinality)\r\n        - [Scorer Application](#scorer-application)\r\n        - [Metric-specific Parameters](#metric-specific-parameters)\r\n        - [Outputs](#outputs)\r\n- [Code Style](#code-style)\r\n- [Custom Metrics](#-custom-metrics)\r\n- [Contributing](#-contributing)\r\n- [Contact](#-contact)\r\n- [License](#license)\r\n\r\n\r\n## \ud83d\udca1 Motivations\r\n* \ud83d\udcd6 As Natural Language Generation (NLG) models are getting better over time, accurately evaluating them is becoming an increasingly pressing priority, asking researchers to deal with semantics, different plausible targets, and multiple intrinsic quality dimensions (e.g., informativeness, fluency, factuality).\r\n* \ud83e\udd16 Task examples: machine translation, abstractive question answering, single/multi-document summarization, data-to-text, chatbots, image/video captioning, etc.\r\n* \ud83d\udccc Human evaluation is often the best indicator of the quality of a system. However, designing crowd sourcing experiments is an expensive and high-latency process, which does not easily fit in a daily model development pipeline. Therefore, NLG researchers commonly use automatic evaluation metrics, which provide an acceptable proxy for quality and are very cheap to compute.\r\n* \ud83d\udccc NLG metrics automatically compute a holistic or dimension-specific score, an acceptable proxy for effectiveness and efficiency. However, they are becoming an important bottleneck for research in the field. As we know, areas can stagnate due to poor metrics, and we believe that you shouldn't feel confined to the most traditional overlap-based techniques like ROUGE.\r\n* \ud83d\udca1 If you're working on an established problem, you'll feel pressure from readers to be conservative and use the metrics that have already been tested for the same task. However, this might be a compelling pressure. Our view is that NLP engineers should enrich their evaluation toolkits with multiple metrics capturing different textual properties, being free to argue against cultural norms and motivate new ones, also exploring the latest contributions focused on semantics.\r\n* \u2620 New NLG metrics are constantly being proposed to top-tier venue conferences, but their implementation remains disrupted, with distinct environments, properties, settings, benchmarks, and features\u2014making them difficult to compare or apply.\r\n* \u2620 The absence of a collective and continuously updated repository discourages the use of modern solutions and slows their understanding.\r\n* \ud83c\udfaf NLG Metricverse implements a large number of prominent evaluation metrics in NLG, seeking to articulate the textual properties they encode (e.g., fluency, grammatical correctness, informativeness), tasks, and limits. Understanding, using, and examining a metric has never been easier.\r\n\r\n\r\n## \ud83e\ude90 Available Metrics and Supported Features\r\nNLG Metricverse supports 38 diverse evaluation metrics overall (last update: October 12, 2022). The code for these metrics will be progressively released in the coming weeks.\r\n\r\nSome libraries have already tried to make an integrated environment. To our best knowledge, [NLGEval](https://github.com/Maluuba/nlg-eval), [HugginFace Datasets](https://huggingface.co/docs/datasets/index), [Evaluate](https://huggingface.co/docs/evaluate/index), [Torch-Metrics](https://torchmetrics.readthedocs.io/en/stable/), and [Jury](https://github.com/obss/jury) are the only resources available.\r\nHowever, none of them possess all the properties listed below: (i) large number of heterogeneous NLG metrics, (ii) concurrent computation of more metrics at once, (iii) support for multiple references and/or predictions, (iv) meta-evaluation, and (v) visualization.\r\n\r\nThe following table summarizes the discrepancies between NLG Metricverse and related work.\r\n\r\n| | NLG-Metricverse | NLGEval | Datasets | Evaluate | TorchMetrics | Jury |\r\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\r\n| #NLG-specific metrics | 43 + Datasets | 8 | 22 | 53 | 13 | 19 + Datasets |\r\n| More metrics at once | :white_check_mark: | :x: | :x: | :white_check_mark: | :x: | :white_check_mark: |\r\n| Multiple refs/preds | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: | :white_check_mark: |\r\n| Meta-evaluation | :white_check_mark: | :x: | :x: | :x: | :x: | :x: |\r\n| Visualization | :white_check_mark: | :x: | :x: | :x: | :x: | :x: |\r\n\r\n\ud83d\udd0d [Complete comparison and supported metrics](https://github.com/disi-unibo-nlp/nlg-metricverse/blob/main/comparison.md) \r\n\r\n\r\n## \ud83d\udd0c Installation\r\nInstall from PyPI repository\r\n```\r\npip install nlg-metricverse\r\n```\r\nor build from source\r\n```\r\ngit clone https://github.com/disi-unibo-nlp/nlg-metricverse.git\r\ncd nlg-metricverse\r\npip install -v .\r\n```\r\n\r\n#### Explore on Hugging Face Spaces\r\nThe **Spaces** edition of NLG Metricverse will be launched soon. Check it out here:\r\n[![](./figures/spaces.png)](https://huggingface.co/spaces/disi-unibo-nlp/nlg-metricverse)\r\n\r\n## \ud83d\ude80 Quickstart\r\n\r\n### Prepare your environment\r\nFor NLGmetricverse we recommend using a virtual environment. If you are not familiar with virtual environments, you can read more about them [here](https://docs.python.org/3/tutorial/venv.html).\r\nUsing virtual environments within a library that encompasses numerous metrics proves invaluable for seamless development and efficient management. By encapsulating each metric within its isolated environment, potential conflicts between dependencies are mitigated, ensuring consistent and reliable behavior. This approach streamlines dependency management, enabling precise specification of version requirements for each metric. Moreover, venv facilitates rigorous testing and reproducibility, safeguarding the library's integrity across various metric-driven scenarios. As metrics expand, venv simplifies collaboration among team members, reduces the risk of global environment contamination, and eases deployment processes.\r\n\r\nBefore running any code, you need to create and activate a virtual environment for the desidered metric and install the required dependencies.\r\n```python\r\npython -m venv nlgmetricverse\\env\\rouge\r\n\r\n#activate the virtual environment on Command Prompt\r\nnlgmetricverse\\env\\rouge\\Scripts\\activate.bat\r\n\r\n#or else on powershell\r\nnlgmetricverse\\env\\rouge\\Scripts\\activate.ps1\r\n\r\n!pip install -v . --quiet\r\n\"\"\"Also, you need to install the packages which are available through a git source separately with the following command. \r\nFor the folks who are curious about \"why?\"; a short explaination is that PYPI does not allow indexing a package which \r\nare directly dependent on non-pypi packages due to security reasons. The file `requirements-dev.txt` includes packages \r\nwhich are currently only available through a git source, or they are PYPI packages with no recent release or \r\nincompatible with NLGmetricverse, so that they are added as git sources or pointing to specific commits.\"\"\"\r\n!pip install -r requirements-dev.txt\r\n\r\n#if present, install the specific requirements for the metric\r\n!pip install -r nlgmetricverse\\metrics\\rouge\\requirements.txt\r\n```\r\nAfter that, you can run the code for the metric you want to use. After you are done, you can deactivate the virtual environment.\r\n```python\r\ndeactivate\r\n```\r\n\r\nThen it is only with <b>two lines of code</b> to evaluate generated outputs: <b>(i)</b> instantiate your scorer by selecting the desired metric(s) and <b>(ii)</b> apply it!\r\n\r\n### Metric Selection\r\nSpecify the metrics you want to use on instantiation,\r\n```python\r\n# If you specify more metrics, each of them will be applyied on your data (allowing for a fast prediction/efficiency comparison)\r\nscorer = NLGMetricverse(metrics=[\"bleu\", \"rouge\"])\r\n```\r\nor directly import metrics from `nlgmetricverse.metrics` as classes, then instantiate and use them as desired.\r\n```python\r\nfrom nlgmetricverse.metrics import BertScore\r\n\r\nscorer = BertScore.construct()\r\n```\r\nYou can seemlessly access both `nlgmetricverse` and HuggingFace `datasets` metrics through `nlgmetricverse.load_metric`.\r\nNLG Metricverse falls back to `datasets` implementation of metrics for the ones that are currently not supported; you can see the metrics available for `datasets` on [datasets/metrics](https://github.com/huggingface/datasets/tree/master/metrics). \r\n```python\r\nbleu = NLGMetricverse.load_metric(\"bleu\")\r\n# metrics not available in `nlgmetricverse` but in `datasets`\r\nwer = NLGMetricverse.load_metric(\"competition_math\") # It falls back to `datasets` package with a warning\r\n```\r\n\r\n### Metric Usage\r\n\r\n#### Prediction-Reference Cardinality\r\n\r\n\u2620 NLG evaluation is very challenging also because the relationships between candidate and reference texts tend to be one-to-many or many-to-many. An artificial text predicted by a model might have multiple human references (i.e., there is more than one effective way to say most things), as well as a model can generate multiple distinct outputs. Such cardinality is crucial, but official implementations tend to neglect it. We do not.\r\n\r\n<i>1:1</i>. One prediction, one reference ([p<sub>1</sub>, ..., p<sub>n</sub>] and [r<sub>1</sub>, ..., r<sub>n</sub>] syntax).\r\n```python\r\npredictions = [\"Evaluating artificial text has never been so simple\", \"the cat is on the mat\"]\r\nreferences = [\"Evaluating artificial text is not difficult\", \"The cat is playing on the mat.\"]\r\n```\r\n<i>1:M</i>. One prediction, many references ([p<sub>1</sub>, ..., p<sub>n</sub>] and [[r<sub>11</sub>, ..., r<sub>1m</sub>], ..., [r<sub>n1</sub>, ..., r<sub>nm</sub>]] syntax)\r\n```python\r\npredictions = [\"Evaluating artificial text has never been so simple\", \"the cat is on the mat\"]\r\nreferences = [\r\n    [\"Evaluating artificial text is not difficult\", \"Evaluating artificial text is simple\"],\r\n    [\"The cat is playing on the mat.\", \"The cat plays on the mat.\"]\r\n]\r\n```\r\n<i>K:M</i>. Many predictions, many references ([[p<sub>11</sub>, ..., p<sub>1k</sub>], ..., [p<sub>n1</sub>, ..., p<sub>nk</sub>]] and [[r<sub>11</sub>, ..., r<sub>1m</sub>], ..., [r<sub>n1</sub>, ..., r<sub>nm</sub>]] syntax). This is helpful for language models with a decoding strategy focused on diversity (e.g., beam search, temperature sampling).\r\n```python\r\npredictions = [\r\n    [\"Evaluating artificial text has never been so simple\", \"The evaluation of automatically generated text is simple.\"],\r\n    [\"the cat is on the mat\", \"the cat likes playing on the mat\"]\r\n]\r\nreferences = [\r\n    [\"Evaluating artificial text is not difficult\", \"Evaluating artificial text is simple\"],\r\n    [\"The cat is playing on the mat.\", \"The cat plays on the mat.\"]\r\n]\r\n```\r\n\r\n#### Scorer Application\r\n```python\r\nscores = scorer(predictions, references)\r\n```\r\nThe `scorer` automatically selects the proper strategy for applying the selected metric(s) depending on the input format. In any case, if a prediction needs to be compared against multiple references, you can customize the reduction function to use (e.g., `reduce_fn=max` chooses the prediction-reference pair with the highest score for each of the N items in the dataset).\r\n```python\r\nscores = scorer.compute(predictions, references, reduce_fn=\"max\")\r\n```\r\n\r\n#### Metric-specific Parameters\r\nAdditional metric-specific parameters can be specified on instantiation.\r\n```python\r\nmetrics = [\r\n    load_metric(\"bleu\", resulting_name=\"bleu_1\", compute_kwargs={\"max_order\": 1}),\r\n    load_metric(\"bleu\", resulting_name=\"bleu_2\", compute_kwargs={\"max_order\": 2}),\r\n    load_metric(\"bertscore\", resulting_name=\"bertscore_1\", compute_kwargs={\"model_type\": \"microsoft/deberta-large-mnli\", \"idf\": True}),\r\n    load_metric(\"rouge\")]\r\nscorer = NLGMetricverse(metrics=metrics)\r\n```\r\n\r\n### Code Style\r\nTo check the code style,\r\n```\r\npython tests/run_code_style.py check\r\n```\r\nTo format the codebase,\r\n```\r\npython tests/run_code_style.py format\r\n```\r\n\r\n## \ud83c\udfa8 Custom Metrics\r\nYou can use custom metrics by inheriting `nlgmetricverse.metrics.Metric`.\r\nYou can see current metrics implemented on NLG Metricverse from [nlgmetricverse/metrics](https://github.com/disi-unibo-nlp/nlg-metricverse/tree/main/nlgmetricverse/metrics).\r\nNLG Metricverse itself uses `datasets.Metric` as a base class to drive its own base class as `nlgmetricverse.metrics.Metric`. The interface is similar; however, NLG Metricverse makes the metrics to take a unified input type by handling metric-specific inputs and allowing multiple cardinalities (1:1, 1:M, K:M).\r\nFor implementing custom metrics, both base classes can be used but we strongly recommend using `nlgmetricverse.metrics.Metric` for its advantages.\r\nWhen using a custom metric, you need to:\r\n1. Create a folder inside `nlgmetricverse/metrics` with the name of your metric.\r\n2. Create inside the folder `__init__.py`, `*metric*.py` and `*metric*_planet.py`.\r\n3. Inside `__init__.py`, add the following code:\r\n```python\r\nfrom nlgmetricverse.metrics.*metric*.*metric* import *Metric*\r\n```\r\n\r\n4. Inside `*metric*.py`, add the following code:\r\n```python\r\n\"\"\"\r\n*Metric* metric super class.\r\n\"\"\"\r\nfrom nlgmetricverse.metrics._core import MetricAlias\r\nfrom nlgmetricverse.metrics.*metric*.*metric* import *CustomMetric*\r\n\r\n__main_class__ = \"*Metric*\"\r\n\r\n\r\nclass *Metric*(MetricAlias):\r\n    \"\"\"\r\n    *Metric* metric superclass.\r\n    \"\"\"\r\n    _SUBCLASS = *CustomMetric*\r\n```\r\n5. Inside `*metric*_planet.py`, add the following code:\r\n```python\r\nfrom nlgmetricverse.metrics import MetricForLanguageGeneration\r\n\r\nclass CustomMetric(MetricForLanguageGeneration):\r\n    def _compute_single_pred_single_ref(\r\n        self, predictions, references, reduce_fn = None, **kwargs\r\n    ):\r\n        raise NotImplementedError\r\n\r\n    def _compute_single_pred_multi_ref(\r\n        self, predictions, references, reduce_fn = None, **kwargs\r\n    ):\r\n        raise NotImplementedError\r\n\r\n    def _compute_multi_pred_multi_ref(\r\n            self, predictions, references, reduce_fn = None, **kwargs\r\n    ):\r\n        raise NotImplementedError\r\n```\r\nFor more details, have a look at base metric implementation [nlgmetricverse.metrics.Metric](./nlgmetricverse/metrics/_core/base.py)\r\n\r\n6. Inside your metric folder add a README.md file, following the [metric card guidelines](./metric_card_guidelines.md).\r\n7. Add your metric to the [comparison table](./comparison.md) and to the [README.md](./README.md) file.\r\n8. Add your metric to [nlgmetricverse/metrics/\\_\\_init\\_\\_.py](./nlgmetricverse/metrics/__init__.py) file.\r\n9. Add your metric to metrics_list inside [nlgmetricverse/metrics/\\_core/utils.py](./nlgmetricverse/metrics/_core/utils.py) file.\r\n10. Add test cases for your metric inside [tests/nlgmetricverse/metrics](./tests/nlgmetricverse/metrics) folder, with its respective expected outputs, inside [tests/test_data/expected\\_outputs/metrics](./tests/test_data/expected_outputs/metrics) folder.\r\n\r\n## \ud83d\ude4c Contributing\r\nThanks go to all these wonderful collaborations for their contribution towards the NLG Metricverse library:\r\n\r\n<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->\r\n<!-- prettier-ignore-start -->\r\n<!-- markdownlint-disable -->\r\n<table>\r\n  <tr>\r\n    <td align=\"center\"><a href=\"https://giacomofrisoni.github.io/\"><img src=\"https://github.com/giacomofrisoni.png\" width=\"100px;\" alt=\"\"/><br /><sub><b>Giacomo Frisoni</b></sub></a></td>\r\n    <td align=\"center\"><a href=\"https://andreazammarchi3.github.io/\"><img src=\"https://github.com/andreazammarchi3.png\" width=\"100px;\" alt=\"\"/><br /><sub><b>Andrea Zammarchi</b></sub></a></td>\r\n    <td align=\"center\"><a href=\"https://github.com/ValentinaPieri\"><img src=\"https://github.com/ValentinaPieri.png\" width=\"100px;\" alt=\"\"/><br /><sub><b>Valentina Pieri</b></sub></td>\r\n    <td align=\"center\"><img src=\"https://github.com/marcoavagnano98.png\" width=\"100px;\" alt=\"\"/><br /><sub><b>Marco Avagnano</b></sub></td>\r\n</table>\r\n\r\n> We are hoping that the open-source community will help us edit the code and make it better!\r\n> Don't hesitate to open issues and contribute the fix/improvement! We can guide you if you're not sure where to start but want to help us out \ud83e\udd47.\r\n> In order to contribute a change to our code base, please submit a pull request (PR) via GitHub and someone from our team will go over it and accept it.\r\n\r\n> If you have troubles, suggestions, or ideas, the [Discussion](https://github.com/disi-unibo-nlp/nlg-metricverse/discussions) board might have some relevant information. If not, you can post your questions there \ud83d\udcac\ud83d\udde8.\r\n\r\n## \u2709 Contact\r\nContact person: Giacomo Frisoni, [giacomo.frisoni@unibo.it](mailto:giacomo.frisoni@unibo.it).\r\nThis research work has been conducted within the Department of Computer Science and Engineering, University of Bologna, Italy.\r\n\r\n## License\r\n\r\nThe code is released under the [MIT License](LICENSE). It should not be used to promote or profit from violence, hate, and division, environmental destruction, abuse of human rights, or the destruction of people's physical and mental health.\r\n\r\n## Star History\r\n\r\n[![Star History Chart](https://api.star-history.com/svg?repos=disi-unibo-nlp/nlg-metricverse&type=Date)](https://star-history.com/#disi-unibo-nlp/nlg-metricverse&Date)\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An End-to-End Library for Evaluating Natural Language Generation.",
    "version": "0.9.9",
    "project_urls": {
        "Download": "https://github.com/disi-unibo-nlp/nlg-metricverse/archive/refs/tags/0.9.9.tar.gz",
        "Homepage": "https://github.com/disi-unibo-nlp/nlg-metricverse"
    },
    "split_keywords": [
        "natural-language-processing",
        "natural-language-generation",
        "nlg-evaluation",
        "metrics",
        "language-models",
        "visualization",
        "python",
        "pytorch"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f4cac0fa1be3471ddbba659327236e12958170499ea5b829efaca7ef655d368b",
                "md5": "973c92a1207e31ff01cc348c5ced929c",
                "sha256": "306c827362119fabf8ad0197a59e134b8de9ef3b83834625578afdd930475563"
            },
            "downloads": -1,
            "filename": "nlg_metricverse-0.9.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "973c92a1207e31ff01cc348c5ced929c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 201707,
            "upload_time": "2023-09-19T13:58:05",
            "upload_time_iso_8601": "2023-09-19T13:58:05.515726Z",
            "url": "https://files.pythonhosted.org/packages/f4/ca/c0fa1be3471ddbba659327236e12958170499ea5b829efaca7ef655d368b/nlg_metricverse-0.9.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0231a2736a39c337afc24e56a8df158ebbab362e81cd712904d07a15ee4663a5",
                "md5": "56c1ffd9c459f4cbd5578ddab6c46fc7",
                "sha256": "a9bb4eb495c1631b77b86d97cacd043a28808f36bb990f3fe092d17b039e17a8"
            },
            "downloads": -1,
            "filename": "nlg-metricverse-0.9.9.tar.gz",
            "has_sig": false,
            "md5_digest": "56c1ffd9c459f4cbd5578ddab6c46fc7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 135222,
            "upload_time": "2023-09-19T13:58:08",
            "upload_time_iso_8601": "2023-09-19T13:58:08.677362Z",
            "url": "https://files.pythonhosted.org/packages/02/31/a2736a39c337afc24e56a8df158ebbab362e81cd712904d07a15ee4663a5/nlg-metricverse-0.9.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-19 13:58:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "disi-unibo-nlp",
    "github_project": "nlg-metricverse",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nlg-metricverse"
}

DISI UniBo NLP