metric-eval

Name	metric-eval JSON
Version	1.0.2 JSON
	download
home_page
Summary	a python package for evaluating evaluation metrics
upload_time	2023-11-07 01:22:58
maintainer
docs_url	None
author	Ziang Xiao, Susu Zhang
requires_python
license
keywords	python metrics evaluation measurement natural language processing natural language generation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# MetricEval
MetricEval: A framework that conceptualizes and operationalizes key desiderata of metric evaluation, in terms of reliability and validity. Please see [Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory](https://arxiv.org/abs/2305.14889) for more details. 

## Summary

In this [Github repo](https://github.com/isle-dev/MetricEval), you will find the implementation of our framework, metric-eval, a python pkg for evaluation metrics analysis.

```
@article{xiao2023evaluating,
  title={Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory},
  author={Xiao, Ziang and Zhang, Susu and Lai, Vivian and Liao, Q Vera},
  journal={arXiv preprint arXiv:2305.14889},
  year={2023}
}
```

## metirc-eval
### Quick Start

#### Install from PyPI
```bash
pip install metric-eval
```

#### Install from Repo
Clone the repository:
```bash
git clone git@github.com:isle-dev/MetricEval.git
cd metirc-eval
```

Install Dependencies:
```bash
conda create --name metirc-eval python=3.10
conda activate metirc-eval
pip install -r requirements.txt
```

### Usage
Please refer to metric_eval/example.py for the detailed usage of metirc-eval, see [Github repo](https://github.com/isle-dev/MetricEval). To interprete the evalaution results, see [paper](https://arxiv.org/abs/2305.14889).

#### Import Module
```python
import metric_eval
```

#### Load Data
Example data could be found in metric_eval/data/* , see [Github repo](https://github.com/isle-dev/MetricEval). The data is a csv file with the following format:
```python
- test_id: array containing ids of test examples
- model_id: character array containing ids of models
- additional columns containing scores on each metric (metric name as column name)
```

```python
data = pd.read_csv("data/metric_scores_long.csv")
data_2nd_run = pd.read_csv("data/metric_scores_long_2nd_run.csv")
```

#### Metric Stability
The function compares the average metric scores for models between the two data sets (data1 and data2). It calculates the Pearson correlation coefficient between the average metric scores from the first run (data) and the second run (data_2nd_run) for each metric.


```python
rel_cor = metric_eval.metric_stability(data,data_2nd_run)
print(rel_cor)
```

#### Metric Consistency
Metric Consistency describes how the metric score fluctuates within a benchmark dataset, i.e., across data points. Metric consistency estimates (alphas) and the standard error of measurement (sems) of each metric given N randomly samples. Defult N (-1) refers all avaliable samples in the dataset.

```python
alphas, sems = metric_eval.metric_consistency(data, N = -1)
print(alphas)
print(sems)
```

#### MTMM Table
The MTMM table presents a way to scrutinize whether observed metric scores act in concert with theory on what they intend to measure, when two or more constructs are measured using two or more methods. By convention, an MTMM table reports the pairwise correlations of the observed metric scores across raters and traits on the off-diagonals and the reliability coefficients of each score on the diagonals.

```python
metric_names = data.columns[2:14].tolist()
trait = ['COH', 'CON', 'FLU', 'REL'] * 3
method = ['Expert_1'] * 4 + ['Expert_2'] * 4 + ['Expert_3'] * 4

# Create the MTMM_design DataFrame
MTMM_design = pd.DataFrame({
    'trait': trait,
    'method': method,
    'metric': metric_names
})

MTMM_result = metric_eval.MTMM(data, MTMM_design, method = 'pearson')
print(MTMM_result)
```

#### Metric Concurrent Validity
The function computes the concurrent validity for each criterion variable by calculating the Kendall's Tau correlation coefficient between the criterion variable and each metric.
```python
criterion = ['Expert.1.COH','Expert.1.CON','Expert.1.FLU','Expert.1.REL']
concurrent_val_table = metric_eval.concurrent_validity(data, criterion)
metric_eval.print_concurrent_validity_table(concurrent_val_table)
```

## Get Involved
We welcome contributions from the community! Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. If you would like to contribute to the codebase, please create a pull request.

## Contact
If you have any questions, please contact [Ziang Xiao](https://www.ziangxiao.com/) at ziang dot xiao at jhu dot edu or [Susu Zhang](https://sites.google.com/view/susuzhang/) at szhan105 at illinois dot edu.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "metric-eval",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,metrics,evaluation,measurement,natural language processing,natural language generation",
    "author": "Ziang Xiao, Susu Zhang",
    "author_email": "ziang.xiao@jhu.edu, szhan105@illinois.edu",
    "download_url": "https://files.pythonhosted.org/packages/1e/8d/597cd383070c0b6cdbaa1fc35e709ec91aac303afffc58eeb987283222d0/metric-eval-1.0.2.tar.gz",
    "platform": null,
    "description": "\n# MetricEval\nMetricEval: A framework that conceptualizes and operationalizes key desiderata of metric evaluation, in terms of reliability and validity. Please see [Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory](https://arxiv.org/abs/2305.14889) for more details. \n\n## Summary\n\nIn this [Github repo](https://github.com/isle-dev/MetricEval), you will find the implementation of our framework, metric-eval, a python pkg for evaluation metrics analysis.\n\n```\n@article{xiao2023evaluating,\n  title={Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory},\n  author={Xiao, Ziang and Zhang, Susu and Lai, Vivian and Liao, Q Vera},\n  journal={arXiv preprint arXiv:2305.14889},\n  year={2023}\n}\n```\n\n## metirc-eval\n### Quick Start\n\n#### Install from PyPI\n```bash\npip install metric-eval\n```\n\n#### Install from Repo\nClone the repository:\n```bash\ngit clone git@github.com:isle-dev/MetricEval.git\ncd metirc-eval\n```\n\nInstall Dependencies:\n```bash\nconda create --name metirc-eval python=3.10\nconda activate metirc-eval\npip install -r requirements.txt\n```\n\n### Usage\nPlease refer to metric_eval/example.py for the detailed usage of metirc-eval, see [Github repo](https://github.com/isle-dev/MetricEval). To interprete the evalaution results, see [paper](https://arxiv.org/abs/2305.14889).\n\n#### Import Module\n```python\nimport metric_eval\n```\n\n#### Load Data\nExample data could be found in metric_eval/data/* , see [Github repo](https://github.com/isle-dev/MetricEval). The data is a csv file with the following format:\n```python\n- test_id: array containing ids of test examples\n- model_id: character array containing ids of models\n- additional columns containing scores on each metric (metric name as column name)\n```\n\n```python\ndata = pd.read_csv(\"data/metric_scores_long.csv\")\ndata_2nd_run = pd.read_csv(\"data/metric_scores_long_2nd_run.csv\")\n```\n\n#### Metric Stability\nThe function compares the average metric scores for models between the two data sets (data1 and data2). It calculates the Pearson correlation coefficient between the average metric scores from the first run (data) and the second run (data_2nd_run) for each metric.\n\n\n```python\nrel_cor = metric_eval.metric_stability(data,data_2nd_run)\nprint(rel_cor)\n```\n\n#### Metric Consistency\nMetric Consistency describes how the metric score fluctuates within a benchmark dataset, i.e., across data points. Metric consistency estimates (alphas) and the standard error of measurement (sems) of each metric given N randomly samples. Defult N (-1) refers all avaliable samples in the dataset.\n\n```python\nalphas, sems = metric_eval.metric_consistency(data, N = -1)\nprint(alphas)\nprint(sems)\n```\n\n#### MTMM Table\nThe MTMM table presents a way to scrutinize whether observed metric scores act in concert with theory on what they intend to measure, when two or more constructs are measured using two or more methods. By convention, an MTMM table reports the pairwise correlations of the observed metric scores across raters and traits on the off-diagonals and the reliability coefficients of each score on the diagonals.\n\n```python\nmetric_names = data.columns[2:14].tolist()\ntrait = ['COH', 'CON', 'FLU', 'REL'] * 3\nmethod = ['Expert_1'] * 4 + ['Expert_2'] * 4 + ['Expert_3'] * 4\n\n# Create the MTMM_design DataFrame\nMTMM_design = pd.DataFrame({\n    'trait': trait,\n    'method': method,\n    'metric': metric_names\n})\n\nMTMM_result = metric_eval.MTMM(data, MTMM_design, method = 'pearson')\nprint(MTMM_result)\n```\n\n#### Metric Concurrent Validity\nThe function computes the concurrent validity for each criterion variable by calculating the Kendall's Tau correlation coefficient between the criterion variable and each metric.\n```python\ncriterion = ['Expert.1.COH','Expert.1.CON','Expert.1.FLU','Expert.1.REL']\nconcurrent_val_table = metric_eval.concurrent_validity(data, criterion)\nmetric_eval.print_concurrent_validity_table(concurrent_val_table)\n```\n\n## Get Involved\nWe welcome contributions from the community! Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. If you would like to contribute to the codebase, please create a pull request.\n\n## Contact\nIf you have any questions, please contact [Ziang Xiao](https://www.ziangxiao.com/) at ziang dot xiao at jhu dot edu or [Susu Zhang](https://sites.google.com/view/susuzhang/) at szhan105 at illinois dot edu.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "a python package for evaluating evaluation metrics",
    "version": "1.0.2",
    "project_urls": null,
    "split_keywords": [
        "python",
        "metrics",
        "evaluation",
        "measurement",
        "natural language processing",
        "natural language generation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1e8d597cd383070c0b6cdbaa1fc35e709ec91aac303afffc58eeb987283222d0",
                "md5": "3256c94d2261416dea3c08a8b8764cfd",
                "sha256": "adaa5d0b38744771b786c192c7a71316bda460895477f3de7d8226413a118490"
            },
            "downloads": -1,
            "filename": "metric-eval-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "3256c94d2261416dea3c08a8b8764cfd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6018,
            "upload_time": "2023-11-07T01:22:58",
            "upload_time_iso_8601": "2023-11-07T01:22:58.041369Z",
            "url": "https://files.pythonhosted.org/packages/1e/8d/597cd383070c0b6cdbaa1fc35e709ec91aac303afffc58eeb987283222d0/metric-eval-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-07 01:22:58",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "metric-eval"
}

Ziang Xiao, Susu Zhang