scale-score

Name	scale-score JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/asappresearch/scale-score
Summary	Implementation of SCALE metric and ScreenEval
upload_time	2023-11-29 20:01:25
maintainer
docs_url	None
author	Barrett Lattimer, Patrick Chen, Xinyuan Zhang, Yi Yang
requires_python	>=3.6
license	MIT
keywords	hallucination detection inconsistency detection hallucination automatic evaluation metric long document efficient fast accurate long natural language generation task agnostic nlp nlg
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Fast and Accurate Factual Inconsistency Detection Over Long Documents
Barrett Martin Lattimer, Patrick Chen, Xinyuan Zhang, Yi Yang

blattimer@asapp.com

EMNLP 2023

https://arxiv.org/abs/2310.13189
## Overview
Introducing SCALE, an reference-free NLI based factual inconsistency detection method, and ScreenEval, the longest dialogue based dataset for factual inconsistency detection presently available.
Both can be found in our paper Fast and Accurate Factual Inconsistency Detection Over Long Documents.

SCALE uses a novel chunking strategy to achieve state-of-the-art factual inconsistency deteciton performance across many NLG domains, tasks, and over long documents (>6k tokens). SCALE's chunking approach enables fast relevant source text retrival over long documents. 

## SCALE

This metrics outputs the estimated probablility that a hypothesis is supported by a given premise *SCALE(premise, hypothesis)*. Commonly the hypothesis is generated text and the premise is some ground truth text. For example, a premise may be a document and the hypothesis may be a language model generated summary sentence. The score is bounded as follows 0&le;*SCALE(premise, hypothesis)*&le;1. A higher score signifies a higher probability the hypothesis is factually consistent with the premise. A lower score signifies the hypothesis is more likely to be factually inconsistent with the premise. It is recommended to use Flan_T5_XL as the base model for the best results. 

### Install
To use the evaluation metric, first pip install the python module. 
```
pip install scale-score
```
or install from source
```
pip install -e .
```
### Score
#### *Running the Metric*
Import the score function and load your premises, hypothesies. For scoring, the premise is a list of entire document strings while the hypothesis are single sentences represented as is a list of list of strings. Each premise has a list of associated hypothesis with a one to one mapping based on index (premise_0 -> ['hypothesis_0_0', 'hypothesis_0_1'], premise_1-> ['hypothesis_1_0', 'hypothesis_1_1', 'hypothesis_1_2']). 
```python
from scale_score import score

premise = [
    'premise_0',
    'premise_1',
]
hypothesis = [
    ['hypothesis_0_0', 'hypothesis_0_1'],
    ['hypothesis_1_0', 'hypothesis_1_1', 'hypothesis_1_2']
]

results = score(premise, hypothesis)
```
Where the results correspond to each hypothesis scored with it's respecitve premise
```python
results = [
    SCALE(premise_0, hypothesis_0_0), 
    SCALE(premise_0, hypothesis_0_1), 
    SCALE(premise_1, hypothesis_1_0), 
    SCALE(premise_1, hypothesis_1_1),
    SCALE(premise_1, hypothesis_1_2),
]
```


You can also use the `scorer` object to prevent loading the model at every call like so,
```python
from scale_score.scorer import SCALEScorer
scorer = SCALEScorer(size='small', device='cuda')
results = scorer.score(premise, hypothesis)
```
#### *Arguments*
These arguments are the exact same for both `score` and `scorer.score` functions except `scorer.score` does not take in a *size* or *device* as that is set up when building the scorer object. 
| Argument | Type | Default | Description |
| ------ | ------ | ------ | ------ |
| premise | List[str] | required | premise text, the ground truth |
| hypothesis | List[List[str]] | required | hypothesis text, usually the text predicted by a model being evaluated |
| chunk_size | int | 1000 | The size of the chunks used to perform chunking on the premise |
| window_size | float | 0.25 | The percentage of overlap between chunks. 0&le;window_size&lt;1 |
| size | str | 'xl' | Size of Flan-T5 model, options are 'small', 'base', 'large', 'xl', 'xxl' |
| device | str | 'cuda' | torch device to send the model to. |
| model_path | str | None | Optional path to a Flan-T5 model to load. Note the corresponding size must be specified in the *size* argument. |
| model | T5ForConditionalGeneration | None | Optional model to use for scoring |
| tokenizer | T5Tokenizer | None | Optional tokenizer to use for scoring |

### Evaluation
After scoring, use the `evaluate_scale` function to evaluate the results. 
```python
from scale_score.eval import evaluate_scale
from scale_score.scorer import SCALEScorer
scorer = SCALEScorer(size='small', device='cuda')
results = scorer.score(premise, hypothesis)
metrics = evaluate_scale(results)
```
The arguments for `evaluate_scale` are as follows
| Argument | Type | Default | Description |
| ------ | ------ | ------ | ------ |
| results | List[float] | required | Output from scale_score score or scorer run |
| incorrect | List[int] | required | List of labels for summary sentences, 1 for incorrect and 0 for correct |
| threshold | float | 0.5 | Threshold used to calculate binary, micro, macro, and weighted f1 scores |
| out_file | str | None | Optional json filepath to write the metrics to |
| print_outputs | bool | True | Whether to print the metrics |

The metrics that are output are described below. 
| Metric | Description |
| ------ |------ |
| pearson | Pearson correlation | 
| spearman | Spearman correlation | 
| kendalltau | Kendall Tau correlation |
| majority_class_accuracy | Accuracy if we always predict correct | 
| best_accuracy | Best predicted accuracy possible after threshold tuning | 
| best_detection_precision | Best predicted precision possible after threshold tuning f1 score | 
| best_detection_recall | Best predicted recall possible after threshold tuning f1 score | 
| best_detection_f1 | Best predicted f1 possible after threshold tuning | 
| accuracy@90% | Accuracy achieved if we want to keep 90% of all correct sentences | 
| accuracy@70% | Accuracy achieved if we want to keep 70% of all correct sentences | 
| threshold_f1 | Threshold used to calculate best_detection_f1 | 
| threshold_@90% | Threshold used to calculate accuracy@90% | 
| threshold_@70% | Threshold used to calculate accuracy@70% | 
| f1_binary | F1 score of incorrect sentence detection | 
| f1_macro | Average F1 score between correct and incorrect sentence detection | 
| f1_micro | Calculate F1 globally by counting the total true positives, false negatives and false positives | 
| f1_weighted | Calculate F1 for each label, and find their average weighted by support | 


### Retrieve
#### *Running Retrieval*
Import the retrieve function and load your premises, hypothesies. 

**NOTE**: Premises are lists of lists in retrieval. Both premises and hypothesis are split down to the sentence or utterance level. 

Each premise list has an associated hypothesis list with a one to one mapping based on index. 
```python
from scale_score import retrieve

premise = [
    ['premise_0_utt_0', 'premise_0_utt_1', 'premise_0_utt_2'],
    ['premise_1_utt_0', 'premise_1_utt_1'],
]
hypothesis = [
    ['hypothesis_0_0', 'hypothesis_0_1'],
    ['hypothesis_1_0', 'hypothesis_1_1', 'hypothesis_1_2']
]

results = retrieve(premise, hypothesis)
```
Where the results correspond to a list which has the most relevant premise utterance/sentence and the corresponding score.

You can also use the `scorer` object to prevent loading the model at every call like so,
```python
from scale_score.scorer import SCALEScorer
scorer = SCALEScorer(size='small', device='cuda')
results = scorer.retrieve(premise, hypothesis)
```
#### *Arguments*
These arguments are the exact same for both `retrieve` and `scorer.retrieve` functions except `scorer.retrieve` does not take in a *size* or *device* as that is set up when building the scorer object. 
| Argument | Type | Default | Description |
| ------ | ------ | ------ | ------ |
| premise | List[str] | required | premise text, the ground truth |
| hypothesis | List[List[str]] | required | hypothesis text, usually the text predicted by a model being evaluated |
| branches | int | 2 | The number of branches to have in the search tree |
| size | str | 'xl' | Size of Flan-T5 model, options are 'small', 'base', 'large', 'xl', 'xxl' |
| device | str | 'cuda' | torch device to send the model to. |
| model_path | str | None | Optional path to a Flan-T5 model to load. Note the corresponding size must be specified in the *size* argument. |
| model | T5ForConditionalGeneration | None | Optional model to use for scoring |
| tokenizer | T5Tokenizer | None | Optional tokenizer to use for scoring |


## ScreenEval
ScreenEval is located in the `data` folder stored as a json file. The following keys are important for the use of ScreenEval.

| Key  | Type | Description |
| ------ | ------ | ------ |
| original_convo | List[str] | The source document that is to be summarized as a string |
| convo | List[List[str]] | The source document that is to be summarized split into a list of utterances |
| inferred_summary | List[str] | The summary sentence that is paired with the given source document |
| summary_id | List[str] | The source model for the summary sentence |
| convo_id | List[int] | The ID of the source document |
| annotated_summary | List[str] | The entire associated summary, with the focus summary sentence surrounded by `<mark><\mark>`|
| prediction_annotated_source_doc | List[str] | Raw source document |
| agreement | List[float] | Annotator agreement on summary sentence facutal inconsistency label |
| agg_label | List[bool] | Factual inconsistency label (true -> factually consistent, false -> factually inconsistent) |
| rel_utt | List[List[int]] | The indices of related utterances in the corresponding `convo` list.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/asappresearch/scale-score",
    "name": "scale-score",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "Hallucination detection,Inconsistency detection,hallucination,automatic evaluation,metric,long document,efficient,fast,accurate,long,natural language generation,task agnostic,nlp,nlg",
    "author": "Barrett Lattimer, Patrick Chen, Xinyuan Zhang, Yi Yang",
    "author_email": "blattimer@asapp.com",
    "download_url": "",
    "platform": null,
    "description": "# Fast and Accurate Factual Inconsistency Detection Over Long Documents\nBarrett Martin Lattimer, Patrick Chen, Xinyuan Zhang, Yi Yang\n\nblattimer@asapp.com\n\nEMNLP 2023\n\nhttps://arxiv.org/abs/2310.13189\n## Overview\nIntroducing SCALE, an reference-free NLI based factual inconsistency detection method, and ScreenEval, the longest dialogue based dataset for factual inconsistency detection presently available.\nBoth can be found in our paper Fast and Accurate Factual Inconsistency Detection Over Long Documents.\n\nSCALE uses a novel chunking strategy to achieve state-of-the-art factual inconsistency deteciton performance across many NLG domains, tasks, and over long documents (>6k tokens). SCALE's chunking approach enables fast relevant source text retrival over long documents. \n\n## SCALE\n\nThis metrics outputs the estimated probablility that a hypothesis is supported by a given premise *SCALE(premise, hypothesis)*. Commonly the hypothesis is generated text and the premise is some ground truth text. For example, a premise may be a document and the hypothesis may be a language model generated summary sentence. The score is bounded as follows 0&le;*SCALE(premise, hypothesis)*&le;1. A higher score signifies a higher probability the hypothesis is factually consistent with the premise. A lower score signifies the hypothesis is more likely to be factually inconsistent with the premise. It is recommended to use Flan_T5_XL as the base model for the best results. \n\n### Install\nTo use the evaluation metric, first pip install the python module. \n```\npip install scale-score\n```\nor install from source\n```\npip install -e .\n```\n### Score\n#### *Running the Metric*\nImport the score function and load your premises, hypothesies. For scoring, the premise is a list of entire document strings while the hypothesis are single sentences represented as is a list of list of strings. Each premise has a list of associated hypothesis with a one to one mapping based on index (premise_0 -> ['hypothesis_0_0', 'hypothesis_0_1'], premise_1-> ['hypothesis_1_0', 'hypothesis_1_1', 'hypothesis_1_2']). \n```python\nfrom scale_score import score\n\npremise = [\n    'premise_0',\n    'premise_1',\n]\nhypothesis = [\n    ['hypothesis_0_0', 'hypothesis_0_1'],\n    ['hypothesis_1_0', 'hypothesis_1_1', 'hypothesis_1_2']\n]\n\nresults = score(premise, hypothesis)\n```\nWhere the results correspond to each hypothesis scored with it's respecitve premise\n```python\nresults = [\n    SCALE(premise_0, hypothesis_0_0), \n    SCALE(premise_0, hypothesis_0_1), \n    SCALE(premise_1, hypothesis_1_0), \n    SCALE(premise_1, hypothesis_1_1),\n    SCALE(premise_1, hypothesis_1_2),\n]\n```\n\n\nYou can also use the `scorer` object to prevent loading the model at every call like so,\n```python\nfrom scale_score.scorer import SCALEScorer\nscorer = SCALEScorer(size='small', device='cuda')\nresults = scorer.score(premise, hypothesis)\n```\n#### *Arguments*\nThese arguments are the exact same for both `score` and `scorer.score` functions except `scorer.score` does not take in a *size* or *device* as that is set up when building the scorer object. \n| Argument | Type | Default | Description |\n| ------ | ------ | ------ | ------ |\n| premise | List[str] | required | premise text, the ground truth |\n| hypothesis | List[List[str]] | required | hypothesis text, usually the text predicted by a model being evaluated |\n| chunk_size | int | 1000 | The size of the chunks used to perform chunking on the premise |\n| window_size | float | 0.25 | The percentage of overlap between chunks. 0&le;window_size&lt;1 |\n| size | str | 'xl' | Size of Flan-T5 model, options are 'small', 'base', 'large', 'xl', 'xxl' |\n| device | str | 'cuda' | torch device to send the model to. |\n| model_path | str | None | Optional path to a Flan-T5 model to load. Note the corresponding size must be specified in the *size* argument. |\n| model | T5ForConditionalGeneration | None | Optional model to use for scoring |\n| tokenizer | T5Tokenizer | None | Optional tokenizer to use for scoring |\n\n### Evaluation\nAfter scoring, use the `evaluate_scale` function to evaluate the results. \n```python\nfrom scale_score.eval import evaluate_scale\nfrom scale_score.scorer import SCALEScorer\nscorer = SCALEScorer(size='small', device='cuda')\nresults = scorer.score(premise, hypothesis)\nmetrics = evaluate_scale(results)\n```\nThe arguments for `evaluate_scale` are as follows\n| Argument | Type | Default | Description |\n| ------ | ------ | ------ | ------ |\n| results | List[float] | required | Output from scale_score score or scorer run |\n| incorrect | List[int] | required | List of labels for summary sentences, 1 for incorrect and 0 for correct |\n| threshold | float | 0.5 | Threshold used to calculate binary, micro, macro, and weighted f1 scores |\n| out_file | str | None | Optional json filepath to write the metrics to |\n| print_outputs | bool | True | Whether to print the metrics |\n\nThe metrics that are output are described below. \n| Metric | Description |\n| ------ |------ |\n| pearson | Pearson correlation | \n| spearman | Spearman correlation | \n| kendalltau | Kendall Tau correlation |\n| majority_class_accuracy | Accuracy if we always predict correct | \n| best_accuracy | Best predicted accuracy possible after threshold tuning | \n| best_detection_precision | Best predicted precision possible after threshold tuning f1 score | \n| best_detection_recall | Best predicted recall possible after threshold tuning f1 score | \n| best_detection_f1 | Best predicted f1 possible after threshold tuning | \n| accuracy@90% | Accuracy achieved if we want to keep 90% of all correct sentences | \n| accuracy@70% | Accuracy achieved if we want to keep 70% of all correct sentences | \n| threshold_f1 | Threshold used to calculate best_detection_f1 | \n| threshold_@90% | Threshold used to calculate accuracy@90% | \n| threshold_@70% | Threshold used to calculate accuracy@70% | \n| f1_binary | F1 score of incorrect sentence detection | \n| f1_macro | Average F1 score between correct and incorrect sentence detection | \n| f1_micro | Calculate F1 globally by counting the total true positives, false negatives and false positives | \n| f1_weighted | Calculate F1 for each label, and find their average weighted by support | \n\n\n### Retrieve\n#### *Running Retrieval*\nImport the retrieve function and load your premises, hypothesies. \n\n**NOTE**: Premises are lists of lists in retrieval. Both premises and hypothesis are split down to the sentence or utterance level. \n\nEach premise list has an associated hypothesis list with a one to one mapping based on index. \n```python\nfrom scale_score import retrieve\n\npremise = [\n    ['premise_0_utt_0', 'premise_0_utt_1', 'premise_0_utt_2'],\n    ['premise_1_utt_0', 'premise_1_utt_1'],\n]\nhypothesis = [\n    ['hypothesis_0_0', 'hypothesis_0_1'],\n    ['hypothesis_1_0', 'hypothesis_1_1', 'hypothesis_1_2']\n]\n\nresults = retrieve(premise, hypothesis)\n```\nWhere the results correspond to a list which has the most relevant premise utterance/sentence and the corresponding score.\n\nYou can also use the `scorer` object to prevent loading the model at every call like so,\n```python\nfrom scale_score.scorer import SCALEScorer\nscorer = SCALEScorer(size='small', device='cuda')\nresults = scorer.retrieve(premise, hypothesis)\n```\n#### *Arguments*\nThese arguments are the exact same for both `retrieve` and `scorer.retrieve` functions except `scorer.retrieve` does not take in a *size* or *device* as that is set up when building the scorer object. \n| Argument | Type | Default | Description |\n| ------ | ------ | ------ | ------ |\n| premise | List[str] | required | premise text, the ground truth |\n| hypothesis | List[List[str]] | required | hypothesis text, usually the text predicted by a model being evaluated |\n| branches | int | 2 | The number of branches to have in the search tree |\n| size | str | 'xl' | Size of Flan-T5 model, options are 'small', 'base', 'large', 'xl', 'xxl' |\n| device | str | 'cuda' | torch device to send the model to. |\n| model_path | str | None | Optional path to a Flan-T5 model to load. Note the corresponding size must be specified in the *size* argument. |\n| model | T5ForConditionalGeneration | None | Optional model to use for scoring |\n| tokenizer | T5Tokenizer | None | Optional tokenizer to use for scoring |\n\n\n## ScreenEval\nScreenEval is located in the `data` folder stored as a json file. The following keys are important for the use of ScreenEval.\n\n| Key  | Type | Description |\n| ------ | ------ | ------ |\n| original_convo | List[str] | The source document that is to be summarized as a string |\n| convo | List[List[str]] | The source document that is to be summarized split into a list of utterances |\n| inferred_summary | List[str] | The summary sentence that is paired with the given source document |\n| summary_id | List[str] | The source model for the summary sentence |\n| convo_id | List[int] | The ID of the source document |\n| annotated_summary | List[str] | The entire associated summary, with the focus summary sentence surrounded by `<mark><\\mark>`|\n| prediction_annotated_source_doc | List[str] | Raw source document |\n| agreement | List[float] | Annotator agreement on summary sentence facutal inconsistency label |\n| agg_label | List[bool] | Factual inconsistency label (true -> factually consistent, false -> factually inconsistent) |\n| rel_utt | List[List[int]] | The indices of related utterances in the corresponding `convo` list.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Implementation of SCALE metric and ScreenEval",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/asappresearch/scale-score"
    },
    "split_keywords": [
        "hallucination detection",
        "inconsistency detection",
        "hallucination",
        "automatic evaluation",
        "metric",
        "long document",
        "efficient",
        "fast",
        "accurate",
        "long",
        "natural language generation",
        "task agnostic",
        "nlp",
        "nlg"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1d31ca230281b67b7788ba774dbbfb88bd15676feb51da83f8111a2c7f3c4397",
                "md5": "88d95acb2749c8763e4e3cc5c08fe746",
                "sha256": "36fa60a539157b8f4acd97586c957b37bcfa4fd716177f9e2e5f2a6024d714eb"
            },
            "downloads": -1,
            "filename": "scale_score-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "88d95acb2749c8763e4e3cc5c08fe746",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 11331,
            "upload_time": "2023-11-29T20:01:25",
            "upload_time_iso_8601": "2023-11-29T20:01:25.536346Z",
            "url": "https://files.pythonhosted.org/packages/1d/31/ca230281b67b7788ba774dbbfb88bd15676feb51da83f8111a2c7f3c4397/scale_score-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-29 20:01:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "asappresearch",
    "github_project": "scale-score",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "scale-score"
}

Barrett Lattimer, Patrick Chen, Xinyuan Zhang, Yi Yang