trust_eval


Nametrust_eval JSON
Version 0.1.5 PyPI version JSON
download
home_pagehttps://github.com/shanghongsim/trust-eval
SummaryMetric to measure RAG responses with inline citations
upload_time2025-02-11 04:42:29
maintainerNone
docs_urlNone
authorShang Hong Sim
requires_python<3.12,>=3.10
licenseCC BY-NC 4.0
keywords rag evaluation metrics citation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Trust Eval

Welcome to **Trust Eval**! 🌟  

A comprehensive tool for evaluating the trustworthiness of inline-cited outputs generated by large language models (LLMs) within the Retrieval-Augmented Generation (RAG) framework. Our suite of metrics measures correctness, citation quality, and groundedness.

This is the official implementation of the metrics introduced in the paper *"Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse"* (accepted at ICLR '25).

## Installation 🛠️

### Prerequisites

- **OS:** Linux  
- **Python:** Versions 3.10 – 3.12 (preferably 3.10.13)  
- **GPU:** Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100)

### Steps

1. **Set up a Python environment**

   ```bash
   conda create -n trust_eval python=3.10.13
   conda activate trust_eval
   ```

2. **Install dependencies**

   ```bash
   pip install trust_eval
   ```

   > Note: that vLLM will be installed with CUDA 12.1. Please ensure your CUDA setup is compatible.

3. **Set up NLTK**

   ```bash
   import nltk
   nltk.download('punkt_tab')
   ```

## Quickstart 🔥

### Set up

Download `eval_data` from [Trust-Align Huggingface](https://huggingface.co/datasets/declare-lab/Trust-Score/tree/main/Trust-Score) and place it at the same level as the prompt folder. If you would like to use the default path configurations, please do not rename the folders. If you rename your folders, you will need to specify your own path.

```text
quickstart/
├── eval_data/
├── prompts/
```

### Quick look at the data

Here, we are working with **ASQA** where the questions are of the type long form factoid QA. Each sample has 3 fields: `question`, `answers`, `docs`. Below is one example of the dataset:

```javascript
[ ...
    {   // The question asked.
        "question": "Who has the highest goals in world football?",

        // A list containing all correct (short) answers to the question, represented as arrays where each element contains variations of the answer. 
        "answers": [
            ["Daei", "Ali Daei"],                // Variations for Ali Daei
            ["Bican", "Josef Bican"],            // Variations for Josef Bican
            ["Sinclair", "Christine Sinclair"]   // Variations for Christine Sinclair
        ],

        // A list of 100 dictionaries where each dictionary contains one document.
        "docs": [
            {   
                // The title of the document being referenced.
                "title": "Argentina\u2013Brazil football rivalry",

                // A snippet of text from the document.
                "text": "\"Football Player of the Century\", ...",

                // A binary list where each element indicates if the respective answer was found in the document (1 for found, 0 for not found).
                "answers_found": [0,0,0],

                // A recall score calculated as the percentage of correct answers that the document entails.
                "rec_score": 0.0
            },
        ]
    },
...
]
    
```

Please refer to [datasets](./docs/docs/datasets.md) page for examples of how ELI5 or QAMPARI sampless

### Configuring yaml files

For generator related configurations, there are three field that are mandatory: `data_type`, `model` and `max_length`. `data_type` determines which benchmark dataset to evaluate on. `model` determines which model to evaluate and `max_length` is the maximum context length of the model. We will be using `Qwen2.5-3B-Instruct` in this tutorial but you can replace it with the path to your model checkpoints to evaluate your model.

```yaml
data_type: "asqa"
model: Qwen/Qwen2.5-3B-Instruct
max_length: 8192
```

For evaluation related configurations, only `data_type` is mandatory.

```yaml
data_type: "asqa"
```

Your directory should now look like this:

```text
quickstart/
├── eval_data/
├── prompts/
├── generator_config.yaml
├── eval_config.yaml
```

### Running evals

Now define your main script:

**Generating Responses**

```python
from config import EvaluationConfig, ResponseGeneratorConfig
from evaluator import Evaluator
from logging_config import logger
from response_generator import ResponseGenerator

# Configure the response generator
generator_config = ResponseGeneratorConfig.from_yaml(yaml_path="generator_config.yaml")

# Generate and save responses
generator = ResponseGenerator(generator_config)
generator.generate_responses()
generator.save_responses()
```

**Evaluating Responses**

```python
# Configure the evaluator
evaluation_config = EvaluationConfig.from_yaml(yaml_path="eval_config.yaml")

# Compute and save evaluation metrics
evaluator = Evaluator(evaluation_config)
evaluator.compute_metrics()
evaluator.save_results()
```

Your directory should look like this:

```text
quickstart/
├── eval_data/
├── prompts/
├── example_usage.py
├── generator_config.yaml
├── eval_config.yaml
```

```bash
CUDA_VISIBLE_DEVICES=0,1 python example_usage.py 
```

> Note: Define the GPUs you wish to run on in `CUDA_VISIBLE_DEVICES`. For reference, we are able to run up to 7b models on two A40s.

Sample output:

```javascript
{ // refusal response: "I apologize, but I couldn't find an answer..."
    
    // Basic statistics
    "num_samples": 948,
    "answered_ratio": 50.0, // Ratio of (# answered qns / total # qns)
    "answered_num": 5, // # of qns where response is not refusal response
    "answerable_num": 7, // # of qns that ground truth answerable, given the documents
    "overlapped_num": 5, // # of qns that are both answered and answerable
    "regular_length": 46.6, // Average length of all responses
    "answered_length": 28.0, // Average length of non-refusal responses

    // Refusal groundedness metrics

    // # qns where (model refused to respond & is ground truth unanswerable) / # qns is ground truth unanswerable
    "reject_rec": 100.0,

    // # qns where (model refused to respond & is ground truth unanswerable) / # qns where model refused to respond
    "reject_prec": 60.0,

    // F1 of reject_rec and reject_prec
    "reject_f1": 75.0,

    // # qns where (model respond & is ground truth answerable) / # qns is ground truth answerable
    "answerable_rec": 71.42857142857143,

    // # qns where (model respond & is ground truth answerable) / # qns where model responded
    "answerable_prec": 100.0,

    // F1 of answerable_rec and answerable_prec
    "answerable_f1": 83.33333333333333,

    // Avg of reject_rec and answerable_rec
    "macro_avg": 85.71428571428572,

    // Avg of reject_f1 and answerable_f1
    "macro_f1": 79.16666666666666,

    // Response correctness metrics

    // Regardless of response type (refusal or answered), check if ground truth claim is in the response. 
    "regular_str_em": 41.666666666666664,

    // Only for qns with answered responses, check if ground truth claim is in the response. 
    "answered_str_em": 66.66666666666666,

    // Calculate EM for all qns that are answered and answerable, avg by # of answered questions (EM_alpha)
    "calib_answered_str_em": 100.0,

    // Calculate EM for all qns that are answered and answerable, avg by # of answerable questions (EM_beta)
    "calib_answerable_str_em": 71.42857142857143,

    // F1 of calib_answered_claims_nli and calib_answerable_claims_nli
    "calib_str_em_f1": 83.33333333333333,

    // EM score of qns that are answered and ground truth unanswerable, indicating use of parametric knowledge
    "parametric_str_em": 0.0,

    // Citation quality metrics

    // (Avg across all qns) Does the set of citations support statement s_i? 
    "regular_citation_rec": 28.333333333333332,

    // (Avg across all qns) Any redundant citations? (1) Does citation c_i,j fully support statement s_i? (2) Is the set of citations without c_i,j insufficient to support statement s_i? 
    "regular_citation_prec": 35.0,

    // F1 of regular_citation_rec and regular_citation_prec
    "regular_citation_f1": 31.315789473684212,

    // (Avg across answered qns only)
    "answered_citation_rec": 50.0,

    // (Avg across answered qns only)
    "answered_citation_prec": 60.0,

    // F1 answered_citation_rec and answered_citation_prec
    "answered_citation_f1": 54.54545454545455,

    // Avg (macro_f1, calib_claims_nli_f1, answered_citation_f1)
    "trust_score": 72.34848484848486
}
```

Please refer to [metrics](./docs/docs/metrics.md) for explanations of outputs when evaluating with ELI5 or QAMPARI.

### The end

Congratulations! You have reached the end of the quickstart tutorial and you are now ready to [benchmark your own RAG application](./docs/custom/) (running the evaluations with custom data instead of benchmark data) or [reproduce our experimental setup](./docs/experiments/)! 🥳

## Contact 📬

For questions or feedback, reach out to Shang Hong (`simshanghong@gmail.com`).

## Citation 📝

If you use this software in your research, please cite the [Trust-Eval](https://arxiv.org/abs/2409.11242) paper as below.

```bibtex
@misc{song2024measuringenhancingtrustworthinessllms,
      title={Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse}, 
      author={Maojia Song and Shang Hong Sim and Rishabh Bhardwaj and Hai Leong Chieu and Navonil Majumder and Soujanya Poria},
      year={2024},
      eprint={2409.11242},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.11242}, 
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shanghongsim/trust-eval",
    "name": "trust_eval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.10",
    "maintainer_email": null,
    "keywords": "RAG, evaluation, metrics, citation",
    "author": "Shang Hong Sim",
    "author_email": "simshanghong@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/62/65/6c4fdd240aa5d8edbbed51eb05bfb7606595b0b899faea817f204ea744b8/trust_eval-0.1.5.tar.gz",
    "platform": null,
    "description": "# Trust Eval\n\nWelcome to **Trust Eval**! \ud83c\udf1f  \n\nA comprehensive tool for evaluating the trustworthiness of inline-cited outputs generated by large language models (LLMs) within the Retrieval-Augmented Generation (RAG) framework. Our suite of metrics measures correctness, citation quality, and groundedness.\n\nThis is the official implementation of the metrics introduced in the paper *\"Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse\"* (accepted at ICLR '25).\n\n## Installation \ud83d\udee0\ufe0f\n\n### Prerequisites\n\n- **OS:** Linux  \n- **Python:** Versions 3.10 \u2013 3.12 (preferably 3.10.13)  \n- **GPU:** Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100)\n\n### Steps\n\n1. **Set up a Python environment**\n\n   ```bash\n   conda create -n trust_eval python=3.10.13\n   conda activate trust_eval\n   ```\n\n2. **Install dependencies**\n\n   ```bash\n   pip install trust_eval\n   ```\n\n   > Note: that vLLM will be installed with CUDA 12.1. Please ensure your CUDA setup is compatible.\n\n3. **Set up NLTK**\n\n   ```bash\n   import nltk\n   nltk.download('punkt_tab')\n   ```\n\n## Quickstart \ud83d\udd25\n\n### Set up\n\nDownload `eval_data` from [Trust-Align Huggingface](https://huggingface.co/datasets/declare-lab/Trust-Score/tree/main/Trust-Score) and place it at the same level as the prompt folder. If you would like to use the default path configurations, please do not rename the folders. If you rename your folders, you will need to specify your own path.\n\n```text\nquickstart/\n\u251c\u2500\u2500 eval_data/\n\u251c\u2500\u2500 prompts/\n```\n\n### Quick look at the data\n\nHere, we are working with **ASQA** where the questions are of the type long form factoid QA. Each sample has 3 fields: `question`, `answers`, `docs`. Below is one example of the dataset:\n\n```javascript\n[ ...\n    {   // The question asked.\n        \"question\": \"Who has the highest goals in world football?\",\n\n        // A list containing all correct (short) answers to the question, represented as arrays where each element contains variations of the answer. \n        \"answers\": [\n            [\"Daei\", \"Ali Daei\"],                // Variations for Ali Daei\n            [\"Bican\", \"Josef Bican\"],            // Variations for Josef Bican\n            [\"Sinclair\", \"Christine Sinclair\"]   // Variations for Christine Sinclair\n        ],\n\n        // A list of 100 dictionaries where each dictionary contains one document.\n        \"docs\": [\n            {   \n                // The title of the document being referenced.\n                \"title\": \"Argentina\\u2013Brazil football rivalry\",\n\n                // A snippet of text from the document.\n                \"text\": \"\\\"Football Player of the Century\\\", ...\",\n\n                // A binary list where each element indicates if the respective answer was found in the document (1 for found, 0 for not found).\n                \"answers_found\": [0,0,0],\n\n                // A recall score calculated as the percentage of correct answers that the document entails.\n                \"rec_score\": 0.0\n            },\n        ]\n    },\n...\n]\n    \n```\n\nPlease refer to [datasets](./docs/docs/datasets.md) page for examples of how ELI5 or QAMPARI sampless\n\n### Configuring yaml files\n\nFor generator related configurations, there are three field that are mandatory: `data_type`, `model` and `max_length`. `data_type` determines which benchmark dataset to evaluate on. `model` determines which model to evaluate and `max_length` is the maximum context length of the model. We will be using `Qwen2.5-3B-Instruct` in this tutorial but you can replace it with the path to your model checkpoints to evaluate your model.\n\n```yaml\ndata_type: \"asqa\"\nmodel: Qwen/Qwen2.5-3B-Instruct\nmax_length: 8192\n```\n\nFor evaluation related configurations, only `data_type` is mandatory.\n\n```yaml\ndata_type: \"asqa\"\n```\n\nYour directory should now look like this:\n\n```text\nquickstart/\n\u251c\u2500\u2500 eval_data/\n\u251c\u2500\u2500 prompts/\n\u251c\u2500\u2500 generator_config.yaml\n\u251c\u2500\u2500 eval_config.yaml\n```\n\n### Running evals\n\nNow define your main script:\n\n**Generating Responses**\n\n```python\nfrom config import EvaluationConfig, ResponseGeneratorConfig\nfrom evaluator import Evaluator\nfrom logging_config import logger\nfrom response_generator import ResponseGenerator\n\n# Configure the response generator\ngenerator_config = ResponseGeneratorConfig.from_yaml(yaml_path=\"generator_config.yaml\")\n\n# Generate and save responses\ngenerator = ResponseGenerator(generator_config)\ngenerator.generate_responses()\ngenerator.save_responses()\n```\n\n**Evaluating Responses**\n\n```python\n# Configure the evaluator\nevaluation_config = EvaluationConfig.from_yaml(yaml_path=\"eval_config.yaml\")\n\n# Compute and save evaluation metrics\nevaluator = Evaluator(evaluation_config)\nevaluator.compute_metrics()\nevaluator.save_results()\n```\n\nYour directory should look like this:\n\n```text\nquickstart/\n\u251c\u2500\u2500 eval_data/\n\u251c\u2500\u2500 prompts/\n\u251c\u2500\u2500 example_usage.py\n\u251c\u2500\u2500 generator_config.yaml\n\u251c\u2500\u2500 eval_config.yaml\n```\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1 python example_usage.py \n```\n\n> Note: Define the GPUs you wish to run on in `CUDA_VISIBLE_DEVICES`. For reference, we are able to run up to 7b models on two A40s.\n\nSample output:\n\n```javascript\n{ // refusal response: \"I apologize, but I couldn't find an answer...\"\n    \n    // Basic statistics\n    \"num_samples\": 948,\n    \"answered_ratio\": 50.0, // Ratio of (# answered qns / total # qns)\n    \"answered_num\": 5, // # of qns where response is not refusal response\n    \"answerable_num\": 7, // # of qns that ground truth answerable, given the documents\n    \"overlapped_num\": 5, // # of qns that are both answered and answerable\n    \"regular_length\": 46.6, // Average length of all responses\n    \"answered_length\": 28.0, // Average length of non-refusal responses\n\n    // Refusal groundedness metrics\n\n    // # qns where (model refused to respond & is ground truth unanswerable) / # qns is ground truth unanswerable\n    \"reject_rec\": 100.0,\n\n    // # qns where (model refused to respond & is ground truth unanswerable) / # qns where model refused to respond\n    \"reject_prec\": 60.0,\n\n    // F1 of reject_rec and reject_prec\n    \"reject_f1\": 75.0,\n\n    // # qns where (model respond & is ground truth answerable) / # qns is ground truth answerable\n    \"answerable_rec\": 71.42857142857143,\n\n    // # qns where (model respond & is ground truth answerable) / # qns where model responded\n    \"answerable_prec\": 100.0,\n\n    // F1 of answerable_rec and answerable_prec\n    \"answerable_f1\": 83.33333333333333,\n\n    // Avg of reject_rec and answerable_rec\n    \"macro_avg\": 85.71428571428572,\n\n    // Avg of reject_f1 and answerable_f1\n    \"macro_f1\": 79.16666666666666,\n\n    // Response correctness metrics\n\n    // Regardless of response type (refusal or answered), check if ground truth claim is in the response. \n    \"regular_str_em\": 41.666666666666664,\n\n    // Only for qns with answered responses, check if ground truth claim is in the response. \n    \"answered_str_em\": 66.66666666666666,\n\n    // Calculate EM for all qns that are answered and answerable, avg by # of answered questions (EM_alpha)\n    \"calib_answered_str_em\": 100.0,\n\n    // Calculate EM for all qns that are answered and answerable, avg by # of answerable questions (EM_beta)\n    \"calib_answerable_str_em\": 71.42857142857143,\n\n    // F1 of calib_answered_claims_nli and calib_answerable_claims_nli\n    \"calib_str_em_f1\": 83.33333333333333,\n\n    // EM score of qns that are answered and ground truth unanswerable, indicating use of parametric knowledge\n    \"parametric_str_em\": 0.0,\n\n    // Citation quality metrics\n\n    // (Avg across all qns) Does the set of citations support statement s_i?\u00a0\n    \"regular_citation_rec\": 28.333333333333332,\n\n    // (Avg across all qns) Any redundant citations? (1) Does citation c_i,j fully support statement s_i? (2) Is the set of citations without c_i,j insufficient to support statement s_i? \n    \"regular_citation_prec\": 35.0,\n\n    // F1 of regular_citation_rec and regular_citation_prec\n    \"regular_citation_f1\": 31.315789473684212,\n\n    // (Avg across answered qns only)\n    \"answered_citation_rec\": 50.0,\n\n    // (Avg across answered qns only)\n    \"answered_citation_prec\": 60.0,\n\n    // F1 answered_citation_rec and answered_citation_prec\n    \"answered_citation_f1\": 54.54545454545455,\n\n    // Avg (macro_f1, calib_claims_nli_f1, answered_citation_f1)\n    \"trust_score\": 72.34848484848486\n}\n```\n\nPlease refer to [metrics](./docs/docs/metrics.md) for explanations of outputs when evaluating with ELI5 or QAMPARI.\n\n### The end\n\nCongratulations! You have reached the end of the quickstart tutorial and you are now ready to [benchmark your own RAG application](./docs/custom/) (running the evaluations with custom data instead of benchmark data) or [reproduce our experimental setup](./docs/experiments/)! \ud83e\udd73\n\n## Contact \ud83d\udcec\n\nFor questions or feedback, reach out to Shang Hong (`simshanghong@gmail.com`).\n\n## Citation \ud83d\udcdd\n\nIf you use this software in your research, please cite the [Trust-Eval](https://arxiv.org/abs/2409.11242) paper as below.\n\n```bibtex\n@misc{song2024measuringenhancingtrustworthinessllms,\n      title={Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse}, \n      author={Maojia Song and Shang Hong Sim and Rishabh Bhardwaj and Hai Leong Chieu and Navonil Majumder and Soujanya Poria},\n      year={2024},\n      eprint={2409.11242},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2409.11242}, \n}\n```\n",
    "bugtrack_url": null,
    "license": "CC BY-NC 4.0",
    "summary": "Metric to measure RAG responses with inline citations",
    "version": "0.1.5",
    "project_urls": {
        "Homepage": "https://github.com/shanghongsim/trust-eval",
        "Repository": "https://github.com/shanghongsim/trust-eval"
    },
    "split_keywords": [
        "rag",
        " evaluation",
        " metrics",
        " citation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bfd0f425ca65ae88a5dd12b6b7d2b1773983c009595503edf557bd9b54f41088",
                "md5": "31ce0ec8ead702ca1020ba16655a780b",
                "sha256": "b4d2f747cf2528b364dc5968255eace5ccef40b25f609498197a51f1e54fbd23"
            },
            "downloads": -1,
            "filename": "trust_eval-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "31ce0ec8ead702ca1020ba16655a780b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.10",
            "size": 28612,
            "upload_time": "2025-02-11T04:42:27",
            "upload_time_iso_8601": "2025-02-11T04:42:27.730648Z",
            "url": "https://files.pythonhosted.org/packages/bf/d0/f425ca65ae88a5dd12b6b7d2b1773983c009595503edf557bd9b54f41088/trust_eval-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "62656c4fdd240aa5d8edbbed51eb05bfb7606595b0b899faea817f204ea744b8",
                "md5": "e39d1050574e1d47c517211531bd99b0",
                "sha256": "0264e19d053e5e8bce45f1f168fb39a79b154f982639f598566242ae3438d8f7"
            },
            "downloads": -1,
            "filename": "trust_eval-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "e39d1050574e1d47c517211531bd99b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.10",
            "size": 27389,
            "upload_time": "2025-02-11T04:42:29",
            "upload_time_iso_8601": "2025-02-11T04:42:29.147981Z",
            "url": "https://files.pythonhosted.org/packages/62/65/6c4fdd240aa5d8edbbed51eb05bfb7606595b0b899faea817f204ea744b8/trust_eval-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-11 04:42:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shanghongsim",
    "github_project": "trust-eval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "trust_eval"
}
        
Elapsed time: 0.39474s