ai-rag-chat-evaluator

Name	ai-rag-chat-evaluator JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/Azure-Samples/ai-rag-chat-evaluator
Summary	Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI
upload_time	2024-03-17 15:40:57
maintainer	Oleksis Fraga
docs_url	None
author	Pamela Fox
requires_python	>=3.10,<4.0
license	MIT
keywords	ai rag chat azure sdk openai
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Evaluating a RAG Chat App

This repo contains scripts and tools for evaluating a chat app that uses the RAG architecture.
There are many parameters that affect the quality and style of answers generated by the chat app,
such as the system prompt, search parameters, and GPT model parameters.

Whenever you are making changes to a RAG chat with the goal of improving the answers, you should evaluate the results.
This repository offers tools to make it easier to run evaluations, plus examples of evaluations
that we've run on our [sample chat app](https://github.com/Azure-Samples/azure-search-openai-demo/).

[ 📺 Watch a video overview of this repo](https://www.youtube.com/watch?v=mM8pZAI2C5w)

Table of contents:

* [Setting up this project](#setting-up-this-project)
* [Deploying a GPT-4 model](#deploying-a-gpt-4-model)
* [Generating ground truth data](#generating-ground-truth-data)
* [Running an evaluation](#running-an-evaluation)
* [Viewing the results](#viewing-the-results)
* [Measuring app's ability to say "I don't know"](#measuring-apps-ability-to-say-i-dont-know)

## Setting up this project

If you open this project in a Dev Container or GitHub Codespaces, it will automatically set up the environment for you.
If not, then follow these steps:

1. Install Python 3.10 or higher
2. Create a Python [virtual environment](https://learn.microsoft.com/azure/developer/python/get-started?tabs=cmd#configure-python-virtual-environment).
2. Inside that virtual environment, install the requirements:

```shell
python -m pip install -r requirements.txt
```

## Deploying a GPT-4 model

It's best to use a GPT-4 model for performing the evaluation, even if your chat app uses GPT-3.5 or another model.
You can either use an Azure OpenAI instance or an openai.com instance.

### Using a new Azure OpenAI instance

To use a new Azure OpenAI instance, you'll need to create a new instance and deploy the app to it.
We've made that easy to deploy with the `azd` CLI tool.

1. Install the [Azure Developer CLI](https://aka.ms/azure-dev/install)
2. Run `azd auth login` to log in to your Azure account
3. Run `azd up` to deploy a new GPT-4 instance
4. Create a `.env` file based on the provisioned resources by running one of the following commands.

Powershell:

```shell
azd env get-values > .env
```

Bash:

```powershell
$output = azd env get-values; Add-Content -Path .env -Value $output;
```

### Using an existing Azure OpenAI instance

If you already have an Azure OpenAI instance, you can use that instead of creating a new one.

1. Create `.env` file by copying `.env.sample`
2. Fill in the values for your instance:

```shell
AZURE_OPENAI_EVAL_DEPLOYMENT="<deployment-name>"
AZURE_OPENAI_SERVICE="<service-name>"
```
3. The scripts default to keyless access (via `AzureDefaultCredential`), but you can optionally use a key by setting `AZURE_OPENAI_KEY` in `.env`.

### Using an openai.com instance

If you have an openai.com instance, you can use that instead of an Azure OpenAI instance.

1. Create `.env` file by copying `.env.sample`
2. Fill in the values for your OpenAI account. You might not have an organization, in which case you can leave that blank.

```shell
OPENAICOM_KEY=""
OPENAICOM_ORGANIZATION=""
```

## Generating ground truth data

In order to evaluate new answers, they must be compared to "ground truth" answers: the ideal answer for a particular question. See `example_input/qa.jsonl` for an example of the format.
We recommend at least 200 QA pairs if possible.

There are a few ways to get this data:

1. Manually curate a set of questions and answers that you consider to be ideal. This is the most accurate, but also the most time-consuming. Make sure your answers include citations in the expected format. This approach requires domain expertise in the data.
2. Use the generator script to generate a set of questions and answers. This is the fastest, but may also be the least accurate. See below for details on how to run the generator script.
3. Use the generator script to generate a set of questions and answers, and then manually curate them, rewriting any answers that are subpar and adding missing citations. This is a good middle ground, and is what we recommend.

<details>
<summary>Additional tips for ground truth data generation</summary>

* Generate more QA pairs than you need, then prune them down manually based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.
* Be aware of the knowledge distribution in the document set, so you effectively sample questions across the knowledge space.
* Once your chat application is live, continually sample live user questions (within accordance to your privacy policy) to make sure you're representing the sorts of questions that users are asking.
</details>

### Running the generator script

This repo includes a script for generating questions and answers from documents stored in Azure AI Search.

> [!IMPORTANT]
> The generator script can only generate English Q/A pairs right now, due to [limitations in the azure-ai-generative SDK](https://github.com/Azure/azure-sdk-for-python/issues/34099).

1. Create `.env` file by copying `.env.sample`
2. Fill in the values for your Azure AI Search instance:

```shell
AZURE_SEARCH_SERVICE="<service-name>"
AZURE_SEARCH_INDEX="<index-name>"
AZURE_SEARCH_KEY=""
```

The key may not be necessary if it's configured for keyless access from your account.
If providing a key, it's best to provide a query key since the script only requires that level of access.

3. Run the generator script:

```shell
python -m scripts generate --output=example_input/qa.jsonl --numquestions=200 --persource=5
```

That script will generate 200 questions and answers, and store them in `example_input/qa.jsonl`. We've already provided an example based off the sample documents for this app.

To further customize the generator beyond the `numquestions` and `persource` parameters, modify `scripts/generate.py`.

## Running an evaluation

We provide a script that loads in the current `azd` environment's variables, installs the requirements for the evaluation, and runs the evaluation against the local app. Run it like this:

```shell
python -m scripts evaluate --config=example_config.json
```

The config.json should contain these fields as a minimum:

```json
{
"testdata_path": "example_input/qa.jsonl",
"target_url": "http://localhost:50505/chat",
"requested_metrics": ["groundedness", "relevance", "coherence", "latency", "answer_length"],
"results_dir": "example_results/experiment<TIMESTAMP>"
}
```

### Running against a local container

If you're running this evaluator in a container and your app is running in a container on the same system, use a URL like this for the `target_url`:

"target_url": "http://host.docker.internal:50505/chat"

### Running against a deployed app

To run against a deployed endpoint, change the `target_url` to the chat endpoint of the deployed app:

"target_url": "https://app-backend-j25rgqsibtmlo.azurewebsites.net/chat"

### Running on a subset of questions

It's common to run the evaluation on a subset of the questions, to get a quick sense of how the changes are affecting the answers. To do this, use the `--numquestions` parameter:

```shell
python -m scripts evaluate --config=example_config.json --numquestions=2
```

### Specifying the evaluate metrics

The `evaluate` command will use the metrics specified in the `requested_metrics` field of the config JSON.
Some of those metrics are built-in to the evaluation SDK, and the rest are custom metrics that we've added.

#### Built-in metrics

These metrics are calculated by sending a call to the GPT model, asking it to provide a 1-5 rating, and storing that rating.

> [!IMPORTANT]
> The built-in metrics are only intended for use on evaluating English language answers, due to [limitations in the azure-ai-generative SDK](https://github.com/Azure/azure-sdk-for-python/issues/34099).

* [`gpt_coherence`](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in#ai-assisted-coherence) measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.
* [`gpt_relevance`](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in#ai-assisted-relevance) assesses the ability of answers to capture the key points of the context.
* [`gpt_groundedness`](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in#ai-assisted-groundedness) assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.

#### Custom metrics

##### Prompt metrics

The following metrics are implemented very similar to the built-in metrics, but use a locally stored prompt. They're a great fit if you find that the built-in metrics are not working well for you or if you need to translate the prompt to another language.

* `coherence`: Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. Based on `scripts/evaluate_metrics/prompts/coherence.jinja2`.
* `relevance`: Assesses the ability of answers to capture the key points of the context. Based on `scripts/evaluate_metrics/prompts/relevance.jinja2`.
* `groundedness`: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Based on `scripts/evaluate_metrics/prompts/groundedness.jinja2`.

##### Code metrics

These metrics are calculated with some local code based on the results of the chat app, and do not require a call to the GPT model.

* `latency`: The time it takes for the chat app to generate an answer, in seconds.
* `length`: The length of the generated answer, in characters.
* `answer_has_citation`: Whether the answer contains a correctly formatted citation to a source document, assuming citations are in square brackets.

### Sending additional parameters to the app

This repo assumes that your chat app is following the [Chat App Protocol](https://github.com/Azure-Samples/ai-chat-app-protocol), which means that all POST requests look like this:

```json
{"messages": [{"content": "<Actual user question goes here>", "role": "user"}],
"stream": False,
"context": {...},
}
```

Any additional app parameters would be specified in the `context` of that JSON, such as temperature, search settings, prompt overrides, etc. To specify those parameters, add a `target_parameters` key to your config JSON. For example:

```json
"target_parameters": {
"overrides": {
"semantic_ranker": false,
"prompt_template": "<READFILE>example_input/prompt_refined.txt"
}
}
```

The `overrides` key is the same as the `overrides` key in the `context` of the POST request.
As a convenience, you can use the `<READFILE>` prefix to read in a file and use its contents as the value for the parameter.
That way, you can store potential (long) prompts separately from the config JSON file.

## Viewing the results

The results of each evaluation are stored in a results folder (defaulting to `example_results`).
Inside each run's folder, you'll find:

- `eval_results.jsonl`: Each question and answer, along with the GPT metrics for each QA pair.
- `parameters.json`: The parameters used for the run, like the overrides.
- `summary.json`: The overall results, like the average GPT metrics.
- `config.json`: The original config used for the run. This is useful for reproducing the run.

To make it easier to view and compare results across runs, we've built a few tools,
located inside the `review-tools` folder.

### Using the summary tool

To view a summary across all the runs, use the `summary` command with the path to the results folder:

```bash
python -m review_tools summary example_results
```

This will display an interactive table with the results for each run, like this:

![Screenshot of CLI tool with table of results](docs/screenshot_summary.png)

To see the parameters used for a particular run, select the folder name.
A modal will appear with the parameters, including any prompt override.

### Using the compare tool

To compare the answers generated for each question across 2 runs, use the `compare` command with 2 paths:

```bash
python -m review_tools diff example_results/baseline_1 example_results/baseline_2
```

This will display each question, one at a time, with the two generated answers in scrollable panes,
and the GPT metrics below each answer.

![Screenshot of CLI tool for comparing a question with 2 answers](docs/screenshot_compare.png)]

Use the buttons at the bottom to navigate to the next question or quit the tool.

## Measuring app's ability to say "I don't know"

The evaluation flow described above focused on evaluating a model’s answers for a set of questions that *could* be answered by the data. But what about all those questions that can’t be answered by the data? Does your model know how to say “I don’t know?” The GPT models are trained to try and be helpful, so their tendency is to always give some sort of answer, especially for answers that were in their training data. If you want to ensure your app can say “I don’t know” when it should, you need to evaluate it on a different set of questions with a different metric.

### Generating ground truth data for answer-less questions

For this evaluation, our ground truth data needs to be a set of question whose answer should provoke an "I don’t know" response from the data. There are several categories of such questions:

* **Unknowable**: Questions that are related to the sources but not actually in them (and not public knowledge).
* **Uncitable**: Questions whose answers are well known to the LLM from its training data, but are not in the sources. There are two flavors of these:
* **Related**: Similar topics to sources, so LLM will be particularly tempted to think the sources know.
* **Unrelated**: Completely unrelated to sources, so LLM shouldn’t be as tempted to think the sources know.
* **Nonsensical**: Questions that are non-questions, that a human would scratch their head at and ask for clarification.

You can write these questions manually, but it’s also possible to generate them using a generator script in this repo,
assuming you already have ground truth data with answerable questions.

```shell
python -m scripts generate_dontknows --input=example_input/qa.jsonl --output=example_input/qa_dontknows.jsonl --numquestions=45
```

That script sends the current questions to the configured GPT-4 model along with prompts to generate questions of each kind.

When it’s done, you should review and curate the resulting ground truth data. Pay special attention to the "unknowable" questions at the top of the file, since you may decide that some of those are actually knowable, and you may want to reword or rewrite entirely.

### Running an evaluation for answer-less questions

This repo contains a custom GPT metric called "dontknowness" that rates answers from 1-5, where 1 is "answered the question completely with no certainty" and 5 is "said it didn't know and attempted no answer". The goal is for all answers to be rated 4 or 5.

Here's an example configuration JSON that requests that metric, referencing the new ground truth data and a new output folder:

```json
{
"testdata_path": "example_input/qa_dontknows.jsonl",
"results_dir": "example_results_dontknows/baseline",
"requested_metrics": ["dontknowness", "answer_length", "latency", "has_citation"],
"target_url": "http://localhost:50505/chat",
"target_parameters": {
}
}
```

We recommend a separate output folder, as you'll likely want to make multiple runs and easily compare between those runs using the [review tools](#viewing-the-results).

Run the evaluation like this:

```shell
python -m scripts evaluate --config=example_config_dontknows.json
```

The results will be stored in the `results_dir` folder, and can be reviewed using the [review tools](#viewing-the-results).

### Improving the app's ability to say "I don't know"

If the app is not saying "I don't know" enough, you can use the `diff` tool to compare the answers for the "dontknows" questions across runs, and see if the answers are improving. Changes you can try:

* Adjust the prompt to encourage the model to say "I don't know" more often. Remove anything in the prompt that might be distracting or overly encouraging it to answer.
* Try using GPT-4 instead of GPT-3.5. The results will be slower (see the latency column) but it may be more likely to say "I don't know" when it should.
* Adjust the temperature of the model used by your app.
* Add an additional LLM step in your app after generating the answer, to have the LLM rate its own confidence that the answer is found in the sources. If the confidence is low, the app should say "I don't know".

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Azure-Samples/ai-rag-chat-evaluator",
    "name": "ai-rag-chat-evaluator",
    "maintainer": "Oleksis Fraga",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "oleksis.fraga@gmail.com",
    "keywords": "ai,rag,chat,azure,sdk,openai",
    "author": "Pamela Fox",
    "author_email": "pamelafox@fosstodon.org",
    "download_url": "https://files.pythonhosted.org/packages/b1/0a/74f7ca7ef43c2b73778ffc0a746db26eee0387cdbef881ab4e127a4fad1c/ai_rag_chat_evaluator-0.2.0.tar.gz",
    "platform": null,
    "description": "# Evaluating a RAG Chat App\n\nThis repo contains scripts and tools for evaluating a chat app that uses the RAG architecture.\nThere are many parameters that affect the quality and style of answers generated by the chat app,\nsuch as the system prompt, search parameters, and GPT model parameters.\n\nWhenever you are making changes to a RAG chat with the goal of improving the answers, you should evaluate the results.\nThis repository offers tools to make it easier to run evaluations, plus examples of evaluations\nthat we've run on our [sample chat app](https://github.com/Azure-Samples/azure-search-openai-demo/).\n\n[ \ud83d\udcfa Watch a video overview of this repo](https://www.youtube.com/watch?v=mM8pZAI2C5w)\n\nTable of contents:\n\n* [Setting up this project](#setting-up-this-project)\n* [Deploying a GPT-4 model](#deploying-a-gpt-4-model)\n* [Generating ground truth data](#generating-ground-truth-data)\n* [Running an evaluation](#running-an-evaluation)\n* [Viewing the results](#viewing-the-results)\n* [Measuring app's ability to say \"I don't know\"](#measuring-apps-ability-to-say-i-dont-know)\n\n## Setting up this project\n\nIf you open this project in a Dev Container or GitHub Codespaces, it will automatically set up the environment for you.\nIf not, then follow these steps:\n\n1. Install Python 3.10 or higher\n2. Create a Python [virtual environment](https://learn.microsoft.com/azure/developer/python/get-started?tabs=cmd#configure-python-virtual-environment).\n2. Inside that virtual environment, install the requirements:\n\n    ```shell\n    python -m pip install -r requirements.txt\n    ```\n\n## Deploying a GPT-4 model\n\nIt's best to use a GPT-4 model for performing the evaluation, even if your chat app uses GPT-3.5 or another model.\nYou can either use an Azure OpenAI instance or an openai.com instance.\n\n### Using a new Azure OpenAI instance\n\nTo use a new Azure OpenAI instance, you'll need to create a new instance and deploy the app to it.\nWe've made that easy to deploy with the `azd` CLI tool.\n\n1. Install the [Azure Developer CLI](https://aka.ms/azure-dev/install)\n2. Run `azd auth login` to log in to your Azure account\n3. Run `azd up` to deploy a new GPT-4 instance\n4. Create a `.env` file based on the provisioned resources by running one of the following commands.\n\n    Powershell:\n\n    ```shell\n    azd env get-values > .env\n    ```\n\n    Bash:\n\n    ```powershell\n    $output = azd env get-values; Add-Content -Path .env -Value $output;\n    ```\n\n### Using an existing Azure OpenAI instance\n\nIf you already have an Azure OpenAI instance, you can use that instead of creating a new one.\n\n1. Create `.env` file by copying `.env.sample`\n2. Fill in the values for your instance:\n\n    ```shell\n    AZURE_OPENAI_EVAL_DEPLOYMENT=\"<deployment-name>\"\n    AZURE_OPENAI_SERVICE=\"<service-name>\"\n    ```\n3. The scripts default to keyless access (via `AzureDefaultCredential`), but you can optionally use a key by setting `AZURE_OPENAI_KEY` in `.env`.\n\n### Using an openai.com instance\n\nIf you have an openai.com instance, you can use that instead of an Azure OpenAI instance.\n\n1. Create `.env` file by copying `.env.sample`\n2. Fill in the values for your OpenAI account. You might not have an organization, in which case you can leave that blank.\n\n    ```shell\n    OPENAICOM_KEY=\"\"\n    OPENAICOM_ORGANIZATION=\"\"\n    ```\n\n\n## Generating ground truth data\n\nIn order to evaluate new answers, they must be compared to \"ground truth\" answers: the ideal answer for a particular question. See `example_input/qa.jsonl` for an example of the format.\nWe recommend at least 200 QA pairs if possible.\n\nThere are a few ways to get this data:\n\n1. Manually curate a set of questions and answers that you consider to be ideal. This is the most accurate, but also the most time-consuming. Make sure your answers include citations in the expected format. This approach requires domain expertise in the data.\n2. Use the generator script to generate a set of questions and answers. This is the fastest, but may also be the least accurate. See below for details on how to run the generator script.\n3. Use the generator script to generate a set of questions and answers, and then manually curate them, rewriting any answers that are subpar and adding missing citations. This is a good middle ground, and is what we recommend.\n\n<details>\n <summary>Additional tips for ground truth data generation</summary>\n\n* Generate more QA pairs than you need, then prune them down manually based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.\n* Be aware of the knowledge distribution in the document set, so you effectively sample questions across the knowledge space.\n* Once your chat application is live, continually sample live user questions (within accordance to your privacy policy) to make sure you're representing the sorts of questions that users are asking.\n</details>\n\n### Running the generator script\n\nThis repo includes a script for generating questions and answers from documents stored in Azure AI Search.\n\n> [!IMPORTANT]\n> The generator script can only generate English Q/A pairs right now, due to [limitations in the azure-ai-generative SDK](https://github.com/Azure/azure-sdk-for-python/issues/34099).\n\n1. Create `.env` file by copying `.env.sample`\n2. Fill in the values for your Azure AI Search instance:\n\n    ```shell\n    AZURE_SEARCH_SERVICE=\"<service-name>\"\n    AZURE_SEARCH_INDEX=\"<index-name>\"\n    AZURE_SEARCH_KEY=\"\"\n    ```\n\n    The key may not be necessary if it's configured for keyless access from your account.\n    If providing a key, it's best to provide a query key since the script only requires that level of access.\n\n3. Run the generator script:\n\n    ```shell\n    python -m scripts generate --output=example_input/qa.jsonl --numquestions=200 --persource=5\n    ```\n\n    That script will generate 200 questions and answers, and store them in `example_input/qa.jsonl`. We've already provided an example based off the sample documents for this app.\n\n    To further customize the generator beyond the `numquestions` and `persource` parameters, modify `scripts/generate.py`.\n\n\n## Running an evaluation\n\nWe provide a script that loads in the current `azd` environment's variables, installs the requirements for the evaluation, and runs the evaluation against the local app. Run it like this:\n\n```shell\npython -m scripts evaluate --config=example_config.json\n```\n\nThe config.json should contain these fields as a minimum:\n\n```json\n{\n    \"testdata_path\": \"example_input/qa.jsonl\",\n    \"target_url\": \"http://localhost:50505/chat\",\n    \"requested_metrics\": [\"groundedness\", \"relevance\", \"coherence\", \"latency\", \"answer_length\"],\n    \"results_dir\": \"example_results/experiment<TIMESTAMP>\"\n}\n```\n\n### Running against a local container\n\nIf you're running this evaluator in a container and your app is running in a container on the same system, use a URL like this for the `target_url`:\n\n\"target_url\": \"http://host.docker.internal:50505/chat\"\n\n### Running against a deployed app\n\nTo run against a deployed endpoint, change the `target_url` to the chat endpoint of the deployed app:\n\n\"target_url\": \"https://app-backend-j25rgqsibtmlo.azurewebsites.net/chat\"\n\n### Running on a subset of questions\n\nIt's common to run the evaluation on a subset of the questions, to get a quick sense of how the changes are affecting the answers. To do this, use the `--numquestions` parameter:\n\n```shell\npython -m scripts evaluate --config=example_config.json --numquestions=2\n```\n\n### Specifying the evaluate metrics\n\nThe `evaluate` command will use the metrics specified in the `requested_metrics` field of the config JSON.\nSome of those metrics are built-in to the evaluation SDK, and the rest are custom metrics that we've added.\n\n#### Built-in metrics\n\nThese metrics are calculated by sending a call to the GPT model, asking it to provide a 1-5 rating, and storing that rating.\n\n> [!IMPORTANT]\n> The built-in metrics are only intended for use on evaluating English language answers, due to [limitations in the azure-ai-generative SDK](https://github.com/Azure/azure-sdk-for-python/issues/34099).\n\n* [`gpt_coherence`](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in#ai-assisted-coherence) measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.\n* [`gpt_relevance`](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in#ai-assisted-relevance) assesses the ability of answers to capture the key points of the context.\n* [`gpt_groundedness`](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in#ai-assisted-groundedness) assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.\n\n#### Custom metrics\n\n##### Prompt metrics\n\nThe following metrics are implemented very similar to the built-in metrics, but use a locally stored prompt. They're a great fit if you find that the built-in metrics are not working well for you or if you need to translate the prompt to another language.\n\n* `coherence`: Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. Based on `scripts/evaluate_metrics/prompts/coherence.jinja2`.\n* `relevance`: Assesses the ability of answers to capture the key points of the context. Based on `scripts/evaluate_metrics/prompts/relevance.jinja2`.\n* `groundedness`: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Based on `scripts/evaluate_metrics/prompts/groundedness.jinja2`.\n\n##### Code metrics\n\nThese metrics are calculated with some local code based on the results of the chat app, and do not require a call to the GPT model.\n\n* `latency`: The time it takes for the chat app to generate an answer, in seconds.\n* `length`: The length of the generated answer, in characters.\n* `answer_has_citation`: Whether the answer contains a correctly formatted citation to a source document, assuming citations are in square brackets.\n\n### Sending additional parameters to the app\n\nThis repo assumes that your chat app is following the [Chat App Protocol](https://github.com/Azure-Samples/ai-chat-app-protocol), which means that all POST requests look like this:\n\n```json\n{\"messages\": [{\"content\": \"<Actual user question goes here>\", \"role\": \"user\"}],\n \"stream\": False,\n \"context\": {...},\n}\n```\n\nAny additional app parameters would be specified in the `context` of that JSON, such as temperature, search settings, prompt overrides, etc. To specify those parameters, add a `target_parameters` key to your config JSON. For example:\n\n```json\n    \"target_parameters\": {\n        \"overrides\": {\n            \"semantic_ranker\": false,\n            \"prompt_template\": \"<READFILE>example_input/prompt_refined.txt\"\n        }\n    }\n```\n\nThe `overrides` key is the same as the `overrides` key in the `context` of the POST request.\nAs a convenience, you can use the `<READFILE>` prefix to read in a file and use its contents as the value for the parameter.\nThat way, you can store potential (long) prompts separately from the config JSON file.\n\n## Viewing the results\n\nThe results of each evaluation are stored in a results folder (defaulting to `example_results`).\nInside each run's folder, you'll find:\n\n- `eval_results.jsonl`: Each question and answer, along with the GPT metrics for each QA pair.\n- `parameters.json`: The parameters used for the run, like the overrides.\n- `summary.json`: The overall results, like the average GPT metrics.\n- `config.json`: The original config used for the run. This is useful for reproducing the run.\n\nTo make it easier to view and compare results across runs, we've built a few tools,\nlocated inside the `review-tools` folder.\n\n\n### Using the summary tool\n\nTo view a summary across all the runs, use the `summary` command with the path to the results folder:\n\n```bash\npython -m review_tools summary example_results\n```\n\nThis will display an interactive table with the results for each run, like this:\n\n![Screenshot of CLI tool with table of results](docs/screenshot_summary.png)\n\nTo see the parameters used for a particular run, select the folder name.\nA modal will appear with the parameters, including any prompt override.\n\n### Using the compare tool\n\nTo compare the answers generated for each question across 2 runs, use the `compare` command with 2 paths:\n\n```bash\npython -m review_tools diff example_results/baseline_1 example_results/baseline_2\n```\n\nThis will display each question, one at a time, with the two generated answers in scrollable panes,\nand the GPT metrics below each answer.\n\n![Screenshot of CLI tool for comparing a question with 2 answers](docs/screenshot_compare.png)]\n\nUse the buttons at the bottom to navigate to the next question or quit the tool.\n\n## Measuring app's ability to say \"I don't know\"\n\nThe evaluation flow described above focused on evaluating a model\u2019s answers for a set of questions that *could* be answered by the data. But what about all those questions that can\u2019t be answered by the data? Does your model know how to say \u201cI don\u2019t know?\u201d The GPT models are trained to try and be helpful, so their tendency is to always give some sort of answer, especially for answers that were in their training data. If you want to ensure your app can say \u201cI don\u2019t know\u201d when it should, you need to evaluate it on a different set of questions with a different metric.\n\n### Generating ground truth data for answer-less questions\n\nFor this evaluation, our ground truth data needs to be a set of question whose answer should provoke an \"I don\u2019t know\" response from the data. There are several categories of such questions:\n\n* **Unknowable**: Questions that are related to the sources but not actually in them (and not public knowledge).\n* **Uncitable**: Questions whose answers are well known to the LLM from its training data, but are not in the sources. There are two flavors of these:\n    * **Related**: Similar topics to sources, so LLM will be particularly tempted to think the sources know.\n    * **Unrelated**: Completely unrelated to sources, so LLM shouldn\u2019t be as tempted to think the sources know.\n* **Nonsensical**: Questions that are non-questions, that a human would scratch their head at and ask for clarification.\n\nYou can write these questions manually, but it\u2019s also possible to generate them using a generator script in this repo,\nassuming you already have ground truth data with answerable questions.\n\n```shell\npython -m scripts generate_dontknows --input=example_input/qa.jsonl --output=example_input/qa_dontknows.jsonl --numquestions=45\n```\n\nThat script sends the current questions to the configured GPT-4 model along with prompts to generate questions of each kind.\n\nWhen it\u2019s done, you should review and curate the resulting ground truth data. Pay special attention to the \"unknowable\" questions at the top of the file, since you may decide that some of those are actually knowable, and you may want to reword or rewrite entirely.\n\n### Running an evaluation for answer-less questions\n\nThis repo contains a custom GPT metric called \"dontknowness\" that rates answers from 1-5, where 1 is \"answered the question completely with no certainty\" and 5 is \"said it didn't know and attempted no answer\". The goal is for all answers to be rated 4 or 5.\n\nHere's an example configuration JSON that requests that metric, referencing the new ground truth data and a new output folder:\n\n```json\n{\n    \"testdata_path\": \"example_input/qa_dontknows.jsonl\",\n    \"results_dir\": \"example_results_dontknows/baseline\",\n    \"requested_metrics\": [\"dontknowness\", \"answer_length\", \"latency\", \"has_citation\"],\n    \"target_url\": \"http://localhost:50505/chat\",\n    \"target_parameters\": {\n    }\n}\n```\n\nWe recommend a separate output folder, as you'll likely want to make multiple runs and easily compare between those runs using the [review tools](#viewing-the-results).\n\nRun the evaluation like this:\n\n```shell\npython -m scripts evaluate --config=example_config_dontknows.json\n```\n\nThe results will be stored in the `results_dir` folder, and can be reviewed using the [review tools](#viewing-the-results).\n\n### Improving the app's ability to say \"I don't know\"\n\nIf the app is not saying \"I don't know\" enough, you can use the `diff` tool to compare the answers for the \"dontknows\" questions across runs, and see if the answers are improving. Changes you can try:\n\n* Adjust the prompt to encourage the model to say \"I don't know\" more often. Remove anything in the prompt that might be distracting or overly encouraging it to answer.\n* Try using GPT-4 instead of GPT-3.5. The results will be slower (see the latency column) but it may be more likely to say \"I don't know\" when it should.\n* Adjust the temperature of the model used by your app.\n* Add an additional LLM step in your app after generating the answer, to have the LLM rate its own confidence that the answer is found in the sources. If the confidence is low, the app should say \"I don't know\".\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/Azure-Samples/ai-rag-chat-evaluator",
        "Repository": "https://github.com/Azure-Samples/ai-rag-chat-evaluator"
    },
    "split_keywords": [
        "ai",
        "rag",
        "chat",
        "azure",
        "sdk",
        "openai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8fa7d1a1723b13513e03f77194be35ca87df19ed2db4f1027dfeb15325c74d01",
                "md5": "64a3f1e5169651cf6216983f8b763baa",
                "sha256": "f85b35bc7a9eda6eb909841f51a987891b3f70bf8c04d7d702504120df7bf67f"
            },
            "downloads": -1,
            "filename": "ai_rag_chat_evaluator-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "64a3f1e5169651cf6216983f8b763baa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 23610,
            "upload_time": "2024-03-17T15:40:56",
            "upload_time_iso_8601": "2024-03-17T15:40:56.134576Z",
            "url": "https://files.pythonhosted.org/packages/8f/a7/d1a1723b13513e03f77194be35ca87df19ed2db4f1027dfeb15325c74d01/ai_rag_chat_evaluator-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b10a74f7ca7ef43c2b73778ffc0a746db26eee0387cdbef881ab4e127a4fad1c",
                "md5": "73286e5108c5c7de37d083ac6abc5307",
                "sha256": "bcd69e6910dd5640ee8c4b9c691dd65ff90c31d08b575a506560ab3bce858fdc"
            },
            "downloads": -1,
            "filename": "ai_rag_chat_evaluator-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "73286e5108c5c7de37d083ac6abc5307",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 24376,
            "upload_time": "2024-03-17T15:40:57",
            "upload_time_iso_8601": "2024-03-17T15:40:57.890184Z",
            "url": "https://files.pythonhosted.org/packages/b1/0a/74f7ca7ef43c2b73778ffc0a746db26eee0387cdbef881ab4e127a4fad1c/ai_rag_chat_evaluator-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-17 15:40:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Azure-Samples",
    "github_project": "ai-rag-chat-evaluator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "ai-rag-chat-evaluator"
}

Pamela Fox