# NoMIRACL: A Multilingual Hallucination Evaluation Dataset for Robust RAGs
<p align="center">
<a href="https://github.com/project-miracl/nomiracl">
<img alt="Stars" src="https://img.shields.io/github/stars/project-miracl/nomiracl.svg?style=flat&logo=github&colorB=blue&label=stars">
</a>
<a href="https://www.python.org/">
<img alt="Build" src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple">
</a>
<a href="https://github.com/project-miracl/nomiracl/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/github/license/project-miracl/nomiracl.svg?style=flat&colorB=green">
</a>
<a href="https://arxiv.org/abs/2312.11361">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2312.11361-b31b1b.svg">
</a>
</p>
<h4 align="center">
<a href="./"><img style="float: middle;" width="800" height="570" src="./images/nomiracl-teaser.png" /></a>
<footer><br clear="all"/>The image has been generated using miramuseai.net and Adobe Photoshop.</footer>
</h4>
NoMIRACL is multilingual hallucination evaluation dataset across 18 diverse languages. It includes both a non-relevant and a relevant subset. The non-relevant subset contains queries with passages manually judged as non-relevant, while the relevant subset includes queries with at least one judged relevant passage. LLM robustness is measured using two key metrics: `hallucination rate` and `error rate`.
**This repository provides easy code to implement and evaluate diverse LLM baselines using our prompt template on NoMIRACL.**
For more information, checkout out our publication:
- [NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation](https://arxiv.org/abs/2312.11361) (Thakur et al., ArXiv 2023)
## :wrench: Installation
You can install NoMIRACL via pip:
```python
pip install nomiracl
```
If you want to build from source, use:
```bash
$ git clone https://github.com/project-miracl/nomiracl.git
$ cd nomiracl
$ pip install -e .
```
## :star: Getting Started
#### 1. Load NoMIRACL Dataset
- 50\% of relevant examples, 50\% of non-relevant, both maximum capped at 250.
- Full example available in [sample_load_no_miracl.py](./examples/sample_load_no_miracl.py).
```python
from nomiracl.dataset import NoMIRACLDataLoader
data_loader = NoMIRACLDataLoader(language = "english,
split = "test", # or 'dev'
hf_dataset_name="miracl/nomiracl",
load_from_huggingface=True)
corpus, queries, qrels = data_loader.load_data_sample(
relevant_ratio = 0.5, non_relevant_ratio = 0.5, max_sample_pool = 250)
```
#### 2. LLM prompt generation
- Full example available in [sample_model_generation.py](./examples/sample_model_generation.py).
```python
from nomiracl.generation.utils import load_model
model_name = "zephyr-7b-beta"
weights_path = f"HuggingFaceH4/{model_name}"
model = load_model(model_name, weights_path=weights_path, cache_dir=None)
# Sample prompts
prompts = [
"What is the capital of France?",
"What is the capital of Germany?",
"What is the capital of Italy?",
]
model_results = model.batch_call(prompts, batch_size=1)
for prompt, result in zip(prompts, model_results):
print("Prompt: {}".format(prompt))
print("{} result: {}".format(model_name, result))
```
#### 3. Loading Vanilla prompt template
- Full example available in [sample_vanilla_prompt_exploration.py](./examples/sample_vanilla_prompt_exploration.py).
```python
from nomiracl.prompts.utils import load_prompt_template
prompt_cls = load_prompt_template("vanilla", count = 10)
query = "Which is the best programming language?"
passages = [
"Python is the best programming language.",
"Javascript is the best programming language.",
"Go is the best programming language.",
"Java is the best programming language.",
"C# is the best programming language.",
"Ruby is the best programming language.",
"R is the best programming language.",
"C++ is the best programming language.",
"C is the best programming language.",
"Rust is the best programming language.",
]
prompt = prompt_cls(query=query, passages=passages)
print(prompt)
```
## :hugs: NoMIRACL Dataset
The NoMIRACL dataset is available in HuggingFace under: `miracl/nomiracl`.
Languages covered: Arabic (ar), Bengali (bn), German (de), English (en), Spanish (es), Persian (fa), Finnish (fi), French (fr), Hindi (hi), Indonesian (id), Japanese (ja), Korean (ko), Russian (ru), Swahili (sw), Thai (th), Yoruba (yo), Chinese (zh).
HuggingFace Page: [https://huggingface.co/datasets/miracl/nomiracl](https://huggingface.co/datasets/miracl/nomiracl)
```python
import datasets
language = 'german' # or any of the 18 languages
subset = 'relevant' # or 'non_relevant'
split = 'test' # or 'dev' for development split
# four combinations available: 'dev.relevant', 'dev.non_relevant', 'test.relevant' and 'test.non_relevant'
nomiracl = datasets.load_dataset('miracl/nomiracl', language, split=f'{split}.{subset}')
```
### Model identifiers for evaluation in NoMIRACL
| Acronym | Model Name | Model Link |
| :-----: | :--------: | :--------: |
| GPT-4 | `gpt-4-azure`| [AzureAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/) |
| GPT-3.5 | `gpt-3.5-azure`| [AzureAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/) |
| Mixtral-7x8B | `Mixtral-8x7B-Instruct-v0.1`| :hugs: [model](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) |
| Mistral-7B | `Mistral-7B-Instruct-v0.2`| :hugs: [model](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) |
| Orca-2-13B | `Orca-2-13b`| :hugs: [model](https://huggingface.co/microsoft/Orca-2-13b) |
| Orca-2-7B | `Orca-2-7b`| :hugs: [model](https://huggingface.co/microsoft/Orca-2-7b) |
| Aya-101 | `aya-101`| :hugs: [model](https://huggingface.co/CohereForAI/aya-101) |
| LLAMA-2-70B | `llama-2-70b-chat`| :hugs: [model](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) |
| LLAMA-2-13B |`llama-2-13b-chat`| :hugs: [model](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) |
| LLAMA-2-7B | `llama-2-7b-chat`| :hugs: [model](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) |
| Flan-T5-XXL |`flan-t5-xxl`| :hugs: [model](https://huggingface.co/google/flan-t5-xxl) |
### Baseline Accuracy on NoMIRACL non-relevant subset (test split, maximum cap of 250 per language)
Baseline results (250 queries) are available within the repository under `./results/baselines/non_relevant`.
An example datapoint under `./results/baselines/non_relevant/en.test.vanilla_prompt.jsonl`
```
{
"query_id": "842558#0",
"docids": ["2842207#5", "7004944#45", "3310762#14", "47220460#1", "36451733#7", "3310762#20", "4724576#4", "22373402#0", "52203230#0", "23126218#4"],
"prompt": "I will give you a question and several contexts containing information about the question. [ ... ] \n\nOUTPUT:\n",
"template": "vanilla",
"results": {"gpt-4-azure": "Yes, answer is present.",
"llama-2-13b-chat": "\nYes, answer is present in [6].\n\nNo answers found in the other contexts.",
[...]
"aya-101": "Wales"}
}
```
| Code | Set | \#Q | GPT-4 | GPT-3.5 | Mistral-7B | Mixtral-7x8B | LLAMA-2-70B | LLAMA-2-13B | LLAMA-2-7B | flan-t5-xxl | Orca-2-7B | Orca-2-13B | aya-101 |
|------|----- |------|-------|----------|---------------|--------|---------|-------|----------|-------|---------|---------|---------|
| ar | Test | 250 | 61.60% | 46.40% | 87.20% | 89.20% | 0.0% | 15.60% | 0.0% | 0.80% | 3.60% | 10.40% | 23.20% |
| bn | Test | 250 | 60.00% | 4.80% | 83.20% | 90.00% | 0.0% | 2.40% | 0.4% | 0.00% | 5.60% | 3.20% | 10.00% |
| de | Test | 217 | 63.59% | 53.00% | 87.56% | 68.20% | 0.5% | 5.07% | 0.9% | 3.23% | 5.07% | 12.90% | 29.03% |
| en | Test | 250 | 57.20% | 54.80% | 90.00% | 72.40% | 0.0% | 0.80% | 2.8% | 16.40% | 12.00% | 6.80% | 15.60% |
| es | Test | 250 | 87.20% | 64.80% | 92.00% | 90.80% | 0.8% | 0.40% | 11.2% | 10.80% | 14.40% | 10.40% | 3.20% |
| fa | Test | 250 | 57.20% | 23.60% | 82.40% | 90.40% | 0.0% | 4.80% | 0.0% | 0.40% | 0.40% | 14.00% | 14.40% |
| fr | Test | 250 | 52.40% | 44.00% | 82.40% | 58.40% | 0.0% | 0.00% | 0.4% | 2.40% | 6.00% | 9.20% | 22.00% |
| fi | Test | 124 | 60.48% | 65.32% | 87.90% | 89.52% | 0.0% | 4.84% | 0.0% | 0.00% | 2.42% | 27.42% | 33.06% |
| hi | Test | 250 | 78.80% | 29.60% | 91.60% | 95.60% | 0.0% | 3.20% | 0.8% | 0.00% | 0.40% | 9.20% | 17.60% |
| id | Test | 250 | 63.20% | 56.80% | 89.20% | 83.20% | 0.4% | 4.80% | 1.6% | 6.80% | 2.80% | 14.40% | 19.60% |
| ja | Test | 250 | 56.80% | 32.40% | 89.20% | 82.80% | 0.0% | 4.00% | 0.0% | 0.80% | 7.60% | 24.00% | 10.40% |
| ko | Test | 250 | 59.60% | 40.00% | 88.40% | 90.00% | 0.0% | 0.80% | 1.2% | 0.00% | 3.60% | 10.80% | 14.40% |
| ru | Test | 250 | 58.00% | 34.80% | 90.00% | 78.40% | 0.8% | 4.00% | 0.4% | 1.60% | 11.20% | 9.20% | 31.60% |
| sw | Test | 250 | 91.20% | 66.40% | 95.20% | 88.00% | 0.0% | 0.80% | 0.4% | 7.60% | 4.40% | 14.00% | 27.60% |
| te | Test | 250 | 74.80% | 6.80% | 81.20% | 84.80% | 0.0% | 0.40% | 0.0% | 1.60% | 6.80% | 8.00% | 24.00% |
| th | Test | 250 | 46.80% | 4.00% | 90.40% | 67.20% | 0.0% | 16.40% | 0.0% | 0.80% | 4.40% | 5.60% | 11.20% |
| yo | Test | 250 | 75.20% | 74.80% | 95.20% | 89.20% | 0.0% | 1.20% | 0.4% | 12.80% | 13.60% | 14.80% | 20.00% |
| zh | Test | 250 | 56.40% | 43.60% | 86.80% | 78.40% | 0.0% | 6.00% | 0.0% | 5.20% | 4.40% | 10.80% | 9.20% |
| Avg. | Test | - | **64.5%** | **41.44%** | **88.33%** | **82.6%** | **0.1%** | **4.2%** | **1.14%** | **3.96%** | **6.04%** | **11.95%** | **18.67%** |
### Baseline Accuracy on NoMIRACL relevant subset (test split, maximum cap of 250 per language)
Baseline results (250 queries) are available within the repository under `./results/baselines/relevant`.
An example datapoint under `./results/baselines/relevant/en.test.vanilla_prompt.jsonl`
```
{
"query_id": "8706103#0",
"docids": ["42057469#2", "4998067#1", "29247933#0", "162619#81", "422315#13", "26790310#4", "41298602#18", "22816#16", "123427#61", "23576525#0"],
"prompt": "I will give you a question and several contexts containing information about the question. [ ... ] \n\nQUESTION:\nWhat is the course that will be discontinued as defined by the National Education Policy? [ ... ] \n\nOUTPUT:\n",
"template": "vanilla",
"results": {"gpt-4-azure": "I don't know.",
"llama-2-13b-chat": "Please answer the question based on the given contexts.",
[...]
"aya-101": "I don't know"}
}
```
| Code | Set | \#Q | GPT-4 | GPT-3.5 | Mistral-7B-v0.2 | Mixtral-7x8B | LLAMA-2-70B | LLAMA-2-13B | LLAMA-2-7B | flan-t5-xxl | Orca-2-7B | Orca-2-13B | aya-101 |
|------|----- |------|-------|----------|---------------|--------|---------|-------|----------|-------|---------|---------|---------|
| ar | Test | 250 | 88.40% | 91.20% | 32.80% | 59.20% | 96.40% | 62.40% | 62.0% | 100.0% | 82.80% | 51.60% | 16.40% |
| bn | Test | 250 | 82.80% | 64.80% | 26.40% | 43.20% | 97.60% | 42.40% | 71.6% | 100.0% | 40.00% | 80.00% | 20.00% |
| de | Test | 217 | 88.40% | 93.60% | 26.80% | 74.40% | 88.80% | 52.80% | 56.4% | 99.2% | 86.00% | 92.00% | 25.60% |
| en | Test | 250 | 94.80% | 91.20% | 35.20% | 78.80% | 85.20% | 46.40% | 68.0% | 98.8% | 81.60% | 98.00% | 51.60% |
| es | Test | 250 | 77.60% | 90.00% | 26.80% | 67.20% | 77.20% | 34.40% | 48.8% | 99.2% | 64.00% | 95.60% | 78.00% |
| fa | Test | 250 | 86.40% | 95.60% | 30.80% | 46.80% | 99.20% | 85.60% | 72.0% | 100.0% | 84.40% | 61.60% | 24.40% |
| fr | Test | 250 | 88.40% | 90.40% | 37.60% | 81.60% | 92.00% | 54.80% | 55.6% | 99.6% | 81.60% | 98.00% | 57.20% |
| fi | Test | 124 | 84.00% | 87.60% | 24.80% | 65.20% | 98.40% | 6.00% | 93.2% | 100.0% | 79.20% | 75.20% | 23.60% |
| hi | Test | 250 | 78.80% | 92.80% | 28.00% | 38.40% | 94.80% | 66.00% | 74.8% | 100.0% | 57.60% | 57.60% | 25.20% |
| id | Test | 250 | 66.00% | 74.00% | 11.20% | 48.00% | 83.60% | 32.00% | 71.2% | 100.0% | 73.20% | 79.20% | 60.71% |
| ja | Test | 250 | 95.60% | 97.20% | 31.20% | 69.20% | 98.40% | 52.40% | 69.2% | 100.0% | 62.80% | 64.80% | 45.60% |
| ko | Test | 250 | 87.20% | 92.40% | 22.40% | 56.00% | 99.60% | 90.00% | 85.6% | 100.0% | 77.60% | 80.00% | 60.00% |
| ru | Test | 250 | 93.60% | 94.40% | 32.40% | 77.20% | 83.20% | 61.60% | 96.4% | 100.0% | 79.60% | 85.20% | 22.80% |
| sw | Test | 250 | 78.80% | 90.40% | 8.40% | 51.60% | 90.00% | 62.80% | 49.2% | 100.0% | 89.20% | 92.80% | 38.40% |
| te | Test | 250 | 58.00% | 45.60% | 14.80% | 33.20% | 99.60% | 74.40% | 97.2% | 100.0% | 44.80% | 73.60% | 10.80% |
| th | Test | 250 | 95.60% | 96.40% | 23.60% | 72.80% | 98.80% | 59.20% | 91.2% | 100.0% | 78.00% | 76.00% | 20.80% |
| yo | Test | 250 | 85.78% | 64.22% | 8.33% | 34.31% | 91.18% | 62.25% | 62.3% | 99.5% | 79.41% | 82.84% | 18.14% |
| zh | Test | 250 | 95.60% | 95.60% | 30.00% | 74.00% | 97.20% | 63.20% | 83.6% | 100.0% | 81.20% | 90.40% | 75.20% |
| Avg. | Test | - | **84.77%** | **85.97%** | **25.09%** | **59.51%** | **92.84%** | **56.04%** | **72.68%** | **99.8%** | **73.50%** | **79.69%** | **37.47%** |
## NoMIRACL Dataset Construction
<img src="./images/NoMIRACL-Flowchart.drawio.png" width="1013" height="179" />
Retrieval Augmented Generation (RAG) is a powerful approach to incorporate external knowledge into large language models (LLMs) to enhance the accuracy and faithfulness of generated responses. However, evaluating LLM robustness in RAG across different language families has been a challenge, leading to gaps in understanding the model's performance against errors in external retrieved knowledge. To address this, we present NoMIRACL, a human-annotated dataset designed for evaluating LLM robustness in RAG across 18 typologically diverse languages.
NoMIRACL is a multilingual dataset designed to evaluate LLM robustness against errors in first-stage retrieval. The dataset covers 18 typologically diverse languages and includes two subsets: non-relevant and relevant.
### Non-Relevant Subset (F)
- Queries with no-known answers.
- All top-k passages manually judged as non-relevant (relevancy score = 0).
### Relevant Subset (T)
- Queries with known answers.
- At least one of the top-k passages manually judged as relevant (relevancy score = 1).
## Evaluation Metrics
<img src="./images/NoMIRACL-confusion-matrix.png" width="411" height="193"/>
We conduct a robustness evaluation using a binary classification task, comparing LLM predictions against the ground truth provided in NoMIRACL. The metrics used are hallucination rate and error rate.
- **Hallucination Rate:** `FP/(FP + TN)` Measures the model's tendency to hallucinate an answer when no answer is present in the non-relevant subset.
- **Error Rate:** `FN/(FN + TP)` Measures the model's inaccuracy in recognizing relevant passages in the relevant subset.
## :handshake: Collaboration and Acknowledgements
The NoMIRACL dataset has been made possible due to a collaborative effort of the following universities and organizations:
- University of Waterloo
- Huawei Noah's Ark Lab
Parts of the NoMIRACL code structure has been inspired by:
- [https://github.com/McGill-NLP/instruct-qa](https://github.com/McGill-NLP/instruct-qa)
## :scroll: Citations
If you use NoMIRACL or parts in a research paper, please cite our work as follows:
```
@article{thakur:2024,
author = {Nandan Thakur and
Luiz Bonifacio and
Xinyu Zhang and
Odunayo Ogundepo and
Ehsan Kamalloo and
David Alfonso{-}Hermelo and
Xiaoguang Li and
Qun Liu and
Boxing Chen and
Mehdi Rezagholizadeh and
Jimmy Lin},
title = {NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented
Generation},
journal = {CoRR},
volume = {abs/2312.11361},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2312.11361},
doi = {10.48550/ARXIV.2312.11361},
eprinttype = {arXiv},
eprint = {2312.11361},
timestamp = {Tue, 16 Jan 2024 11:57:42 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2312-11361.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
---
Contact person: Nandan Thakur, [nandan.thakur@uwaterloo.co](mailto:nandan.thakur@uwaterloo.ca)
> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
Raw data
{
"_id": null,
"home_page": "",
"name": "nomiracl",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "Transformer Networks BERT PyTorch NLP deep learning LLM Hallucination",
"author": "Nandan Thakur",
"author_email": "nandant@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/cf/7a/762d18210e7c064032e311b85840f2462ef80d249fff1dae9497d23427e2/nomiracl-0.0.1.tar.gz",
"platform": null,
"description": "# NoMIRACL: A Multilingual Hallucination Evaluation Dataset for Robust RAGs\n<p align=\"center\">\n <a href=\"https://github.com/project-miracl/nomiracl\">\n <img alt=\"Stars\" src=\"https://img.shields.io/github/stars/project-miracl/nomiracl.svg?style=flat&logo=github&colorB=blue&label=stars\">\n </a>\n <a href=\"https://www.python.org/\">\n <img alt=\"Build\" src=\"https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple\">\n </a>\n <a href=\"https://github.com/project-miracl/nomiracl/blob/main/LICENSE\">\n <img alt=\"License\" src=\"https://img.shields.io/github/license/project-miracl/nomiracl.svg?style=flat&colorB=green\">\n </a>\n <a href=\"https://arxiv.org/abs/2312.11361\">\n <img alt=\"arXiv\" src=\"https://img.shields.io/badge/arXiv-2312.11361-b31b1b.svg\">\n </a>\n</p>\n\n<h4 align=\"center\">\n <a href=\"./\"><img style=\"float: middle;\" width=\"800\" height=\"570\" src=\"./images/nomiracl-teaser.png\" /></a>\n <footer><br clear=\"all\"/>The image has been generated using miramuseai.net and Adobe Photoshop.</footer>\n</h4>\n\nNoMIRACL is multilingual hallucination evaluation dataset across 18 diverse languages. It includes both a non-relevant and a relevant subset. The non-relevant subset contains queries with passages manually judged as non-relevant, while the relevant subset includes queries with at least one judged relevant passage. LLM robustness is measured using two key metrics: `hallucination rate` and `error rate`.\n\n**This repository provides easy code to implement and evaluate diverse LLM baselines using our prompt template on NoMIRACL.**\n\nFor more information, checkout out our publication:\n- [NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation](https://arxiv.org/abs/2312.11361) (Thakur et al., ArXiv 2023)\n\n\n## :wrench: Installation\nYou can install NoMIRACL via pip:\n\n```python\npip install nomiracl\n```\n\nIf you want to build from source, use:\n\n```bash\n$ git clone https://github.com/project-miracl/nomiracl.git\n$ cd nomiracl\n$ pip install -e .\n```\n\n## :star: Getting Started\n\n#### 1. Load NoMIRACL Dataset \n- 50\\% of relevant examples, 50\\% of non-relevant, both maximum capped at 250. \n- Full example available in [sample_load_no_miracl.py](./examples/sample_load_no_miracl.py).\n```python\nfrom nomiracl.dataset import NoMIRACLDataLoader\n\ndata_loader = NoMIRACLDataLoader(language = \"english, \n split = \"test\", # or 'dev' \n hf_dataset_name=\"miracl/nomiracl\", \n load_from_huggingface=True)\n \ncorpus, queries, qrels = data_loader.load_data_sample(\n relevant_ratio = 0.5, non_relevant_ratio = 0.5, max_sample_pool = 250)\n```\n\n#### 2. LLM prompt generation\n- Full example available in [sample_model_generation.py](./examples/sample_model_generation.py).\n```python\nfrom nomiracl.generation.utils import load_model\n\nmodel_name = \"zephyr-7b-beta\"\nweights_path = f\"HuggingFaceH4/{model_name}\"\nmodel = load_model(model_name, weights_path=weights_path, cache_dir=None)\n\n# Sample prompts\nprompts = [\n \"What is the capital of France?\",\n \"What is the capital of Germany?\",\n \"What is the capital of Italy?\",\n]\n\nmodel_results = model.batch_call(prompts, batch_size=1)\n\nfor prompt, result in zip(prompts, model_results):\n print(\"Prompt: {}\".format(prompt))\n print(\"{} result: {}\".format(model_name, result))\n```\n\n#### 3. Loading Vanilla prompt template\n- Full example available in [sample_vanilla_prompt_exploration.py](./examples/sample_vanilla_prompt_exploration.py).\n\n```python\nfrom nomiracl.prompts.utils import load_prompt_template\n\nprompt_cls = load_prompt_template(\"vanilla\", count = 10)\n\nquery = \"Which is the best programming language?\"\n\npassages = [\n \"Python is the best programming language.\",\n \"Javascript is the best programming language.\",\n \"Go is the best programming language.\",\n \"Java is the best programming language.\",\n \"C# is the best programming language.\",\n \"Ruby is the best programming language.\",\n \"R is the best programming language.\",\n \"C++ is the best programming language.\",\n \"C is the best programming language.\",\n \"Rust is the best programming language.\",\n]\n\nprompt = prompt_cls(query=query, passages=passages)\nprint(prompt)\n```\n\n## :hugs: NoMIRACL Dataset\n\nThe NoMIRACL dataset is available in HuggingFace under: `miracl/nomiracl`.\n\nLanguages covered: Arabic (ar), Bengali (bn), German (de), English (en), Spanish (es), Persian (fa), Finnish (fi), French (fr), Hindi (hi), Indonesian (id), Japanese (ja), Korean (ko), Russian (ru), Swahili (sw), Thai (th), Yoruba (yo), Chinese (zh).\n\nHuggingFace Page: [https://huggingface.co/datasets/miracl/nomiracl](https://huggingface.co/datasets/miracl/nomiracl) \n\n```python\nimport datasets\n\nlanguage = 'german' # or any of the 18 languages\nsubset = 'relevant' # or 'non_relevant'\nsplit = 'test' # or 'dev' for development split\n\n# four combinations available: 'dev.relevant', 'dev.non_relevant', 'test.relevant' and 'test.non_relevant'\nnomiracl = datasets.load_dataset('miracl/nomiracl', language, split=f'{split}.{subset}')\n```\n\n### Model identifiers for evaluation in NoMIRACL\n\n| Acronym | Model Name | Model Link |\n| :-----: | :--------: | :--------: |\n| GPT-4 | `gpt-4-azure`| [AzureAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/) |\n| GPT-3.5 | `gpt-3.5-azure`| [AzureAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/) |\n| Mixtral-7x8B | `Mixtral-8x7B-Instruct-v0.1`| :hugs: [model](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) |\n| Mistral-7B | `Mistral-7B-Instruct-v0.2`| :hugs: [model](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) |\n| Orca-2-13B | `Orca-2-13b`| :hugs: [model](https://huggingface.co/microsoft/Orca-2-13b) |\n| Orca-2-7B | `Orca-2-7b`| :hugs: [model](https://huggingface.co/microsoft/Orca-2-7b) |\n| Aya-101 | `aya-101`| :hugs: [model](https://huggingface.co/CohereForAI/aya-101) |\n| LLAMA-2-70B | `llama-2-70b-chat`| :hugs: [model](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) |\n| LLAMA-2-13B |`llama-2-13b-chat`| :hugs: [model](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) |\n| LLAMA-2-7B | `llama-2-7b-chat`| :hugs: [model](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) |\n| Flan-T5-XXL |`flan-t5-xxl`| :hugs: [model](https://huggingface.co/google/flan-t5-xxl) |\n\n### Baseline Accuracy on NoMIRACL non-relevant subset (test split, maximum cap of 250 per language)\n\nBaseline results (250 queries) are available within the repository under `./results/baselines/non_relevant`.\n\nAn example datapoint under `./results/baselines/non_relevant/en.test.vanilla_prompt.jsonl`\n```\n{\n \"query_id\": \"842558#0\", \n \"docids\": [\"2842207#5\", \"7004944#45\", \"3310762#14\", \"47220460#1\", \"36451733#7\", \"3310762#20\", \"4724576#4\", \"22373402#0\", \"52203230#0\", \"23126218#4\"], \n \"prompt\": \"I will give you a question and several contexts containing information about the question. [ ... ] \\n\\nOUTPUT:\\n\", \n \"template\": \"vanilla\", \n \"results\": {\"gpt-4-azure\": \"Yes, answer is present.\", \n \"llama-2-13b-chat\": \"\\nYes, answer is present in [6].\\n\\nNo answers found in the other contexts.\",\n [...]\n \"aya-101\": \"Wales\"}\n}\n```\n\n| Code | Set | \\#Q | GPT-4 | GPT-3.5 | Mistral-7B | Mixtral-7x8B | LLAMA-2-70B | LLAMA-2-13B | LLAMA-2-7B | flan-t5-xxl | Orca-2-7B | Orca-2-13B | aya-101 |\n|------|----- |------|-------|----------|---------------|--------|---------|-------|----------|-------|---------|---------|---------|\n| ar | Test | 250 | 61.60% | 46.40% | 87.20% | 89.20% | 0.0% | 15.60% | 0.0% | 0.80% | 3.60% | 10.40% | 23.20% |\n| bn | Test | 250 | 60.00% | 4.80% | 83.20% | 90.00% | 0.0% | 2.40% | 0.4% | 0.00% | 5.60% | 3.20% | 10.00% |\n| de | Test | 217 | 63.59% | 53.00% | 87.56% | 68.20% | 0.5% | 5.07% | 0.9% | 3.23% | 5.07% | 12.90% | 29.03% |\n| en | Test | 250 | 57.20% | 54.80% | 90.00% | 72.40% | 0.0% | 0.80% | 2.8% | 16.40% | 12.00% | 6.80% | 15.60% |\n| es | Test | 250 | 87.20% | 64.80% | 92.00% | 90.80% | 0.8% | 0.40% | 11.2% | 10.80% | 14.40% | 10.40% | 3.20% |\n| fa | Test | 250 | 57.20% | 23.60% | 82.40% | 90.40% | 0.0% | 4.80% | 0.0% | 0.40% | 0.40% | 14.00% | 14.40% |\n| fr | Test | 250 | 52.40% | 44.00% | 82.40% | 58.40% | 0.0% | 0.00% | 0.4% | 2.40% | 6.00% | 9.20% | 22.00% |\n| fi | Test | 124 | 60.48% | 65.32% | 87.90% | 89.52% | 0.0% | 4.84% | 0.0% | 0.00% | 2.42% | 27.42% | 33.06% |\n| hi | Test | 250 | 78.80% | 29.60% | 91.60% | 95.60% | 0.0% | 3.20% | 0.8% | 0.00% | 0.40% | 9.20% | 17.60% |\n| id | Test | 250 | 63.20% | 56.80% | 89.20% | 83.20% | 0.4% | 4.80% | 1.6% | 6.80% | 2.80% | 14.40% | 19.60% |\n| ja | Test | 250 | 56.80% | 32.40% | 89.20% | 82.80% | 0.0% | 4.00% | 0.0% | 0.80% | 7.60% | 24.00% | 10.40% |\n| ko | Test | 250 | 59.60% | 40.00% | 88.40% | 90.00% | 0.0% | 0.80% | 1.2% | 0.00% | 3.60% | 10.80% | 14.40% |\n| ru | Test | 250 | 58.00% | 34.80% | 90.00% | 78.40% | 0.8% | 4.00% | 0.4% | 1.60% | 11.20% | 9.20% | 31.60% |\n| sw | Test | 250 | 91.20% | 66.40% | 95.20% | 88.00% | 0.0% | 0.80% | 0.4% | 7.60% | 4.40% | 14.00% | 27.60% |\n| te | Test | 250 | 74.80% | 6.80% | 81.20% | 84.80% | 0.0% | 0.40% | 0.0% | 1.60% | 6.80% | 8.00% | 24.00% |\n| th | Test | 250 | 46.80% | 4.00% | 90.40% | 67.20% | 0.0% | 16.40% | 0.0% | 0.80% | 4.40% | 5.60% | 11.20% |\n| yo | Test | 250 | 75.20% | 74.80% | 95.20% | 89.20% | 0.0% | 1.20% | 0.4% | 12.80% | 13.60% | 14.80% | 20.00% |\n| zh | Test | 250 | 56.40% | 43.60% | 86.80% | 78.40% | 0.0% | 6.00% | 0.0% | 5.20% | 4.40% | 10.80% | 9.20% |\n| Avg. | Test | - | **64.5%** | **41.44%** | **88.33%** | **82.6%** | **0.1%** | **4.2%** | **1.14%** | **3.96%** | **6.04%** | **11.95%** | **18.67%** | \n\n\n### Baseline Accuracy on NoMIRACL relevant subset (test split, maximum cap of 250 per language)\n\nBaseline results (250 queries) are available within the repository under `./results/baselines/relevant`.\n\nAn example datapoint under `./results/baselines/relevant/en.test.vanilla_prompt.jsonl`\n```\n{\n \"query_id\": \"8706103#0\", \n \"docids\": [\"42057469#2\", \"4998067#1\", \"29247933#0\", \"162619#81\", \"422315#13\", \"26790310#4\", \"41298602#18\", \"22816#16\", \"123427#61\", \"23576525#0\"], \n \"prompt\": \"I will give you a question and several contexts containing information about the question. [ ... ] \\n\\nQUESTION:\\nWhat is the course that will be discontinued as defined by the National Education Policy? [ ... ] \\n\\nOUTPUT:\\n\", \n \"template\": \"vanilla\", \n \"results\": {\"gpt-4-azure\": \"I don't know.\", \n \"llama-2-13b-chat\": \"Please answer the question based on the given contexts.\",\n [...]\n \"aya-101\": \"I don't know\"}\n}\n```\n\n\n| Code | Set | \\#Q | GPT-4 | GPT-3.5 | Mistral-7B-v0.2 | Mixtral-7x8B | LLAMA-2-70B | LLAMA-2-13B | LLAMA-2-7B | flan-t5-xxl | Orca-2-7B | Orca-2-13B | aya-101 |\n|------|----- |------|-------|----------|---------------|--------|---------|-------|----------|-------|---------|---------|---------|\n| ar | Test | 250 | 88.40% | 91.20% | 32.80% | 59.20% | 96.40% | 62.40% | 62.0% | 100.0% | 82.80% | 51.60% | 16.40% |\n| bn | Test | 250 | 82.80% | 64.80% | 26.40% | 43.20% | 97.60% | 42.40% | 71.6% | 100.0% | 40.00% | 80.00% | 20.00% |\n| de | Test | 217 | 88.40% | 93.60% | 26.80% | 74.40% | 88.80% | 52.80% | 56.4% | 99.2% | 86.00% | 92.00% | 25.60% |\n| en | Test | 250 | 94.80% | 91.20% | 35.20% | 78.80% | 85.20% | 46.40% | 68.0% | 98.8% | 81.60% | 98.00% | 51.60% |\n| es | Test | 250 | 77.60% | 90.00% | 26.80% | 67.20% | 77.20% | 34.40% | 48.8% | 99.2% | 64.00% | 95.60% | 78.00% |\n| fa | Test | 250 | 86.40% | 95.60% | 30.80% | 46.80% | 99.20% | 85.60% | 72.0% | 100.0% | 84.40% | 61.60% | 24.40% |\n| fr | Test | 250 | 88.40% | 90.40% | 37.60% | 81.60% | 92.00% | 54.80% | 55.6% | 99.6% | 81.60% | 98.00% | 57.20% |\n| fi | Test | 124 | 84.00% | 87.60% | 24.80% | 65.20% | 98.40% | 6.00% | 93.2% | 100.0% | 79.20% | 75.20% | 23.60% |\n| hi | Test | 250 | 78.80% | 92.80% | 28.00% | 38.40% | 94.80% | 66.00% | 74.8% | 100.0% | 57.60% | 57.60% | 25.20% |\n| id | Test | 250 | 66.00% | 74.00% | 11.20% | 48.00% | 83.60% | 32.00% | 71.2% | 100.0% | 73.20% | 79.20% | 60.71% |\n| ja | Test | 250 | 95.60% | 97.20% | 31.20% | 69.20% | 98.40% | 52.40% | 69.2% | 100.0% | 62.80% | 64.80% | 45.60% |\n| ko | Test | 250 | 87.20% | 92.40% | 22.40% | 56.00% | 99.60% | 90.00% | 85.6% | 100.0% | 77.60% | 80.00% | 60.00% |\n| ru | Test | 250 | 93.60% | 94.40% | 32.40% | 77.20% | 83.20% | 61.60% | 96.4% | 100.0% | 79.60% | 85.20% | 22.80% |\n| sw | Test | 250 | 78.80% | 90.40% | 8.40% | 51.60% | 90.00% | 62.80% | 49.2% | 100.0% | 89.20% | 92.80% | 38.40% |\n| te | Test | 250 | 58.00% | 45.60% | 14.80% | 33.20% | 99.60% | 74.40% | 97.2% | 100.0% | 44.80% | 73.60% | 10.80% |\n| th | Test | 250 | 95.60% | 96.40% | 23.60% | 72.80% | 98.80% | 59.20% | 91.2% | 100.0% | 78.00% | 76.00% | 20.80% |\n| yo | Test | 250 | 85.78% | 64.22% | 8.33% | 34.31% | 91.18% | 62.25% | 62.3% | 99.5% | 79.41% | 82.84% | 18.14% |\n| zh | Test | 250 | 95.60% | 95.60% | 30.00% | 74.00% | 97.20% | 63.20% | 83.6% | 100.0% | 81.20% | 90.40% | 75.20% |\n| Avg. | Test | - | **84.77%** | **85.97%** | **25.09%** | **59.51%** | **92.84%** | **56.04%** | **72.68%** | **99.8%** | **73.50%** | **79.69%** | **37.47%** | \n\n\n## NoMIRACL Dataset Construction\n\n<img src=\"./images/NoMIRACL-Flowchart.drawio.png\" width=\"1013\" height=\"179\" />\n\nRetrieval Augmented Generation (RAG) is a powerful approach to incorporate external knowledge into large language models (LLMs) to enhance the accuracy and faithfulness of generated responses. However, evaluating LLM robustness in RAG across different language families has been a challenge, leading to gaps in understanding the model's performance against errors in external retrieved knowledge. To address this, we present NoMIRACL, a human-annotated dataset designed for evaluating LLM robustness in RAG across 18 typologically diverse languages.\n\nNoMIRACL is a multilingual dataset designed to evaluate LLM robustness against errors in first-stage retrieval. The dataset covers 18 typologically diverse languages and includes two subsets: non-relevant and relevant.\n\n### Non-Relevant Subset (F)\n- Queries with no-known answers.\n- All top-k passages manually judged as non-relevant (relevancy score = 0).\n\n### Relevant Subset (T)\n- Queries with known answers.\n- At least one of the top-k passages manually judged as relevant (relevancy score = 1).\n\n## Evaluation Metrics\n\n<img src=\"./images/NoMIRACL-confusion-matrix.png\" width=\"411\" height=\"193\"/>\n\nWe conduct a robustness evaluation using a binary classification task, comparing LLM predictions against the ground truth provided in NoMIRACL. The metrics used are hallucination rate and error rate.\n\n- **Hallucination Rate:** `FP/(FP + TN)` Measures the model's tendency to hallucinate an answer when no answer is present in the non-relevant subset.\n\n- **Error Rate:** `FN/(FN + TP)` Measures the model's inaccuracy in recognizing relevant passages in the relevant subset.\n\n## :handshake: Collaboration and Acknowledgements\n\nThe NoMIRACL dataset has been made possible due to a collaborative effort of the following universities and organizations:\n\n- University of Waterloo\n- Huawei Noah's Ark Lab\n\nParts of the NoMIRACL code structure has been inspired by:\n- [https://github.com/McGill-NLP/instruct-qa](https://github.com/McGill-NLP/instruct-qa)\n\n## :scroll: Citations\n\nIf you use NoMIRACL or parts in a research paper, please cite our work as follows:\n\n```\n@article{thakur:2024,\n author = {Nandan Thakur and\n Luiz Bonifacio and\n Xinyu Zhang and\n Odunayo Ogundepo and\n Ehsan Kamalloo and\n David Alfonso{-}Hermelo and\n Xiaoguang Li and\n Qun Liu and\n Boxing Chen and\n Mehdi Rezagholizadeh and\n Jimmy Lin},\n title = {NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented\n Generation},\n journal = {CoRR},\n volume = {abs/2312.11361},\n year = {2023},\n url = {https://doi.org/10.48550/arXiv.2312.11361},\n doi = {10.48550/ARXIV.2312.11361},\n eprinttype = {arXiv},\n eprint = {2312.11361},\n timestamp = {Tue, 16 Jan 2024 11:57:42 +0100},\n biburl = {https://dblp.org/rec/journals/corr/abs-2312-11361.bib},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n\n```\n\n---\nContact person: Nandan Thakur, [nandan.thakur@uwaterloo.co](mailto:nandan.thakur@uwaterloo.ca)\n\n> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Unanswerable questions for LLM hallucations",
"version": "0.0.1",
"project_urls": null,
"split_keywords": [
"transformer",
"networks",
"bert",
"pytorch",
"nlp",
"deep",
"learning",
"llm",
"hallucination"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "cf7a762d18210e7c064032e311b85840f2462ef80d249fff1dae9497d23427e2",
"md5": "de3866826959d45e233592491a7fca15",
"sha256": "191228bcf30ef124fe260316991296c3c9097e20d8f078c7ef5e3e52baa8432e"
},
"downloads": -1,
"filename": "nomiracl-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "de3866826959d45e233592491a7fca15",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 28548,
"upload_time": "2024-02-27T21:44:35",
"upload_time_iso_8601": "2024-02-27T21:44:35.934891Z",
"url": "https://files.pythonhosted.org/packages/cf/7a/762d18210e7c064032e311b85840f2462ef80d249fff1dae9497d23427e2/nomiracl-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-27 21:44:35",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "nomiracl"
}