# Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
[![arXiv](https://img.shields.io/badge/arXiv-2307.16877-b31b1b.svg)](https://arxiv.org/abs/2307.16877)
[![License](https://img.shields.io/badge/License-Apache_2.0-yellowgreen.svg)](https://opensource.org/licenses/Apache-2.0)
[![PyPi](https://img.shields.io/pypi/v/instruct-qa)](https://pypi.org/project/instruct-qa/)
## Quick Start
### Installation
Make sure you have Python 3.7+ installed. It is also a good idea to use a virtual environment.
<details>
<summary>Show instructions for creating a Virtual Environment</summary>
<div>
```bash
python3 -m venv instruct-qa-venv
source instruct-qa-venv/bin/activate
```
</div>
</details>
You can install the library via `pip`:
```bash
# Install the latest release
pip3 install instruct-qa
# Install the latest version from GitHub
pip3 install git+https://github.com/McGill-NLP/instruct-qa
```
For development, you can install it in editable mode with:
```
git clone https://github.com/McGill-NLP/instruct-qa
cd instruct-qa/
pip3 install -e .
```
### Usage
Here is a simple example to get started. Using this library, use can easily leverage retrieval-augmented instruction-following models for question-answering in ~25 lines of code. The source file for this example is [examples/get_started.py](examples/get_started.py).
```python
from instruct_qa.collections.utils import load_collection
from instruct_qa.retrieval.utils import load_retriever, load_index
from instruct_qa.prompt.utils import load_template
from instruct_qa.generation.utils import load_model
from instruct_qa.response_runner import ResponseRunner
collection = load_collection("dpr_wiki_collection")
index = load_index("dpr-nq-multi-hnsw")
retriever = load_retriever("facebook-dpr-question_encoder-multiset-base", index)
model = load_model("flan-t5-xxl")
prompt_template = load_template("qa")
queries = ["what is haleys comet"]
runner = ResponseRunner(
model=model,
retriever=retriever,
document_collection=collection,
prompt_template=prompt_template,
queries=queries,
)
responses = runner()
print(responses[0]["response"])
# Halley's Comet Halley's Comet or Comet Halley, officially designated 1P/Halley, is a short-period comet visible from Earth every 75–76 years. Halley is the only known short-period comet that is regularly visible to the naked eye from Earth, and the only naked-eye comet that might appear twice in a human lifetime. Halley last appeared...
```
You can also check the input prompt given to the instruction-sollowing model that contains the instruction and the retrieved passages.
```python
print(responses[0]["prompt"])
"""
Please answer the following question given the following passages:
- Title: Bill Haley
then known as Bill Haley's Saddlemen...
- Title: C/2016 R2 (PANSTARRS)
(CO) with a blue coma. The blue color...
...
Question: what is haleys comet
Answer:
"""
```
Detailed documentation of different modules of the library can be found [here](instruct_qa/README.md)
## Generating responses for entire datasets
Our library supports both question answering (QA) and conversational question answering (ConvQA) datasets. The following datasets are currently incorporated in the library
- [Natural Questions (Open-domain)](https://huggingface.co/datasets/nq_open)
- [HotpotQA](https://huggingface.co/datasets/hotpot_qa)
- [TopiOCQA](https://huggingface.co/datasets/McGill-NLP/TopiOCQA)
<!-- It is easy to add any HuggingFace dataset to the library by providing a mapping, as demonstrated [here](). -->
Here is an example to generate responses for Natural Questions using DPR retriever and Flan-T5 generator.
```bash
python experiments/question_answering.py \
--prompt_type qa \
--dataset_name natural_questions \
--document_collection_name dpr_wiki_collection \
--index_name dpr-nq-multi-hnsw \
--retriever_name facebook-dpr-question_encoder-multiset-base \
--batch_size 1 \
--model_name flan-t5-xxl \
--k 8
```
By default, a `results` directory is created within the repository that stores the model responses. The default directory location can be overidden by providing an additional command line argument `--persistent_dir <OUTPUT_DIR>` More examples are present in the [examples](examples) directory.
## Download model responses and human evaluation data
We release the model responses generated using the above commands for all three datasets. The scores reported in the paper are based on these responses. The responses can be downloaded with the following command:
```bash
python download_data.py --resource results
```
The responses are automatically unzipped and stored as JSON lines in the following directory structure:
```
results
├── {dataset_name}
│ ├── response
│ │ ├── {dataset}_{split}_c-{collection}_m-{model}_r-{retriever}_prompt-{prompt}_p-{top_p}_t-{temperature}_s-{seed}.jsonl
```
Currently, the following models are included:
- `fid` (Fusion-in-Decoder, separately fine-tuned on each dataset)
- `gpt-3.5-turbo` (GPT-3.5)
- `alpaca-7b` (Alpaca)
- `llama-2-7b-chat` (Llama-2)
- `flan-t5-xxl` (Flan-T5)
We also release the human annotations for correctness and faithfulness on a subset of responses for all datasets. The annotations can be downloaded with the following command:
```bash
python download_data.py --resource human_eval_annotations
```
The responses will be automatically unzipped in the following directory structure:
```
human_eval_annotations
├── correctness
│ ├── {dataset_name}
│ │ ├── {model}_human_eval_results.json
|
├── faithfulness
│ ├── {dataset_name}
│ │ ├── {model}_human_eval_results.json
```
## Evaluating model responses (Coming soon!)
Documentation to evaluate model responses and add your own evaluation criterion coming soon! Stay tuned!
## License
This work is licensed under the Apache 2 license. See [LICENSE](LICENSE) for details.
## Citation
To cite this work, please use the following citation:
```
@article{adlakha2023evaluating,
title={Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering},
author={Vaibhav Adlakha and Parishad BehnamGhader and Xing Han Lu and Nicholas Meade and Siva Reddy},
year={2023},
journal={arXiv:2307.16877},
}
```
## Contact
For queries and clarifications please contact **vaibhav.adlakha (at) mila (dot) quebec**
Raw data
{
"_id": null,
"home_page": "https://github.com/McGill-NLP/instruct-qa",
"name": "instruct-qa",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "McGill NLP",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/ac/9e/f27a23126c8390e6ededc6967048330017c72f2cda7a4e05e19a7c598328/instruct-qa-0.0.2.tar.gz",
"platform": null,
"description": "# Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering\n\n[![arXiv](https://img.shields.io/badge/arXiv-2307.16877-b31b1b.svg)](https://arxiv.org/abs/2307.16877)\n[![License](https://img.shields.io/badge/License-Apache_2.0-yellowgreen.svg)](https://opensource.org/licenses/Apache-2.0)\n[![PyPi](https://img.shields.io/pypi/v/instruct-qa)](https://pypi.org/project/instruct-qa/)\n\n## Quick Start\n### Installation\n\nMake sure you have Python 3.7+ installed. It is also a good idea to use a virtual environment.\n\n<details>\n<summary>Show instructions for creating a Virtual Environment</summary>\n\n<div>\n\n```bash\npython3 -m venv instruct-qa-venv\nsource instruct-qa-venv/bin/activate\n```\n\n</div>\n\n</details>\n\n\nYou can install the library via `pip`:\n\n```bash\n# Install the latest release\npip3 install instruct-qa\n\n# Install the latest version from GitHub\npip3 install git+https://github.com/McGill-NLP/instruct-qa\n```\n\nFor development, you can install it in editable mode with:\n```\ngit clone https://github.com/McGill-NLP/instruct-qa\ncd instruct-qa/\npip3 install -e .\n```\n\n### Usage\nHere is a simple example to get started. Using this library, use can easily leverage retrieval-augmented instruction-following models for question-answering in ~25 lines of code. The source file for this example is [examples/get_started.py](examples/get_started.py).\n\n```python\nfrom instruct_qa.collections.utils import load_collection\nfrom instruct_qa.retrieval.utils import load_retriever, load_index\nfrom instruct_qa.prompt.utils import load_template\nfrom instruct_qa.generation.utils import load_model\nfrom instruct_qa.response_runner import ResponseRunner\n\ncollection = load_collection(\"dpr_wiki_collection\")\nindex = load_index(\"dpr-nq-multi-hnsw\")\nretriever = load_retriever(\"facebook-dpr-question_encoder-multiset-base\", index)\nmodel = load_model(\"flan-t5-xxl\")\nprompt_template = load_template(\"qa\")\n\nqueries = [\"what is haleys comet\"]\n\nrunner = ResponseRunner(\n model=model,\n retriever=retriever,\n document_collection=collection,\n prompt_template=prompt_template,\n queries=queries,\n)\n\nresponses = runner()\nprint(responses[0][\"response\"])\n# Halley's Comet Halley's Comet or Comet Halley, officially designated 1P/Halley, is a short-period comet visible from Earth every 75\u201376 years. Halley is the only known short-period comet that is regularly visible to the naked eye from Earth, and the only naked-eye comet that might appear twice in a human lifetime. Halley last appeared...\n```\nYou can also check the input prompt given to the instruction-sollowing model that contains the instruction and the retrieved passages.\n```python\nprint(responses[0][\"prompt\"])\n\"\"\"\nPlease answer the following question given the following passages:\n- Title: Bill Haley\nthen known as Bill Haley's Saddlemen...\n\n- Title: C/2016 R2 (PANSTARRS)\n(CO) with a blue coma. The blue color...\n\n...\n\nQuestion: what is haleys comet\nAnswer:\n\"\"\"\n\n```\nDetailed documentation of different modules of the library can be found [here](instruct_qa/README.md)\n\n## Generating responses for entire datasets\nOur library supports both question answering (QA) and conversational question answering (ConvQA) datasets. The following datasets are currently incorporated in the library\n- [Natural Questions (Open-domain)](https://huggingface.co/datasets/nq_open)\n- [HotpotQA](https://huggingface.co/datasets/hotpot_qa)\n- [TopiOCQA](https://huggingface.co/datasets/McGill-NLP/TopiOCQA)\n\n<!-- It is easy to add any HuggingFace dataset to the library by providing a mapping, as demonstrated [here](). -->\n\nHere is an example to generate responses for Natural Questions using DPR retriever and Flan-T5 generator.\n```bash\npython experiments/question_answering.py \\\n--prompt_type qa \\\n--dataset_name natural_questions \\\n--document_collection_name dpr_wiki_collection \\\n--index_name dpr-nq-multi-hnsw \\\n--retriever_name facebook-dpr-question_encoder-multiset-base \\\n--batch_size 1 \\\n--model_name flan-t5-xxl \\\n--k 8\n```\n\nBy default, a `results` directory is created within the repository that stores the model responses. The default directory location can be overidden by providing an additional command line argument `--persistent_dir <OUTPUT_DIR>` More examples are present in the [examples](examples) directory.\n\n## Download model responses and human evaluation data\nWe release the model responses generated using the above commands for all three datasets. The scores reported in the paper are based on these responses. The responses can be downloaded with the following command:\n```bash\npython download_data.py --resource results\n```\nThe responses are automatically unzipped and stored as JSON lines in the following directory structure:\n```\nresults\n\u251c\u2500\u2500 {dataset_name}\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 response\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 {dataset}_{split}_c-{collection}_m-{model}_r-{retriever}_prompt-{prompt}_p-{top_p}_t-{temperature}_s-{seed}.jsonl\n```\n\nCurrently, the following models are included:\n- `fid` (Fusion-in-Decoder, separately fine-tuned on each dataset)\n- `gpt-3.5-turbo` (GPT-3.5)\n- `alpaca-7b` (Alpaca)\n- `llama-2-7b-chat` (Llama-2)\n- `flan-t5-xxl` (Flan-T5)\n\nWe also release the human annotations for correctness and faithfulness on a subset of responses for all datasets. The annotations can be downloaded with the following command:\n```bash\npython download_data.py --resource human_eval_annotations\n```\n\nThe responses will be automatically unzipped in the following directory structure:\n```\nhuman_eval_annotations\n\u251c\u2500\u2500 correctness\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 {dataset_name}\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 {model}_human_eval_results.json\n|\n\u251c\u2500\u2500 faithfulness\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 {dataset_name}\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 {model}_human_eval_results.json\n```\n\n## Evaluating model responses (Coming soon!)\n\nDocumentation to evaluate model responses and add your own evaluation criterion coming soon! Stay tuned!\n\n## License\n\nThis work is licensed under the Apache 2 license. See [LICENSE](LICENSE) for details.\n\n## Citation\n\n\nTo cite this work, please use the following citation:\n```\n@article{adlakha2023evaluating,\n title={Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering}, \n author={Vaibhav Adlakha and Parishad BehnamGhader and Xing Han Lu and Nicholas Meade and Siva Reddy},\n year={2023},\n journal={arXiv:2307.16877},\n}\n```\n\n## Contact\n\nFor queries and clarifications please contact **vaibhav.adlakha (at) mila (dot) quebec**\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Empirical evaluation of retrieval-augmented instruction-following models.",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/McGill-NLP/instruct-qa"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5af0789e6d4fb0382ee289b406d9a180d7b5778a6adec9547c5d4106eea502e7",
"md5": "a8b75b5a8c5a8b154ec8eadb1caf35f2",
"sha256": "0ce0e43e27040dec9ebf02c4e233eac3dc7dd48c20fb1f8d6d66837cb38ca92e"
},
"downloads": -1,
"filename": "instruct_qa-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a8b75b5a8c5a8b154ec8eadb1caf35f2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 52053,
"upload_time": "2023-09-19T19:46:15",
"upload_time_iso_8601": "2023-09-19T19:46:15.729952Z",
"url": "https://files.pythonhosted.org/packages/5a/f0/789e6d4fb0382ee289b406d9a180d7b5778a6adec9547c5d4106eea502e7/instruct_qa-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ac9ef27a23126c8390e6ededc6967048330017c72f2cda7a4e05e19a7c598328",
"md5": "4727ca6dae04916a1a15fb39c9d02c70",
"sha256": "6c959be30b224befa093a41cfc3f39697a679cac32b0b86900c8f8967735c6e6"
},
"downloads": -1,
"filename": "instruct-qa-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "4727ca6dae04916a1a15fb39c9d02c70",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 40790,
"upload_time": "2023-09-19T19:46:17",
"upload_time_iso_8601": "2023-09-19T19:46:17.600682Z",
"url": "https://files.pythonhosted.org/packages/ac/9e/f27a23126c8390e6ededc6967048330017c72f2cda7a4e05e19a7c598328/instruct-qa-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-19 19:46:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "McGill-NLP",
"github_project": "instruct-qa",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "instruct-qa"
}