Name | factscore JSON |
Version |
0.2.0
JSON |
| download |
home_page | |
Summary | FactScore is an automatic evaluation metric for factual precision in long-form text generation. It uses large language models and retrieval to break down generations into atomic facts and then measure the correctness with respect to a knowledge source (like Wikipedia). |
upload_time | 2023-10-14 22:28:16 |
maintainer | |
docs_url | None |
author | Sewon Min |
requires_python | >=3.7.1,<4.0.0 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# FActScore
[](#python)
[](https://arxiv.org/abs/2305.14251)
[](https://pypi.python.org/pypi/factscore/)
[](https://pepy.tech/project/factscore)
This is the official release accompanying our EMNLP 2023 paper, [FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation](https://arxiv.org/abs/2305.14251). FActScore is available as a PIP package as well.
If you find FActScore useful, please cite:
```
@inproceedings{ factscore,
title={ {FActScore}: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation },
author={ Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang Wei and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh },
year={ 2023 },
booktitle = { EMNLP },
url={ https://arxiv.org/abs/2305.14251 }
}
```
## Install
<!-- ```
conda create -n fs-env python=3.9
conda activate fs-env
pip install -r requirements.txt
``` -->
Make a new Python 3.7+ environment using `virtualenv` or `conda`.
```bash
pip install --upgrade factscore
python -m spacy download en_core_web_sm
```
## Download the data
```bash
python -m factscore.download_data --llama_7B_HF_path "llama-7B"
```
This command does the following.
1. Download the knowledge source and example data.
2. Take the LLAMA 7B model and reconstruct Inst-LLAMA. This requires having access to HuggingFace weights of the LLAMA-7B model, which are added to the `--llama_7B_HF_path` flag. Follow [this guide](https://huggingface.co/docs/transformers/main/model_doc/llama) in order to obtain those weights. Skip the `--llama_7B_HF_path` if you would only like to use the ChatGPT version of FActScore.
**Optional flags**:
- `--data_dir`: directory to store the knowledge source and example data. `.cache/factscore` by default.
- `--model_dir`: directory to store Inst-LLAMA weights. `.cache/factscore` by default.
**Troubleshooting**:
- If you get a `ERROR 429: Too Many Requests` error while downloading the DB file, please download the DB from [this Google Drive link](https://drive.google.com/file/d/1mekls6OGOKLmt7gYtHs0WGf5oTamTNat/view?usp=sharing) and place it under `--data_dir` (`.cache/factscore` by default).
- If everything else fails, consider downloading the files manually from [this link](https://drive.google.com/drive/folders/1bLHGu_imkZVtX6O0mpZ-G0-4ofTLM1ZA?usp=share_link) and placing them in `--data_dir` and `--model_dir`, see [`factscore/download_data.py`](factscore/download_data.py) for more details.
## Running FActScore using a command line
We expect running FActScore costs about $1 of the API cost per 100 sentences. For instance, if you have 100 generations, each with 5 sentences on average, it costs $5 in total.
```bash
python -m factscore.factscorer --input_path {input_path} --model_name {estimator_name} --openai_key {openai_key}
```
- `--input_path` can be something like `data/unlabeled/InstructGPT.jsonl`. It should be a `.jsonl` format where each line contains `topic` (a topic entity that corresponds to the Wikipedia title) and `output` (a generation from the model).
- `--model_name`: `retrieval+ChatGPT` and `retrieval+llama+npm` (You can also use `retrieval+ChatGPT+npm` or `retrieval+llama` but we recommend the former two.)
- `--openai_key`: File containing OpenAI API Key.
**Optional flags**:
- `--data_dir`: Directory containing knowledge source, etc. `.cache/factscore` by default.
- `--model_dir`: Directory containing Inst-LLAMA weights. Skip if your `model_name` doesn't include `llama`. `.cache/factscore` by default.
- `--cache_dir`: Directory containing cache from API/models. `.cache/factscore` by default.
- `--use_atomic_facts`: If specified, it uses model-generated atomic facts released as part of our data instead of running the atomic fact generator. This will allow reproducing our results with no (or little if it still uses ChatGPT) cost. You can't specify it if you are running new model generations.
- `--gamma`: A hyperparameter for length penalty. `10` by default. It penalizes the score if the number of facts is less than `gamma`. `10` roughly corresponds to 2 sentences, so would penalize if the generation has less than 2 sentences. Usually, this would not change the ranking between systems unless some systems generate overly short responses all the time (e.g., models trained on NLP datasets without long-form generation tasks may do so). If you would like to turn off the length penalty completely, specify `--gamma 0`.
- `--n_samples`: If specified, it runs the model on a subset of the data.
- `--verbose`: If specified, it shows the progress bar.
- `--print_rate_limit_error`: It specified, it prints out rate limit errors from OpenAI API.
- `--cost_estimate`: This flag decides the type of OpenAI API cost estimation that we provide before calling it. It can be `"consider_cache"` (default) or `"ignore_cache"`.
- `--abstain_detection`: This flag optionally enables automatic detection of abstained responses. By default this is disabled, but it is recommended to add your own function tailored to your model. The currently supported detectors are `"generic"` and `"perplexity_ai"`, and their implementations can be found in [`factscore/abstain_detection.py`](factscore/abstain_detection.py). There are two methods to add your own abstain function: a) clone our GitHub repository to install `factscore` locally (`pip install --editable .`), and then add your function to [`factscore/abstain_detection.py`](factscore/abstain_detection.py) directly; b) process your abstain detection outside our package, and use empty strings in the `output` key for the JSONL file used in `--input_path`.
- `--knowledge_source`: In case the default knowledge source (Wikipedia - 2023/04/01) will not be used, preprocess it using the [instructions below](#To-use-a-custom-knowledge-source), and then specify the knowledge_source name under this flag.
## To evaluate your own LM
There're two sets of prompt entities, `data/labeled/prompt_entities.txt` (183 entities) and `data/unlabeled/prompt_entities.txt` (500 entities). Each line contains the name of the person (which is also a corresponding Wikipedia title). You can use the labeled version if you want to be compatible with the data under `data/labeled` (Section 3 and Section 4.2 in the paper), and use the unlabeled version if you want to be compatible with the data under `data/unlabeled` (Section 4.3 in the paper).
You can prompt your LM with your own prompt (we used `Question: Tell me a bio of <entity>.`) and use the following code.
```python
from factscore.factscorer import FactScorer
fs = FactScorer(openai_key="...")
# topics: list of strings (human entities used to generate bios)
# generations: list of strings (model generations)
out = fs.get_score(topics, generations, gamma=10)
print (out["score"]) # FActScore
print (out["init_score"]) # FActScore w/o length penalty
print (out["respond_ratio"]) # % of responding (not abstaining from answering)
print (out["num_facts_per_response"]) # average number of atomic facts per response
```
Alternatively, you can create a .jsonl file, where each line has `topic` (entity name, exactly same as the one from `.txt` file) and `output` (generation from LM), and then use a command line [above](#Running-FActScore-using-a-command-line).
We recommend using (A) `FactScorer(model_name="retrieval+ChatGPT")` (default) or (B) `FactScorer(model_name="retrieval+llama+npm")`. They have 0.99 Pearson correlation. Here're results of a range of models, which you can easily reproduce through [these command lines](#Running-FActScore-using-a-command-line).
| Model | % respond | # facts | FActScore from (A) | FActScore from (B) |
|---|---|---|---|---|
| [GPT-4](https://arxiv.org/abs/2303.08774) | 88.2 | 60.8 | 73.1 | 59.9 |
| [ChatGPT](https://openai.com/blog/chatgpt) | 84.2 | 37.0 | 71.6 | 60.4 |
| [Alpaca 65B](https://crfm.stanford.edu/2023/03/13/alpaca.html) | 100.0 | 17.1 | 55.6 | 46.3 |
| [InstructGPT](https://openai.com/research/instruction-following) | 99.8 | 27.7 | 52.8 | 41.7 |
| [Alpaca 13B](https://crfm.stanford.edu/2023/03/13/alpaca.html) | 100.0 | 16.6 | 47.7 | 40.3 |
| [Vicuna 13B](https://lmsys.org/blog/2023-03-30-vicuna/) | 76.6 | 50.9 | 46.6 | 40.7 |
| [Alpaca 7B](https://crfm.stanford.edu/2023/03/13/alpaca.html) | 100.0 | 17.4 | 39.7 | 36.5 |
| [Vicuna 7B](https://lmsys.org/blog/2023-03-30-vicuna/) | 91.0 | 45.6 | 38.9 | 36.9 |
| [MPT Chat 7B](https://www.mosaicml.com/blog/mpt-7b) | 88.8 | 37.3 | 30.1 | 27.9 |
| [Oasst Pythia 12B](https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b) | 100.0 | 39.7 | 25.1 | 20.8 |
| [Dolly 12B](https://huggingface.co/databricks/dolly-v2-12b) | 100.0 | 24.6 | 21.7 | 17.1 |
| [StableLM tuned 7B](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | 66.6 | 38.0 | 17.3 | 16.3 |
`% respond` (% of responding instead of abstaining from answering) and `# facts` (# of atomic facts per valid response) indicate "factual recall" (how many pieces of information the model gives) and FActScore indicates "factual precision" (how accurate each piece of information the model gives is).
## To use a custom knowledge source
By default, FActScore uses Wikipedia dump from 2023/04/01. But you can also use your own knowledge source!
The knolwedge source should be ready in a `.jsonl` format, where each line is a dictionary containing `title` and `text`. `text` can either be a string or a list of strings (e.g., sections).
```python
from factscore.factscorer import FactScorer
fs = FactScorer()
# this will create a database using your file
# for English Wikipedia (18GB)), it takes ~8 hours
# once DB file is created, you can reuse it by only specifying `db_path`
fs.register_knowledge_source(name_of_your_knowledge_source,
data_path=path_to_jsonl_file,
db_path=path_to_output_db_file)
# now, when you compute a score, specify knowledge source to use
out = fs.get_score(topics, generations, knowledge_source=name_of_your_knowledge_source)
print (out["score"]) # FActScore
print (out["respond_ratio"]) # % of responding (not abstaining from answering)
print (out["num_facts_per_response"]) # average number of atomic facts per response
```
To see an example of constructing the ACL anthology knowledge source, see [`preprocessing/preprocess_acl.py`](preprocessing/preprocess_acl.py).
Raw data
{
"_id": null,
"home_page": "",
"name": "factscore",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7.1,<4.0.0",
"maintainer_email": "",
"keywords": "",
"author": "Sewon Min",
"author_email": "sewon@cs.washington.edu",
"download_url": "https://files.pythonhosted.org/packages/3f/3a/4bf7576a9282bb54c72dd034abb534f17ce717e2d9686289ab5f8095cd1b/factscore-0.2.0.tar.gz",
"platform": null,
"description": "# FActScore\n\n[](#python)\n[](https://arxiv.org/abs/2305.14251)\n[](https://pypi.python.org/pypi/factscore/)\n[](https://pepy.tech/project/factscore)\n\nThis is the official release accompanying our EMNLP 2023 paper, [FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation](https://arxiv.org/abs/2305.14251). FActScore is available as a PIP package as well.\n\nIf you find FActScore useful, please cite:\n```\n@inproceedings{ factscore,\n title={ {FActScore}: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation },\n author={ Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang Wei and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh },\n year={ 2023 },\n booktitle = { EMNLP },\n url={ https://arxiv.org/abs/2305.14251 }\n}\n```\n\n## Install\n<!-- ```\nconda create -n fs-env python=3.9\nconda activate fs-env\npip install -r requirements.txt\n``` -->\n\nMake a new Python 3.7+ environment using `virtualenv` or `conda`.\n\n```bash\npip install --upgrade factscore\npython -m spacy download en_core_web_sm\n```\n\n## Download the data\n\n```bash\npython -m factscore.download_data --llama_7B_HF_path \"llama-7B\"\n```\n\nThis command does the following.\n1. Download the knowledge source and example data.\n2. Take the LLAMA 7B model and reconstruct Inst-LLAMA. This requires having access to HuggingFace weights of the LLAMA-7B model, which are added to the `--llama_7B_HF_path` flag. Follow [this guide](https://huggingface.co/docs/transformers/main/model_doc/llama) in order to obtain those weights. Skip the `--llama_7B_HF_path` if you would only like to use the ChatGPT version of FActScore.\n\n**Optional flags**:\n- `--data_dir`: directory to store the knowledge source and example data. `.cache/factscore` by default.\n- `--model_dir`: directory to store Inst-LLAMA weights. `.cache/factscore` by default.\n\n**Troubleshooting**:\n- If you get a `ERROR 429: Too Many Requests` error while downloading the DB file, please download the DB from [this Google Drive link](https://drive.google.com/file/d/1mekls6OGOKLmt7gYtHs0WGf5oTamTNat/view?usp=sharing) and place it under `--data_dir` (`.cache/factscore` by default).\n- If everything else fails, consider downloading the files manually from [this link](https://drive.google.com/drive/folders/1bLHGu_imkZVtX6O0mpZ-G0-4ofTLM1ZA?usp=share_link) and placing them in `--data_dir` and `--model_dir`, see [`factscore/download_data.py`](factscore/download_data.py) for more details.\n\n\n## Running FActScore using a command line\n\nWe expect running FActScore costs about $1 of the API cost per 100 sentences. For instance, if you have 100 generations, each with 5 sentences on average, it costs $5 in total.\n\n```bash\npython -m factscore.factscorer --input_path {input_path} --model_name {estimator_name} --openai_key {openai_key}\n```\n\n- `--input_path` can be something like `data/unlabeled/InstructGPT.jsonl`. It should be a `.jsonl` format where each line contains `topic` (a topic entity that corresponds to the Wikipedia title) and `output` (a generation from the model).\n- `--model_name`: `retrieval+ChatGPT` and `retrieval+llama+npm` (You can also use `retrieval+ChatGPT+npm` or `retrieval+llama` but we recommend the former two.)\n- `--openai_key`: File containing OpenAI API Key.\n\n**Optional flags**:\n- `--data_dir`: Directory containing knowledge source, etc. `.cache/factscore` by default.\n- `--model_dir`: Directory containing Inst-LLAMA weights. Skip if your `model_name` doesn't include `llama`. `.cache/factscore` by default.\n- `--cache_dir`: Directory containing cache from API/models. `.cache/factscore` by default.\n- `--use_atomic_facts`: If specified, it uses model-generated atomic facts released as part of our data instead of running the atomic fact generator. This will allow reproducing our results with no (or little if it still uses ChatGPT) cost. You can't specify it if you are running new model generations.\n- `--gamma`: A hyperparameter for length penalty. `10` by default. It penalizes the score if the number of facts is less than `gamma`. `10` roughly corresponds to 2 sentences, so would penalize if the generation has less than 2 sentences. Usually, this would not change the ranking between systems unless some systems generate overly short responses all the time (e.g., models trained on NLP datasets without long-form generation tasks may do so). If you would like to turn off the length penalty completely, specify `--gamma 0`.\n- `--n_samples`: If specified, it runs the model on a subset of the data.\n- `--verbose`: If specified, it shows the progress bar.\n- `--print_rate_limit_error`: It specified, it prints out rate limit errors from OpenAI API.\n- `--cost_estimate`: This flag decides the type of OpenAI API cost estimation that we provide before calling it. It can be `\"consider_cache\"` (default) or `\"ignore_cache\"`.\n- `--abstain_detection`: This flag optionally enables automatic detection of abstained responses. By default this is disabled, but it is recommended to add your own function tailored to your model. The currently supported detectors are `\"generic\"` and `\"perplexity_ai\"`, and their implementations can be found in [`factscore/abstain_detection.py`](factscore/abstain_detection.py). There are two methods to add your own abstain function: a) clone our GitHub repository to install `factscore` locally (`pip install --editable .`), and then add your function to [`factscore/abstain_detection.py`](factscore/abstain_detection.py) directly; b) process your abstain detection outside our package, and use empty strings in the `output` key for the JSONL file used in `--input_path`.\n- `--knowledge_source`: In case the default knowledge source (Wikipedia - 2023/04/01) will not be used, preprocess it using the [instructions below](#To-use-a-custom-knowledge-source), and then specify the knowledge_source name under this flag.\n\n## To evaluate your own LM\n\nThere're two sets of prompt entities, `data/labeled/prompt_entities.txt` (183 entities) and `data/unlabeled/prompt_entities.txt` (500 entities). Each line contains the name of the person (which is also a corresponding Wikipedia title). You can use the labeled version if you want to be compatible with the data under `data/labeled` (Section 3 and Section 4.2 in the paper), and use the unlabeled version if you want to be compatible with the data under `data/unlabeled` (Section 4.3 in the paper).\n\nYou can prompt your LM with your own prompt (we used `Question: Tell me a bio of <entity>.`) and use the following code.\n\n```python\nfrom factscore.factscorer import FactScorer\n\nfs = FactScorer(openai_key=\"...\")\n\n# topics: list of strings (human entities used to generate bios)\n# generations: list of strings (model generations)\nout = fs.get_score(topics, generations, gamma=10)\nprint (out[\"score\"]) # FActScore\nprint (out[\"init_score\"]) # FActScore w/o length penalty\nprint (out[\"respond_ratio\"]) # % of responding (not abstaining from answering)\nprint (out[\"num_facts_per_response\"]) # average number of atomic facts per response\n```\n\nAlternatively, you can create a .jsonl file, where each line has `topic` (entity name, exactly same as the one from `.txt` file) and `output` (generation from LM), and then use a command line [above](#Running-FActScore-using-a-command-line).\n\nWe recommend using (A) `FactScorer(model_name=\"retrieval+ChatGPT\")` (default) or (B) `FactScorer(model_name=\"retrieval+llama+npm\")`. They have 0.99 Pearson correlation. Here're results of a range of models, which you can easily reproduce through [these command lines](#Running-FActScore-using-a-command-line).\n\n| Model | % respond | # facts | FActScore from (A) | FActScore from (B) |\n|---|---|---|---|---|\n| [GPT-4](https://arxiv.org/abs/2303.08774) | 88.2 | 60.8 | 73.1 | 59.9 |\n| [ChatGPT](https://openai.com/blog/chatgpt) | 84.2 | 37.0 | 71.6 | 60.4 |\n| [Alpaca 65B](https://crfm.stanford.edu/2023/03/13/alpaca.html) | 100.0 | 17.1 | 55.6 | 46.3 |\n| [InstructGPT](https://openai.com/research/instruction-following) | 99.8 | 27.7 | 52.8 | 41.7 |\n| [Alpaca 13B](https://crfm.stanford.edu/2023/03/13/alpaca.html) | 100.0 | 16.6 | 47.7 | 40.3 |\n| [Vicuna 13B](https://lmsys.org/blog/2023-03-30-vicuna/) | 76.6 | 50.9 | 46.6 | 40.7 |\n| [Alpaca 7B](https://crfm.stanford.edu/2023/03/13/alpaca.html) | 100.0 | 17.4 | 39.7 | 36.5 |\n| [Vicuna 7B](https://lmsys.org/blog/2023-03-30-vicuna/) | 91.0 | 45.6 | 38.9 | 36.9 |\n| [MPT Chat 7B](https://www.mosaicml.com/blog/mpt-7b) | 88.8 | 37.3 | 30.1 | 27.9 |\n| [Oasst Pythia 12B](https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b) | 100.0 | 39.7 | 25.1 | 20.8 |\n| [Dolly 12B](https://huggingface.co/databricks/dolly-v2-12b) | 100.0 | 24.6 | 21.7 | 17.1 |\n| [StableLM tuned 7B](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | 66.6 | 38.0 | 17.3 | 16.3 |\n\n`% respond` (% of responding instead of abstaining from answering) and `# facts` (# of atomic facts per valid response) indicate \"factual recall\" (how many pieces of information the model gives) and FActScore indicates \"factual precision\" (how accurate each piece of information the model gives is).\n\n## To use a custom knowledge source\n\nBy default, FActScore uses Wikipedia dump from 2023/04/01. But you can also use your own knowledge source!\n\nThe knolwedge source should be ready in a `.jsonl` format, where each line is a dictionary containing `title` and `text`. `text` can either be a string or a list of strings (e.g., sections).\n\n```python\nfrom factscore.factscorer import FactScorer\n\nfs = FactScorer()\n\n# this will create a database using your file\n# for English Wikipedia (18GB)), it takes ~8 hours\n# once DB file is created, you can reuse it by only specifying `db_path`\nfs.register_knowledge_source(name_of_your_knowledge_source,\n data_path=path_to_jsonl_file,\n db_path=path_to_output_db_file)\n\n# now, when you compute a score, specify knowledge source to use\nout = fs.get_score(topics, generations, knowledge_source=name_of_your_knowledge_source)\nprint (out[\"score\"]) # FActScore\nprint (out[\"respond_ratio\"]) # % of responding (not abstaining from answering)\nprint (out[\"num_facts_per_response\"]) # average number of atomic facts per response\n```\n\nTo see an example of constructing the ACL anthology knowledge source, see [`preprocessing/preprocess_acl.py`](preprocessing/preprocess_acl.py).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "FactScore is an automatic evaluation metric for factual precision in long-form text generation. It uses large language models and retrieval to break down generations into atomic facts and then measure the correctness with respect to a knowledge source (like Wikipedia).",
"version": "0.2.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0544ded50831d30f32653e5a7522d0c42910b1faa3aa3a7df678cace9501e50f",
"md5": "9588d87196d2f08878cc814fb4443230",
"sha256": "018d8b9fb6b34a466176ef340ddb4f25fc74df45e877758ddf658be0ad688abb"
},
"downloads": -1,
"filename": "factscore-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9588d87196d2f08878cc814fb4443230",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.1,<4.0.0",
"size": 27207,
"upload_time": "2023-10-14T22:28:15",
"upload_time_iso_8601": "2023-10-14T22:28:15.230707Z",
"url": "https://files.pythonhosted.org/packages/05/44/ded50831d30f32653e5a7522d0c42910b1faa3aa3a7df678cace9501e50f/factscore-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3f3a4bf7576a9282bb54c72dd034abb534f17ce717e2d9686289ab5f8095cd1b",
"md5": "327886acc5f7220286166066a4132f7b",
"sha256": "bfbf55a01345057d545be6a38f157d85c9585a9606a428be7224218844d00011"
},
"downloads": -1,
"filename": "factscore-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "327886acc5f7220286166066a4132f7b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.1,<4.0.0",
"size": 27197,
"upload_time": "2023-10-14T22:28:16",
"upload_time_iso_8601": "2023-10-14T22:28:16.659669Z",
"url": "https://files.pythonhosted.org/packages/3f/3a/4bf7576a9282bb54c72dd034abb534f17ce717e2d9686289ab5f8095cd1b/factscore-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-14 22:28:16",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "factscore"
}