Name | lm-polygraph JSON |
Version |
0.4.0
JSON |
| download |
home_page | None |
Summary | Uncertainty Estimation Toolkit for Transformer Language Models |
upload_time | 2024-10-17 13:09:02 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | Copyright (c) 2023 MBZUAI Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
nlp
deep learning
transformer
pytorch
uncertainty estimation
|
VCS |
|
bugtrack_url |
|
requirements |
datasets
rouge-score
nlpaug
scikit-learn
tqdm
matplotlib
pandas
torch
bs4
transformers
nltk
sacrebleu
sentencepiece
hf-lfs
pytest
pytreebank
setuptools
numpy
dill
scipy
flask
protobuf
fschat
hydra-core
einops
accelerate
bitsandbytes
openai
wget
sentence-transformers
bert-score
unbabel-comet
nltk
evaluate
spacy
fastchat
diskcache
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/IINemo/isanlp_srl_framebank/blob/master/LICENSE)
![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)
# LM-Polygraph: Uncertainty estimation for LLMs
[Installation](#installation) | [Basic usage](#basic_usage) | [Overview](#overview_of_methods) | [Benchmark](#benchmark) | [Demo application](#demo_web_application) | [Documentation](https://lm-polygraph.readthedocs.io/)
LM-Polygraph provides a battery of state-of-the-art of uncertainty estimation (UE) methods for LMs in text generation tasks. High uncertainty can indicate the presence of hallucinations and knowing a score that estimates uncertinaty can help to make applications of LLMs safer.
The framework also introduces an extendable benchmark for consistent evaluation of UE techniques by researchers and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses.
## Installation
### From GitHub
To install latest from main brach, clone the repo and conduct installation using pip, it is recommended to use virtual environment. Code example is presented below:
```shell
git clone https://github.com/IINemo/lm-polygraph.git
python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
cd lm-polygraph
pip install .
```
Installation from GitHub is recommended if you want to explore notebooks with examples or use default benchmarking configurations, as they are included in the repository but not in the PyPI package. However code from the main branch may be unstable, so it is recommended to checkout to the latest stable release before installation:
```shell
git clone https://github.com/IINemo/lm-polygraph.git
git checkout tags/v0.3.0
python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
cd lm-polygraph
pip install .
```
### From PyPI
To install the latest stable version from PyPI, run:
```shell
python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
pip install lm-polygraph
```
To install a specific version, run:
```shell
python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
pip install lm-polygraph==0.3.0
```
## <a name="basic_usage"></a>Basic usage
1. Initialize the base model (encoder-decoder or decoder-only) and tokenizer from HuggingFace or a local file, and use them to initialize the WhiteboxModel for evaluation. For example, with bigscience/bloomz-560m:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from lm_polygraph.utils.model import WhiteboxModel
model_path = "bigscience/bloomz-560m"
base_model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = WhiteboxModel(base_model, tokenizer, model_path=model_path)
```
Alternatively, you can use WhiteboxModel#from_pretrained method to let LM-Polygraph download the model and tokenizer for you. However, this approach is deprecated and will be removed in the next major release.
```python
from lm_polygraph.utils.model import WhiteboxModel
model = WhiteboxModel.from_pretrained(
"bigscience/bloomz-3b",
device_map="cuda:0",
)
```
2. Specify UE method:
```python
from lm_polygraph.estimators import *
ue_method = MeanPointwiseMutualInformation()
```
3. Get predictions and their uncertainty scores:
```python
from lm_polygraph.utils.manager import estimate_uncertainty
input_text = "Who is George Bush?"
ue = estimate_uncertainty(model, ue_method, input_text=input_text)
print(ue)
# UncertaintyOutput(uncertainty=-6.504108926902215, input_text='Who is George Bush?', generation_text=' President of the United States', model_path='bigscience/bloomz-560m')
```
### Other examples:
* [example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/basic_example.ipynb): simple examples of scoring individual queries;
* [claim_level_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/claim_level_example.ipynb): an example of scoring individual claims;
* [qa_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/qa_example.ipynb): an example of scoring the `bigscience/bloomz-3b` model on the `TriviaQA` dataset;
* [mt_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/mt_example.ipynb): an of scoring the `facebook/wmt19-en-de` model on the `WMT14 En-De` dataset;
* [ats_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/ats_example.ipynb): an example of scoring the `facebook/bart-large-cnn` model on the `XSUM` summarization dataset;
* [colab](https://colab.research.google.com/drive/1JS-NG0oqAVQhnpYY-DsoYWhz35reGRVJ?usp=sharing): demo web application in Colab (`bloomz-560m`, `gpt-3.5-turbo`, and `gpt-4` fit the default memory limit; other models require Colab-pro).
## <a name="overview_of_methods"></a>Overview of methods
<!-- | Uncertainty Estimation Method | Type | Category | Compute | Memory | Need Training Data? |
| ------------------------------------------------------------------- | ----------- | ------------------- | ------- | ------ | ------------------- |
| Maximum sequence probability | White-box | Information-based | Low | Low | No |
| Perplexity (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |
| Mean token entropy (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |
| Monte Carlo sequence entropy (Kuhn et al., 2023) | White-box | Information-based | High | Low | No |
| Pointwise mutual information (PMI) (Takayama and Arase, 2019) | White-box | Information-based | Medium | Low | No |
| Conditional PMI (van der Poel et al., 2022) | White-box | Information-based | Medium | Medium | No |
| Semantic entropy (Kuhn et al., 2023) | White-box | Meaning diversity | High | Low | No |
| Sentence-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |
| Token-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |
| Mahalanobis distance (MD) (Lee et al., 2018) | White-box | Density-based | Low | Low | Yes |
| Robust density estimation (RDE) (Yoo et al., 2022) | White-box | Density-based | Low | Low | Yes |
| Relative Mahalanobis distance (RMD) (Ren et al., 2023) | White-box | Density-based | Low | Low | Yes |
| Hybrid Uncertainty Quantification (HUQ) (Vazhentsev et al., 2023a) | White-box | Density-based | Low | Low | Yes |
| p(True) (Kadavath et al., 2022) | White-box | Reflexive | Medium | Low | No |
| Number of semantic sets (NumSets) (Kuhn et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
| Sum of eigenvalues of the graph Laplacian (EigV) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
| Degree matrix (Deg) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
| Eccentricity (Ecc) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
| Lexical similarity (LexSim) (Fomicheva et al., 2020a) | Black-box | Meaning Diversity | High | Low | No | -->
| Uncertainty Estimation Method | Type | Category | Compute | Memory | Need Training Data? | Level |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- | ------------------- |---------|--------| ------------------- |----------------|
| Maximum sequence probability | White-box | Information-based | Low | Low | No | sequence/claim |
| Perplexity [(Fomicheva et al., 2020a)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00330/96475/Unsupervised-Quality-Estimation-for-Neural-Machine) | White-box | Information-based | Low | Low | No | sequence/claim |
| Mean/max token entropy [(Fomicheva et al., 2020a)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00330/96475/Unsupervised-Quality-Estimation-for-Neural-Machine) | White-box | Information-based | Low | Low | No | sequence/claim |
| Monte Carlo sequence entropy [(Kuhn et al., 2023)](https://openreview.net/forum?id=VD-AYtP0dve) | White-box | Information-based | High | Low | No | sequence |
| Pointwise mutual information (PMI) [(Takayama and Arase, 2019)](https://aclanthology.org/W19-4115/) | White-box | Information-based | Medium | Low | No | sequence/claim |
| Conditional PMI [(van der Poel et al., 2022)](https://aclanthology.org/2022.emnlp-main.399/) | White-box | Information-based | Medium | Medium | No | sequence |
| Rényi divergence [(Darrin et al., 2023)](https://aclanthology.org/2023.emnlp-main.357/) | White-box | Information-based | Low | Low | No | sequence |
| Fisher-Rao distance [(Darrin et al., 2023)](https://aclanthology.org/2023.emnlp-main.357/) | White-box | Information-based | Low | Low | No | sequence |
| Semantic entropy [(Kuhn et al., 2023)](https://openreview.net/forum?id=VD-AYtP0dve) | White-box | Meaning diversity | High | Low | No | sequence |
| Claim-Conditioned Probability [(Fadeeva et al., 2024)](https://arxiv.org/abs/2403.04696) | White-box | Meaning diversity | Low | Low | No | sequence/claim |
| TokenSAR [(Duan et al., 2023)](https://arxiv.org/abs/2307.01379) | White-box | Meaning diversity | High | Low | No | sequence |
| SentenceSAR [(Duan et al., 2023)](https://arxiv.org/abs/2307.01379) | White-box | Meaning diversity | High | Low | No | sequence |
| SAR [(Duan et al., 2023)](https://arxiv.org/abs/2307.01379) | White-box | Meaning diversity | High | Low | No | sequence |
| Sentence-level ensemble-based measures [(Malinin and Gales, 2020)](https://arxiv.org/abs/2002.07650) | White-box | Ensembling | High | High | Yes | sequence |
| Token-level ensemble-based measures [(Malinin and Gales, 2020)](https://arxiv.org/abs/2002.07650) | White-box | Ensembling | High | High | Yes | sequence |
| Mahalanobis distance (MD) [(Lee et al., 2018)](https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html) | White-box | Density-based | Low | Low | Yes | sequence |
| Robust density estimation (RDE) [(Yoo et al., 2022)](https://aclanthology.org/2022.findings-acl.289/) | White-box | Density-based | Low | Low | Yes | sequence |
| Relative Mahalanobis distance (RMD) [(Ren et al., 2023)](https://openreview.net/forum?id=kJUS5nD0vPB) | White-box | Density-based | Low | Low | Yes | sequence |
| Hybrid Uncertainty Quantification (HUQ) [(Vazhentsev et al., 2023a)](https://aclanthology.org/2023.acl-long.652/) | White-box | Density-based | Low | Low | Yes | sequence |
| p(True) [(Kadavath et al., 2022)](https://arxiv.org/abs/2207.05221) | White-box | Reflexive | Medium | Low | No | sequence/claim |
| Number of semantic sets (NumSets) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |
| Sum of eigenvalues of the graph Laplacian (EigV) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |
| Degree matrix (Deg) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |
| Eccentricity (Ecc) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |
| Lexical similarity (LexSim) [(Fomicheva et al., 2020a)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00330/96475/Unsupervised-Quality-Estimation-for-Neural-Machine) | Black-box | Meaning Diversity | High | Low | No | sequence |
| Verbalized Uncertainty 1S [(Tian et al., 2023)](https://arxiv.org/abs/2305.14975) | Black-box | Reflexive | Low | Low | No | sequence |
| Verbalized Uncertainty 2S [(Tian et al., 2023)](https://arxiv.org/abs/2305.14975) | Black-box | Reflexive | Medium | Low | No | sequence |
## Benchmark
To evaluate the performance of uncertainty estimation methods consider a quick example:
```
HYDRA_CONFIG=../examples/configs/polygraph_eval_coqa.yaml python ./scripts/polygraph_eval \
dataset="coqa" \
model.path="databricks/dolly-v2-3b" \
save_path="./workdir/output" \
"seed=[1,2,3,4,5]"
```
Use [`visualization_tables.ipynb`](https://github.com/IINemo/lm-polygraph/blob/main/notebooks/vizualization_tables.ipynb) or [`result_tables.ipynb`](https://github.com/IINemo/lm-polygraph/blob/main/notebooks/result_tables.ipynb) to generate the summarizing tables for an experiment.
A detailed description of the benchmark is in the [documentation](https://lm-polygraph.readthedocs.io/en/latest/usage.html#benchmarks).
## <a name="demo_web_application"></a>Demo web application
<img width="850" alt="gui7" src="https://github.com/IINemo/lm-polygraph/assets/21058413/51aa12f7-f996-4257-b1bc-afbec6db4da7">
### Start with Docker
```sh
docker run -p 3001:3001 -it \
-v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
--gpus all mephodybro/polygraph_demo:0.0.17 polygraph_server
```
The server should be available on `http://localhost:3001`
A more detailed description of the demo is available in the [documentation](https://lm-polygraph.readthedocs.io/en/latest/web_demo.html).
## Cite
```
@inproceedings{fadeeva-etal-2023-lm,
title = "{LM}-Polygraph: Uncertainty Estimation for Language Models",
author = "Fadeeva, Ekaterina and
Vashurin, Roman and
Tsvigun, Akim and
Vazhentsev, Artem and
Petrakov, Sergey and
Fedyanin, Kirill and
Vasilev, Daniil and
Goncharova, Elizaveta and
Panchenko, Alexander and
Panov, Maxim and
Baldwin, Timothy and
Shelmanov, Artem",
editor = "Feng, Yansong and
Lefever, Els",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-demo.41",
doi = "10.18653/v1/2023.emnlp-demo.41",
pages = "446--461",
abstract = "Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often {``}hallucinate{''}, i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.",
}
```
## Acknowledgements
The chat GUI implementation is based on the [chatgpt-web-application](https://github.com/ioanmo226/chatgpt-web-application) project.
Raw data
{
"_id": null,
"home_page": null,
"name": "lm-polygraph",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "NLP, deep learning, transformer, pytorch, uncertainty estimation",
"author": null,
"author_email": "\"List of contributors: https://github.com/IINemo/lm-polygraph/graphs/contributors\" <artemshelmanov@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/61/7e/299a3c3c4f084172a43e3d71366198dd3e1e5c0b20eb1158c9db6639f709/lm_polygraph-0.4.0.tar.gz",
"platform": null,
"description": "[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/IINemo/isanlp_srl_framebank/blob/master/LICENSE)\n![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)\n\n# LM-Polygraph: Uncertainty estimation for LLMs\n\n[Installation](#installation) | [Basic usage](#basic_usage) | [Overview](#overview_of_methods) | [Benchmark](#benchmark) | [Demo application](#demo_web_application) | [Documentation](https://lm-polygraph.readthedocs.io/)\n\nLM-Polygraph provides a battery of state-of-the-art of uncertainty estimation (UE) methods for LMs in text generation tasks. High uncertainty can indicate the presence of hallucinations and knowing a score that estimates uncertinaty can help to make applications of LLMs safer.\n\nThe framework also introduces an extendable benchmark for consistent evaluation of UE techniques by researchers and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses.\n\n## Installation\n\n### From GitHub\nTo install latest from main brach, clone the repo and conduct installation using pip, it is recommended to use virtual environment. Code example is presented below:\n\n```shell\ngit clone https://github.com/IINemo/lm-polygraph.git\npython3 -m venv env # Substitute this with your virtual environment creation command\nsource env/bin/activate\ncd lm-polygraph\npip install .\n```\n\nInstallation from GitHub is recommended if you want to explore notebooks with examples or use default benchmarking configurations, as they are included in the repository but not in the PyPI package. However code from the main branch may be unstable, so it is recommended to checkout to the latest stable release before installation:\n\n```shell\ngit clone https://github.com/IINemo/lm-polygraph.git\ngit checkout tags/v0.3.0\npython3 -m venv env # Substitute this with your virtual environment creation command\nsource env/bin/activate\ncd lm-polygraph\npip install .\n```\n\n### From PyPI\nTo install the latest stable version from PyPI, run:\n\n```shell\npython3 -m venv env # Substitute this with your virtual environment creation command\nsource env/bin/activate\npip install lm-polygraph\n```\n\nTo install a specific version, run:\n\n```shell\npython3 -m venv env # Substitute this with your virtual environment creation command\nsource env/bin/activate\npip install lm-polygraph==0.3.0\n```\n\n## <a name=\"basic_usage\"></a>Basic usage\n1. Initialize the base model (encoder-decoder or decoder-only) and tokenizer from HuggingFace or a local file, and use them to initialize the WhiteboxModel for evaluation. For example, with bigscience/bloomz-560m:\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom lm_polygraph.utils.model import WhiteboxModel\n\nmodel_path = \"bigscience/bloomz-560m\"\nbase_model = AutoModelForCausalLM.from_pretrained(model_path, device_map=\"cuda:0\")\ntokenizer = AutoTokenizer.from_pretrained(model_path)\n\nmodel = WhiteboxModel(base_model, tokenizer, model_path=model_path)\n```\n\nAlternatively, you can use WhiteboxModel#from_pretrained method to let LM-Polygraph download the model and tokenizer for you. However, this approach is deprecated and will be removed in the next major release.\n\n```python\nfrom lm_polygraph.utils.model import WhiteboxModel\n\nmodel = WhiteboxModel.from_pretrained(\n \"bigscience/bloomz-3b\",\n device_map=\"cuda:0\",\n)\n```\n\n2. Specify UE method:\n\n```python\nfrom lm_polygraph.estimators import *\n\nue_method = MeanPointwiseMutualInformation()\n```\n\n3. Get predictions and their uncertainty scores:\n\n```python\nfrom lm_polygraph.utils.manager import estimate_uncertainty\n\ninput_text = \"Who is George Bush?\"\nue = estimate_uncertainty(model, ue_method, input_text=input_text)\nprint(ue)\n# UncertaintyOutput(uncertainty=-6.504108926902215, input_text='Who is George Bush?', generation_text=' President of the United States', model_path='bigscience/bloomz-560m')\n```\n\n### Other examples:\n\n* [example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/basic_example.ipynb): simple examples of scoring individual queries;\n* [claim_level_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/claim_level_example.ipynb): an example of scoring individual claims;\n* [qa_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/qa_example.ipynb): an example of scoring the `bigscience/bloomz-3b` model on the `TriviaQA` dataset;\n* [mt_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/mt_example.ipynb): an of scoring the `facebook/wmt19-en-de` model on the `WMT14 En-De` dataset;\n* [ats_example.ipynb](https://github.com/IINemo/lm-polygraph/blob/main/examples/ats_example.ipynb): an example of scoring the `facebook/bart-large-cnn` model on the `XSUM` summarization dataset;\n* [colab](https://colab.research.google.com/drive/1JS-NG0oqAVQhnpYY-DsoYWhz35reGRVJ?usp=sharing): demo web application in Colab (`bloomz-560m`, `gpt-3.5-turbo`, and `gpt-4` fit the default memory limit; other models require Colab-pro).\n\n## <a name=\"overview_of_methods\"></a>Overview of methods\n\n<!-- | Uncertainty Estimation Method | Type | Category | Compute | Memory | Need Training Data? |\n| ------------------------------------------------------------------- | ----------- | ------------------- | ------- | ------ | ------------------- |\n| Maximum sequence probability | White-box | Information-based | Low | Low | No |\n| Perplexity (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |\n| Mean token entropy (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |\n| Monte Carlo sequence entropy (Kuhn et al., 2023) | White-box | Information-based | High | Low | No |\n| Pointwise mutual information (PMI) (Takayama and Arase, 2019) | White-box | Information-based | Medium | Low | No |\n| Conditional PMI (van der Poel et al., 2022) | White-box | Information-based | Medium | Medium | No |\n| Semantic entropy (Kuhn et al., 2023) | White-box | Meaning diversity | High | Low | No |\n| Sentence-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |\n| Token-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |\n| Mahalanobis distance (MD) (Lee et al., 2018) | White-box | Density-based | Low | Low | Yes |\n| Robust density estimation (RDE) (Yoo et al., 2022) | White-box | Density-based | Low | Low | Yes |\n| Relative Mahalanobis distance (RMD) (Ren et al., 2023) | White-box | Density-based | Low | Low | Yes |\n| Hybrid Uncertainty Quantification (HUQ) (Vazhentsev et al., 2023a) | White-box | Density-based | Low | Low | Yes |\n| p(True) (Kadavath et al., 2022) | White-box | Reflexive | Medium | Low | No |\n| Number of semantic sets (NumSets) (Kuhn et al., 2023) | Black-box | Meaning Diversity | High | Low | No |\n| Sum of eigenvalues of the graph Laplacian (EigV) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |\n| Degree matrix (Deg) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |\n| Eccentricity (Ecc) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |\n| Lexical similarity (LexSim) (Fomicheva et al., 2020a) | Black-box | Meaning Diversity | High | Low | No | -->\n\n| Uncertainty Estimation Method | Type | Category | Compute | Memory | Need Training Data? | Level |\n|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- | ------------------- |---------|--------| ------------------- |----------------|\n| Maximum sequence probability | White-box | Information-based | Low | Low | No | sequence/claim |\n| Perplexity [(Fomicheva et al., 2020a)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00330/96475/Unsupervised-Quality-Estimation-for-Neural-Machine) | White-box | Information-based | Low | Low | No | sequence/claim |\n| Mean/max token entropy [(Fomicheva et al., 2020a)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00330/96475/Unsupervised-Quality-Estimation-for-Neural-Machine) | White-box | Information-based | Low | Low | No | sequence/claim |\n| Monte Carlo sequence entropy [(Kuhn et al., 2023)](https://openreview.net/forum?id=VD-AYtP0dve) | White-box | Information-based | High | Low | No | sequence |\n| Pointwise mutual information (PMI) [(Takayama and Arase, 2019)](https://aclanthology.org/W19-4115/) | White-box | Information-based | Medium | Low | No | sequence/claim |\n| Conditional PMI [(van der Poel et al., 2022)](https://aclanthology.org/2022.emnlp-main.399/) | White-box | Information-based | Medium | Medium | No | sequence |\n| R\u00e9nyi divergence [(Darrin et al., 2023)](https://aclanthology.org/2023.emnlp-main.357/) | White-box | Information-based | Low | Low | No | sequence |\n| Fisher-Rao distance [(Darrin et al., 2023)](https://aclanthology.org/2023.emnlp-main.357/) | White-box | Information-based | Low | Low | No | sequence |\n| Semantic entropy [(Kuhn et al., 2023)](https://openreview.net/forum?id=VD-AYtP0dve) | White-box | Meaning diversity | High | Low | No | sequence |\n| Claim-Conditioned Probability [(Fadeeva et al., 2024)](https://arxiv.org/abs/2403.04696) | White-box | Meaning diversity | Low | Low | No | sequence/claim |\n| TokenSAR [(Duan et al., 2023)](https://arxiv.org/abs/2307.01379) | White-box | Meaning diversity | High | Low | No | sequence |\n| SentenceSAR [(Duan et al., 2023)](https://arxiv.org/abs/2307.01379) | White-box | Meaning diversity | High | Low | No | sequence |\n| SAR [(Duan et al., 2023)](https://arxiv.org/abs/2307.01379) | White-box | Meaning diversity | High | Low | No | sequence |\n| Sentence-level ensemble-based measures [(Malinin and Gales, 2020)](https://arxiv.org/abs/2002.07650) | White-box | Ensembling | High | High | Yes | sequence |\n| Token-level ensemble-based measures [(Malinin and Gales, 2020)](https://arxiv.org/abs/2002.07650) | White-box | Ensembling | High | High | Yes | sequence |\n| Mahalanobis distance (MD) [(Lee et al., 2018)](https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html) | White-box | Density-based | Low | Low | Yes | sequence |\n| Robust density estimation (RDE) [(Yoo et al., 2022)](https://aclanthology.org/2022.findings-acl.289/) | White-box | Density-based | Low | Low | Yes | sequence |\n| Relative Mahalanobis distance (RMD) [(Ren et al., 2023)](https://openreview.net/forum?id=kJUS5nD0vPB) | White-box | Density-based | Low | Low | Yes | sequence |\n| Hybrid Uncertainty Quantification (HUQ) [(Vazhentsev et al., 2023a)](https://aclanthology.org/2023.acl-long.652/) | White-box | Density-based | Low | Low | Yes | sequence |\n| p(True) [(Kadavath et al., 2022)](https://arxiv.org/abs/2207.05221) | White-box | Reflexive | Medium | Low | No | sequence/claim |\n| Number of semantic sets (NumSets) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |\n| Sum of eigenvalues of the graph Laplacian (EigV) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |\n| Degree matrix (Deg) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |\n| Eccentricity (Ecc) [(Lin et al., 2023)](https://arxiv.org/abs/2305.19187) | Black-box | Meaning Diversity | High | Low | No | sequence |\n| Lexical similarity (LexSim) [(Fomicheva et al., 2020a)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00330/96475/Unsupervised-Quality-Estimation-for-Neural-Machine) | Black-box | Meaning Diversity | High | Low | No | sequence |\n| Verbalized Uncertainty 1S [(Tian et al., 2023)](https://arxiv.org/abs/2305.14975) | Black-box | Reflexive | Low | Low | No | sequence |\n| Verbalized Uncertainty 2S [(Tian et al., 2023)](https://arxiv.org/abs/2305.14975) | Black-box | Reflexive | Medium | Low | No | sequence |\n\n\n## Benchmark\n\nTo evaluate the performance of uncertainty estimation methods consider a quick example: \n\n```\nHYDRA_CONFIG=../examples/configs/polygraph_eval_coqa.yaml python ./scripts/polygraph_eval \\\n dataset=\"coqa\" \\\n model.path=\"databricks/dolly-v2-3b\" \\\n save_path=\"./workdir/output\" \\\n \"seed=[1,2,3,4,5]\"\n```\n\nUse [`visualization_tables.ipynb`](https://github.com/IINemo/lm-polygraph/blob/main/notebooks/vizualization_tables.ipynb) or [`result_tables.ipynb`](https://github.com/IINemo/lm-polygraph/blob/main/notebooks/result_tables.ipynb) to generate the summarizing tables for an experiment.\n\nA detailed description of the benchmark is in the [documentation](https://lm-polygraph.readthedocs.io/en/latest/usage.html#benchmarks).\n\n## <a name=\"demo_web_application\"></a>Demo web application\n\n \n<img width=\"850\" alt=\"gui7\" src=\"https://github.com/IINemo/lm-polygraph/assets/21058413/51aa12f7-f996-4257-b1bc-afbec6db4da7\">\n\n\n### Start with Docker\n\n```sh\ndocker run -p 3001:3001 -it \\\n -v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \\\n --gpus all mephodybro/polygraph_demo:0.0.17 polygraph_server\n```\nThe server should be available on `http://localhost:3001`\n\nA more detailed description of the demo is available in the [documentation](https://lm-polygraph.readthedocs.io/en/latest/web_demo.html).\n\n## Cite\n```\n@inproceedings{fadeeva-etal-2023-lm,\n title = \"{LM}-Polygraph: Uncertainty Estimation for Language Models\",\n author = \"Fadeeva, Ekaterina and\n Vashurin, Roman and\n Tsvigun, Akim and\n Vazhentsev, Artem and\n Petrakov, Sergey and\n Fedyanin, Kirill and\n Vasilev, Daniil and\n Goncharova, Elizaveta and\n Panchenko, Alexander and\n Panov, Maxim and\n Baldwin, Timothy and\n Shelmanov, Artem\",\n editor = \"Feng, Yansong and\n Lefever, Els\",\n booktitle = \"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n month = dec,\n year = \"2023\",\n address = \"Singapore\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://aclanthology.org/2023.emnlp-demo.41\",\n doi = \"10.18653/v1/2023.emnlp-demo.41\",\n pages = \"446--461\",\n abstract = \"Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often {``}hallucinate{''}, i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.\",\n}\n```\n\n## Acknowledgements\n\nThe chat GUI implementation is based on the [chatgpt-web-application](https://github.com/ioanmo226/chatgpt-web-application) project.\n",
"bugtrack_url": null,
"license": "Copyright (c) 2023 MBZUAI Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "Uncertainty Estimation Toolkit for Transformer Language Models",
"version": "0.4.0",
"project_urls": {
"Bug Tracker": "https://github.com/IINemo/lm-polygraph/issues",
"Documentation": "https://lm-polygraph.readthedocs.io",
"Homepage": "https://github.com/IINemo/lm-polygraph",
"Repository": "https://github.com/IINemo/lm-polygraph"
},
"split_keywords": [
"nlp",
" deep learning",
" transformer",
" pytorch",
" uncertainty estimation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f1b200dc9201db0c25030ab9dddd5fbbf2e1b979b5ad54460ddd3765363b3132",
"md5": "6435bff646b70f7ee4461352c8c46085",
"sha256": "893f90f3cbdac1b45f4c449887187dfef8e0bba29a5e8a6068c632bd7e2acaf8"
},
"downloads": -1,
"filename": "lm_polygraph-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6435bff646b70f7ee4461352c8c46085",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 178344,
"upload_time": "2024-10-17T13:08:59",
"upload_time_iso_8601": "2024-10-17T13:08:59.558701Z",
"url": "https://files.pythonhosted.org/packages/f1/b2/00dc9201db0c25030ab9dddd5fbbf2e1b979b5ad54460ddd3765363b3132/lm_polygraph-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "617e299a3c3c4f084172a43e3d71366198dd3e1e5c0b20eb1158c9db6639f709",
"md5": "214260f429225b0665d3e6e0f21ab1b6",
"sha256": "50e5b610043258e0c86b0e60eacf27c16a56096448595afb3a27ca7aef6796b7"
},
"downloads": -1,
"filename": "lm_polygraph-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "214260f429225b0665d3e6e0f21ab1b6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 124272,
"upload_time": "2024-10-17T13:09:02",
"upload_time_iso_8601": "2024-10-17T13:09:02.438392Z",
"url": "https://files.pythonhosted.org/packages/61/7e/299a3c3c4f084172a43e3d71366198dd3e1e5c0b20eb1158c9db6639f709/lm_polygraph-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-17 13:09:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "IINemo",
"github_project": "lm-polygraph",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "datasets",
"specs": [
[
">=",
"2.14.2"
]
]
},
{
"name": "rouge-score",
"specs": [
[
">=",
"0.0.4"
]
]
},
{
"name": "nlpaug",
"specs": [
[
">=",
"1.1.10"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.5.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.64.1"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.6"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.5"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.13.0"
]
]
},
{
"name": "bs4",
"specs": []
},
{
"name": "transformers",
"specs": [
[
">=",
"4.40"
]
]
},
{
"name": "nltk",
"specs": [
[
">=",
"3.6.5"
]
]
},
{
"name": "sacrebleu",
"specs": [
[
">=",
"1.5.0"
]
]
},
{
"name": "sentencepiece",
"specs": [
[
">=",
"0.1.97"
]
]
},
{
"name": "hf-lfs",
"specs": [
[
">=",
"0.0.3"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"4.4.1"
]
]
},
{
"name": "pytreebank",
"specs": [
[
">=",
"0.2.7"
]
]
},
{
"name": "setuptools",
"specs": [
[
">=",
"60.2.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.23.5"
]
]
},
{
"name": "dill",
"specs": [
[
">=",
"0.3.5.1"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.9.3"
]
]
},
{
"name": "flask",
"specs": [
[
">=",
"2.3.2"
]
]
},
{
"name": "protobuf",
"specs": [
[
">=",
"4.23"
]
]
},
{
"name": "fschat",
"specs": [
[
">=",
"0.2.3"
]
]
},
{
"name": "hydra-core",
"specs": [
[
">=",
"1.3.2"
]
]
},
{
"name": "einops",
"specs": []
},
{
"name": "accelerate",
"specs": [
[
">=",
"0.32.1"
]
]
},
{
"name": "bitsandbytes",
"specs": []
},
{
"name": "openai",
"specs": [
[
">=",
"0.28.0"
]
]
},
{
"name": "wget",
"specs": []
},
{
"name": "sentence-transformers",
"specs": []
},
{
"name": "bert-score",
"specs": [
[
">=",
"0.3.13"
]
]
},
{
"name": "unbabel-comet",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "nltk",
"specs": [
[
"<",
"4"
],
[
">=",
"3.7"
]
]
},
{
"name": "evaluate",
"specs": [
[
">=",
"0.4.2"
]
]
},
{
"name": "spacy",
"specs": [
[
"<",
"4"
],
[
">=",
"3.4.0"
]
]
},
{
"name": "fastchat",
"specs": []
},
{
"name": "diskcache",
"specs": [
[
">=",
"5.6.3"
]
]
}
],
"lcname": "lm-polygraph"
}