langfair

Name	langfair JSON
Version	0.6.7 JSON
	download
home_page	https://github.com/cvs-health/langfair
Summary	LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
upload_time	2025-07-29 16:27:21
maintainer	Dylan Bouchard
docs_url	None
author	Dylan Bouchard
requires_python	<3.13,>=3.9
license	Apache-2.0 AND MIT
keywords	llm large language model bias fairness responsible ai
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
  <img src="https://raw.githubusercontent.com/cvs-health/langfair/main/assets/images/langfair-logo.png" />
</p>

# LangFair: Use-Case Level LLM Bias and Fairness Assessments
[![Build Status](https://github.com/cvs-health/langfair/actions/workflows/ci.yaml/badge.svg)](https://github.com/cvs-health/langfair/actions)
[![PyPI version](https://badge.fury.io/py/langfair.svg)](https://pypi.org/project/langfair/)
[![Downloads](https://img.shields.io/pepy/dt/langfair)](https://pepy.tech/projects/langfair?timeRange=threeMonths&category=version&includeCIDownloads=true&granularity=daily&viewType=line&versions=0.6.3%2C0.6.2%2C0.6.1)
[![Documentation Status](https://img.shields.io/badge/docs-latest-blue.svg)](https://cvs-health.github.io/langfair/latest/index.html)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![](https://img.shields.io/badge/arXiv-2407.10853-B31B1B.svg)](https://arxiv.org/abs/2407.10853)


LangFair is a comprehensive Python library designed for conducting bias and fairness assessments of large language model (LLM) use cases. This repository includes various supporting resources, including

- [Documentation site](https://cvs-health.github.io/langfair/) with complete API reference
- [Comprehensive framework](https://github.com/cvs-health/langfair/tree/main#-choosing-bias-and-fairness-metrics-for-an-llm-use-case) for choosing bias and fairness metrics
- [Demo notebooks](https://github.com/cvs-health/langfair/tree/main#-example-notebooks) providing illustrative examples
- [LangFair tutorial](https://medium.com/cvs-health-tech-blog/how-to-assess-your-llm-use-case-for-bias-and-fairness-with-langfair-7be89c0c4fab) on Medium
- [Software paper](https://arxiv.org/abs/2501.03112v1) on how LangFair compares to other toolkits
- [Research paper](https://arxiv.org/abs/2407.10853) on our evaluation approach

## 🚀 Why Choose LangFair?
Static benchmark assessments, which are typically assumed to be sufficiently representative, often fall short in capturing the risks associated with all possible use cases of LLMs. These models are increasingly used in various applications, including recommendation systems, classification, text generation, and summarization. However, evaluating these models without considering use-case-specific prompts can lead to misleading assessments of their performance, especially regarding bias and fairness risks.
 
LangFair addresses this gap by adopting a Bring Your Own Prompts (BYOP) approach, allowing users to tailor bias and fairness evaluations to their specific use cases. This ensures that the metrics computed reflect the true performance of the LLMs in real-world scenarios, where prompt-specific risks are critical. Additionally, LangFair's focus is on output-based metrics that are practical for governance audits and real-world testing, without needing access to internal model states.

<p align="center">
  <img src="https://raw.githubusercontent.com/cvs-health/langfair/release-branch/v0.4.0/assets/images/langfair_graphic.png" />
</p>

**Note:** This diagram illustrates the workflow for assessing bias and fairness in text generation and summarization use cases.

## ⚡ Quickstart Guide
### (Optional) Create a virtual environment for using LangFair
We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions [here](https://docs.python.org/3/library/venv.html).

### Installing LangFair
The latest version can be installed from PyPI:

```bash
pip install langfair
```

### Usage Examples
Below are code samples illustrating how to use LangFair to assess bias and fairness risks in text generation and summarization use cases. The below examples assume the user has already defined a list of prompts from their use case, `prompts`. 

##### Generate LLM responses
To generate responses, we can use LangFair's `ResponseGenerator` class. First, we must create a `langchain` LLM object. Below we use `ChatVertexAI`, but **any of [LangChain’s LLM classes](https://js.langchain.com/docs/integrations/chat/) may be used instead**. Note that `InMemoryRateLimiter` is to used to avoid rate limit errors.
```python
from langchain_google_vertexai import ChatVertexAI
from langchain_core.rate_limiters import InMemoryRateLimiter
rate_limiter = InMemoryRateLimiter(
    requests_per_second=4.5, check_every_n_seconds=0.5, max_bucket_size=280,  
)
llm = ChatVertexAI(
    model_name="gemini-pro", temperature=0.3, rate_limiter=rate_limiter
)
```
We can use `ResponseGenerator.generate_responses` to generate 25 responses for each prompt, as is convention for toxicity evaluation.
```python
from langfair.generator import ResponseGenerator
rg = ResponseGenerator(langchain_llm=llm)
generations = await rg.generate_responses(prompts=prompts, count=25)
responses = generations["data"]["response"]
duplicated_prompts = generations["data"]["prompt"] # so prompts correspond to responses
```

##### Compute toxicity metrics
Toxicity metrics can be computed with `ToxicityMetrics`. Note that use of `torch.device` is optional and should be used if GPU is available to speed up toxicity computation.
```python
# import torch # uncomment if GPU is available
# device = torch.device("cuda") # uncomment if GPU is available
from langfair.metrics.toxicity import ToxicityMetrics
tm = ToxicityMetrics(
    # device=device, # uncomment if GPU is available,
)
tox_result = tm.evaluate(
    prompts=duplicated_prompts, 
    responses=responses, 
    return_data=True
)
tox_result['metrics']
# # Output is below
# {'Toxic Fraction': 0.0004,
# 'Expected Maximum Toxicity': 0.013845130120171235,
# 'Toxicity Probability': 0.01}
```

##### Compute stereotype metrics
Stereotype metrics can be computed with `StereotypeMetrics`.
```python
from langfair.metrics.stereotype import StereotypeMetrics
sm = StereotypeMetrics()
stereo_result = sm.evaluate(responses=responses, categories=["gender"])
stereo_result['metrics']
# # Output is below
# {'Stereotype Association': 0.3172750176745329,
# 'Cooccurrence Bias': 0.44766333654278373,
# 'Stereotype Fraction - gender': 0.08}
```

##### Generate counterfactual responses and compute metrics
We can generate counterfactual responses with `CounterfactualGenerator`.
```python
from langfair.generator.counterfactual import CounterfactualGenerator
cg = CounterfactualGenerator(langchain_llm=llm)
cf_generations = await cg.generate_responses(
    prompts=prompts, attribute='gender', count=25
)
male_responses = cf_generations['data']['male_response']
female_responses = cf_generations['data']['female_response']
```

Counterfactual metrics can be easily computed with `CounterfactualMetrics`.
```python
from langfair.metrics.counterfactual import CounterfactualMetrics
cm = CounterfactualMetrics()
cf_result = cm.evaluate(
    texts1=male_responses, 
    texts2=female_responses,
    attribute='gender'
)
cf_result['metrics']
# # Output is below
# {'Cosine Similarity': 0.8318708,
# 'RougeL Similarity': 0.5195852482361165,
# 'Bleu Similarity': 0.3278433712872481,
# 'Sentiment Bias': 0.0009947145187601957}
```

##### Alternative approach: Semi-automated evaluation with `AutoEval`
To streamline assessments for text generation and summarization use cases, the `AutoEval` class conducts a multi-step process that completes all of the aforementioned steps with two lines of code.
```python
from langfair.auto import AutoEval
auto_object = AutoEval(
    prompts=prompts, 
    langchain_llm=llm,
    # toxicity_device=device # uncomment if GPU is available
)
results = await auto_object.evaluate()
results['metrics']
# # Output is below
# {'Toxicity': {'Toxic Fraction': 0.0004,
#   'Expected Maximum Toxicity': 0.013845130120171235,
#   'Toxicity Probability': 0.01},
#  'Stereotype': {'Stereotype Association': 0.3172750176745329,
#   'Cooccurrence Bias': 0.44766333654278373,
#   'Stereotype Fraction - gender': 0.08,
#   'Expected Maximum Stereotype - gender': 0.60355167388916,
#   'Stereotype Probability - gender': 0.27036},
#  'Counterfactual': {'male-female': {'Cosine Similarity': 0.8318708,
#    'RougeL Similarity': 0.5195852482361165,
#    'Bleu Similarity': 0.3278433712872481,
#    'Sentiment Bias': 0.0009947145187601957}}}
```

## 📚 Example Notebooks
Explore the following demo notebooks to see how to use LangFair for various bias and fairness evaluation metrics:

- [Toxicity Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/toxicity_metrics_demo.ipynb): A notebook demonstrating toxicity metrics.
- [Counterfactual Fairness Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/counterfactual_metrics_demo.ipynb): A notebook illustrating how to generate counterfactual datasets and compute counterfactual fairness metrics.
- [Stereotype Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/stereotype_metrics_demo.ipynb): A notebook demonstrating stereotype metrics.
- [AutoEval for Text Generation / Summarization (Toxicity, Stereotypes, Counterfactual)](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/auto_eval_demo.ipynb): A notebook illustrating how to use LangFair's `AutoEval` class for a comprehensive fairness assessment of text generation / summarization use cases. This assessment includes toxicity, stereotype, and counterfactual metrics.
- [Classification Fairness Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/classification/classification_metrics_demo.ipynb): A notebook demonstrating classification fairness metrics.
- [Recommendation Fairness Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/recommendation/recommendation_metrics_demo.ipynb): A notebook demonstrating recommendation fairness metrics.


## 🛠 Choosing Bias and Fairness Metrics for an LLM Use Case
Selecting the appropriate bias and fairness metrics is essential for accurately assessing the performance of large language models (LLMs) in specific use cases. Instead of attempting to compute all possible metrics, practitioners should focus on a relevant subset that aligns with their specific goals and the context of their application.

Our decision framework for selecting appropriate evaluation metrics is illustrated in the diagram below. For more details, refer to our [research paper](https://arxiv.org/abs/2407.10853) detailing the evaluation approach.

<p align="center">
  <img src="https://raw.githubusercontent.com/cvs-health/langfair/main/assets/images/use_case_framework.PNG" />
</p>

**Note:** Fairness through unawareness means none of the prompts for an LLM use case include any mention of protected attribute words.

## 📊 Supported Bias and Fairness Metrics
Bias and fairness metrics offered by LangFair are grouped into several categories. The full suite of metrics is displayed below.

##### Toxicity Metrics
* Expected Maximum Toxicity ([Gehman et al., 2020](https://arxiv.org/abs/2009.11462))
* Toxicity Probability ([Gehman et al., 2020](https://arxiv.org/abs/2009.11462))
* Toxic Fraction ([Liang et al., 2023](https://arxiv.org/abs/2211.09110))

##### Counterfactual Fairness Metrics
* Strict Counterfactual Sentiment Parity ([Huang et al., 2020](https://arxiv.org/abs/1911.03064))
* Weak Counterfactual Sentiment Parity ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))
* Counterfactual Cosine Similarity Score ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))
* Counterfactual BLEU ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))
* Counterfactual ROUGE-L ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))

##### Stereotype Metrics
* Stereotypical Associations ([Liang et al., 2023](https://arxiv.org/abs/2211.09110))
* Co-occurrence Bias Score ([Bordia & Bowman, 2019](https://arxiv.org/abs/1904.03035))
* Stereotype classifier metrics ([Zekun et al., 2023](https://arxiv.org/abs/2311.14126), [Bouchard, 2024](https://arxiv.org/abs/2407.10853))

##### Recommendation (Counterfactual) Fairness Metrics
* Jaccard Similarity ([Zhang et al., 2023](https://dl.acm.org/doi/10.1145/3604915.3608860))
* Search Result Page Misinformation Score ([Zhang et al., 2023](https://dl.acm.org/doi/10.1145/3604915.3608860))
* Pairwise Ranking Accuracy Gap ([Zhang et al., 2023](https://dl.acm.org/doi/10.1145/3604915.3608860))

##### Classification Fairness Metrics
* Predicted Prevalence Rate Disparity ([Feldman et al., 2015](https://arxiv.org/abs/1412.3756); [Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))
* False Negative Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))
* False Omission Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))
* False Positive Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))
* False Discovery Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))


## 📖 Associated Research
A technical description and a practitioner's guide for selecting evaluation metrics is contained in **[this paper](https://arxiv.org/abs/2407.10853)**. If you use our evaluation approach, we would appreciate citations to the following paper:

```bibtex
@misc{bouchard2024actionableframeworkassessingbias,
      title={An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases}, 
      author={Dylan Bouchard},
      year={2024},
      eprint={2407.10853},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10853}, 
}
```

A high-level description of LangFair's functionality is contained in **[this paper](https://arxiv.org/abs/2501.03112)**. If you use LangFair, we would appreciate citations to the following paper:

```bibtex
@misc{bouchard2025langfairpythonpackageassessing,
      title={LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases}, 
      author={Dylan Bouchard and Mohit Singh Chauhan and David Skarbrevik and Viren Bajaj and Zeya Ahmad},
      year={2025},
      eprint={2501.03112},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.03112}, 
}
```

## 📄 Code Documentation
Please refer to our [documentation site](https://cvs-health.github.io/langfair/) for more details on how to use LangFair.

## 🤝 Development Team
The open-source version of LangFair is the culmination of extensive work carried out by a dedicated team of developers. While the internal commit history will not be made public, we believe it's essential to acknowledge the significant contributions of our development team who were instrumental in bringing this project to fruition:

- [Dylan Bouchard](https://github.com/dylanbouchard)
- [Mohit Singh Chauhan](https://github.com/mohitcek)
- [David Skarbrevik](https://github.com/dskarbrevik)
- [Viren Bajaj](https://github.com/virenbajaj)
- [Zeya Ahmad](https://github.com/zeya30)

## 🤗 Contributing
Contributions are welcome. Please refer [here](https://github.com/cvs-health/langfair/tree/main/CONTRIBUTING.md) for instructions on how to contribute to LangFair.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cvs-health/langfair",
    "name": "langfair",
    "maintainer": "Dylan Bouchard",
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": "dylan.bouchard@cvshealth.com",
    "keywords": "LLM, large language model, bias, fairness, Responsible AI",
    "author": "Dylan Bouchard",
    "author_email": "dylan.bouchard@cvshealth.com",
    "download_url": "https://files.pythonhosted.org/packages/08/4e/a2b5a4556156dc2545b8425aed7a289d6519c1a67ce92bd9c26a24053620/langfair-0.6.7.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/cvs-health/langfair/main/assets/images/langfair-logo.png\" />\n</p>\n\n# LangFair: Use-Case Level LLM Bias and Fairness Assessments\n[![Build Status](https://github.com/cvs-health/langfair/actions/workflows/ci.yaml/badge.svg)](https://github.com/cvs-health/langfair/actions)\n[![PyPI version](https://badge.fury.io/py/langfair.svg)](https://pypi.org/project/langfair/)\n[![Downloads](https://img.shields.io/pepy/dt/langfair)](https://pepy.tech/projects/langfair?timeRange=threeMonths&category=version&includeCIDownloads=true&granularity=daily&viewType=line&versions=0.6.3%2C0.6.2%2C0.6.1)\n[![Documentation Status](https://img.shields.io/badge/docs-latest-blue.svg)](https://cvs-health.github.io/langfair/latest/index.html)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![](https://img.shields.io/badge/arXiv-2407.10853-B31B1B.svg)](https://arxiv.org/abs/2407.10853)\n\n\nLangFair is a comprehensive Python library designed for conducting bias and fairness assessments of large language model (LLM) use cases. This repository includes various supporting resources, including\n\n- [Documentation site](https://cvs-health.github.io/langfair/) with complete API reference\n- [Comprehensive framework](https://github.com/cvs-health/langfair/tree/main#-choosing-bias-and-fairness-metrics-for-an-llm-use-case) for choosing bias and fairness metrics\n- [Demo notebooks](https://github.com/cvs-health/langfair/tree/main#-example-notebooks) providing illustrative examples\n- [LangFair tutorial](https://medium.com/cvs-health-tech-blog/how-to-assess-your-llm-use-case-for-bias-and-fairness-with-langfair-7be89c0c4fab) on Medium\n- [Software paper](https://arxiv.org/abs/2501.03112v1) on how LangFair compares to other toolkits\n- [Research paper](https://arxiv.org/abs/2407.10853) on our evaluation approach\n\n## \ud83d\ude80 Why Choose LangFair?\nStatic benchmark assessments, which are typically assumed to be sufficiently representative, often fall short in capturing the risks associated with all possible use cases of LLMs. These models are increasingly used in various applications, including recommendation systems, classification, text generation, and summarization. However, evaluating these models without considering use-case-specific prompts can lead to misleading assessments of their performance, especially regarding bias and fairness risks.\n \nLangFair addresses this gap by adopting a Bring Your Own Prompts (BYOP) approach, allowing users to tailor bias and fairness evaluations to their specific use cases. This ensures that the metrics computed reflect the true performance of the LLMs in real-world scenarios, where prompt-specific risks are critical. Additionally, LangFair's focus is on output-based metrics that are practical for governance audits and real-world testing, without needing access to internal model states.\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/cvs-health/langfair/release-branch/v0.4.0/assets/images/langfair_graphic.png\" />\n</p>\n\n**Note:** This diagram illustrates the workflow for assessing bias and fairness in text generation and summarization use cases.\n\n## \u26a1 Quickstart Guide\n### (Optional) Create a virtual environment for using LangFair\nWe recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions [here](https://docs.python.org/3/library/venv.html).\n\n### Installing LangFair\nThe latest version can be installed from PyPI:\n\n```bash\npip install langfair\n```\n\n### Usage Examples\nBelow are code samples illustrating how to use LangFair to assess bias and fairness risks in text generation and summarization use cases. The below examples assume the user has already defined a list of prompts from their use case, `prompts`. \n\n##### Generate LLM responses\nTo generate responses, we can use LangFair's `ResponseGenerator` class. First, we must create a `langchain` LLM object. Below we use `ChatVertexAI`, but **any of [LangChain\u2019s LLM classes](https://js.langchain.com/docs/integrations/chat/) may be used instead**. Note that `InMemoryRateLimiter` is to used to avoid rate limit errors.\n```python\nfrom langchain_google_vertexai import ChatVertexAI\nfrom langchain_core.rate_limiters import InMemoryRateLimiter\nrate_limiter = InMemoryRateLimiter(\n    requests_per_second=4.5, check_every_n_seconds=0.5, max_bucket_size=280,  \n)\nllm = ChatVertexAI(\n    model_name=\"gemini-pro\", temperature=0.3, rate_limiter=rate_limiter\n)\n```\nWe can use `ResponseGenerator.generate_responses` to generate 25 responses for each prompt, as is convention for toxicity evaluation.\n```python\nfrom langfair.generator import ResponseGenerator\nrg = ResponseGenerator(langchain_llm=llm)\ngenerations = await rg.generate_responses(prompts=prompts, count=25)\nresponses = generations[\"data\"][\"response\"]\nduplicated_prompts = generations[\"data\"][\"prompt\"] # so prompts correspond to responses\n```\n\n##### Compute toxicity metrics\nToxicity metrics can be computed with `ToxicityMetrics`. Note that use of `torch.device` is optional and should be used if GPU is available to speed up toxicity computation.\n```python\n# import torch # uncomment if GPU is available\n# device = torch.device(\"cuda\") # uncomment if GPU is available\nfrom langfair.metrics.toxicity import ToxicityMetrics\ntm = ToxicityMetrics(\n    # device=device, # uncomment if GPU is available,\n)\ntox_result = tm.evaluate(\n    prompts=duplicated_prompts, \n    responses=responses, \n    return_data=True\n)\ntox_result['metrics']\n# # Output is below\n# {'Toxic Fraction': 0.0004,\n# 'Expected Maximum Toxicity': 0.013845130120171235,\n# 'Toxicity Probability': 0.01}\n```\n\n##### Compute stereotype metrics\nStereotype metrics can be computed with `StereotypeMetrics`.\n```python\nfrom langfair.metrics.stereotype import StereotypeMetrics\nsm = StereotypeMetrics()\nstereo_result = sm.evaluate(responses=responses, categories=[\"gender\"])\nstereo_result['metrics']\n# # Output is below\n# {'Stereotype Association': 0.3172750176745329,\n# 'Cooccurrence Bias': 0.44766333654278373,\n# 'Stereotype Fraction - gender': 0.08}\n```\n\n##### Generate counterfactual responses and compute metrics\nWe can generate counterfactual responses with `CounterfactualGenerator`.\n```python\nfrom langfair.generator.counterfactual import CounterfactualGenerator\ncg = CounterfactualGenerator(langchain_llm=llm)\ncf_generations = await cg.generate_responses(\n    prompts=prompts, attribute='gender', count=25\n)\nmale_responses = cf_generations['data']['male_response']\nfemale_responses = cf_generations['data']['female_response']\n```\n\nCounterfactual metrics can be easily computed with `CounterfactualMetrics`.\n```python\nfrom langfair.metrics.counterfactual import CounterfactualMetrics\ncm = CounterfactualMetrics()\ncf_result = cm.evaluate(\n    texts1=male_responses, \n    texts2=female_responses,\n    attribute='gender'\n)\ncf_result['metrics']\n# # Output is below\n# {'Cosine Similarity': 0.8318708,\n# 'RougeL Similarity': 0.5195852482361165,\n# 'Bleu Similarity': 0.3278433712872481,\n# 'Sentiment Bias': 0.0009947145187601957}\n```\n\n##### Alternative approach: Semi-automated evaluation with `AutoEval`\nTo streamline assessments for text generation and summarization use cases, the `AutoEval` class conducts a multi-step process that completes all of the aforementioned steps with two lines of code.\n```python\nfrom langfair.auto import AutoEval\nauto_object = AutoEval(\n    prompts=prompts, \n    langchain_llm=llm,\n    # toxicity_device=device # uncomment if GPU is available\n)\nresults = await auto_object.evaluate()\nresults['metrics']\n# # Output is below\n# {'Toxicity': {'Toxic Fraction': 0.0004,\n#   'Expected Maximum Toxicity': 0.013845130120171235,\n#   'Toxicity Probability': 0.01},\n#  'Stereotype': {'Stereotype Association': 0.3172750176745329,\n#   'Cooccurrence Bias': 0.44766333654278373,\n#   'Stereotype Fraction - gender': 0.08,\n#   'Expected Maximum Stereotype - gender': 0.60355167388916,\n#   'Stereotype Probability - gender': 0.27036},\n#  'Counterfactual': {'male-female': {'Cosine Similarity': 0.8318708,\n#    'RougeL Similarity': 0.5195852482361165,\n#    'Bleu Similarity': 0.3278433712872481,\n#    'Sentiment Bias': 0.0009947145187601957}}}\n```\n\n## \ud83d\udcda Example Notebooks\nExplore the following demo notebooks to see how to use LangFair for various bias and fairness evaluation metrics:\n\n- [Toxicity Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/toxicity_metrics_demo.ipynb): A notebook demonstrating toxicity metrics.\n- [Counterfactual Fairness Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/counterfactual_metrics_demo.ipynb): A notebook illustrating how to generate counterfactual datasets and compute counterfactual fairness metrics.\n- [Stereotype Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/stereotype_metrics_demo.ipynb): A notebook demonstrating stereotype metrics.\n- [AutoEval for Text Generation / Summarization (Toxicity, Stereotypes, Counterfactual)](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/text_generation/auto_eval_demo.ipynb): A notebook illustrating how to use LangFair's `AutoEval` class for a comprehensive fairness assessment of text generation / summarization use cases. This assessment includes toxicity, stereotype, and counterfactual metrics.\n- [Classification Fairness Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/classification/classification_metrics_demo.ipynb): A notebook demonstrating classification fairness metrics.\n- [Recommendation Fairness Evaluation](https://github.com/cvs-health/langfair/blob/main/examples/evaluations/recommendation/recommendation_metrics_demo.ipynb): A notebook demonstrating recommendation fairness metrics.\n\n\n## \ud83d\udee0 Choosing Bias and Fairness Metrics for an LLM Use Case\nSelecting the appropriate bias and fairness metrics is essential for accurately assessing the performance of large language models (LLMs) in specific use cases. Instead of attempting to compute all possible metrics, practitioners should focus on a relevant subset that aligns with their specific goals and the context of their application.\n\nOur decision framework for selecting appropriate evaluation metrics is illustrated in the diagram below. For more details, refer to our [research paper](https://arxiv.org/abs/2407.10853) detailing the evaluation approach.\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/cvs-health/langfair/main/assets/images/use_case_framework.PNG\" />\n</p>\n\n**Note:** Fairness through unawareness means none of the prompts for an LLM use case include any mention of protected attribute words.\n\n## \ud83d\udcca Supported Bias and Fairness Metrics\nBias and fairness metrics offered by LangFair are grouped into several categories. The full suite of metrics is displayed below.\n\n##### Toxicity Metrics\n* Expected Maximum Toxicity ([Gehman et al., 2020](https://arxiv.org/abs/2009.11462))\n* Toxicity Probability ([Gehman et al., 2020](https://arxiv.org/abs/2009.11462))\n* Toxic Fraction ([Liang et al., 2023](https://arxiv.org/abs/2211.09110))\n\n##### Counterfactual Fairness Metrics\n* Strict Counterfactual Sentiment Parity ([Huang et al., 2020](https://arxiv.org/abs/1911.03064))\n* Weak Counterfactual Sentiment Parity ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))\n* Counterfactual Cosine Similarity Score ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))\n* Counterfactual BLEU ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))\n* Counterfactual ROUGE-L ([Bouchard, 2024](https://arxiv.org/abs/2407.10853))\n\n##### Stereotype Metrics\n* Stereotypical Associations ([Liang et al., 2023](https://arxiv.org/abs/2211.09110))\n* Co-occurrence Bias Score ([Bordia & Bowman, 2019](https://arxiv.org/abs/1904.03035))\n* Stereotype classifier metrics ([Zekun et al., 2023](https://arxiv.org/abs/2311.14126), [Bouchard, 2024](https://arxiv.org/abs/2407.10853))\n\n##### Recommendation (Counterfactual) Fairness Metrics\n* Jaccard Similarity ([Zhang et al., 2023](https://dl.acm.org/doi/10.1145/3604915.3608860))\n* Search Result Page Misinformation Score ([Zhang et al., 2023](https://dl.acm.org/doi/10.1145/3604915.3608860))\n* Pairwise Ranking Accuracy Gap ([Zhang et al., 2023](https://dl.acm.org/doi/10.1145/3604915.3608860))\n\n##### Classification Fairness Metrics\n* Predicted Prevalence Rate Disparity ([Feldman et al., 2015](https://arxiv.org/abs/1412.3756); [Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n* False Negative Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n* False Omission Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n* False Positive Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n* False Discovery Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n\n\n## \ud83d\udcd6 Associated Research\nA technical description and a practitioner's guide for selecting evaluation metrics is contained in **[this paper](https://arxiv.org/abs/2407.10853)**. If you use our evaluation approach, we would appreciate citations to the following paper:\n\n```bibtex\n@misc{bouchard2024actionableframeworkassessingbias,\n      title={An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases}, \n      author={Dylan Bouchard},\n      year={2024},\n      eprint={2407.10853},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2407.10853}, \n}\n```\n\nA high-level description of LangFair's functionality is contained in **[this paper](https://arxiv.org/abs/2501.03112)**. If you use LangFair, we would appreciate citations to the following paper:\n\n```bibtex\n@misc{bouchard2025langfairpythonpackageassessing,\n      title={LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases}, \n      author={Dylan Bouchard and Mohit Singh Chauhan and David Skarbrevik and Viren Bajaj and Zeya Ahmad},\n      year={2025},\n      eprint={2501.03112},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2501.03112}, \n}\n```\n\n## \ud83d\udcc4 Code Documentation\nPlease refer to our [documentation site](https://cvs-health.github.io/langfair/) for more details on how to use LangFair.\n\n## \ud83e\udd1d Development Team\nThe open-source version of LangFair is the culmination of extensive work carried out by a dedicated team of developers. While the internal commit history will not be made public, we believe it's essential to acknowledge the significant contributions of our development team who were instrumental in bringing this project to fruition:\n\n- [Dylan Bouchard](https://github.com/dylanbouchard)\n- [Mohit Singh Chauhan](https://github.com/mohitcek)\n- [David Skarbrevik](https://github.com/dskarbrevik)\n- [Viren Bajaj](https://github.com/virenbajaj)\n- [Zeya Ahmad](https://github.com/zeya30)\n\n## \ud83e\udd17 Contributing\nContributions are welcome. Please refer [here](https://github.com/cvs-health/langfair/tree/main/CONTRIBUTING.md) for instructions on how to contribute to LangFair.",
    "bugtrack_url": null,
    "license": "Apache-2.0 AND MIT",
    "summary": "LangFair is a Python library for conducting use-case level LLM bias and fairness assessments",
    "version": "0.6.7",
    "project_urls": {
        "Documentation": "https://cvs-health.github.io/langfair/latest/index.html",
        "Homepage": "https://github.com/cvs-health/langfair",
        "Repository": "https://github.com/cvs-health/langfair"
    },
    "split_keywords": [
        "llm",
        " large language model",
        " bias",
        " fairness",
        " responsible ai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4221cb615d8f040e759e24f67a80b77db35ad784d586aa149a72e9708c449e83",
                "md5": "fb4cdffe2396e8b76d4dcf05731d6e0c",
                "sha256": "46e3f2875975825aa7aadcc5dfd04073b0eb1c097c59044189bf79bca533d88d"
            },
            "downloads": -1,
            "filename": "langfair-0.6.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fb4cdffe2396e8b76d4dcf05731d6e0c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 97613,
            "upload_time": "2025-07-29T16:27:20",
            "upload_time_iso_8601": "2025-07-29T16:27:20.375461Z",
            "url": "https://files.pythonhosted.org/packages/42/21/cb615d8f040e759e24f67a80b77db35ad784d586aa149a72e9708c449e83/langfair-0.6.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "084ea2b5a4556156dc2545b8425aed7a289d6519c1a67ce92bd9c26a24053620",
                "md5": "23130caac06110ee2508115be8b85d64",
                "sha256": "0a0ea9e305aaee47000948132a520c88787796e42fca219b4269167b74f0a479"
            },
            "downloads": -1,
            "filename": "langfair-0.6.7.tar.gz",
            "has_sig": false,
            "md5_digest": "23130caac06110ee2508115be8b85d64",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 58043,
            "upload_time": "2025-07-29T16:27:21",
            "upload_time_iso_8601": "2025-07-29T16:27:21.273484Z",
            "url": "https://files.pythonhosted.org/packages/08/4e/a2b5a4556156dc2545b8425aed7a289d6519c1a67ce92bd9c26a24053620/langfair-0.6.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-29 16:27:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cvs-health",
    "github_project": "langfair",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "langfair"
}

Dylan Bouchard