judges

Name	judges JSON
Version	0.0.6 JSON
	download
home_page	None
Summary	A small library of research-backed LLM judges
upload_time	2025-02-07 21:05:29
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	ai large language models llm-judge evaluation evals
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # judges ‍⚖️

1. [Overview](#overview)
2. [Installation](#installation)
3. [API](#api)
   - [Types of Judges](#types-of-judges)
     - [Classifiers](#classifiers)
     - [Graders](#graders)
   - [Using Judges](#using-judges)
     - [Classifier Judges](#classifier-judges)
   - [Combining Judges](#combining-judges)
     - [Jury Object](#jury-object)
4. [Usage](#usage)
   - [Pick a model](#pick-a-model)
   - [Send data to an LLM](#send-data-to-an-llm)
   - [Use a `judges` classifier LLM as an evaluator model](#use-a-judges-classifier-llm-as-an-evaluator-model)
   - [Use a `Jury` for averaging and diversification](#use-a-jury-for-averaging-and-diversification)
   - [Use `AutoJudge` to create a custom LLM judge](#use-autojudge-to-create-a-custom-llm-judge)
5. [Appendix of Judges](#appendix)
   - [Classifiers](#classifiers) 
   - [Grader](#graders) 


## Overview
`judges` is a small library to use and create LLM-as-a-Judge evaluators. The purpose of `judges` is to have a curated set of LLM evaluators in a low-friction format across a variety of use cases that are backed by research, and can be used off-the-shelf or serve as inspiration for building your own LLM evaluators.

## Installation
```
pip install judges
```

## API

### Types of Judges

The library provides two types of judges:

1. **Classifiers**: Return boolean values.
   - `True` indicates the inputs passed the evaluation.
   - `False` indicates the inputs did not pass the evaluation.

2. **Graders**: Return scores on a numerical or Likert scale.
   - Numerical scale: 1 to 5
   - Likert scale: terrible, bad, average, good, excellent

### Using Judges

All judges can be used by calling the `.judge()` method. This method accepts the following parameters:
- `input`: The input to be evaluated.
- `output`: The output to be evaluated.
- `expected` (optional): The expected result for comparison.

The `.judge()` method returns a `Judgment` object with the following attributes:
- `reasoning`: The reasoning behind the judgment.
- `score`: The score assigned by the judge.

### Classifier Judges

If the underlying prompt for a classifier judge outputs a `Judgment` similar to `True` or `False` (e.g., good or bad, yes or no, 0 or 1), the `judges` library automatically resolves the outputs so that a `Judgment` only has a boolean label.

### Combining Judges

The library also provides an interface to combine multiple judges through the `Jury` object. The `Jury` object has a `.vote()` method that produces a `Verdict`.

### Jury Object

- `.vote()`: Combines the judgments of multiple judges and produces a `Verdict`.

## Usage

### Pick a model

- **OpenAI:** 
  - By default, `judges` uses the OpenAI client and models due to its widespread use. To get started, you'll need an OpenAI API key set as an environment variable `OPENAI_API_KEY`
- **LiteLLM:** 
  - If you would like to use models on other inference providers, `judges` also integrates with `litellm` as an extra dependency. Run `pip install "judges[litellm]"`, and set the appropriate API keys based on the [LiteLLM Docs](https://docs.litellm.ai/docs/#basic-usage).

> [!TIP]  
> If you choose to use `litellm` to use 3rd-party inference providers, and the model you want is not available via the function below, check the docs of the inference provider directly since `litellm` docs may not always be up to date.


### Send data to an LLM
Next, if you'd like to use this package, you can follow the `example.py` below:

```python
from openai import OpenAI

client = OpenAI()

question = "What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know."

story = """
Fig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.

One day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.

But instead of chasing it, Fig barked in excitement, as if saying, “Nice to meet you!” The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.

From that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.
"""

# set up the input prompt
input = f'{story}\n\nQuestion:{question}'

# write down what the model is expected to respond with
# NOTE: not all judges require an expected answer. refer to the implementations
expected = "I don't know"

# get the model output
output = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[
        {
            'role': 'user', 
            'content': input,
        },
    ],
).choices[0].message.content
```

### Use a `judges` classifier LLM as an evaluator model

```python
from judges.classifiers.correctness import PollMultihopCorrectness

# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='gpt-4o-mini')

judgment = correctness.judge(
    input=input,
    output=output,
    expected=expected,
)
print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.

print(judgment.score)
# True
```

### Use a `Jury` for averaging and diversification

A jury of LLMs can enable more diverse results and enable you to combine the judgments of multiple LLMs.

```python
from judges import Jury
from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness

poll = PollMultihopCorrectness(model='gpt-4o')
raft = RAFTCorrectness(model='gpt-4o-mini')

jury = Jury(judges=[poll, raft], voting_method="average")

verdict = jury.vote(
    input=input,
    output=completion,
    expected=expected,
)
print(verdict.score)
```

### Use `AutoJudge` to create a custom LLM judge

`autojudge` is an extension to the **judges** library that builds on our [previous work](https://www.quotientai.co/post/subject-matter-expert-language-liaison-smell-a-framework-for-aligning-llm-evaluators-to-human-feedback) aligning judges to human feedback -- given a labeled dataset with feedback and a natural language description of an evaluation task, `autojudge` creates custom, task-specific LLM judges.

Install it using:  

```bash
pip install "judges[auto]"
```

**Step 1 - Prepare your dataset:**
Your dataset can be either a list of dictionaries or path to a csv file with the following fields:

- **`input`**: The input provided to your model
- **`output`**: The model's response
- **`label`**: `1` for correct, `0` for incorrect  
- **`feedback`**: Feedback explaining why the response is correct or incorrect

Example:  

| input                             | output                                                              | label | feedback                              |
|-----------------------------------|---------------------------------------------------------------------|-------|---------------------------------------|
| What's the best time to visit Paris? | The best time to visit Paris is during the spring or fall.          | 1     | Provides accurate and detailed advice. |
| Can I ride a dragon in Scotland? | Yes, dragons are commonly seen in the highlands and can be ridden with proper training          | 0     | Dragons are mythical creatures; the information is fictional. |

**Step 2 - Initialize your `autojudge`:**
Provide a labeled dataset and describe the evaluation task.  

```python
from judges.classifiers.auto import AutoJudge

dataset = [
    {
        "input": "Can I ride a dragon in Scotland?",
        "output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training.",
        "label": 0,
        "feedback": "Dragons are mythical creatures; the information is fictional.",
    },
    {
        "input": "Can you recommend a good hotel in Tokyo?",
        "output": "Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.",
        "label": 1,
        "feedback": "Offers a specific and helpful recommendation.",
    },
    {
        "input": "Can I drink tap water in London?",
        "output": "Yes, tap water in London is safe to drink and meets high quality standards.",
        "label": 1,
        "feedback": "Gives clear and reassuring information.",
    },
    {
        "input": "What's the boiling point of water on the moon?",
        "output": "The boiling point of water on the moon is 100°C, the same as on Earth.",
        "label": 0,
        "feedback": "Boiling point varies with pressure; the moon's vacuum affects it.",
    }
]


# Task description
task = "Evaluate responses for accuracy, clarity, and helpfulness."

# Initialize autojudge
autojudge = AutoJudge.from_dataset(
    dataset=dataset,
    task=task,
    model="gpt-4-turbo-2024-04-09",
    # increase workers for speed ⚡
    # max_workers=2,
    # generated prompts are automatically saved to disk
    # save_to_disk=False,
)
```

**Step 3 - Use your judge to evaluate new input-output pairs:**
You can use `autojudge` to evaluate a single input-output pair using the `.judge()` method.  

```python
# Input-output pair to evaluate
input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."

# Get the judgment
judgment = autojudge.judge(input=input_, output=output)

# Print the judgment
print(judgment.reasoning)
# The response accurately lists popular attractions like the Statue of Liberty and Central Park, which are well-known and relevant to the user's query.
print(judgment.score)
# True (correct)
```

## Appendix

### Classifiers

| Judge Type | Category                 | Description                                                                                                                                                                                                                                                                                                                                                                                                                                  | Reference Paper                                                                                                                                                                                                                                                                                   | Python Import                                                                                                                           |
|------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| classifier | Factual Correctness      | Evaluates the factual correctness of a generated response against a reference answer using few-shot learning. It compares the provided answer with the reference answer to determine its accuracy.                                                                                                                                                                                                                                                 | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollMultihopCorrectness```                                               |
| classifier | Factual Correctness      | Assesses the factual accuracy of an AI assistant's response against a reference answer without relying on example-based (few-shot) learning. It determines whether the provided answer aligns with the reference answer based on factual information.                                                                                                                                                                                         | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollZeroShotCorrectness```                                               |
| classifier | Factual Correctness      | Evaluates the factual correctness of responses based on the KILT (Knowledge Intensive Language Tasks) version of Natural Questions. It uses few-shot learning to compare the AI assistant's response with the reference answer to assess accuracy.                                                                                                                                                                                         | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollKiltNQCorrectness```                                                 |
| classifier | Factual Correctness      | Assesses the factual correctness of responses based on the KILT version of HotpotQA. It utilizes few-shot learning to determine whether the AI assistant's response aligns with the reference answer, ensuring accuracy and consistency.                                                                                                                                                                                                  | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollKiltHotpotCorrectness```                                             |
| classifier | Factual Correctness      | Evaluates the factual correctness of responses using LLAMA-based methods as part of the RAFT (Adapting Language Model to Domain Specific RAG) framework. It determines the accuracy of the AI assistant's response by comparing it with the reference answer based on predefined criteria.                                                                                                                                                             | [RAFT: Adapting Language Model to Domain Specific RAG](https://arxiv.org/abs/2403.10131)                                                                                                                                                                                           | ```from judges.classifiers.correctness import RAFTCorrectness```                                                        |
| classifier | Hallucination Evaluation | Determines whether a provided answer contains non-factual or hallucinated information based on predefined hallucination types and world knowledge. It evaluates the factual integrity of the AI assistant's response to ensure accuracy and reliability.                                                                                                                                                                                   | [HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models](https://arxiv.org/abs/2305.11747)                                                                                                                                                            | ```from judges.classifiers.hallucination import HaluEvalAnswerNonFactual```                                            |
| classifier | Hallucination Evaluation | Evaluates whether a dialogue response contains non-factual or hallucinated information by analyzing the response against predefined hallucination types and world knowledge. It ensures that the AI assistant's responses are factually accurate within the context of the dialogue.                                                                                                                                                         | [HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models](https://arxiv.org/abs/2305.11747)                                                                                                                                                            | ```from judges.classifiers.hallucination import HaluEvalDialogueResponseNonFactual```                                  |
| classifier | Hallucination Evaluation | Assesses whether a document summary contains non-factual or hallucinated information by comparing it against the source document. It ensures the summary accurately reflects the content without introducing incorrect or fabricated details.                                                                                                                                                                                             | [HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models](https://arxiv.org/abs/2305.11747)                                                                                                                                                            | ```from judges.classifiers.hallucination import HaluEvalDocumentSummaryNonFactual```                                   |
| classifier | Harmfulness Evaluation   | Evaluates whether an AI assistant's response contains harmful content, including harmful, dangerous, unethical, illegal, or hostile information. It assesses the response for alignment with trustworthy language model guidelines to ensure the safety and reliability of the output.                                                                                                                                                                          | [Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment](https://arxiv.org/abs/2308.05374)                                                                                                                                                     | ```from judges.classifiers.harmfulness import TrustworthyLLMHarmfulness```                                           |
| classifier | Query Quality Evaluation | Evaluates the quality of a query based on clarity, specificity, and relevance. It assesses whether the query is well-structured and aligned with the desired information retrieval objectives, ensuring that the queries facilitate accurate and relevant responses from the AI assistant.                                                                                                                                                                        | [FactAlign: Long-form Factuality Alignment of Large Language Models](https://arxiv.org/abs/2410.01691)                                                                                                                                                                              | ```from judges.classifiers.query_quality import FactAlignQueryQuality```                                               |
| classifier | Refusal Evaluation       | Evaluates whether an AI assistant's response refuses to complete a given task. It determines if the response is a refusal based on predefined criteria, ensuring that the AI adheres to ethical guidelines and alignment policies when declining to assist with certain requests.                                                                                                                                                                         | [Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment](https://arxiv.org/abs/2308.05374)                                                                                                                                                     | ```from judges.classifiers.refusal import TrustworthyLLMRefusal```                                                   |

---


### Graders

| Judge Type | Category                       | Description                                                                                                                                                                                                                                                                                                                                                                                                                              | Reference Paper                                                                                                                                                             | Python Import                                                                                                                           |
|------------|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| grader     | Factual Correctness            | Evaluates the correctness of a response in an Absolute Grading setting, according to a scoring rubric.                                                                                                                                                                                                                                                                                                                                 | [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491)                                                               | `from judges.graders.correctness import PrometheusAbsoluteCoarseCorrectness`                                                        |
| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize implicit emotions in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                                  | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenImplicitEmotionRecognition`                                                          |
| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize the hidden intention in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                                   | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenIntentionRecognition`                                                               |
| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize the key event in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                                              | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenKeyEventRecognition`                                                                 |
| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize both the significant and trivial events in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenMixedEventRecognition`                                                               |
| grader     | Information Coverage Evaluation | Determines whether a specific reference insight is fully, partially, or not covered within a provided list of bullet points, ensuring comprehensive coverage of key information.                                                                                                                                                                                                                                                                 | [Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems](https://arxiv.org/abs/2407.01370)                                                                 | `from judges.graders.information_coverage import HaystackBulletPointCoverageCorrectness`                                               |
| grader     | Content Moderation             | Categorizes user input texts into predefined categories such as hate, self-harm, sexual content, etc., determining the appropriate category based on the content's nature and intent.                                                                                                                                                                                                                                                       | [OR-Bench: An Over-Refusal Benchmark for Large Language Models](https://arxiv.org/abs/2405.20947)                                                                        | `from judges.graders.moderator import ORBenchUserInputModeration`                                                                   |
| grader     | Content Moderation             | Classifies model output texts into one of eleven predefined categories related to content sensitivity and appropriateness, ensuring responses adhere to ethical and safety guidelines.                                                                                                                                                                                                                                                 | [OR-Bench: An Over-Refusal Benchmark for Large Language Models](https://arxiv.org/abs/2405.20947)                                                                        | `from judges.graders.moderator import ORBenchUserOutputModeration`                                                                  |
| grader     | Query Quality Evaluation       | Assesses the quality of user queries based on clarity, specificity, and coherence, ensuring that queries are well-structured to facilitate accurate and relevant responses.                                                                                                                                                                                                                                                                  | [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://api.semanticscholar.org/CorpusID:270391432)                                   | `from judges.graders.query_quality import MagpieQueryQuality`                                                                         |
| grader     | Refusal Evaluation             | Classifies AI assistant responses into direct_answer, direct_refusal, or indirect_refusal, evaluating whether the assistant appropriately refuses to answer certain prompts based on ethical guidelines.                                                                                                                                                                                                                                         | [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)                                                                               | `from judges.graders.refusal_detection import ORBenchRefusalDetection`                                                                 |
| grader     | Relevance Evaluation           | Evaluates the relevance of a passage to a query based on a four-point scale: Irrelevant, Related, Highly relevant, Perfectly relevant. Ensures that the passage adequately addresses the query with varying degrees of relevance.                                                                                                                                                                                                          | [Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.](https://doi.org/10.1145/3637528.3671883)                                         | `from judges.graders.relevance import ReliableCIRelevance`                                                                             |
| grader     | Response Quality Evaluation    | Evaluates the quality of the AI assistant's response based on helpfulness, relevance, accuracy, depth, creativity, and level of detail, assigning a numerical grade.                                                                                                                                                                                                                                                                      | [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)                                                                               | `from judges.graders.response_quality import MTBenchChatBotResponseQuality`                                                            |

---

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "judges",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "ai, large language models, llm-judge, evaluation, evals",
    "author": null,
    "author_email": "Freddie Vargus <freddie@quotientai.co>, James Liounis <james@quotientai.co>",
    "download_url": "https://files.pythonhosted.org/packages/f4/b4/7f49caf909fa31de78ad904047c8e8b8d6f615f3db4bba573957c7c18565/judges-0.0.6.tar.gz",
    "platform": null,
    "description": "# judges \u200d\u2696\ufe0f\n\n1. [Overview](#overview)\n2. [Installation](#installation)\n3. [API](#api)\n   - [Types of Judges](#types-of-judges)\n     - [Classifiers](#classifiers)\n     - [Graders](#graders)\n   - [Using Judges](#using-judges)\n     - [Classifier Judges](#classifier-judges)\n   - [Combining Judges](#combining-judges)\n     - [Jury Object](#jury-object)\n4. [Usage](#usage)\n   - [Pick a model](#pick-a-model)\n   - [Send data to an LLM](#send-data-to-an-llm)\n   - [Use a `judges` classifier LLM as an evaluator model](#use-a-judges-classifier-llm-as-an-evaluator-model)\n   - [Use a `Jury` for averaging and diversification](#use-a-jury-for-averaging-and-diversification)\n   - [Use `AutoJudge` to create a custom LLM judge](#use-autojudge-to-create-a-custom-llm-judge)\n5. [Appendix of Judges](#appendix)\n   - [Classifiers](#classifiers) \n   - [Grader](#graders) \n\n\n## Overview\n`judges` is a small library to use and create LLM-as-a-Judge evaluators. The purpose of `judges` is to have a curated set of LLM evaluators in a low-friction format across a variety of use cases that are backed by research, and can be used off-the-shelf or serve as inspiration for building your own LLM evaluators.\n\n## Installation\n```\npip install judges\n```\n\n## API\n\n### Types of Judges\n\nThe library provides two types of judges:\n\n1. **Classifiers**: Return boolean values.\n   - `True` indicates the inputs passed the evaluation.\n   - `False` indicates the inputs did not pass the evaluation.\n\n2. **Graders**: Return scores on a numerical or Likert scale.\n   - Numerical scale: 1 to 5\n   - Likert scale: terrible, bad, average, good, excellent\n\n### Using Judges\n\nAll judges can be used by calling the `.judge()` method. This method accepts the following parameters:\n- `input`: The input to be evaluated.\n- `output`: The output to be evaluated.\n- `expected` (optional): The expected result for comparison.\n\nThe `.judge()` method returns a `Judgment` object with the following attributes:\n- `reasoning`: The reasoning behind the judgment.\n- `score`: The score assigned by the judge.\n\n### Classifier Judges\n\nIf the underlying prompt for a classifier judge outputs a `Judgment` similar to `True` or `False` (e.g., good or bad, yes or no, 0 or 1), the `judges` library automatically resolves the outputs so that a `Judgment` only has a boolean label.\n\n### Combining Judges\n\nThe library also provides an interface to combine multiple judges through the `Jury` object. The `Jury` object has a `.vote()` method that produces a `Verdict`.\n\n### Jury Object\n\n- `.vote()`: Combines the judgments of multiple judges and produces a `Verdict`.\n\n## Usage\n\n### Pick a model\n\n- **OpenAI:** \n  - By default, `judges` uses the OpenAI client and models due to its widespread use. To get started, you'll need an OpenAI API key set as an environment variable `OPENAI_API_KEY`\n- **LiteLLM:** \n  - If you would like to use models on other inference providers, `judges` also integrates with `litellm` as an extra dependency. Run `pip install \"judges[litellm]\"`, and set the appropriate API keys based on the [LiteLLM Docs](https://docs.litellm.ai/docs/#basic-usage).\n\n> [!TIP]  \n> If you choose to use `litellm` to use 3rd-party inference providers, and the model you want is not available via the function below, check the docs of the inference provider directly since `litellm` docs may not always be up to date.\n\n\n### Send data to an LLM\nNext, if you'd like to use this package, you can follow the `example.py` below:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI()\n\nquestion = \"What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know.\"\n\nstory = \"\"\"\nFig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.\n\nOne day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.\n\nBut instead of chasing it, Fig barked in excitement, as if saying, \u201cNice to meet you!\u201d The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.\n\nFrom that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.\n\"\"\"\n\n# set up the input prompt\ninput = f'{story}\\n\\nQuestion:{question}'\n\n# write down what the model is expected to respond with\n# NOTE: not all judges require an expected answer. refer to the implementations\nexpected = \"I don't know\"\n\n# get the model output\noutput = client.chat.completions.create(\n    model='gpt-4o-mini',\n    messages=[\n        {\n            'role': 'user', \n            'content': input,\n        },\n    ],\n).choices[0].message.content\n```\n\n### Use a `judges` classifier LLM as an evaluator model\n\n```python\nfrom judges.classifiers.correctness import PollMultihopCorrectness\n\n# use the correctness classifier to determine if the first model\n# answered correctly\ncorrectness = PollMultihopCorrectness(model='gpt-4o-mini')\n\njudgment = correctness.judge(\n    input=input,\n    output=output,\n    expected=expected,\n)\nprint(judgment.reasoning)\n# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.\n\nprint(judgment.score)\n# True\n```\n\n### Use a `Jury` for averaging and diversification\n\nA jury of LLMs can enable more diverse results and enable you to combine the judgments of multiple LLMs.\n\n```python\nfrom judges import Jury\nfrom judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness\n\npoll = PollMultihopCorrectness(model='gpt-4o')\nraft = RAFTCorrectness(model='gpt-4o-mini')\n\njury = Jury(judges=[poll, raft], voting_method=\"average\")\n\nverdict = jury.vote(\n    input=input,\n    output=completion,\n    expected=expected,\n)\nprint(verdict.score)\n```\n\n### Use `AutoJudge` to create a custom LLM judge\n\n`autojudge` is an extension to the **judges** library that builds on our [previous work](https://www.quotientai.co/post/subject-matter-expert-language-liaison-smell-a-framework-for-aligning-llm-evaluators-to-human-feedback) aligning judges to human feedback -- given a labeled dataset with feedback and a natural language description of an evaluation task, `autojudge` creates custom, task-specific LLM judges.\n\nInstall it using:  \n\n```bash\npip install \"judges[auto]\"\n```\n\n**Step 1 - Prepare your dataset:**\nYour dataset can be either a list of dictionaries or path to a csv file with the following fields:\n\n- **`input`**: The input provided to your model\n- **`output`**: The model's response\n- **`label`**: `1` for correct, `0` for incorrect  \n- **`feedback`**: Feedback explaining why the response is correct or incorrect\n\nExample:  \n\n| input                             | output                                                              | label | feedback                              |\n|-----------------------------------|---------------------------------------------------------------------|-------|---------------------------------------|\n| What's the best time to visit Paris? | The best time to visit Paris is during the spring or fall.          | 1     | Provides accurate and detailed advice. |\n| Can I ride a dragon in Scotland? | Yes, dragons are commonly seen in the highlands and can be ridden with proper training          | 0     | Dragons are mythical creatures; the information is fictional. |\n\n**Step 2 - Initialize your `autojudge`:**\nProvide a labeled dataset and describe the evaluation task.  \n\n```python\nfrom judges.classifiers.auto import AutoJudge\n\ndataset = [\n    {\n        \"input\": \"Can I ride a dragon in Scotland?\",\n        \"output\": \"Yes, dragons are commonly seen in the highlands and can be ridden with proper training.\",\n        \"label\": 0,\n        \"feedback\": \"Dragons are mythical creatures; the information is fictional.\",\n    },\n    {\n        \"input\": \"Can you recommend a good hotel in Tokyo?\",\n        \"output\": \"Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.\",\n        \"label\": 1,\n        \"feedback\": \"Offers a specific and helpful recommendation.\",\n    },\n    {\n        \"input\": \"Can I drink tap water in London?\",\n        \"output\": \"Yes, tap water in London is safe to drink and meets high quality standards.\",\n        \"label\": 1,\n        \"feedback\": \"Gives clear and reassuring information.\",\n    },\n    {\n        \"input\": \"What's the boiling point of water on the moon?\",\n        \"output\": \"The boiling point of water on the moon is 100\u00b0C, the same as on Earth.\",\n        \"label\": 0,\n        \"feedback\": \"Boiling point varies with pressure; the moon's vacuum affects it.\",\n    }\n]\n\n\n# Task description\ntask = \"Evaluate responses for accuracy, clarity, and helpfulness.\"\n\n# Initialize autojudge\nautojudge = AutoJudge.from_dataset(\n    dataset=dataset,\n    task=task,\n    model=\"gpt-4-turbo-2024-04-09\",\n    # increase workers for speed \u26a1\n    # max_workers=2,\n    # generated prompts are automatically saved to disk\n    # save_to_disk=False,\n)\n```\n\n**Step 3 - Use your judge to evaluate new input-output pairs:**\nYou can use `autojudge` to evaluate a single input-output pair using the `.judge()` method.  \n\n```python\n# Input-output pair to evaluate\ninput_ = \"What are the top attractions in New York City?\"\noutput = \"Some top attractions in NYC include the Statue of Liberty and Central Park.\"\n\n# Get the judgment\njudgment = autojudge.judge(input=input_, output=output)\n\n# Print the judgment\nprint(judgment.reasoning)\n# The response accurately lists popular attractions like the Statue of Liberty and Central Park, which are well-known and relevant to the user's query.\nprint(judgment.score)\n# True (correct)\n```\n\n## Appendix\n\n### Classifiers\n\n| Judge Type | Category                 | Description                                                                                                                                                                                                                                                                                                                                                                                                                                  | Reference Paper                                                                                                                                                                                                                                                                                   | Python Import                                                                                                                           |\n|------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|\n| classifier | Factual Correctness      | Evaluates the factual correctness of a generated response against a reference answer using few-shot learning. It compares the provided answer with the reference answer to determine its accuracy.                                                                                                                                                                                                                                                 | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollMultihopCorrectness```                                               |\n| classifier | Factual Correctness      | Assesses the factual accuracy of an AI assistant's response against a reference answer without relying on example-based (few-shot) learning. It determines whether the provided answer aligns with the reference answer based on factual information.                                                                                                                                                                                         | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollZeroShotCorrectness```                                               |\n| classifier | Factual Correctness      | Evaluates the factual correctness of responses based on the KILT (Knowledge Intensive Language Tasks) version of Natural Questions. It uses few-shot learning to compare the AI assistant's response with the reference answer to assess accuracy.                                                                                                                                                                                         | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollKiltNQCorrectness```                                                 |\n| classifier | Factual Correctness      | Assesses the factual correctness of responses based on the KILT version of HotpotQA. It utilizes few-shot learning to determine whether the AI assistant's response aligns with the reference answer, ensuring accuracy and consistency.                                                                                                                                                                                                  | [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)                                                                                                                                                             | ```from judges.classifiers.correctness import PollKiltHotpotCorrectness```                                             |\n| classifier | Factual Correctness      | Evaluates the factual correctness of responses using LLAMA-based methods as part of the RAFT (Adapting Language Model to Domain Specific RAG) framework. It determines the accuracy of the AI assistant's response by comparing it with the reference answer based on predefined criteria.                                                                                                                                                             | [RAFT: Adapting Language Model to Domain Specific RAG](https://arxiv.org/abs/2403.10131)                                                                                                                                                                                           | ```from judges.classifiers.correctness import RAFTCorrectness```                                                        |\n| classifier | Hallucination Evaluation | Determines whether a provided answer contains non-factual or hallucinated information based on predefined hallucination types and world knowledge. It evaluates the factual integrity of the AI assistant's response to ensure accuracy and reliability.                                                                                                                                                                                   | [HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models](https://arxiv.org/abs/2305.11747)                                                                                                                                                            | ```from judges.classifiers.hallucination import HaluEvalAnswerNonFactual```                                            |\n| classifier | Hallucination Evaluation | Evaluates whether a dialogue response contains non-factual or hallucinated information by analyzing the response against predefined hallucination types and world knowledge. It ensures that the AI assistant's responses are factually accurate within the context of the dialogue.                                                                                                                                                         | [HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models](https://arxiv.org/abs/2305.11747)                                                                                                                                                            | ```from judges.classifiers.hallucination import HaluEvalDialogueResponseNonFactual```                                  |\n| classifier | Hallucination Evaluation | Assesses whether a document summary contains non-factual or hallucinated information by comparing it against the source document. It ensures the summary accurately reflects the content without introducing incorrect or fabricated details.                                                                                                                                                                                             | [HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models](https://arxiv.org/abs/2305.11747)                                                                                                                                                            | ```from judges.classifiers.hallucination import HaluEvalDocumentSummaryNonFactual```                                   |\n| classifier | Harmfulness Evaluation   | Evaluates whether an AI assistant's response contains harmful content, including harmful, dangerous, unethical, illegal, or hostile information. It assesses the response for alignment with trustworthy language model guidelines to ensure the safety and reliability of the output.                                                                                                                                                                          | [Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment](https://arxiv.org/abs/2308.05374)                                                                                                                                                     | ```from judges.classifiers.harmfulness import TrustworthyLLMHarmfulness```                                           |\n| classifier | Query Quality Evaluation | Evaluates the quality of a query based on clarity, specificity, and relevance. It assesses whether the query is well-structured and aligned with the desired information retrieval objectives, ensuring that the queries facilitate accurate and relevant responses from the AI assistant.                                                                                                                                                                        | [FactAlign: Long-form Factuality Alignment of Large Language Models](https://arxiv.org/abs/2410.01691)                                                                                                                                                                              | ```from judges.classifiers.query_quality import FactAlignQueryQuality```                                               |\n| classifier | Refusal Evaluation       | Evaluates whether an AI assistant's response refuses to complete a given task. It determines if the response is a refusal based on predefined criteria, ensuring that the AI adheres to ethical guidelines and alignment policies when declining to assist with certain requests.                                                                                                                                                                         | [Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment](https://arxiv.org/abs/2308.05374)                                                                                                                                                     | ```from judges.classifiers.refusal import TrustworthyLLMRefusal```                                                   |\n\n---\n\n\n### Graders\n\n| Judge Type | Category                       | Description                                                                                                                                                                                                                                                                                                                                                                                                                              | Reference Paper                                                                                                                                                             | Python Import                                                                                                                           |\n|------------|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|\n| grader     | Factual Correctness            | Evaluates the correctness of a response in an Absolute Grading setting, according to a scoring rubric.                                                                                                                                                                                                                                                                                                                                 | [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491)                                                               | `from judges.graders.correctness import PrometheusAbsoluteCoarseCorrectness`                                                        |\n| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize implicit emotions in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                                  | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenImplicitEmotionRecognition`                                                          |\n| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize the hidden intention in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                                   | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenIntentionRecognition`                                                               |\n| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize the key event in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                                              | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenKeyEventRecognition`                                                                 |\n| grader     | Empathy Evaluation             | Evaluates the response of a model based on its ability to recognize both the significant and trivial events in a statement, using a 3-point scale.                                                                                                                                                                                                                                                                                                | [EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models](https://arxiv.org/abs/2409.13359)                                                             | `from judges.graders.empathy import EmotionQueenMixedEventRecognition`                                                               |\n| grader     | Information Coverage Evaluation | Determines whether a specific reference insight is fully, partially, or not covered within a provided list of bullet points, ensuring comprehensive coverage of key information.                                                                                                                                                                                                                                                                 | [Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems](https://arxiv.org/abs/2407.01370)                                                                 | `from judges.graders.information_coverage import HaystackBulletPointCoverageCorrectness`                                               |\n| grader     | Content Moderation             | Categorizes user input texts into predefined categories such as hate, self-harm, sexual content, etc., determining the appropriate category based on the content's nature and intent.                                                                                                                                                                                                                                                       | [OR-Bench: An Over-Refusal Benchmark for Large Language Models](https://arxiv.org/abs/2405.20947)                                                                        | `from judges.graders.moderator import ORBenchUserInputModeration`                                                                   |\n| grader     | Content Moderation             | Classifies model output texts into one of eleven predefined categories related to content sensitivity and appropriateness, ensuring responses adhere to ethical and safety guidelines.                                                                                                                                                                                                                                                 | [OR-Bench: An Over-Refusal Benchmark for Large Language Models](https://arxiv.org/abs/2405.20947)                                                                        | `from judges.graders.moderator import ORBenchUserOutputModeration`                                                                  |\n| grader     | Query Quality Evaluation       | Assesses the quality of user queries based on clarity, specificity, and coherence, ensuring that queries are well-structured to facilitate accurate and relevant responses.                                                                                                                                                                                                                                                                  | [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://api.semanticscholar.org/CorpusID:270391432)                                   | `from judges.graders.query_quality import MagpieQueryQuality`                                                                         |\n| grader     | Refusal Evaluation             | Classifies AI assistant responses into direct_answer, direct_refusal, or indirect_refusal, evaluating whether the assistant appropriately refuses to answer certain prompts based on ethical guidelines.                                                                                                                                                                                                                                         | [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)                                                                               | `from judges.graders.refusal_detection import ORBenchRefusalDetection`                                                                 |\n| grader     | Relevance Evaluation           | Evaluates the relevance of a passage to a query based on a four-point scale: Irrelevant, Related, Highly relevant, Perfectly relevant. Ensures that the passage adequately addresses the query with varying degrees of relevance.                                                                                                                                                                                                          | [Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.](https://doi.org/10.1145/3637528.3671883)                                         | `from judges.graders.relevance import ReliableCIRelevance`                                                                             |\n| grader     | Response Quality Evaluation    | Evaluates the quality of the AI assistant's response based on helpfulness, relevance, accuracy, depth, creativity, and level of detail, assigning a numerical grade.                                                                                                                                                                                                                                                                      | [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)                                                                               | `from judges.graders.response_quality import MTBenchChatBotResponseQuality`                                                            |\n\n---\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A small library of research-backed LLM judges",
    "version": "0.0.6",
    "project_urls": null,
    "split_keywords": [
        "ai",
        " large language models",
        " llm-judge",
        " evaluation",
        " evals"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d5ddb940eb7918a8df7a7a1155cbbb8411a21d89f323427bafa2eb27f0280a64",
                "md5": "0105e0ad2e940c13575ac58fc08a78a8",
                "sha256": "2a611b083f5edbdc1dc200cbdfd940e36fb1c4f7e0683ee0b115d8ca1ddaaefd"
            },
            "downloads": -1,
            "filename": "judges-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0105e0ad2e940c13575ac58fc08a78a8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 54228,
            "upload_time": "2025-02-07T21:05:27",
            "upload_time_iso_8601": "2025-02-07T21:05:27.499020Z",
            "url": "https://files.pythonhosted.org/packages/d5/dd/b940eb7918a8df7a7a1155cbbb8411a21d89f323427bafa2eb27f0280a64/judges-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f4b47f49caf909fa31de78ad904047c8e8b8d6f615f3db4bba573957c7c18565",
                "md5": "fe8c4b64edf1686c34b8ce23bc0b4e7e",
                "sha256": "4df9c45d3d1dc6c7996775eb000880ac7332a87bf6ecc76a4661dcb062a997b9"
            },
            "downloads": -1,
            "filename": "judges-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "fe8c4b64edf1686c34b8ce23bc0b4e7e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 59552,
            "upload_time": "2025-02-07T21:05:29",
            "upload_time_iso_8601": "2025-02-07T21:05:29.281248Z",
            "url": "https://files.pythonhosted.org/packages/f4/b4/7f49caf909fa31de78ad904047c8e8b8d6f615f3db4bba573957c7c18565/judges-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-07 21:05:29",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "judges"
}

None