judgezoo


Namejudgezoo JSON
Version 0.1.4 PyPI version JSON
download
home_pageNone
SummaryA collection of judges for evaluating LLM model output for safety & toxicity with a standardized API.
upload_time2025-09-08 16:14:41
maintainerNone
docs_urlNone
authorTim Beyer
requires_python>=3.8
licenseNone
keywords llm safety evaluation ai-safety machine-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # <img src="logo.png" alt="JudgeZoo Logo" width="50" height="50" style="vertical-align: middle;"> JudgeZoo

This repo provides access to a set of commonly used LLM-based safety judges via a simple and consistent API.
Our main focus is ease-of-use, correctness, and reproducibility.


## The Problem

Evaluations in the safety/robustness literature are highly fragmented, with researchers using different judges, prompting strategies, and model versions.
This fragmentation makes reproducibility nearly impossible and creates barriers to fair comparison between methods.
JudgeZoo addresses this by providing standardized, tested implementations of commonly used safety judges, ensuring consistent evaluation and reducing the likelihood of implementation errors that could compromise results.


## How to use

You can create a judge model instance with a single line of code:
```python3
judge = Judge.from_name("strong_reject")
```

To get safety scores, just pass a list of conversations to score:
```python3
harmless_conversation = [
    {"role": "user", "content": "How do I make a birthday cake?"},
    {"role": "assistant", "content": "Step 1: Collect ingredients..."}
]

scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": [0.02496337890625]}
```
All judges return `"p_harmful"`, which is a normalized score from 0 to 1.
Depending on the original setup, the judge may also return discrete scores or harm categories (e.g. on a Likert scale).
In these cases, the raw scores are also returned:

```python3
judge = Judge.from_name("adaptive_attacks")

scores = judge([harmful_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```


## Included judges
| Name                   | Argument              | Creator (Org/Researcher)     | Link to Paper                                       | Type         | Fine-tuned from        |
| ---------------------- | --------------------- | ---------------------------- | ----------------------------------------------------| ------------ | ---------------------- |
| Adaptive Attacks       | `adaptive_attacks`    | Andriushchenko et al. (2024) | [arXiv:2404.02151](https://arxiv.org/abs/2404.02151)| prompt-based | —                      |
| AdvPrefix              | `advprefix`           | Zhu et al. (2024)            | [arXiv:2412.10321](https://arxiv.org/abs/2412.10321)| prompt-based | —                      |
| AegisGuard*            | `aegis_guard`         | Ghosh et al. (2024)          | [arXiv:2404.05993](https://arxiv.org/abs/2404.05993)| fine-tuned   | LlamaGuard 7B          |
| HarmBench              | `harmbench`           | Mazeika et al. (2024)        | [arXiv:2402.04249](https://arxiv.org/abs/2402.04249)| fine-tuned   | Gemma 2B               |
| JailJudge              | `jail_judge`          | Liu et al. (2024)            | [arXiv:2410.12855](https://arxiv.org/abs/2410.12855)| fine-tuned   | Llama 2 7B
| Llama Guard 3          | `llama_guard_3`       | Llama Team, AI @ Meta (2024) | [arXiv:2407.21783](https://arxiv.org/abs/2407.21783)| fine-tuned   | Llama 3 8B             |
| Llama Guard 4          | `llama_guard_4`       | Llama Team, AI @ Meta (2024) | [Meta blog](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) | fine-tuned | Llama 4 12B |
| MD-Judge (v0.1 & v0.2) | `md_judge`            | Li, Lijun et al. (2024)      | [arXiv:2402.05044](https://arxiv.org/abs/2402.05044)| fine-tuned   | Mistral-7B/LMintern2 7B|
| StrongREJECT           | `strong_reject`       | Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| fine-tuned   | Gemma 2b               |
| StrongREJECT (rubric)  | `strong_reject_rubric`| Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| prompt-based | -                      |
| XSTestJudge            | `xstest`              | Röttge et al. (2023)         | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263)| prompt-based | —                      |

\* there are two versions of this judge (permissive and defensive). You can switch between them using `Judge.from_name("aegis_guard", defensive=[True/False])`


## Other
### Prompt-based judges

While some judges (such as the HarmBench classifier) are finetuned local models, others rely on prompted foundation models.
Currently, we support local foundation models and OpenAI models:

```python3
judge = Judge.from_name("adaptive_attacks", use_local_model=False, remote_foundation_model="gpt-4o")

scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```

```python3
judge = Judge.from_name("adaptive_attacks", use_local_model=True)

scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```

When not specified, the defaults in `config.py` are used.

### Multi-turn interaction

Judges vary in how much of a conversation they can evaluate - many models only work for single-turn interactions.
In these cases, we assume the first user message to be the prompt and the final assistant message to be the response to be judged.
If you prefer a different setup, you can pass only single-turn conversations.

### Reproducibility

Wherever possible, we use official code directly provided by the original authors to ensure correctness.

Finally, we warn if a user's setup diverges from the original implementation:

```python3
from judgezoo import Judge

judge = Judge.from_name("intention_analysis")
>>> WARNING:root:IntentionAnalysisJudge originally used gpt-3.5-turbo-0613, you are using gpt-4o. Results may differ from the original paper.
```


## Installation
```pip install judgezoo```


## Tests
To run all tests, run

```pytest tests/ --runslow```


## Citation

If you use JudgeZoo in your research, please cite:

```bibtex
@software{judgezoo,
  title = {JudgeZoo: A Standardized Library for LLM Safety Judges},
  author = {[Tim Beyer]},
  year = {2025},
  url = {https://github.com/LLM-QC/judgezoo},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "judgezoo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, safety, evaluation, ai-safety, machine-learning",
    "author": "Tim Beyer",
    "author_email": "Tim Beyer <tim.beyer@tum.de>",
    "download_url": "https://files.pythonhosted.org/packages/53/59/e957d8bf549b3f52205e24cb9f819a6a1a188512d487c42619c9886d6526/judgezoo-0.1.4.tar.gz",
    "platform": null,
    "description": "# <img src=\"logo.png\" alt=\"JudgeZoo Logo\" width=\"50\" height=\"50\" style=\"vertical-align: middle;\"> JudgeZoo\n\nThis repo provides access to a set of commonly used LLM-based safety judges via a simple and consistent API.\nOur main focus is ease-of-use, correctness, and reproducibility.\n\n\n## The Problem\n\nEvaluations in the safety/robustness literature are highly fragmented, with researchers using different judges, prompting strategies, and model versions.\nThis fragmentation makes reproducibility nearly impossible and creates barriers to fair comparison between methods.\nJudgeZoo addresses this by providing standardized, tested implementations of commonly used safety judges, ensuring consistent evaluation and reducing the likelihood of implementation errors that could compromise results.\n\n\n## How to use\n\nYou can create a judge model instance with a single line of code:\n```python3\njudge = Judge.from_name(\"strong_reject\")\n```\n\nTo get safety scores, just pass a list of conversations to score:\n```python3\nharmless_conversation = [\n    {\"role\": \"user\", \"content\": \"How do I make a birthday cake?\"},\n    {\"role\": \"assistant\", \"content\": \"Step 1: Collect ingredients...\"}\n]\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": [0.02496337890625]}\n```\nAll judges return `\"p_harmful\"`, which is a normalized score from 0 to 1.\nDepending on the original setup, the judge may also return discrete scores or harm categories (e.g. on a Likert scale).\nIn these cases, the raw scores are also returned:\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\")\n\nscores = judge([harmful_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\n\n## Included judges\n| Name                   | Argument              | Creator (Org/Researcher)     | Link to Paper                                       | Type         | Fine-tuned from        |\n| ---------------------- | --------------------- | ---------------------------- | ----------------------------------------------------| ------------ | ---------------------- |\n| Adaptive Attacks       | `adaptive_attacks`    | Andriushchenko et al. (2024) | [arXiv:2404.02151](https://arxiv.org/abs/2404.02151)| prompt-based | \u2014                      |\n| AdvPrefix              | `advprefix`           | Zhu et al. (2024)            | [arXiv:2412.10321](https://arxiv.org/abs/2412.10321)| prompt-based | \u2014                      |\n| AegisGuard*            | `aegis_guard`         | Ghosh et al. (2024)          | [arXiv:2404.05993](https://arxiv.org/abs/2404.05993)| fine-tuned   | LlamaGuard 7B          |\n| HarmBench              | `harmbench`           | Mazeika et al. (2024)        | [arXiv:2402.04249](https://arxiv.org/abs/2402.04249)| fine-tuned   | Gemma 2B               |\n| JailJudge              | `jail_judge`          | Liu et al. (2024)            | [arXiv:2410.12855](https://arxiv.org/abs/2410.12855)| fine-tuned   | Llama 2 7B\n| Llama Guard 3          | `llama_guard_3`       | Llama Team, AI @ Meta (2024) | [arXiv:2407.21783](https://arxiv.org/abs/2407.21783)| fine-tuned   | Llama 3 8B             |\n| Llama Guard 4          | `llama_guard_4`       | Llama Team, AI @ Meta (2024) | [Meta blog](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) | fine-tuned | Llama 4 12B |\n| MD-Judge (v0.1 & v0.2) | `md_judge`            | Li, Lijun et al. (2024)      | [arXiv:2402.05044](https://arxiv.org/abs/2402.05044)| fine-tuned   | Mistral-7B/LMintern2 7B|\n| StrongREJECT           | `strong_reject`       | Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| fine-tuned   | Gemma 2b               |\n| StrongREJECT (rubric)  | `strong_reject_rubric`| Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| prompt-based | -                      |\n| XSTestJudge            | `xstest`              | R\u00f6ttge et al. (2023)         | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263)| prompt-based | \u2014                      |\n\n\\* there are two versions of this judge (permissive and defensive). You can switch between them using `Judge.from_name(\"aegis_guard\", defensive=[True/False])`\n\n\n## Other\n### Prompt-based judges\n\nWhile some judges (such as the HarmBench classifier) are finetuned local models, others rely on prompted foundation models.\nCurrently, we support local foundation models and OpenAI models:\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\", use_local_model=False, remote_foundation_model=\"gpt-4o\")\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\", use_local_model=True)\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\nWhen not specified, the defaults in `config.py` are used.\n\n### Multi-turn interaction\n\nJudges vary in how much of a conversation they can evaluate - many models only work for single-turn interactions.\nIn these cases, we assume the first user message to be the prompt and the final assistant message to be the response to be judged.\nIf you prefer a different setup, you can pass only single-turn conversations.\n\n### Reproducibility\n\nWherever possible, we use official code directly provided by the original authors to ensure correctness.\n\nFinally, we warn if a user's setup diverges from the original implementation:\n\n```python3\nfrom judgezoo import Judge\n\njudge = Judge.from_name(\"intention_analysis\")\n>>> WARNING:root:IntentionAnalysisJudge originally used gpt-3.5-turbo-0613, you are using gpt-4o. Results may differ from the original paper.\n```\n\n\n## Installation\n```pip install judgezoo```\n\n\n## Tests\nTo run all tests, run\n\n```pytest tests/ --runslow```\n\n\n## Citation\n\nIf you use JudgeZoo in your research, please cite:\n\n```bibtex\n@software{judgezoo,\n  title = {JudgeZoo: A Standardized Library for LLM Safety Judges},\n  author = {[Tim Beyer]},\n  year = {2025},\n  url = {https://github.com/LLM-QC/judgezoo},\n}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A collection of judges for evaluating LLM model output for safety & toxicity with a standardized API.",
    "version": "0.1.4",
    "project_urls": {
        "Homepage": "https://github.com/LLM-QC/judgezoo",
        "Issues": "https://github.com/LLM-QC/judgezoo/issues",
        "Repository": "https://github.com/LLM-QC/judgezoo"
    },
    "split_keywords": [
        "llm",
        " safety",
        " evaluation",
        " ai-safety",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4b3f1398c4fbe462b4fac0cd551d48858cb69f06aea91ef113dade7f0505d277",
                "md5": "65dc3065ece6074893a6a2fcd41a7c4f",
                "sha256": "9303387e6e86110ec91cd62d792d6906c8fe413d2d29b0d4582b7c017b50e3be"
            },
            "downloads": -1,
            "filename": "judgezoo-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65dc3065ece6074893a6a2fcd41a7c4f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 42443,
            "upload_time": "2025-09-08T16:14:40",
            "upload_time_iso_8601": "2025-09-08T16:14:40.311645Z",
            "url": "https://files.pythonhosted.org/packages/4b/3f/1398c4fbe462b4fac0cd551d48858cb69f06aea91ef113dade7f0505d277/judgezoo-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5359e957d8bf549b3f52205e24cb9f819a6a1a188512d487c42619c9886d6526",
                "md5": "5ff43ba592129d5ee115b1c5870a698e",
                "sha256": "84d9a22824713ae5e1c729b41e081c119b34d28016c42305dbdf9bab2a705533"
            },
            "downloads": -1,
            "filename": "judgezoo-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "5ff43ba592129d5ee115b1c5870a698e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 62465,
            "upload_time": "2025-09-08T16:14:41",
            "upload_time_iso_8601": "2025-09-08T16:14:41.534565Z",
            "url": "https://files.pythonhosted.org/packages/53/59/e957d8bf549b3f52205e24cb9f819a6a1a188512d487c42619c9886d6526/judgezoo-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-08 16:14:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "LLM-QC",
    "github_project": "judgezoo",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "judgezoo"
}
        
Elapsed time: 2.43361s