judgezoo


Namejudgezoo JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryA collection of judges for evaluating LLM model output for safety & toxicity with a standardized API.
upload_time2025-07-16 13:49:03
maintainerNone
docs_urlNone
authorTim Beyer
requires_python>=3.8
licenseNone
keywords llm safety evaluation ai-safety machine-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # JudgeZoo

This repo provides access to a set of commonly used LLM-based safety judges via a simple and consistent API.
Our main focus is ease-of-use, correctness, and reproducibility.


## How to use

You can create a judge model instance with a single line of code:
```python3
judge = Judge.from_name("strong_reject")
```

To get safety scores, just pass a list of conversations to score:
```python3
harmless_conversation = [
    {"role": "user", "content": "How do I make a birthday cake?"},
    {"role": "assistant", "content": "Step 1: Collect ingredients..."}
]

scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": [0.02496337890625]}
```
All judges return `"p_harmful"`, which is a normalized score from 0 to 1.
Depending on the original setup, the judge may also return discrete scores or harm categories (e.g. on a Likert scale).
In these cases, the raw scores are also returned:

```python3
judge = Judge.from_name("adaptive_attacks")

scores = judge([harmful_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```


## Included judges
| Name                   | Argument              | Creator (Org/Researcher)     | Link to Paper                                       | Type         | Fine-tuned from        |
| ---------------------- | --------------------- | ---------------------------- | ----------------------------------------------------| ------------ | ---------------------- |
| Adaptive Attacks       | `adaptive_attacks`    | Andriushchenko et al. (2024) | [arXiv:2404.02151](https://arxiv.org/abs/2404.02151)| prompt-based | —                      |
| AdvPrefix              | `advprefix`           | Zhu et al. (2024)            | [arXiv:2412.10321](https://arxiv.org/abs/2412.10321)| prompt-based | —                      |
| AegisGuard*            | `aegis_guard`         | Ghosh et al. (2024)          | [arXiv:2404.05993](https://arxiv.org/abs/2404.05993)| fine-tuned   | LlamaGuard 7B          |
| HarmBench              | `harmbench`           | Mazeika et al. (2024)        | [arXiv:2402.04249](https://arxiv.org/abs/2402.04249)| fine-tuned   | Gemma 2B               |
| JailJudge              | `jail_judge`          | Liu et al. (2024)            | [arXiv:2410.12855](https://arxiv.org/abs/2410.12855)| fine-tuned   | Llama 2 7B
| Llama Guard 3          | `llama_guard_3`       | Llama Team, AI @ Meta (2024) | [arXiv:2407.21783](https://arxiv.org/abs/2407.21783)| fine-tuned   | Llama 3 8B             |
| Llama Guard 4          | `llama_guard_4`       | Llama Team, AI @ Meta (2024) | [Meta blog](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) | fine-tuned | Llama 4 12B |
| MD-Judge (v0.1 & v0.2) | `md_judge`            | Li, Lijun et al. (2024)      | [arXiv:2402.05044](https://arxiv.org/abs/2402.05044)| fine-tuned   | Mistral-7B/LMintern2 7B|
| StrongREJECT           | `strong_reject`       | Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| fine-tuned   | Gemma 2b               |
| StrongREJECT (rubric)  | `strong_reject_rubric`| Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| prompt-based | -                      |
| XSTestJudge            | `xstest`              | Röttge et al. (2023)         | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263)| prompt-based | —                      |

\* there are two versions of this judge (permissive and defensive). You can switch between them using `Judge.from_name("aegis_guard", defensive=[True/False])`


## Other
### Prompt-based judges

While some judges (such as the HarmBench classifier) are finetuned local models, others rely on prompted foundation models.
Currently, we support local foundation models and OpenAI models:

```python3
judge = Judge.from_name("adaptive_attacks", use_local_model=False, remote_foundation_model="gpt-4o")

scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```

```python3
judge = Judge.from_name("adaptive_attacks", use_local_model=True)

scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```

When not specified, the defaults in `config.py` are used.

### Multi-turn interaction

Judges vary in how much of a conversation they can evaluate - many models only work for single-turn interactions.
In these cases, we assume the first user message to be the prompt and the final assistant message to be the response to be judged.
If you prefer a different setup, you can pass only single-turn conversations.

### Reproducibility

Wherever possible, we use official code directly provided by the original authors to ensure correctness.

Finally, we warn if a user's setup diverges from the original implementation:

```python3
from judgezoo import Judge

judge = Judge.from_name("intention_analysis")
>>> WARNING:root:IntentionAnalysisJudge originally used gpt-3.5-turbo-0613, you are using gpt-4o. Results may differ from the original paper.
```


## Installation
```pip install judgezoo```


## Tests
To run all tests, run

```pytest tests/ --runslow```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "judgezoo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, safety, evaluation, ai-safety, machine-learning",
    "author": "Tim Beyer",
    "author_email": "Tim Beyer <tim.beyer@tum.de>",
    "download_url": "https://files.pythonhosted.org/packages/b5/44/499aab1522e99dc6c3e7a4ebbfebda627019672140b666359d9ca9b6f095/judgezoo-0.1.1.tar.gz",
    "platform": null,
    "description": "# JudgeZoo\n\nThis repo provides access to a set of commonly used LLM-based safety judges via a simple and consistent API.\nOur main focus is ease-of-use, correctness, and reproducibility.\n\n\n## How to use\n\nYou can create a judge model instance with a single line of code:\n```python3\njudge = Judge.from_name(\"strong_reject\")\n```\n\nTo get safety scores, just pass a list of conversations to score:\n```python3\nharmless_conversation = [\n    {\"role\": \"user\", \"content\": \"How do I make a birthday cake?\"},\n    {\"role\": \"assistant\", \"content\": \"Step 1: Collect ingredients...\"}\n]\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": [0.02496337890625]}\n```\nAll judges return `\"p_harmful\"`, which is a normalized score from 0 to 1.\nDepending on the original setup, the judge may also return discrete scores or harm categories (e.g. on a Likert scale).\nIn these cases, the raw scores are also returned:\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\")\n\nscores = judge([harmful_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\n\n## Included judges\n| Name                   | Argument              | Creator (Org/Researcher)     | Link to Paper                                       | Type         | Fine-tuned from        |\n| ---------------------- | --------------------- | ---------------------------- | ----------------------------------------------------| ------------ | ---------------------- |\n| Adaptive Attacks       | `adaptive_attacks`    | Andriushchenko et al. (2024) | [arXiv:2404.02151](https://arxiv.org/abs/2404.02151)| prompt-based | \u2014                      |\n| AdvPrefix              | `advprefix`           | Zhu et al. (2024)            | [arXiv:2412.10321](https://arxiv.org/abs/2412.10321)| prompt-based | \u2014                      |\n| AegisGuard*            | `aegis_guard`         | Ghosh et al. (2024)          | [arXiv:2404.05993](https://arxiv.org/abs/2404.05993)| fine-tuned   | LlamaGuard 7B          |\n| HarmBench              | `harmbench`           | Mazeika et al. (2024)        | [arXiv:2402.04249](https://arxiv.org/abs/2402.04249)| fine-tuned   | Gemma 2B               |\n| JailJudge              | `jail_judge`          | Liu et al. (2024)            | [arXiv:2410.12855](https://arxiv.org/abs/2410.12855)| fine-tuned   | Llama 2 7B\n| Llama Guard 3          | `llama_guard_3`       | Llama Team, AI @ Meta (2024) | [arXiv:2407.21783](https://arxiv.org/abs/2407.21783)| fine-tuned   | Llama 3 8B             |\n| Llama Guard 4          | `llama_guard_4`       | Llama Team, AI @ Meta (2024) | [Meta blog](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) | fine-tuned | Llama 4 12B |\n| MD-Judge (v0.1 & v0.2) | `md_judge`            | Li, Lijun et al. (2024)      | [arXiv:2402.05044](https://arxiv.org/abs/2402.05044)| fine-tuned   | Mistral-7B/LMintern2 7B|\n| StrongREJECT           | `strong_reject`       | Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| fine-tuned   | Gemma 2b               |\n| StrongREJECT (rubric)  | `strong_reject_rubric`| Souly et al. (2024)          | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| prompt-based | -                      |\n| XSTestJudge            | `xstest`              | R\u00f6ttge et al. (2023)         | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263)| prompt-based | \u2014                      |\n\n\\* there are two versions of this judge (permissive and defensive). You can switch between them using `Judge.from_name(\"aegis_guard\", defensive=[True/False])`\n\n\n## Other\n### Prompt-based judges\n\nWhile some judges (such as the HarmBench classifier) are finetuned local models, others rely on prompted foundation models.\nCurrently, we support local foundation models and OpenAI models:\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\", use_local_model=False, remote_foundation_model=\"gpt-4o\")\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\", use_local_model=True)\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\nWhen not specified, the defaults in `config.py` are used.\n\n### Multi-turn interaction\n\nJudges vary in how much of a conversation they can evaluate - many models only work for single-turn interactions.\nIn these cases, we assume the first user message to be the prompt and the final assistant message to be the response to be judged.\nIf you prefer a different setup, you can pass only single-turn conversations.\n\n### Reproducibility\n\nWherever possible, we use official code directly provided by the original authors to ensure correctness.\n\nFinally, we warn if a user's setup diverges from the original implementation:\n\n```python3\nfrom judgezoo import Judge\n\njudge = Judge.from_name(\"intention_analysis\")\n>>> WARNING:root:IntentionAnalysisJudge originally used gpt-3.5-turbo-0613, you are using gpt-4o. Results may differ from the original paper.\n```\n\n\n## Installation\n```pip install judgezoo```\n\n\n## Tests\nTo run all tests, run\n\n```pytest tests/ --runslow```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A collection of judges for evaluating LLM model output for safety & toxicity with a standardized API.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/LLM-QC/judgezoo",
        "Issues": "https://github.com/LLM-QC/judgezoo/issues",
        "Repository": "https://github.com/LLM-QC/judgezoo"
    },
    "split_keywords": [
        "llm",
        " safety",
        " evaluation",
        " ai-safety",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ad4a3254518c20ed1bc6242587d2b12b03cf65ec19d58c011c5bab7a16da3722",
                "md5": "481c6ed623a69c05e3197b647b5dbeff",
                "sha256": "0c88c161f11efbcab4d356d1ed9cd952e09c8cc99401782e22e34211321fce3d"
            },
            "downloads": -1,
            "filename": "judgezoo-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "481c6ed623a69c05e3197b647b5dbeff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 37947,
            "upload_time": "2025-07-16T13:49:00",
            "upload_time_iso_8601": "2025-07-16T13:49:00.677168Z",
            "url": "https://files.pythonhosted.org/packages/ad/4a/3254518c20ed1bc6242587d2b12b03cf65ec19d58c011c5bab7a16da3722/judgezoo-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b544499aab1522e99dc6c3e7a4ebbfebda627019672140b666359d9ca9b6f095",
                "md5": "111dc0690ffd1f63219648d32abf6d8d",
                "sha256": "3d49918e57bbc4cd2833eae848a606f9e6a3ec64f0f16d8049ea1df1d2125d8f"
            },
            "downloads": -1,
            "filename": "judgezoo-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "111dc0690ffd1f63219648d32abf6d8d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 54103,
            "upload_time": "2025-07-16T13:49:03",
            "upload_time_iso_8601": "2025-07-16T13:49:03.056176Z",
            "url": "https://files.pythonhosted.org/packages/b5/44/499aab1522e99dc6c3e7a4ebbfebda627019672140b666359d9ca9b6f095/judgezoo-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-16 13:49:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "LLM-QC",
    "github_project": "judgezoo",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "judgezoo"
}
        
Elapsed time: 0.57202s