Name | judgezoo JSON |
Version |
0.1.1
JSON |
| download |
home_page | None |
Summary | A collection of judges for evaluating LLM model output for safety & toxicity with a standardized API. |
upload_time | 2025-07-16 13:49:03 |
maintainer | None |
docs_url | None |
author | Tim Beyer |
requires_python | >=3.8 |
license | None |
keywords |
llm
safety
evaluation
ai-safety
machine-learning
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# JudgeZoo
This repo provides access to a set of commonly used LLM-based safety judges via a simple and consistent API.
Our main focus is ease-of-use, correctness, and reproducibility.
## How to use
You can create a judge model instance with a single line of code:
```python3
judge = Judge.from_name("strong_reject")
```
To get safety scores, just pass a list of conversations to score:
```python3
harmless_conversation = [
{"role": "user", "content": "How do I make a birthday cake?"},
{"role": "assistant", "content": "Step 1: Collect ingredients..."}
]
scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": [0.02496337890625]}
```
All judges return `"p_harmful"`, which is a normalized score from 0 to 1.
Depending on the original setup, the judge may also return discrete scores or harm categories (e.g. on a Likert scale).
In these cases, the raw scores are also returned:
```python3
judge = Judge.from_name("adaptive_attacks")
scores = judge([harmful_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```
## Included judges
| Name | Argument | Creator (Org/Researcher) | Link to Paper | Type | Fine-tuned from |
| ---------------------- | --------------------- | ---------------------------- | ----------------------------------------------------| ------------ | ---------------------- |
| Adaptive Attacks | `adaptive_attacks` | Andriushchenko et al. (2024) | [arXiv:2404.02151](https://arxiv.org/abs/2404.02151)| prompt-based | — |
| AdvPrefix | `advprefix` | Zhu et al. (2024) | [arXiv:2412.10321](https://arxiv.org/abs/2412.10321)| prompt-based | — |
| AegisGuard* | `aegis_guard` | Ghosh et al. (2024) | [arXiv:2404.05993](https://arxiv.org/abs/2404.05993)| fine-tuned | LlamaGuard 7B |
| HarmBench | `harmbench` | Mazeika et al. (2024) | [arXiv:2402.04249](https://arxiv.org/abs/2402.04249)| fine-tuned | Gemma 2B |
| JailJudge | `jail_judge` | Liu et al. (2024) | [arXiv:2410.12855](https://arxiv.org/abs/2410.12855)| fine-tuned | Llama 2 7B
| Llama Guard 3 | `llama_guard_3` | Llama Team, AI @ Meta (2024) | [arXiv:2407.21783](https://arxiv.org/abs/2407.21783)| fine-tuned | Llama 3 8B |
| Llama Guard 4 | `llama_guard_4` | Llama Team, AI @ Meta (2024) | [Meta blog](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) | fine-tuned | Llama 4 12B |
| MD-Judge (v0.1 & v0.2) | `md_judge` | Li, Lijun et al. (2024) | [arXiv:2402.05044](https://arxiv.org/abs/2402.05044)| fine-tuned | Mistral-7B/LMintern2 7B|
| StrongREJECT | `strong_reject` | Souly et al. (2024) | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| fine-tuned | Gemma 2b |
| StrongREJECT (rubric) | `strong_reject_rubric`| Souly et al. (2024) | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| prompt-based | - |
| XSTestJudge | `xstest` | Röttge et al. (2023) | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263)| prompt-based | — |
\* there are two versions of this judge (permissive and defensive). You can switch between them using `Judge.from_name("aegis_guard", defensive=[True/False])`
## Other
### Prompt-based judges
While some judges (such as the HarmBench classifier) are finetuned local models, others rely on prompted foundation models.
Currently, we support local foundation models and OpenAI models:
```python3
judge = Judge.from_name("adaptive_attacks", use_local_model=False, remote_foundation_model="gpt-4o")
scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```
```python3
judge = Judge.from_name("adaptive_attacks", use_local_model=True)
scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
```
When not specified, the defaults in `config.py` are used.
### Multi-turn interaction
Judges vary in how much of a conversation they can evaluate - many models only work for single-turn interactions.
In these cases, we assume the first user message to be the prompt and the final assistant message to be the response to be judged.
If you prefer a different setup, you can pass only single-turn conversations.
### Reproducibility
Wherever possible, we use official code directly provided by the original authors to ensure correctness.
Finally, we warn if a user's setup diverges from the original implementation:
```python3
from judgezoo import Judge
judge = Judge.from_name("intention_analysis")
>>> WARNING:root:IntentionAnalysisJudge originally used gpt-3.5-turbo-0613, you are using gpt-4o. Results may differ from the original paper.
```
## Installation
```pip install judgezoo```
## Tests
To run all tests, run
```pytest tests/ --runslow```
Raw data
{
"_id": null,
"home_page": null,
"name": "judgezoo",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, safety, evaluation, ai-safety, machine-learning",
"author": "Tim Beyer",
"author_email": "Tim Beyer <tim.beyer@tum.de>",
"download_url": "https://files.pythonhosted.org/packages/b5/44/499aab1522e99dc6c3e7a4ebbfebda627019672140b666359d9ca9b6f095/judgezoo-0.1.1.tar.gz",
"platform": null,
"description": "# JudgeZoo\n\nThis repo provides access to a set of commonly used LLM-based safety judges via a simple and consistent API.\nOur main focus is ease-of-use, correctness, and reproducibility.\n\n\n## How to use\n\nYou can create a judge model instance with a single line of code:\n```python3\njudge = Judge.from_name(\"strong_reject\")\n```\n\nTo get safety scores, just pass a list of conversations to score:\n```python3\nharmless_conversation = [\n {\"role\": \"user\", \"content\": \"How do I make a birthday cake?\"},\n {\"role\": \"assistant\", \"content\": \"Step 1: Collect ingredients...\"}\n]\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": [0.02496337890625]}\n```\nAll judges return `\"p_harmful\"`, which is a normalized score from 0 to 1.\nDepending on the original setup, the judge may also return discrete scores or harm categories (e.g. on a Likert scale).\nIn these cases, the raw scores are also returned:\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\")\n\nscores = judge([harmful_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\n\n## Included judges\n| Name | Argument | Creator (Org/Researcher) | Link to Paper | Type | Fine-tuned from |\n| ---------------------- | --------------------- | ---------------------------- | ----------------------------------------------------| ------------ | ---------------------- |\n| Adaptive Attacks | `adaptive_attacks` | Andriushchenko et al. (2024) | [arXiv:2404.02151](https://arxiv.org/abs/2404.02151)| prompt-based | \u2014 |\n| AdvPrefix | `advprefix` | Zhu et al. (2024) | [arXiv:2412.10321](https://arxiv.org/abs/2412.10321)| prompt-based | \u2014 |\n| AegisGuard* | `aegis_guard` | Ghosh et al. (2024) | [arXiv:2404.05993](https://arxiv.org/abs/2404.05993)| fine-tuned | LlamaGuard 7B |\n| HarmBench | `harmbench` | Mazeika et al. (2024) | [arXiv:2402.04249](https://arxiv.org/abs/2402.04249)| fine-tuned | Gemma 2B |\n| JailJudge | `jail_judge` | Liu et al. (2024) | [arXiv:2410.12855](https://arxiv.org/abs/2410.12855)| fine-tuned | Llama 2 7B\n| Llama Guard 3 | `llama_guard_3` | Llama Team, AI @ Meta (2024) | [arXiv:2407.21783](https://arxiv.org/abs/2407.21783)| fine-tuned | Llama 3 8B |\n| Llama Guard 4 | `llama_guard_4` | Llama Team, AI @ Meta (2024) | [Meta blog](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) | fine-tuned | Llama 4 12B |\n| MD-Judge (v0.1 & v0.2) | `md_judge` | Li, Lijun et al. (2024) | [arXiv:2402.05044](https://arxiv.org/abs/2402.05044)| fine-tuned | Mistral-7B/LMintern2 7B|\n| StrongREJECT | `strong_reject` | Souly et al. (2024) | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| fine-tuned | Gemma 2b |\n| StrongREJECT (rubric) | `strong_reject_rubric`| Souly et al. (2024) | [arXiv:2402.10260](https://arxiv.org/abs/2402.10260)| prompt-based | - |\n| XSTestJudge | `xstest` | R\u00f6ttge et al. (2023) | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263)| prompt-based | \u2014 |\n\n\\* there are two versions of this judge (permissive and defensive). You can switch between them using `Judge.from_name(\"aegis_guard\", defensive=[True/False])`\n\n\n## Other\n### Prompt-based judges\n\nWhile some judges (such as the HarmBench classifier) are finetuned local models, others rely on prompted foundation models.\nCurrently, we support local foundation models and OpenAI models:\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\", use_local_model=False, remote_foundation_model=\"gpt-4o\")\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\n```python3\njudge = Judge.from_name(\"adaptive_attacks\", use_local_model=True)\n\nscores = judge([harmless_conversation])\nprint(scores)\n>>> {\"p_harmful\": 0.0, \"rating\": \"1\"}\n```\n\nWhen not specified, the defaults in `config.py` are used.\n\n### Multi-turn interaction\n\nJudges vary in how much of a conversation they can evaluate - many models only work for single-turn interactions.\nIn these cases, we assume the first user message to be the prompt and the final assistant message to be the response to be judged.\nIf you prefer a different setup, you can pass only single-turn conversations.\n\n### Reproducibility\n\nWherever possible, we use official code directly provided by the original authors to ensure correctness.\n\nFinally, we warn if a user's setup diverges from the original implementation:\n\n```python3\nfrom judgezoo import Judge\n\njudge = Judge.from_name(\"intention_analysis\")\n>>> WARNING:root:IntentionAnalysisJudge originally used gpt-3.5-turbo-0613, you are using gpt-4o. Results may differ from the original paper.\n```\n\n\n## Installation\n```pip install judgezoo```\n\n\n## Tests\nTo run all tests, run\n\n```pytest tests/ --runslow```\n",
"bugtrack_url": null,
"license": null,
"summary": "A collection of judges for evaluating LLM model output for safety & toxicity with a standardized API.",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/LLM-QC/judgezoo",
"Issues": "https://github.com/LLM-QC/judgezoo/issues",
"Repository": "https://github.com/LLM-QC/judgezoo"
},
"split_keywords": [
"llm",
" safety",
" evaluation",
" ai-safety",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ad4a3254518c20ed1bc6242587d2b12b03cf65ec19d58c011c5bab7a16da3722",
"md5": "481c6ed623a69c05e3197b647b5dbeff",
"sha256": "0c88c161f11efbcab4d356d1ed9cd952e09c8cc99401782e22e34211321fce3d"
},
"downloads": -1,
"filename": "judgezoo-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "481c6ed623a69c05e3197b647b5dbeff",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 37947,
"upload_time": "2025-07-16T13:49:00",
"upload_time_iso_8601": "2025-07-16T13:49:00.677168Z",
"url": "https://files.pythonhosted.org/packages/ad/4a/3254518c20ed1bc6242587d2b12b03cf65ec19d58c011c5bab7a16da3722/judgezoo-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b544499aab1522e99dc6c3e7a4ebbfebda627019672140b666359d9ca9b6f095",
"md5": "111dc0690ffd1f63219648d32abf6d8d",
"sha256": "3d49918e57bbc4cd2833eae848a606f9e6a3ec64f0f16d8049ea1df1d2125d8f"
},
"downloads": -1,
"filename": "judgezoo-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "111dc0690ffd1f63219648d32abf6d8d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 54103,
"upload_time": "2025-07-16T13:49:03",
"upload_time_iso_8601": "2025-07-16T13:49:03.056176Z",
"url": "https://files.pythonhosted.org/packages/b5/44/499aab1522e99dc6c3e7a4ebbfebda627019672140b666359d9ca9b6f095/judgezoo-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-16 13:49:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "LLM-QC",
"github_project": "judgezoo",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "judgezoo"
}