# OpenJury 🏛️
**A Python SDK for evaluating and comparing multiple model outputs using configurable LLM-based jurors.**
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
---
## Overview
**OpenJury** is a post-inference ensemble framework that evaluates and compares multiple model outputs using configurable LLM-based jurors. It enables structured model assessment, ranking, and A/B testing directly into your Python apps, research workflows, or ML platforms.
At its core, OpenJury is a decision-level, LLM-driven evaluation system that aggregates juror scores using flexibile voting strategies (e.g. weighted, ranked, consensus, etc.). This makes it a powerful and extensible solution for nuanced, after-inference comparison of generated outputs across models, prompts, versions, or datasets.
### Why use an LLM Jury?
AI models can generate fluent, convincing outputs, but fluency != correctness. Whether you're building a customer service agent, a code review assist, or a content generator, you need to know which response is best, correct, or how models compare with quality and consistency. Human evaluation doesn't scale, which is why LLM-based jurors are widely used.
But relying on a single LLM (like GPT-4o) to evaluate model outputs, although common, is expensive and can introduce [intra-model bias](https://arxiv.org/abs/2404.13076). Research by Cohere [shows](https://arxiv.org/abs/2404.18796) that using a panel of smaller, diverse models not only cuts cost but also leads to more reliable and less biased evaluations.
OpenJury puts this into practice: instead of a single judge, it uses multiple jurors to score and explain outputs. The result? Better evlautions and lower costs, all configurable with a declarative interface.
---
## Key Features
- **Python SDK:** Simple integration, flexible configuration
- **Multi-Criteria Evaluation:** Define custom criteria with weights and scoring
- **Advanced Voting Methods:** Majority, average, weighted, ranked, consensus, or your own
- **Parallel Processing:** Evaluate at scale, concurrently
- **Rich Output:** Scores, explanations, voting breakdowns, and confidence metrics
- **Extensible:** Plug in your own jurors, voting logic, and evaluation strategies
- **Dev Experience:** One-command setup, Makefile workflow, and modern code quality tools
---
## Installation
**Requirements:** Python 3.11 or newer
### Recommended (PyPI)
```bash
pip install openjury
```
### From Source (for development/contribution)
```bash
git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]" # (optional) dev dependencies
```
## Quick Start
### Set Environment Variables
```bash
export OPENROUTER_API_KEY="your-api-key"
```
or if you're using OpenAI:
```bash
export LLM_PROVIDER="openai"
export OPENAI_API_KEY="your-api-key"
```
### Basic Usage
```python
from openjury import OpenJury, JuryConfig
config = JuryConfig.from_json_file("jury_config.json")
jury = OpenJury(config)
verdict = jury.evaluate(
prompt="Write a Python function to reverse a string",
responses=[
"def reverse(s): return s[::-1]",
"def reverse(s): return ''.join(reversed(s))"
]
)
print(f"Winner: {verdict.final_verdict.winner}")
print(f"Confidence: {verdict.final_verdict.confidence:.2%}")
```
### Configuration Example (jury_config.json)
```json
{
"name": "Code Quality Jury",
"criteria": [
{"name": "correctness", "weight": 2.0, "max_score": 5},
{"name": "readability", "weight": 1.5, "max_score": 5}
],
"jurors": [
{"name": "Senior Developer", "system_prompt": "You are a senior developer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.", "model_name": "qwen/qwen-2.5-coder-32b", "weight": 2.0},
{"name": "Code Reviewer", "system_prompt": "You are a code reviewer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.", "model_name": "llama3/llama-3.1-8b-instruct", "weight": 1.0}
],
"voting_method": "weighted"
}
```
### Examples
You can find more examples in the [examples](examples) directory.
### Use Cases
#### Model Evaluation & Comparison
- Compare outputs from different models (e.g., GPT-4 vs Claude vs custom models)
- Run A/B tests across prompt variations, fine-tuned models, or versions
#### Content & Response Quality
- Evaluate generated code for correctness and readability
- Score long-form content (blogs, papers, explanations) for clarity, tone, or coherence
#### Automated Grading & Assessment
- Grade student answers or interview responses at scale
- Score generated outputs against rubric-style criteria
#### Production Monitoring & QA
- Monitor output quality in production systems
- Detect degradation or drift between model versions
#### Custom Evaluation Workflows
- Integrate LLM-based judgment into human-in-the-loop pipelines
- Use configurable jurors and voting for domain-specific tasks
---
## License
OpenJury is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.
---
## Contributing
Contributions are welcome! Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "openjury",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "ai, evaluation, langchain, llm, llm-as-a-judge, machine-learning, model-comparison, model-consensus, openai",
"author": null,
"author_email": "robiscoding <robiscodingg@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/dc/9c/e8c5e7e8bd4fdc4251471393dfbf19c467b58d2d43144c9a2394f8e9a41a/openjury-0.1.0.tar.gz",
"platform": null,
"description": "# OpenJury \ud83c\udfdb\ufe0f\n\n**A Python SDK for evaluating and comparing multiple model outputs using configurable LLM-based jurors.**\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/Apache-2.0)\n\n---\n\n## Overview\n\n**OpenJury** is a post-inference ensemble framework that evaluates and compares multiple model outputs using configurable LLM-based jurors. It enables structured model assessment, ranking, and A/B testing directly into your Python apps, research workflows, or ML platforms.\n\nAt its core, OpenJury is a decision-level, LLM-driven evaluation system that aggregates juror scores using flexibile voting strategies (e.g. weighted, ranked, consensus, etc.). This makes it a powerful and extensible solution for nuanced, after-inference comparison of generated outputs across models, prompts, versions, or datasets.\n\n### Why use an LLM Jury?\n\nAI models can generate fluent, convincing outputs, but fluency != correctness. Whether you're building a customer service agent, a code review assist, or a content generator, you need to know which response is best, correct, or how models compare with quality and consistency. Human evaluation doesn't scale, which is why LLM-based jurors are widely used.\n\nBut relying on a single LLM (like GPT-4o) to evaluate model outputs, although common, is expensive and can introduce [intra-model bias](https://arxiv.org/abs/2404.13076). Research by Cohere [shows](https://arxiv.org/abs/2404.18796) that using a panel of smaller, diverse models not only cuts cost but also leads to more reliable and less biased evaluations.\n\nOpenJury puts this into practice: instead of a single judge, it uses multiple jurors to score and explain outputs. The result? Better evlautions and lower costs, all configurable with a declarative interface.\n\n\n---\n\n## Key Features\n\n- **Python SDK:** Simple integration, flexible configuration\n- **Multi-Criteria Evaluation:** Define custom criteria with weights and scoring\n- **Advanced Voting Methods:** Majority, average, weighted, ranked, consensus, or your own\n- **Parallel Processing:** Evaluate at scale, concurrently\n- **Rich Output:** Scores, explanations, voting breakdowns, and confidence metrics\n- **Extensible:** Plug in your own jurors, voting logic, and evaluation strategies\n- **Dev Experience:** One-command setup, Makefile workflow, and modern code quality tools\n\n---\n\n## Installation\n\n**Requirements:** Python 3.11 or newer\n\n### Recommended (PyPI)\n\n```bash\npip install openjury\n```\n\n### From Source (for development/contribution)\n\n```bash\ngit clone https://github.com/robiscoding/openjury.git\ncd openjury\npip install -e .\nuv pip install -e \".[dev]\" # (optional) dev dependencies\n```\n\n## Quick Start\n\n### Set Environment Variables\n\n```bash\nexport OPENROUTER_API_KEY=\"your-api-key\"\n```\n\nor if you're using OpenAI:\n```bash\nexport LLM_PROVIDER=\"openai\"\nexport OPENAI_API_KEY=\"your-api-key\"\n```\n\n### Basic Usage\n\n```python\nfrom openjury import OpenJury, JuryConfig\n\nconfig = JuryConfig.from_json_file(\"jury_config.json\")\njury = OpenJury(config)\nverdict = jury.evaluate(\n prompt=\"Write a Python function to reverse a string\",\n responses=[\n \"def reverse(s): return s[::-1]\",\n \"def reverse(s): return ''.join(reversed(s))\"\n ]\n)\n\nprint(f\"Winner: {verdict.final_verdict.winner}\")\nprint(f\"Confidence: {verdict.final_verdict.confidence:.2%}\")\n```\n\n### Configuration Example (jury_config.json)\n\n```json\n{\n \"name\": \"Code Quality Jury\",\n \"criteria\": [\n {\"name\": \"correctness\", \"weight\": 2.0, \"max_score\": 5},\n {\"name\": \"readability\", \"weight\": 1.5, \"max_score\": 5}\n ],\n \"jurors\": [\n {\"name\": \"Senior Developer\", \"system_prompt\": \"You are a senior developer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.\", \"model_name\": \"qwen/qwen-2.5-coder-32b\", \"weight\": 2.0},\n {\"name\": \"Code Reviewer\", \"system_prompt\": \"You are a code reviewer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.\", \"model_name\": \"llama3/llama-3.1-8b-instruct\", \"weight\": 1.0}\n ],\n \"voting_method\": \"weighted\"\n}\n```\n\n### Examples\n\nYou can find more examples in the [examples](examples) directory.\n\n### Use Cases\n\n#### Model Evaluation & Comparison\n- Compare outputs from different models (e.g., GPT-4 vs Claude vs custom models)\n- Run A/B tests across prompt variations, fine-tuned models, or versions\n\n#### Content & Response Quality\n- Evaluate generated code for correctness and readability\n- Score long-form content (blogs, papers, explanations) for clarity, tone, or coherence\n\n#### Automated Grading & Assessment\n- Grade student answers or interview responses at scale\n- Score generated outputs against rubric-style criteria\n\n#### Production Monitoring & QA\n- Monitor output quality in production systems\n- Detect degradation or drift between model versions\n\n#### Custom Evaluation Workflows\n- Integrate LLM-based judgment into human-in-the-loop pipelines\n- Use configurable jurors and voting for domain-specific tasks\n\n---\n\n## License\n\nOpenJury is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.\n\n---\n\n## Contributing\n\nContributions are welcome! Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for details.",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Python SDK for evaluating multiple model outputs using configurable LLM-based jurors",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/robiscoding/openjury",
"Issues": "https://github.com/robiscoding/openjury/issues",
"Repository": "https://github.com/robiscoding/openjury"
},
"split_keywords": [
"ai",
" evaluation",
" langchain",
" llm",
" llm-as-a-judge",
" machine-learning",
" model-comparison",
" model-consensus",
" openai"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0aac97ebbe9643b96babd03423d420593b520c3dce827494df86c22930ca4d6f",
"md5": "61023d2ac58c0dc9adeeee962812b199",
"sha256": "8bf923df2e8a7de093a664643b705dbe22a15cb37e35d0a26c2558404ba1ebb9"
},
"downloads": -1,
"filename": "openjury-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "61023d2ac58c0dc9adeeee962812b199",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 20109,
"upload_time": "2025-08-01T19:36:41",
"upload_time_iso_8601": "2025-08-01T19:36:41.951826Z",
"url": "https://files.pythonhosted.org/packages/0a/ac/97ebbe9643b96babd03423d420593b520c3dce827494df86c22930ca4d6f/openjury-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dc9ce8c5e7e8bd4fdc4251471393dfbf19c467b58d2d43144c9a2394f8e9a41a",
"md5": "40855c006732ba4ac1d5a026db50d552",
"sha256": "11cd68d36324fd12206ce237f755c7d499a8265de0090ee9898214963c71b986"
},
"downloads": -1,
"filename": "openjury-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "40855c006732ba4ac1d5a026db50d552",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 21004,
"upload_time": "2025-08-01T19:36:43",
"upload_time_iso_8601": "2025-08-01T19:36:43.780586Z",
"url": "https://files.pythonhosted.org/packages/dc/9c/e8c5e7e8bd4fdc4251471393dfbf19c467b58d2d43144c9a2394f8e9a41a/openjury-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-01 19:36:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "robiscoding",
"github_project": "openjury",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "openjury"
}