openjury

Name	openjury JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Python SDK for evaluating multiple model outputs using configurable LLM-based jurors
upload_time	2025-08-01 19:36:43
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	Apache-2.0
keywords	ai evaluation langchain llm llm-as-a-judge machine-learning model-comparison model-consensus openai
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OpenJury 🏛️

**A Python SDK for evaluating and comparing multiple model outputs using configurable LLM-based jurors.**

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

---

## Overview

**OpenJury** is a post-inference ensemble framework that evaluates and compares multiple model outputs using configurable LLM-based jurors. It enables structured model assessment, ranking, and A/B testing directly into your Python apps, research workflows, or ML platforms.

At its core, OpenJury is a decision-level, LLM-driven evaluation system that aggregates juror scores using flexibile voting strategies (e.g. weighted, ranked, consensus, etc.). This makes it a powerful and extensible solution for nuanced, after-inference comparison of generated outputs across models, prompts, versions, or datasets.

### Why use an LLM Jury?

AI models can generate fluent, convincing outputs, but fluency != correctness. Whether you're building a customer service agent, a code review assist, or a content generator, you need to know which response is best, correct, or how models compare with quality and consistency. Human evaluation doesn't scale, which is why LLM-based jurors are widely used.

But relying on a single LLM (like GPT-4o) to evaluate model outputs, although common, is expensive and can introduce [intra-model bias](https://arxiv.org/abs/2404.13076). Research by Cohere [shows](https://arxiv.org/abs/2404.18796) that using a panel of smaller, diverse models not only cuts cost but also leads to more reliable and less biased evaluations.

OpenJury puts this into practice: instead of a single judge, it uses multiple jurors to score and explain outputs. The result? Better evlautions and lower costs, all configurable with a declarative interface.


---

## Key Features

- **Python SDK:** Simple integration, flexible configuration
- **Multi-Criteria Evaluation:** Define custom criteria with weights and scoring
- **Advanced Voting Methods:** Majority, average, weighted, ranked, consensus, or your own
- **Parallel Processing:** Evaluate at scale, concurrently
- **Rich Output:** Scores, explanations, voting breakdowns, and confidence metrics
- **Extensible:** Plug in your own jurors, voting logic, and evaluation strategies
- **Dev Experience:** One-command setup, Makefile workflow, and modern code quality tools

---

## Installation

**Requirements:** Python 3.11 or newer

### Recommended (PyPI)

```bash
pip install openjury
```

### From Source (for development/contribution)

```bash
git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]"     # (optional) dev dependencies
```

## Quick Start

### Set Environment Variables

```bash
export OPENROUTER_API_KEY="your-api-key"
```

or if you're using OpenAI:
```bash
export LLM_PROVIDER="openai"
export OPENAI_API_KEY="your-api-key"
```

### Basic Usage

```python
from openjury import OpenJury, JuryConfig

config = JuryConfig.from_json_file("jury_config.json")
jury = OpenJury(config)
verdict = jury.evaluate(
    prompt="Write a Python function to reverse a string",
    responses=[
        "def reverse(s): return s[::-1]",
        "def reverse(s): return ''.join(reversed(s))"
    ]
)

print(f"Winner: {verdict.final_verdict.winner}")
print(f"Confidence: {verdict.final_verdict.confidence:.2%}")
```

### Configuration Example (jury_config.json)

```json
{
  "name": "Code Quality Jury",
  "criteria": [
    {"name": "correctness", "weight": 2.0, "max_score": 5},
    {"name": "readability", "weight": 1.5, "max_score": 5}
  ],
  "jurors": [
    {"name": "Senior Developer", "system_prompt": "You are a senior developer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.", "model_name": "qwen/qwen-2.5-coder-32b", "weight": 2.0},
    {"name": "Code Reviewer", "system_prompt": "You are a code reviewer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.", "model_name": "llama3/llama-3.1-8b-instruct", "weight": 1.0}
  ],
  "voting_method": "weighted"
}
```

### Examples

You can find more examples in the [examples](examples) directory.

### Use Cases

#### Model Evaluation & Comparison
- Compare outputs from different models (e.g., GPT-4 vs Claude vs custom models)
- Run A/B tests across prompt variations, fine-tuned models, or versions

#### Content & Response Quality
- Evaluate generated code for correctness and readability
- Score long-form content (blogs, papers, explanations) for clarity, tone, or coherence

#### Automated Grading & Assessment
- Grade student answers or interview responses at scale
- Score generated outputs against rubric-style criteria

#### Production Monitoring & QA
- Monitor output quality in production systems
- Detect degradation or drift between model versions

#### Custom Evaluation Workflows
- Integrate LLM-based judgment into human-in-the-loop pipelines
- Use configurable jurors and voting for domain-specific tasks

---

## License

OpenJury is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.

---

## Contributing

Contributions are welcome! Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "openjury",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "ai, evaluation, langchain, llm, llm-as-a-judge, machine-learning, model-comparison, model-consensus, openai",
    "author": null,
    "author_email": "robiscoding <robiscodingg@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/dc/9c/e8c5e7e8bd4fdc4251471393dfbf19c467b58d2d43144c9a2394f8e9a41a/openjury-0.1.0.tar.gz",
    "platform": null,
    "description": "# OpenJury \ud83c\udfdb\ufe0f\n\n**A Python SDK for evaluating and comparing multiple model outputs using configurable LLM-based jurors.**\n\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n---\n\n## Overview\n\n**OpenJury** is a post-inference ensemble framework that evaluates and compares multiple model outputs using configurable LLM-based jurors. It enables structured model assessment, ranking, and A/B testing directly into your Python apps, research workflows, or ML platforms.\n\nAt its core, OpenJury is a decision-level, LLM-driven evaluation system that aggregates juror scores using flexibile voting strategies (e.g. weighted, ranked, consensus, etc.). This makes it a powerful and extensible solution for nuanced, after-inference comparison of generated outputs across models, prompts, versions, or datasets.\n\n### Why use an LLM Jury?\n\nAI models can generate fluent, convincing outputs, but fluency != correctness. Whether you're building a customer service agent, a code review assist, or a content generator, you need to know which response is best, correct, or how models compare with quality and consistency. Human evaluation doesn't scale, which is why LLM-based jurors are widely used.\n\nBut relying on a single LLM (like GPT-4o) to evaluate model outputs, although common, is expensive and can introduce [intra-model bias](https://arxiv.org/abs/2404.13076). Research by Cohere [shows](https://arxiv.org/abs/2404.18796) that using a panel of smaller, diverse models not only cuts cost but also leads to more reliable and less biased evaluations.\n\nOpenJury puts this into practice: instead of a single judge, it uses multiple jurors to score and explain outputs. The result? Better evlautions and lower costs, all configurable with a declarative interface.\n\n\n---\n\n## Key Features\n\n- **Python SDK:** Simple integration, flexible configuration\n- **Multi-Criteria Evaluation:** Define custom criteria with weights and scoring\n- **Advanced Voting Methods:** Majority, average, weighted, ranked, consensus, or your own\n- **Parallel Processing:** Evaluate at scale, concurrently\n- **Rich Output:** Scores, explanations, voting breakdowns, and confidence metrics\n- **Extensible:** Plug in your own jurors, voting logic, and evaluation strategies\n- **Dev Experience:** One-command setup, Makefile workflow, and modern code quality tools\n\n---\n\n## Installation\n\n**Requirements:** Python 3.11 or newer\n\n### Recommended (PyPI)\n\n```bash\npip install openjury\n```\n\n### From Source (for development/contribution)\n\n```bash\ngit clone https://github.com/robiscoding/openjury.git\ncd openjury\npip install -e .\nuv pip install -e \".[dev]\"     # (optional) dev dependencies\n```\n\n## Quick Start\n\n### Set Environment Variables\n\n```bash\nexport OPENROUTER_API_KEY=\"your-api-key\"\n```\n\nor if you're using OpenAI:\n```bash\nexport LLM_PROVIDER=\"openai\"\nexport OPENAI_API_KEY=\"your-api-key\"\n```\n\n### Basic Usage\n\n```python\nfrom openjury import OpenJury, JuryConfig\n\nconfig = JuryConfig.from_json_file(\"jury_config.json\")\njury = OpenJury(config)\nverdict = jury.evaluate(\n    prompt=\"Write a Python function to reverse a string\",\n    responses=[\n        \"def reverse(s): return s[::-1]\",\n        \"def reverse(s): return ''.join(reversed(s))\"\n    ]\n)\n\nprint(f\"Winner: {verdict.final_verdict.winner}\")\nprint(f\"Confidence: {verdict.final_verdict.confidence:.2%}\")\n```\n\n### Configuration Example (jury_config.json)\n\n```json\n{\n  \"name\": \"Code Quality Jury\",\n  \"criteria\": [\n    {\"name\": \"correctness\", \"weight\": 2.0, \"max_score\": 5},\n    {\"name\": \"readability\", \"weight\": 1.5, \"max_score\": 5}\n  ],\n  \"jurors\": [\n    {\"name\": \"Senior Developer\", \"system_prompt\": \"You are a senior developer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.\", \"model_name\": \"qwen/qwen-2.5-coder-32b\", \"weight\": 2.0},\n    {\"name\": \"Code Reviewer\", \"system_prompt\": \"You are a code reviewer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.\", \"model_name\": \"llama3/llama-3.1-8b-instruct\", \"weight\": 1.0}\n  ],\n  \"voting_method\": \"weighted\"\n}\n```\n\n### Examples\n\nYou can find more examples in the [examples](examples) directory.\n\n### Use Cases\n\n#### Model Evaluation & Comparison\n- Compare outputs from different models (e.g., GPT-4 vs Claude vs custom models)\n- Run A/B tests across prompt variations, fine-tuned models, or versions\n\n#### Content & Response Quality\n- Evaluate generated code for correctness and readability\n- Score long-form content (blogs, papers, explanations) for clarity, tone, or coherence\n\n#### Automated Grading & Assessment\n- Grade student answers or interview responses at scale\n- Score generated outputs against rubric-style criteria\n\n#### Production Monitoring & QA\n- Monitor output quality in production systems\n- Detect degradation or drift between model versions\n\n#### Custom Evaluation Workflows\n- Integrate LLM-based judgment into human-in-the-loop pipelines\n- Use configurable jurors and voting for domain-specific tasks\n\n---\n\n## License\n\nOpenJury is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.\n\n---\n\n## Contributing\n\nContributions are welcome! Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for details.",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Python SDK for evaluating multiple model outputs using configurable LLM-based jurors",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/robiscoding/openjury",
        "Issues": "https://github.com/robiscoding/openjury/issues",
        "Repository": "https://github.com/robiscoding/openjury"
    },
    "split_keywords": [
        "ai",
        " evaluation",
        " langchain",
        " llm",
        " llm-as-a-judge",
        " machine-learning",
        " model-comparison",
        " model-consensus",
        " openai"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0aac97ebbe9643b96babd03423d420593b520c3dce827494df86c22930ca4d6f",
                "md5": "61023d2ac58c0dc9adeeee962812b199",
                "sha256": "8bf923df2e8a7de093a664643b705dbe22a15cb37e35d0a26c2558404ba1ebb9"
            },
            "downloads": -1,
            "filename": "openjury-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "61023d2ac58c0dc9adeeee962812b199",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 20109,
            "upload_time": "2025-08-01T19:36:41",
            "upload_time_iso_8601": "2025-08-01T19:36:41.951826Z",
            "url": "https://files.pythonhosted.org/packages/0a/ac/97ebbe9643b96babd03423d420593b520c3dce827494df86c22930ca4d6f/openjury-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dc9ce8c5e7e8bd4fdc4251471393dfbf19c467b58d2d43144c9a2394f8e9a41a",
                "md5": "40855c006732ba4ac1d5a026db50d552",
                "sha256": "11cd68d36324fd12206ce237f755c7d499a8265de0090ee9898214963c71b986"
            },
            "downloads": -1,
            "filename": "openjury-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "40855c006732ba4ac1d5a026db50d552",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 21004,
            "upload_time": "2025-08-01T19:36:43",
            "upload_time_iso_8601": "2025-08-01T19:36:43.780586Z",
            "url": "https://files.pythonhosted.org/packages/dc/9c/e8c5e7e8bd4fdc4251471393dfbf19c467b58d2d43144c9a2394f8e9a41a/openjury-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 19:36:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "robiscoding",
    "github_project": "openjury",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "openjury"
}

None