omnibench

Name	omnibench JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	Comprehensive AI Agent Benchmarking Framework
upload_time	2025-09-08 22:17:51
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	ai agent benchmarking evaluation llm testing performance
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OmniBench

<p align="center">
  <img src="https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/OmniBench_Logo.png" alt="OmniBench Logo" width="400">
</p>

<p align="center">
  <em>A Customizable, Multi-Objective AI Agent Benchmarking Framework for Agentic Reliability and Mediation (ARM)</em>
</p>

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![PyPI version](https://badge.fury.io/py/omnibench.svg)](https://pypi.org/project/omnibench/)

> **🚀 Quick Start in 2 minutes:** Clone → Install → Run → Get comprehensive agent evaluation results!

**Agentic Reliability and Mediation (ARM)** is a research and development area at BrainGnosis. We study how to measure and improve the reliability of AI agents and how they mediate conflicts during autonomous decision making. Our goal is to establish clear principles, metrics, and evaluation protocols that transfer across domains, so agents remain dependable, aligned, and resilient under varied operating conditions.

From this work we are releasing **OmniBench**, an open source, flexible, multi-objective benchmarking framework for evaluating AI agents across both standard suites and highly customized use cases. OmniBench looks beyond output only checks, it assesses decision quality, adaptability, conflict handling, and reliability in single agent and multi agent settings. Its modular design lets teams add scenarios, metrics, reward and constraint definitions, and integrations with tools and simulators. The result is domain relevant testing with reproducible reports that reflect the demands of real world applications.

> **⚠️ Development Version Notice**  
> OmniBench is currently in active development. While we strive for stability, you may encounter bugs, breaking changes, or incomplete features. We recommend thorough testing in your specific use case and welcome bug reports and feedback to help us improve the framework.

## Table of Contents

- [About Us: BrainGnosis](#about-us-braingnosis)
- [Why OmniBench?](#why-omnibench)
- [Why OmniBench is Different](#why-omnibench-is-different)  
- [How It Works](#how-it-works)
- [Installation](#installation)
- [30-Second Demo](#30-second-demo)
- [Quick Start](#quick-start)
- [Common Use Cases](#common-use-cases)
- [Core Concepts](#core-concepts)
- [Examples](#examples)
- [Framework Integrations](#framework-integrations)
- [Advanced Usage](#advanced-usage)
- [Development](#development)
- [Contributing](#contributing)
- [FAQ](#faq)
- [License](#license)
- [Support](#support)

## About Us: BrainGnosis

<div align="left">
  <img src="https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/BrainGnosis.png" alt="BrainGnosis" width="400">
</div>

<br>

[**BrainGnosis**](https://www.braingnosis.com) is dedicated to making AI smarter for humans through structured intelligence and reliable AI systems. We are developing **AgentOS**, an enterprise operating system for intelligent AI agents that think, adapt, and collaborate to enhance organizational performance.

**Our Mission:** Build reliable, adaptable, and deeply human-aligned AI that transforms how businesses operate.

**🔗 Learn more:** [www.braingnosis.com](https://www.braingnosis.com)

---

## Why OmniBench?

Traditional benchmarking approaches evaluate AI systems through simple input-output comparisons, missing the complex decision-making processes that modern AI agents employ.

<p align="center">
  <img src="https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/General%20Benchmarking%20Process.png" alt="General Benchmarking Process" width="500">
</p>

**OmniBench's Comprehensive Approach**  
OmniBench captures the full spectrum of agentic behavior by evaluating multiple dimensions simultaneously - from reasoning chains to action sequences to system state changes.

<p align="center">
  <img src="https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/OmniBench%20Benchmarking%20Process.png" alt="OmniBench Benchmarking Process" width="500">
</p>

## Why OmniBench is Different

- **📊 Multi-Dimensional Evaluation**: Assess outputs, reasoning, actions, and states simultaneously with native support for output-based, path-based, state-based, and llm-as-a-judge evaluations
- **🔄 Agentic Loop Awareness**: Understands iterative thought-action-observation cycles that modern AI agents employ
- **🎯 Objective-Specific Analysis**: Different aspects evaluated by specialized objectives with comprehensive evaluation criteria
- **🔗 Comprehensive Coverage**: No blind spots in agent behavior assessment - captures the full decision-making process
- **⚡ High-Performance Execution**: Async-support enables rapid concurrent evaluation for faster benchmarking cycles
- **📊 Advanced Analytics**: Built-in AI summarization and customizable evaluation metrics for actionable insights
- **🔧 Extensible Architecture**: Modular design allowing custom objectives, evaluation criteria, and result types
- **🔄 Framework Agnostic**: Works seamlessly with any Python-based agent framework (LangChain, Pydantic AI, custom agents)

## How It Works

OmniBench follows a clean, modular architecture that makes it easy to understand and extend:

```text
omnibench/
├── core/                     # Core benchmarking engine
│   ├── benchmarker.py       # Main OmniBenchmarker class
│   └── types.py             # Type definitions and result classes
├── objectives/              # Evaluation objectives
│   ├── base.py             # Base objective class
│   ├── llm_judge.py        # LLM-based evaluation
│   ├── output.py           # Output comparison objectives
│   ├── path.py             # Path/action sequence evaluation
│   ├── state.py            # State-based evaluation
│   └── combined.py         # Multi-objective evaluation
├── integrations/            # Framework-specific integrations
│   └── pydantic_ai/        # Pydantic AI integration
└── logging/                # Logging and analytics
    ├── logger.py           # Comprehensive logging system
    └── evaluator.py        # Auto-evaluation and analysis
```

**Evaluation Flow:**

1. **Agent Execution**: Your agent processes input and generates output
2. **Multi-Objective Assessment**: Different objectives evaluate different aspects
3. **Comprehensive Logging**: Results are logged with detailed analytics
4. **Performance Insights**: Get actionable feedback on agent behavior

## Installation

### Prerequisites

- **Python 3.10+** (Required)
- **"API Keys"**: OpenAI, Anthropic (for LLM Judge objectives)
- **5 minutes** for setup and first benchmark

### Core Package

**Recommended Installation (Most Reliable):**

```bash
# Clone the repository
git clone https://github.com/BrainGnosis/OmniBench.git
cd OmniBench

# Install dependencies
pip install -r omnibench/requirements.txt

# Install in development mode
pip install -e .
```

**Alternative: PyPI Installation (Beta)**

> **⚠️ Beta Notice:** PyPI installation is available but currently in beta testing. Cross-platform compatibility is being actively improved. For the most reliable experience, we recommend the git installation above.

```bash
# Install from PyPI (beta - may have platform-specific issues)
pip install omnibench
```

### Environment Setup

Create a `.env` file in your project root with your API keys:

```bash
# .env
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
```

**✅ That's it!** OmniBench automatically loads environment variables when you import it.

### Requirements

Core dependencies:

```txt
python==3.11+
langchain==0.3.27
langchain_core==0.3.75
langchain_openai==0.3.32
pydantic==2.11.7
rich==14.1.0
numpy==2.3.2
tqdm==4.67.1
```

## 30-Second Demo

Want to see OmniBench in action immediately? Here's the minimal example:

```python
from omnibench import OmniBenchmarker, Benchmark
from omnibench.objectives import StringEqualityObjective

# 1. Define a simple agent
class SimpleAgent:
    def invoke(self, query: str) -> dict:
        return {"answer": "Paris"}

def create_agent():
    return SimpleAgent()

# 2. Create benchmark
benchmark = Benchmark(
    name="Geography Test",
    input_kwargs={"query": "What's the capital of France?"},
    objective=StringEqualityObjective(name="exact_match", output_key="answer", goal="Paris"),
    iterations=1
)

# 3. Run evaluation
benchmarker = OmniBenchmarker(
    executor_fn=create_agent,
    executor_kwargs={},
    initial_input=[benchmark]
)
results = benchmarker.benchmark()

# 4. View results
benchmarker.print_logger_summary()
```

**Output:**

```
✅ Geography Test: PASSED (100% accuracy)
📊 1/1 benchmarks passed | Runtime: 0.1s
```

## Quick Start

Here's a complete example demonstrating OmniBench's core capabilities:

```python
import asyncio
from dotenv import load_dotenv
from omnibench import OmniBenchmarker, Benchmark
from omnibench.objectives import LLMJudgeObjective, StringEqualityObjective, CombinedBenchmarkObjective
from omnibench.core.types import BoolEvalResult, FloatEvalResult

# Load environment variables
load_dotenv()

# Define your agent (works with any Python callable)
class SimpleAgent:
    def invoke(self, query: str) -> dict:
        if "capital" in query.lower() and "france" in query.lower():
            return {"response": "The capital of France is Paris."}
        return {"response": "I'm not sure about that."}

def create_agent():
    return SimpleAgent()

# Create evaluation objectives
accuracy_objective = StringEqualityObjective(
    name="exact_accuracy",
    output_key="response", 
    goal="The capital of France is Paris."
)

quality_objective = LLMJudgeObjective(
    name="response_quality",
    output_key="response",
    goal="The agent identified the capital of France correctly",
    valid_eval_result_type=FloatEvalResult  # 0.0-1.0 scoring
)

# Combine multiple objectives
combined_objective = CombinedBenchmarkObjective(
    name="comprehensive_evaluation",
    objectives=[accuracy_objective, quality_objective]
)

# Create and run benchmark
async def main():
    benchmark = Benchmark(
        name="Geography Knowledge Test",
        input_kwargs={"query": "What is the capital of France?"},
        objective=combined_objective,
        iterations=5
    )
    
    benchmarker = OmniBenchmarker(
        executor_fn=create_agent,
        executor_kwargs={},
        initial_input=[benchmark]
    )
    
    # Execute with concurrency control
    results = await benchmarker.benchmark_async(max_concurrent=3)
    
    # View results
    benchmarker.print_logger_summary()
    return results

# Run the benchmark
if __name__ == "__main__":
    results = asyncio.run(main())
```

### 🎯 **Next Steps**

**Got the basic example working?** Here's your learning path:

1. **🔍 Explore Examples:** Check out `examples/` directory for real-world use cases
2. **🎛️ Try Different Objectives:** Experiment with LLM Judge and Combined objectives  
3. **⚡ Scale Up:** Use async benchmarking with `benchmark_async()` for faster evaluation
4. **🔧 Customize:** Create your own evaluation objectives for domain-specific needs
5. **📊 Analyze:** Dive deeper with `print_logger_details()` for comprehensive insights

**Need help?** Check our [FAQ](#faq) or join the [community discussions](https://github.com/BrainGnosis/OmniBench/discussions)!

## Common Use Cases

Here are real-world scenarios where OmniBench excels:

### 🏢 **Enterprise AI Validation**

**Scenario:** Validating customer service chatbots before deployment

- **Objectives:** LLM Judge for helpfulness + StringEquality for policy compliance
- **Benefit:** Ensure agents are both helpful AND follow company guidelines

### 🔬 **Research & Development**

**Scenario:** Comparing different agent architectures or prompting strategies

- **Objectives:** Combined objectives measuring accuracy, reasoning quality, and efficiency
- **Benefit:** Rigorous A/B testing with statistical significance

### 🚀 **Production Monitoring**

**Scenario:** Continuous evaluation of deployed agents

- **Objectives:** State-based objectives tracking system changes + output quality
- **Benefit:** Early detection of performance degradation

### 🎓 **Educational AI Assessment**

**Scenario:** Evaluating AI tutoring systems

- **Objectives:** Path-based objectives tracking learning progression + content accuracy
- **Benefit:** Comprehensive assessment of both teaching method and content quality

### 🤖 **Multi-Agent System Testing**

**Scenario:** Testing collaborative agent teams

- **Objectives:** State-based objectives for system coordination + individual agent performance
- **Benefit:** Holistic evaluation of complex agent interactions

### 💡 **When to Choose Each Objective Type**

| Objective Type | Best For | Example Use Case | Key Benefit |
|---|---|---|---|
| **LLM Judge** | Subjective qualities | "Is this explanation clear?" | Human-like evaluation |
| **Output-Based** | Exact requirements | "Does output match format?" | Precise validation |
| **Path-Based** | Process evaluation | "Did agent use tools correctly?" | Workflow assessment |
| **State-Based** | System changes | "Was database updated properly?" | State verification |
| **Combined** | Comprehensive testing | "All of the above" | Complete coverage |

## Core Concepts

### Evaluation Objectives

OmniBench provides multiple evaluation objective types, each designed to address different evaluation challenges:

#### LLM Judge Objective

**When to use:** *"How do I evaluate subjective qualities like helpfulness, creativity, or nuanced correctness that can't be captured by exact matching?"*

Perfect for assessing complex, subjective criteria where human-like judgment is needed.

```python
# Boolean evaluation (pass/fail)
binary_objective = LLMJudgeObjective(
    name="correctness_check",
    output_key="response",
    goal="Provide a factually correct answer"
)

# Numerical evaluation (0.0-1.0 scoring)
scoring_objective = LLMJudgeObjective(
    name="quality_score", 
    output_key="response",
    goal="Provide comprehensive and helpful information",
    valid_eval_result_type=FloatEvalResult
)
```

#### Output-Based Objectives

**When to use:** *"How do I verify that my agent produces the exact output I expect, or matches specific patterns?"*

Ideal for deterministic evaluations where you need precise output matching or format validation.

```python
# Exact string matching
exact_objective = StringEqualityObjective(
    name="exact_match",
    output_key="answer",
    goal="Paris"
)

# Regex pattern matching
pattern_objective = RegexMatchObjective(
    name="pattern_match",
    output_key="response",
    goal=r"Paris|paris"
)
```

#### Path-Based and State-Based Objectives

**When to use:** *"How do I evaluate not just what my agent outputs, but HOW it gets there and what changes it makes?"*

Essential for evaluating agent reasoning processes, tool usage sequences, and system state modifications.

```python
# Evaluate action sequences
path_objective = PathEqualityObjective(
    name="tool_usage",
    output_key="agent_path",
    goal=[[("search", SearchTool), ("summarize", None)]]
)

# Evaluate state changes
state_objective = StateEqualityObjective(
    name="final_state",
    output_key="agent_state",
    goal=ExpectedState
)
```

## Examples

> **📝 AI-Generated Content Notice**  
> The examples and tests in this repository were developed with assistance from AI coding tools and IDEs. While we have reviewed and tested the code, please validate the examples thoroughly in your own environment and adapt them to your specific needs.

### Complete Example Files

The `examples/` directory contains comprehensive examples:

- **`pydantic_ai_example.py`** - Model parity comparison (Claude 3.5 vs GPT-4)
- **`document_extraction_evolution.py`** - Document extraction prompt evolution (4 iterative improvements)
- **`langchain_embedding_example.py`** - LangChain embedding benchmarks  
- **`inventory_management_example.py`** - Complex inventory management agent evaluation

**📋 Full Example List:**

- `output_evaluation.py` - Basic string/regex evaluation (no API keys needed)
- `custom_agent_example.py` - Framework-agnostic agent patterns
- `bool_vs_float_results.py` - Boolean vs scored result comparison
- `document_extraction_evolution.py` - Document extraction prompt evolution

See `examples/README.md` for detailed descriptions and setup instructions.

### Logging and Analytics

```python
# Print summary with key metrics
benchmarker.print_logger_summary()

# Detailed results with full evaluation data
benchmarker.print_logger_details(detail_level="detailed")

# Access raw logs for custom processing
logs = benchmarker.logger.get_all_logs()
```

## Framework Integrations

OmniBench works seamlessly with popular AI agent frameworks:

<details>
<summary><strong>LangChain Integration</strong></summary>

```python
from langchain.agents import create_openai_functions_agent
from langchain_openai import ChatOpenAI

def create_langchain_agent():
    llm = ChatOpenAI(temperature=0, model="gpt-4")
    tools = []  # Add your tools here
    agent = create_openai_functions_agent(llm, tools, prompt=None)
    return agent

benchmarker = OmniBenchmarker(
    executor_fn=create_langchain_agent,
    executor_kwargs={},
    agent_invoke_method_name="invoke",
    initial_input=[benchmark]
)
```

</details>

<details>
<summary><strong>Pydantic AI Integration</strong></summary>

```python
from omnibench.integrations.pydantic_ai import PydanticAIOmniBenchmarker
from pydantic_ai import Agent

def create_pydantic_agent():
    return Agent(model="openai:gpt-4", result_type=str)

benchmarker = PydanticAIOmniBenchmarker(
    executor_fn=create_pydantic_agent,
    initial_input=[benchmark]
)
```

</details>

<details>
<summary><strong>Custom Agent Integration</strong></summary>

```python
class MyCustomAgent:
    def run(self, input_data: dict) -> dict:
        # Your custom agent logic
        return {"response": "Custom agent response"}

def create_custom_agent():
    return MyCustomAgent()

benchmarker = OmniBenchmarker(
    executor_fn=create_custom_agent,
    executor_kwargs={},
    agent_invoke_method_name="run",  # Specify your agent's method
    initial_input=[benchmark]
)
```

</details>

## Advanced Usage

### Custom LLM Judge Prompts

```python
custom_objective = LLMJudgeObjective(
    name="factual_correctness",
    output_key="response",
    goal="Correctly identify the author",
    prompt="""
    Evaluate this response for factual correctness.
    
    Expected: {expected_output}
    Agent Response: {input}
    
    Return true if the information is factually correct.
    {format_instructions}
    """,
    valid_eval_result_type=BoolEvalResult
)
```

**Required Placeholders:** `{input}`, `{expected_output}`, `{format_instructions}`

### Custom Evaluation Functions

```python
def custom_evaluation_function(input_dict: dict) -> dict:
    agent_output = input_dict["input"]
    
    # Your custom logic here
    if "paris" in agent_output.lower():
        score = 0.9
        message = "Correctly identified Paris"
    else:
        score = 0.1
        message = "Failed to identify correct answer"
    
    return {"result": score, "message": message}

custom_objective = LLMJudgeObjective(
    name="custom_evaluation",
    output_key="response",
    invoke_method=custom_evaluation_function,
    valid_eval_result_type=FloatEvalResult
)
```

### Custom Objectives and Result Types

```python
from omnibench.core.types import ValidEvalResult
from omnibench.objectives.base import BaseBenchmarkObjective

class ScoreWithReason(ValidEvalResult):
    result: float
    reason: str

class CustomObjective(BaseBenchmarkObjective):
    valid_eval_result_type = ScoreWithReason
    
    def _eval_fn(self, goal, formatted_output, **kwargs):
        # Your evaluation logic
        score = 0.8
        reason = "Custom evaluation completed"
        return ScoreWithReason(result=score, reason=reason)
```

## Development

### Development Setup

```bash
git clone https://github.com/BrainGnosis/OmniBench.git
cd OmniBench

# Install development dependencies
pip install -e .[dev]

# Install pre-commit hooks
pre-commit install
```

### Testing

```bash
cd tests/

# Quick development tests
python run_tests.py fast        # ~4s, fast tests only
python run_tests.py imports     # ~1s, smoke test

# Run by category
python run_tests.py logging     # Test logging components
python run_tests.py core        # Core benchmarker tests
python run_tests.py objectives  # Evaluation objectives

# Comprehensive testing with rich output
python test_all.py --fast       # Skip slow tests
python test_all.py              # Everything (~5min)
python test_all.py --verbose    # Detailed failure info
```

See `tests/README.md` for detailed information about the test suite structure and available options.

## Contributing

We welcome contributions to OmniBench! Here's how you can help:

### Ways to Contribute

- 🐛 **Bug Reports**: Found an issue? [Open an issue](https://github.com/BrainGnosis/OmniBench/issues)
- 💡 **Feature Requests**: Have an idea? [Start a discussion](https://github.com/BrainGnosis/OmniBench/discussions)
- 🔧 **Code Contributions**: Submit pull requests for bug fixes and new features
- 📚 **Documentation**: Help improve our docs and examples
- 🧪 **Testing**: Add test cases and improve test coverage

### Development Workflow

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes and add tests
4. Run tests and ensure they pass (`pytest`)
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to your branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

### Code Style

- Follow PEP 8 for Python code
- Use type hints where appropriate
- Add docstrings for public functions and classes
- Run `pre-commit install` to enable automatic formatting

## FAQ

### General Questions

**Q: What makes OmniBench different from other benchmarking tools?**
A: OmniBench evaluates the full agentic loop (reasoning, actions, state changes) rather than just input-output comparisons. It supports multi-objective evaluation and works with any Python-based agent framework.

**Q: Can I use OmniBench with my existing agent framework?**
A: Yes! OmniBench is framework-agnostic and works with LangChain, Pydantic AI, AutoGen, or custom agents. Just provide a callable that takes input and returns output.

**Q: How do I create custom evaluation objectives?**
A: Extend `BaseBenchmarkObjective` and implement the `_eval_fn` method. See the [Custom Objectives examples](#custom-objectives-and-result-types) for details.

### Technical Questions

**Q: Does OmniBench support async execution?**
A: Yes! Use `benchmarker.benchmark_async()` with concurrency control via the `max_concurrent` parameter.

**Q: How do I integrate with different LLM providers?**
A: OmniBench uses your agent's LLM configuration. For LLM Judge objectives, set your API keys in the `.env` file and they'll be loaded automatically.

**Q: Can I benchmark multi-agent systems?**
A: Absolutely! Create benchmarks for each agent or use Combined objectives to evaluate multi-agent interactions.

### Troubleshooting

**Q: I'm getting import errors when using OmniBench**
A: Ensure you've installed all dependencies: `pip install -r omnibench/requirements.txt`. Check that your Python version is 3.10+.

**Q: My custom evaluation isn't working**
A: Verify your `_eval_fn` returns the correct result type (BoolEvalResult, FloatEvalResult, etc.) and that required placeholders are included in custom prompts.

**Q: How do I debug failed benchmarks?**
A: Use `benchmarker.print_logger_details(detail_level="detailed")` to see full evaluation traces and error messages.

## License

Licensed under the Apache License 2.0. See [LICENSE](LICENSE) for details.

## Support

- **Issues**: [GitHub Issues](https://github.com/BrainGnosis/OmniBench/issues)
<!-- - **Discussions**: [GitHub Discussions](https://github.com/BrainGnosis/OmniBench/discussions) -->
- **Email**: [dev@braingnosis.ai](mailto:dev@braingnosis.ai)

---

<div align="center">

**Built with ❤️ by** [**BrainGnosis**](https://www.braingnosis.com)

<a href="https://www.braingnosis.com">
  <img src="https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/BrainGnosis.png" alt="BrainGnosis" width="220">
</a>

*Making AI Smarter for Humans*

</div>

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "omnibench",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "BrainGnosis <dev@braingnosis.dev>",
    "keywords": "ai, agent, benchmarking, evaluation, llm, testing, performance",
    "author": null,
    "author_email": "BrainGnosis <dev@braingnosis.dev>",
    "download_url": "https://files.pythonhosted.org/packages/39/5e/d9cfb0749375d3f123700dc81840f3a291d0915c897b07305b7b4ca4de96/omnibench-0.1.2.tar.gz",
    "platform": null,
    "description": "# OmniBench\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/OmniBench_Logo.png\" alt=\"OmniBench Logo\" width=\"400\">\n</p>\n\n<p align=\"center\">\n  <em>A Customizable, Multi-Objective AI Agent Benchmarking Framework for Agentic Reliability and Mediation (ARM)</em>\n</p>\n\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![PyPI version](https://badge.fury.io/py/omnibench.svg)](https://pypi.org/project/omnibench/)\n\n> **\ud83d\ude80 Quick Start in 2 minutes:** Clone \u2192 Install \u2192 Run \u2192 Get comprehensive agent evaluation results!\n\n**Agentic Reliability and Mediation (ARM)** is a research and development area at BrainGnosis. We study how to measure and improve the reliability of AI agents and how they mediate conflicts during autonomous decision making. Our goal is to establish clear principles, metrics, and evaluation protocols that transfer across domains, so agents remain dependable, aligned, and resilient under varied operating conditions.\n\nFrom this work we are releasing **OmniBench**, an open source, flexible, multi-objective benchmarking framework for evaluating AI agents across both standard suites and highly customized use cases. OmniBench looks beyond output only checks, it assesses decision quality, adaptability, conflict handling, and reliability in single agent and multi agent settings. Its modular design lets teams add scenarios, metrics, reward and constraint definitions, and integrations with tools and simulators. The result is domain relevant testing with reproducible reports that reflect the demands of real world applications.\n\n> **\u26a0\ufe0f Development Version Notice**  \n> OmniBench is currently in active development. While we strive for stability, you may encounter bugs, breaking changes, or incomplete features. We recommend thorough testing in your specific use case and welcome bug reports and feedback to help us improve the framework.\n\n## Table of Contents\n\n- [About Us: BrainGnosis](#about-us-braingnosis)\n- [Why OmniBench?](#why-omnibench)\n- [Why OmniBench is Different](#why-omnibench-is-different)  \n- [How It Works](#how-it-works)\n- [Installation](#installation)\n- [30-Second Demo](#30-second-demo)\n- [Quick Start](#quick-start)\n- [Common Use Cases](#common-use-cases)\n- [Core Concepts](#core-concepts)\n- [Examples](#examples)\n- [Framework Integrations](#framework-integrations)\n- [Advanced Usage](#advanced-usage)\n- [Development](#development)\n- [Contributing](#contributing)\n- [FAQ](#faq)\n- [License](#license)\n- [Support](#support)\n\n## About Us: BrainGnosis\n\n<div align=\"left\">\n  <img src=\"https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/BrainGnosis.png\" alt=\"BrainGnosis\" width=\"400\">\n</div>\n\n<br>\n\n[**BrainGnosis**](https://www.braingnosis.com) is dedicated to making AI smarter for humans through structured intelligence and reliable AI systems. We are developing **AgentOS**, an enterprise operating system for intelligent AI agents that think, adapt, and collaborate to enhance organizational performance.\n\n**Our Mission:** Build reliable, adaptable, and deeply human-aligned AI that transforms how businesses operate.\n\n**\ud83d\udd17 Learn more:** [www.braingnosis.com](https://www.braingnosis.com)\n\n---\n\n## Why OmniBench?\n\nTraditional benchmarking approaches evaluate AI systems through simple input-output comparisons, missing the complex decision-making processes that modern AI agents employ.\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/General%20Benchmarking%20Process.png\" alt=\"General Benchmarking Process\" width=\"500\">\n</p>\n\n**OmniBench's Comprehensive Approach**  \nOmniBench captures the full spectrum of agentic behavior by evaluating multiple dimensions simultaneously - from reasoning chains to action sequences to system state changes.\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/OmniBench%20Benchmarking%20Process.png\" alt=\"OmniBench Benchmarking Process\" width=\"500\">\n</p>\n\n## Why OmniBench is Different\n\n- **\ud83d\udcca Multi-Dimensional Evaluation**: Assess outputs, reasoning, actions, and states simultaneously with native support for output-based, path-based, state-based, and llm-as-a-judge evaluations\n- **\ud83d\udd04 Agentic Loop Awareness**: Understands iterative thought-action-observation cycles that modern AI agents employ\n- **\ud83c\udfaf Objective-Specific Analysis**: Different aspects evaluated by specialized objectives with comprehensive evaluation criteria\n- **\ud83d\udd17 Comprehensive Coverage**: No blind spots in agent behavior assessment - captures the full decision-making process\n- **\u26a1 High-Performance Execution**: Async-support enables rapid concurrent evaluation for faster benchmarking cycles\n- **\ud83d\udcca Advanced Analytics**: Built-in AI summarization and customizable evaluation metrics for actionable insights\n- **\ud83d\udd27 Extensible Architecture**: Modular design allowing custom objectives, evaluation criteria, and result types\n- **\ud83d\udd04 Framework Agnostic**: Works seamlessly with any Python-based agent framework (LangChain, Pydantic AI, custom agents)\n\n## How It Works\n\nOmniBench follows a clean, modular architecture that makes it easy to understand and extend:\n\n```text\nomnibench/\n\u251c\u2500\u2500 core/                     # Core benchmarking engine\n\u2502   \u251c\u2500\u2500 benchmarker.py       # Main OmniBenchmarker class\n\u2502   \u2514\u2500\u2500 types.py             # Type definitions and result classes\n\u251c\u2500\u2500 objectives/              # Evaluation objectives\n\u2502   \u251c\u2500\u2500 base.py             # Base objective class\n\u2502   \u251c\u2500\u2500 llm_judge.py        # LLM-based evaluation\n\u2502   \u251c\u2500\u2500 output.py           # Output comparison objectives\n\u2502   \u251c\u2500\u2500 path.py             # Path/action sequence evaluation\n\u2502   \u251c\u2500\u2500 state.py            # State-based evaluation\n\u2502   \u2514\u2500\u2500 combined.py         # Multi-objective evaluation\n\u251c\u2500\u2500 integrations/            # Framework-specific integrations\n\u2502   \u2514\u2500\u2500 pydantic_ai/        # Pydantic AI integration\n\u2514\u2500\u2500 logging/                # Logging and analytics\n    \u251c\u2500\u2500 logger.py           # Comprehensive logging system\n    \u2514\u2500\u2500 evaluator.py        # Auto-evaluation and analysis\n```\n\n**Evaluation Flow:**\n\n1. **Agent Execution**: Your agent processes input and generates output\n2. **Multi-Objective Assessment**: Different objectives evaluate different aspects\n3. **Comprehensive Logging**: Results are logged with detailed analytics\n4. **Performance Insights**: Get actionable feedback on agent behavior\n\n## Installation\n\n### Prerequisites\n\n- **Python 3.10+** (Required)\n- **\"API Keys\"**: OpenAI, Anthropic (for LLM Judge objectives)\n- **5 minutes** for setup and first benchmark\n\n### Core Package\n\n**Recommended Installation (Most Reliable):**\n\n```bash\n# Clone the repository\ngit clone https://github.com/BrainGnosis/OmniBench.git\ncd OmniBench\n\n# Install dependencies\npip install -r omnibench/requirements.txt\n\n# Install in development mode\npip install -e .\n```\n\n**Alternative: PyPI Installation (Beta)**\n\n> **\u26a0\ufe0f Beta Notice:** PyPI installation is available but currently in beta testing. Cross-platform compatibility is being actively improved. For the most reliable experience, we recommend the git installation above.\n\n```bash\n# Install from PyPI (beta - may have platform-specific issues)\npip install omnibench\n```\n\n### Environment Setup\n\nCreate a `.env` file in your project root with your API keys:\n\n```bash\n# .env\nOPENAI_API_KEY=your_openai_api_key_here\nANTHROPIC_API_KEY=your_anthropic_api_key_here\n```\n\n**\u2705 That's it!** OmniBench automatically loads environment variables when you import it.\n\n### Requirements\n\nCore dependencies:\n\n```txt\npython==3.11+\nlangchain==0.3.27\nlangchain_core==0.3.75\nlangchain_openai==0.3.32\npydantic==2.11.7\nrich==14.1.0\nnumpy==2.3.2\ntqdm==4.67.1\n```\n\n## 30-Second Demo\n\nWant to see OmniBench in action immediately? Here's the minimal example:\n\n```python\nfrom omnibench import OmniBenchmarker, Benchmark\nfrom omnibench.objectives import StringEqualityObjective\n\n# 1. Define a simple agent\nclass SimpleAgent:\n    def invoke(self, query: str) -> dict:\n        return {\"answer\": \"Paris\"}\n\ndef create_agent():\n    return SimpleAgent()\n\n# 2. Create benchmark\nbenchmark = Benchmark(\n    name=\"Geography Test\",\n    input_kwargs={\"query\": \"What's the capital of France?\"},\n    objective=StringEqualityObjective(name=\"exact_match\", output_key=\"answer\", goal=\"Paris\"),\n    iterations=1\n)\n\n# 3. Run evaluation\nbenchmarker = OmniBenchmarker(\n    executor_fn=create_agent,\n    executor_kwargs={},\n    initial_input=[benchmark]\n)\nresults = benchmarker.benchmark()\n\n# 4. View results\nbenchmarker.print_logger_summary()\n```\n\n**Output:**\n\n```\n\u2705 Geography Test: PASSED (100% accuracy)\n\ud83d\udcca 1/1 benchmarks passed | Runtime: 0.1s\n```\n\n## Quick Start\n\nHere's a complete example demonstrating OmniBench's core capabilities:\n\n```python\nimport asyncio\nfrom dotenv import load_dotenv\nfrom omnibench import OmniBenchmarker, Benchmark\nfrom omnibench.objectives import LLMJudgeObjective, StringEqualityObjective, CombinedBenchmarkObjective\nfrom omnibench.core.types import BoolEvalResult, FloatEvalResult\n\n# Load environment variables\nload_dotenv()\n\n# Define your agent (works with any Python callable)\nclass SimpleAgent:\n    def invoke(self, query: str) -> dict:\n        if \"capital\" in query.lower() and \"france\" in query.lower():\n            return {\"response\": \"The capital of France is Paris.\"}\n        return {\"response\": \"I'm not sure about that.\"}\n\ndef create_agent():\n    return SimpleAgent()\n\n# Create evaluation objectives\naccuracy_objective = StringEqualityObjective(\n    name=\"exact_accuracy\",\n    output_key=\"response\", \n    goal=\"The capital of France is Paris.\"\n)\n\nquality_objective = LLMJudgeObjective(\n    name=\"response_quality\",\n    output_key=\"response\",\n    goal=\"The agent identified the capital of France correctly\",\n    valid_eval_result_type=FloatEvalResult  # 0.0-1.0 scoring\n)\n\n# Combine multiple objectives\ncombined_objective = CombinedBenchmarkObjective(\n    name=\"comprehensive_evaluation\",\n    objectives=[accuracy_objective, quality_objective]\n)\n\n# Create and run benchmark\nasync def main():\n    benchmark = Benchmark(\n        name=\"Geography Knowledge Test\",\n        input_kwargs={\"query\": \"What is the capital of France?\"},\n        objective=combined_objective,\n        iterations=5\n    )\n    \n    benchmarker = OmniBenchmarker(\n        executor_fn=create_agent,\n        executor_kwargs={},\n        initial_input=[benchmark]\n    )\n    \n    # Execute with concurrency control\n    results = await benchmarker.benchmark_async(max_concurrent=3)\n    \n    # View results\n    benchmarker.print_logger_summary()\n    return results\n\n# Run the benchmark\nif __name__ == \"__main__\":\n    results = asyncio.run(main())\n```\n\n### \ud83c\udfaf **Next Steps**\n\n**Got the basic example working?** Here's your learning path:\n\n1. **\ud83d\udd0d Explore Examples:** Check out `examples/` directory for real-world use cases\n2. **\ud83c\udf9b\ufe0f Try Different Objectives:** Experiment with LLM Judge and Combined objectives  \n3. **\u26a1 Scale Up:** Use async benchmarking with `benchmark_async()` for faster evaluation\n4. **\ud83d\udd27 Customize:** Create your own evaluation objectives for domain-specific needs\n5. **\ud83d\udcca Analyze:** Dive deeper with `print_logger_details()` for comprehensive insights\n\n**Need help?** Check our [FAQ](#faq) or join the [community discussions](https://github.com/BrainGnosis/OmniBench/discussions)!\n\n## Common Use Cases\n\nHere are real-world scenarios where OmniBench excels:\n\n### \ud83c\udfe2 **Enterprise AI Validation**\n\n**Scenario:** Validating customer service chatbots before deployment\n\n- **Objectives:** LLM Judge for helpfulness + StringEquality for policy compliance\n- **Benefit:** Ensure agents are both helpful AND follow company guidelines\n\n### \ud83d\udd2c **Research & Development**\n\n**Scenario:** Comparing different agent architectures or prompting strategies\n\n- **Objectives:** Combined objectives measuring accuracy, reasoning quality, and efficiency\n- **Benefit:** Rigorous A/B testing with statistical significance\n\n### \ud83d\ude80 **Production Monitoring**\n\n**Scenario:** Continuous evaluation of deployed agents\n\n- **Objectives:** State-based objectives tracking system changes + output quality\n- **Benefit:** Early detection of performance degradation\n\n### \ud83c\udf93 **Educational AI Assessment**\n\n**Scenario:** Evaluating AI tutoring systems\n\n- **Objectives:** Path-based objectives tracking learning progression + content accuracy\n- **Benefit:** Comprehensive assessment of both teaching method and content quality\n\n### \ud83e\udd16 **Multi-Agent System Testing**\n\n**Scenario:** Testing collaborative agent teams\n\n- **Objectives:** State-based objectives for system coordination + individual agent performance\n- **Benefit:** Holistic evaluation of complex agent interactions\n\n### \ud83d\udca1 **When to Choose Each Objective Type**\n\n| Objective Type | Best For | Example Use Case | Key Benefit |\n|---|---|---|---|\n| **LLM Judge** | Subjective qualities | \"Is this explanation clear?\" | Human-like evaluation |\n| **Output-Based** | Exact requirements | \"Does output match format?\" | Precise validation |\n| **Path-Based** | Process evaluation | \"Did agent use tools correctly?\" | Workflow assessment |\n| **State-Based** | System changes | \"Was database updated properly?\" | State verification |\n| **Combined** | Comprehensive testing | \"All of the above\" | Complete coverage |\n\n## Core Concepts\n\n### Evaluation Objectives\n\nOmniBench provides multiple evaluation objective types, each designed to address different evaluation challenges:\n\n#### LLM Judge Objective\n\n**When to use:** *\"How do I evaluate subjective qualities like helpfulness, creativity, or nuanced correctness that can't be captured by exact matching?\"*\n\nPerfect for assessing complex, subjective criteria where human-like judgment is needed.\n\n```python\n# Boolean evaluation (pass/fail)\nbinary_objective = LLMJudgeObjective(\n    name=\"correctness_check\",\n    output_key=\"response\",\n    goal=\"Provide a factually correct answer\"\n)\n\n# Numerical evaluation (0.0-1.0 scoring)\nscoring_objective = LLMJudgeObjective(\n    name=\"quality_score\", \n    output_key=\"response\",\n    goal=\"Provide comprehensive and helpful information\",\n    valid_eval_result_type=FloatEvalResult\n)\n```\n\n#### Output-Based Objectives\n\n**When to use:** *\"How do I verify that my agent produces the exact output I expect, or matches specific patterns?\"*\n\nIdeal for deterministic evaluations where you need precise output matching or format validation.\n\n```python\n# Exact string matching\nexact_objective = StringEqualityObjective(\n    name=\"exact_match\",\n    output_key=\"answer\",\n    goal=\"Paris\"\n)\n\n# Regex pattern matching\npattern_objective = RegexMatchObjective(\n    name=\"pattern_match\",\n    output_key=\"response\",\n    goal=r\"Paris|paris\"\n)\n```\n\n#### Path-Based and State-Based Objectives\n\n**When to use:** *\"How do I evaluate not just what my agent outputs, but HOW it gets there and what changes it makes?\"*\n\nEssential for evaluating agent reasoning processes, tool usage sequences, and system state modifications.\n\n```python\n# Evaluate action sequences\npath_objective = PathEqualityObjective(\n    name=\"tool_usage\",\n    output_key=\"agent_path\",\n    goal=[[(\"search\", SearchTool), (\"summarize\", None)]]\n)\n\n# Evaluate state changes\nstate_objective = StateEqualityObjective(\n    name=\"final_state\",\n    output_key=\"agent_state\",\n    goal=ExpectedState\n)\n```\n\n## Examples\n\n> **\ud83d\udcdd AI-Generated Content Notice**  \n> The examples and tests in this repository were developed with assistance from AI coding tools and IDEs. While we have reviewed and tested the code, please validate the examples thoroughly in your own environment and adapt them to your specific needs.\n\n### Complete Example Files\n\nThe `examples/` directory contains comprehensive examples:\n\n- **`pydantic_ai_example.py`** - Model parity comparison (Claude 3.5 vs GPT-4)\n- **`document_extraction_evolution.py`** - Document extraction prompt evolution (4 iterative improvements)\n- **`langchain_embedding_example.py`** - LangChain embedding benchmarks  \n- **`inventory_management_example.py`** - Complex inventory management agent evaluation\n\n**\ud83d\udccb Full Example List:**\n\n- `output_evaluation.py` - Basic string/regex evaluation (no API keys needed)\n- `custom_agent_example.py` - Framework-agnostic agent patterns\n- `bool_vs_float_results.py` - Boolean vs scored result comparison\n- `document_extraction_evolution.py` - Document extraction prompt evolution\n\nSee `examples/README.md` for detailed descriptions and setup instructions.\n\n### Logging and Analytics\n\n```python\n# Print summary with key metrics\nbenchmarker.print_logger_summary()\n\n# Detailed results with full evaluation data\nbenchmarker.print_logger_details(detail_level=\"detailed\")\n\n# Access raw logs for custom processing\nlogs = benchmarker.logger.get_all_logs()\n```\n\n## Framework Integrations\n\nOmniBench works seamlessly with popular AI agent frameworks:\n\n<details>\n<summary><strong>LangChain Integration</strong></summary>\n\n```python\nfrom langchain.agents import create_openai_functions_agent\nfrom langchain_openai import ChatOpenAI\n\ndef create_langchain_agent():\n    llm = ChatOpenAI(temperature=0, model=\"gpt-4\")\n    tools = []  # Add your tools here\n    agent = create_openai_functions_agent(llm, tools, prompt=None)\n    return agent\n\nbenchmarker = OmniBenchmarker(\n    executor_fn=create_langchain_agent,\n    executor_kwargs={},\n    agent_invoke_method_name=\"invoke\",\n    initial_input=[benchmark]\n)\n```\n\n</details>\n\n<details>\n<summary><strong>Pydantic AI Integration</strong></summary>\n\n```python\nfrom omnibench.integrations.pydantic_ai import PydanticAIOmniBenchmarker\nfrom pydantic_ai import Agent\n\ndef create_pydantic_agent():\n    return Agent(model=\"openai:gpt-4\", result_type=str)\n\nbenchmarker = PydanticAIOmniBenchmarker(\n    executor_fn=create_pydantic_agent,\n    initial_input=[benchmark]\n)\n```\n\n</details>\n\n<details>\n<summary><strong>Custom Agent Integration</strong></summary>\n\n```python\nclass MyCustomAgent:\n    def run(self, input_data: dict) -> dict:\n        # Your custom agent logic\n        return {\"response\": \"Custom agent response\"}\n\ndef create_custom_agent():\n    return MyCustomAgent()\n\nbenchmarker = OmniBenchmarker(\n    executor_fn=create_custom_agent,\n    executor_kwargs={},\n    agent_invoke_method_name=\"run\",  # Specify your agent's method\n    initial_input=[benchmark]\n)\n```\n\n</details>\n\n## Advanced Usage\n\n### Custom LLM Judge Prompts\n\n```python\ncustom_objective = LLMJudgeObjective(\n    name=\"factual_correctness\",\n    output_key=\"response\",\n    goal=\"Correctly identify the author\",\n    prompt=\"\"\"\n    Evaluate this response for factual correctness.\n    \n    Expected: {expected_output}\n    Agent Response: {input}\n    \n    Return true if the information is factually correct.\n    {format_instructions}\n    \"\"\",\n    valid_eval_result_type=BoolEvalResult\n)\n```\n\n**Required Placeholders:** `{input}`, `{expected_output}`, `{format_instructions}`\n\n### Custom Evaluation Functions\n\n```python\ndef custom_evaluation_function(input_dict: dict) -> dict:\n    agent_output = input_dict[\"input\"]\n    \n    # Your custom logic here\n    if \"paris\" in agent_output.lower():\n        score = 0.9\n        message = \"Correctly identified Paris\"\n    else:\n        score = 0.1\n        message = \"Failed to identify correct answer\"\n    \n    return {\"result\": score, \"message\": message}\n\ncustom_objective = LLMJudgeObjective(\n    name=\"custom_evaluation\",\n    output_key=\"response\",\n    invoke_method=custom_evaluation_function,\n    valid_eval_result_type=FloatEvalResult\n)\n```\n\n### Custom Objectives and Result Types\n\n```python\nfrom omnibench.core.types import ValidEvalResult\nfrom omnibench.objectives.base import BaseBenchmarkObjective\n\nclass ScoreWithReason(ValidEvalResult):\n    result: float\n    reason: str\n\nclass CustomObjective(BaseBenchmarkObjective):\n    valid_eval_result_type = ScoreWithReason\n    \n    def _eval_fn(self, goal, formatted_output, **kwargs):\n        # Your evaluation logic\n        score = 0.8\n        reason = \"Custom evaluation completed\"\n        return ScoreWithReason(result=score, reason=reason)\n```\n\n## Development\n\n### Development Setup\n\n```bash\ngit clone https://github.com/BrainGnosis/OmniBench.git\ncd OmniBench\n\n# Install development dependencies\npip install -e .[dev]\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Testing\n\n```bash\ncd tests/\n\n# Quick development tests\npython run_tests.py fast        # ~4s, fast tests only\npython run_tests.py imports     # ~1s, smoke test\n\n# Run by category\npython run_tests.py logging     # Test logging components\npython run_tests.py core        # Core benchmarker tests\npython run_tests.py objectives  # Evaluation objectives\n\n# Comprehensive testing with rich output\npython test_all.py --fast       # Skip slow tests\npython test_all.py              # Everything (~5min)\npython test_all.py --verbose    # Detailed failure info\n```\n\nSee `tests/README.md` for detailed information about the test suite structure and available options.\n\n## Contributing\n\nWe welcome contributions to OmniBench! Here's how you can help:\n\n### Ways to Contribute\n\n- \ud83d\udc1b **Bug Reports**: Found an issue? [Open an issue](https://github.com/BrainGnosis/OmniBench/issues)\n- \ud83d\udca1 **Feature Requests**: Have an idea? [Start a discussion](https://github.com/BrainGnosis/OmniBench/discussions)\n- \ud83d\udd27 **Code Contributions**: Submit pull requests for bug fixes and new features\n- \ud83d\udcda **Documentation**: Help improve our docs and examples\n- \ud83e\uddea **Testing**: Add test cases and improve test coverage\n\n### Development Workflow\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes and add tests\n4. Run tests and ensure they pass (`pytest`)\n5. Commit your changes (`git commit -m 'Add amazing feature'`)\n6. Push to your branch (`git push origin feature/amazing-feature`)\n7. Open a Pull Request\n\n### Code Style\n\n- Follow PEP 8 for Python code\n- Use type hints where appropriate\n- Add docstrings for public functions and classes\n- Run `pre-commit install` to enable automatic formatting\n\n## FAQ\n\n### General Questions\n\n**Q: What makes OmniBench different from other benchmarking tools?**\nA: OmniBench evaluates the full agentic loop (reasoning, actions, state changes) rather than just input-output comparisons. It supports multi-objective evaluation and works with any Python-based agent framework.\n\n**Q: Can I use OmniBench with my existing agent framework?**\nA: Yes! OmniBench is framework-agnostic and works with LangChain, Pydantic AI, AutoGen, or custom agents. Just provide a callable that takes input and returns output.\n\n**Q: How do I create custom evaluation objectives?**\nA: Extend `BaseBenchmarkObjective` and implement the `_eval_fn` method. See the [Custom Objectives examples](#custom-objectives-and-result-types) for details.\n\n### Technical Questions\n\n**Q: Does OmniBench support async execution?**\nA: Yes! Use `benchmarker.benchmark_async()` with concurrency control via the `max_concurrent` parameter.\n\n**Q: How do I integrate with different LLM providers?**\nA: OmniBench uses your agent's LLM configuration. For LLM Judge objectives, set your API keys in the `.env` file and they'll be loaded automatically.\n\n**Q: Can I benchmark multi-agent systems?**\nA: Absolutely! Create benchmarks for each agent or use Combined objectives to evaluate multi-agent interactions.\n\n### Troubleshooting\n\n**Q: I'm getting import errors when using OmniBench**\nA: Ensure you've installed all dependencies: `pip install -r omnibench/requirements.txt`. Check that your Python version is 3.10+.\n\n**Q: My custom evaluation isn't working**\nA: Verify your `_eval_fn` returns the correct result type (BoolEvalResult, FloatEvalResult, etc.) and that required placeholders are included in custom prompts.\n\n**Q: How do I debug failed benchmarks?**\nA: Use `benchmarker.print_logger_details(detail_level=\"detailed\")` to see full evaluation traces and error messages.\n\n## License\n\nLicensed under the Apache License 2.0. See [LICENSE](LICENSE) for details.\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/BrainGnosis/OmniBench/issues)\n<!-- - **Discussions**: [GitHub Discussions](https://github.com/BrainGnosis/OmniBench/discussions) -->\n- **Email**: [dev@braingnosis.ai](mailto:dev@braingnosis.ai)\n\n---\n\n<div align=\"center\">\n\n**Built with \u2764\ufe0f by** [**BrainGnosis**](https://www.braingnosis.com)\n\n<a href=\"https://www.braingnosis.com\">\n  <img src=\"https://raw.githubusercontent.com/BrainGnosis/OmniBench/main/assets/BrainGnosis.png\" alt=\"BrainGnosis\" width=\"220\">\n</a>\n\n*Making AI Smarter for Humans*\n\n</div>\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Comprehensive AI Agent Benchmarking Framework",
    "version": "0.1.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/BrainGnosis/OmniBench/issues",
        "Documentation": "https://github.com/BrainGnosis/OmniBench#readme",
        "Homepage": "https://github.com/BrainGnosis/OmniBench",
        "Repository": "https://github.com/BrainGnosis/OmniBench"
    },
    "split_keywords": [
        "ai",
        " agent",
        " benchmarking",
        " evaluation",
        " llm",
        " testing",
        " performance"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b05f12aec8d9e2a8e880c3cf3cf8a07602111766249c34a958dbfc8364dc9222",
                "md5": "a4dc1167b9cfa97b6e7ac3ac7ba9229e",
                "sha256": "d552faf5ddf115b244b6cc4872633b62bdb9ddf2e3b7da97d952437ab94159b1"
            },
            "downloads": -1,
            "filename": "omnibench-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a4dc1167b9cfa97b6e7ac3ac7ba9229e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 65229,
            "upload_time": "2025-09-08T22:17:49",
            "upload_time_iso_8601": "2025-09-08T22:17:49.871651Z",
            "url": "https://files.pythonhosted.org/packages/b0/5f/12aec8d9e2a8e880c3cf3cf8a07602111766249c34a958dbfc8364dc9222/omnibench-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "395ed9cfb0749375d3f123700dc81840f3a291d0915c897b07305b7b4ca4de96",
                "md5": "05606bca28c334e349597780c3d2e356",
                "sha256": "5426dbe16af876af0e901beaf951108e458ea6c5cf05aec783fc4c579985e14b"
            },
            "downloads": -1,
            "filename": "omnibench-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "05606bca28c334e349597780c3d2e356",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 181689,
            "upload_time": "2025-09-08T22:17:51",
            "upload_time_iso_8601": "2025-09-08T22:17:51.129178Z",
            "url": "https://files.pythonhosted.org/packages/39/5e/d9cfb0749375d3f123700dc81840f3a291d0915c897b07305b7b4ca4de96/omnibench-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-08 22:17:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "BrainGnosis",
    "github_project": "OmniBench",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "omnibench"
}

None