rebel-eval

Name	rebel-eval JSON
Version	0.2.0 JSON
	download
home_page	None
Summary	RAG Evaluation Benchmark and Evaluation Library
upload_time	2025-08-08 10:17:25
maintainer	None
docs_url	None
author	Alexander Ploskin
requires_python	<4.0,>=3.9
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # REBEL Framework

**REBEL** is a powerful evaluation framework for Large Language Model (LLM) assistants that provides comprehensive benchmarking capabilities with support for both deterministic and AI-judge based metrics.

## Description

REBEL enables developers to create robust evaluation pipelines for LLM applications through:

- **Flexible Test Definition**: Decorator-based test case creation with parameter grids and retry mechanisms
- **Multi-Metric Support**: Both rule-based and LLM-judge evaluation methods
- **Parallel Execution**: Concurrent API calls and evaluations for efficient benchmarking
- **DeepEval Integration**: Seamless integration with the DeepEval ecosystem
- **Comprehensive Results**: Detailed scoring with aggregation strategies and execution metadata

## How to Use?

### Installation

```bash
pip install rebel-eval[deepeval]
```

### Define Tests and Metrics

Create your test files using REBEL's decorator pattern. See our [complete example](https://github.com/tensorsearchcom/rebel/example/openrouter/) for detailed implementation.

```python
from rebel import test_case
from rebel.models import Message, RoleEnum, TestGroup, RetryParams

@test_case(
    messages=[
        Message(role=RoleEnum.system, content="You are a helpful assistant."),
        Message(role=RoleEnum.user, content="Count the letter 'r' in this text.")
    ]
)
def test_counting_accuracy():
    yield TestGroup(
        retry_params=RetryParams(count=3, aggregation_strategy="mean"),
        metrics=[MyCustomMetric()]
    )
```

### Run Benchmarks

Execute your benchmark using the CLI:

```bash
# Using configuration file
rebel run --test-dir tests/ --output-folder results/ --api-config model_config.json

# Using custom client
rebel run --test-dir tests/ --output-folder results/ \
  --api-client-module my_module \
  --api-client-class MyAPIClient \
  --api-client-args '{"api_key": "your-key"}'
```

## Metrics

### Implement Custom Metrics

Create deterministic metrics by inheriting from the `Metric` base class:

```python
from rebel.models import Metric, AssistantInput, AssistantOutput, EvaluationResult, EvaluationVerdict

class MyCustomMetric(Metric):
    def measure(self, input: AssistantInput, expected: AssistantOutput, actual: AssistantOutput) -> EvaluationResult:
        # Your evaluation logic here
        score = compute_score(actual.output, expected.output)
        
        return EvaluationResult(
            score=score,
            verdict=EvaluationVerdict.PASSED if score > 0.5 else EvaluationVerdict.FAILED,
            reason=f"Score: {score}"
        )
    
    def get_name(self) -> str:
        return "My Custom Metric"
```

### Built-in REBEL Metrics

REBEL provides several ready-to-use metrics:

- **ContextualFScore**: RAG evaluation with precision/recall analysis
- **ToolCallsAccuracy**: Function calling evaluation with flexible matching
- **Custom Distance Metrics**: Configurable similarity measurements

Example usage:

```python
from rebel.metrics import ContextualFScore, ToolCallsAccuracy

# RAG evaluation
contextual_metric = ContextualFScore(
    beta=1.0,
    threshold=0.7,
    model=your_judge_model,
    template=your_template
)

# Tool calling evaluation
tool_metric = ToolCallsAccuracy(
    threshold=0.8,
    strict_mode=True
)
```

## Tests

### Define Test Cases

Use the `@test_case` decorator to create comprehensive test suites. Our [test examples](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/tests) show various patterns:


```python
from rebel import test_case
from rebel.models import Message, RoleEnum, TestGroup, RetryParams, ParameterGrid

@test_case(
    messages=[Message(role=RoleEnum.user, content="Test query")],
    tags=["accuracy", "basic"],
    api_params={"temperature": 0.7},
    param_grid=ParameterGrid(parameters={"max_tokens": [100, 200, 500]})
)
def test_comprehensive_evaluation():
    # Multiple test groups with different configurations
    yield TestGroup(
        metrics=[AccuracyMetric()],
        retry_params=RetryParams(count=3, aggregation_strategy="mean"),
        tags=["primary"]
    )
    
    yield TestGroup(
        metrics=[LatencyMetric()],
        retry_params=RetryParams(count=5, aggregation_strategy="median"),
        tags=["performance"]
    )
```

### Test Organization Features

- **Parameter Grids**: Automatic test expansion across parameter combinations
- **Retry Mechanisms**: Configurable retry counts with aggregation strategies (mean, min, max, median)
- **Tagging System**: Flexible test filtering and organization
- **Expected Outputs**: Optional ground truth specification for comparison

## DeepEval Integration

### Integrate DeepEval Metrics


Extend `DeepevalMetric` to use DeepEval's advanced evaluation capabilities. Check out our [China Alignment Metric example](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/metrics/china_alignment.py) for a complete implementation:

```python
from rebel.deepeval.metric import DeepevalMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

class MyDeepevalMetric(DeepevalMetric):
    threshold: float = 0.7
    
    def get_name(self):
        return "My DeepEval Metric"
    
    def get_deepeval_metric(self):
        return GEval(
            name="Custom Evaluation",
            criteria="Evaluate response quality and accuracy",
            evaluation_steps=[
                "Check factual accuracy",
                "Assess response completeness",
                "Verify appropriate tone"
            ],
            evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
            threshold=self.threshold,
            model=self.judge_llm
        )
```

### Judge Model Configuration

Configure your judge models using the DeepEval client:

```python
from rebel.deepeval.client import OpenAIClientLLM

judge_config = {
    "model": "gpt-4",
    "api_key": "your-key",
    "base_url": "https://api.openai.com/v1",
    "temperature": 0.1
}

judge_llm = OpenAIClientLLM(judge_config)
```

## Results

### Investigate Test Results

REBEL generates comprehensive JSON reports with detailed execution metadata:

```json
{
  "metadata": {
    "timestamp": "20250722_113301",
    "total_test_cases": 18
  },
  "test_cases": [
    {
      "name": "test_example_[]",
      "actual_outputs": [
        {
          "output": "Response text",
          "execution_time": 0.625
        }
      ],
      "evaluation_results": [
        {
          "score": 0.85,
          "verdict": "passed",
          "reason": "High quality response"
        }
      ],
      "aggregated_result": {
        "score": 0.85,
        "verdict": "passed"
      }
    }
  ]
}
```

### Result Analysis Features

- **Individual Attempt Tracking**: Complete execution history for each retry
- **Aggregated Scores**: Statistical summaries based on configured strategies
- **Execution Metadata**: Performance metrics including response times
- **Detailed Reasoning**: Comprehensive failure analysis and success explanations
- **Structured Output**: Machine-readable JSON format for automated processing

Results are automatically organized by model name and timestamp in your specified output directory, enabling easy comparison and historical analysis.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "rebel-eval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Alexander Ploskin",
    "author_email": "ploskin0107@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/1e/f8/9000e7f24414bf229bc969d2f0f30378f76c78c00f96987fcdff1c174998/rebel_eval-0.2.0.tar.gz",
    "platform": null,
    "description": "# REBEL Framework\n\n**REBEL** is a powerful evaluation framework for Large Language Model (LLM) assistants that provides comprehensive benchmarking capabilities with support for both deterministic and AI-judge based metrics.\n\n## Description\n\nREBEL enables developers to create robust evaluation pipelines for LLM applications through:\n\n- **Flexible Test Definition**: Decorator-based test case creation with parameter grids and retry mechanisms\n- **Multi-Metric Support**: Both rule-based and LLM-judge evaluation methods\n- **Parallel Execution**: Concurrent API calls and evaluations for efficient benchmarking\n- **DeepEval Integration**: Seamless integration with the DeepEval ecosystem\n- **Comprehensive Results**: Detailed scoring with aggregation strategies and execution metadata\n\n## How to Use?\n\n### Installation\n\n```bash\npip install rebel-eval[deepeval]\n```\n\n### Define Tests and Metrics\n\nCreate your test files using REBEL's decorator pattern. See our [complete example](https://github.com/tensorsearchcom/rebel/example/openrouter/) for detailed implementation.\n\n```python\nfrom rebel import test_case\nfrom rebel.models import Message, RoleEnum, TestGroup, RetryParams\n\n@test_case(\n    messages=[\n        Message(role=RoleEnum.system, content=\"You are a helpful assistant.\"),\n        Message(role=RoleEnum.user, content=\"Count the letter 'r' in this text.\")\n    ]\n)\ndef test_counting_accuracy():\n    yield TestGroup(\n        retry_params=RetryParams(count=3, aggregation_strategy=\"mean\"),\n        metrics=[MyCustomMetric()]\n    )\n```\n\n### Run Benchmarks\n\nExecute your benchmark using the CLI:\n\n```bash\n# Using configuration file\nrebel run --test-dir tests/ --output-folder results/ --api-config model_config.json\n\n# Using custom client\nrebel run --test-dir tests/ --output-folder results/ \\\n  --api-client-module my_module \\\n  --api-client-class MyAPIClient \\\n  --api-client-args '{\"api_key\": \"your-key\"}'\n```\n\n## Metrics\n\n### Implement Custom Metrics\n\nCreate deterministic metrics by inheriting from the `Metric` base class:\n\n```python\nfrom rebel.models import Metric, AssistantInput, AssistantOutput, EvaluationResult, EvaluationVerdict\n\nclass MyCustomMetric(Metric):\n    def measure(self, input: AssistantInput, expected: AssistantOutput, actual: AssistantOutput) -> EvaluationResult:\n        # Your evaluation logic here\n        score = compute_score(actual.output, expected.output)\n        \n        return EvaluationResult(\n            score=score,\n            verdict=EvaluationVerdict.PASSED if score > 0.5 else EvaluationVerdict.FAILED,\n            reason=f\"Score: {score}\"\n        )\n    \n    def get_name(self) -> str:\n        return \"My Custom Metric\"\n```\n\n### Built-in REBEL Metrics\n\nREBEL provides several ready-to-use metrics:\n\n- **ContextualFScore**: RAG evaluation with precision/recall analysis\n- **ToolCallsAccuracy**: Function calling evaluation with flexible matching\n- **Custom Distance Metrics**: Configurable similarity measurements\n\nExample usage:\n\n```python\nfrom rebel.metrics import ContextualFScore, ToolCallsAccuracy\n\n# RAG evaluation\ncontextual_metric = ContextualFScore(\n    beta=1.0,\n    threshold=0.7,\n    model=your_judge_model,\n    template=your_template\n)\n\n# Tool calling evaluation\ntool_metric = ToolCallsAccuracy(\n    threshold=0.8,\n    strict_mode=True\n)\n```\n\n## Tests\n\n### Define Test Cases\n\nUse the `@test_case` decorator to create comprehensive test suites. Our [test examples](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/tests) show various patterns:\n\n\n```python\nfrom rebel import test_case\nfrom rebel.models import Message, RoleEnum, TestGroup, RetryParams, ParameterGrid\n\n@test_case(\n    messages=[Message(role=RoleEnum.user, content=\"Test query\")],\n    tags=[\"accuracy\", \"basic\"],\n    api_params={\"temperature\": 0.7},\n    param_grid=ParameterGrid(parameters={\"max_tokens\": [100, 200, 500]})\n)\ndef test_comprehensive_evaluation():\n    # Multiple test groups with different configurations\n    yield TestGroup(\n        metrics=[AccuracyMetric()],\n        retry_params=RetryParams(count=3, aggregation_strategy=\"mean\"),\n        tags=[\"primary\"]\n    )\n    \n    yield TestGroup(\n        metrics=[LatencyMetric()],\n        retry_params=RetryParams(count=5, aggregation_strategy=\"median\"),\n        tags=[\"performance\"]\n    )\n```\n\n### Test Organization Features\n\n- **Parameter Grids**: Automatic test expansion across parameter combinations\n- **Retry Mechanisms**: Configurable retry counts with aggregation strategies (mean, min, max, median)\n- **Tagging System**: Flexible test filtering and organization\n- **Expected Outputs**: Optional ground truth specification for comparison\n\n## DeepEval Integration\n\n### Integrate DeepEval Metrics\n\n\nExtend `DeepevalMetric` to use DeepEval's advanced evaluation capabilities. Check out our [China Alignment Metric example](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/metrics/china_alignment.py) for a complete implementation:\n\n```python\nfrom rebel.deepeval.metric import DeepevalMetric\nfrom deepeval.metrics import GEval\nfrom deepeval.test_case import LLMTestCaseParams\n\nclass MyDeepevalMetric(DeepevalMetric):\n    threshold: float = 0.7\n    \n    def get_name(self):\n        return \"My DeepEval Metric\"\n    \n    def get_deepeval_metric(self):\n        return GEval(\n            name=\"Custom Evaluation\",\n            criteria=\"Evaluate response quality and accuracy\",\n            evaluation_steps=[\n                \"Check factual accuracy\",\n                \"Assess response completeness\",\n                \"Verify appropriate tone\"\n            ],\n            evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],\n            threshold=self.threshold,\n            model=self.judge_llm\n        )\n```\n\n### Judge Model Configuration\n\nConfigure your judge models using the DeepEval client:\n\n```python\nfrom rebel.deepeval.client import OpenAIClientLLM\n\njudge_config = {\n    \"model\": \"gpt-4\",\n    \"api_key\": \"your-key\",\n    \"base_url\": \"https://api.openai.com/v1\",\n    \"temperature\": 0.1\n}\n\njudge_llm = OpenAIClientLLM(judge_config)\n```\n\n## Results\n\n### Investigate Test Results\n\nREBEL generates comprehensive JSON reports with detailed execution metadata:\n\n```json\n{\n  \"metadata\": {\n    \"timestamp\": \"20250722_113301\",\n    \"total_test_cases\": 18\n  },\n  \"test_cases\": [\n    {\n      \"name\": \"test_example_[]\",\n      \"actual_outputs\": [\n        {\n          \"output\": \"Response text\",\n          \"execution_time\": 0.625\n        }\n      ],\n      \"evaluation_results\": [\n        {\n          \"score\": 0.85,\n          \"verdict\": \"passed\",\n          \"reason\": \"High quality response\"\n        }\n      ],\n      \"aggregated_result\": {\n        \"score\": 0.85,\n        \"verdict\": \"passed\"\n      }\n    }\n  ]\n}\n```\n\n### Result Analysis Features\n\n- **Individual Attempt Tracking**: Complete execution history for each retry\n- **Aggregated Scores**: Statistical summaries based on configured strategies\n- **Execution Metadata**: Performance metrics including response times\n- **Detailed Reasoning**: Comprehensive failure analysis and success explanations\n- **Structured Output**: Machine-readable JSON format for automated processing\n\nResults are automatically organized by model name and timestamp in your specified output directory, enabling easy comparison and historical analysis.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "RAG Evaluation Benchmark and Evaluation Library",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "82eddafac8da24f88edd590fdbf43da4a2381ab23e970c8ad1b27b492afdf180",
                "md5": "d178a9a40e53e52b0e02c4731c46ecfa",
                "sha256": "53f7254a0f7f9edfc138a6d72af0e1dd65ddc19c8b3b95b406e022453cf2db97"
            },
            "downloads": -1,
            "filename": "rebel_eval-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d178a9a40e53e52b0e02c4731c46ecfa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 36661,
            "upload_time": "2025-08-08T10:17:23",
            "upload_time_iso_8601": "2025-08-08T10:17:23.493153Z",
            "url": "https://files.pythonhosted.org/packages/82/ed/dafac8da24f88edd590fdbf43da4a2381ab23e970c8ad1b27b492afdf180/rebel_eval-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1ef89000e7f24414bf229bc969d2f0f30378f76c78c00f96987fcdff1c174998",
                "md5": "cccfcf0eed631d7a40caf2f431b5f434",
                "sha256": "ba2bb8f45559bb674f682a1b7ddb0c9fec37d9968a157d9f49c2f81a049a581d"
            },
            "downloads": -1,
            "filename": "rebel_eval-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "cccfcf0eed631d7a40caf2f431b5f434",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 29141,
            "upload_time": "2025-08-08T10:17:25",
            "upload_time_iso_8601": "2025-08-08T10:17:25.280682Z",
            "url": "https://files.pythonhosted.org/packages/1e/f8/9000e7f24414bf229bc969d2f0f30378f76c78c00f96987fcdff1c174998/rebel_eval-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 10:17:25",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "rebel-eval"
}

Alexander Ploskin