Name | rebel-eval JSON |
Version |
0.2.0
JSON |
| download |
home_page | None |
Summary | RAG Evaluation Benchmark and Evaluation Library |
upload_time | 2025-08-08 10:17:25 |
maintainer | None |
docs_url | None |
author | Alexander Ploskin |
requires_python | <4.0,>=3.9 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# REBEL Framework
**REBEL** is a powerful evaluation framework for Large Language Model (LLM) assistants that provides comprehensive benchmarking capabilities with support for both deterministic and AI-judge based metrics.
## Description
REBEL enables developers to create robust evaluation pipelines for LLM applications through:
- **Flexible Test Definition**: Decorator-based test case creation with parameter grids and retry mechanisms
- **Multi-Metric Support**: Both rule-based and LLM-judge evaluation methods
- **Parallel Execution**: Concurrent API calls and evaluations for efficient benchmarking
- **DeepEval Integration**: Seamless integration with the DeepEval ecosystem
- **Comprehensive Results**: Detailed scoring with aggregation strategies and execution metadata
## How to Use?
### Installation
```bash
pip install rebel-eval[deepeval]
```
### Define Tests and Metrics
Create your test files using REBEL's decorator pattern. See our [complete example](https://github.com/tensorsearchcom/rebel/example/openrouter/) for detailed implementation.
```python
from rebel import test_case
from rebel.models import Message, RoleEnum, TestGroup, RetryParams
@test_case(
messages=[
Message(role=RoleEnum.system, content="You are a helpful assistant."),
Message(role=RoleEnum.user, content="Count the letter 'r' in this text.")
]
)
def test_counting_accuracy():
yield TestGroup(
retry_params=RetryParams(count=3, aggregation_strategy="mean"),
metrics=[MyCustomMetric()]
)
```
### Run Benchmarks
Execute your benchmark using the CLI:
```bash
# Using configuration file
rebel run --test-dir tests/ --output-folder results/ --api-config model_config.json
# Using custom client
rebel run --test-dir tests/ --output-folder results/ \
--api-client-module my_module \
--api-client-class MyAPIClient \
--api-client-args '{"api_key": "your-key"}'
```
## Metrics
### Implement Custom Metrics
Create deterministic metrics by inheriting from the `Metric` base class:
```python
from rebel.models import Metric, AssistantInput, AssistantOutput, EvaluationResult, EvaluationVerdict
class MyCustomMetric(Metric):
def measure(self, input: AssistantInput, expected: AssistantOutput, actual: AssistantOutput) -> EvaluationResult:
# Your evaluation logic here
score = compute_score(actual.output, expected.output)
return EvaluationResult(
score=score,
verdict=EvaluationVerdict.PASSED if score > 0.5 else EvaluationVerdict.FAILED,
reason=f"Score: {score}"
)
def get_name(self) -> str:
return "My Custom Metric"
```
### Built-in REBEL Metrics
REBEL provides several ready-to-use metrics:
- **ContextualFScore**: RAG evaluation with precision/recall analysis
- **ToolCallsAccuracy**: Function calling evaluation with flexible matching
- **Custom Distance Metrics**: Configurable similarity measurements
Example usage:
```python
from rebel.metrics import ContextualFScore, ToolCallsAccuracy
# RAG evaluation
contextual_metric = ContextualFScore(
beta=1.0,
threshold=0.7,
model=your_judge_model,
template=your_template
)
# Tool calling evaluation
tool_metric = ToolCallsAccuracy(
threshold=0.8,
strict_mode=True
)
```
## Tests
### Define Test Cases
Use the `@test_case` decorator to create comprehensive test suites. Our [test examples](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/tests) show various patterns:
```python
from rebel import test_case
from rebel.models import Message, RoleEnum, TestGroup, RetryParams, ParameterGrid
@test_case(
messages=[Message(role=RoleEnum.user, content="Test query")],
tags=["accuracy", "basic"],
api_params={"temperature": 0.7},
param_grid=ParameterGrid(parameters={"max_tokens": [100, 200, 500]})
)
def test_comprehensive_evaluation():
# Multiple test groups with different configurations
yield TestGroup(
metrics=[AccuracyMetric()],
retry_params=RetryParams(count=3, aggregation_strategy="mean"),
tags=["primary"]
)
yield TestGroup(
metrics=[LatencyMetric()],
retry_params=RetryParams(count=5, aggregation_strategy="median"),
tags=["performance"]
)
```
### Test Organization Features
- **Parameter Grids**: Automatic test expansion across parameter combinations
- **Retry Mechanisms**: Configurable retry counts with aggregation strategies (mean, min, max, median)
- **Tagging System**: Flexible test filtering and organization
- **Expected Outputs**: Optional ground truth specification for comparison
## DeepEval Integration
### Integrate DeepEval Metrics
Extend `DeepevalMetric` to use DeepEval's advanced evaluation capabilities. Check out our [China Alignment Metric example](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/metrics/china_alignment.py) for a complete implementation:
```python
from rebel.deepeval.metric import DeepevalMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
class MyDeepevalMetric(DeepevalMetric):
threshold: float = 0.7
def get_name(self):
return "My DeepEval Metric"
def get_deepeval_metric(self):
return GEval(
name="Custom Evaluation",
criteria="Evaluate response quality and accuracy",
evaluation_steps=[
"Check factual accuracy",
"Assess response completeness",
"Verify appropriate tone"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=self.threshold,
model=self.judge_llm
)
```
### Judge Model Configuration
Configure your judge models using the DeepEval client:
```python
from rebel.deepeval.client import OpenAIClientLLM
judge_config = {
"model": "gpt-4",
"api_key": "your-key",
"base_url": "https://api.openai.com/v1",
"temperature": 0.1
}
judge_llm = OpenAIClientLLM(judge_config)
```
## Results
### Investigate Test Results
REBEL generates comprehensive JSON reports with detailed execution metadata:
```json
{
"metadata": {
"timestamp": "20250722_113301",
"total_test_cases": 18
},
"test_cases": [
{
"name": "test_example_[]",
"actual_outputs": [
{
"output": "Response text",
"execution_time": 0.625
}
],
"evaluation_results": [
{
"score": 0.85,
"verdict": "passed",
"reason": "High quality response"
}
],
"aggregated_result": {
"score": 0.85,
"verdict": "passed"
}
}
]
}
```
### Result Analysis Features
- **Individual Attempt Tracking**: Complete execution history for each retry
- **Aggregated Scores**: Statistical summaries based on configured strategies
- **Execution Metadata**: Performance metrics including response times
- **Detailed Reasoning**: Comprehensive failure analysis and success explanations
- **Structured Output**: Machine-readable JSON format for automated processing
Results are automatically organized by model name and timestamp in your specified output directory, enabling easy comparison and historical analysis.
Raw data
{
"_id": null,
"home_page": null,
"name": "rebel-eval",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Alexander Ploskin",
"author_email": "ploskin0107@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/1e/f8/9000e7f24414bf229bc969d2f0f30378f76c78c00f96987fcdff1c174998/rebel_eval-0.2.0.tar.gz",
"platform": null,
"description": "# REBEL Framework\n\n**REBEL** is a powerful evaluation framework for Large Language Model (LLM) assistants that provides comprehensive benchmarking capabilities with support for both deterministic and AI-judge based metrics.\n\n## Description\n\nREBEL enables developers to create robust evaluation pipelines for LLM applications through:\n\n- **Flexible Test Definition**: Decorator-based test case creation with parameter grids and retry mechanisms\n- **Multi-Metric Support**: Both rule-based and LLM-judge evaluation methods\n- **Parallel Execution**: Concurrent API calls and evaluations for efficient benchmarking\n- **DeepEval Integration**: Seamless integration with the DeepEval ecosystem\n- **Comprehensive Results**: Detailed scoring with aggregation strategies and execution metadata\n\n## How to Use?\n\n### Installation\n\n```bash\npip install rebel-eval[deepeval]\n```\n\n### Define Tests and Metrics\n\nCreate your test files using REBEL's decorator pattern. See our [complete example](https://github.com/tensorsearchcom/rebel/example/openrouter/) for detailed implementation.\n\n```python\nfrom rebel import test_case\nfrom rebel.models import Message, RoleEnum, TestGroup, RetryParams\n\n@test_case(\n messages=[\n Message(role=RoleEnum.system, content=\"You are a helpful assistant.\"),\n Message(role=RoleEnum.user, content=\"Count the letter 'r' in this text.\")\n ]\n)\ndef test_counting_accuracy():\n yield TestGroup(\n retry_params=RetryParams(count=3, aggregation_strategy=\"mean\"),\n metrics=[MyCustomMetric()]\n )\n```\n\n### Run Benchmarks\n\nExecute your benchmark using the CLI:\n\n```bash\n# Using configuration file\nrebel run --test-dir tests/ --output-folder results/ --api-config model_config.json\n\n# Using custom client\nrebel run --test-dir tests/ --output-folder results/ \\\n --api-client-module my_module \\\n --api-client-class MyAPIClient \\\n --api-client-args '{\"api_key\": \"your-key\"}'\n```\n\n## Metrics\n\n### Implement Custom Metrics\n\nCreate deterministic metrics by inheriting from the `Metric` base class:\n\n```python\nfrom rebel.models import Metric, AssistantInput, AssistantOutput, EvaluationResult, EvaluationVerdict\n\nclass MyCustomMetric(Metric):\n def measure(self, input: AssistantInput, expected: AssistantOutput, actual: AssistantOutput) -> EvaluationResult:\n # Your evaluation logic here\n score = compute_score(actual.output, expected.output)\n \n return EvaluationResult(\n score=score,\n verdict=EvaluationVerdict.PASSED if score > 0.5 else EvaluationVerdict.FAILED,\n reason=f\"Score: {score}\"\n )\n \n def get_name(self) -> str:\n return \"My Custom Metric\"\n```\n\n### Built-in REBEL Metrics\n\nREBEL provides several ready-to-use metrics:\n\n- **ContextualFScore**: RAG evaluation with precision/recall analysis\n- **ToolCallsAccuracy**: Function calling evaluation with flexible matching\n- **Custom Distance Metrics**: Configurable similarity measurements\n\nExample usage:\n\n```python\nfrom rebel.metrics import ContextualFScore, ToolCallsAccuracy\n\n# RAG evaluation\ncontextual_metric = ContextualFScore(\n beta=1.0,\n threshold=0.7,\n model=your_judge_model,\n template=your_template\n)\n\n# Tool calling evaluation\ntool_metric = ToolCallsAccuracy(\n threshold=0.8,\n strict_mode=True\n)\n```\n\n## Tests\n\n### Define Test Cases\n\nUse the `@test_case` decorator to create comprehensive test suites. Our [test examples](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/tests) show various patterns:\n\n\n```python\nfrom rebel import test_case\nfrom rebel.models import Message, RoleEnum, TestGroup, RetryParams, ParameterGrid\n\n@test_case(\n messages=[Message(role=RoleEnum.user, content=\"Test query\")],\n tags=[\"accuracy\", \"basic\"],\n api_params={\"temperature\": 0.7},\n param_grid=ParameterGrid(parameters={\"max_tokens\": [100, 200, 500]})\n)\ndef test_comprehensive_evaluation():\n # Multiple test groups with different configurations\n yield TestGroup(\n metrics=[AccuracyMetric()],\n retry_params=RetryParams(count=3, aggregation_strategy=\"mean\"),\n tags=[\"primary\"]\n )\n \n yield TestGroup(\n metrics=[LatencyMetric()],\n retry_params=RetryParams(count=5, aggregation_strategy=\"median\"),\n tags=[\"performance\"]\n )\n```\n\n### Test Organization Features\n\n- **Parameter Grids**: Automatic test expansion across parameter combinations\n- **Retry Mechanisms**: Configurable retry counts with aggregation strategies (mean, min, max, median)\n- **Tagging System**: Flexible test filtering and organization\n- **Expected Outputs**: Optional ground truth specification for comparison\n\n## DeepEval Integration\n\n### Integrate DeepEval Metrics\n\n\nExtend `DeepevalMetric` to use DeepEval's advanced evaluation capabilities. Check out our [China Alignment Metric example](https://github.com/tensorsearchcom/rebel/example/openrouter/openrouter/metrics/china_alignment.py) for a complete implementation:\n\n```python\nfrom rebel.deepeval.metric import DeepevalMetric\nfrom deepeval.metrics import GEval\nfrom deepeval.test_case import LLMTestCaseParams\n\nclass MyDeepevalMetric(DeepevalMetric):\n threshold: float = 0.7\n \n def get_name(self):\n return \"My DeepEval Metric\"\n \n def get_deepeval_metric(self):\n return GEval(\n name=\"Custom Evaluation\",\n criteria=\"Evaluate response quality and accuracy\",\n evaluation_steps=[\n \"Check factual accuracy\",\n \"Assess response completeness\",\n \"Verify appropriate tone\"\n ],\n evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],\n threshold=self.threshold,\n model=self.judge_llm\n )\n```\n\n### Judge Model Configuration\n\nConfigure your judge models using the DeepEval client:\n\n```python\nfrom rebel.deepeval.client import OpenAIClientLLM\n\njudge_config = {\n \"model\": \"gpt-4\",\n \"api_key\": \"your-key\",\n \"base_url\": \"https://api.openai.com/v1\",\n \"temperature\": 0.1\n}\n\njudge_llm = OpenAIClientLLM(judge_config)\n```\n\n## Results\n\n### Investigate Test Results\n\nREBEL generates comprehensive JSON reports with detailed execution metadata:\n\n```json\n{\n \"metadata\": {\n \"timestamp\": \"20250722_113301\",\n \"total_test_cases\": 18\n },\n \"test_cases\": [\n {\n \"name\": \"test_example_[]\",\n \"actual_outputs\": [\n {\n \"output\": \"Response text\",\n \"execution_time\": 0.625\n }\n ],\n \"evaluation_results\": [\n {\n \"score\": 0.85,\n \"verdict\": \"passed\",\n \"reason\": \"High quality response\"\n }\n ],\n \"aggregated_result\": {\n \"score\": 0.85,\n \"verdict\": \"passed\"\n }\n }\n ]\n}\n```\n\n### Result Analysis Features\n\n- **Individual Attempt Tracking**: Complete execution history for each retry\n- **Aggregated Scores**: Statistical summaries based on configured strategies\n- **Execution Metadata**: Performance metrics including response times\n- **Detailed Reasoning**: Comprehensive failure analysis and success explanations\n- **Structured Output**: Machine-readable JSON format for automated processing\n\nResults are automatically organized by model name and timestamp in your specified output directory, enabling easy comparison and historical analysis.\n",
"bugtrack_url": null,
"license": null,
"summary": "RAG Evaluation Benchmark and Evaluation Library",
"version": "0.2.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "82eddafac8da24f88edd590fdbf43da4a2381ab23e970c8ad1b27b492afdf180",
"md5": "d178a9a40e53e52b0e02c4731c46ecfa",
"sha256": "53f7254a0f7f9edfc138a6d72af0e1dd65ddc19c8b3b95b406e022453cf2db97"
},
"downloads": -1,
"filename": "rebel_eval-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d178a9a40e53e52b0e02c4731c46ecfa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 36661,
"upload_time": "2025-08-08T10:17:23",
"upload_time_iso_8601": "2025-08-08T10:17:23.493153Z",
"url": "https://files.pythonhosted.org/packages/82/ed/dafac8da24f88edd590fdbf43da4a2381ab23e970c8ad1b27b492afdf180/rebel_eval-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1ef89000e7f24414bf229bc969d2f0f30378f76c78c00f96987fcdff1c174998",
"md5": "cccfcf0eed631d7a40caf2f431b5f434",
"sha256": "ba2bb8f45559bb674f682a1b7ddb0c9fec37d9968a157d9f49c2f81a049a581d"
},
"downloads": -1,
"filename": "rebel_eval-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "cccfcf0eed631d7a40caf2f431b5f434",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 29141,
"upload_time": "2025-08-08T10:17:25",
"upload_time_iso_8601": "2025-08-08T10:17:25.280682Z",
"url": "https://files.pythonhosted.org/packages/1e/f8/9000e7f24414bf229bc969d2f0f30378f76c78c00f96987fcdff1c174998/rebel_eval-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-08 10:17:25",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "rebel-eval"
}