monitoring-rag


Namemonitoring-rag JSON
Version 0.0.2 PyPI version JSON
download
home_pageNone
SummaryA comprehensive, framework-agnostic library for evaluating Retrieval-Augmented Generation (RAG) pipelines.
upload_time2025-07-24 11:25:53
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords rag evaluation llm ai machine-learning nlp retrieval generation langchain azure openai
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # šŸŽÆ Monitoring-RAG

**Comprehensive RAG Evaluation with LangChain Integration**

[![PyPI version](https://badge.fury.io/py/rag_evals.svg)](https://pypi.org/project/monitoring-rag/0.0.1/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A streamlined Python library for evaluating Retrieval-Augmented Generation (RAG) systems using LangChain's powerful framework. Evaluate your RAG pipeline with a single line of code.

## šŸš€ Key Features

- **šŸŽÆ Unified Interface**: Single `RAGEvaluator` class for all metrics
- **⚔ LangChain Powered**: Built on industry-standard LangChain framework
- **šŸ“Š Comprehensive Metrics**: 12+ evaluation metrics covering generation, retrieval, and composite evaluation
- **šŸ”„ Async Support**: Full asynchronous evaluation for high performance
- **šŸ› ļø Flexible Configuration**: Support for OpenAI and Azure OpenAI
- **šŸ“¦ Simple API**: Evaluate with just query, context, and generated text

## šŸ“‹ Quick Reference

```python
# Import the main evaluator
from rag_evals import RAGEvaluator

# Initialize
evaluator = RAGEvaluator(llm_provider="openai", model="gpt-4")

# Evaluate all metrics
results = evaluator.evaluate(
    query="Your question",
    answer="RAG system's answer",
    retrieved_contexts=["Retrieved context"]
)

# Evaluate specific metrics
results = evaluator.evaluate(
    query="...",
    answer="...",
    retrieved_contexts=["..."],
    metrics=["faithfulness", "answer_relevance"]
)

# Available metrics
generation_metrics = ["faithfulness", "answer_relevance", "answer_correctness", 
                     "completeness", "coherence", "helpfulness"]
retrieval_metrics = ["context_recall", "context_relevance", "context_precision"]
composite_metrics = ["llm_judge", "rag_certainty", "trust_score"]
```

## šŸš€ Installation

```bash
pip install monitoring-rag
```

> **Note**: Install as `monitoring-rag` but import as `rag_evals`

## šŸ”§ Quick Start

### Basic Usage

```python
from rag_evals import RAGEvaluator

# Initialize evaluator
evaluator = RAGEvaluator(
    llm_provider="openai",  # or "azure"
    model="gpt-4",
    api_key="your-api-key"  # or set OPENAI_API_KEY environment variable
)

# Evaluate a RAG response
results = evaluator.evaluate(
    query="What is machine learning?",
    answer="Machine learning is a subset of artificial intelligence that uses algorithms to learn patterns from data.",
    retrieved_contexts=["Machine learning is a method of data analysis that automates analytical model building using algorithms that iteratively learn from data."]
)

print(results)
# Output: {
#     "faithfulness": 0.95,
#     "answer_relevance": 0.92,
#     "context_relevance": 0.88,
#     "completeness": 0.85,
#     "coherence": 0.90,
#     ...
# }
```

### Azure OpenAI Configuration

```python
from rag_evals import RAGEvaluator

# Using Azure OpenAI
evaluator = RAGEvaluator(
    llm_provider="azure",
    model="gpt-4",
    azure_config={
        "api_key": "your-azure-key",
        "azure_endpoint": "https://your-resource.openai.azure.com/",
        "azure_deployment": "gpt-4-deployment",
        "api_version": "2024-02-01"
    }
)

# Or using environment variables
# AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT
evaluator = RAGEvaluator(llm_provider="azure", model="gpt-4")
```

### Batch Evaluation

```python
# Evaluate multiple responses at once
inputs = [
    {
        "query": "What is Python?",
        "generated_text": "Python is a programming language.",
        "context": "Python is a high-level programming language known for its simplicity."
    },
    {
        "query": "What is machine learning?",
        "generated_text": "ML uses algorithms to find patterns.",
        "context": "Machine learning involves training algorithms on data."
    }
]

results = evaluator.evaluate_batch(inputs)
# Returns list of result dictionaries
```

### Selective Metric Evaluation

```python
# Evaluate only specific metrics
results = evaluator.evaluate(
    query="What is AI?",
    generated_text="AI is artificial intelligence.",
    context="Artificial intelligence refers to machine intelligence.",
    metrics=["faithfulness", "answer_relevance", "coherence"]
)

# Configure evaluator with specific metrics
evaluator = RAGEvaluator(
    llm_provider="openai",
    model="gpt-4",
    metrics=["faithfulness", "completeness", "trust_score"]
)
```

## šŸ“Š Available Metrics

### Generation Metrics
Evaluate the quality of generated responses:

- **Faithfulness**: How well the answer is grounded in the provided contexts
- **Answer Relevance**: How relevant the answer is to the user's query  
- **Answer Correctness**: Factual accuracy of the generated answer
- **Completeness**: How thoroughly the answer addresses the query
- **Coherence**: Logical flow and readability of the answer
- **Helpfulness**: Practical value and actionability of the answer

### Retrieval Metrics
Evaluate the quality of retrieved contexts:

- **Context Recall**: How well contexts support generating the ground truth
- **Context Relevance**: How relevant retrieved contexts are to the query
- **Context Precision**: Quality and ranking of retrieved contexts

### Composite Metrics
Holistic evaluation combining multiple aspects:

- **LLM Judge**: Comprehensive multi-dimensional evaluation
- **RAG Certainty**: Confidence and reliability assessment
- **Trust Score**: Overall trustworthiness evaluation

## šŸ“– Step-by-Step Tutorial

### Step 1: Installation and Setup

```bash
# Install the library
pip install rag_evals

# Create a .env file for your API keys (optional)
echo "OPENAI_API_KEY=your-api-key-here" > .env
```

### Step 2: Basic Single Evaluation

```python
from rag_evals import RAGEvaluator

# Initialize the evaluator
evaluator = RAGEvaluator(
    llm_provider="openai",
    model="gpt-4"
    # API key will be read from OPENAI_API_KEY environment variable
)

# Your RAG system output
query = "What are the benefits of renewable energy?"
context = """
Renewable energy sources like solar and wind power offer numerous benefits:
1. They reduce greenhouse gas emissions
2. They provide energy independence
3. They create jobs in the green economy
4. Operating costs are lower than fossil fuels
"""
generated_text = """
Renewable energy provides several key benefits including reducing carbon emissions,
increasing energy independence, creating green jobs, and offering lower operating costs
compared to traditional fossil fuels.
"""

# Evaluate all metrics
results = evaluator.evaluate(
    query=query,
    generated_text=generated_text,
    context=context
)

# Display results
for metric, score in results.items():
    print(f"{metric}: {score:.2f}")
```

### Step 3: Evaluating Specific Metrics

```python
# Evaluate only generation metrics
generation_results = evaluator.evaluate(
    query=query,
    generated_text=generated_text,
    context=context,
    metrics=["faithfulness", "answer_relevance", "completeness"]
)

print("Generation Metrics:")
for metric, score in generation_results.items():
    print(f"  {metric}: {score:.2f}")

# Evaluate only retrieval metrics (requires ground truth)
retrieval_results = evaluator.evaluate(
    query=query,
    generated_text=generated_text,
    context=context,
    ground_truth="Renewable energy reduces emissions, provides energy independence, creates jobs, and has lower operating costs.",
    metrics=["context_recall", "context_precision"]
)

print("\nRetrieval Metrics:")
for metric, score in retrieval_results.items():
    print(f"  {metric}: {score:.2f}")
```

### Step 4: Batch Evaluation for Multiple Samples

```python
# Prepare multiple samples
samples = [
    {
        "query": "What is machine learning?",
        "generated_text": "Machine learning is a subset of AI that enables systems to learn from data.",
        "context": "Machine learning is a branch of artificial intelligence that focuses on building systems that learn from data."
    },
    {
        "query": "Explain quantum computing",
        "generated_text": "Quantum computing uses quantum mechanics principles to process information.",
        "context": "Quantum computing is a type of computation that harnesses quantum mechanical phenomena like superposition and entanglement."
    },
    {
        "query": "What are neural networks?",
        "generated_text": "Neural networks are computing systems inspired by biological neural networks.",
        "context": "Artificial neural networks are computing systems vaguely inspired by the biological neural networks in animal brains."
    }
]

# Evaluate batch
batch_results = evaluator.evaluate_batch(samples)

# Analyze results
for i, result in enumerate(batch_results):
    print(f"\nSample {i+1}:")
    avg_score = sum(result.values()) / len(result)
    print(f"  Average Score: {avg_score:.2f}")
    print(f"  Faithfulness: {result.get('faithfulness', 0):.2f}")
    print(f"  Relevance: {result.get('answer_relevance', 0):.2f}")
```

### Step 5: Working with Azure OpenAI

```python
# Method 1: Using environment variables
import os
os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-key"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"
os.environ["AZURE_OPENAI_DEPLOYMENT"] = "gpt-4-deployment"

evaluator = RAGEvaluator(
    llm_provider="azure",
    model="gpt-4"
)

# Method 2: Direct configuration
evaluator = RAGEvaluator(
    llm_provider="azure",
    model="gpt-4",
    azure_config={
        "api_key": "your-azure-key",
        "azure_endpoint": "https://your-resource.openai.azure.com/",
        "azure_deployment": "gpt-4-deployment",
        "api_version": "2024-02-01"
    }
)
```

### Step 6: Interpreting Results

```python
def interpret_score(score: float, metric: str) -> str:
    """Interpret metric scores with recommendations."""
    if score >= 0.9:
        return "Excellent - No improvement needed"
    elif score >= 0.7:
        return "Good - Minor improvements possible"
    elif score >= 0.5:
        return "Moderate - Consider improvements"
    else:
        return "Poor - Significant improvements needed"

# Evaluate and interpret
results = evaluator.evaluate(query, generated_text, context)

print("RAG System Evaluation Report")
print("=" * 40)
for metric, score in results.items():
    interpretation = interpret_score(score, metric)
    print(f"{metric:20} {score:.2f} - {interpretation}")

# Identify weakest areas
weak_metrics = {m: s for m, s in results.items() if s < 0.7}
if weak_metrics:
    print("\nāš ļø Areas needing improvement:")
    for metric, score in weak_metrics.items():
        print(f"  - {metric}: {score:.2f}")
```

### Step 7: Production Usage Pattern

```python
import logging
from typing import Dict, Any

class RAGSystem:
    def __init__(self):
        self.evaluator = RAGEvaluator(
            llm_provider="openai",
            model="gpt-4",
            metrics=["faithfulness", "answer_relevance", "completeness"]
        )
        self.threshold = 0.7  # Minimum acceptable score
        
    def generate_and_evaluate(self, query: str) -> Dict[str, Any]:
        # Your RAG pipeline here
        context = self.retrieve_context(query)
        generated_text = self.generate_answer(query, context)
        
        # Evaluate the response
        scores = self.evaluator.evaluate(
            query=query,
            generated_text=generated_text,
            context=context
        )
        
        # Check if response meets quality threshold
        avg_score = sum(scores.values()) / len(scores)
        
        return {
            "answer": generated_text,
            "context": context,
            "evaluation": scores,
            "average_score": avg_score,
            "meets_threshold": avg_score >= self.threshold
        }
    
    def retrieve_context(self, query: str) -> str:
        # Your retrieval logic here
        return "Retrieved context..."
    
    def generate_answer(self, query: str, context: str) -> str:
        # Your generation logic here
        return "Generated answer..."

# Usage
rag_system = RAGSystem()
result = rag_system.generate_and_evaluate("What is climate change?")

if result["meets_threshold"]:
    print(f"āœ… Response quality approved: {result['average_score']:.2f}")
else:
    print(f"āŒ Response below threshold: {result['average_score']:.2f}")
    print("Consider regenerating or improving retrieval")
```

## šŸ”§ Advanced Usage

### Custom LLM Configuration

```python
# Configure LLM parameters
evaluator = RAGEvaluator(
    llm_provider="openai",
    model="gpt-4",
    temperature=0.1,
    max_tokens=1000,
    top_p=0.9
)
```

### Dynamic Metric Management

```python
# List available metrics
print(evaluator.list_metrics())

# Add/remove metrics dynamically
evaluator.add_metric("trust_score")
evaluator.remove_metric("coherence")

# Check configured metrics
print(evaluator.get_configured_metrics())
```

### Async Evaluation

```python
import asyncio

async def evaluate_async():
    results = await evaluator.aevaluate(
        query="What is RAG?",
        generated_text="RAG combines retrieval and generation...",
        context="Retrieval-Augmented Generation enhances LLMs..."
    )
    return results

# Run async evaluation
results = asyncio.run(evaluate_async())
```

## šŸ’” Common Use Cases

### 1. RAG System Development
```python
# During development, evaluate different retrieval strategies
strategies = ["dense", "sparse", "hybrid"]
results = {}

for strategy in strategies:
    context = retrieve_with_strategy(query, strategy)
    answer = generate_answer(query, context)
    
    scores = evaluator.evaluate(query, answer, context)
    results[strategy] = scores
    
# Compare strategies
best_strategy = max(results.items(), key=lambda x: sum(x[1].values()))
print(f"Best strategy: {best_strategy[0]}")
```

### 2. A/B Testing Different Models
```python
# Compare different LLM models
models = ["gpt-3.5-turbo", "gpt-4", "gpt-4-turbo"]
model_performance = {}

for model in models:
    eval_temp = RAGEvaluator(llm_provider="openai", model=model)
    scores = eval_temp.evaluate(query, generated_text, context)
    model_performance[model] = sum(scores.values()) / len(scores)

# Find best performing model
best_model = max(model_performance.items(), key=lambda x: x[1])
print(f"Best model: {best_model[0]} (avg score: {best_model[1]:.2f})")
```

### 3. Quality Assurance Pipeline
```python
def qa_pipeline(query: str, answer: str, context: str) -> bool:
    """Quality assurance check before serving responses."""
    # Define minimum thresholds
    thresholds = {
        "faithfulness": 0.8,
        "answer_relevance": 0.7,
        "coherence": 0.75
    }
    
    # Evaluate critical metrics
    scores = evaluator.evaluate(
        query, answer, context, 
        metrics=list(thresholds.keys())
    )
    
    # Check all thresholds
    passed = all(scores[m] >= thresholds[m] for m in thresholds)
    
    if not passed:
        failed_metrics = [m for m in thresholds if scores[m] < thresholds[m]]
        print(f"QA Failed: {failed_metrics}")
    
    return passed
```

### 4. Continuous Monitoring
```python
import json
from datetime import datetime

def log_evaluation(query: str, answer: str, context: str, log_file: str):
    """Log evaluations for monitoring RAG system performance over time."""
    scores = evaluator.evaluate(query, answer, context)
    
    log_entry = {
        "timestamp": datetime.now().isoformat(),
        "query": query[:100],  # Truncate for logging
        "scores": scores,
        "average_score": sum(scores.values()) / len(scores)
    }
    
    with open(log_file, "a") as f:
        f.write(json.dumps(log_entry) + "\n")
    
    # Alert if performance drops
    if log_entry["average_score"] < 0.6:
        print(f"āš ļø Low performance alert: {log_entry['average_score']:.2f}")
```

## šŸ”§ Troubleshooting

### Common Issues and Solutions

#### 1. API Key Errors
```python
# Issue: "API key not found" error
# Solution 1: Set environment variable
import os
os.environ["OPENAI_API_KEY"] = "your-key"

# Solution 2: Pass directly
evaluator = RAGEvaluator(
    llm_provider="openai",
    model="gpt-4",
    api_key="your-key"
)

# Solution 3: Use .env file
# Create .env file with: OPENAI_API_KEY=your-key
from dotenv import load_dotenv
load_dotenv()
```

#### 2. Azure Configuration Issues
```python
# Issue: Azure endpoint errors
# Solution: Ensure all required fields are provided
evaluator = RAGEvaluator(
    llm_provider="azure",
    model="gpt-4",
    azure_config={
        "api_key": "required",
        "azure_endpoint": "required - must end with /",
        "azure_deployment": "required - your deployment name",
        "api_version": "2024-02-01"  # Use latest version
    }
)
```

#### 3. Memory Issues with Batch Processing
```python
# Issue: Out of memory with large batches
# Solution: Process in smaller chunks
def evaluate_large_dataset(data, chunk_size=10):
    all_results = []
    
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i + chunk_size]
        results = evaluator.evaluate_batch(chunk)
        all_results.extend(results)
        
        # Optional: Add delay to avoid rate limits
        time.sleep(1)
    
    return all_results
```

#### 4. Handling Rate Limits
```python
import time
from typing import Dict, Any

def evaluate_with_retry(
    query: str, 
    generated_text: str, 
    context: str,
    max_retries: int = 3
) -> Dict[str, float]:
    """Evaluate with automatic retry on rate limit errors."""
    for attempt in range(max_retries):
        try:
            return evaluator.evaluate(query, generated_text, context)
        except Exception as e:
            if "rate limit" in str(e).lower() and attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Rate limited. Waiting {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                raise
```

#### 5. Debugging Low Scores
```python
def debug_low_scores(query: str, answer: str, context: str):
    """Analyze why scores are low."""
    results = evaluator.evaluate(query, answer, context)
    
    print("Debugging Low Scores:")
    print("-" * 40)
    
    # Check each component
    if results.get("faithfulness", 0) < 0.5:
        print("āŒ Faithfulness Issue: Answer may contain hallucinations")
        print("   Check if all claims are supported by context")
    
    if results.get("answer_relevance", 0) < 0.5:
        print("āŒ Relevance Issue: Answer doesn't address the query")
        print("   Ensure answer directly responds to the question")
    
    if results.get("context_relevance", 0) < 0.5:
        print("āŒ Context Issue: Retrieved context is not relevant")
        print("   Improve retrieval strategy or query processing")
    
    if results.get("completeness", 0) < 0.5:
        print("āŒ Completeness Issue: Answer is incomplete")
        print("   Add more relevant information to the answer")
```

## šŸ—ļø Architecture

```
RAGEvaluator
ā”œā”€ā”€ Generation Metrics
│   ā”œā”€ā”€ Faithfulness
│   ā”œā”€ā”€ Answer Relevance  
│   ā”œā”€ā”€ Answer Correctness
│   ā”œā”€ā”€ Completeness
│   ā”œā”€ā”€ Coherence
│   └── Helpfulness
ā”œā”€ā”€ Retrieval Metrics
│   ā”œā”€ā”€ Context Recall
│   ā”œā”€ā”€ Context Relevance
│   └── Context Precision
└── Composite Metrics
    ā”œā”€ā”€ LLM Judge
    ā”œā”€ā”€ RAG Certainty
    └── Trust Score
```

## šŸ”„ Migration from v1.x

RAG Evals 2.0 introduces breaking changes for a cleaner, more powerful API:

### Before (v1.x)
```python
from rag_evals.metrics import Faithfulness
from rag_evals.llm import OpenAIProvider

llm = OpenAIProvider(api_key="key", model="gpt-4")
metric = Faithfulness(llm_provider=llm)
result = await metric.evaluate(rag_input)
```

### After (v2.0)
```python
from rag_evals import RAGEvaluator

evaluator = RAGEvaluator(llm_provider="openai", model="gpt-4", api_key="key")
results = evaluator.evaluate(query="...", generated_text="...", context="...")
```

## šŸ› ļø Configuration

### Environment Variables

```bash
# OpenAI
export OPENAI_API_KEY="your-openai-key"

# Azure OpenAI  
export AZURE_OPENAI_API_KEY="your-azure-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
```

### Configuration File Support

```python
# Load from config file
import json

with open("rag_config.json") as f:
    config = json.load(f)

evaluator = RAGEvaluator(**config)
```

Example `rag_config.json`:
```json
{
    "llm_provider": "azure",
    "model": "gpt-4",
    "azure_config": {
        "azure_endpoint": "https://your-resource.openai.azure.com/",
        "azure_deployment": "gpt-4-deployment"
    },
    "metrics": ["faithfulness", "answer_relevance", "completeness"]
}
```

## šŸ“Š Example Results

```python
{
    "faithfulness": 0.95,        # 95% of claims supported by context
    "answer_relevance": 0.92,    # Highly relevant to the query
    "answer_correctness": 0.88,  # Factually accurate
    "completeness": 0.85,        # Covers most aspects of the query
    "coherence": 0.90,          # Well-structured and logical
    "helpfulness": 0.87,        # Practically useful
    "context_recall": 0.93,     # Context supports ground truth well
    "context_relevance": 0.89,  # Retrieved contexts are relevant
    "context_precision": 0.86,  # High-quality context ranking
    "llm_judge": 0.91,          # Overall high quality
    "rag_certainty": 0.88,      # High confidence in response
    "trust_score": 0.90         # Highly trustworthy
}
```

## šŸ” Understanding Scores

All metrics return scores between **0.0** (poor) and **1.0** (excellent):

- **0.9-1.0**: Excellent quality
- **0.7-0.9**: Good quality  
- **0.5-0.7**: Moderate quality
- **0.3-0.5**: Poor quality
- **0.0-0.3**: Very poor quality

## šŸ¤ Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## šŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## šŸ™ Acknowledgments

- Built on [LangChain](https://github.com/langchain-ai/langchain) for robust LLM integration
- Inspired by research in RAG evaluation and LLM-as-a-judge methodologies
- Community feedback and contributions drive continuous improvement

## šŸ“š Citation

If you use RAG Evals in your research, please cite:

```bibtex
@software{rag_evals,
  title = {RAG Evals: Comprehensive RAG Evaluation with LangChain},
  author = {RAG Evals Team},
  year = {2025},
  url = {https://github.com/ragevals/rag_evals}
}
``` 

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "monitoring-rag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Abhinandan Mukherjee <abhinandan.m.0401@gmail.com>",
    "keywords": "rag, evaluation, llm, ai, machine-learning, nlp, retrieval, generation, langchain, azure, openai",
    "author": null,
    "author_email": "Abhinandan Mukherjee <abhinandan.m.0401@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/07/20/5adea9018b5f5a1f1bbb23c440da13100965bbee22f94406d508e0dc700e/monitoring_rag-0.0.2.tar.gz",
    "platform": null,
    "description": "# \ud83c\udfaf Monitoring-RAG\r\n\r\n**Comprehensive RAG Evaluation with LangChain Integration**\r\n\r\n[![PyPI version](https://badge.fury.io/py/rag_evals.svg)](https://pypi.org/project/monitoring-rag/0.0.1/)\r\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n\r\nA streamlined Python library for evaluating Retrieval-Augmented Generation (RAG) systems using LangChain's powerful framework. Evaluate your RAG pipeline with a single line of code.\r\n\r\n## \ud83d\ude80 Key Features\r\n\r\n- **\ud83c\udfaf Unified Interface**: Single `RAGEvaluator` class for all metrics\r\n- **\u26a1 LangChain Powered**: Built on industry-standard LangChain framework\r\n- **\ud83d\udcca Comprehensive Metrics**: 12+ evaluation metrics covering generation, retrieval, and composite evaluation\r\n- **\ud83d\udd04 Async Support**: Full asynchronous evaluation for high performance\r\n- **\ud83d\udee0\ufe0f Flexible Configuration**: Support for OpenAI and Azure OpenAI\r\n- **\ud83d\udce6 Simple API**: Evaluate with just query, context, and generated text\r\n\r\n## \ud83d\udccb Quick Reference\r\n\r\n```python\r\n# Import the main evaluator\r\nfrom rag_evals import RAGEvaluator\r\n\r\n# Initialize\r\nevaluator = RAGEvaluator(llm_provider=\"openai\", model=\"gpt-4\")\r\n\r\n# Evaluate all metrics\r\nresults = evaluator.evaluate(\r\n    query=\"Your question\",\r\n    answer=\"RAG system's answer\",\r\n    retrieved_contexts=[\"Retrieved context\"]\r\n)\r\n\r\n# Evaluate specific metrics\r\nresults = evaluator.evaluate(\r\n    query=\"...\",\r\n    answer=\"...\",\r\n    retrieved_contexts=[\"...\"],\r\n    metrics=[\"faithfulness\", \"answer_relevance\"]\r\n)\r\n\r\n# Available metrics\r\ngeneration_metrics = [\"faithfulness\", \"answer_relevance\", \"answer_correctness\", \r\n                     \"completeness\", \"coherence\", \"helpfulness\"]\r\nretrieval_metrics = [\"context_recall\", \"context_relevance\", \"context_precision\"]\r\ncomposite_metrics = [\"llm_judge\", \"rag_certainty\", \"trust_score\"]\r\n```\r\n\r\n## \ud83d\ude80 Installation\r\n\r\n```bash\r\npip install monitoring-rag\r\n```\r\n\r\n> **Note**: Install as `monitoring-rag` but import as `rag_evals`\r\n\r\n## \ud83d\udd27 Quick Start\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom rag_evals import RAGEvaluator\r\n\r\n# Initialize evaluator\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"openai\",  # or \"azure\"\r\n    model=\"gpt-4\",\r\n    api_key=\"your-api-key\"  # or set OPENAI_API_KEY environment variable\r\n)\r\n\r\n# Evaluate a RAG response\r\nresults = evaluator.evaluate(\r\n    query=\"What is machine learning?\",\r\n    answer=\"Machine learning is a subset of artificial intelligence that uses algorithms to learn patterns from data.\",\r\n    retrieved_contexts=[\"Machine learning is a method of data analysis that automates analytical model building using algorithms that iteratively learn from data.\"]\r\n)\r\n\r\nprint(results)\r\n# Output: {\r\n#     \"faithfulness\": 0.95,\r\n#     \"answer_relevance\": 0.92,\r\n#     \"context_relevance\": 0.88,\r\n#     \"completeness\": 0.85,\r\n#     \"coherence\": 0.90,\r\n#     ...\r\n# }\r\n```\r\n\r\n### Azure OpenAI Configuration\r\n\r\n```python\r\nfrom rag_evals import RAGEvaluator\r\n\r\n# Using Azure OpenAI\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"azure\",\r\n    model=\"gpt-4\",\r\n    azure_config={\r\n        \"api_key\": \"your-azure-key\",\r\n        \"azure_endpoint\": \"https://your-resource.openai.azure.com/\",\r\n        \"azure_deployment\": \"gpt-4-deployment\",\r\n        \"api_version\": \"2024-02-01\"\r\n    }\r\n)\r\n\r\n# Or using environment variables\r\n# AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT\r\nevaluator = RAGEvaluator(llm_provider=\"azure\", model=\"gpt-4\")\r\n```\r\n\r\n### Batch Evaluation\r\n\r\n```python\r\n# Evaluate multiple responses at once\r\ninputs = [\r\n    {\r\n        \"query\": \"What is Python?\",\r\n        \"generated_text\": \"Python is a programming language.\",\r\n        \"context\": \"Python is a high-level programming language known for its simplicity.\"\r\n    },\r\n    {\r\n        \"query\": \"What is machine learning?\",\r\n        \"generated_text\": \"ML uses algorithms to find patterns.\",\r\n        \"context\": \"Machine learning involves training algorithms on data.\"\r\n    }\r\n]\r\n\r\nresults = evaluator.evaluate_batch(inputs)\r\n# Returns list of result dictionaries\r\n```\r\n\r\n### Selective Metric Evaluation\r\n\r\n```python\r\n# Evaluate only specific metrics\r\nresults = evaluator.evaluate(\r\n    query=\"What is AI?\",\r\n    generated_text=\"AI is artificial intelligence.\",\r\n    context=\"Artificial intelligence refers to machine intelligence.\",\r\n    metrics=[\"faithfulness\", \"answer_relevance\", \"coherence\"]\r\n)\r\n\r\n# Configure evaluator with specific metrics\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"openai\",\r\n    model=\"gpt-4\",\r\n    metrics=[\"faithfulness\", \"completeness\", \"trust_score\"]\r\n)\r\n```\r\n\r\n## \ud83d\udcca Available Metrics\r\n\r\n### Generation Metrics\r\nEvaluate the quality of generated responses:\r\n\r\n- **Faithfulness**: How well the answer is grounded in the provided contexts\r\n- **Answer Relevance**: How relevant the answer is to the user's query  \r\n- **Answer Correctness**: Factual accuracy of the generated answer\r\n- **Completeness**: How thoroughly the answer addresses the query\r\n- **Coherence**: Logical flow and readability of the answer\r\n- **Helpfulness**: Practical value and actionability of the answer\r\n\r\n### Retrieval Metrics\r\nEvaluate the quality of retrieved contexts:\r\n\r\n- **Context Recall**: How well contexts support generating the ground truth\r\n- **Context Relevance**: How relevant retrieved contexts are to the query\r\n- **Context Precision**: Quality and ranking of retrieved contexts\r\n\r\n### Composite Metrics\r\nHolistic evaluation combining multiple aspects:\r\n\r\n- **LLM Judge**: Comprehensive multi-dimensional evaluation\r\n- **RAG Certainty**: Confidence and reliability assessment\r\n- **Trust Score**: Overall trustworthiness evaluation\r\n\r\n## \ud83d\udcd6 Step-by-Step Tutorial\r\n\r\n### Step 1: Installation and Setup\r\n\r\n```bash\r\n# Install the library\r\npip install rag_evals\r\n\r\n# Create a .env file for your API keys (optional)\r\necho \"OPENAI_API_KEY=your-api-key-here\" > .env\r\n```\r\n\r\n### Step 2: Basic Single Evaluation\r\n\r\n```python\r\nfrom rag_evals import RAGEvaluator\r\n\r\n# Initialize the evaluator\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"openai\",\r\n    model=\"gpt-4\"\r\n    # API key will be read from OPENAI_API_KEY environment variable\r\n)\r\n\r\n# Your RAG system output\r\nquery = \"What are the benefits of renewable energy?\"\r\ncontext = \"\"\"\r\nRenewable energy sources like solar and wind power offer numerous benefits:\r\n1. They reduce greenhouse gas emissions\r\n2. They provide energy independence\r\n3. They create jobs in the green economy\r\n4. Operating costs are lower than fossil fuels\r\n\"\"\"\r\ngenerated_text = \"\"\"\r\nRenewable energy provides several key benefits including reducing carbon emissions,\r\nincreasing energy independence, creating green jobs, and offering lower operating costs\r\ncompared to traditional fossil fuels.\r\n\"\"\"\r\n\r\n# Evaluate all metrics\r\nresults = evaluator.evaluate(\r\n    query=query,\r\n    generated_text=generated_text,\r\n    context=context\r\n)\r\n\r\n# Display results\r\nfor metric, score in results.items():\r\n    print(f\"{metric}: {score:.2f}\")\r\n```\r\n\r\n### Step 3: Evaluating Specific Metrics\r\n\r\n```python\r\n# Evaluate only generation metrics\r\ngeneration_results = evaluator.evaluate(\r\n    query=query,\r\n    generated_text=generated_text,\r\n    context=context,\r\n    metrics=[\"faithfulness\", \"answer_relevance\", \"completeness\"]\r\n)\r\n\r\nprint(\"Generation Metrics:\")\r\nfor metric, score in generation_results.items():\r\n    print(f\"  {metric}: {score:.2f}\")\r\n\r\n# Evaluate only retrieval metrics (requires ground truth)\r\nretrieval_results = evaluator.evaluate(\r\n    query=query,\r\n    generated_text=generated_text,\r\n    context=context,\r\n    ground_truth=\"Renewable energy reduces emissions, provides energy independence, creates jobs, and has lower operating costs.\",\r\n    metrics=[\"context_recall\", \"context_precision\"]\r\n)\r\n\r\nprint(\"\\nRetrieval Metrics:\")\r\nfor metric, score in retrieval_results.items():\r\n    print(f\"  {metric}: {score:.2f}\")\r\n```\r\n\r\n### Step 4: Batch Evaluation for Multiple Samples\r\n\r\n```python\r\n# Prepare multiple samples\r\nsamples = [\r\n    {\r\n        \"query\": \"What is machine learning?\",\r\n        \"generated_text\": \"Machine learning is a subset of AI that enables systems to learn from data.\",\r\n        \"context\": \"Machine learning is a branch of artificial intelligence that focuses on building systems that learn from data.\"\r\n    },\r\n    {\r\n        \"query\": \"Explain quantum computing\",\r\n        \"generated_text\": \"Quantum computing uses quantum mechanics principles to process information.\",\r\n        \"context\": \"Quantum computing is a type of computation that harnesses quantum mechanical phenomena like superposition and entanglement.\"\r\n    },\r\n    {\r\n        \"query\": \"What are neural networks?\",\r\n        \"generated_text\": \"Neural networks are computing systems inspired by biological neural networks.\",\r\n        \"context\": \"Artificial neural networks are computing systems vaguely inspired by the biological neural networks in animal brains.\"\r\n    }\r\n]\r\n\r\n# Evaluate batch\r\nbatch_results = evaluator.evaluate_batch(samples)\r\n\r\n# Analyze results\r\nfor i, result in enumerate(batch_results):\r\n    print(f\"\\nSample {i+1}:\")\r\n    avg_score = sum(result.values()) / len(result)\r\n    print(f\"  Average Score: {avg_score:.2f}\")\r\n    print(f\"  Faithfulness: {result.get('faithfulness', 0):.2f}\")\r\n    print(f\"  Relevance: {result.get('answer_relevance', 0):.2f}\")\r\n```\r\n\r\n### Step 5: Working with Azure OpenAI\r\n\r\n```python\r\n# Method 1: Using environment variables\r\nimport os\r\nos.environ[\"AZURE_OPENAI_API_KEY\"] = \"your-azure-key\"\r\nos.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"https://your-resource.openai.azure.com/\"\r\nos.environ[\"AZURE_OPENAI_DEPLOYMENT\"] = \"gpt-4-deployment\"\r\n\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"azure\",\r\n    model=\"gpt-4\"\r\n)\r\n\r\n# Method 2: Direct configuration\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"azure\",\r\n    model=\"gpt-4\",\r\n    azure_config={\r\n        \"api_key\": \"your-azure-key\",\r\n        \"azure_endpoint\": \"https://your-resource.openai.azure.com/\",\r\n        \"azure_deployment\": \"gpt-4-deployment\",\r\n        \"api_version\": \"2024-02-01\"\r\n    }\r\n)\r\n```\r\n\r\n### Step 6: Interpreting Results\r\n\r\n```python\r\ndef interpret_score(score: float, metric: str) -> str:\r\n    \"\"\"Interpret metric scores with recommendations.\"\"\"\r\n    if score >= 0.9:\r\n        return \"Excellent - No improvement needed\"\r\n    elif score >= 0.7:\r\n        return \"Good - Minor improvements possible\"\r\n    elif score >= 0.5:\r\n        return \"Moderate - Consider improvements\"\r\n    else:\r\n        return \"Poor - Significant improvements needed\"\r\n\r\n# Evaluate and interpret\r\nresults = evaluator.evaluate(query, generated_text, context)\r\n\r\nprint(\"RAG System Evaluation Report\")\r\nprint(\"=\" * 40)\r\nfor metric, score in results.items():\r\n    interpretation = interpret_score(score, metric)\r\n    print(f\"{metric:20} {score:.2f} - {interpretation}\")\r\n\r\n# Identify weakest areas\r\nweak_metrics = {m: s for m, s in results.items() if s < 0.7}\r\nif weak_metrics:\r\n    print(\"\\n\u26a0\ufe0f Areas needing improvement:\")\r\n    for metric, score in weak_metrics.items():\r\n        print(f\"  - {metric}: {score:.2f}\")\r\n```\r\n\r\n### Step 7: Production Usage Pattern\r\n\r\n```python\r\nimport logging\r\nfrom typing import Dict, Any\r\n\r\nclass RAGSystem:\r\n    def __init__(self):\r\n        self.evaluator = RAGEvaluator(\r\n            llm_provider=\"openai\",\r\n            model=\"gpt-4\",\r\n            metrics=[\"faithfulness\", \"answer_relevance\", \"completeness\"]\r\n        )\r\n        self.threshold = 0.7  # Minimum acceptable score\r\n        \r\n    def generate_and_evaluate(self, query: str) -> Dict[str, Any]:\r\n        # Your RAG pipeline here\r\n        context = self.retrieve_context(query)\r\n        generated_text = self.generate_answer(query, context)\r\n        \r\n        # Evaluate the response\r\n        scores = self.evaluator.evaluate(\r\n            query=query,\r\n            generated_text=generated_text,\r\n            context=context\r\n        )\r\n        \r\n        # Check if response meets quality threshold\r\n        avg_score = sum(scores.values()) / len(scores)\r\n        \r\n        return {\r\n            \"answer\": generated_text,\r\n            \"context\": context,\r\n            \"evaluation\": scores,\r\n            \"average_score\": avg_score,\r\n            \"meets_threshold\": avg_score >= self.threshold\r\n        }\r\n    \r\n    def retrieve_context(self, query: str) -> str:\r\n        # Your retrieval logic here\r\n        return \"Retrieved context...\"\r\n    \r\n    def generate_answer(self, query: str, context: str) -> str:\r\n        # Your generation logic here\r\n        return \"Generated answer...\"\r\n\r\n# Usage\r\nrag_system = RAGSystem()\r\nresult = rag_system.generate_and_evaluate(\"What is climate change?\")\r\n\r\nif result[\"meets_threshold\"]:\r\n    print(f\"\u2705 Response quality approved: {result['average_score']:.2f}\")\r\nelse:\r\n    print(f\"\u274c Response below threshold: {result['average_score']:.2f}\")\r\n    print(\"Consider regenerating or improving retrieval\")\r\n```\r\n\r\n## \ud83d\udd27 Advanced Usage\r\n\r\n### Custom LLM Configuration\r\n\r\n```python\r\n# Configure LLM parameters\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"openai\",\r\n    model=\"gpt-4\",\r\n    temperature=0.1,\r\n    max_tokens=1000,\r\n    top_p=0.9\r\n)\r\n```\r\n\r\n### Dynamic Metric Management\r\n\r\n```python\r\n# List available metrics\r\nprint(evaluator.list_metrics())\r\n\r\n# Add/remove metrics dynamically\r\nevaluator.add_metric(\"trust_score\")\r\nevaluator.remove_metric(\"coherence\")\r\n\r\n# Check configured metrics\r\nprint(evaluator.get_configured_metrics())\r\n```\r\n\r\n### Async Evaluation\r\n\r\n```python\r\nimport asyncio\r\n\r\nasync def evaluate_async():\r\n    results = await evaluator.aevaluate(\r\n        query=\"What is RAG?\",\r\n        generated_text=\"RAG combines retrieval and generation...\",\r\n        context=\"Retrieval-Augmented Generation enhances LLMs...\"\r\n    )\r\n    return results\r\n\r\n# Run async evaluation\r\nresults = asyncio.run(evaluate_async())\r\n```\r\n\r\n## \ud83d\udca1 Common Use Cases\r\n\r\n### 1. RAG System Development\r\n```python\r\n# During development, evaluate different retrieval strategies\r\nstrategies = [\"dense\", \"sparse\", \"hybrid\"]\r\nresults = {}\r\n\r\nfor strategy in strategies:\r\n    context = retrieve_with_strategy(query, strategy)\r\n    answer = generate_answer(query, context)\r\n    \r\n    scores = evaluator.evaluate(query, answer, context)\r\n    results[strategy] = scores\r\n    \r\n# Compare strategies\r\nbest_strategy = max(results.items(), key=lambda x: sum(x[1].values()))\r\nprint(f\"Best strategy: {best_strategy[0]}\")\r\n```\r\n\r\n### 2. A/B Testing Different Models\r\n```python\r\n# Compare different LLM models\r\nmodels = [\"gpt-3.5-turbo\", \"gpt-4\", \"gpt-4-turbo\"]\r\nmodel_performance = {}\r\n\r\nfor model in models:\r\n    eval_temp = RAGEvaluator(llm_provider=\"openai\", model=model)\r\n    scores = eval_temp.evaluate(query, generated_text, context)\r\n    model_performance[model] = sum(scores.values()) / len(scores)\r\n\r\n# Find best performing model\r\nbest_model = max(model_performance.items(), key=lambda x: x[1])\r\nprint(f\"Best model: {best_model[0]} (avg score: {best_model[1]:.2f})\")\r\n```\r\n\r\n### 3. Quality Assurance Pipeline\r\n```python\r\ndef qa_pipeline(query: str, answer: str, context: str) -> bool:\r\n    \"\"\"Quality assurance check before serving responses.\"\"\"\r\n    # Define minimum thresholds\r\n    thresholds = {\r\n        \"faithfulness\": 0.8,\r\n        \"answer_relevance\": 0.7,\r\n        \"coherence\": 0.75\r\n    }\r\n    \r\n    # Evaluate critical metrics\r\n    scores = evaluator.evaluate(\r\n        query, answer, context, \r\n        metrics=list(thresholds.keys())\r\n    )\r\n    \r\n    # Check all thresholds\r\n    passed = all(scores[m] >= thresholds[m] for m in thresholds)\r\n    \r\n    if not passed:\r\n        failed_metrics = [m for m in thresholds if scores[m] < thresholds[m]]\r\n        print(f\"QA Failed: {failed_metrics}\")\r\n    \r\n    return passed\r\n```\r\n\r\n### 4. Continuous Monitoring\r\n```python\r\nimport json\r\nfrom datetime import datetime\r\n\r\ndef log_evaluation(query: str, answer: str, context: str, log_file: str):\r\n    \"\"\"Log evaluations for monitoring RAG system performance over time.\"\"\"\r\n    scores = evaluator.evaluate(query, answer, context)\r\n    \r\n    log_entry = {\r\n        \"timestamp\": datetime.now().isoformat(),\r\n        \"query\": query[:100],  # Truncate for logging\r\n        \"scores\": scores,\r\n        \"average_score\": sum(scores.values()) / len(scores)\r\n    }\r\n    \r\n    with open(log_file, \"a\") as f:\r\n        f.write(json.dumps(log_entry) + \"\\n\")\r\n    \r\n    # Alert if performance drops\r\n    if log_entry[\"average_score\"] < 0.6:\r\n        print(f\"\u26a0\ufe0f Low performance alert: {log_entry['average_score']:.2f}\")\r\n```\r\n\r\n## \ud83d\udd27 Troubleshooting\r\n\r\n### Common Issues and Solutions\r\n\r\n#### 1. API Key Errors\r\n```python\r\n# Issue: \"API key not found\" error\r\n# Solution 1: Set environment variable\r\nimport os\r\nos.environ[\"OPENAI_API_KEY\"] = \"your-key\"\r\n\r\n# Solution 2: Pass directly\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"openai\",\r\n    model=\"gpt-4\",\r\n    api_key=\"your-key\"\r\n)\r\n\r\n# Solution 3: Use .env file\r\n# Create .env file with: OPENAI_API_KEY=your-key\r\nfrom dotenv import load_dotenv\r\nload_dotenv()\r\n```\r\n\r\n#### 2. Azure Configuration Issues\r\n```python\r\n# Issue: Azure endpoint errors\r\n# Solution: Ensure all required fields are provided\r\nevaluator = RAGEvaluator(\r\n    llm_provider=\"azure\",\r\n    model=\"gpt-4\",\r\n    azure_config={\r\n        \"api_key\": \"required\",\r\n        \"azure_endpoint\": \"required - must end with /\",\r\n        \"azure_deployment\": \"required - your deployment name\",\r\n        \"api_version\": \"2024-02-01\"  # Use latest version\r\n    }\r\n)\r\n```\r\n\r\n#### 3. Memory Issues with Batch Processing\r\n```python\r\n# Issue: Out of memory with large batches\r\n# Solution: Process in smaller chunks\r\ndef evaluate_large_dataset(data, chunk_size=10):\r\n    all_results = []\r\n    \r\n    for i in range(0, len(data), chunk_size):\r\n        chunk = data[i:i + chunk_size]\r\n        results = evaluator.evaluate_batch(chunk)\r\n        all_results.extend(results)\r\n        \r\n        # Optional: Add delay to avoid rate limits\r\n        time.sleep(1)\r\n    \r\n    return all_results\r\n```\r\n\r\n#### 4. Handling Rate Limits\r\n```python\r\nimport time\r\nfrom typing import Dict, Any\r\n\r\ndef evaluate_with_retry(\r\n    query: str, \r\n    generated_text: str, \r\n    context: str,\r\n    max_retries: int = 3\r\n) -> Dict[str, float]:\r\n    \"\"\"Evaluate with automatic retry on rate limit errors.\"\"\"\r\n    for attempt in range(max_retries):\r\n        try:\r\n            return evaluator.evaluate(query, generated_text, context)\r\n        except Exception as e:\r\n            if \"rate limit\" in str(e).lower() and attempt < max_retries - 1:\r\n                wait_time = 2 ** attempt  # Exponential backoff\r\n                print(f\"Rate limited. Waiting {wait_time} seconds...\")\r\n                time.sleep(wait_time)\r\n            else:\r\n                raise\r\n```\r\n\r\n#### 5. Debugging Low Scores\r\n```python\r\ndef debug_low_scores(query: str, answer: str, context: str):\r\n    \"\"\"Analyze why scores are low.\"\"\"\r\n    results = evaluator.evaluate(query, answer, context)\r\n    \r\n    print(\"Debugging Low Scores:\")\r\n    print(\"-\" * 40)\r\n    \r\n    # Check each component\r\n    if results.get(\"faithfulness\", 0) < 0.5:\r\n        print(\"\u274c Faithfulness Issue: Answer may contain hallucinations\")\r\n        print(\"   Check if all claims are supported by context\")\r\n    \r\n    if results.get(\"answer_relevance\", 0) < 0.5:\r\n        print(\"\u274c Relevance Issue: Answer doesn't address the query\")\r\n        print(\"   Ensure answer directly responds to the question\")\r\n    \r\n    if results.get(\"context_relevance\", 0) < 0.5:\r\n        print(\"\u274c Context Issue: Retrieved context is not relevant\")\r\n        print(\"   Improve retrieval strategy or query processing\")\r\n    \r\n    if results.get(\"completeness\", 0) < 0.5:\r\n        print(\"\u274c Completeness Issue: Answer is incomplete\")\r\n        print(\"   Add more relevant information to the answer\")\r\n```\r\n\r\n## \ud83c\udfd7\ufe0f Architecture\r\n\r\n```\r\nRAGEvaluator\r\n\u251c\u2500\u2500 Generation Metrics\r\n\u2502   \u251c\u2500\u2500 Faithfulness\r\n\u2502   \u251c\u2500\u2500 Answer Relevance  \r\n\u2502   \u251c\u2500\u2500 Answer Correctness\r\n\u2502   \u251c\u2500\u2500 Completeness\r\n\u2502   \u251c\u2500\u2500 Coherence\r\n\u2502   \u2514\u2500\u2500 Helpfulness\r\n\u251c\u2500\u2500 Retrieval Metrics\r\n\u2502   \u251c\u2500\u2500 Context Recall\r\n\u2502   \u251c\u2500\u2500 Context Relevance\r\n\u2502   \u2514\u2500\u2500 Context Precision\r\n\u2514\u2500\u2500 Composite Metrics\r\n    \u251c\u2500\u2500 LLM Judge\r\n    \u251c\u2500\u2500 RAG Certainty\r\n    \u2514\u2500\u2500 Trust Score\r\n```\r\n\r\n## \ud83d\udd04 Migration from v1.x\r\n\r\nRAG Evals 2.0 introduces breaking changes for a cleaner, more powerful API:\r\n\r\n### Before (v1.x)\r\n```python\r\nfrom rag_evals.metrics import Faithfulness\r\nfrom rag_evals.llm import OpenAIProvider\r\n\r\nllm = OpenAIProvider(api_key=\"key\", model=\"gpt-4\")\r\nmetric = Faithfulness(llm_provider=llm)\r\nresult = await metric.evaluate(rag_input)\r\n```\r\n\r\n### After (v2.0)\r\n```python\r\nfrom rag_evals import RAGEvaluator\r\n\r\nevaluator = RAGEvaluator(llm_provider=\"openai\", model=\"gpt-4\", api_key=\"key\")\r\nresults = evaluator.evaluate(query=\"...\", generated_text=\"...\", context=\"...\")\r\n```\r\n\r\n## \ud83d\udee0\ufe0f Configuration\r\n\r\n### Environment Variables\r\n\r\n```bash\r\n# OpenAI\r\nexport OPENAI_API_KEY=\"your-openai-key\"\r\n\r\n# Azure OpenAI  \r\nexport AZURE_OPENAI_API_KEY=\"your-azure-key\"\r\nexport AZURE_OPENAI_ENDPOINT=\"https://your-resource.openai.azure.com/\"\r\n```\r\n\r\n### Configuration File Support\r\n\r\n```python\r\n# Load from config file\r\nimport json\r\n\r\nwith open(\"rag_config.json\") as f:\r\n    config = json.load(f)\r\n\r\nevaluator = RAGEvaluator(**config)\r\n```\r\n\r\nExample `rag_config.json`:\r\n```json\r\n{\r\n    \"llm_provider\": \"azure\",\r\n    \"model\": \"gpt-4\",\r\n    \"azure_config\": {\r\n        \"azure_endpoint\": \"https://your-resource.openai.azure.com/\",\r\n        \"azure_deployment\": \"gpt-4-deployment\"\r\n    },\r\n    \"metrics\": [\"faithfulness\", \"answer_relevance\", \"completeness\"]\r\n}\r\n```\r\n\r\n## \ud83d\udcca Example Results\r\n\r\n```python\r\n{\r\n    \"faithfulness\": 0.95,        # 95% of claims supported by context\r\n    \"answer_relevance\": 0.92,    # Highly relevant to the query\r\n    \"answer_correctness\": 0.88,  # Factually accurate\r\n    \"completeness\": 0.85,        # Covers most aspects of the query\r\n    \"coherence\": 0.90,          # Well-structured and logical\r\n    \"helpfulness\": 0.87,        # Practically useful\r\n    \"context_recall\": 0.93,     # Context supports ground truth well\r\n    \"context_relevance\": 0.89,  # Retrieved contexts are relevant\r\n    \"context_precision\": 0.86,  # High-quality context ranking\r\n    \"llm_judge\": 0.91,          # Overall high quality\r\n    \"rag_certainty\": 0.88,      # High confidence in response\r\n    \"trust_score\": 0.90         # Highly trustworthy\r\n}\r\n```\r\n\r\n## \ud83d\udd0d Understanding Scores\r\n\r\nAll metrics return scores between **0.0** (poor) and **1.0** (excellent):\r\n\r\n- **0.9-1.0**: Excellent quality\r\n- **0.7-0.9**: Good quality  \r\n- **0.5-0.7**: Moderate quality\r\n- **0.3-0.5**: Poor quality\r\n- **0.0-0.3**: Very poor quality\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\r\n\r\n1. Fork the repository\r\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\r\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\r\n4. Push to the branch (`git push origin feature/amazing-feature`)\r\n5. Open a Pull Request\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\n- Built on [LangChain](https://github.com/langchain-ai/langchain) for robust LLM integration\r\n- Inspired by research in RAG evaluation and LLM-as-a-judge methodologies\r\n- Community feedback and contributions drive continuous improvement\r\n\r\n## \ud83d\udcda Citation\r\n\r\nIf you use RAG Evals in your research, please cite:\r\n\r\n```bibtex\r\n@software{rag_evals,\r\n  title = {RAG Evals: Comprehensive RAG Evaluation with LangChain},\r\n  author = {RAG Evals Team},\r\n  year = {2025},\r\n  url = {https://github.com/ragevals/rag_evals}\r\n}\r\n``` \r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A comprehensive, framework-agnostic library for evaluating Retrieval-Augmented Generation (RAG) pipelines.",
    "version": "0.0.2",
    "project_urls": {
        "Documentation": "https://rag-evals.readthedocs.io",
        "Homepage": "https://github.com/ragevals/rag_evals",
        "Issues": "https://github.com/ragevals/rag_evals/issues",
        "Repository": "https://github.com/ragevals/rag_evals"
    },
    "split_keywords": [
        "rag",
        " evaluation",
        " llm",
        " ai",
        " machine-learning",
        " nlp",
        " retrieval",
        " generation",
        " langchain",
        " azure",
        " openai"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a9e9d76e484136b5c2715ec1b4fc50bab4d5ebdf76e3cb1f8a5332ab90e6cd45",
                "md5": "7eb5e9c5572592ca327df08232ae293b",
                "sha256": "97cf7b5f9461fa7f872c142afbc591f74a474cc2a84fdeb157033c41d4e0d0db"
            },
            "downloads": -1,
            "filename": "monitoring_rag-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7eb5e9c5572592ca327df08232ae293b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 116169,
            "upload_time": "2025-07-24T11:25:52",
            "upload_time_iso_8601": "2025-07-24T11:25:52.114295Z",
            "url": "https://files.pythonhosted.org/packages/a9/e9/d76e484136b5c2715ec1b4fc50bab4d5ebdf76e3cb1f8a5332ab90e6cd45/monitoring_rag-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "07205adea9018b5f5a1f1bbb23c440da13100965bbee22f94406d508e0dc700e",
                "md5": "cfb3829f3a31339f6c448e75c7b2d40d",
                "sha256": "5110a72e4f4caa1bf4e438206905092df31ceafbcbdac727e9c6284a08eeace5"
            },
            "downloads": -1,
            "filename": "monitoring_rag-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "cfb3829f3a31339f6c448e75c7b2d40d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 100199,
            "upload_time": "2025-07-24T11:25:53",
            "upload_time_iso_8601": "2025-07-24T11:25:53.803140Z",
            "url": "https://files.pythonhosted.org/packages/07/20/5adea9018b5f5a1f1bbb23c440da13100965bbee22f94406d508e0dc700e/monitoring_rag-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 11:25:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ragevals",
    "github_project": "rag_evals",
    "github_not_found": true,
    "lcname": "monitoring-rag"
}
        
Elapsed time: 1.51420s