rag-evaluation


Namerag-evaluation JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryA robust Python package for evaluating Retrieval-Augmented Generation (RAG) systems.
upload_time2025-07-17 08:30:01
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords rag nlp evaluation
VCS
bugtrack_url
requirements numpy pandas python-dotenv openai google-genai httpx pytest
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # RAG Evaluation

**RAG Evaluation** is a Python package designed for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a systematic way to score and analyze the quality of responses or outputs generated by RAG systems. This package is particularly suited for projects leveraging Large Language Models (LLMs) such as GPT, Gemini, Llama etc.

It integrates easily with the OpenAI API via the `openai` package and automatically handles environment variable-based API key loading through `python-dotenv`.

## Features

- **Multi-Metric Evaluation:** Evaluate RAG system output using the following metrics:
  - **Query Relevance** Measures how well the RAG system output addresses the user’s query.
  - **Factual Accuracy** Ensures the response is factually correct with respect to the source document.
  - **Coverage** Checks that all key points from the source relevant to the query are included.
  - **Coherence** Assesses the logical flow and organization of ideas in the RAG system output
  - **Fluency** Assesses readability, grammar, and naturalness of the language.
- **Standardized Prompting:** Uses a well-defined prompt template to consistently assess RAG system.
- **Customizable Weighting:** Allows users to adjust the relative importance of each metric to tailor the overall accuracy to their specific priorities.
- **Easy Integration:** Provides a high-level function to integrate evaluation into your RAG system pipelines.

## Installation

```bash
pip install rag_evaluation
```

## API-Key Management
**1. Environment / .env file – zero code changes**

```bash
# .env  or exported in the shell
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIzaSy.
```
*The package loads these automatically via python-dotenv.*

**2. One-time, in-memory key (per Python session)**

```python
import rag_evaluation as rag_eval

rag_eval.set_api_key("openai", "sk-live...")
rag_eval.set_api_key("gemini", "AIzaSy...")

# any subsequent call picks them up automatically
# takes precedence over environment variables.

```

**3. Explicit lookup / fallback**

```python
from rag_eval.config import get_api_key

key = get_api_key("openai", default_key="sk-fallback...")

# Priority inside get_api_key
# cache (set_api_key) ➜ env/.env ➜ default_key ➜ ValueError.

```


## Usage
### Open-Source Local Models (Ollama models; does not require external APIs)
**Currently, the package supports Llama, Mistral, and Qwen.**

**Step 1**: Download Ollama. Check [here](https://ollama.com/download) for instructions  <br>

**Step 2**:  Check from the list of [models](https://ollama.com/search) and download using `ollama pull <model_name>`

```python

#check the list of models available on your local PC
from openai import OpenAI

client = OpenAI(
    api_key='ollama',
    base_url="http://localhost:11434/v1"
)

# List all available models
models = client.models.list()
print(models.to_json())

```
#### Usage with Open-Source Models (Ollama models)


```python
from rag_eval.evaluator import evaluate_response

# Define the inputs
query = "Which large language model is currently the largest and most capable?"

response_text = """The largest and most capable LLMs are the generative pretrained transformers (GPTs). These models are 
                designed to handle complex language tasks, and their vast number of parameters gives them the ability to 
                understand and generate human-like text."""
                 
document = """A large language model (LLM) is a type of machine learning model designed for natural language processing 
            tasks such as language generation. LLMs are language models with many parameters, and are trained with 
            self-supervised learning on a vast amount of text. The largest and most capable LLMs are 
            generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided 
            by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies 
            inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in."""

# Llama usage (ollama pull llama3.2:1b to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='llama3.2:1b
)
print(report)

# Mistral usage (ollama pull mistral to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='mistral:latest',
    metric_weights=[0.1, 0., 0.9, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)

# Qwen usage (ollama pull qwen to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='qwen:latest',
)
print(report)

```

### For API-based Models (GPT and Gemini)
```python

# Define the inputs (same as above)

# OpenAI usage 
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="openai",
    model_name='gpt-4.1',
)
print(report)

# Gemini usage 
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="gemini",
    model_name='gemini-2.5-flash',
    metric_weights=[0.1, 0.4, 0.5, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)

```

## Customizing Metric Weights

By default, **RAG Evaluation** balances the five core metrics as follows:

| Metric            | Default Weight |
|-------------------|---------------:|
| Query Relevance   |          0.25  |
| Factual Accuracy  |          0.25  |
| Coverage          |          0.25  |
| Coherence         |          0.125 |
| Fluency           |          0.125 |

*These defaults reflect our view that modern LLMs (GPT, Llama, Gemini, etc.) already excel at coherence and fluency, so we place greater emphasis on the metrics that most impact RAG system accuracy.*

If you’d like to emphasize certain aspects of your RAG system’s output - say you care twice as much about factual accuracy as coherence, you can supply your own weights via the `metric_weights` parameter in `evaluate_response`. Your custom weights must:

1. Be a list of **five** floats.
2. Sum to **1.0** (ensuring they form a valid weighted average).
3. Follow this order:  
   `[Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]`

### Example

```python
from rag_eval.evaluator import evaluate_response

# Define the inputs
query = "..."
response = "..."
document = "..."

# Emphasize factual accuracy and coverage
custom_weights = [0.1, 0.4, 0.4, 0.05, 0.05]

report = evaluate_response(
    query=query,
    response=response,
    document=document,
    model_type="openai",
    model_name="gpt-4.1",
    metric_weights=custom_weights
)

print(report)
```
*This simple mechanism let users tailor the **Overall Accuracy** score to whatever aspects matter most in their evaluation scenario*

## Output

The `evaluate_response` function returns a pandas DataFrame with:
- **Metric Names:** Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency.
- **Normalized Scores:** A 0–1 score for each metric.
- **Percentage Scores:** The normalized score expressed as a percentage.
- **Overall Accuracy:** A weighted average score across all metrics.

## Need help?
- **Open an issue or pull request on GitHub**  
- **For more examples of how to use the package, see the [example notebook](https://github.com/OlaAkindele/rag_evaluation/main/rag_evaluation_notebook.ipynb)**

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "rag-evaluation",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "rag, nlp, evaluation",
    "author": null,
    "author_email": "Ola Akindele <akindele.ok@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/79/6c/d141d78072f1c24a5f5a980c213309eb85df76830704f2ff1da911bef4e1/rag_evaluation-0.2.2.tar.gz",
    "platform": null,
    "description": "# RAG Evaluation\r\n\r\n**RAG Evaluation** is a Python package designed for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a systematic way to score and analyze the quality of responses or outputs generated by RAG systems. This package is particularly suited for projects leveraging Large Language Models (LLMs) such as GPT, Gemini, Llama etc.\r\n\r\nIt integrates easily with the OpenAI API via the `openai` package and automatically handles environment variable-based API key loading through `python-dotenv`.\r\n\r\n## Features\r\n\r\n- **Multi-Metric Evaluation:** Evaluate RAG system output using the following metrics:\r\n  - **Query Relevance** Measures how well the RAG system output addresses the user\u2019s query.\r\n  - **Factual Accuracy** Ensures the response is factually correct with respect to the source document.\r\n  - **Coverage** Checks that all key points from the source relevant to the query are included.\r\n  - **Coherence** Assesses the logical flow and organization of ideas in the RAG system output\r\n  - **Fluency** Assesses readability, grammar, and naturalness of the language.\r\n- **Standardized Prompting:** Uses a well-defined prompt template to consistently assess RAG system.\r\n- **Customizable Weighting:** Allows users to adjust the relative importance of each metric to tailor the overall accuracy to their specific priorities.\r\n- **Easy Integration:** Provides a high-level function to integrate evaluation into your RAG system pipelines.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install rag_evaluation\r\n```\r\n\r\n## API-Key Management\r\n**1. Environment / .env file \u2013 zero code changes**\r\n\r\n```bash\r\n# .env  or exported in the shell\r\nOPENAI_API_KEY=sk-...\r\nGEMINI_API_KEY=AIzaSy.\r\n```\r\n*The package loads these automatically via python-dotenv.*\r\n\r\n**2. One-time, in-memory key (per Python session)**\r\n\r\n```python\r\nimport rag_evaluation as rag_eval\r\n\r\nrag_eval.set_api_key(\"openai\", \"sk-live...\")\r\nrag_eval.set_api_key(\"gemini\", \"AIzaSy...\")\r\n\r\n# any subsequent call picks them up automatically\r\n# takes precedence over environment variables.\r\n\r\n```\r\n\r\n**3. Explicit lookup / fallback**\r\n\r\n```python\r\nfrom rag_eval.config import get_api_key\r\n\r\nkey = get_api_key(\"openai\", default_key=\"sk-fallback...\")\r\n\r\n# Priority inside get_api_key\r\n# cache (set_api_key) \u279c env/.env \u279c default_key \u279c ValueError.\r\n\r\n```\r\n\r\n\r\n## Usage\r\n### Open-Source Local Models (Ollama models; does not require external APIs)\r\n**Currently, the package supports Llama, Mistral, and Qwen.**\r\n\r\n**Step 1**: Download Ollama. Check [here](https://ollama.com/download) for instructions  <br>\r\n\r\n**Step 2**:  Check from the list of [models](https://ollama.com/search) and download using `ollama pull <model_name>`\r\n\r\n```python\r\n\r\n#check the list of models available on your local PC\r\nfrom openai import OpenAI\r\n\r\nclient = OpenAI(\r\n    api_key='ollama',\r\n    base_url=\"http://localhost:11434/v1\"\r\n)\r\n\r\n# List all available models\r\nmodels = client.models.list()\r\nprint(models.to_json())\r\n\r\n```\r\n#### Usage with Open-Source Models (Ollama models)\r\n\r\n\r\n```python\r\nfrom rag_eval.evaluator import evaluate_response\r\n\r\n# Define the inputs\r\nquery = \"Which large language model is currently the largest and most capable?\"\r\n\r\nresponse_text = \"\"\"The largest and most capable LLMs are the generative pretrained transformers (GPTs). These models are \r\n                designed to handle complex language tasks, and their vast number of parameters gives them the ability to \r\n                understand and generate human-like text.\"\"\"\r\n                 \r\ndocument = \"\"\"A large language model (LLM) is a type of machine learning model designed for natural language processing \r\n            tasks such as language generation. LLMs are language models with many parameters, and are trained with \r\n            self-supervised learning on a vast amount of text. The largest and most capable LLMs are \r\n            generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided \r\n            by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies \r\n            inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in.\"\"\"\r\n\r\n# Llama usage (ollama pull llama3.2:1b to download from terminal)\r\nreport = evaluate_response(\r\n    query=query,\r\n    response=response_text,\r\n    document=document,\r\n    model_type=\"ollama\",\r\n    model_name='llama3.2:1b\r\n)\r\nprint(report)\r\n\r\n# Mistral usage (ollama pull mistral to download from terminal)\r\nreport = evaluate_response(\r\n    query=query,\r\n    response=response_text,\r\n    document=document,\r\n    model_type=\"ollama\",\r\n    model_name='mistral:latest',\r\n    metric_weights=[0.1, 0., 0.9, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]\r\n)\r\nprint(report)\r\n\r\n# Qwen usage (ollama pull qwen to download from terminal)\r\nreport = evaluate_response(\r\n    query=query,\r\n    response=response_text,\r\n    document=document,\r\n    model_type=\"ollama\",\r\n    model_name='qwen:latest',\r\n)\r\nprint(report)\r\n\r\n```\r\n\r\n### For API-based Models (GPT and Gemini)\r\n```python\r\n\r\n# Define the inputs (same as above)\r\n\r\n# OpenAI usage \r\nreport = evaluate_response(\r\n    query=query,\r\n    response=response_text,\r\n    document=document,\r\n    model_type=\"openai\",\r\n    model_name='gpt-4.1',\r\n)\r\nprint(report)\r\n\r\n# Gemini usage \r\nreport = evaluate_response(\r\n    query=query,\r\n    response=response_text,\r\n    document=document,\r\n    model_type=\"gemini\",\r\n    model_name='gemini-2.5-flash',\r\n    metric_weights=[0.1, 0.4, 0.5, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]\r\n)\r\nprint(report)\r\n\r\n```\r\n\r\n## Customizing Metric Weights\r\n\r\nBy default, **RAG Evaluation** balances the five core metrics as follows:\r\n\r\n| Metric            | Default Weight |\r\n|-------------------|---------------:|\r\n| Query Relevance   |          0.25  |\r\n| Factual Accuracy  |          0.25  |\r\n| Coverage          |          0.25  |\r\n| Coherence         |          0.125 |\r\n| Fluency           |          0.125 |\r\n\r\n*These defaults reflect our view that modern LLMs (GPT, Llama, Gemini, etc.) already excel at coherence and fluency, so we place greater emphasis on the metrics that most impact RAG system accuracy.*\r\n\r\nIf you\u2019d like to emphasize certain aspects of your RAG system\u2019s output - say you care twice as much about factual accuracy as coherence, you can supply your own weights via the `metric_weights` parameter in `evaluate_response`. Your custom weights must:\r\n\r\n1. Be a list of **five** floats.\r\n2. Sum to **1.0** (ensuring they form a valid weighted average).\r\n3. Follow this order:  \r\n   `[Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]`\r\n\r\n### Example\r\n\r\n```python\r\nfrom rag_eval.evaluator import evaluate_response\r\n\r\n# Define the inputs\r\nquery = \"...\"\r\nresponse = \"...\"\r\ndocument = \"...\"\r\n\r\n# Emphasize factual accuracy and coverage\r\ncustom_weights = [0.1, 0.4, 0.4, 0.05, 0.05]\r\n\r\nreport = evaluate_response(\r\n    query=query,\r\n    response=response,\r\n    document=document,\r\n    model_type=\"openai\",\r\n    model_name=\"gpt-4.1\",\r\n    metric_weights=custom_weights\r\n)\r\n\r\nprint(report)\r\n```\r\n*This simple mechanism let users tailor the **Overall Accuracy** score to whatever aspects matter most in their evaluation scenario*\r\n\r\n## Output\r\n\r\nThe `evaluate_response` function returns a pandas DataFrame with:\r\n- **Metric Names:** Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency.\r\n- **Normalized Scores:** A 0\u20131 score for each metric.\r\n- **Percentage Scores:** The normalized score expressed as a percentage.\r\n- **Overall Accuracy:** A weighted average score across all metrics.\r\n\r\n## Need help?\r\n- **Open an issue or pull request on GitHub**  \r\n- **For more examples of how to use the package, see the [example notebook](https://github.com/OlaAkindele/rag_evaluation/main/rag_evaluation_notebook.ipynb)**\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A robust Python package for evaluating Retrieval-Augmented Generation (RAG) systems.",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/OlaAkindele/rag_evaluation"
    },
    "split_keywords": [
        "rag",
        " nlp",
        " evaluation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "664fcd08d670aaefbf8d76abdca2cc00f91376dda5adb29d64bbc7b634c7a038",
                "md5": "0f5ab0dd47006f2d9d9a990456357baf",
                "sha256": "ad7779593cd4209713dd185e048715bac1e1d4c547f5d937873aec16e4750b91"
            },
            "downloads": -1,
            "filename": "rag_evaluation-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0f5ab0dd47006f2d9d9a990456357baf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 11647,
            "upload_time": "2025-07-17T08:30:00",
            "upload_time_iso_8601": "2025-07-17T08:30:00.000623Z",
            "url": "https://files.pythonhosted.org/packages/66/4f/cd08d670aaefbf8d76abdca2cc00f91376dda5adb29d64bbc7b634c7a038/rag_evaluation-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "796cd141d78072f1c24a5f5a980c213309eb85df76830704f2ff1da911bef4e1",
                "md5": "39db27a209247352bfa0b23c7d1cbb20",
                "sha256": "c5f20ebbe8e9caee1da78aaf2be5a50284269cde92df810f954009bf78ec6f04"
            },
            "downloads": -1,
            "filename": "rag_evaluation-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "39db27a209247352bfa0b23c7d1cbb20",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 13503,
            "upload_time": "2025-07-17T08:30:01",
            "upload_time_iso_8601": "2025-07-17T08:30:01.188359Z",
            "url": "https://files.pythonhosted.org/packages/79/6c/d141d78072f1c24a5f5a980c213309eb85df76830704f2ff1da911bef4e1/rag_evaluation-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-17 08:30:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "OlaAkindele",
    "github_project": "rag_evaluation",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.25"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    ">=",
                    "0.21"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    "<",
                    "2"
                ],
                [
                    ">=",
                    "1.58.1"
                ]
            ]
        },
        {
            "name": "google-genai",
            "specs": [
                [
                    ">=",
                    "1.15"
                ],
                [
                    "<",
                    "2"
                ]
            ]
        },
        {
            "name": "httpx",
            "specs": [
                [
                    ">=",
                    "0.28.1"
                ],
                [
                    "<",
                    "1"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "<",
                    "9"
                ],
                [
                    ">=",
                    "8"
                ]
            ]
        }
    ],
    "lcname": "rag-evaluation"
}
        
Elapsed time: 2.85010s