# RAG Evaluation
**RAG Evaluation** is a Python package designed for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a systematic way to score and analyze the quality of responses generated by RAG systems. This package is particularly suited for projects leveraging Large Language Models (LLMs) such as GPT, Gemini, etc.
It integrates easily with the OpenAI API via the `openai` package and automatically handles environment variable-based API key loading through `python-dotenv`.
## Features
- **Multi-Metric Evaluation:** Evaluate responses using the following metrics:
- **Query Relevance**
- **Factual Accuracy**
- **Coverage**
- **Coherence**
- **Fluency**
- **Standardized Prompting:** Uses a well-defined prompt template to consistently assess responses.
- **Customizable:** Easily extendable to add new metrics or evaluation criteria.
- **Easy Integration:** Provides a high-level function to integrate evaluation into your RAG pipelines.
## Installation
```bash
pip install rag_evaluation
```
## Usage
### Open-Source Local Models (Ollama models; does not require external APIs)
**Currently, the package supports Llama, Mistral, and Qwen.**
```python
from openai import OpenAI
client = OpenAI(
api_key='ollama',
base_url="http://localhost:11434/v1"
)
# List all available models
models = client.models.list()
print(models.to_json())
```
### Usage with Open-Source Models (Ollama models)
```python
from rag_evaluation.evaluator import evaluate_response
# Define the inputs
query = "Which large language model is currently the largest and most capable?"
response_text = """The largest and most capable LLMs are the generative pretrained transformers (GPTs). These models are
designed to handle complex language tasks, and their vast number of parameters gives them the ability to
understand and generate human-like text."""
document = """A large language model (LLM) is a type of machine learning model designed for natural language processing
tasks such as language generation. LLMs are language models with many parameters, and are trained with
self-supervised learning on a vast amount of text. The largest and most capable LLMs are
generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided
by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies
inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in."""
# Llama usage (ollama pull llama3.2:1b to download from terminal)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="ollama",
model_name='llama3.2:1b',
)
print(report)
# Mistral usage (ollama pull mistral to download from terminal)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="ollama",
model_name='mistral:latest',
)
print(report)
# Qwen usage (ollama pull qwen to download from terminal)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="ollama",
model_name='qwen:latest',
)
print(report)
```
### For API-based Models (GPT and Gemini)
```python
import os
from rag_evaluation.config import get_api_key
from rag_evaluation.evaluator import evaluate_response
# Set the API key (either via environment variable or by providing a default)
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# OR
api_key = get_api_key("OPENAI_API_KEY")
# Define your inputs (same as above)
# OpenAI usage (ensure API key is set manually or in environment variable)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="openai",
model_name='gpt-4.1',
)
print(report)
# Gemini usage (ensure API key is set manually or in environment variable)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="gemini",
model_name='gemini-2.5-flash',
)
print(report)
```
## Output
The `evaluate_response` function returns a pandas DataFrame with:
- **Metric Names:** Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency.
- **Normalized Scores:** A 0–1 score for each metric.
- **Percentage Scores:** The normalized score expressed as a percentage.
- **Overall Accuracy:** A weighted average score across all metrics.
Raw data
{
"_id": null,
"home_page": null,
"name": "rag-evaluation",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "rag, nlp, evaluation",
"author": null,
"author_email": "Ola Akindele <akindele.ok@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/e4/cb/ca9d2ccff7e272e218a8197688fdd18bbd3f5ffc14e297df96c260633868/rag_evaluation-0.2.1.tar.gz",
"platform": null,
"description": "# RAG Evaluation\r\n\r\n**RAG Evaluation** is a Python package designed for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a systematic way to score and analyze the quality of responses generated by RAG systems. This package is particularly suited for projects leveraging Large Language Models (LLMs) such as GPT, Gemini, etc.\r\n\r\nIt integrates easily with the OpenAI API via the `openai` package and automatically handles environment variable-based API key loading through `python-dotenv`.\r\n\r\n## Features\r\n\r\n- **Multi-Metric Evaluation:** Evaluate responses using the following metrics:\r\n - **Query Relevance**\r\n - **Factual Accuracy**\r\n - **Coverage**\r\n - **Coherence**\r\n - **Fluency**\r\n- **Standardized Prompting:** Uses a well-defined prompt template to consistently assess responses.\r\n- **Customizable:** Easily extendable to add new metrics or evaluation criteria.\r\n- **Easy Integration:** Provides a high-level function to integrate evaluation into your RAG pipelines.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install rag_evaluation\r\n```\r\n\r\n## Usage\r\n### Open-Source Local Models (Ollama models; does not require external APIs)\r\n**Currently, the package supports Llama, Mistral, and Qwen.**\r\n\r\n```python\r\nfrom openai import OpenAI\r\n\r\nclient = OpenAI(\r\n api_key='ollama',\r\n base_url=\"http://localhost:11434/v1\"\r\n)\r\n\r\n# List all available models\r\nmodels = client.models.list()\r\nprint(models.to_json())\r\n\r\n```\r\n\r\n### Usage with Open-Source Models (Ollama models)\r\n\r\n```python\r\nfrom rag_evaluation.evaluator import evaluate_response\r\n\r\n# Define the inputs\r\nquery = \"Which large language model is currently the largest and most capable?\"\r\n\r\nresponse_text = \"\"\"The largest and most capable LLMs are the generative pretrained transformers (GPTs). These models are \r\n designed to handle complex language tasks, and their vast number of parameters gives them the ability to \r\n understand and generate human-like text.\"\"\"\r\n \r\ndocument = \"\"\"A large language model (LLM) is a type of machine learning model designed for natural language processing \r\n tasks such as language generation. LLMs are language models with many parameters, and are trained with \r\n self-supervised learning on a vast amount of text. The largest and most capable LLMs are \r\n generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided \r\n by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies \r\n inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in.\"\"\"\r\n\r\n# Llama usage (ollama pull llama3.2:1b to download from terminal)\r\nreport = evaluate_response(\r\n query=query,\r\n response=response_text,\r\n document=document,\r\n model_type=\"ollama\",\r\n model_name='llama3.2:1b',\r\n)\r\nprint(report)\r\n\r\n# Mistral usage (ollama pull mistral to download from terminal)\r\nreport = evaluate_response(\r\n query=query,\r\n response=response_text,\r\n document=document,\r\n model_type=\"ollama\",\r\n model_name='mistral:latest',\r\n)\r\nprint(report)\r\n\r\n# Qwen usage (ollama pull qwen to download from terminal)\r\nreport = evaluate_response(\r\n query=query,\r\n response=response_text,\r\n document=document,\r\n model_type=\"ollama\",\r\n model_name='qwen:latest',\r\n)\r\nprint(report)\r\n\r\n```\r\n\r\n### For API-based Models (GPT and Gemini)\r\n```python\r\nimport os\r\nfrom rag_evaluation.config import get_api_key\r\nfrom rag_evaluation.evaluator import evaluate_response\r\n\r\n# Set the API key (either via environment variable or by providing a default)\r\nos.environ[\"OPENAI_API_KEY\"] = \"YOUR_OPENAI_API_KEY\"\r\n\r\n# OR\r\n\r\napi_key = get_api_key(\"OPENAI_API_KEY\")\r\n\r\n# Define your inputs (same as above)\r\n\r\n# OpenAI usage (ensure API key is set manually or in environment variable)\r\nreport = evaluate_response(\r\n query=query,\r\n response=response_text,\r\n document=document,\r\n model_type=\"openai\",\r\n model_name='gpt-4.1',\r\n)\r\nprint(report)\r\n\r\n# Gemini usage (ensure API key is set manually or in environment variable)\r\nreport = evaluate_response(\r\n query=query,\r\n response=response_text,\r\n document=document,\r\n model_type=\"gemini\",\r\n model_name='gemini-2.5-flash',\r\n)\r\nprint(report)\r\n\r\n```\r\n\r\n## Output\r\n\r\nThe `evaluate_response` function returns a pandas DataFrame with:\r\n- **Metric Names:** Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency.\r\n- **Normalized Scores:** A 0\u20131 score for each metric.\r\n- **Percentage Scores:** The normalized score expressed as a percentage.\r\n- **Overall Accuracy:** A weighted average score across all metrics.\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A robust Python package for evaluating Retrieval-Augmented Generation (RAG) systems.",
"version": "0.2.1",
"project_urls": {
"Homepage": "https://github.com/OlaAkindele/rag_evaluation"
},
"split_keywords": [
"rag",
" nlp",
" evaluation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4bd75293acfa89fcfeca88a30b4ca5528de212c1fb1520665959389c4b048531",
"md5": "6f8da062db6f0a12b13ceb10ccdf342f",
"sha256": "1043a7a1d7c30981f9297a6983c8a03e91d5d6541044392e51ce1490e202c1ba"
},
"downloads": -1,
"filename": "rag_evaluation-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6f8da062db6f0a12b13ceb10ccdf342f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 9695,
"upload_time": "2025-07-12T22:38:31",
"upload_time_iso_8601": "2025-07-12T22:38:31.416757Z",
"url": "https://files.pythonhosted.org/packages/4b/d7/5293acfa89fcfeca88a30b4ca5528de212c1fb1520665959389c4b048531/rag_evaluation-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e4cbca9d2ccff7e272e218a8197688fdd18bbd3f5ffc14e297df96c260633868",
"md5": "79410016f14b8cbe25e6008d1095e8bf",
"sha256": "ed3948b979892ed5309b545e4652cdd8102083bde54d26adcc9898e83bde345c"
},
"downloads": -1,
"filename": "rag_evaluation-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "79410016f14b8cbe25e6008d1095e8bf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 10032,
"upload_time": "2025-07-12T22:38:32",
"upload_time_iso_8601": "2025-07-12T22:38:32.663162Z",
"url": "https://files.pythonhosted.org/packages/e4/cb/ca9d2ccff7e272e218a8197688fdd18bbd3f5ffc14e297df96c260633868/rag_evaluation-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-12 22:38:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "OlaAkindele",
"github_project": "rag_evaluation",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "openai",
"specs": []
},
{
"name": "dotenv",
"specs": []
},
{
"name": "google-genai",
"specs": []
},
{
"name": "transformers",
"specs": [
[
">=",
"4.30.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "accelerate",
"specs": [
[
">=",
"0.20.0"
]
]
},
{
"name": "sentencepiece",
"specs": [
[
">=",
"0.1.97"
]
]
},
{
"name": "protobuf",
"specs": [
[
">=",
"3.20.0"
]
]
}
],
"lcname": "rag-evaluation"
}