| Name | Comprehensive-RAG-Evaluation-Metrics JSON |
| Version |
0.11.0
JSON |
| download |
| home_page | https://github.com/beekash222/RAG_EVAL |
| Summary | This library provides a comprehensive suite of metrics to evaluate the performance of Retrieval-Augmented Generation (RAG) systems. RAG systems, which combine information retrieval with text generation, present unique evaluation challenges beyond those found in standard language generation tasks |
| upload_time | 2024-08-06 12:23:08 |
| maintainer | None |
| docs_url | None |
| author | Beekash Mohanty |
| requires_python | >=3.9 |
| license | None |
| keywords |
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
RAG Evaluator
Overview
RAG Evaluator is a Python library for evaluating Retrieval-Augmented Generation (RAG) systems. It provides various metrics to evaluate the quality of generated text against reference text.
Installation
You can install the library using pip:
pip install Comprehensive_RAG_Evaluation_Metrics
Usage
Here's how to use the RAG Evaluator library:
from Comprehensive_RAG_Evaluation_Metrics import RAGEvaluator
# Initialize the evaluator
evaluator = RAGEvaluator()
# Input data
question = "What are the causes of difficulty in learning a new topic?"
response = "Difficulty in learning a new topic is often caused by a lack of understanding of the subject's structure."
reference = "Not knowing how to explain a topic to others can make it harder to learn, as it requires a deeper understanding of the subject's structure."
# Evaluate the response
metrics = evaluator.evaluate_all(question, response, reference)
# Print the results
print(metrics)
Streamlit Web App
To run the web app:
cd into streamlit app folder.
Create a virtual env
Activate the virtual env
Install all dependencies
Run the app:
streamlit run app.py
Metrics
The RAG Evaluator provides the following metrics:
BLEU (0-100): Measures the overlap between the generated output and reference text based on n-grams.
0-20: Low similarity, 20-40: Medium-low, 40-60: Medium, 60-80: High, 80-100: Very high
ROUGE-1 (0-1): Measures the overlap of unigrams between the generated output and reference text.
0.0-0.2: Poor overlap, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent
BERT Score (0-1): Evaluates the semantic similarity using BERT embeddings (Precision, Recall, F1).
0.0-0.5: Low similarity, 0.5-0.7: Moderate, 0.7-0.8: Good, 0.8-0.9: High, 0.9-1.0: Very high
Perplexity (1 to ∞, lower is better): Measures how well a language model predicts the text.
1-10: Excellent, 10-50: Good, 50-100: Moderate, 100+: High (potentially nonsensical)
Diversity (0-1): Measures the uniqueness of bigrams in the generated output.
0.0-0.2: Very low, 0.2-0.4: Low, 0.4-0.6: Moderate, 0.6-0.8: High, 0.8-1.0: Very high
Racial Bias (0-1): Detects the presence of biased language in the generated output.
0.0-0.2: Low probability, 0.2-0.4: Moderate, 0.4-0.6: High, 0.6-0.8: Very high, 0.8-1.0: Extreme
METEOR (0-1): Calculates semantic similarity considering synonyms and paraphrases.
0.0-0.2: Poor, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent
CHRF (0-1): Computes Character n-gram F-score for fine-grained text similarity.
0.0-0.2: Low, 0.2-0.4: Moderate, 0.4-0.6: Good, 0.6-0.8: High, 0.8-1.0: Very high
Flesch Reading Ease (0-100): Assesses text readability.
0-30: Very difficult, 30-50: Difficult, 50-60: Fairly difficult, 60-70: Standard, 70-80: Fairly easy, 80-90: Easy, 90-100: Very easy
Flesch-Kincaid Grade (0-18+): Indicates the U.S. school grade level needed to understand the text.
1-6: Elementary, 7-8: Middle school, 9-12: High school, 13+: College level
Testing
To run the tests, use the following command:
Semantic Similarity Evaluates similarity in meaning between two texts using word embeddings and cosine similarity for accurate context.
Factual Consistency Verifies factual accuracy in responses using entity recognition and knowledge graph-based methods for trustworthiness.
Question Relevance Measures response relevance to user queries using keyword extraction and intent detection for effective answers.
Context Relevance Assesses response appropriateness in a given situation using topic modeling and semantic role labeling for contextual fit.
Answer Relevance Evaluates response clarity and directness in answering user queries using named entity recognition and dependency parsing.
Toxicity Detects hate speech, profanity, and toxic content in responses using sentiment analysis and machine learning-based classification.
python -m unittest discover -s rag_evaluator -p "test_*.py"
Raw data
{
"_id": null,
"home_page": "https://github.com/beekash222/RAG_EVAL",
"name": "Comprehensive-RAG-Evaluation-Metrics",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Beekash Mohanty",
"author_email": "beekashmohanty222@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/1c/b4/84e83334e366924df070bfe6d1de57aaf3406c923e206c0df35825647a6f/comprehensive_rag_evaluation_metrics-0.11.0.tar.gz",
"platform": null,
"description": "RAG Evaluator\r\nOverview\r\nRAG Evaluator is a Python library for evaluating Retrieval-Augmented Generation (RAG) systems. It provides various metrics to evaluate the quality of generated text against reference text.\r\n\r\nInstallation\r\nYou can install the library using pip:\r\n\r\npip install Comprehensive_RAG_Evaluation_Metrics\r\nUsage\r\nHere's how to use the RAG Evaluator library:\r\n\r\nfrom Comprehensive_RAG_Evaluation_Metrics import RAGEvaluator\r\n\r\n# Initialize the evaluator\r\nevaluator = RAGEvaluator()\r\n\r\n# Input data\r\nquestion = \"What are the causes of difficulty in learning a new topic?\"\r\nresponse = \"Difficulty in learning a new topic is often caused by a lack of understanding of the subject's structure.\"\r\nreference = \"Not knowing how to explain a topic to others can make it harder to learn, as it requires a deeper understanding of the subject's structure.\"\r\n\r\n# Evaluate the response\r\nmetrics = evaluator.evaluate_all(question, response, reference)\r\n\r\n# Print the results\r\nprint(metrics)\r\nStreamlit Web App\r\nTo run the web app:\r\n\r\ncd into streamlit app folder.\r\nCreate a virtual env\r\nActivate the virtual env\r\nInstall all dependencies\r\nRun the app:\r\nstreamlit run app.py\r\nMetrics\r\nThe RAG Evaluator provides the following metrics:\r\n\r\nBLEU (0-100): Measures the overlap between the generated output and reference text based on n-grams.\r\n\r\n0-20: Low similarity, 20-40: Medium-low, 40-60: Medium, 60-80: High, 80-100: Very high\r\nROUGE-1 (0-1): Measures the overlap of unigrams between the generated output and reference text.\r\n\r\n0.0-0.2: Poor overlap, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent\r\nBERT Score (0-1): Evaluates the semantic similarity using BERT embeddings (Precision, Recall, F1).\r\n\r\n0.0-0.5: Low similarity, 0.5-0.7: Moderate, 0.7-0.8: Good, 0.8-0.9: High, 0.9-1.0: Very high\r\nPerplexity (1 to \u00e2\u02c6\u017e, lower is better): Measures how well a language model predicts the text.\r\n\r\n1-10: Excellent, 10-50: Good, 50-100: Moderate, 100+: High (potentially nonsensical)\r\nDiversity (0-1): Measures the uniqueness of bigrams in the generated output.\r\n\r\n0.0-0.2: Very low, 0.2-0.4: Low, 0.4-0.6: Moderate, 0.6-0.8: High, 0.8-1.0: Very high\r\nRacial Bias (0-1): Detects the presence of biased language in the generated output.\r\n\r\n0.0-0.2: Low probability, 0.2-0.4: Moderate, 0.4-0.6: High, 0.6-0.8: Very high, 0.8-1.0: Extreme\r\nMETEOR (0-1): Calculates semantic similarity considering synonyms and paraphrases.\r\n\r\n0.0-0.2: Poor, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent\r\nCHRF (0-1): Computes Character n-gram F-score for fine-grained text similarity.\r\n\r\n0.0-0.2: Low, 0.2-0.4: Moderate, 0.4-0.6: Good, 0.6-0.8: High, 0.8-1.0: Very high\r\nFlesch Reading Ease (0-100): Assesses text readability.\r\n\r\n0-30: Very difficult, 30-50: Difficult, 50-60: Fairly difficult, 60-70: Standard, 70-80: Fairly easy, 80-90: Easy, 90-100: Very easy\r\nFlesch-Kincaid Grade (0-18+): Indicates the U.S. school grade level needed to understand the text.\r\n\r\n1-6: Elementary, 7-8: Middle school, 9-12: High school, 13+: College level\r\nTesting\r\nTo run the tests, use the following command:\r\n\r\nSemantic Similarity Evaluates similarity in meaning between two texts using word embeddings and cosine similarity for accurate context.\r\n\r\nFactual Consistency Verifies factual accuracy in responses using entity recognition and knowledge graph-based methods for trustworthiness.\r\n\r\nQuestion Relevance Measures response relevance to user queries using keyword extraction and intent detection for effective answers.\r\n\r\nContext Relevance Assesses response appropriateness in a given situation using topic modeling and semantic role labeling for contextual fit.\r\n\r\nAnswer Relevance Evaluates response clarity and directness in answering user queries using named entity recognition and dependency parsing.\r\n\r\nToxicity Detects hate speech, profanity, and toxic content in responses using sentiment analysis and machine learning-based classification.\r\n\r\npython -m unittest discover -s rag_evaluator -p \"test_*.py\"\r\n",
"bugtrack_url": null,
"license": null,
"summary": "This library provides a comprehensive suite of metrics to evaluate the performance of Retrieval-Augmented Generation (RAG) systems. RAG systems, which combine information retrieval with text generation, present unique evaluation challenges beyond those found in standard language generation tasks",
"version": "0.11.0",
"project_urls": {
"Homepage": "https://github.com/beekash222/RAG_EVAL"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "69a559c9a5dded13731e804ce5226cfa7278609dcdcb51e7b595391f7c6be1b6",
"md5": "53e0f45ca88b7df9a5be4081a277d39b",
"sha256": "d41c23b962ca6562152df3579463567a08cd0bb4a628241cbb80adf03dc9562f"
},
"downloads": -1,
"filename": "Comprehensive_RAG_Evaluation_Metrics-0.11.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "53e0f45ca88b7df9a5be4081a277d39b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8722,
"upload_time": "2024-08-06T12:23:06",
"upload_time_iso_8601": "2024-08-06T12:23:06.363633Z",
"url": "https://files.pythonhosted.org/packages/69/a5/59c9a5dded13731e804ce5226cfa7278609dcdcb51e7b595391f7c6be1b6/Comprehensive_RAG_Evaluation_Metrics-0.11.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1cb484e83334e366924df070bfe6d1de57aaf3406c923e206c0df35825647a6f",
"md5": "9e48623fb394d7055d5a3f424ff648c0",
"sha256": "575c8d6e381e5af0851b571243378a4f9e9cc9b9b8a633df73873367db631217"
},
"downloads": -1,
"filename": "comprehensive_rag_evaluation_metrics-0.11.0.tar.gz",
"has_sig": false,
"md5_digest": "9e48623fb394d7055d5a3f424ff648c0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 7684,
"upload_time": "2024-08-06T12:23:08",
"upload_time_iso_8601": "2024-08-06T12:23:08.334973Z",
"url": "https://files.pythonhosted.org/packages/1c/b4/84e83334e366924df070bfe6d1de57aaf3406c923e206c0df35825647a6f/comprehensive_rag_evaluation_metrics-0.11.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-06 12:23:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "beekash222",
"github_project": "RAG_EVAL",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "comprehensive-rag-evaluation-metrics"
}