Name | continuous-eval JSON |
Version |
0.3.14.post2
JSON |
| download |
home_page | None |
Summary | Open-Source Evaluation for GenAI Applications. |
upload_time | 2025-01-06 15:30:53 |
maintainer | None |
docs_url | None |
author | Yi Zhang |
requires_python | <3.13,>=3.10 |
license | Apache-2.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<h3 align="center">
<img
src="docs/public/continuous-eval-logo.png"
width="350"
>
</h3>
<div align="center">
<a href="https://docs.relari.ai/" target="_blank"><img src="https://img.shields.io/badge/docs-view-blue" alt="Documentation"></a>
<a href="https://pypi.python.org/pypi/continuous-eval"></a>
<a href="https://github.com/relari-ai/continuous-eval/releases"></a>
<a href="https://pypi.python.org/pypi/continuous-eval/"></a>
<a a href="https://github.com/relari-ai/continuous-eval/blob/main/LICENSE"></a>
</div>
<h2 align="center">
<p>Data-Driven Evaluation for LLM-Powered Applications</p>
</h2>
## Overview
`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.
<h1 align="center">
<img
src="docs/public/module-level-eval.png"
>
</h1>
## How is continuous-eval different?
- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.
- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.
- **Probabilistic Evaluation**: Evaluate your pipeline with probabilistic metrics
## Getting Started
This code is provided as a PyPi package. To install it, run the following command:
```bash
python3 -m pip install continuous-eval
```
if you want to install from source:
```bash
git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval
poetry install --all-extras
```
To run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.
## Run a single metric
Here's how you run a single metric on a datum.
Check all available metrics here: [link](https://continuous-eval.docs.relari.ai/)
```python
from continuous_eval.metrics.retrieval import PrecisionRecallF1
datum = {
"question": "What is the capital of France?",
"retrieved_context": [
"Paris is the capital of France and its largest city.",
"Lyon is a major city in France.",
],
"ground_truth_context": ["Paris is the capital of France."],
"answer": "Paris",
"ground_truths": ["Paris"],
}
metric = PrecisionRecallF1()
print(metric(**datum))
```
## Run an evaluation
If you want to run an evaluation on a dataset, you can use the `EvaluationRunner` class.
```python
from time import perf_counter
from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.eval import EvaluationRunner, SingleModulePipeline
from continuous_eval.eval.tests import GreaterOrEqualThan
from continuous_eval.metrics.retrieval import (
PrecisionRecallF1,
RankedRetrievalMetrics,
)
def main():
# Let's download the retrieval dataset example
dataset = example_data_downloader("retrieval")
# Setup evaluation pipeline (i.e., dataset, metrics and tests)
pipeline = SingleModulePipeline(
dataset=dataset,
eval=[
PrecisionRecallF1().use(
retrieved_context=dataset.retrieved_contexts,
ground_truth_context=dataset.ground_truth_contexts,
),
RankedRetrievalMetrics().use(
retrieved_context=dataset.retrieved_contexts,
ground_truth_context=dataset.ground_truth_contexts,
),
],
tests=[
GreaterOrEqualThan(
test_name="Recall", metric_name="context_recall", min_value=0.8
),
],
)
# Start the evaluation manager and run the metrics (and tests)
tic = perf_counter()
runner = EvaluationRunner(pipeline)
eval_results = runner.evaluate()
toc = perf_counter()
print("Evaluation results:")
print(eval_results.aggregate())
print(f"Elapsed time: {toc - tic:.2f} seconds\n")
print("Running tests...")
test_results = runner.test(eval_results)
print(test_results)
if __name__ == "__main__":
# It is important to run this script in a new process to avoid
# multiprocessing issues
main()
```
## Run evaluation on a pipeline (modular evaluation)
Sometimes the system is composed of multiple modules, each with its own metrics and tests.
Continuous-eval supports this use case by allowing you to define modules in your pipeline and select corresponding metrics.
```python
from typing import Any, Dict, List
from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.eval import (
Dataset,
EvaluationRunner,
Module,
ModuleOutput,
Pipeline,
)
from continuous_eval.eval.result_types import PipelineResults
from continuous_eval.metrics.generation.text import AnswerCorrectness
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
def page_content(docs: List[Dict[str, Any]]) -> List[str]:
# Extract the content of the retrieved documents from the pipeline results
return [doc["page_content"] for doc in docs]
def main():
dataset: Dataset = example_data_downloader("graham_essays/small/dataset")
results: Dict = example_data_downloader("graham_essays/small/results")
# Simple 3-step RAG pipeline with Retriever->Reranker->Generation
retriever = Module(
name="retriever",
input=dataset.question,
output=List[str],
eval=[
PrecisionRecallF1().use(
retrieved_context=ModuleOutput(page_content), # specify how to extract what we need (i.e., page_content)
ground_truth_context=dataset.ground_truth_context,
),
],
)
reranker = Module(
name="reranker",
input=retriever,
output=List[Dict[str, str]],
eval=[
RankedRetrievalMetrics().use(
retrieved_context=ModuleOutput(page_content),
ground_truth_context=dataset.ground_truth_context,
),
],
)
llm = Module(
name="llm",
input=reranker,
output=str,
eval=[
AnswerCorrectness().use(
question=dataset.question,
answer=ModuleOutput(),
ground_truth_answers=dataset.ground_truth_answers,
),
],
)
pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
print(pipeline.graph_repr()) # visualize the pipeline in marmaid format
runner = EvaluationRunner(pipeline)
eval_results = runner.evaluate(PipelineResults.from_dict(results))
print(eval_results.aggregate())
if __name__ == "__main__":
main()
```
> Note: it is important to wrap your code in a main function (with the `if __name__ == "__main__":` guard) to make sure the parallelization works properly.
## Custom Metrics
There are several ways to create custom metrics, see the [Custom Metrics](https://continuous-eval.docs.relari.ai/v0.3/metrics/overview) section in the docs.
The simplest way is to leverage the `CustomMetric` class to create a LLM-as-a-Judge.
```python
from continuous_eval.metrics.base.metric import Arg, Field
from continuous_eval.metrics.custom import CustomMetric
from typing import List
criteria = "Check that the generated answer does not contain PII or other sensitive information."
rubric = """Use the following rubric to assign a score to the answer based on its conciseness:
- Yes: The answer contains PII or other sensitive information.
- No: The answer does not contain PII or other sensitive information.
"""
metric = CustomMetric(
name="PIICheck",
criteria=criteria,
rubric=rubric,
arguments={"answer": Arg(type=str, description="The answer to evaluate.")},
response_format={
"reasoning": Field(
type=str,
description="The reasoning for the score given to the answer",
),
"score": Field(
type=str, description="The score of the answer: Yes or No"
),
"identifies": Field(
type=List[str],
description="The PII or other sensitive information identified in the answer",
),
},
)
# Let's calculate the metric for the first datum
print(metric(answer="John Doe resides at 123 Main Street, Springfield."))
```
## 💡 Contributing
Interested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.
## Resources
- **Docs:** [link](https://continuous-eval.docs.relari.ai/)
- **Examples Repo**: [end-to-end example repo](https://github.com/relari-ai/examples)
- **Blog Posts:**
- Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
- How important is a Golden Dataset for LLM evaluation?
[(link)](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)
- How to evaluate complex GenAI Apps: a granular approach [(link)](https://medium.com/relari/how-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)
- How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https://medium.com/towards-data-science/how-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)
- Generate Synthetic Data to Test LLM Applications [(link)](https://medium.com/relari/generate-synthetic-data-to-test-llm-applications-4bffeb51b80e)
- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)
- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https://cal.com/relari/intro)
## License
This project is licensed under the Apache 2.0 - see the [LICENSE](LICENSE) file for details.
## Open Analytics
We monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement.
You can take a look at exactly what we track in the [telemetry code](continuous_eval/utils/telemetry.py)
To disable usage-tracking you set the `CONTINUOUS_EVAL_DO_NOT_TRACK` flag to `true`.
Raw data
{
"_id": null,
"home_page": null,
"name": "continuous-eval",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Yi Zhang",
"author_email": "yi@relari.ai",
"download_url": "https://files.pythonhosted.org/packages/af/d0/eb2ce2738ca392f43dc77b5c1186e0f0149dbcbcc4ae23ec33daa5ae2bc6/continuous_eval-0.3.14.post2.tar.gz",
"platform": null,
"description": "<h3 align=\"center\">\n <img\n src=\"docs/public/continuous-eval-logo.png\"\n width=\"350\"\n >\n</h3>\n\n<div align=\"center\">\n\n \n <a href=\"https://docs.relari.ai/\" target=\"_blank\"><img src=\"https://img.shields.io/badge/docs-view-blue\" alt=\"Documentation\"></a>\n <a href=\"https://pypi.python.org/pypi/continuous-eval\"></a>\n <a href=\"https://github.com/relari-ai/continuous-eval/releases\"></a>\n <a href=\"https://pypi.python.org/pypi/continuous-eval/\"></a>\n <a a href=\"https://github.com/relari-ai/continuous-eval/blob/main/LICENSE\"></a>\n\n\n</div>\n\n<h2 align=\"center\">\n <p>Data-Driven Evaluation for LLM-Powered Applications</p>\n</h2>\n\n\n\n## Overview\n\n`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.\n\n<h1 align=\"center\">\n <img\n src=\"docs/public/module-level-eval.png\"\n >\n</h1>\n\n## How is continuous-eval different?\n\n- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.\n\n- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.\n\n- **Probabilistic Evaluation**: Evaluate your pipeline with probabilistic metrics\n\n## Getting Started\n\nThis code is provided as a PyPi package. To install it, run the following command:\n\n```bash\npython3 -m pip install continuous-eval\n```\n\nif you want to install from source:\n\n```bash\ngit clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval\npoetry install --all-extras\n```\n\nTo run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.\n\n## Run a single metric\n\nHere's how you run a single metric on a datum.\nCheck all available metrics here: [link](https://continuous-eval.docs.relari.ai/)\n\n```python\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1\n\ndatum = {\n \"question\": \"What is the capital of France?\",\n \"retrieved_context\": [\n \"Paris is the capital of France and its largest city.\",\n \"Lyon is a major city in France.\",\n ],\n \"ground_truth_context\": [\"Paris is the capital of France.\"],\n \"answer\": \"Paris\",\n \"ground_truths\": [\"Paris\"],\n}\n\nmetric = PrecisionRecallF1()\n\nprint(metric(**datum))\n```\n\n## Run an evaluation\n\nIf you want to run an evaluation on a dataset, you can use the `EvaluationRunner` class.\n\n```python\nfrom time import perf_counter\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import EvaluationRunner, SingleModulePipeline\nfrom continuous_eval.eval.tests import GreaterOrEqualThan\nfrom continuous_eval.metrics.retrieval import (\n PrecisionRecallF1,\n RankedRetrievalMetrics,\n)\n\n\ndef main():\n # Let's download the retrieval dataset example\n dataset = example_data_downloader(\"retrieval\")\n\n # Setup evaluation pipeline (i.e., dataset, metrics and tests)\n pipeline = SingleModulePipeline(\n dataset=dataset,\n eval=[\n PrecisionRecallF1().use(\n retrieved_context=dataset.retrieved_contexts,\n ground_truth_context=dataset.ground_truth_contexts,\n ),\n RankedRetrievalMetrics().use(\n retrieved_context=dataset.retrieved_contexts,\n ground_truth_context=dataset.ground_truth_contexts,\n ),\n ],\n tests=[\n GreaterOrEqualThan(\n test_name=\"Recall\", metric_name=\"context_recall\", min_value=0.8\n ),\n ],\n )\n\n # Start the evaluation manager and run the metrics (and tests)\n tic = perf_counter()\n runner = EvaluationRunner(pipeline)\n eval_results = runner.evaluate()\n toc = perf_counter()\n print(\"Evaluation results:\")\n print(eval_results.aggregate())\n print(f\"Elapsed time: {toc - tic:.2f} seconds\\n\")\n\n print(\"Running tests...\")\n test_results = runner.test(eval_results)\n print(test_results)\n\n\nif __name__ == \"__main__\":\n # It is important to run this script in a new process to avoid\n # multiprocessing issues\n main()\n```\n\n## Run evaluation on a pipeline (modular evaluation)\n\nSometimes the system is composed of multiple modules, each with its own metrics and tests.\nContinuous-eval supports this use case by allowing you to define modules in your pipeline and select corresponding metrics.\n\n```python\nfrom typing import Any, Dict, List\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import (\n Dataset,\n EvaluationRunner,\n Module,\n ModuleOutput,\n Pipeline,\n)\nfrom continuous_eval.eval.result_types import PipelineResults\nfrom continuous_eval.metrics.generation.text import AnswerCorrectness\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics\n\n\ndef page_content(docs: List[Dict[str, Any]]) -> List[str]:\n # Extract the content of the retrieved documents from the pipeline results\n return [doc[\"page_content\"] for doc in docs]\n\n\ndef main():\n dataset: Dataset = example_data_downloader(\"graham_essays/small/dataset\")\n results: Dict = example_data_downloader(\"graham_essays/small/results\")\n\n # Simple 3-step RAG pipeline with Retriever->Reranker->Generation\n retriever = Module(\n name=\"retriever\",\n input=dataset.question,\n output=List[str],\n eval=[\n PrecisionRecallF1().use(\n retrieved_context=ModuleOutput(page_content), # specify how to extract what we need (i.e., page_content)\n ground_truth_context=dataset.ground_truth_context,\n ),\n ],\n )\n\n reranker = Module(\n name=\"reranker\",\n input=retriever,\n output=List[Dict[str, str]],\n eval=[\n RankedRetrievalMetrics().use(\n retrieved_context=ModuleOutput(page_content),\n ground_truth_context=dataset.ground_truth_context,\n ),\n ],\n )\n\n llm = Module(\n name=\"llm\",\n input=reranker,\n output=str,\n eval=[\n AnswerCorrectness().use(\n question=dataset.question,\n answer=ModuleOutput(),\n ground_truth_answers=dataset.ground_truth_answers,\n ),\n ],\n )\n\n pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)\n print(pipeline.graph_repr()) # visualize the pipeline in marmaid format\n\n runner = EvaluationRunner(pipeline)\n eval_results = runner.evaluate(PipelineResults.from_dict(results))\n print(eval_results.aggregate())\n\n\nif __name__ == \"__main__\":\n main()\n```\n\n> Note: it is important to wrap your code in a main function (with the `if __name__ == \"__main__\":` guard) to make sure the parallelization works properly.\n\n## Custom Metrics\n\nThere are several ways to create custom metrics, see the [Custom Metrics](https://continuous-eval.docs.relari.ai/v0.3/metrics/overview) section in the docs.\n\nThe simplest way is to leverage the `CustomMetric` class to create a LLM-as-a-Judge.\n\n```python\nfrom continuous_eval.metrics.base.metric import Arg, Field\nfrom continuous_eval.metrics.custom import CustomMetric\nfrom typing import List\n\ncriteria = \"Check that the generated answer does not contain PII or other sensitive information.\"\nrubric = \"\"\"Use the following rubric to assign a score to the answer based on its conciseness:\n- Yes: The answer contains PII or other sensitive information.\n- No: The answer does not contain PII or other sensitive information.\n\"\"\"\n\nmetric = CustomMetric(\n name=\"PIICheck\",\n criteria=criteria,\n rubric=rubric,\n arguments={\"answer\": Arg(type=str, description=\"The answer to evaluate.\")},\n response_format={\n \"reasoning\": Field(\n type=str,\n description=\"The reasoning for the score given to the answer\",\n ),\n \"score\": Field(\n type=str, description=\"The score of the answer: Yes or No\"\n ),\n \"identifies\": Field(\n type=List[str],\n description=\"The PII or other sensitive information identified in the answer\",\n ),\n },\n)\n\n# Let's calculate the metric for the first datum\nprint(metric(answer=\"John Doe resides at 123 Main Street, Springfield.\"))\n```\n\n## \ud83d\udca1 Contributing\n\nInterested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.\n\n## Resources\n\n- **Docs:** [link](https://continuous-eval.docs.relari.ai/)\n- **Examples Repo**: [end-to-end example repo](https://github.com/relari-ai/examples)\n- **Blog Posts:**\n - Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)\n - How important is a Golden Dataset for LLM evaluation?\n [(link)](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)\n - How to evaluate complex GenAI Apps: a granular approach [(link)](https://medium.com/relari/how-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)\n - How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https://medium.com/towards-data-science/how-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)\n - Generate Synthetic Data to Test LLM Applications [(link)](https://medium.com/relari/generate-synthetic-data-to-test-llm-applications-4bffeb51b80e)\n- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)\n- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https://cal.com/relari/intro)\n\n## License\n\nThis project is licensed under the Apache 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Open Analytics\n\nWe monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement.\nYou can take a look at exactly what we track in the [telemetry code](continuous_eval/utils/telemetry.py)\n\nTo disable usage-tracking you set the `CONTINUOUS_EVAL_DO_NOT_TRACK` flag to `true`.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Open-Source Evaluation for GenAI Applications.",
"version": "0.3.14.post2",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d9f8aa33cec694733597ac87a519319d9cb89b2734aa52ef08285f6d7bb9968d",
"md5": "73b5a8097ac79154d91240dd870315f5",
"sha256": "faa857991a75595c68dd44bfaf4c2387304de9c30bcd647f7af7962ae44d6f76"
},
"downloads": -1,
"filename": "continuous_eval-0.3.14.post2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "73b5a8097ac79154d91240dd870315f5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 80083,
"upload_time": "2025-01-06T15:30:50",
"upload_time_iso_8601": "2025-01-06T15:30:50.391750Z",
"url": "https://files.pythonhosted.org/packages/d9/f8/aa33cec694733597ac87a519319d9cb89b2734aa52ef08285f6d7bb9968d/continuous_eval-0.3.14.post2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "afd0eb2ce2738ca392f43dc77b5c1186e0f0149dbcbcc4ae23ec33daa5ae2bc6",
"md5": "65fb4211686d8dabba6d397d1455fb54",
"sha256": "333fc3da8d9fd0f0ea406a6bc91f8049f7ee6e1d3ce4d61fa07ce681f05aa92e"
},
"downloads": -1,
"filename": "continuous_eval-0.3.14.post2.tar.gz",
"has_sig": false,
"md5_digest": "65fb4211686d8dabba6d397d1455fb54",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 56534,
"upload_time": "2025-01-06T15:30:53",
"upload_time_iso_8601": "2025-01-06T15:30:53.096106Z",
"url": "https://files.pythonhosted.org/packages/af/d0/eb2ce2738ca392f43dc77b5c1186e0f0149dbcbcc4ae23ec33daa5ae2bc6/continuous_eval-0.3.14.post2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-06 15:30:53",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "continuous-eval"
}