Name | continuous-eval JSON |
Version |
0.3.13
JSON |
| download |
home_page | None |
Summary | Open-Source Evaluation for GenAI Application Pipelines. |
upload_time | 2024-08-04 23:00:44 |
maintainer | None |
docs_url | None |
author | Yi Zhang |
requires_python | <3.12,>=3.9 |
license | Apache-2.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<h3 align="center">
<img
src="docs/public/continuous-eval-logo.png"
width="350"
>
</h3>
<div align="center">
<a href="https://docs.relari.ai/" target="_blank"><img src="https://img.shields.io/badge/docs-view-blue" alt="Documentation"></a>
<a href="https://pypi.python.org/pypi/continuous-eval">![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/pyversions/continuous-eval.svg)</a>
<a href="https://github.com/relari-ai/continuous-eval/releases">![https://GitHub.com/relari-ai/continuous-eval/releases](https://img.shields.io/github/release/relari-ai/continuous-eval)</a>
<a href="https://pypi.python.org/pypi/continuous-eval/">![https://github.com/Naereen/badges/](https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github)</a>
<a a href="https://github.com/relari-ai/continuous-eval/blob/main/LICENSE">![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/l/continuous-eval.svg)</a>
</div>
<h2 align="center">
<p>Data-Driven Evaluation for LLM-Powered Applications</p>
</h2>
## Overview
`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.
<h1 align="center">
<img
src="docs/public/module-level-eval.png"
>
</h1>
## How is continuous-eval different?
- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.
- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.
- **Leverage User Feedback in Evaluation**: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.
- **Synthetic Dataset Generation**: Generate large-scale synthetic dataset to test your pipeline.
## Getting Started
This code is provided as a PyPi package. To install it, run the following command:
```bash
python3 -m pip install continuous-eval
```
if you want to install from source:
```bash
git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval
poetry install --all-extras
```
To run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.
## Run a single metric
Here's how you run a single metric on a datum.
Check all available metrics here: [link](https://continuous-eval.docs.relari.ai/)
```python
from continuous_eval.metrics.retrieval import PrecisionRecallF1
datum = {
"question": "What is the capital of France?",
"retrieved_context": [
"Paris is the capital of France and its largest city.",
"Lyon is a major city in France.",
],
"ground_truth_context": ["Paris is the capital of France."],
"answer": "Paris",
"ground_truths": ["Paris"],
}
metric = PrecisionRecallF1()
print(metric(**datum))
```
### Available Metrics
<table border="0">
<tr>
<th>Module</th>
<th>Category</th>
<th>Metrics</th>
</tr>
<tr>
<td rowspan="2">Retrieval</td>
<td>Deterministic</td>
<td>PrecisionRecallF1, RankedRetrievalMetrics, TokenCount</td>
</tr>
<tr>
<td>LLM-based</td>
<td>LLMBasedContextPrecision, LLMBasedContextCoverage</td>
</tr>
<tr>
<td rowspan="3">Text Generation</td>
<td>Deterministic</td>
<td>DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability</td>
</tr>
<tr>
<td>Semantic</td>
<td>DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity</td>
</tr>
<tr>
<td>LLM-based</td>
<td>LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency</td>
</tr>
<tr>
<td rowspan="1">Classification</td>
<td>Deterministic</td>
<td>ClassificationAccuracy</td>
</tr>
<tr>
<td rowspan="2">Code Generation</td>
<td>Deterministic</td>
<td>CodeStringMatch, PythonASTSimilarity, SQLSyntaxMatch, SQLASTSimilarity</td>
</tr>
<tr>
<td>LLM-based</td>
<td>LLMBasedCodeGeneration</td>
</tr>
<tr>
<td>Agent Tools</td>
<td>Deterministic</td>
<td>ToolSelectionAccuracy</td>
</tr>
<tr>
<td>Custom</td>
<td></td>
<td>Define your own metrics</td>
</tr>
</table>
To define your own metrics, you only need to extend the [Metric](continuous_eval/metrics/base.py#L23C7-L23C13) class implementing the `__call__` method.
Optional methods are `batch` (if it is possible to implement optimizations for batch processing) and `aggregate` (to aggregate metrics results over multiple samples_).
## Run evaluation on a pipeline
Define modules in your pipeline and select corresponding metrics.
```python
from continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset, EvaluationRunner
from continuous_eval.eval.logger import PipelineLogger
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness
from typing import List, Dict
dataset = Dataset("dataset_folder")
# Simple 3-step RAG pipeline with Retriever->Reranker->Generation
retriever = Module(
name="Retriever",
input=dataset.question,
output=List[str],
eval=[
PrecisionRecallF1().use(
retrieved_context=ModuleOutput(),
ground_truth_context=dataset.ground_truth_context,
),
],
)
reranker = Module(
name="reranker",
input=retriever,
output=List[Dict[str, str]],
eval=[
RankedRetrievalMetrics().use(
retrieved_context=ModuleOutput(),
ground_truth_context=dataset.ground_truth_context,
),
],
)
llm = Module(
name="answer_generator",
input=reranker,
output=str,
eval=[
FleschKincaidReadability().use(answer=ModuleOutput()),
DeterministicAnswerCorrectness().use(
answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
),
],
)
pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
print(pipeline.graph_repr()) # optional: visualize the pipeline
```
Now you can run the evaluation on your pipeline
```python
pipelog = PipelineLogger(pipeline=pipeline)
# now run your LLM application pipeline, and for each module, log the results:
pipelog.log(uid=sample_uid, module="module_name", value=data)
# Once you finish logging the data, you can use the EvaluationRunner to evaluate the logs
evalrunner = EvaluationRunner(pipeline)
metrics = evalrunner.evaluate(pipelog)
metrics.results() # returns a dictionary with the results
```
To run evaluation over an existing dataset (BYODataset), you can run the following:
```python
dataset = Dataset(...)
evalrunner = EvaluationRunner(pipeline)
metrics = evalrunner.evaluate(dataset)
```
## Synthetic Data Generation
Ground truth data, or reference data, is important for evaluation as it can offer a comprehensive and consistent measurement of system performance. However, it is often costly and time-consuming to manually curate such a golden dataset.
We have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes.
To generate custom synthetic data, please visit [Relari](https://www.relari.ai/) to create a free account and you can then generate custom synthetic golden datasets through the Relari Cloud.
## 💡 Contributing
Interested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.
## Resources
- **Docs:** [link](https://continuous-eval.docs.relari.ai/)
- **Examples Repo**: [end-to-end example repo](https://github.com/relari-ai/examples)
- **Blog Posts:**
- Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
- How important is a Golden Dataset for LLM evaluation?
[(link)](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)
- How to evaluate complex GenAI Apps: a granular approach [(link)](https://medium.com/relari/how-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)
- How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https://medium.com/towards-data-science/how-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)
- Generate Synthetic Data to Test LLM Applications [(link)](https://medium.com/relari/generate-synthetic-data-to-test-llm-applications-4bffeb51b80e)
- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)
- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https://cal.com/relari/demo)
## License
This project is licensed under the Apache 2.0 - see the [LICENSE](LICENSE) file for details.
## Open Analytics
We monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement.
You can take a look at exactly what we track in the [telemetry code](continuous_eval/utils/telemetry.py)
To disable usage-tracking you set the `CONTINUOUS_EVAL_DO_NOT_TRACK` flag to `true`.
Raw data
{
"_id": null,
"home_page": null,
"name": "continuous-eval",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Yi Zhang",
"author_email": "yi@relari.ai",
"download_url": "https://files.pythonhosted.org/packages/eb/8e/a6096693a6a19c797c309e31ab2824d3bb205a7b9eb94eff240410d260b0/continuous_eval-0.3.13.tar.gz",
"platform": null,
"description": "<h3 align=\"center\">\n <img\n src=\"docs/public/continuous-eval-logo.png\"\n width=\"350\"\n >\n</h3>\n\n<div align=\"center\">\n\n \n <a href=\"https://docs.relari.ai/\" target=\"_blank\"><img src=\"https://img.shields.io/badge/docs-view-blue\" alt=\"Documentation\"></a>\n <a href=\"https://pypi.python.org/pypi/continuous-eval\">![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/pyversions/continuous-eval.svg)</a>\n <a href=\"https://github.com/relari-ai/continuous-eval/releases\">![https://GitHub.com/relari-ai/continuous-eval/releases](https://img.shields.io/github/release/relari-ai/continuous-eval)</a>\n <a href=\"https://pypi.python.org/pypi/continuous-eval/\">![https://github.com/Naereen/badges/](https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github)</a>\n <a a href=\"https://github.com/relari-ai/continuous-eval/blob/main/LICENSE\">![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/l/continuous-eval.svg)</a>\n\n\n</div>\n\n<h2 align=\"center\">\n <p>Data-Driven Evaluation for LLM-Powered Applications</p>\n</h2>\n\n\n\n## Overview\n\n`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.\n\n<h1 align=\"center\">\n <img\n src=\"docs/public/module-level-eval.png\"\n >\n</h1>\n\n## How is continuous-eval different?\n\n- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.\n\n- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.\n\n- **Leverage User Feedback in Evaluation**: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.\n\n- **Synthetic Dataset Generation**: Generate large-scale synthetic dataset to test your pipeline.\n\n## Getting Started\n\nThis code is provided as a PyPi package. To install it, run the following command:\n\n```bash\npython3 -m pip install continuous-eval\n```\n\nif you want to install from source:\n\n```bash\ngit clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval\npoetry install --all-extras\n```\n\nTo run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.\n\n## Run a single metric\n\nHere's how you run a single metric on a datum.\nCheck all available metrics here: [link](https://continuous-eval.docs.relari.ai/)\n\n```python\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1\n\ndatum = {\n \"question\": \"What is the capital of France?\",\n \"retrieved_context\": [\n \"Paris is the capital of France and its largest city.\",\n \"Lyon is a major city in France.\",\n ],\n \"ground_truth_context\": [\"Paris is the capital of France.\"],\n \"answer\": \"Paris\",\n \"ground_truths\": [\"Paris\"],\n}\n\nmetric = PrecisionRecallF1()\n\nprint(metric(**datum))\n```\n\n### Available Metrics\n\n<table border=\"0\">\n <tr>\n <th>Module</th>\n <th>Category</th>\n <th>Metrics</th>\n </tr>\n <tr>\n <td rowspan=\"2\">Retrieval</td>\n <td>Deterministic</td>\n <td>PrecisionRecallF1, RankedRetrievalMetrics, TokenCount</td>\n </tr>\n <tr>\n <td>LLM-based</td>\n <td>LLMBasedContextPrecision, LLMBasedContextCoverage</td>\n </tr>\n <tr>\n <td rowspan=\"3\">Text Generation</td>\n <td>Deterministic</td>\n <td>DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability</td>\n </tr>\n <tr>\n <td>Semantic</td>\n <td>DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity</td>\n </tr>\n <tr>\n <td>LLM-based</td>\n <td>LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency</td>\n </tr>\n <tr>\n <td rowspan=\"1\">Classification</td>\n <td>Deterministic</td>\n <td>ClassificationAccuracy</td>\n </tr>\n <tr>\n <td rowspan=\"2\">Code Generation</td>\n <td>Deterministic</td>\n <td>CodeStringMatch, PythonASTSimilarity, SQLSyntaxMatch, SQLASTSimilarity</td>\n </tr>\n <tr>\n <td>LLM-based</td>\n <td>LLMBasedCodeGeneration</td>\n </tr>\n <tr>\n <td>Agent Tools</td>\n <td>Deterministic</td>\n <td>ToolSelectionAccuracy</td>\n </tr>\n <tr>\n <td>Custom</td>\n <td></td>\n <td>Define your own metrics</td>\n </tr>\n</table>\n\nTo define your own metrics, you only need to extend the [Metric](continuous_eval/metrics/base.py#L23C7-L23C13) class implementing the `__call__` method.\nOptional methods are `batch` (if it is possible to implement optimizations for batch processing) and `aggregate` (to aggregate metrics results over multiple samples_).\n\n## Run evaluation on a pipeline\n\nDefine modules in your pipeline and select corresponding metrics.\n\n```python\nfrom continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset, EvaluationRunner\nfrom continuous_eval.eval.logger import PipelineLogger\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics\nfrom continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness\nfrom typing import List, Dict\n\ndataset = Dataset(\"dataset_folder\")\n\n# Simple 3-step RAG pipeline with Retriever->Reranker->Generation\nretriever = Module(\n name=\"Retriever\",\n input=dataset.question,\n output=List[str],\n eval=[\n PrecisionRecallF1().use(\n retrieved_context=ModuleOutput(),\n ground_truth_context=dataset.ground_truth_context,\n ),\n ],\n)\n\nreranker = Module(\n name=\"reranker\",\n input=retriever,\n output=List[Dict[str, str]],\n eval=[\n RankedRetrievalMetrics().use(\n retrieved_context=ModuleOutput(),\n ground_truth_context=dataset.ground_truth_context,\n ),\n ],\n)\n\nllm = Module(\n name=\"answer_generator\",\n input=reranker,\n output=str,\n eval=[\n FleschKincaidReadability().use(answer=ModuleOutput()),\n DeterministicAnswerCorrectness().use(\n answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths\n ),\n ],\n)\n\npipeline = Pipeline([retriever, reranker, llm], dataset=dataset)\nprint(pipeline.graph_repr()) # optional: visualize the pipeline\n```\n\nNow you can run the evaluation on your pipeline\n\n```python\npipelog = PipelineLogger(pipeline=pipeline)\n\n# now run your LLM application pipeline, and for each module, log the results:\npipelog.log(uid=sample_uid, module=\"module_name\", value=data)\n\n# Once you finish logging the data, you can use the EvaluationRunner to evaluate the logs\nevalrunner = EvaluationRunner(pipeline)\nmetrics = evalrunner.evaluate(pipelog)\nmetrics.results() # returns a dictionary with the results\n```\n\nTo run evaluation over an existing dataset (BYODataset), you can run the following:\n\n```python\ndataset = Dataset(...)\nevalrunner = EvaluationRunner(pipeline)\nmetrics = evalrunner.evaluate(dataset)\n```\n\n## Synthetic Data Generation\n\nGround truth data, or reference data, is important for evaluation as it can offer a comprehensive and consistent measurement of system performance. However, it is often costly and time-consuming to manually curate such a golden dataset.\nWe have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes.\n\nTo generate custom synthetic data, please visit [Relari](https://www.relari.ai/) to create a free account and you can then generate custom synthetic golden datasets through the Relari Cloud.\n\n## \ud83d\udca1 Contributing\n\nInterested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.\n\n## Resources\n\n- **Docs:** [link](https://continuous-eval.docs.relari.ai/)\n- **Examples Repo**: [end-to-end example repo](https://github.com/relari-ai/examples)\n- **Blog Posts:**\n - Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)\n - How important is a Golden Dataset for LLM evaluation?\n [(link)](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)\n - How to evaluate complex GenAI Apps: a granular approach [(link)](https://medium.com/relari/how-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)\n - How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https://medium.com/towards-data-science/how-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)\n - Generate Synthetic Data to Test LLM Applications [(link)](https://medium.com/relari/generate-synthetic-data-to-test-llm-applications-4bffeb51b80e)\n- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)\n- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https://cal.com/relari/demo)\n\n## License\n\nThis project is licensed under the Apache 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Open Analytics\n\nWe monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement.\nYou can take a look at exactly what we track in the [telemetry code](continuous_eval/utils/telemetry.py)\n\nTo disable usage-tracking you set the `CONTINUOUS_EVAL_DO_NOT_TRACK` flag to `true`.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Open-Source Evaluation for GenAI Application Pipelines.",
"version": "0.3.13",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "db4e91ab1deb757d7f69fe74e2772cf8149e8bc0297a1c92ef57ff284eb8ea76",
"md5": "5911d3de37a7f54892c8578c1952b97c",
"sha256": "b1607640d56677d988924cdd85199298255ef4a88bb7d9cea0ff0b0937d92a2d"
},
"downloads": -1,
"filename": "continuous_eval-0.3.13-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5911d3de37a7f54892c8578c1952b97c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.9",
"size": 58685,
"upload_time": "2024-08-04T23:00:42",
"upload_time_iso_8601": "2024-08-04T23:00:42.303474Z",
"url": "https://files.pythonhosted.org/packages/db/4e/91ab1deb757d7f69fe74e2772cf8149e8bc0297a1c92ef57ff284eb8ea76/continuous_eval-0.3.13-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "eb8ea6096693a6a19c797c309e31ab2824d3bb205a7b9eb94eff240410d260b0",
"md5": "6ef944567a526b04bfac035d1036b86c",
"sha256": "bab38b1a60a5223a14571d8a01e2368c6ea6d0a72155b29868d46e883bc0d085"
},
"downloads": -1,
"filename": "continuous_eval-0.3.13.tar.gz",
"has_sig": false,
"md5_digest": "6ef944567a526b04bfac035d1036b86c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.9",
"size": 47112,
"upload_time": "2024-08-04T23:00:44",
"upload_time_iso_8601": "2024-08-04T23:00:44.219561Z",
"url": "https://files.pythonhosted.org/packages/eb/8e/a6096693a6a19c797c309e31ab2824d3bb205a7b9eb94eff240410d260b0/continuous_eval-0.3.13.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-04 23:00:44",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "continuous-eval"
}