[![PyPI version](https://img.shields.io/pypi/v/pytest-evals.svg)](https://pypi.org/p/pytest-evals)
[![License](https://img.shields.io/github/license/AlmogBaku/pytest-evals.svg)](https://github.com/AlmogBaku/pytest-evals/blob/main/LICENSE)
[![Issues](https://img.shields.io/github/issues/AlmogBaku/pytest-evals.svg)](https://github.com/AlmogBaku/pytest-evals/issues)
[![Stars](https://img.shields.io/github/stars/AlmogBaku/pytest-evals.svg)](https://github.com/AlmogBaku/pytest-evals/stargazers)
# pytest-evals
A minimal pytest plugin to evaluate LLM applications easily. Run tests at scale, collect metrics, analyze results and
seamlessly integrate with your CI/CD pipeline.
- ✨ Run evaluations at scale
- 🔄 Two-phase execution: cases first, analysis second
- 📊 Built-in result collection and metrics calculation
- 🚀 Parallel execution support (with [`pytest-xdist`](https://pytest-xdist.readthedocs.io/))
- 🔀 Supports asynchronous tests with [`pytest-asyncio`](https://pytest-asyncio.readthedocs.io/en/latest/)
- 📒 Work like a charm with notebooks using [`ipytest`](https://github.com/chmp/ipytest)
```python
@pytest.mark.eval(name="my_eval")
@pytest.mark.parametrize("case", TEST_DATA)
def test_agent(case, eval_bag):
# save whatever you need in the bag. You'll have access to it later in the analysis phase
eval_bag.prediction = agent.predict(case["input"])
@pytest.mark.eval_analysis(name="my_eval")
def test_analysis(eval_results):
print(f"F1 Score: {calculate_f1(eval_results):.2%}")
```
Evaluations are easy - just write tests! no need to reinvent the wheel with complex DSLs or frameworks.
## Why Another Eval Tool?
**Evaluations are just tests.** No need for complex frameworks or DSLs. `pytest-evals` is minimal by design:
- Use `pytest` - the tool you already know
- Keep tests and evaluations together
- Focus on logic, not infrastructure
It just collects your results and lets you analyze them as a whole. Nothing more, nothing less.
## Install
```bash
pip install pytest-evals
```
## Quick Start
Here's a complete example evaluating a simple classifier:
```python
import pytest
TEST_DATA = [
{"text": "I need to debug this Python code", "label": True},
{"text": "The cat jumped over the lazy dog", "label": False},
{"text": "My monitor keeps flickering", "label": True},
]
@pytest.fixture
def classifier():
def classify(text: str) -> bool:
# In real-life, we would use a more sophisticated model like an LLM for this :P
computer_keywords = {'debug', 'python', 'code', 'monitor'}
return any(keyword in text.lower() for keyword in computer_keywords)
return classify
@pytest.mark.eval(name="computer_classifier")
@pytest.mark.parametrize("case", TEST_DATA)
def test_classifier(case: dict, eval_bag, classifier):
eval_bag.input_text = case["text"]
eval_bag.label = case["label"]
eval_bag.prediction = classifier(case["text"])
assert eval_bag.prediction == eval_bag.label
@pytest.mark.eval_analysis(name="computer_classifier")
def test_analysis(eval_results):
total = len(eval_results)
correct = sum(1 for r in eval_results if r.result.prediction == r.result.label)
accuracy = correct / total
print(f"Accuracy: {accuracy:.2%}")
assert accuracy >= 0.7
```
Run it:
```bash
# Run test cases
pytest --run-eval
# Analyze results
pytest --run-eval-analysis
```
## How It Works
Built on top of [pytest-harvest](https://smarie.github.io/python-pytest-harvest/), pytest-evals splits evaluation into
two phases:
1. **Evaluation Phase**: Run all test cases, collecting results and metrics in `eval_bag`. The results are saved in a
temporary file to allow the analysis phase to access them.
2. **Analysis Phase**: Process all results at once through `eval_results` to calculate final metrics
This split allows you to:
- Run evaluations in parallel (since the analysis test MUST run after all cases are done, we must run them separately)
- Make pass/fail decisions on the overall evaluation results instead of individual test failures (by passing the
`--supress-failed-exit-code --run-eval` flags)
- Collect comprehensive metrics
**Note**: When running evaluation tests, the rest of your test suite will not run. This is by design to keep the results
clean and focused.
### Working with a notebook
It's also possible to run evaluations from a notebook. To do that, simply
install [ipytest](https://github.com/chmp/ipytest), and
load the extension:
```ipython
%load_ext pytest_evals
```
Then, use the magic commands `%%ipytest_eval` in your cell to run evaluations. This will run the evaluation phase and
then the analysis phase.
```ipython
%%ipytest_eval
import pytest
@pytest.mark.eval(name="my_eval")
def test_agent(eval_bag):
eval_bag.prediction = agent.run(case["input"])
@pytest.mark.eval_analysis(name="my_eval")
def test_analysis(eval_results):
print(f"F1 Score: {calculate_f1(eval_results):.2%}")
```
You can see an example of this in the [`example/example_notebook.ipynb`](example/example_notebook.ipynb) notebook. Or
look at the [advanced example](example/example_notebook_advanced.ipynb) for a more complex example that tracks multiple
experiments.
## Production Use
### Managing Test Data (Evaluation Set)
It's recommended to use a CSV file to store test data. This makes it easier to manage large datasets and allows you to
communicate with non-technical stakeholders.
To do this, you can use `pandas` to read the CSV file and pass the test cases as parameters to your tests using
`@pytest.mark.parametrize` 🙃 :
```python
import pandas as pd
import pytest
test_data = pd.read_csv("tests/testdata.csv")
@pytest.mark.eval(name="my_eval")
@pytest.mark.parametrize("case", test_data.to_dict(orient="records"))
def test_agent(case, eval_bag, agent):
eval_bag.prediction = agent.run(case["input"])
```
In case you need to select a subset of the test data (e.g., a golden set), you can simply define an environment variable
to indicate that, and filter the data with `pandas`.
### CI Integration
Run tests and analysis as separate steps:
```yaml
evaluate:
steps:
- run: pytest --run-eval -n auto --supress-failed-exit-code # Run cases in parallel
- run: pytest --run-eval-analysis # Analyze results
```
Use `--supress-failed-exit-code` with `--run-eval` - let the analysis phase determine success/failure. **If all your
cases pass, your evaluation set is probably too small!**
### Running in Parallel
As your evaluation set grows, you may want to run your test cases in parallel. To do this, install
[`pytest-xdist`](https://pytest-xdist.readthedocs.io/). `pytest-evals` will support that out of the box 🚀.
```bash
run: pytest --run-eval -n auto
```
Raw data
{
"_id": null,
"home_page": null,
"name": "pytest-evals",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "anthropic, eval, evaluations, gpt, llm, openai, pytest, pytest-evals",
"author": null,
"author_email": "Almog Baku <almog.baku@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/be/f0/4ea9e2e78ed06729dfb8b0c7a8cdba0a9c828f9c6476060f7a5149d14413/pytest_evals-0.2.3.tar.gz",
"platform": null,
"description": "[![PyPI version](https://img.shields.io/pypi/v/pytest-evals.svg)](https://pypi.org/p/pytest-evals)\n[![License](https://img.shields.io/github/license/AlmogBaku/pytest-evals.svg)](https://github.com/AlmogBaku/pytest-evals/blob/main/LICENSE)\n[![Issues](https://img.shields.io/github/issues/AlmogBaku/pytest-evals.svg)](https://github.com/AlmogBaku/pytest-evals/issues)\n[![Stars](https://img.shields.io/github/stars/AlmogBaku/pytest-evals.svg)](https://github.com/AlmogBaku/pytest-evals/stargazers)\n\n# pytest-evals\n\nA minimal pytest plugin to evaluate LLM applications easily. Run tests at scale, collect metrics, analyze results and\nseamlessly integrate with your CI/CD pipeline.\n\n- \u2728 Run evaluations at scale\n- \ud83d\udd04 Two-phase execution: cases first, analysis second\n- \ud83d\udcca Built-in result collection and metrics calculation\n- \ud83d\ude80 Parallel execution support (with [`pytest-xdist`](https://pytest-xdist.readthedocs.io/))\n- \ud83d\udd00 Supports asynchronous tests with [`pytest-asyncio`](https://pytest-asyncio.readthedocs.io/en/latest/)\n- \ud83d\udcd2 Work like a charm with notebooks using [`ipytest`](https://github.com/chmp/ipytest)\n\n```python\n@pytest.mark.eval(name=\"my_eval\")\n@pytest.mark.parametrize(\"case\", TEST_DATA)\ndef test_agent(case, eval_bag):\n # save whatever you need in the bag. You'll have access to it later in the analysis phase\n eval_bag.prediction = agent.predict(case[\"input\"])\n\n\n@pytest.mark.eval_analysis(name=\"my_eval\")\ndef test_analysis(eval_results):\n print(f\"F1 Score: {calculate_f1(eval_results):.2%}\")\n```\n\nEvaluations are easy - just write tests! no need to reinvent the wheel with complex DSLs or frameworks.\n\n## Why Another Eval Tool?\n\n**Evaluations are just tests.** No need for complex frameworks or DSLs. `pytest-evals` is minimal by design:\n\n- Use `pytest` - the tool you already know\n- Keep tests and evaluations together\n- Focus on logic, not infrastructure\n\nIt just collects your results and lets you analyze them as a whole. Nothing more, nothing less.\n\n## Install\n\n```bash\npip install pytest-evals\n```\n\n## Quick Start\n\nHere's a complete example evaluating a simple classifier:\n\n```python\nimport pytest\n\nTEST_DATA = [\n {\"text\": \"I need to debug this Python code\", \"label\": True},\n {\"text\": \"The cat jumped over the lazy dog\", \"label\": False},\n {\"text\": \"My monitor keeps flickering\", \"label\": True},\n]\n\n\n@pytest.fixture\ndef classifier():\n def classify(text: str) -> bool:\n # In real-life, we would use a more sophisticated model like an LLM for this :P\n computer_keywords = {'debug', 'python', 'code', 'monitor'}\n return any(keyword in text.lower() for keyword in computer_keywords)\n\n return classify\n\n\n@pytest.mark.eval(name=\"computer_classifier\")\n@pytest.mark.parametrize(\"case\", TEST_DATA)\ndef test_classifier(case: dict, eval_bag, classifier):\n eval_bag.input_text = case[\"text\"]\n eval_bag.label = case[\"label\"]\n eval_bag.prediction = classifier(case[\"text\"])\n assert eval_bag.prediction == eval_bag.label\n\n\n@pytest.mark.eval_analysis(name=\"computer_classifier\")\ndef test_analysis(eval_results):\n total = len(eval_results)\n correct = sum(1 for r in eval_results if r.result.prediction == r.result.label)\n accuracy = correct / total\n\n print(f\"Accuracy: {accuracy:.2%}\")\n assert accuracy >= 0.7\n```\n\nRun it:\n\n```bash\n# Run test cases\npytest --run-eval\n\n# Analyze results\npytest --run-eval-analysis\n```\n\n## How It Works\n\nBuilt on top of [pytest-harvest](https://smarie.github.io/python-pytest-harvest/), pytest-evals splits evaluation into\ntwo phases:\n\n1. **Evaluation Phase**: Run all test cases, collecting results and metrics in `eval_bag`. The results are saved in a\n temporary file to allow the analysis phase to access them.\n2. **Analysis Phase**: Process all results at once through `eval_results` to calculate final metrics\n\nThis split allows you to:\n\n- Run evaluations in parallel (since the analysis test MUST run after all cases are done, we must run them separately)\n- Make pass/fail decisions on the overall evaluation results instead of individual test failures (by passing the\n `--supress-failed-exit-code --run-eval` flags)\n- Collect comprehensive metrics\n\n**Note**: When running evaluation tests, the rest of your test suite will not run. This is by design to keep the results\nclean and focused.\n\n### Working with a notebook\n\nIt's also possible to run evaluations from a notebook. To do that, simply\ninstall [ipytest](https://github.com/chmp/ipytest), and\nload the extension:\n\n```ipython\n%load_ext pytest_evals\n```\n\nThen, use the magic commands `%%ipytest_eval` in your cell to run evaluations. This will run the evaluation phase and\nthen the analysis phase.\n\n```ipython\n%%ipytest_eval\nimport pytest\n\n@pytest.mark.eval(name=\"my_eval\")\ndef test_agent(eval_bag):\n eval_bag.prediction = agent.run(case[\"input\"])\n \n@pytest.mark.eval_analysis(name=\"my_eval\")\ndef test_analysis(eval_results):\n print(f\"F1 Score: {calculate_f1(eval_results):.2%}\")\n```\n\nYou can see an example of this in the [`example/example_notebook.ipynb`](example/example_notebook.ipynb) notebook. Or\nlook at the [advanced example](example/example_notebook_advanced.ipynb) for a more complex example that tracks multiple\nexperiments.\n\n## Production Use\n\n### Managing Test Data (Evaluation Set)\n\nIt's recommended to use a CSV file to store test data. This makes it easier to manage large datasets and allows you to\ncommunicate with non-technical stakeholders.\n\nTo do this, you can use `pandas` to read the CSV file and pass the test cases as parameters to your tests using\n`@pytest.mark.parametrize` \ud83d\ude43 :\n\n```python\nimport pandas as pd\nimport pytest\n\ntest_data = pd.read_csv(\"tests/testdata.csv\")\n\n\n@pytest.mark.eval(name=\"my_eval\")\n@pytest.mark.parametrize(\"case\", test_data.to_dict(orient=\"records\"))\ndef test_agent(case, eval_bag, agent):\n eval_bag.prediction = agent.run(case[\"input\"])\n```\n\nIn case you need to select a subset of the test data (e.g., a golden set), you can simply define an environment variable\nto indicate that, and filter the data with `pandas`.\n\n### CI Integration\n\nRun tests and analysis as separate steps:\n\n```yaml\nevaluate:\n steps:\n - run: pytest --run-eval -n auto --supress-failed-exit-code # Run cases in parallel\n - run: pytest --run-eval-analysis # Analyze results\n```\n\nUse `--supress-failed-exit-code` with `--run-eval` - let the analysis phase determine success/failure. **If all your\ncases pass, your evaluation set is probably too small!**\n\n### Running in Parallel\n\nAs your evaluation set grows, you may want to run your test cases in parallel. To do this, install\n[`pytest-xdist`](https://pytest-xdist.readthedocs.io/). `pytest-evals` will support that out of the box \ud83d\ude80.\n\n```bash\nrun: pytest --run-eval -n auto\n```",
"bugtrack_url": null,
"license": null,
"summary": "A pytest plugin for running and analyzing LLM evaluation tests",
"version": "0.2.3",
"project_urls": {
"Homepage": "https://github.com/AlmogBaku/pytest-evals",
"Issues": "https://github.com/AlmogBaku/pytest-evals/issues",
"Repository": "https://github.com/AlmogBaku/pytest-evals"
},
"split_keywords": [
"anthropic",
" eval",
" evaluations",
" gpt",
" llm",
" openai",
" pytest",
" pytest-evals"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d2274023168fc567144180792de19f1ee3bb7c03bd0cb66d53e2df3115d3b2d7",
"md5": "deb2172ac0db2859cbc72c9b6d251e59",
"sha256": "df684c6de5587302b207159a3c8c38f2cf4126fd215ed38a5164f04473394002"
},
"downloads": -1,
"filename": "pytest_evals-0.2.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "deb2172ac0db2859cbc72c9b6d251e59",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 9480,
"upload_time": "2025-01-14T21:39:49",
"upload_time_iso_8601": "2025-01-14T21:39:49.785230Z",
"url": "https://files.pythonhosted.org/packages/d2/27/4023168fc567144180792de19f1ee3bb7c03bd0cb66d53e2df3115d3b2d7/pytest_evals-0.2.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bef04ea9e2e78ed06729dfb8b0c7a8cdba0a9c828f9c6476060f7a5149d14413",
"md5": "d9b5d201529366235fbbc2995c7a88dc",
"sha256": "097eba59c4788469034f5b6a97f20eeb3a140df77d019e4e529307215d80b2a8"
},
"downloads": -1,
"filename": "pytest_evals-0.2.3.tar.gz",
"has_sig": false,
"md5_digest": "d9b5d201529366235fbbc2995c7a88dc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 210845,
"upload_time": "2025-01-14T21:39:52",
"upload_time_iso_8601": "2025-01-14T21:39:52.222605Z",
"url": "https://files.pythonhosted.org/packages/be/f0/4ea9e2e78ed06729dfb8b0c7a8cdba0a9c828f9c6476060f7a5149d14413/pytest_evals-0.2.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-14 21:39:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AlmogBaku",
"github_project": "pytest-evals",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pytest-evals"
}