<div id="top"></div>
# `pytest-evals` 🚀
Test your LLM outputs against examples - no more manual checking! A (minimalistic) pytest plugin that helps you to
evaluate that your LLM is giving good answers.
[](https://pypi.org/p/pytest-evals)
[](https://github.com/AlmogBaku/pytest-evals/blob/main/LICENSE)
[](https://github.com/AlmogBaku/pytest-evals/issues)
[](https://github.com/AlmogBaku/pytest-evals/stargazers)
# 🧐 Why pytest-evals?
Building LLM applications is exciting, but how do you know they're actually working well? `pytest-evals` helps you:
- 🎯 **Test & Evaluate:** Run your LLM prompt against many cases
- 📈 **Track & Measure:** Collect metrics and analyze the overall performance
- 🔄 **Integrate Easily:** Works with pytest, Jupyter notebooks, and CI/CD pipelines
- ✨ **Scale Up:** Run tests in parallel with [`pytest-xdist`](https://pytest-xdist.readthedocs.io/) and
asynchronously with [`pytest-asyncio`](https://pytest-asyncio.readthedocs.io/).
# 🚀 Getting Started
To get started, install `pytest-evals` and write your tests:
```bash
pip install pytest-evals
```
#### ⚡️ Quick Example
For example, say you're building a support ticket classifier. You want to test cases like:
| Input Text | Expected Classification |
|--------------------------------------------------------|-------------------------|
| My login isn't working and I need to access my account | account_access |
| Can I get a refund for my last order? | billing |
| How do I change my notification settings? | settings |
`pytest-evals` helps you automatically test how your LLM perform against these cases, track accuracy, and ensure it
keeps working as expected over time.
```python
# Predict the LLM performance for each case
@pytest.mark.eval(name="my_classifier")
@pytest.mark.parametrize("case", TEST_DATA)
def test_classifier(case: dict, eval_bag, classifier):
# Run predictions and store results
eval_bag.prediction = classifier(case["Input Text"])
eval_bag.expected = case["Expected Classification"]
eval_bag.accuracy = eval_bag.prediction == eval_bag.expected
# Now let's see how our app performing across all cases...
@pytest.mark.eval_analysis(name="my_classifier")
def test_analysis(eval_results):
accuracy = sum([result.accuracy for result in eval_results]) / len(eval_results)
print(f"Accuracy: {accuracy:.2%}")
assert accuracy >= 0.7 # Ensure our performance is not degrading 🫢
```
Then, run your evaluation tests:
```bash
# Run test cases
pytest --run-eval
# Analyze results
pytest --run-eval-analysis
```
## 😵💫 Why Another Eval Tool?
**Evaluations are just tests.** No need for complex frameworks or DSLs. `pytest-evals` is minimalistic by design:
- Use `pytest` - the tool you already know
- Keep tests and evaluations together
- Focus on logic, not infrastructure
It just collects your results and lets you analyze them as a whole. Nothing more, nothing less.
<p align="right">(<a href="#top">back to top</a>)</p>
# 📚 User Guide
Check out our detailed guides and examples:
- [Basic evaluation](example/example_test.py)
- [Basic of LLM as a judge evaluation](example/example_judge_test.py)
- [Notebook example](example/example_notebook.ipynb)
- [Advanced notebook example](example/example_notebook_advanced.ipynb)
## 🤔 How It Works
Built on top of [pytest-harvest](https://smarie.github.io/python-pytest-harvest/), `pytest-evals` splits evaluation into
two phases:
1. **Evaluation Phase**: Run all test cases, collecting results and metrics in `eval_bag`. The results are saved in a
temporary file to allow the analysis phase to access them.
2. **Analysis Phase**: Process all results at once through `eval_results` to calculate final metrics
This split allows you to:
- Run evaluations in parallel (since the analysis test MUST run after all cases are done, we must run them separately)
- Make pass/fail decisions on the overall evaluation results instead of individual test failures (by passing the
`--supress-failed-exit-code --run-eval` flags)
- Collect comprehensive metrics
**Note**: When running evaluation tests, the rest of your test suite will not run. This is by design to keep the results
clean and focused.
## 💾 Saving case results
By default, `pytest-evals` saves the results of each case in a json file to allow the analysis phase to access them.
However, this might not be a friendly format for deeper analysis. To save the results in a more friendly format, as a
CSV file, use the `--save-evals-csv` flag:
```bash
pytest --run-eval --save-evals-csv
```
## 📝 Working with a notebook
It's also possible to run evaluations from a notebook. To do that, simply
install [ipytest](https://github.com/chmp/ipytest), and load the extension:
```python
%load_ext pytest_evals
```
Then, use the magic commands `%%ipytest_eval` in your cell to run evaluations. This will run the evaluation phase and
then the analysis phase. By default, using this magic will run both `--run-eval` and `--run-eval-analysis`, but you can
specify your own flags by passing arguments right after the magic command (e.g., `%%ipytest_eval --run-eval`).
```python
%%ipytest_eval
import pytest
@pytest.mark.eval(name="my_eval")
def test_agent(eval_bag):
eval_bag.prediction = agent.run(case["input"])
@pytest.mark.eval_analysis(name="my_eval")
def test_analysis(eval_results):
print(f"F1 Score: {calculate_f1(eval_results):.2%}")
```
You can see an example of this in the [`example/example_notebook.ipynb`](example/example_notebook.ipynb) notebook. Or
look at the [advanced example](example/example_notebook_advanced.ipynb) for a more complex example that tracks multiple
experiments.
<p align="right">(<a href="#top">back to top</a>)</p>
## 🏗️ Production Use
### 📚 Managing Test Data (Evaluation Set)
It's recommended to use a CSV file to store test data. This makes it easier to manage large datasets and allows you to
communicate with non-technical stakeholders.
To do this, you can use `pandas` to read the CSV file and pass the test cases as parameters to your tests using
`@pytest.mark.parametrize` 🙃 :
```python
import pandas as pd
import pytest
test_data = pd.read_csv("tests/testdata.csv")
@pytest.mark.eval(name="my_eval")
@pytest.mark.parametrize("case", test_data.to_dict(orient="records"))
def test_agent(case, eval_bag, agent):
eval_bag.prediction = agent.run(case["input"])
```
In case you need to select a subset of the test data (e.g., a golden set), you can simply define an environment variable
to indicate that, and filter the data with `pandas`.
### 🔀 CI Integration
Run tests and analysis as separate steps:
```yaml
evaluate:
steps:
- run: pytest --run-eval -n auto --supress-failed-exit-code # Run cases in parallel
- run: pytest --run-eval-analysis # Analyze results
```
Use `--supress-failed-exit-code` with `--run-eval` - let the analysis phase determine success/failure. **If all your
cases pass, your evaluation set is probably too small!**
### ⚡️ Parallel Testing
As your evaluation set grows, you may want to run your test cases in parallel. To do this, install
[`pytest-xdist`](https://pytest-xdist.readthedocs.io/). `pytest-evals` will support that out of the box 🚀.
```bash
run: pytest --run-eval -n auto
```
<p align="right">(<a href="#top">back to top</a>)</p>
# 👷 Contributing
Contributions make the open-source community a fantastic place to learn, inspire, and create. Any contributions you make
are **greatly appreciated** (not only code! but also documenting, blogging, or giving us feedback) 😍.
Please fork the repo and create a pull request if you have a suggestion. You can also simply open an issue to give us
some feedback.
**Don't forget to give the project [a star](#top)! ⭐️**
For more information about contributing code to the project, read the [CONTRIBUTING.md](CONTRIBUTING.md) guide.
# 📃 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
<p align="right">(<a href="#top">back to top</a>)</p>
Raw data
{
"_id": null,
"home_page": null,
"name": "pytest-evals",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "anthropic, eval, evaluations, gpt, llm, openai, pytest, pytest-evals",
"author": null,
"author_email": "Almog Baku <almog.baku@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a6/7c/67f6e792d3dff4593af0c2f9e8b56be54a6dea4bb37dc91cba53feefc05f/pytest_evals-0.3.4.tar.gz",
"platform": null,
"description": "<div id=\"top\"></div>\n\n# `pytest-evals` \ud83d\ude80\n\nTest your LLM outputs against examples - no more manual checking! A (minimalistic) pytest plugin that helps you to\nevaluate that your LLM is giving good answers.\n\n[](https://pypi.org/p/pytest-evals)\n[](https://github.com/AlmogBaku/pytest-evals/blob/main/LICENSE)\n[](https://github.com/AlmogBaku/pytest-evals/issues)\n[](https://github.com/AlmogBaku/pytest-evals/stargazers)\n\n# \ud83e\uddd0 Why pytest-evals?\n\nBuilding LLM applications is exciting, but how do you know they're actually working well? `pytest-evals` helps you:\n\n- \ud83c\udfaf **Test & Evaluate:** Run your LLM prompt against many cases\n- \ud83d\udcc8 **Track & Measure:** Collect metrics and analyze the overall performance\n- \ud83d\udd04 **Integrate Easily:** Works with pytest, Jupyter notebooks, and CI/CD pipelines\n- \u2728 **Scale Up:** Run tests in parallel with [`pytest-xdist`](https://pytest-xdist.readthedocs.io/) and\n asynchronously with [`pytest-asyncio`](https://pytest-asyncio.readthedocs.io/).\n\n# \ud83d\ude80 Getting Started\n\nTo get started, install `pytest-evals` and write your tests:\n\n```bash\npip install pytest-evals\n```\n\n#### \u26a1\ufe0f Quick Example\n\nFor example, say you're building a support ticket classifier. You want to test cases like:\n\n| Input Text | Expected Classification |\n|--------------------------------------------------------|-------------------------|\n| My login isn't working and I need to access my account | account_access |\n| Can I get a refund for my last order? | billing |\n| How do I change my notification settings? | settings |\n\n`pytest-evals` helps you automatically test how your LLM perform against these cases, track accuracy, and ensure it\nkeeps working as expected over time.\n\n```python\n# Predict the LLM performance for each case\n@pytest.mark.eval(name=\"my_classifier\")\n@pytest.mark.parametrize(\"case\", TEST_DATA)\ndef test_classifier(case: dict, eval_bag, classifier):\n # Run predictions and store results\n eval_bag.prediction = classifier(case[\"Input Text\"])\n eval_bag.expected = case[\"Expected Classification\"]\n eval_bag.accuracy = eval_bag.prediction == eval_bag.expected\n\n\n# Now let's see how our app performing across all cases...\n@pytest.mark.eval_analysis(name=\"my_classifier\")\ndef test_analysis(eval_results):\n accuracy = sum([result.accuracy for result in eval_results]) / len(eval_results)\n print(f\"Accuracy: {accuracy:.2%}\")\n assert accuracy >= 0.7 # Ensure our performance is not degrading \ud83e\udee2\n```\n\nThen, run your evaluation tests:\n\n```bash\n# Run test cases\npytest --run-eval\n\n# Analyze results\npytest --run-eval-analysis\n```\n\n## \ud83d\ude35\u200d\ud83d\udcab Why Another Eval Tool?\n\n**Evaluations are just tests.** No need for complex frameworks or DSLs. `pytest-evals` is minimalistic by design:\n\n- Use `pytest` - the tool you already know\n- Keep tests and evaluations together\n- Focus on logic, not infrastructure\n\nIt just collects your results and lets you analyze them as a whole. Nothing more, nothing less.\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n# \ud83d\udcda User Guide\n\nCheck out our detailed guides and examples:\n\n- [Basic evaluation](example/example_test.py)\n- [Basic of LLM as a judge evaluation](example/example_judge_test.py)\n- [Notebook example](example/example_notebook.ipynb)\n- [Advanced notebook example](example/example_notebook_advanced.ipynb)\n\n## \ud83e\udd14 How It Works\n\nBuilt on top of [pytest-harvest](https://smarie.github.io/python-pytest-harvest/), `pytest-evals` splits evaluation into\ntwo phases:\n\n1. **Evaluation Phase**: Run all test cases, collecting results and metrics in `eval_bag`. The results are saved in a\n temporary file to allow the analysis phase to access them.\n2. **Analysis Phase**: Process all results at once through `eval_results` to calculate final metrics\n\nThis split allows you to:\n\n- Run evaluations in parallel (since the analysis test MUST run after all cases are done, we must run them separately)\n- Make pass/fail decisions on the overall evaluation results instead of individual test failures (by passing the\n `--supress-failed-exit-code --run-eval` flags)\n- Collect comprehensive metrics\n\n**Note**: When running evaluation tests, the rest of your test suite will not run. This is by design to keep the results\nclean and focused.\n\n## \ud83d\udcbe Saving case results\nBy default, `pytest-evals` saves the results of each case in a json file to allow the analysis phase to access them.\nHowever, this might not be a friendly format for deeper analysis. To save the results in a more friendly format, as a\nCSV file, use the `--save-evals-csv` flag:\n\n```bash\npytest --run-eval --save-evals-csv\n```\n\n## \ud83d\udcdd Working with a notebook\n\nIt's also possible to run evaluations from a notebook. To do that, simply\ninstall [ipytest](https://github.com/chmp/ipytest), and load the extension:\n\n```python\n%load_ext pytest_evals\n```\n\nThen, use the magic commands `%%ipytest_eval` in your cell to run evaluations. This will run the evaluation phase and\nthen the analysis phase. By default, using this magic will run both `--run-eval` and `--run-eval-analysis`, but you can\nspecify your own flags by passing arguments right after the magic command (e.g., `%%ipytest_eval --run-eval`).\n\n```python\n%%ipytest_eval\nimport pytest\n\n\n@pytest.mark.eval(name=\"my_eval\")\ndef test_agent(eval_bag):\n eval_bag.prediction = agent.run(case[\"input\"])\n\n\n@pytest.mark.eval_analysis(name=\"my_eval\")\ndef test_analysis(eval_results):\n print(f\"F1 Score: {calculate_f1(eval_results):.2%}\")\n```\n\nYou can see an example of this in the [`example/example_notebook.ipynb`](example/example_notebook.ipynb) notebook. Or\nlook at the [advanced example](example/example_notebook_advanced.ipynb) for a more complex example that tracks multiple\nexperiments.\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n## \ud83c\udfd7\ufe0f Production Use\n\n### \ud83d\udcda Managing Test Data (Evaluation Set)\n\nIt's recommended to use a CSV file to store test data. This makes it easier to manage large datasets and allows you to\ncommunicate with non-technical stakeholders.\n\nTo do this, you can use `pandas` to read the CSV file and pass the test cases as parameters to your tests using\n`@pytest.mark.parametrize` \ud83d\ude43 :\n\n```python\nimport pandas as pd\nimport pytest\n\ntest_data = pd.read_csv(\"tests/testdata.csv\")\n\n\n@pytest.mark.eval(name=\"my_eval\")\n@pytest.mark.parametrize(\"case\", test_data.to_dict(orient=\"records\"))\ndef test_agent(case, eval_bag, agent):\n eval_bag.prediction = agent.run(case[\"input\"])\n```\n\nIn case you need to select a subset of the test data (e.g., a golden set), you can simply define an environment variable\nto indicate that, and filter the data with `pandas`.\n\n### \ud83d\udd00 CI Integration\n\nRun tests and analysis as separate steps:\n\n```yaml\nevaluate:\n steps:\n - run: pytest --run-eval -n auto --supress-failed-exit-code # Run cases in parallel\n - run: pytest --run-eval-analysis # Analyze results\n```\n\nUse `--supress-failed-exit-code` with `--run-eval` - let the analysis phase determine success/failure. **If all your\ncases pass, your evaluation set is probably too small!**\n\n### \u26a1\ufe0f Parallel Testing\n\nAs your evaluation set grows, you may want to run your test cases in parallel. To do this, install\n[`pytest-xdist`](https://pytest-xdist.readthedocs.io/). `pytest-evals` will support that out of the box \ud83d\ude80.\n\n```bash\nrun: pytest --run-eval -n auto\n```\n\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n# \ud83d\udc77 Contributing\n\nContributions make the open-source community a fantastic place to learn, inspire, and create. Any contributions you make\nare **greatly appreciated** (not only code! but also documenting, blogging, or giving us feedback) \ud83d\ude0d.\n\nPlease fork the repo and create a pull request if you have a suggestion. You can also simply open an issue to give us\nsome feedback.\n\n**Don't forget to give the project [a star](#top)! \u2b50\ufe0f**\n\nFor more information about contributing code to the project, read the [CONTRIBUTING.md](CONTRIBUTING.md) guide.\n\n# \ud83d\udcc3 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>",
"bugtrack_url": null,
"license": null,
"summary": "A pytest plugin for running and analyzing LLM evaluation tests",
"version": "0.3.4",
"project_urls": {
"Homepage": "https://github.com/AlmogBaku/pytest-evals",
"Issues": "https://github.com/AlmogBaku/pytest-evals/issues",
"Repository": "https://github.com/AlmogBaku/pytest-evals"
},
"split_keywords": [
"anthropic",
" eval",
" evaluations",
" gpt",
" llm",
" openai",
" pytest",
" pytest-evals"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ccf8d1e0cf1494fc6401e335728b3dc3bc3fd90622de7cdc489992e0742b1ea4",
"md5": "301e170a70a21ec64475ac53fd79c5bd",
"sha256": "81e30baffe8879adfb80ec9d4dddf1c92831901b4bfc756d8de3e1d567360c1a"
},
"downloads": -1,
"filename": "pytest_evals-0.3.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "301e170a70a21ec64475ac53fd79c5bd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 11556,
"upload_time": "2025-02-02T21:26:38",
"upload_time_iso_8601": "2025-02-02T21:26:38.838411Z",
"url": "https://files.pythonhosted.org/packages/cc/f8/d1e0cf1494fc6401e335728b3dc3bc3fd90622de7cdc489992e0742b1ea4/pytest_evals-0.3.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a67c67f6e792d3dff4593af0c2f9e8b56be54a6dea4bb37dc91cba53feefc05f",
"md5": "7158989b3fcacd763179af4da1026a42",
"sha256": "a209d1f4d0993c3b05a2348042a5c81bf200e7c28d9d559fe629293f2f664552"
},
"downloads": -1,
"filename": "pytest_evals-0.3.4.tar.gz",
"has_sig": false,
"md5_digest": "7158989b3fcacd763179af4da1026a42",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 225846,
"upload_time": "2025-02-02T21:26:40",
"upload_time_iso_8601": "2025-02-02T21:26:40.186601Z",
"url": "https://files.pythonhosted.org/packages/a6/7c/67f6e792d3dff4593af0c2f9e8b56be54a6dea4bb37dc91cba53feefc05f/pytest_evals-0.3.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-02 21:26:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AlmogBaku",
"github_project": "pytest-evals",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pytest-evals"
}