pydantic-evals

Name	pydantic-evals JSON
Version	0.8.1 JSON
	download
home_page	None
Summary	Framework for evaluating stochastic code execution, especially code making use of LLMs
upload_time	2025-08-29 14:46:28
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Pydantic Evals

[![CI](https://github.com/pydantic/pydantic-ai/actions/workflows/ci.yml/badge.svg?event=push)](https://github.com/pydantic/pydantic-ai/actions/workflows/ci.yml?query=branch%3Amain)
[![Coverage](https://coverage-badge.samuelcolvin.workers.dev/pydantic/pydantic-ai.svg)](https://coverage-badge.samuelcolvin.workers.dev/redirect/pydantic/pydantic-ai)
[![PyPI](https://img.shields.io/pypi/v/pydantic-evals.svg)](https://pypi.python.org/pypi/pydantic-evals)
[![python versions](https://img.shields.io/pypi/pyversions/pydantic-evals.svg)](https://github.com/pydantic/pydantic-ai)
[![license](https://img.shields.io/github/license/pydantic/pydantic-ai.svg)](https://github.com/pydantic/pydantic-ai/blob/main/LICENSE)

This is a library for evaluating non-deterministic (or "stochastic") functions in Python. It provides a simple,
Pythonic interface for defining and running stochastic functions, and analyzing the results of running those functions.

While this library is developed as part of [Pydantic AI](https://ai.pydantic.dev), it only uses Pydantic AI for a small
subset of generative functionality internally, and it is designed to be used with arbitrary "stochastic function"
implementations. In particular, it can be used with other (non-Pydantic AI) AI libraries, agent frameworks, etc.

As with Pydantic AI, this library prioritizes type safety and use of common Python syntax over esoteric, domain-specific
use of Python syntax.

Full documentation is available at [ai.pydantic.dev/evals](https://ai.pydantic.dev/evals).

## Example

While you'd typically use Pydantic Evals with more complex functions (such as Pydantic AI agents or graphs), here's a
quick example that evaluates a simple function against a test case using both custom and built-in evaluators:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, IsInstance

# Define a test case with inputs and expected output
case = Case(
    name='capital_question',
    inputs='What is the capital of France?',
    expected_output='Paris',
)

# Define a custom evaluator
class MatchAnswer(Evaluator[str, str]):
    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        if ctx.output == ctx.expected_output:
            return 1.0
        elif isinstance(ctx.output, str) and ctx.expected_output.lower() in ctx.output.lower():
            return 0.8
        return 0.0

# Create a dataset with the test case and evaluators
dataset = Dataset(
    cases=[case],
    evaluators=[IsInstance(type_name='str'), MatchAnswer()],
)

# Define the function to evaluate
async def answer_question(question: str) -> str:
    return 'Paris'

# Run the evaluation
report = dataset.evaluate_sync(answer_question)
report.print(include_input=True, include_output=True)
"""
                                    Evaluation Summary: answer_question
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID          ┃ Inputs                         ┃ Outputs ┃ Scores            ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ capital_question │ What is the capital of France? │ Paris   │ MatchAnswer: 1.00 │ ✔          │     10ms │
├──────────────────┼────────────────────────────────┼─────────┼───────────────────┼────────────┼──────────┤
│ Averages         │                                │         │ MatchAnswer: 1.00 │ 100.0% ✔   │     10ms │
└──────────────────┴────────────────────────────────┴─────────┴───────────────────┴────────────┴──────────┘
"""
```

Using the library with more complex functions, such as Pydantic AI agents, is similar — all you need to do is define a
task function wrapping the function you want to evaluate, with a signature that matches the inputs and outputs of your
test cases.

## Logfire Integration

Pydantic Evals uses OpenTelemetry to record traces for each case in your evaluations.

You can send these traces to any OpenTelemetry-compatible backend. For the best experience, we recommend [Pydantic Logfire](https://logfire.pydantic.dev/docs), which includes custom views for evals:

<div style="display: flex; gap: 1rem; flex-wrap: wrap;">
  <img src="https://ai.pydantic.dev/img/logfire-evals-overview.png" alt="Logfire Evals Overview" width="48%">
  <img src="https://ai.pydantic.dev/img/logfire-evals-case.png" alt="Logfire Evals Case View" width="48%">
</div>

You'll see full details about the inputs, outputs, token usage, execution durations, etc. And you'll have access to the full trace for each case — ideal for debugging, writing path-aware evaluators, or running the similar evaluations against production traces.

Basic setup:

```python {test="skip" lint="skip" format="skip"}
import logfire

logfire.configure(
    send_to_logfire='if-token-present',
    environment='development',
    service_name='evals',
)

...

my_dataset.evaluate_sync(my_task)
```

[Read more about the Logfire integration here.](https://ai.pydantic.dev/evals/#logfire-integration)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pydantic-evals",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Samuel Colvin <samuel@pydantic.dev>, Marcelo Trylesinski <marcelotryle@gmail.com>, David Montague <david@pydantic.dev>, Alex Hall <alex@pydantic.dev>, Douwe Maan <douwe@pydantic.dev>",
    "download_url": "https://files.pythonhosted.org/packages/6c/9d/460a1f2c9f5f263e9d8e9661acbd654ccc81ad3373ea43048d914091a817/pydantic_evals-0.8.1.tar.gz",
    "platform": null,
    "description": "# Pydantic Evals\n\n[![CI](https://github.com/pydantic/pydantic-ai/actions/workflows/ci.yml/badge.svg?event=push)](https://github.com/pydantic/pydantic-ai/actions/workflows/ci.yml?query=branch%3Amain)\n[![Coverage](https://coverage-badge.samuelcolvin.workers.dev/pydantic/pydantic-ai.svg)](https://coverage-badge.samuelcolvin.workers.dev/redirect/pydantic/pydantic-ai)\n[![PyPI](https://img.shields.io/pypi/v/pydantic-evals.svg)](https://pypi.python.org/pypi/pydantic-evals)\n[![python versions](https://img.shields.io/pypi/pyversions/pydantic-evals.svg)](https://github.com/pydantic/pydantic-ai)\n[![license](https://img.shields.io/github/license/pydantic/pydantic-ai.svg)](https://github.com/pydantic/pydantic-ai/blob/main/LICENSE)\n\nThis is a library for evaluating non-deterministic (or \"stochastic\") functions in Python. It provides a simple,\nPythonic interface for defining and running stochastic functions, and analyzing the results of running those functions.\n\nWhile this library is developed as part of [Pydantic AI](https://ai.pydantic.dev), it only uses Pydantic AI for a small\nsubset of generative functionality internally, and it is designed to be used with arbitrary \"stochastic function\"\nimplementations. In particular, it can be used with other (non-Pydantic AI) AI libraries, agent frameworks, etc.\n\nAs with Pydantic AI, this library prioritizes type safety and use of common Python syntax over esoteric, domain-specific\nuse of Python syntax.\n\nFull documentation is available at [ai.pydantic.dev/evals](https://ai.pydantic.dev/evals).\n\n## Example\n\nWhile you'd typically use Pydantic Evals with more complex functions (such as Pydantic AI agents or graphs), here's a\nquick example that evaluates a simple function against a test case using both custom and built-in evaluators:\n\n```python\nfrom pydantic_evals import Case, Dataset\nfrom pydantic_evals.evaluators import Evaluator, EvaluatorContext, IsInstance\n\n# Define a test case with inputs and expected output\ncase = Case(\n    name='capital_question',\n    inputs='What is the capital of France?',\n    expected_output='Paris',\n)\n\n# Define a custom evaluator\nclass MatchAnswer(Evaluator[str, str]):\n    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:\n        if ctx.output == ctx.expected_output:\n            return 1.0\n        elif isinstance(ctx.output, str) and ctx.expected_output.lower() in ctx.output.lower():\n            return 0.8\n        return 0.0\n\n# Create a dataset with the test case and evaluators\ndataset = Dataset(\n    cases=[case],\n    evaluators=[IsInstance(type_name='str'), MatchAnswer()],\n)\n\n# Define the function to evaluate\nasync def answer_question(question: str) -> str:\n    return 'Paris'\n\n# Run the evaluation\nreport = dataset.evaluate_sync(answer_question)\nreport.print(include_input=True, include_output=True)\n\"\"\"\n                                    Evaluation Summary: answer_question\n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 Case ID          \u2503 Inputs                         \u2503 Outputs \u2503 Scores            \u2503 Assertions \u2503 Duration \u2503\n\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n\u2502 capital_question \u2502 What is the capital of France? \u2502 Paris   \u2502 MatchAnswer: 1.00 \u2502 \u2714          \u2502     10ms \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Averages         \u2502                                \u2502         \u2502 MatchAnswer: 1.00 \u2502 100.0% \u2714   \u2502     10ms \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\"\"\"\n```\n\nUsing the library with more complex functions, such as Pydantic AI agents, is similar \u2014 all you need to do is define a\ntask function wrapping the function you want to evaluate, with a signature that matches the inputs and outputs of your\ntest cases.\n\n## Logfire Integration\n\nPydantic Evals uses OpenTelemetry to record traces for each case in your evaluations.\n\nYou can send these traces to any OpenTelemetry-compatible backend. For the best experience, we recommend [Pydantic Logfire](https://logfire.pydantic.dev/docs), which includes custom views for evals:\n\n<div style=\"display: flex; gap: 1rem; flex-wrap: wrap;\">\n  <img src=\"https://ai.pydantic.dev/img/logfire-evals-overview.png\" alt=\"Logfire Evals Overview\" width=\"48%\">\n  <img src=\"https://ai.pydantic.dev/img/logfire-evals-case.png\" alt=\"Logfire Evals Case View\" width=\"48%\">\n</div>\n\nYou'll see full details about the inputs, outputs, token usage, execution durations, etc. And you'll have access to the full trace for each case \u2014 ideal for debugging, writing path-aware evaluators, or running the similar evaluations against production traces.\n\nBasic setup:\n\n```python {test=\"skip\" lint=\"skip\" format=\"skip\"}\nimport logfire\n\nlogfire.configure(\n    send_to_logfire='if-token-present',\n    environment='development',\n    service_name='evals',\n)\n\n...\n\nmy_dataset.evaluate_sync(my_task)\n```\n\n[Read more about the Logfire integration here.](https://ai.pydantic.dev/evals/#logfire-integration)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Framework for evaluating stochastic code execution, especially code making use of LLMs",
    "version": "0.8.1",
    "project_urls": {
        "Changelog": "https://github.com/pydantic/pydantic-ai/releases",
        "Documentation": "https://ai.pydantic.dev/evals",
        "Homepage": "https://ai.pydantic.dev/evals",
        "Source": "https://github.com/pydantic/pydantic-ai"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6ff91d21c4687167c4fa76fd3b1ed47f9bc2d38fd94cbacd9aa3f19e82e59830",
                "md5": "08fdcdff1c49d7a7de7cf6c498d12be1",
                "sha256": "6c76333b1d79632f619eb58a24ac656e9f402c47c75ad750ba0230d7f5514344"
            },
            "downloads": -1,
            "filename": "pydantic_evals-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "08fdcdff1c49d7a7de7cf6c498d12be1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 52602,
            "upload_time": "2025-08-29T14:46:16",
            "upload_time_iso_8601": "2025-08-29T14:46:16.602853Z",
            "url": "https://files.pythonhosted.org/packages/6f/f9/1d21c4687167c4fa76fd3b1ed47f9bc2d38fd94cbacd9aa3f19e82e59830/pydantic_evals-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6c9d460a1f2c9f5f263e9d8e9661acbd654ccc81ad3373ea43048d914091a817",
                "md5": "7b997b5813d419320f27bfad1e0fb840",
                "sha256": "c398a623c31c19ce70e346ad75654fcb1517c3f6a821461f64fe5cbbe0813023"
            },
            "downloads": -1,
            "filename": "pydantic_evals-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7b997b5813d419320f27bfad1e0fb840",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 43933,
            "upload_time": "2025-08-29T14:46:28",
            "upload_time_iso_8601": "2025-08-29T14:46:28.903588Z",
            "url": "https://files.pythonhosted.org/packages/6c/9d/460a1f2c9f5f263e9d8e9661acbd654ccc81ad3373ea43048d914091a817/pydantic_evals-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-29 14:46:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pydantic",
    "github_project": "pydantic-ai",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pydantic-evals"
}

None