twevals


Nametwevals JSON
Version 0.0.0.dev20250904233630 PyPI version JSON
download
home_pageNone
SummaryA lightweight, code-first evaluation framework for testing AI agents and LLM applications
upload_time2025-09-04 23:36:47
maintainerNone
docs_urlNone
authorTwevals Team
requires_python<4.0,>=3.10
licenseMIT
keywords evaluation llm agents testing benchmark cli
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Twevals

Lightweight evals for AI agents and LLM apps. Write Python functions alongside your code, return an `EvalResult`, and Twevals handles storage, scoring, and a small web UI.

## Installation

Twevals is intended as a development dependency.

```bash
pip install twevals
# or with Poetry
poetry add --group dev twevals
```

## Quick start

Look at the [examples](examples/) directory for runnable snippets.
Run the demo suite and open the UI:

```bash
twevals serve examples
```

![UI screenshot](assets/ui.png)

### UI highlights

- Expand rows to see inputs, outputs, metadata, scores, and annotations.
- Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.
- Actions menu: refresh, rerun the suite, export JSON/CSV.

Common `serve` flags: `--dataset`, `--label`, `-c/--concurrency`, `--dev`, `--host`, `--port`, `-q/--quiet`, `-v/--verbose`.

## Authoring evals

Write evals like tests; return `EvalResult`.

```python
from twevals import eval, EvalResult

@eval(dataset="customer_service")
def test_refund_request():
    output = run_agent("I want a refund")
    return EvalResult(
        input="I want a refund",
        output=output,
        reference="refund",
        scores={"key": "keyword", "passed": "refund" in output.lower()},
    )
```

### EvalResult

The `EvalResult` object is used to store the result of an eval. It is returned by the `@eval` decorator.

```python
EvalResult(
    input="...",          # required: prompt or test input. Can be a string, a dict, or a list.
    output="...",         # required: model/agent output. Can be a string, a dict, or a list.
    reference="...",      # optional expected output.
    error=None,            # optional error message. Assert errors will automatically be added to the result.
    latency=0.123,         # optional execution time. Latency is automatically calculated if not provided.
    metadata={"model": "gpt-4"},  # optional metadata for filtering and tracking
    run_data={"trace": [...]},     # optional extra JSON stored with result. Good place to store the trace for debugging.
    scores={"key": "exact", "passed": True},  # scores dict or list of dicts;
)
```

Twevals allows you to use a pass/fail score, a numeric score, or a combination of both. You can also add justification to the score in the `notes` field.

The Score schema for `scores` items is:

```python
{
    "key": "metric",        # required: Name of the metric
    "value": 0.42,           # optional numeric metric
    "passed": True,          # optional boolean metric
    "notes": "optional",     # optional notes
}
# Provide at least one of: value or passed
```

`scores` accepts a single dict or a list of dicts/`Score` objects; Twevals normalizes both forms.

### `@eval` decorator

Wraps a function and records returned `EvalResult` objects.

Parameters:
- `dataset` (defaults to filename)
- `labels` (filtering tags)
- `evaluators` (callables that add scores to a result)

### `@parametrize`

Generate multiple evals from one function. Place `@eval` above `@parametrize`.

```python
from twevals import parametrize

@eval(dataset="customer_service")
@parametrize("prompt,expected", [
    ("I want a refund", "refund"),
    ("Can I get my money back?", "refund"),
])
def test_refund(prompt, expected):
    output = run_agent(prompt)
    return EvalResult(
        input=prompt,
        output=output,
        reference=expected,
    )
```

Common patterns:

```python
# 1) Single parameter values (with optional ids)
@eval(dataset="math")
@parametrize("n", [1, 2, 3], ids=["small", "medium", "large"])
def test_square(n):
    out = n * n
    return EvalResult(input=n, output=out, reference=n**2,
                      scores={"key": "exact", "passed": out == n**2})

# 2) Multiple parameters via tuples
@eval(dataset="auth")
@parametrize("username,password,ok", [
    ("alice", "correct", True),
    ("alice", "wrong", False),
])
def test_login(username, password, ok):
    out = fake_login(username, password)
    return EvalResult(input={"u": username}, output=out,
                      scores={"key": "ok", "passed": out is ok})

# 3) Dictionaries for named argument sets
@eval(dataset="calc")
@parametrize("op,a,b,expected", [
    {"op": "add", "a": 2, "b": 3, "expected": 5},
    {"op": "mul", "a": 4, "b": 7, "expected": 28},
])
def test_calc(op, a, b, expected):
    ops = {"add": lambda x, y: x + y, "mul": lambda x, y: x * y}
    result = ops[op](a, b)
    return EvalResult(input={"op": op, "a": a, "b": b}, output=result,
                      reference=expected,
                      scores=[{"key": "correct", "passed": result == expected})]

# 4) Stacked parametrize (cartesian product); ids combine like "model-temp"
@eval(dataset="models")
@parametrize("model", ["gpt-4", "gpt-3.5"], ids=["g4", "g35"])
@parametrize("temperature", [0.0, 0.7])
def test_model_grid(model, temperature):
    out = run(model=model, temperature=temperature)
    return EvalResult(input={"model": model, "temperature": temperature}, output=out)

# 5) Single-name shorthand accepts single values
@eval(dataset="thresholds")
@parametrize("threshold", [0.2, 0.5, 0.8])
def test_threshold(threshold=0.5):
    out = evaluate(threshold=threshold)
    return EvalResult(input=threshold, output=out)
```

Notes:
- Accepts tuples, dicts, or single values (for one parameter).
- Works with sync or async functions.
- Put `@eval` above `@parametrize` so Twevals can attach dataset/labels.

See more patterns in `examples/demo_eval_paramatrize.py`.

## Headless runs

Skip the UI and save results to disk:

```bash
twevals run path/to/evals
# Filtering and other common flags work here as well
```

`run`-only flags: `-o/--output` (save JSON summary), `--csv` (save CSV).

## CLI reference

```
twevals serve <path>   # run evals once and launch the web UI
twevals run <path>     # run without UI

Common flags:
  -d, --dataset TEXT      Filter by dataset(s)
  -l, --label TEXT        Filter by label(s)
  -c, --concurrency INT   Number of concurrent evals (0 = sequential)
  -q, --quiet             Reduce logs
  -v, --verbose           Verbose logs

serve-only:
  --dev                   Enable hot reload
  --host TEXT             Host interface (default 127.0.0.1)
  --port INT              Port (default 8000)

run-only:
  -o, --output FILE       Save JSON summary
  --csv FILE              Save CSV results
```

## Contributing

```bash
poetry install
poetry run pytest -q
poetry run ruff check twevals tests
poetry run black .
```

Helpful demo:

```bash
poetry run twevals serve examples
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "twevals",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "evaluation, llm, agents, testing, benchmark, cli",
    "author": "Twevals Team",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/5c/e6/944f955078ff67e02899ed9d16e208f4cf2095b7149aee245565cf93ea60/twevals-0.0.0.dev20250904233630.tar.gz",
    "platform": null,
    "description": "# Twevals\n\nLightweight evals for AI agents and LLM apps. Write Python functions alongside your code, return an `EvalResult`, and Twevals handles storage, scoring, and a small web UI.\n\n## Installation\n\nTwevals is intended as a development dependency.\n\n```bash\npip install twevals\n# or with Poetry\npoetry add --group dev twevals\n```\n\n## Quick start\n\nLook at the [examples](examples/) directory for runnable snippets.\nRun the demo suite and open the UI:\n\n```bash\ntwevals serve examples\n```\n\n![UI screenshot](assets/ui.png)\n\n### UI highlights\n\n- Expand rows to see inputs, outputs, metadata, scores, and annotations.\n- Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.\n- Actions menu: refresh, rerun the suite, export JSON/CSV.\n\nCommon `serve` flags: `--dataset`, `--label`, `-c/--concurrency`, `--dev`, `--host`, `--port`, `-q/--quiet`, `-v/--verbose`.\n\n## Authoring evals\n\nWrite evals like tests; return `EvalResult`.\n\n```python\nfrom twevals import eval, EvalResult\n\n@eval(dataset=\"customer_service\")\ndef test_refund_request():\n    output = run_agent(\"I want a refund\")\n    return EvalResult(\n        input=\"I want a refund\",\n        output=output,\n        reference=\"refund\",\n        scores={\"key\": \"keyword\", \"passed\": \"refund\" in output.lower()},\n    )\n```\n\n### EvalResult\n\nThe `EvalResult` object is used to store the result of an eval. It is returned by the `@eval` decorator.\n\n```python\nEvalResult(\n    input=\"...\",          # required: prompt or test input. Can be a string, a dict, or a list.\n    output=\"...\",         # required: model/agent output. Can be a string, a dict, or a list.\n    reference=\"...\",      # optional expected output.\n    error=None,            # optional error message. Assert errors will automatically be added to the result.\n    latency=0.123,         # optional execution time. Latency is automatically calculated if not provided.\n    metadata={\"model\": \"gpt-4\"},  # optional metadata for filtering and tracking\n    run_data={\"trace\": [...]},     # optional extra JSON stored with result. Good place to store the trace for debugging.\n    scores={\"key\": \"exact\", \"passed\": True},  # scores dict or list of dicts;\n)\n```\n\nTwevals allows you to use a pass/fail score, a numeric score, or a combination of both. You can also add justification to the score in the `notes` field.\n\nThe Score schema for `scores` items is:\n\n```python\n{\n    \"key\": \"metric\",        # required: Name of the metric\n    \"value\": 0.42,           # optional numeric metric\n    \"passed\": True,          # optional boolean metric\n    \"notes\": \"optional\",     # optional notes\n}\n# Provide at least one of: value or passed\n```\n\n`scores` accepts a single dict or a list of dicts/`Score` objects; Twevals normalizes both forms.\n\n### `@eval` decorator\n\nWraps a function and records returned `EvalResult` objects.\n\nParameters:\n- `dataset` (defaults to filename)\n- `labels` (filtering tags)\n- `evaluators` (callables that add scores to a result)\n\n### `@parametrize`\n\nGenerate multiple evals from one function. Place `@eval` above `@parametrize`.\n\n```python\nfrom twevals import parametrize\n\n@eval(dataset=\"customer_service\")\n@parametrize(\"prompt,expected\", [\n    (\"I want a refund\", \"refund\"),\n    (\"Can I get my money back?\", \"refund\"),\n])\ndef test_refund(prompt, expected):\n    output = run_agent(prompt)\n    return EvalResult(\n        input=prompt,\n        output=output,\n        reference=expected,\n    )\n```\n\nCommon patterns:\n\n```python\n# 1) Single parameter values (with optional ids)\n@eval(dataset=\"math\")\n@parametrize(\"n\", [1, 2, 3], ids=[\"small\", \"medium\", \"large\"])\ndef test_square(n):\n    out = n * n\n    return EvalResult(input=n, output=out, reference=n**2,\n                      scores={\"key\": \"exact\", \"passed\": out == n**2})\n\n# 2) Multiple parameters via tuples\n@eval(dataset=\"auth\")\n@parametrize(\"username,password,ok\", [\n    (\"alice\", \"correct\", True),\n    (\"alice\", \"wrong\", False),\n])\ndef test_login(username, password, ok):\n    out = fake_login(username, password)\n    return EvalResult(input={\"u\": username}, output=out,\n                      scores={\"key\": \"ok\", \"passed\": out is ok})\n\n# 3) Dictionaries for named argument sets\n@eval(dataset=\"calc\")\n@parametrize(\"op,a,b,expected\", [\n    {\"op\": \"add\", \"a\": 2, \"b\": 3, \"expected\": 5},\n    {\"op\": \"mul\", \"a\": 4, \"b\": 7, \"expected\": 28},\n])\ndef test_calc(op, a, b, expected):\n    ops = {\"add\": lambda x, y: x + y, \"mul\": lambda x, y: x * y}\n    result = ops[op](a, b)\n    return EvalResult(input={\"op\": op, \"a\": a, \"b\": b}, output=result,\n                      reference=expected,\n                      scores=[{\"key\": \"correct\", \"passed\": result == expected})]\n\n# 4) Stacked parametrize (cartesian product); ids combine like \"model-temp\"\n@eval(dataset=\"models\")\n@parametrize(\"model\", [\"gpt-4\", \"gpt-3.5\"], ids=[\"g4\", \"g35\"])\n@parametrize(\"temperature\", [0.0, 0.7])\ndef test_model_grid(model, temperature):\n    out = run(model=model, temperature=temperature)\n    return EvalResult(input={\"model\": model, \"temperature\": temperature}, output=out)\n\n# 5) Single-name shorthand accepts single values\n@eval(dataset=\"thresholds\")\n@parametrize(\"threshold\", [0.2, 0.5, 0.8])\ndef test_threshold(threshold=0.5):\n    out = evaluate(threshold=threshold)\n    return EvalResult(input=threshold, output=out)\n```\n\nNotes:\n- Accepts tuples, dicts, or single values (for one parameter).\n- Works with sync or async functions.\n- Put `@eval` above `@parametrize` so Twevals can attach dataset/labels.\n\nSee more patterns in `examples/demo_eval_paramatrize.py`.\n\n## Headless runs\n\nSkip the UI and save results to disk:\n\n```bash\ntwevals run path/to/evals\n# Filtering and other common flags work here as well\n```\n\n`run`-only flags: `-o/--output` (save JSON summary), `--csv` (save CSV).\n\n## CLI reference\n\n```\ntwevals serve <path>   # run evals once and launch the web UI\ntwevals run <path>     # run without UI\n\nCommon flags:\n  -d, --dataset TEXT      Filter by dataset(s)\n  -l, --label TEXT        Filter by label(s)\n  -c, --concurrency INT   Number of concurrent evals (0 = sequential)\n  -q, --quiet             Reduce logs\n  -v, --verbose           Verbose logs\n\nserve-only:\n  --dev                   Enable hot reload\n  --host TEXT             Host interface (default 127.0.0.1)\n  --port INT              Port (default 8000)\n\nrun-only:\n  -o, --output FILE       Save JSON summary\n  --csv FILE              Save CSV results\n```\n\n## Contributing\n\n```bash\npoetry install\npoetry run pytest -q\npoetry run ruff check twevals tests\npoetry run black .\n```\n\nHelpful demo:\n\n```bash\npoetry run twevals serve examples\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A lightweight, code-first evaluation framework for testing AI agents and LLM applications",
    "version": "0.0.0.dev20250904233630",
    "project_urls": {
        "Documentation": "https://github.com/camronh/Twevals#readme",
        "Homepage": "https://github.com/camronh/Twevals",
        "Repository": "https://github.com/camronh/Twevals"
    },
    "split_keywords": [
        "evaluation",
        " llm",
        " agents",
        " testing",
        " benchmark",
        " cli"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fb7741f501ae89168412f231c98ef36cf1da57480d09e7fb5f7598a5f323847f",
                "md5": "50dfd4754ada4eb4a7e1a51701cf3ec5",
                "sha256": "3292ca67ae7bb90c0b3634be0db9e528181df9f825b483ac80ca168ca6111318"
            },
            "downloads": -1,
            "filename": "twevals-0.0.0.dev20250904233630-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "50dfd4754ada4eb4a7e1a51701cf3ec5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 33667,
            "upload_time": "2025-09-04T23:36:46",
            "upload_time_iso_8601": "2025-09-04T23:36:46.771789Z",
            "url": "https://files.pythonhosted.org/packages/fb/77/41f501ae89168412f231c98ef36cf1da57480d09e7fb5f7598a5f323847f/twevals-0.0.0.dev20250904233630-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5ce6944f955078ff67e02899ed9d16e208f4cf2095b7149aee245565cf93ea60",
                "md5": "4e3fd63737f1fc73f989fa686df8ca11",
                "sha256": "91ac18e35c214536e5161cfb5fcd837a7379527a8be415f46a733e95de4b3f50"
            },
            "downloads": -1,
            "filename": "twevals-0.0.0.dev20250904233630.tar.gz",
            "has_sig": false,
            "md5_digest": "4e3fd63737f1fc73f989fa686df8ca11",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 30751,
            "upload_time": "2025-09-04T23:36:47",
            "upload_time_iso_8601": "2025-09-04T23:36:47.905755Z",
            "url": "https://files.pythonhosted.org/packages/5c/e6/944f955078ff67e02899ed9d16e208f4cf2095b7149aee245565cf93ea60/twevals-0.0.0.dev20250904233630.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-04 23:36:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "camronh",
    "github_project": "Twevals#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "twevals"
}
        
Elapsed time: 0.84188s