# Twevals
Lightweight evals for AI agents and LLM apps. Write Python functions alongside your code, return an `EvalResult`, and Twevals handles storage, scoring, and a small web UI.
## Installation
Twevals is intended as a development dependency.
```bash
pip install twevals
# or with Poetry
poetry add --group dev twevals
```
## Quick start
Look at the [examples](examples/) directory for runnable snippets.
Run the demo suite and open the UI:
```bash
twevals serve examples
```

### UI highlights
- Expand rows to see inputs, outputs, metadata, scores, and annotations.
- Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.
- Actions menu: refresh, rerun the suite, export JSON/CSV.
Common `serve` flags: `--dataset`, `--label`, `-c/--concurrency`, `--dev`, `--host`, `--port`, `-q/--quiet`, `-v/--verbose`.
## Authoring evals
Write evals like tests; return `EvalResult`.
```python
from twevals import eval, EvalResult
@eval(dataset="customer_service")
def test_refund_request():
output = run_agent("I want a refund")
return EvalResult(
input="I want a refund",
output=output,
reference="refund",
scores={"key": "keyword", "passed": "refund" in output.lower()},
)
```
### EvalResult
The `EvalResult` object is used to store the result of an eval. It is returned by the `@eval` decorator.
```python
EvalResult(
input="...", # required: prompt or test input. Can be a string, a dict, or a list.
output="...", # required: model/agent output. Can be a string, a dict, or a list.
reference="...", # optional expected output.
error=None, # optional error message. Assert errors will automatically be added to the result.
latency=0.123, # optional execution time. Latency is automatically calculated if not provided.
metadata={"model": "gpt-4"}, # optional metadata for filtering and tracking
run_data={"trace": [...]}, # optional extra JSON stored with result. Good place to store the trace for debugging.
scores={"key": "exact", "passed": True}, # scores dict or list of dicts;
)
```
Twevals allows you to use a pass/fail score, a numeric score, or a combination of both. You can also add justification to the score in the `notes` field.
The Score schema for `scores` items is:
```python
{
"key": "metric", # required: Name of the metric
"value": 0.42, # optional numeric metric
"passed": True, # optional boolean metric
"notes": "optional", # optional notes
}
# Provide at least one of: value or passed
```
`scores` accepts a single dict or a list of dicts/`Score` objects; Twevals normalizes both forms.
### `@eval` decorator
Wraps a function and records returned `EvalResult` objects.
Parameters:
- `dataset` (defaults to filename)
- `labels` (filtering tags)
- `evaluators` (callables that add scores to a result)
### `@parametrize`
Generate multiple evals from one function. Place `@eval` above `@parametrize`.
```python
from twevals import parametrize
@eval(dataset="customer_service")
@parametrize("prompt,expected", [
("I want a refund", "refund"),
("Can I get my money back?", "refund"),
])
def test_refund(prompt, expected):
output = run_agent(prompt)
return EvalResult(
input=prompt,
output=output,
reference=expected,
)
```
Common patterns:
```python
# 1) Single parameter values (with optional ids)
@eval(dataset="math")
@parametrize("n", [1, 2, 3], ids=["small", "medium", "large"])
def test_square(n):
out = n * n
return EvalResult(input=n, output=out, reference=n**2,
scores={"key": "exact", "passed": out == n**2})
# 2) Multiple parameters via tuples
@eval(dataset="auth")
@parametrize("username,password,ok", [
("alice", "correct", True),
("alice", "wrong", False),
])
def test_login(username, password, ok):
out = fake_login(username, password)
return EvalResult(input={"u": username}, output=out,
scores={"key": "ok", "passed": out is ok})
# 3) Dictionaries for named argument sets
@eval(dataset="calc")
@parametrize("op,a,b,expected", [
{"op": "add", "a": 2, "b": 3, "expected": 5},
{"op": "mul", "a": 4, "b": 7, "expected": 28},
])
def test_calc(op, a, b, expected):
ops = {"add": lambda x, y: x + y, "mul": lambda x, y: x * y}
result = ops[op](a, b)
return EvalResult(input={"op": op, "a": a, "b": b}, output=result,
reference=expected,
scores=[{"key": "correct", "passed": result == expected})]
# 4) Stacked parametrize (cartesian product); ids combine like "model-temp"
@eval(dataset="models")
@parametrize("model", ["gpt-4", "gpt-3.5"], ids=["g4", "g35"])
@parametrize("temperature", [0.0, 0.7])
def test_model_grid(model, temperature):
out = run(model=model, temperature=temperature)
return EvalResult(input={"model": model, "temperature": temperature}, output=out)
# 5) Single-name shorthand accepts single values
@eval(dataset="thresholds")
@parametrize("threshold", [0.2, 0.5, 0.8])
def test_threshold(threshold=0.5):
out = evaluate(threshold=threshold)
return EvalResult(input=threshold, output=out)
```
Notes:
- Accepts tuples, dicts, or single values (for one parameter).
- Works with sync or async functions.
- Put `@eval` above `@parametrize` so Twevals can attach dataset/labels.
See more patterns in `examples/demo_eval_paramatrize.py`.
## Headless runs
Skip the UI and save results to disk:
```bash
twevals run path/to/evals
# Filtering and other common flags work here as well
```
`run`-only flags: `-o/--output` (save JSON summary), `--csv` (save CSV).
## CLI reference
```
twevals serve <path> # run evals once and launch the web UI
twevals run <path> # run without UI
Common flags:
-d, --dataset TEXT Filter by dataset(s)
-l, --label TEXT Filter by label(s)
-c, --concurrency INT Number of concurrent evals (0 = sequential)
-q, --quiet Reduce logs
-v, --verbose Verbose logs
serve-only:
--dev Enable hot reload
--host TEXT Host interface (default 127.0.0.1)
--port INT Port (default 8000)
run-only:
-o, --output FILE Save JSON summary
--csv FILE Save CSV results
```
## Contributing
```bash
poetry install
poetry run pytest -q
poetry run ruff check twevals tests
poetry run black .
```
Helpful demo:
```bash
poetry run twevals serve examples
```
Raw data
{
"_id": null,
"home_page": null,
"name": "twevals",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "evaluation, llm, agents, testing, benchmark, cli",
"author": "Twevals Team",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/5c/e6/944f955078ff67e02899ed9d16e208f4cf2095b7149aee245565cf93ea60/twevals-0.0.0.dev20250904233630.tar.gz",
"platform": null,
"description": "# Twevals\n\nLightweight evals for AI agents and LLM apps. Write Python functions alongside your code, return an `EvalResult`, and Twevals handles storage, scoring, and a small web UI.\n\n## Installation\n\nTwevals is intended as a development dependency.\n\n```bash\npip install twevals\n# or with Poetry\npoetry add --group dev twevals\n```\n\n## Quick start\n\nLook at the [examples](examples/) directory for runnable snippets.\nRun the demo suite and open the UI:\n\n```bash\ntwevals serve examples\n```\n\n\n\n### UI highlights\n\n- Expand rows to see inputs, outputs, metadata, scores, and annotations.\n- Edit datasets, labels, scores, metadata, or annotations inline; changes persist to JSON.\n- Actions menu: refresh, rerun the suite, export JSON/CSV.\n\nCommon `serve` flags: `--dataset`, `--label`, `-c/--concurrency`, `--dev`, `--host`, `--port`, `-q/--quiet`, `-v/--verbose`.\n\n## Authoring evals\n\nWrite evals like tests; return `EvalResult`.\n\n```python\nfrom twevals import eval, EvalResult\n\n@eval(dataset=\"customer_service\")\ndef test_refund_request():\n output = run_agent(\"I want a refund\")\n return EvalResult(\n input=\"I want a refund\",\n output=output,\n reference=\"refund\",\n scores={\"key\": \"keyword\", \"passed\": \"refund\" in output.lower()},\n )\n```\n\n### EvalResult\n\nThe `EvalResult` object is used to store the result of an eval. It is returned by the `@eval` decorator.\n\n```python\nEvalResult(\n input=\"...\", # required: prompt or test input. Can be a string, a dict, or a list.\n output=\"...\", # required: model/agent output. Can be a string, a dict, or a list.\n reference=\"...\", # optional expected output.\n error=None, # optional error message. Assert errors will automatically be added to the result.\n latency=0.123, # optional execution time. Latency is automatically calculated if not provided.\n metadata={\"model\": \"gpt-4\"}, # optional metadata for filtering and tracking\n run_data={\"trace\": [...]}, # optional extra JSON stored with result. Good place to store the trace for debugging.\n scores={\"key\": \"exact\", \"passed\": True}, # scores dict or list of dicts;\n)\n```\n\nTwevals allows you to use a pass/fail score, a numeric score, or a combination of both. You can also add justification to the score in the `notes` field.\n\nThe Score schema for `scores` items is:\n\n```python\n{\n \"key\": \"metric\", # required: Name of the metric\n \"value\": 0.42, # optional numeric metric\n \"passed\": True, # optional boolean metric\n \"notes\": \"optional\", # optional notes\n}\n# Provide at least one of: value or passed\n```\n\n`scores` accepts a single dict or a list of dicts/`Score` objects; Twevals normalizes both forms.\n\n### `@eval` decorator\n\nWraps a function and records returned `EvalResult` objects.\n\nParameters:\n- `dataset` (defaults to filename)\n- `labels` (filtering tags)\n- `evaluators` (callables that add scores to a result)\n\n### `@parametrize`\n\nGenerate multiple evals from one function. Place `@eval` above `@parametrize`.\n\n```python\nfrom twevals import parametrize\n\n@eval(dataset=\"customer_service\")\n@parametrize(\"prompt,expected\", [\n (\"I want a refund\", \"refund\"),\n (\"Can I get my money back?\", \"refund\"),\n])\ndef test_refund(prompt, expected):\n output = run_agent(prompt)\n return EvalResult(\n input=prompt,\n output=output,\n reference=expected,\n )\n```\n\nCommon patterns:\n\n```python\n# 1) Single parameter values (with optional ids)\n@eval(dataset=\"math\")\n@parametrize(\"n\", [1, 2, 3], ids=[\"small\", \"medium\", \"large\"])\ndef test_square(n):\n out = n * n\n return EvalResult(input=n, output=out, reference=n**2,\n scores={\"key\": \"exact\", \"passed\": out == n**2})\n\n# 2) Multiple parameters via tuples\n@eval(dataset=\"auth\")\n@parametrize(\"username,password,ok\", [\n (\"alice\", \"correct\", True),\n (\"alice\", \"wrong\", False),\n])\ndef test_login(username, password, ok):\n out = fake_login(username, password)\n return EvalResult(input={\"u\": username}, output=out,\n scores={\"key\": \"ok\", \"passed\": out is ok})\n\n# 3) Dictionaries for named argument sets\n@eval(dataset=\"calc\")\n@parametrize(\"op,a,b,expected\", [\n {\"op\": \"add\", \"a\": 2, \"b\": 3, \"expected\": 5},\n {\"op\": \"mul\", \"a\": 4, \"b\": 7, \"expected\": 28},\n])\ndef test_calc(op, a, b, expected):\n ops = {\"add\": lambda x, y: x + y, \"mul\": lambda x, y: x * y}\n result = ops[op](a, b)\n return EvalResult(input={\"op\": op, \"a\": a, \"b\": b}, output=result,\n reference=expected,\n scores=[{\"key\": \"correct\", \"passed\": result == expected})]\n\n# 4) Stacked parametrize (cartesian product); ids combine like \"model-temp\"\n@eval(dataset=\"models\")\n@parametrize(\"model\", [\"gpt-4\", \"gpt-3.5\"], ids=[\"g4\", \"g35\"])\n@parametrize(\"temperature\", [0.0, 0.7])\ndef test_model_grid(model, temperature):\n out = run(model=model, temperature=temperature)\n return EvalResult(input={\"model\": model, \"temperature\": temperature}, output=out)\n\n# 5) Single-name shorthand accepts single values\n@eval(dataset=\"thresholds\")\n@parametrize(\"threshold\", [0.2, 0.5, 0.8])\ndef test_threshold(threshold=0.5):\n out = evaluate(threshold=threshold)\n return EvalResult(input=threshold, output=out)\n```\n\nNotes:\n- Accepts tuples, dicts, or single values (for one parameter).\n- Works with sync or async functions.\n- Put `@eval` above `@parametrize` so Twevals can attach dataset/labels.\n\nSee more patterns in `examples/demo_eval_paramatrize.py`.\n\n## Headless runs\n\nSkip the UI and save results to disk:\n\n```bash\ntwevals run path/to/evals\n# Filtering and other common flags work here as well\n```\n\n`run`-only flags: `-o/--output` (save JSON summary), `--csv` (save CSV).\n\n## CLI reference\n\n```\ntwevals serve <path> # run evals once and launch the web UI\ntwevals run <path> # run without UI\n\nCommon flags:\n -d, --dataset TEXT Filter by dataset(s)\n -l, --label TEXT Filter by label(s)\n -c, --concurrency INT Number of concurrent evals (0 = sequential)\n -q, --quiet Reduce logs\n -v, --verbose Verbose logs\n\nserve-only:\n --dev Enable hot reload\n --host TEXT Host interface (default 127.0.0.1)\n --port INT Port (default 8000)\n\nrun-only:\n -o, --output FILE Save JSON summary\n --csv FILE Save CSV results\n```\n\n## Contributing\n\n```bash\npoetry install\npoetry run pytest -q\npoetry run ruff check twevals tests\npoetry run black .\n```\n\nHelpful demo:\n\n```bash\npoetry run twevals serve examples\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A lightweight, code-first evaluation framework for testing AI agents and LLM applications",
"version": "0.0.0.dev20250904233630",
"project_urls": {
"Documentation": "https://github.com/camronh/Twevals#readme",
"Homepage": "https://github.com/camronh/Twevals",
"Repository": "https://github.com/camronh/Twevals"
},
"split_keywords": [
"evaluation",
" llm",
" agents",
" testing",
" benchmark",
" cli"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fb7741f501ae89168412f231c98ef36cf1da57480d09e7fb5f7598a5f323847f",
"md5": "50dfd4754ada4eb4a7e1a51701cf3ec5",
"sha256": "3292ca67ae7bb90c0b3634be0db9e528181df9f825b483ac80ca168ca6111318"
},
"downloads": -1,
"filename": "twevals-0.0.0.dev20250904233630-py3-none-any.whl",
"has_sig": false,
"md5_digest": "50dfd4754ada4eb4a7e1a51701cf3ec5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 33667,
"upload_time": "2025-09-04T23:36:46",
"upload_time_iso_8601": "2025-09-04T23:36:46.771789Z",
"url": "https://files.pythonhosted.org/packages/fb/77/41f501ae89168412f231c98ef36cf1da57480d09e7fb5f7598a5f323847f/twevals-0.0.0.dev20250904233630-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5ce6944f955078ff67e02899ed9d16e208f4cf2095b7149aee245565cf93ea60",
"md5": "4e3fd63737f1fc73f989fa686df8ca11",
"sha256": "91ac18e35c214536e5161cfb5fcd837a7379527a8be415f46a733e95de4b3f50"
},
"downloads": -1,
"filename": "twevals-0.0.0.dev20250904233630.tar.gz",
"has_sig": false,
"md5_digest": "4e3fd63737f1fc73f989fa686df8ca11",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 30751,
"upload_time": "2025-09-04T23:36:47",
"upload_time_iso_8601": "2025-09-04T23:36:47.905755Z",
"url": "https://files.pythonhosted.org/packages/5c/e6/944f955078ff67e02899ed9d16e208f4cf2095b7149aee245565cf93ea60/twevals-0.0.0.dev20250904233630.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-04 23:36:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "camronh",
"github_project": "Twevals#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "twevals"
}