Name | zeroeval JSON |
Version |
0.6.8
JSON |
| download |
home_page | None |
Summary | ZeroEval SDK |
upload_time | 2025-08-02 06:18:53 |
maintainer | None |
docs_url | None |
author | None |
requires_python | <4.0,>=3.9 |
license | None |
keywords |
llm
evaluation
observability
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# ZeroEval SDK
[ZeroEval](https://zeroeval.com) is an evaluations, a/b testing and monitoring platform for AI products. This SDK lets you **create datasets, run AI/LLM experiments, and trace multimodal workloads**.
Issues? Email us at [founders@zeroeval.com](mailto:founders@zeroeval.com)
---
## Features
• **Dataset API** – versioned, queryable, text or multimodal (images, audio, video, URLs).
• **Experiment engine** – run tasks + custom evaluators locally or in the cloud.
• **Observability** – hierarchical _Session → Trace → Span_ tracing; live dashboard.
• **Python CLI** – `zeroeval run …`, `zeroeval setup` for friction-less onboarding.
---
## Installation
```bash
pip install zeroeval # poetry add zeroeval
```
The SDK automatically detects and instruments OpenAI, LangChain, and LangGraph if you have them installed. No additional configuration needed!
---
## Authentication
1. One-off interactive setup (recommended):
```bash
zeroeval setup # or: poetry run zeroeval setup
```
Your API key will be automatically saved to your shell configuration file (e.g., `~/.zshrc`, `~/.bashrc`). Best practice is to also store it in a `.env` file in your project root.
2. Or set it in code each time:
```python
import zeroeval as ze
ze.init(api_key="YOUR_API_KEY")
```
---
## Quick-start
```python
# quickstart.py
import zeroeval as ze
ze.init() # uses ZEROEVAL_API_KEY env var
# 1. Create dataset
ds = ze.Dataset(
name="gsm8k_sample",
data=[
{"question": "What is 6 times 7?", "answer": "42"},
{"question": "What is 10 plus 7?", "answer": "17"}
]
)
# 2. Define task
@ze.task(outputs=["prediction"])
def solve(row):
# Your LLM logic here
response = llm_call(row["question"])
return {"prediction": response}
# 3. Define evaluation
@ze.evaluation(mode="dataset", outputs=["accuracy"])
def accuracy(answer_col, prediction_col):
correct = sum(a == p for a, p in zip(answer_col, prediction_col))
return {"accuracy": correct / len(answer_col)}
# 4. Run and evaluate
run = ds.run(solve, workers=8).score([accuracy], answer="answer")
print(f"Accuracy: {run.metrics['accuracy']:.2%}")
```
For a fully-worked multimodal example, visit the docs: https://docs.zeroeval.com/multimodal-datasets (coming soon)
---
## Creating datasets
### 1. Text / tabular
```python
# Create dataset from list
cities = ze.Dataset(
"Cities",
data=[
{"name": "Paris", "population": 2_165_000},
{"name": "Berlin", "population": 3_769_000}
],
description="Example tabular dataset"
)
cities.push()
# Load dataset from CSV
ds = ze.Dataset("/path/to/data.csv")
ds.push() # Creates new version if dataset exists
```
### 2. Multimodal (images, audio, URLs)
```python
mm = ze.Dataset(
"Medical_Xray_Dataset",
data=[{"patient_id": "P001", "symptoms": "Cough"}],
description="Symptoms + chest X-ray"
)
mm.add_image(row_index=0, column_name="chest_xray", image_path="sample_images/p001.jpg")
mm.add_audio(row_index=0, column_name="verbal_notes", audio_path="notes/p001.wav")
mm.add_media_url(row_index=0, column_name="external_scan", media_url="https://example.com/scan.jpg", media_type="image")
mm.push()
```
---
## Working with Datasets
### Loading from CSV
```python
# Load dataset directly from CSV file
dataset = ze.Dataset("data.csv")
```
### Iteration
Datasets support Python's iteration protocol:
```python
# Basic iteration
for row in dataset:
print(row.name, row.score)
# With enumerate
for i, row in enumerate(dataset):
print(f"Row {i}: {row.name}")
# List comprehensions
high_scores = [row for row in dataset if row.score > 90]
```
### Slicing and Indexing
```python
# Single item access (returns DotDict)
first_row = dataset[0]
last_row = dataset[-1]
# Slicing (returns new Dataset)
top_10 = dataset[:10]
bottom_5 = dataset[-5:]
middle = dataset[10:20]
# Sliced datasets can be processed independently
subset = dataset[:100]
results = subset.run(my_task)
subset.push() # Upload subset as new dataset
```
### Dot Notation Access
All rows support dot notation for cleaner code:
```python
# Instead of row["column_name"]
value = row.column_name
# Works in tasks too
@ze.task(outputs=["length"])
def get_length(row):
return {"length": len(row.text)}
```
---
## Running experiments
```python
import zeroeval as ze
ze.init()
# Pull dataset
dataset = ze.Dataset.pull("Capitals")
# Define task
@ze.task(outputs=["prediction"])
def uppercase_task(row):
return {"prediction": row["input"].upper()}
# Define evaluation
@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(output, prediction):
return {"exact_match": int(output.upper() == prediction)}
@ze.evaluation(mode="dataset", outputs=["accuracy"])
def accuracy(exact_match_col):
return {"accuracy": sum(exact_match_col) / len(exact_match_col)}
# Run experiment
run = dataset.run(uppercase_task, workers=8)
run = run.score([exact_match, accuracy], output="output")
print(f"Accuracy: {run.metrics['accuracy']:.2%}")
```
Advanced options:
```python
# Multiple runs with ensemble
run = dataset.run(task, repeats=5, ensemble="majority", on="prediction")
# Pass@k evaluation
run = dataset.run(task, repeats=10, ensemble="pass@k", on="prediction", k=3)
# Custom aggregator
def best_length(values):
return max(values, key=len) if values else ""
run = dataset.run(task, repeats=3, ensemble=best_length, on="prediction")
```
---
## Multimodal experiment (GPT-4o)
A shortened version (full listing in the docs):
```python
import zeroeval as ze, openai, base64
from pathlib import Path
ze.init()
client = openai.OpenAI() # assumes env var OPENAI_API_KEY
def img_to_data_uri(path):
data = Path(path).read_bytes()
b64 = base64.b64encode(data).decode()
return f"data:image/jpeg;base64,{b64}"
# Pull multimodal dataset
dataset = ze.Dataset.pull("Medical_Xray_Dataset")
# Define task
@ze.task(outputs=["diagnosis"])
def diagnose(row):
messages = [
{"role": "user", "content": "Patient: " + row["symptoms"]},
{"role": "user", "content": {
"type": "image_url",
"image_url": {"url": img_to_data_uri(row["chest_xray"])}
}}
]
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
return {"diagnosis": response.choices[0].message.content}
# Define evaluation
@ze.evaluation(mode="row", outputs=["contains_keyword"])
def check_keywords(diagnosis, expected_keywords):
keywords = expected_keywords.lower().split(',')
diagnosis_lower = diagnosis.lower()
found = any(kw.strip() in diagnosis_lower for kw in keywords)
return {"contains_keyword": int(found)}
# Run and evaluate
run = dataset.run(diagnose, workers=4)
run = run.score([check_keywords], expected_keywords="expected_keywords")
```
---
## Streaming & tracing
• **Streaming responses** – streaming guide: https://docs.zeroeval.com/streaming (coming soon)
• **Deep observability** – tracing guide: https://docs.zeroeval.com/tracing (coming soon)
• **Framework integrations** – see [INTEGRATIONS.md](./INTEGRATIONS.md) for automatic OpenAI, LangChain, and LangGraph tracing
---
## CLI commands
```bash
zeroeval setup # one-time API key config (auto-saves to shell config)
zeroeval run my_script.py # run a Python script that uses ZeroEval
```
## Testing
```bash
poetry run pytest
```
Raw data
{
"_id": null,
"home_page": null,
"name": "zeroeval",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "LLM, evaluation, observability",
"author": null,
"author_email": "Sebastian Crossa <seb@zeroeval.com>, Jonathan Chavez <jona@zeroeval.com>",
"download_url": "https://files.pythonhosted.org/packages/f7/c6/063d220eac7d248ab465387044b34058c6f0468c401fab453a163517fa1c/zeroeval-0.6.8.tar.gz",
"platform": null,
"description": "# ZeroEval SDK\n\n[ZeroEval](https://zeroeval.com) is an evaluations, a/b testing and monitoring platform for AI products. This SDK lets you **create datasets, run AI/LLM experiments, and trace multimodal workloads**.\n\nIssues? Email us at [founders@zeroeval.com](mailto:founders@zeroeval.com)\n\n---\n\n## Features\n\n\u2022 **Dataset API** \u2013 versioned, queryable, text or multimodal (images, audio, video, URLs). \n\u2022 **Experiment engine** \u2013 run tasks + custom evaluators locally or in the cloud. \n\u2022 **Observability** \u2013 hierarchical _Session \u2192 Trace \u2192 Span_ tracing; live dashboard. \n\u2022 **Python CLI** \u2013 `zeroeval run \u2026`, `zeroeval setup` for friction-less onboarding.\n\n---\n\n## Installation\n\n```bash\npip install zeroeval # poetry add zeroeval\n```\n\nThe SDK automatically detects and instruments OpenAI, LangChain, and LangGraph if you have them installed. No additional configuration needed!\n\n---\n\n## Authentication\n\n1. One-off interactive setup (recommended):\n ```bash\n zeroeval setup # or: poetry run zeroeval setup\n ```\n Your API key will be automatically saved to your shell configuration file (e.g., `~/.zshrc`, `~/.bashrc`). Best practice is to also store it in a `.env` file in your project root.\n2. Or set it in code each time:\n ```python\n import zeroeval as ze\n ze.init(api_key=\"YOUR_API_KEY\")\n ```\n\n---\n\n## Quick-start\n\n```python\n# quickstart.py\nimport zeroeval as ze\n\nze.init() # uses ZEROEVAL_API_KEY env var\n\n# 1. Create dataset\nds = ze.Dataset(\n name=\"gsm8k_sample\",\n data=[\n {\"question\": \"What is 6 times 7?\", \"answer\": \"42\"},\n {\"question\": \"What is 10 plus 7?\", \"answer\": \"17\"}\n ]\n)\n\n# 2. Define task\n@ze.task(outputs=[\"prediction\"])\ndef solve(row):\n # Your LLM logic here\n response = llm_call(row[\"question\"])\n return {\"prediction\": response}\n\n# 3. Define evaluation\n@ze.evaluation(mode=\"dataset\", outputs=[\"accuracy\"])\ndef accuracy(answer_col, prediction_col):\n correct = sum(a == p for a, p in zip(answer_col, prediction_col))\n return {\"accuracy\": correct / len(answer_col)}\n\n# 4. Run and evaluate\nrun = ds.run(solve, workers=8).score([accuracy], answer=\"answer\")\nprint(f\"Accuracy: {run.metrics['accuracy']:.2%}\")\n```\n\nFor a fully-worked multimodal example, visit the docs: https://docs.zeroeval.com/multimodal-datasets (coming soon)\n\n---\n\n## Creating datasets\n\n### 1. Text / tabular\n\n```python\n# Create dataset from list\ncities = ze.Dataset(\n \"Cities\",\n data=[\n {\"name\": \"Paris\", \"population\": 2_165_000},\n {\"name\": \"Berlin\", \"population\": 3_769_000}\n ],\n description=\"Example tabular dataset\"\n)\ncities.push()\n\n# Load dataset from CSV\nds = ze.Dataset(\"/path/to/data.csv\")\nds.push() # Creates new version if dataset exists\n```\n\n### 2. Multimodal (images, audio, URLs)\n\n```python\nmm = ze.Dataset(\n \"Medical_Xray_Dataset\",\n data=[{\"patient_id\": \"P001\", \"symptoms\": \"Cough\"}],\n description=\"Symptoms + chest X-ray\"\n)\nmm.add_image(row_index=0, column_name=\"chest_xray\", image_path=\"sample_images/p001.jpg\")\nmm.add_audio(row_index=0, column_name=\"verbal_notes\", audio_path=\"notes/p001.wav\")\nmm.add_media_url(row_index=0, column_name=\"external_scan\", media_url=\"https://example.com/scan.jpg\", media_type=\"image\")\nmm.push()\n```\n\n---\n\n## Working with Datasets\n\n### Loading from CSV\n\n```python\n# Load dataset directly from CSV file\ndataset = ze.Dataset(\"data.csv\")\n```\n\n### Iteration\n\nDatasets support Python's iteration protocol:\n\n```python\n# Basic iteration\nfor row in dataset:\n print(row.name, row.score)\n\n# With enumerate\nfor i, row in enumerate(dataset):\n print(f\"Row {i}: {row.name}\")\n\n# List comprehensions\nhigh_scores = [row for row in dataset if row.score > 90]\n```\n\n### Slicing and Indexing\n\n```python\n# Single item access (returns DotDict)\nfirst_row = dataset[0]\nlast_row = dataset[-1]\n\n# Slicing (returns new Dataset)\ntop_10 = dataset[:10]\nbottom_5 = dataset[-5:]\nmiddle = dataset[10:20]\n\n# Sliced datasets can be processed independently\nsubset = dataset[:100]\nresults = subset.run(my_task)\nsubset.push() # Upload subset as new dataset\n```\n\n### Dot Notation Access\n\nAll rows support dot notation for cleaner code:\n\n```python\n# Instead of row[\"column_name\"]\nvalue = row.column_name\n\n# Works in tasks too\n@ze.task(outputs=[\"length\"])\ndef get_length(row):\n return {\"length\": len(row.text)}\n```\n\n---\n\n## Running experiments\n\n```python\nimport zeroeval as ze\n\nze.init()\n\n# Pull dataset\ndataset = ze.Dataset.pull(\"Capitals\")\n\n# Define task\n@ze.task(outputs=[\"prediction\"])\ndef uppercase_task(row):\n return {\"prediction\": row[\"input\"].upper()}\n\n# Define evaluation\n@ze.evaluation(mode=\"row\", outputs=[\"exact_match\"])\ndef exact_match(output, prediction):\n return {\"exact_match\": int(output.upper() == prediction)}\n\n@ze.evaluation(mode=\"dataset\", outputs=[\"accuracy\"])\ndef accuracy(exact_match_col):\n return {\"accuracy\": sum(exact_match_col) / len(exact_match_col)}\n\n# Run experiment\nrun = dataset.run(uppercase_task, workers=8)\nrun = run.score([exact_match, accuracy], output=\"output\")\n\nprint(f\"Accuracy: {run.metrics['accuracy']:.2%}\")\n```\n\nAdvanced options:\n\n```python\n# Multiple runs with ensemble\nrun = dataset.run(task, repeats=5, ensemble=\"majority\", on=\"prediction\")\n\n# Pass@k evaluation\nrun = dataset.run(task, repeats=10, ensemble=\"pass@k\", on=\"prediction\", k=3)\n\n# Custom aggregator\ndef best_length(values):\n return max(values, key=len) if values else \"\"\n\nrun = dataset.run(task, repeats=3, ensemble=best_length, on=\"prediction\")\n```\n\n---\n\n## Multimodal experiment (GPT-4o)\n\nA shortened version (full listing in the docs):\n\n```python\nimport zeroeval as ze, openai, base64\nfrom pathlib import Path\n\nze.init()\nclient = openai.OpenAI() # assumes env var OPENAI_API_KEY\n\ndef img_to_data_uri(path):\n data = Path(path).read_bytes()\n b64 = base64.b64encode(data).decode()\n return f\"data:image/jpeg;base64,{b64}\"\n\n# Pull multimodal dataset\ndataset = ze.Dataset.pull(\"Medical_Xray_Dataset\")\n\n# Define task\n@ze.task(outputs=[\"diagnosis\"])\ndef diagnose(row):\n messages = [\n {\"role\": \"user\", \"content\": \"Patient: \" + row[\"symptoms\"]},\n {\"role\": \"user\", \"content\": {\n \"type\": \"image_url\",\n \"image_url\": {\"url\": img_to_data_uri(row[\"chest_xray\"])}\n }}\n ]\n response = client.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)\n return {\"diagnosis\": response.choices[0].message.content}\n\n# Define evaluation\n@ze.evaluation(mode=\"row\", outputs=[\"contains_keyword\"])\ndef check_keywords(diagnosis, expected_keywords):\n keywords = expected_keywords.lower().split(',')\n diagnosis_lower = diagnosis.lower()\n found = any(kw.strip() in diagnosis_lower for kw in keywords)\n return {\"contains_keyword\": int(found)}\n\n# Run and evaluate\nrun = dataset.run(diagnose, workers=4)\nrun = run.score([check_keywords], expected_keywords=\"expected_keywords\")\n```\n\n---\n\n## Streaming & tracing\n\n\u2022 **Streaming responses** \u2013 streaming guide: https://docs.zeroeval.com/streaming (coming soon)\n\u2022 **Deep observability** \u2013 tracing guide: https://docs.zeroeval.com/tracing (coming soon)\n\u2022 **Framework integrations** \u2013 see [INTEGRATIONS.md](./INTEGRATIONS.md) for automatic OpenAI, LangChain, and LangGraph tracing\n\n---\n\n## CLI commands\n\n```bash\nzeroeval setup # one-time API key config (auto-saves to shell config)\nzeroeval run my_script.py # run a Python script that uses ZeroEval\n```\n\n## Testing\n\n```bash\npoetry run pytest\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "ZeroEval SDK",
"version": "0.6.8",
"project_urls": {
"Repository": "https://github.com/zeroeval"
},
"split_keywords": [
"llm",
" evaluation",
" observability"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2b0b2817da763f844f1b378edd7606af82bb8c7d1d8fb9382b75b96852519cee",
"md5": "0aca1de657ebd0d80370a8c7457a6d33",
"sha256": "949d22b4937c7b9c7823ecd8fe6f3ff68dac653f0deb7b8d5ca4d10860d64e2e"
},
"downloads": -1,
"filename": "zeroeval-0.6.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0aca1de657ebd0d80370a8c7457a6d33",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 68641,
"upload_time": "2025-08-02T06:18:52",
"upload_time_iso_8601": "2025-08-02T06:18:52.344830Z",
"url": "https://files.pythonhosted.org/packages/2b/0b/2817da763f844f1b378edd7606af82bb8c7d1d8fb9382b75b96852519cee/zeroeval-0.6.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f7c6063d220eac7d248ab465387044b34058c6f0468c401fab453a163517fa1c",
"md5": "05eb6e44016b663791cee955b48d2467",
"sha256": "9823553627d042e387bba859ac92eada4a6e06235e2a6ceeab860e5d8cf4a42b"
},
"downloads": -1,
"filename": "zeroeval-0.6.8.tar.gz",
"has_sig": false,
"md5_digest": "05eb6e44016b663791cee955b48d2467",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 197881,
"upload_time": "2025-08-02T06:18:53",
"upload_time_iso_8601": "2025-08-02T06:18:53.761686Z",
"url": "https://files.pythonhosted.org/packages/f7/c6/063d220eac7d248ab465387044b34058c6f0468c401fab453a163517fa1c/zeroeval-0.6.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-02 06:18:53",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "zeroeval"
}