zeroeval


Namezeroeval JSON
Version 0.6.8 PyPI version JSON
download
home_pageNone
SummaryZeroEval SDK
upload_time2025-08-02 06:18:53
maintainerNone
docs_urlNone
authorNone
requires_python<4.0,>=3.9
licenseNone
keywords llm evaluation observability
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ZeroEval SDK

[ZeroEval](https://zeroeval.com) is an evaluations, a/b testing and monitoring platform for AI products. This SDK lets you **create datasets, run AI/LLM experiments, and trace multimodal workloads**.

Issues? Email us at [founders@zeroeval.com](mailto:founders@zeroeval.com)

---

## Features

• **Dataset API** – versioned, queryable, text or multimodal (images, audio, video, URLs).  
• **Experiment engine** – run tasks + custom evaluators locally or in the cloud.  
• **Observability** – hierarchical _Session → Trace → Span_ tracing; live dashboard.  
• **Python CLI** – `zeroeval run …`, `zeroeval setup` for friction-less onboarding.

---

## Installation

```bash
pip install zeroeval # poetry add zeroeval
```

The SDK automatically detects and instruments OpenAI, LangChain, and LangGraph if you have them installed. No additional configuration needed!

---

## Authentication

1. One-off interactive setup (recommended):
   ```bash
   zeroeval setup  # or: poetry run zeroeval setup
   ```
   Your API key will be automatically saved to your shell configuration file (e.g., `~/.zshrc`, `~/.bashrc`). Best practice is to also store it in a `.env` file in your project root.
2. Or set it in code each time:
   ```python
   import zeroeval as ze
   ze.init(api_key="YOUR_API_KEY")
   ```

---

## Quick-start

```python
# quickstart.py
import zeroeval as ze

ze.init()  # uses ZEROEVAL_API_KEY env var

# 1. Create dataset
ds = ze.Dataset(
    name="gsm8k_sample",
    data=[
        {"question": "What is 6 times 7?", "answer": "42"},
        {"question": "What is 10 plus 7?", "answer": "17"}
    ]
)

# 2. Define task
@ze.task(outputs=["prediction"])
def solve(row):
    # Your LLM logic here
    response = llm_call(row["question"])
    return {"prediction": response}

# 3. Define evaluation
@ze.evaluation(mode="dataset", outputs=["accuracy"])
def accuracy(answer_col, prediction_col):
    correct = sum(a == p for a, p in zip(answer_col, prediction_col))
    return {"accuracy": correct / len(answer_col)}

# 4. Run and evaluate
run = ds.run(solve, workers=8).score([accuracy], answer="answer")
print(f"Accuracy: {run.metrics['accuracy']:.2%}")
```

For a fully-worked multimodal example, visit the docs: https://docs.zeroeval.com/multimodal-datasets (coming soon)

---

## Creating datasets

### 1. Text / tabular

```python
# Create dataset from list
cities = ze.Dataset(
    "Cities",
    data=[
        {"name": "Paris", "population": 2_165_000},
        {"name": "Berlin", "population": 3_769_000}
    ],
    description="Example tabular dataset"
)
cities.push()

# Load dataset from CSV
ds = ze.Dataset("/path/to/data.csv")
ds.push()  # Creates new version if dataset exists
```

### 2. Multimodal (images, audio, URLs)

```python
mm = ze.Dataset(
    "Medical_Xray_Dataset",
    data=[{"patient_id": "P001", "symptoms": "Cough"}],
    description="Symptoms + chest X-ray"
)
mm.add_image(row_index=0, column_name="chest_xray", image_path="sample_images/p001.jpg")
mm.add_audio(row_index=0, column_name="verbal_notes", audio_path="notes/p001.wav")
mm.add_media_url(row_index=0, column_name="external_scan", media_url="https://example.com/scan.jpg", media_type="image")
mm.push()
```

---

## Working with Datasets

### Loading from CSV

```python
# Load dataset directly from CSV file
dataset = ze.Dataset("data.csv")
```

### Iteration

Datasets support Python's iteration protocol:

```python
# Basic iteration
for row in dataset:
    print(row.name, row.score)

# With enumerate
for i, row in enumerate(dataset):
    print(f"Row {i}: {row.name}")

# List comprehensions
high_scores = [row for row in dataset if row.score > 90]
```

### Slicing and Indexing

```python
# Single item access (returns DotDict)
first_row = dataset[0]
last_row = dataset[-1]

# Slicing (returns new Dataset)
top_10 = dataset[:10]
bottom_5 = dataset[-5:]
middle = dataset[10:20]

# Sliced datasets can be processed independently
subset = dataset[:100]
results = subset.run(my_task)
subset.push()  # Upload subset as new dataset
```

### Dot Notation Access

All rows support dot notation for cleaner code:

```python
# Instead of row["column_name"]
value = row.column_name

# Works in tasks too
@ze.task(outputs=["length"])
def get_length(row):
    return {"length": len(row.text)}
```

---

## Running experiments

```python
import zeroeval as ze

ze.init()

# Pull dataset
dataset = ze.Dataset.pull("Capitals")

# Define task
@ze.task(outputs=["prediction"])
def uppercase_task(row):
    return {"prediction": row["input"].upper()}

# Define evaluation
@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(output, prediction):
    return {"exact_match": int(output.upper() == prediction)}

@ze.evaluation(mode="dataset", outputs=["accuracy"])
def accuracy(exact_match_col):
    return {"accuracy": sum(exact_match_col) / len(exact_match_col)}

# Run experiment
run = dataset.run(uppercase_task, workers=8)
run = run.score([exact_match, accuracy], output="output")

print(f"Accuracy: {run.metrics['accuracy']:.2%}")
```

Advanced options:

```python
# Multiple runs with ensemble
run = dataset.run(task, repeats=5, ensemble="majority", on="prediction")

# Pass@k evaluation
run = dataset.run(task, repeats=10, ensemble="pass@k", on="prediction", k=3)

# Custom aggregator
def best_length(values):
    return max(values, key=len) if values else ""

run = dataset.run(task, repeats=3, ensemble=best_length, on="prediction")
```

---

## Multimodal experiment (GPT-4o)

A shortened version (full listing in the docs):

```python
import zeroeval as ze, openai, base64
from pathlib import Path

ze.init()
client = openai.OpenAI()  # assumes env var OPENAI_API_KEY

def img_to_data_uri(path):
    data = Path(path).read_bytes()
    b64 = base64.b64encode(data).decode()
    return f"data:image/jpeg;base64,{b64}"

# Pull multimodal dataset
dataset = ze.Dataset.pull("Medical_Xray_Dataset")

# Define task
@ze.task(outputs=["diagnosis"])
def diagnose(row):
    messages = [
        {"role": "user", "content": "Patient: " + row["symptoms"]},
        {"role": "user", "content": {
            "type": "image_url",
            "image_url": {"url": img_to_data_uri(row["chest_xray"])}
        }}
    ]
    response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return {"diagnosis": response.choices[0].message.content}

# Define evaluation
@ze.evaluation(mode="row", outputs=["contains_keyword"])
def check_keywords(diagnosis, expected_keywords):
    keywords = expected_keywords.lower().split(',')
    diagnosis_lower = diagnosis.lower()
    found = any(kw.strip() in diagnosis_lower for kw in keywords)
    return {"contains_keyword": int(found)}

# Run and evaluate
run = dataset.run(diagnose, workers=4)
run = run.score([check_keywords], expected_keywords="expected_keywords")
```

---

## Streaming & tracing

• **Streaming responses** – streaming guide: https://docs.zeroeval.com/streaming (coming soon)
• **Deep observability** – tracing guide: https://docs.zeroeval.com/tracing (coming soon)
• **Framework integrations** – see [INTEGRATIONS.md](./INTEGRATIONS.md) for automatic OpenAI, LangChain, and LangGraph tracing

---

## CLI commands

```bash
zeroeval setup              # one-time API key config (auto-saves to shell config)
zeroeval run my_script.py   # run a Python script that uses ZeroEval
```

## Testing

```bash
poetry run pytest
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "zeroeval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "LLM, evaluation, observability",
    "author": null,
    "author_email": "Sebastian Crossa <seb@zeroeval.com>, Jonathan Chavez <jona@zeroeval.com>",
    "download_url": "https://files.pythonhosted.org/packages/f7/c6/063d220eac7d248ab465387044b34058c6f0468c401fab453a163517fa1c/zeroeval-0.6.8.tar.gz",
    "platform": null,
    "description": "# ZeroEval SDK\n\n[ZeroEval](https://zeroeval.com) is an evaluations, a/b testing and monitoring platform for AI products. This SDK lets you **create datasets, run AI/LLM experiments, and trace multimodal workloads**.\n\nIssues? Email us at [founders@zeroeval.com](mailto:founders@zeroeval.com)\n\n---\n\n## Features\n\n\u2022 **Dataset API** \u2013 versioned, queryable, text or multimodal (images, audio, video, URLs).  \n\u2022 **Experiment engine** \u2013 run tasks + custom evaluators locally or in the cloud.  \n\u2022 **Observability** \u2013 hierarchical _Session \u2192 Trace \u2192 Span_ tracing; live dashboard.  \n\u2022 **Python CLI** \u2013 `zeroeval run \u2026`, `zeroeval setup` for friction-less onboarding.\n\n---\n\n## Installation\n\n```bash\npip install zeroeval # poetry add zeroeval\n```\n\nThe SDK automatically detects and instruments OpenAI, LangChain, and LangGraph if you have them installed. No additional configuration needed!\n\n---\n\n## Authentication\n\n1. One-off interactive setup (recommended):\n   ```bash\n   zeroeval setup  # or: poetry run zeroeval setup\n   ```\n   Your API key will be automatically saved to your shell configuration file (e.g., `~/.zshrc`, `~/.bashrc`). Best practice is to also store it in a `.env` file in your project root.\n2. Or set it in code each time:\n   ```python\n   import zeroeval as ze\n   ze.init(api_key=\"YOUR_API_KEY\")\n   ```\n\n---\n\n## Quick-start\n\n```python\n# quickstart.py\nimport zeroeval as ze\n\nze.init()  # uses ZEROEVAL_API_KEY env var\n\n# 1. Create dataset\nds = ze.Dataset(\n    name=\"gsm8k_sample\",\n    data=[\n        {\"question\": \"What is 6 times 7?\", \"answer\": \"42\"},\n        {\"question\": \"What is 10 plus 7?\", \"answer\": \"17\"}\n    ]\n)\n\n# 2. Define task\n@ze.task(outputs=[\"prediction\"])\ndef solve(row):\n    # Your LLM logic here\n    response = llm_call(row[\"question\"])\n    return {\"prediction\": response}\n\n# 3. Define evaluation\n@ze.evaluation(mode=\"dataset\", outputs=[\"accuracy\"])\ndef accuracy(answer_col, prediction_col):\n    correct = sum(a == p for a, p in zip(answer_col, prediction_col))\n    return {\"accuracy\": correct / len(answer_col)}\n\n# 4. Run and evaluate\nrun = ds.run(solve, workers=8).score([accuracy], answer=\"answer\")\nprint(f\"Accuracy: {run.metrics['accuracy']:.2%}\")\n```\n\nFor a fully-worked multimodal example, visit the docs: https://docs.zeroeval.com/multimodal-datasets (coming soon)\n\n---\n\n## Creating datasets\n\n### 1. Text / tabular\n\n```python\n# Create dataset from list\ncities = ze.Dataset(\n    \"Cities\",\n    data=[\n        {\"name\": \"Paris\", \"population\": 2_165_000},\n        {\"name\": \"Berlin\", \"population\": 3_769_000}\n    ],\n    description=\"Example tabular dataset\"\n)\ncities.push()\n\n# Load dataset from CSV\nds = ze.Dataset(\"/path/to/data.csv\")\nds.push()  # Creates new version if dataset exists\n```\n\n### 2. Multimodal (images, audio, URLs)\n\n```python\nmm = ze.Dataset(\n    \"Medical_Xray_Dataset\",\n    data=[{\"patient_id\": \"P001\", \"symptoms\": \"Cough\"}],\n    description=\"Symptoms + chest X-ray\"\n)\nmm.add_image(row_index=0, column_name=\"chest_xray\", image_path=\"sample_images/p001.jpg\")\nmm.add_audio(row_index=0, column_name=\"verbal_notes\", audio_path=\"notes/p001.wav\")\nmm.add_media_url(row_index=0, column_name=\"external_scan\", media_url=\"https://example.com/scan.jpg\", media_type=\"image\")\nmm.push()\n```\n\n---\n\n## Working with Datasets\n\n### Loading from CSV\n\n```python\n# Load dataset directly from CSV file\ndataset = ze.Dataset(\"data.csv\")\n```\n\n### Iteration\n\nDatasets support Python's iteration protocol:\n\n```python\n# Basic iteration\nfor row in dataset:\n    print(row.name, row.score)\n\n# With enumerate\nfor i, row in enumerate(dataset):\n    print(f\"Row {i}: {row.name}\")\n\n# List comprehensions\nhigh_scores = [row for row in dataset if row.score > 90]\n```\n\n### Slicing and Indexing\n\n```python\n# Single item access (returns DotDict)\nfirst_row = dataset[0]\nlast_row = dataset[-1]\n\n# Slicing (returns new Dataset)\ntop_10 = dataset[:10]\nbottom_5 = dataset[-5:]\nmiddle = dataset[10:20]\n\n# Sliced datasets can be processed independently\nsubset = dataset[:100]\nresults = subset.run(my_task)\nsubset.push()  # Upload subset as new dataset\n```\n\n### Dot Notation Access\n\nAll rows support dot notation for cleaner code:\n\n```python\n# Instead of row[\"column_name\"]\nvalue = row.column_name\n\n# Works in tasks too\n@ze.task(outputs=[\"length\"])\ndef get_length(row):\n    return {\"length\": len(row.text)}\n```\n\n---\n\n## Running experiments\n\n```python\nimport zeroeval as ze\n\nze.init()\n\n# Pull dataset\ndataset = ze.Dataset.pull(\"Capitals\")\n\n# Define task\n@ze.task(outputs=[\"prediction\"])\ndef uppercase_task(row):\n    return {\"prediction\": row[\"input\"].upper()}\n\n# Define evaluation\n@ze.evaluation(mode=\"row\", outputs=[\"exact_match\"])\ndef exact_match(output, prediction):\n    return {\"exact_match\": int(output.upper() == prediction)}\n\n@ze.evaluation(mode=\"dataset\", outputs=[\"accuracy\"])\ndef accuracy(exact_match_col):\n    return {\"accuracy\": sum(exact_match_col) / len(exact_match_col)}\n\n# Run experiment\nrun = dataset.run(uppercase_task, workers=8)\nrun = run.score([exact_match, accuracy], output=\"output\")\n\nprint(f\"Accuracy: {run.metrics['accuracy']:.2%}\")\n```\n\nAdvanced options:\n\n```python\n# Multiple runs with ensemble\nrun = dataset.run(task, repeats=5, ensemble=\"majority\", on=\"prediction\")\n\n# Pass@k evaluation\nrun = dataset.run(task, repeats=10, ensemble=\"pass@k\", on=\"prediction\", k=3)\n\n# Custom aggregator\ndef best_length(values):\n    return max(values, key=len) if values else \"\"\n\nrun = dataset.run(task, repeats=3, ensemble=best_length, on=\"prediction\")\n```\n\n---\n\n## Multimodal experiment (GPT-4o)\n\nA shortened version (full listing in the docs):\n\n```python\nimport zeroeval as ze, openai, base64\nfrom pathlib import Path\n\nze.init()\nclient = openai.OpenAI()  # assumes env var OPENAI_API_KEY\n\ndef img_to_data_uri(path):\n    data = Path(path).read_bytes()\n    b64 = base64.b64encode(data).decode()\n    return f\"data:image/jpeg;base64,{b64}\"\n\n# Pull multimodal dataset\ndataset = ze.Dataset.pull(\"Medical_Xray_Dataset\")\n\n# Define task\n@ze.task(outputs=[\"diagnosis\"])\ndef diagnose(row):\n    messages = [\n        {\"role\": \"user\", \"content\": \"Patient: \" + row[\"symptoms\"]},\n        {\"role\": \"user\", \"content\": {\n            \"type\": \"image_url\",\n            \"image_url\": {\"url\": img_to_data_uri(row[\"chest_xray\"])}\n        }}\n    ]\n    response = client.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)\n    return {\"diagnosis\": response.choices[0].message.content}\n\n# Define evaluation\n@ze.evaluation(mode=\"row\", outputs=[\"contains_keyword\"])\ndef check_keywords(diagnosis, expected_keywords):\n    keywords = expected_keywords.lower().split(',')\n    diagnosis_lower = diagnosis.lower()\n    found = any(kw.strip() in diagnosis_lower for kw in keywords)\n    return {\"contains_keyword\": int(found)}\n\n# Run and evaluate\nrun = dataset.run(diagnose, workers=4)\nrun = run.score([check_keywords], expected_keywords=\"expected_keywords\")\n```\n\n---\n\n## Streaming & tracing\n\n\u2022 **Streaming responses** \u2013 streaming guide: https://docs.zeroeval.com/streaming (coming soon)\n\u2022 **Deep observability** \u2013 tracing guide: https://docs.zeroeval.com/tracing (coming soon)\n\u2022 **Framework integrations** \u2013 see [INTEGRATIONS.md](./INTEGRATIONS.md) for automatic OpenAI, LangChain, and LangGraph tracing\n\n---\n\n## CLI commands\n\n```bash\nzeroeval setup              # one-time API key config (auto-saves to shell config)\nzeroeval run my_script.py   # run a Python script that uses ZeroEval\n```\n\n## Testing\n\n```bash\npoetry run pytest\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "ZeroEval SDK",
    "version": "0.6.8",
    "project_urls": {
        "Repository": "https://github.com/zeroeval"
    },
    "split_keywords": [
        "llm",
        " evaluation",
        " observability"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2b0b2817da763f844f1b378edd7606af82bb8c7d1d8fb9382b75b96852519cee",
                "md5": "0aca1de657ebd0d80370a8c7457a6d33",
                "sha256": "949d22b4937c7b9c7823ecd8fe6f3ff68dac653f0deb7b8d5ca4d10860d64e2e"
            },
            "downloads": -1,
            "filename": "zeroeval-0.6.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0aca1de657ebd0d80370a8c7457a6d33",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 68641,
            "upload_time": "2025-08-02T06:18:52",
            "upload_time_iso_8601": "2025-08-02T06:18:52.344830Z",
            "url": "https://files.pythonhosted.org/packages/2b/0b/2817da763f844f1b378edd7606af82bb8c7d1d8fb9382b75b96852519cee/zeroeval-0.6.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f7c6063d220eac7d248ab465387044b34058c6f0468c401fab453a163517fa1c",
                "md5": "05eb6e44016b663791cee955b48d2467",
                "sha256": "9823553627d042e387bba859ac92eada4a6e06235e2a6ceeab860e5d8cf4a42b"
            },
            "downloads": -1,
            "filename": "zeroeval-0.6.8.tar.gz",
            "has_sig": false,
            "md5_digest": "05eb6e44016b663791cee955b48d2467",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 197881,
            "upload_time": "2025-08-02T06:18:53",
            "upload_time_iso_8601": "2025-08-02T06:18:53.761686Z",
            "url": "https://files.pythonhosted.org/packages/f7/c6/063d220eac7d248ab465387044b34058c6f0468c401fab453a163517fa1c/zeroeval-0.6.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 06:18:53",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "zeroeval"
}
        
Elapsed time: 2.70651s