dottxt-eval


Namedottxt-eval JSON
Version 0.19.0 PyPI version JSON
download
home_pageNone
SummarySimple and robust LLM evaluations
upload_time2025-07-30 13:54:00
maintainerNone
docs_urlNone
author.txt
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # doteval

LLM evaluation library that works like pytest.

## Overview

doteval lets you write LLM evaluations as pytest test functions. It handles progress tracking, resumption after crashes, and result storage automatically.

```python
from doteval import foreach, Result
from doteval.evaluators import exact_match

@foreach("question,answer", dataset)
def eval_math_questions(question, answer, model):
    response = model.generate(question)
    return Result(prompt=question, scores=[exact_match(response, answer)])
```

`pytest` will collect all `eval_` functions in `eval_*.py` files and run them as an evaluation:

```bash
pytest eval_math_questions.py --experiment my_evaluation
```

## Installation

Requires Python 3.10 or higher.

```bash
pip install dottxt-eval
```

## Basic Usage

### 1. Write an evaluation function

```python
# eval_sentiment.py
import pytest
from doteval import foreach, Result
from doteval.evaluators import exact_match

@pytest.fixture
def model():
    return load_your_model()

data = [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay", "neutral")
]

@foreach("text,label", data)
def eval_sentiment(text, label, model):
    prediction = model.classify(text)
    return Result(prompt=text, scores=[exact_match(prediction, label)])
```

### 2. Run the evaluation

```bash
pytest eval_sentiment.py --experiment sentiment_test
```

### 3. View results

```bash
doteval show sentiment_test
```

## Key Features

- **Automatic resumption**: Crashed evaluations continue where they left off
- **Session management**: Named experiments with persistent storage
- **pytest integration**: Use all pytest features (fixtures, parametrization, etc.)
- **Async support**: Built-in concurrency control for large-scale evaluations

## Examples

### Math Reasoning

```python
from datasets import load_dataset
from doteval import foreach, Result
from doteval.evaluators import exact_match

def gsm8k_dataset():
    dataset = load_dataset("gsm8k", "main", split="test", streaming=True)
    for item in dataset:
        yield (item["question"], extract_answer(item["answer"]))

@foreach("question,answer", gsm8k_dataset())
def eval_math_reasoning(question, answer, model):
    response = model.solve(question)
    return Result(prompt=question, scores=[exact_match(extract_number(response), answer)])
```

### Async Evaluation

```python
@foreach("text,label", large_dataset)
async def eval_classification(text, label, async_model):
    prediction = await async_model.classify(text)
    return Result(prompt=text, scores=[exact_match(prediction, label)])
```

```bash
pytest eval_classification.py --experiment async_eval
```

### Custom Evaluators

```python
from doteval.evaluators import evaluator
from doteval.metrics import accuracy

@evaluator(metrics=accuracy())
def semantic_similarity(response: str, expected: str) -> bool:
    embedding1 = get_embedding(response)
    embedding2 = get_embedding(expected)
    return cosine_similarity(embedding1, embedding2) > 0.8

@foreach("question,answer", dataset)
def eval_qa(question, answer, model):
    response = model.generate(question)
    return Result(prompt=question, scores=[semantic_similarity(response, answer)])
```

## CLI Commands

```bash
# List all evaluation experiments
doteval list

# Show results for a specific experiment
doteval show experiment_name

# Show results with error details
doteval show experiment_name --errors

# Show full results without truncation
doteval show experiment_name --full

# Delete an experiment
doteval delete experiment_name

# Rename an experiment
doteval rename old_name new_name

# Use custom storage backend for any command
doteval list --storage sqlite://results.db
doteval show experiment_name --storage sqlite://results.db
doteval delete experiment_name --storage sqlite://results.db
doteval rename old_name new_name --storage sqlite://results.db

# Run with limited samples for testing
pytest eval_test.py --experiment eval_run --samples 10
```

## Session Management

Experiments automatically track:
- Evaluation progress and completion status
- Individual sample results and errors
- Timing and performance metrics
- Resumption state for interrupted runs

```bash
# Start named experiment
pytest eval_math.py --experiment math_baseline

# Resume if interrupted
pytest eval_math.py --experiment math_baseline  # Continues from last completed sample

# Use different storage backend
pytest eval_math.py --experiment math_baseline --storage sqlite://results.db
```

## Custom Retry and Concurrency Strategies

doteval allows you to customize retry logic and concurrency strategies for your evaluations:

### Retry Configuration

Use tenacity's `AsyncRetrying` to customize retry behavior:

```python
from tenacity import AsyncRetrying, stop_after_attempt, wait_exponential
from doteval import ForEach

# Custom retry strategy with exponential backoff
custom_retries = AsyncRetrying(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=4, max=60)
)

# Apply to all evaluations in this instance
foreach = ForEach(retries=custom_retries)

@foreach("question,answer", dataset)
async def eval_with_retries(question, answer, model):
    response = await model.generate(question)  # Will retry up to 5 times
    return Result(prompt=question, scores=[exact_match(response, answer)])
```

### Concurrency Strategies

Control how evaluations are executed with custom concurrency strategies:

```python
from doteval import ForEach
from doteval.concurrency import SlidingWindow, Batch, Adaptive

# For async functions - sliding window concurrency
sliding_window = SlidingWindow(max_concurrency=20)
foreach = ForEach(concurrency=sliding_window)

@foreach("text,label", large_dataset)
async def eval_async(text, label, model):
    response = await model.classify(text)
    return Result(prompt=text, scores=[exact_match(response, label)])

# For sync functions - batch processing
batch_strategy = Batch(batch_size=50)
foreach_batch = ForEach(concurrency=batch_strategy)

@foreach_batch("text,label", dataset)
def eval_in_batches(text, label, model):
    response = model.classify(text)
    return Result(prompt=text, scores=[exact_match(response, label)])

# For async functions - adaptive concurrency that automatically adjusts
adaptive_strategy = Adaptive(
    initial_concurrency=5,
    min_concurrency=1,
    max_concurrency=50,
    adaptation_interval=2.0
)
foreach_adaptive = ForEach(concurrency=adaptive_strategy)

@foreach_adaptive("text,label", large_dataset)
async def eval_adaptive(text, label, model):
    response = await model.classify(text)
    return Result(prompt=text, scores=[exact_match(response, label)])
```

### Adaptive Concurrency Strategy

The `Adaptive` strategy automatically adjusts concurrency levels based on throughput to maximize performance:

```python
from doteval.concurrency import Adaptive

# Create an adaptive strategy
adaptive = Adaptive(
    initial_concurrency=5,      # Starting concurrency level
    min_concurrency=1,          # Minimum concurrency limit
    max_concurrency=100,        # Maximum concurrency limit
    adaptation_interval=2.0,    # Seconds between adaptation decisions
    increase_threshold=0.98,    # Increase if throughput ratio > this
    decrease_threshold=0.90,    # Decrease if throughput ratio < this
    stability_window=3,         # Number of measurements before changing direction
    error_backoff_factor=0.7    # Multiply concurrency by this on errors
)

foreach = ForEach(concurrency=adaptive)

@foreach("prompt,expected", large_dataset)
async def eval_with_adaptive_concurrency(prompt, expected, model):
    response = await model.generate(prompt)
    return Result(prompt=prompt, scores=[exact_match(response, expected)])

# Get adaptation statistics
stats = adaptive.get_stats()
print(f"Current concurrency: {stats['current_concurrency']}")
print(f"Throughput: {stats['throughput']:.2f} requests/second")
print(f"Total completed: {stats['total_completed']}")
```

**Key Features:**
- **Progressive Increase**: Starts conservatively and increases based on observed throughput
- **Hill-Climbing**: Continuously finds the optimal concurrency level
- **Error Backoff**: Automatically reduces concurrency when errors occur
- **Stability Windows**: Prevents oscillation by waiting for consistent improvements
- **Throughput Tracking**: Measures requests per second in a sliding window

This strategy is ideal for saturating remote GPU resources where you don't have explicit usage information.

### Custom Concurrency Strategy

Implement your own concurrency strategy:

```python
import asyncio
from typing import AsyncIterator, Callable, TypeVar

T = TypeVar('T')

class RateLimitedStrategy:
    def __init__(self, max_concurrency: int, requests_per_second: float):
        self.max_concurrency = max_concurrency
        self.requests_per_second = requests_per_second
        self.min_interval = 1.0 / requests_per_second

    async def execute(
        self,
        tasks: AsyncIterator[Callable[[], T]],
        progress_callback: Callable[[T], None] | None = None
    ) -> AsyncIterator[T]:
        last_request_time = 0
        pending = set()

        async for task in tasks:
            # Rate limiting
            current_time = asyncio.get_event_loop().time()
            time_since_last = current_time - last_request_time
            if time_since_last < self.min_interval:
                await asyncio.sleep(self.min_interval - time_since_last)

            # Manage concurrency
            if len(pending) >= self.max_concurrency:
                done, pending = await asyncio.wait(
                    pending, return_when=asyncio.FIRST_COMPLETED
                )
                for completed in done:
                    result = await completed
                    if progress_callback:
                        progress_callback(result)
                    yield result

            pending.add(asyncio.create_task(task()))
            last_request_time = asyncio.get_event_loop().time()

        # Process remaining tasks
        while pending:
            done, pending = await asyncio.wait(
                pending, return_when=asyncio.FIRST_COMPLETED
            )
            for completed in done:
                result = await completed
                if progress_callback:
                    progress_callback(result)
                yield result

# Use the custom strategy
rate_limited = RateLimitedStrategy(max_concurrency=10, requests_per_second=5)
foreach = ForEach(concurrency=rate_limited)
```

### Complete Example

```python
from tenacity import AsyncRetrying, stop_after_attempt, wait_fixed, retry_if_exception_type
from doteval import ForEach, Result
from doteval.concurrency import SlidingWindow
from doteval.evaluators import exact_match
import aiohttp

# Configure retry for specific exceptions
api_retries = AsyncRetrying(
    retry=retry_if_exception_type((aiohttp.ClientError, TimeoutError)),
    stop=stop_after_attempt(3),
    wait=wait_fixed(2)
)

# Configure concurrency
concurrency = SlidingWindow(max_concurrency=5)

# Create configured ForEach instance
foreach = ForEach(retries=api_retries, concurrency=concurrency)

@foreach("prompt,expected", test_prompts)
async def eval_api_responses(prompt, expected, api_client):
    # This will retry on API errors and run up to 5 concurrent requests
    response = await api_client.complete(prompt)
    return Result(prompt=response, scores=[exact_match(response, expected)])
```

## Mixing Evaluations and Regular Tests

doteval is designed to work seamlessly alongside regular pytest tests in the same codebase. You can have both evaluation functions (marked with `@foreach`) and standard test functions in the same files, and pytest will handle them appropriately.

```python
# test_mixed.py
import pytest
from doteval import foreach
from doteval.models import Result, Score
from doteval.metrics import accuracy

# Regular pytest tests
def test_basic_functionality():
    """Regular pytest test."""
    assert 1 + 1 == 2

@pytest.fixture
def model():
    """Shared fixture for both tests and evaluations."""
    return load_your_model()

def test_model_loads(model):
    """Regular test using fixture."""
    assert model is not None

# Doteval evaluation functions
@foreach("input,expected", [("hello", "HELLO"), ("world", "WORLD")])
def eval_string_processing(input, expected, model):
    """Evaluation function using the same fixture."""
    result = model.process(input)
    score = Score("accuracy", result == expected, [accuracy()])
    return Result(score, prompt=f"Input: {input}")
```

### Running Mixed Test Suites

```bash
# Run everything together
pytest test_mixed.py -v

# Run only regular tests
pytest test_mixed.py -m "not doteval" -v

# Run only doteval functions
pytest test_mixed.py -m "doteval" -v
```

The key benefits:
- **Shared fixtures**: Both test types can use the same pytest fixtures
- **Independent execution**: Regular tests run immediately, evaluations run at session end with progress tracking
- **Selective filtering**: Use pytest markers to run subsets independently
- **No interference**: Regular pytest functionality remains completely unaffected

## Storage Backends

doteval supports multiple storage backends and allows you to implement custom ones:

```python
# Built-in backends
from doteval.sessions import SessionManager

# JSON storage (default)
manager = SessionManager(storage_path="json://.doteval")

# SQLite storage with query capabilities
manager = SessionManager(storage_path="sqlite://evaluations.db")

# Custom backend
from doteval.storage import Storage, register

class MyStorage(Storage):
    # Implement abstract methods...
    pass

register("mybackend", MyStorage)
manager = SessionManager(storage_path="mybackend://config")
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dottxt-eval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": ".txt",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/c3/02/1489ea5069e8cb41ab3cfd780a1f6450fc8b2cc2e7b8784d8f68a1719312/dottxt_eval-0.19.0.tar.gz",
    "platform": null,
    "description": "# doteval\n\nLLM evaluation library that works like pytest.\n\n## Overview\n\ndoteval lets you write LLM evaluations as pytest test functions. It handles progress tracking, resumption after crashes, and result storage automatically.\n\n```python\nfrom doteval import foreach, Result\nfrom doteval.evaluators import exact_match\n\n@foreach(\"question,answer\", dataset)\ndef eval_math_questions(question, answer, model):\n    response = model.generate(question)\n    return Result(prompt=question, scores=[exact_match(response, answer)])\n```\n\n`pytest` will collect all `eval_` functions in `eval_*.py` files and run them as an evaluation:\n\n```bash\npytest eval_math_questions.py --experiment my_evaluation\n```\n\n## Installation\n\nRequires Python 3.10 or higher.\n\n```bash\npip install dottxt-eval\n```\n\n## Basic Usage\n\n### 1. Write an evaluation function\n\n```python\n# eval_sentiment.py\nimport pytest\nfrom doteval import foreach, Result\nfrom doteval.evaluators import exact_match\n\n@pytest.fixture\ndef model():\n    return load_your_model()\n\ndata = [\n    (\"I love this!\", \"positive\"),\n    (\"This is terrible\", \"negative\"),\n    (\"It's okay\", \"neutral\")\n]\n\n@foreach(\"text,label\", data)\ndef eval_sentiment(text, label, model):\n    prediction = model.classify(text)\n    return Result(prompt=text, scores=[exact_match(prediction, label)])\n```\n\n### 2. Run the evaluation\n\n```bash\npytest eval_sentiment.py --experiment sentiment_test\n```\n\n### 3. View results\n\n```bash\ndoteval show sentiment_test\n```\n\n## Key Features\n\n- **Automatic resumption**: Crashed evaluations continue where they left off\n- **Session management**: Named experiments with persistent storage\n- **pytest integration**: Use all pytest features (fixtures, parametrization, etc.)\n- **Async support**: Built-in concurrency control for large-scale evaluations\n\n## Examples\n\n### Math Reasoning\n\n```python\nfrom datasets import load_dataset\nfrom doteval import foreach, Result\nfrom doteval.evaluators import exact_match\n\ndef gsm8k_dataset():\n    dataset = load_dataset(\"gsm8k\", \"main\", split=\"test\", streaming=True)\n    for item in dataset:\n        yield (item[\"question\"], extract_answer(item[\"answer\"]))\n\n@foreach(\"question,answer\", gsm8k_dataset())\ndef eval_math_reasoning(question, answer, model):\n    response = model.solve(question)\n    return Result(prompt=question, scores=[exact_match(extract_number(response), answer)])\n```\n\n### Async Evaluation\n\n```python\n@foreach(\"text,label\", large_dataset)\nasync def eval_classification(text, label, async_model):\n    prediction = await async_model.classify(text)\n    return Result(prompt=text, scores=[exact_match(prediction, label)])\n```\n\n```bash\npytest eval_classification.py --experiment async_eval\n```\n\n### Custom Evaluators\n\n```python\nfrom doteval.evaluators import evaluator\nfrom doteval.metrics import accuracy\n\n@evaluator(metrics=accuracy())\ndef semantic_similarity(response: str, expected: str) -> bool:\n    embedding1 = get_embedding(response)\n    embedding2 = get_embedding(expected)\n    return cosine_similarity(embedding1, embedding2) > 0.8\n\n@foreach(\"question,answer\", dataset)\ndef eval_qa(question, answer, model):\n    response = model.generate(question)\n    return Result(prompt=question, scores=[semantic_similarity(response, answer)])\n```\n\n## CLI Commands\n\n```bash\n# List all evaluation experiments\ndoteval list\n\n# Show results for a specific experiment\ndoteval show experiment_name\n\n# Show results with error details\ndoteval show experiment_name --errors\n\n# Show full results without truncation\ndoteval show experiment_name --full\n\n# Delete an experiment\ndoteval delete experiment_name\n\n# Rename an experiment\ndoteval rename old_name new_name\n\n# Use custom storage backend for any command\ndoteval list --storage sqlite://results.db\ndoteval show experiment_name --storage sqlite://results.db\ndoteval delete experiment_name --storage sqlite://results.db\ndoteval rename old_name new_name --storage sqlite://results.db\n\n# Run with limited samples for testing\npytest eval_test.py --experiment eval_run --samples 10\n```\n\n## Session Management\n\nExperiments automatically track:\n- Evaluation progress and completion status\n- Individual sample results and errors\n- Timing and performance metrics\n- Resumption state for interrupted runs\n\n```bash\n# Start named experiment\npytest eval_math.py --experiment math_baseline\n\n# Resume if interrupted\npytest eval_math.py --experiment math_baseline  # Continues from last completed sample\n\n# Use different storage backend\npytest eval_math.py --experiment math_baseline --storage sqlite://results.db\n```\n\n## Custom Retry and Concurrency Strategies\n\ndoteval allows you to customize retry logic and concurrency strategies for your evaluations:\n\n### Retry Configuration\n\nUse tenacity's `AsyncRetrying` to customize retry behavior:\n\n```python\nfrom tenacity import AsyncRetrying, stop_after_attempt, wait_exponential\nfrom doteval import ForEach\n\n# Custom retry strategy with exponential backoff\ncustom_retries = AsyncRetrying(\n    stop=stop_after_attempt(5),\n    wait=wait_exponential(multiplier=1, min=4, max=60)\n)\n\n# Apply to all evaluations in this instance\nforeach = ForEach(retries=custom_retries)\n\n@foreach(\"question,answer\", dataset)\nasync def eval_with_retries(question, answer, model):\n    response = await model.generate(question)  # Will retry up to 5 times\n    return Result(prompt=question, scores=[exact_match(response, answer)])\n```\n\n### Concurrency Strategies\n\nControl how evaluations are executed with custom concurrency strategies:\n\n```python\nfrom doteval import ForEach\nfrom doteval.concurrency import SlidingWindow, Batch, Adaptive\n\n# For async functions - sliding window concurrency\nsliding_window = SlidingWindow(max_concurrency=20)\nforeach = ForEach(concurrency=sliding_window)\n\n@foreach(\"text,label\", large_dataset)\nasync def eval_async(text, label, model):\n    response = await model.classify(text)\n    return Result(prompt=text, scores=[exact_match(response, label)])\n\n# For sync functions - batch processing\nbatch_strategy = Batch(batch_size=50)\nforeach_batch = ForEach(concurrency=batch_strategy)\n\n@foreach_batch(\"text,label\", dataset)\ndef eval_in_batches(text, label, model):\n    response = model.classify(text)\n    return Result(prompt=text, scores=[exact_match(response, label)])\n\n# For async functions - adaptive concurrency that automatically adjusts\nadaptive_strategy = Adaptive(\n    initial_concurrency=5,\n    min_concurrency=1,\n    max_concurrency=50,\n    adaptation_interval=2.0\n)\nforeach_adaptive = ForEach(concurrency=adaptive_strategy)\n\n@foreach_adaptive(\"text,label\", large_dataset)\nasync def eval_adaptive(text, label, model):\n    response = await model.classify(text)\n    return Result(prompt=text, scores=[exact_match(response, label)])\n```\n\n### Adaptive Concurrency Strategy\n\nThe `Adaptive` strategy automatically adjusts concurrency levels based on throughput to maximize performance:\n\n```python\nfrom doteval.concurrency import Adaptive\n\n# Create an adaptive strategy\nadaptive = Adaptive(\n    initial_concurrency=5,      # Starting concurrency level\n    min_concurrency=1,          # Minimum concurrency limit\n    max_concurrency=100,        # Maximum concurrency limit\n    adaptation_interval=2.0,    # Seconds between adaptation decisions\n    increase_threshold=0.98,    # Increase if throughput ratio > this\n    decrease_threshold=0.90,    # Decrease if throughput ratio < this\n    stability_window=3,         # Number of measurements before changing direction\n    error_backoff_factor=0.7    # Multiply concurrency by this on errors\n)\n\nforeach = ForEach(concurrency=adaptive)\n\n@foreach(\"prompt,expected\", large_dataset)\nasync def eval_with_adaptive_concurrency(prompt, expected, model):\n    response = await model.generate(prompt)\n    return Result(prompt=prompt, scores=[exact_match(response, expected)])\n\n# Get adaptation statistics\nstats = adaptive.get_stats()\nprint(f\"Current concurrency: {stats['current_concurrency']}\")\nprint(f\"Throughput: {stats['throughput']:.2f} requests/second\")\nprint(f\"Total completed: {stats['total_completed']}\")\n```\n\n**Key Features:**\n- **Progressive Increase**: Starts conservatively and increases based on observed throughput\n- **Hill-Climbing**: Continuously finds the optimal concurrency level\n- **Error Backoff**: Automatically reduces concurrency when errors occur\n- **Stability Windows**: Prevents oscillation by waiting for consistent improvements\n- **Throughput Tracking**: Measures requests per second in a sliding window\n\nThis strategy is ideal for saturating remote GPU resources where you don't have explicit usage information.\n\n### Custom Concurrency Strategy\n\nImplement your own concurrency strategy:\n\n```python\nimport asyncio\nfrom typing import AsyncIterator, Callable, TypeVar\n\nT = TypeVar('T')\n\nclass RateLimitedStrategy:\n    def __init__(self, max_concurrency: int, requests_per_second: float):\n        self.max_concurrency = max_concurrency\n        self.requests_per_second = requests_per_second\n        self.min_interval = 1.0 / requests_per_second\n\n    async def execute(\n        self,\n        tasks: AsyncIterator[Callable[[], T]],\n        progress_callback: Callable[[T], None] | None = None\n    ) -> AsyncIterator[T]:\n        last_request_time = 0\n        pending = set()\n\n        async for task in tasks:\n            # Rate limiting\n            current_time = asyncio.get_event_loop().time()\n            time_since_last = current_time - last_request_time\n            if time_since_last < self.min_interval:\n                await asyncio.sleep(self.min_interval - time_since_last)\n\n            # Manage concurrency\n            if len(pending) >= self.max_concurrency:\n                done, pending = await asyncio.wait(\n                    pending, return_when=asyncio.FIRST_COMPLETED\n                )\n                for completed in done:\n                    result = await completed\n                    if progress_callback:\n                        progress_callback(result)\n                    yield result\n\n            pending.add(asyncio.create_task(task()))\n            last_request_time = asyncio.get_event_loop().time()\n\n        # Process remaining tasks\n        while pending:\n            done, pending = await asyncio.wait(\n                pending, return_when=asyncio.FIRST_COMPLETED\n            )\n            for completed in done:\n                result = await completed\n                if progress_callback:\n                    progress_callback(result)\n                yield result\n\n# Use the custom strategy\nrate_limited = RateLimitedStrategy(max_concurrency=10, requests_per_second=5)\nforeach = ForEach(concurrency=rate_limited)\n```\n\n### Complete Example\n\n```python\nfrom tenacity import AsyncRetrying, stop_after_attempt, wait_fixed, retry_if_exception_type\nfrom doteval import ForEach, Result\nfrom doteval.concurrency import SlidingWindow\nfrom doteval.evaluators import exact_match\nimport aiohttp\n\n# Configure retry for specific exceptions\napi_retries = AsyncRetrying(\n    retry=retry_if_exception_type((aiohttp.ClientError, TimeoutError)),\n    stop=stop_after_attempt(3),\n    wait=wait_fixed(2)\n)\n\n# Configure concurrency\nconcurrency = SlidingWindow(max_concurrency=5)\n\n# Create configured ForEach instance\nforeach = ForEach(retries=api_retries, concurrency=concurrency)\n\n@foreach(\"prompt,expected\", test_prompts)\nasync def eval_api_responses(prompt, expected, api_client):\n    # This will retry on API errors and run up to 5 concurrent requests\n    response = await api_client.complete(prompt)\n    return Result(prompt=response, scores=[exact_match(response, expected)])\n```\n\n## Mixing Evaluations and Regular Tests\n\ndoteval is designed to work seamlessly alongside regular pytest tests in the same codebase. You can have both evaluation functions (marked with `@foreach`) and standard test functions in the same files, and pytest will handle them appropriately.\n\n```python\n# test_mixed.py\nimport pytest\nfrom doteval import foreach\nfrom doteval.models import Result, Score\nfrom doteval.metrics import accuracy\n\n# Regular pytest tests\ndef test_basic_functionality():\n    \"\"\"Regular pytest test.\"\"\"\n    assert 1 + 1 == 2\n\n@pytest.fixture\ndef model():\n    \"\"\"Shared fixture for both tests and evaluations.\"\"\"\n    return load_your_model()\n\ndef test_model_loads(model):\n    \"\"\"Regular test using fixture.\"\"\"\n    assert model is not None\n\n# Doteval evaluation functions\n@foreach(\"input,expected\", [(\"hello\", \"HELLO\"), (\"world\", \"WORLD\")])\ndef eval_string_processing(input, expected, model):\n    \"\"\"Evaluation function using the same fixture.\"\"\"\n    result = model.process(input)\n    score = Score(\"accuracy\", result == expected, [accuracy()])\n    return Result(score, prompt=f\"Input: {input}\")\n```\n\n### Running Mixed Test Suites\n\n```bash\n# Run everything together\npytest test_mixed.py -v\n\n# Run only regular tests\npytest test_mixed.py -m \"not doteval\" -v\n\n# Run only doteval functions\npytest test_mixed.py -m \"doteval\" -v\n```\n\nThe key benefits:\n- **Shared fixtures**: Both test types can use the same pytest fixtures\n- **Independent execution**: Regular tests run immediately, evaluations run at session end with progress tracking\n- **Selective filtering**: Use pytest markers to run subsets independently\n- **No interference**: Regular pytest functionality remains completely unaffected\n\n## Storage Backends\n\ndoteval supports multiple storage backends and allows you to implement custom ones:\n\n```python\n# Built-in backends\nfrom doteval.sessions import SessionManager\n\n# JSON storage (default)\nmanager = SessionManager(storage_path=\"json://.doteval\")\n\n# SQLite storage with query capabilities\nmanager = SessionManager(storage_path=\"sqlite://evaluations.db\")\n\n# Custom backend\nfrom doteval.storage import Storage, register\n\nclass MyStorage(Storage):\n    # Implement abstract methods...\n    pass\n\nregister(\"mybackend\", MyStorage)\nmanager = SessionManager(storage_path=\"mybackend://config\")\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Simple and robust LLM evaluations",
    "version": "0.19.0",
    "project_urls": {
        "repository": "https://github.com/dottxt-ai/doteval"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "005d1992aefa0683100673373d12eb86925748311bc0aaacd94eee7c8f09970d",
                "md5": "187a9221a3dcb6f263decd04b54edb79",
                "sha256": "735b41ca10385aafad200566ef67b96ec5961f5355ef623203f7e11f92fe8664"
            },
            "downloads": -1,
            "filename": "dottxt_eval-0.19.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "187a9221a3dcb6f263decd04b54edb79",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 50198,
            "upload_time": "2025-07-30T13:53:58",
            "upload_time_iso_8601": "2025-07-30T13:53:58.977058Z",
            "url": "https://files.pythonhosted.org/packages/00/5d/1992aefa0683100673373d12eb86925748311bc0aaacd94eee7c8f09970d/dottxt_eval-0.19.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c3021489ea5069e8cb41ab3cfd780a1f6450fc8b2cc2e7b8784d8f68a1719312",
                "md5": "f9dd1e5412df632c76283fe24435872f",
                "sha256": "e98e84fa9c5eafde353e64c386d72caf41a209359ef4b825a5b216f2c4616632"
            },
            "downloads": -1,
            "filename": "dottxt_eval-0.19.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f9dd1e5412df632c76283fe24435872f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 230286,
            "upload_time": "2025-07-30T13:54:00",
            "upload_time_iso_8601": "2025-07-30T13:54:00.280233Z",
            "url": "https://files.pythonhosted.org/packages/c3/02/1489ea5069e8cb41ab3cfd780a1f6450fc8b2cc2e7b8784d8f68a1719312/dottxt_eval-0.19.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-30 13:54:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dottxt-ai",
    "github_project": "doteval",
    "github_not_found": true,
    "lcname": "dottxt-eval"
}
        
Elapsed time: 0.45410s