Name | dottxt-eval JSON |
Version |
0.19.0
JSON |
| download |
home_page | None |
Summary | Simple and robust LLM evaluations |
upload_time | 2025-07-30 13:54:00 |
maintainer | None |
docs_url | None |
author | .txt |
requires_python | >=3.10 |
license | None |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# doteval
LLM evaluation library that works like pytest.
## Overview
doteval lets you write LLM evaluations as pytest test functions. It handles progress tracking, resumption after crashes, and result storage automatically.
```python
from doteval import foreach, Result
from doteval.evaluators import exact_match
@foreach("question,answer", dataset)
def eval_math_questions(question, answer, model):
response = model.generate(question)
return Result(prompt=question, scores=[exact_match(response, answer)])
```
`pytest` will collect all `eval_` functions in `eval_*.py` files and run them as an evaluation:
```bash
pytest eval_math_questions.py --experiment my_evaluation
```
## Installation
Requires Python 3.10 or higher.
```bash
pip install dottxt-eval
```
## Basic Usage
### 1. Write an evaluation function
```python
# eval_sentiment.py
import pytest
from doteval import foreach, Result
from doteval.evaluators import exact_match
@pytest.fixture
def model():
return load_your_model()
data = [
("I love this!", "positive"),
("This is terrible", "negative"),
("It's okay", "neutral")
]
@foreach("text,label", data)
def eval_sentiment(text, label, model):
prediction = model.classify(text)
return Result(prompt=text, scores=[exact_match(prediction, label)])
```
### 2. Run the evaluation
```bash
pytest eval_sentiment.py --experiment sentiment_test
```
### 3. View results
```bash
doteval show sentiment_test
```
## Key Features
- **Automatic resumption**: Crashed evaluations continue where they left off
- **Session management**: Named experiments with persistent storage
- **pytest integration**: Use all pytest features (fixtures, parametrization, etc.)
- **Async support**: Built-in concurrency control for large-scale evaluations
## Examples
### Math Reasoning
```python
from datasets import load_dataset
from doteval import foreach, Result
from doteval.evaluators import exact_match
def gsm8k_dataset():
dataset = load_dataset("gsm8k", "main", split="test", streaming=True)
for item in dataset:
yield (item["question"], extract_answer(item["answer"]))
@foreach("question,answer", gsm8k_dataset())
def eval_math_reasoning(question, answer, model):
response = model.solve(question)
return Result(prompt=question, scores=[exact_match(extract_number(response), answer)])
```
### Async Evaluation
```python
@foreach("text,label", large_dataset)
async def eval_classification(text, label, async_model):
prediction = await async_model.classify(text)
return Result(prompt=text, scores=[exact_match(prediction, label)])
```
```bash
pytest eval_classification.py --experiment async_eval
```
### Custom Evaluators
```python
from doteval.evaluators import evaluator
from doteval.metrics import accuracy
@evaluator(metrics=accuracy())
def semantic_similarity(response: str, expected: str) -> bool:
embedding1 = get_embedding(response)
embedding2 = get_embedding(expected)
return cosine_similarity(embedding1, embedding2) > 0.8
@foreach("question,answer", dataset)
def eval_qa(question, answer, model):
response = model.generate(question)
return Result(prompt=question, scores=[semantic_similarity(response, answer)])
```
## CLI Commands
```bash
# List all evaluation experiments
doteval list
# Show results for a specific experiment
doteval show experiment_name
# Show results with error details
doteval show experiment_name --errors
# Show full results without truncation
doteval show experiment_name --full
# Delete an experiment
doteval delete experiment_name
# Rename an experiment
doteval rename old_name new_name
# Use custom storage backend for any command
doteval list --storage sqlite://results.db
doteval show experiment_name --storage sqlite://results.db
doteval delete experiment_name --storage sqlite://results.db
doteval rename old_name new_name --storage sqlite://results.db
# Run with limited samples for testing
pytest eval_test.py --experiment eval_run --samples 10
```
## Session Management
Experiments automatically track:
- Evaluation progress and completion status
- Individual sample results and errors
- Timing and performance metrics
- Resumption state for interrupted runs
```bash
# Start named experiment
pytest eval_math.py --experiment math_baseline
# Resume if interrupted
pytest eval_math.py --experiment math_baseline # Continues from last completed sample
# Use different storage backend
pytest eval_math.py --experiment math_baseline --storage sqlite://results.db
```
## Custom Retry and Concurrency Strategies
doteval allows you to customize retry logic and concurrency strategies for your evaluations:
### Retry Configuration
Use tenacity's `AsyncRetrying` to customize retry behavior:
```python
from tenacity import AsyncRetrying, stop_after_attempt, wait_exponential
from doteval import ForEach
# Custom retry strategy with exponential backoff
custom_retries = AsyncRetrying(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
# Apply to all evaluations in this instance
foreach = ForEach(retries=custom_retries)
@foreach("question,answer", dataset)
async def eval_with_retries(question, answer, model):
response = await model.generate(question) # Will retry up to 5 times
return Result(prompt=question, scores=[exact_match(response, answer)])
```
### Concurrency Strategies
Control how evaluations are executed with custom concurrency strategies:
```python
from doteval import ForEach
from doteval.concurrency import SlidingWindow, Batch, Adaptive
# For async functions - sliding window concurrency
sliding_window = SlidingWindow(max_concurrency=20)
foreach = ForEach(concurrency=sliding_window)
@foreach("text,label", large_dataset)
async def eval_async(text, label, model):
response = await model.classify(text)
return Result(prompt=text, scores=[exact_match(response, label)])
# For sync functions - batch processing
batch_strategy = Batch(batch_size=50)
foreach_batch = ForEach(concurrency=batch_strategy)
@foreach_batch("text,label", dataset)
def eval_in_batches(text, label, model):
response = model.classify(text)
return Result(prompt=text, scores=[exact_match(response, label)])
# For async functions - adaptive concurrency that automatically adjusts
adaptive_strategy = Adaptive(
initial_concurrency=5,
min_concurrency=1,
max_concurrency=50,
adaptation_interval=2.0
)
foreach_adaptive = ForEach(concurrency=adaptive_strategy)
@foreach_adaptive("text,label", large_dataset)
async def eval_adaptive(text, label, model):
response = await model.classify(text)
return Result(prompt=text, scores=[exact_match(response, label)])
```
### Adaptive Concurrency Strategy
The `Adaptive` strategy automatically adjusts concurrency levels based on throughput to maximize performance:
```python
from doteval.concurrency import Adaptive
# Create an adaptive strategy
adaptive = Adaptive(
initial_concurrency=5, # Starting concurrency level
min_concurrency=1, # Minimum concurrency limit
max_concurrency=100, # Maximum concurrency limit
adaptation_interval=2.0, # Seconds between adaptation decisions
increase_threshold=0.98, # Increase if throughput ratio > this
decrease_threshold=0.90, # Decrease if throughput ratio < this
stability_window=3, # Number of measurements before changing direction
error_backoff_factor=0.7 # Multiply concurrency by this on errors
)
foreach = ForEach(concurrency=adaptive)
@foreach("prompt,expected", large_dataset)
async def eval_with_adaptive_concurrency(prompt, expected, model):
response = await model.generate(prompt)
return Result(prompt=prompt, scores=[exact_match(response, expected)])
# Get adaptation statistics
stats = adaptive.get_stats()
print(f"Current concurrency: {stats['current_concurrency']}")
print(f"Throughput: {stats['throughput']:.2f} requests/second")
print(f"Total completed: {stats['total_completed']}")
```
**Key Features:**
- **Progressive Increase**: Starts conservatively and increases based on observed throughput
- **Hill-Climbing**: Continuously finds the optimal concurrency level
- **Error Backoff**: Automatically reduces concurrency when errors occur
- **Stability Windows**: Prevents oscillation by waiting for consistent improvements
- **Throughput Tracking**: Measures requests per second in a sliding window
This strategy is ideal for saturating remote GPU resources where you don't have explicit usage information.
### Custom Concurrency Strategy
Implement your own concurrency strategy:
```python
import asyncio
from typing import AsyncIterator, Callable, TypeVar
T = TypeVar('T')
class RateLimitedStrategy:
def __init__(self, max_concurrency: int, requests_per_second: float):
self.max_concurrency = max_concurrency
self.requests_per_second = requests_per_second
self.min_interval = 1.0 / requests_per_second
async def execute(
self,
tasks: AsyncIterator[Callable[[], T]],
progress_callback: Callable[[T], None] | None = None
) -> AsyncIterator[T]:
last_request_time = 0
pending = set()
async for task in tasks:
# Rate limiting
current_time = asyncio.get_event_loop().time()
time_since_last = current_time - last_request_time
if time_since_last < self.min_interval:
await asyncio.sleep(self.min_interval - time_since_last)
# Manage concurrency
if len(pending) >= self.max_concurrency:
done, pending = await asyncio.wait(
pending, return_when=asyncio.FIRST_COMPLETED
)
for completed in done:
result = await completed
if progress_callback:
progress_callback(result)
yield result
pending.add(asyncio.create_task(task()))
last_request_time = asyncio.get_event_loop().time()
# Process remaining tasks
while pending:
done, pending = await asyncio.wait(
pending, return_when=asyncio.FIRST_COMPLETED
)
for completed in done:
result = await completed
if progress_callback:
progress_callback(result)
yield result
# Use the custom strategy
rate_limited = RateLimitedStrategy(max_concurrency=10, requests_per_second=5)
foreach = ForEach(concurrency=rate_limited)
```
### Complete Example
```python
from tenacity import AsyncRetrying, stop_after_attempt, wait_fixed, retry_if_exception_type
from doteval import ForEach, Result
from doteval.concurrency import SlidingWindow
from doteval.evaluators import exact_match
import aiohttp
# Configure retry for specific exceptions
api_retries = AsyncRetrying(
retry=retry_if_exception_type((aiohttp.ClientError, TimeoutError)),
stop=stop_after_attempt(3),
wait=wait_fixed(2)
)
# Configure concurrency
concurrency = SlidingWindow(max_concurrency=5)
# Create configured ForEach instance
foreach = ForEach(retries=api_retries, concurrency=concurrency)
@foreach("prompt,expected", test_prompts)
async def eval_api_responses(prompt, expected, api_client):
# This will retry on API errors and run up to 5 concurrent requests
response = await api_client.complete(prompt)
return Result(prompt=response, scores=[exact_match(response, expected)])
```
## Mixing Evaluations and Regular Tests
doteval is designed to work seamlessly alongside regular pytest tests in the same codebase. You can have both evaluation functions (marked with `@foreach`) and standard test functions in the same files, and pytest will handle them appropriately.
```python
# test_mixed.py
import pytest
from doteval import foreach
from doteval.models import Result, Score
from doteval.metrics import accuracy
# Regular pytest tests
def test_basic_functionality():
"""Regular pytest test."""
assert 1 + 1 == 2
@pytest.fixture
def model():
"""Shared fixture for both tests and evaluations."""
return load_your_model()
def test_model_loads(model):
"""Regular test using fixture."""
assert model is not None
# Doteval evaluation functions
@foreach("input,expected", [("hello", "HELLO"), ("world", "WORLD")])
def eval_string_processing(input, expected, model):
"""Evaluation function using the same fixture."""
result = model.process(input)
score = Score("accuracy", result == expected, [accuracy()])
return Result(score, prompt=f"Input: {input}")
```
### Running Mixed Test Suites
```bash
# Run everything together
pytest test_mixed.py -v
# Run only regular tests
pytest test_mixed.py -m "not doteval" -v
# Run only doteval functions
pytest test_mixed.py -m "doteval" -v
```
The key benefits:
- **Shared fixtures**: Both test types can use the same pytest fixtures
- **Independent execution**: Regular tests run immediately, evaluations run at session end with progress tracking
- **Selective filtering**: Use pytest markers to run subsets independently
- **No interference**: Regular pytest functionality remains completely unaffected
## Storage Backends
doteval supports multiple storage backends and allows you to implement custom ones:
```python
# Built-in backends
from doteval.sessions import SessionManager
# JSON storage (default)
manager = SessionManager(storage_path="json://.doteval")
# SQLite storage with query capabilities
manager = SessionManager(storage_path="sqlite://evaluations.db")
# Custom backend
from doteval.storage import Storage, register
class MyStorage(Storage):
# Implement abstract methods...
pass
register("mybackend", MyStorage)
manager = SessionManager(storage_path="mybackend://config")
```
Raw data
{
"_id": null,
"home_page": null,
"name": "dottxt-eval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": ".txt",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/c3/02/1489ea5069e8cb41ab3cfd780a1f6450fc8b2cc2e7b8784d8f68a1719312/dottxt_eval-0.19.0.tar.gz",
"platform": null,
"description": "# doteval\n\nLLM evaluation library that works like pytest.\n\n## Overview\n\ndoteval lets you write LLM evaluations as pytest test functions. It handles progress tracking, resumption after crashes, and result storage automatically.\n\n```python\nfrom doteval import foreach, Result\nfrom doteval.evaluators import exact_match\n\n@foreach(\"question,answer\", dataset)\ndef eval_math_questions(question, answer, model):\n response = model.generate(question)\n return Result(prompt=question, scores=[exact_match(response, answer)])\n```\n\n`pytest` will collect all `eval_` functions in `eval_*.py` files and run them as an evaluation:\n\n```bash\npytest eval_math_questions.py --experiment my_evaluation\n```\n\n## Installation\n\nRequires Python 3.10 or higher.\n\n```bash\npip install dottxt-eval\n```\n\n## Basic Usage\n\n### 1. Write an evaluation function\n\n```python\n# eval_sentiment.py\nimport pytest\nfrom doteval import foreach, Result\nfrom doteval.evaluators import exact_match\n\n@pytest.fixture\ndef model():\n return load_your_model()\n\ndata = [\n (\"I love this!\", \"positive\"),\n (\"This is terrible\", \"negative\"),\n (\"It's okay\", \"neutral\")\n]\n\n@foreach(\"text,label\", data)\ndef eval_sentiment(text, label, model):\n prediction = model.classify(text)\n return Result(prompt=text, scores=[exact_match(prediction, label)])\n```\n\n### 2. Run the evaluation\n\n```bash\npytest eval_sentiment.py --experiment sentiment_test\n```\n\n### 3. View results\n\n```bash\ndoteval show sentiment_test\n```\n\n## Key Features\n\n- **Automatic resumption**: Crashed evaluations continue where they left off\n- **Session management**: Named experiments with persistent storage\n- **pytest integration**: Use all pytest features (fixtures, parametrization, etc.)\n- **Async support**: Built-in concurrency control for large-scale evaluations\n\n## Examples\n\n### Math Reasoning\n\n```python\nfrom datasets import load_dataset\nfrom doteval import foreach, Result\nfrom doteval.evaluators import exact_match\n\ndef gsm8k_dataset():\n dataset = load_dataset(\"gsm8k\", \"main\", split=\"test\", streaming=True)\n for item in dataset:\n yield (item[\"question\"], extract_answer(item[\"answer\"]))\n\n@foreach(\"question,answer\", gsm8k_dataset())\ndef eval_math_reasoning(question, answer, model):\n response = model.solve(question)\n return Result(prompt=question, scores=[exact_match(extract_number(response), answer)])\n```\n\n### Async Evaluation\n\n```python\n@foreach(\"text,label\", large_dataset)\nasync def eval_classification(text, label, async_model):\n prediction = await async_model.classify(text)\n return Result(prompt=text, scores=[exact_match(prediction, label)])\n```\n\n```bash\npytest eval_classification.py --experiment async_eval\n```\n\n### Custom Evaluators\n\n```python\nfrom doteval.evaluators import evaluator\nfrom doteval.metrics import accuracy\n\n@evaluator(metrics=accuracy())\ndef semantic_similarity(response: str, expected: str) -> bool:\n embedding1 = get_embedding(response)\n embedding2 = get_embedding(expected)\n return cosine_similarity(embedding1, embedding2) > 0.8\n\n@foreach(\"question,answer\", dataset)\ndef eval_qa(question, answer, model):\n response = model.generate(question)\n return Result(prompt=question, scores=[semantic_similarity(response, answer)])\n```\n\n## CLI Commands\n\n```bash\n# List all evaluation experiments\ndoteval list\n\n# Show results for a specific experiment\ndoteval show experiment_name\n\n# Show results with error details\ndoteval show experiment_name --errors\n\n# Show full results without truncation\ndoteval show experiment_name --full\n\n# Delete an experiment\ndoteval delete experiment_name\n\n# Rename an experiment\ndoteval rename old_name new_name\n\n# Use custom storage backend for any command\ndoteval list --storage sqlite://results.db\ndoteval show experiment_name --storage sqlite://results.db\ndoteval delete experiment_name --storage sqlite://results.db\ndoteval rename old_name new_name --storage sqlite://results.db\n\n# Run with limited samples for testing\npytest eval_test.py --experiment eval_run --samples 10\n```\n\n## Session Management\n\nExperiments automatically track:\n- Evaluation progress and completion status\n- Individual sample results and errors\n- Timing and performance metrics\n- Resumption state for interrupted runs\n\n```bash\n# Start named experiment\npytest eval_math.py --experiment math_baseline\n\n# Resume if interrupted\npytest eval_math.py --experiment math_baseline # Continues from last completed sample\n\n# Use different storage backend\npytest eval_math.py --experiment math_baseline --storage sqlite://results.db\n```\n\n## Custom Retry and Concurrency Strategies\n\ndoteval allows you to customize retry logic and concurrency strategies for your evaluations:\n\n### Retry Configuration\n\nUse tenacity's `AsyncRetrying` to customize retry behavior:\n\n```python\nfrom tenacity import AsyncRetrying, stop_after_attempt, wait_exponential\nfrom doteval import ForEach\n\n# Custom retry strategy with exponential backoff\ncustom_retries = AsyncRetrying(\n stop=stop_after_attempt(5),\n wait=wait_exponential(multiplier=1, min=4, max=60)\n)\n\n# Apply to all evaluations in this instance\nforeach = ForEach(retries=custom_retries)\n\n@foreach(\"question,answer\", dataset)\nasync def eval_with_retries(question, answer, model):\n response = await model.generate(question) # Will retry up to 5 times\n return Result(prompt=question, scores=[exact_match(response, answer)])\n```\n\n### Concurrency Strategies\n\nControl how evaluations are executed with custom concurrency strategies:\n\n```python\nfrom doteval import ForEach\nfrom doteval.concurrency import SlidingWindow, Batch, Adaptive\n\n# For async functions - sliding window concurrency\nsliding_window = SlidingWindow(max_concurrency=20)\nforeach = ForEach(concurrency=sliding_window)\n\n@foreach(\"text,label\", large_dataset)\nasync def eval_async(text, label, model):\n response = await model.classify(text)\n return Result(prompt=text, scores=[exact_match(response, label)])\n\n# For sync functions - batch processing\nbatch_strategy = Batch(batch_size=50)\nforeach_batch = ForEach(concurrency=batch_strategy)\n\n@foreach_batch(\"text,label\", dataset)\ndef eval_in_batches(text, label, model):\n response = model.classify(text)\n return Result(prompt=text, scores=[exact_match(response, label)])\n\n# For async functions - adaptive concurrency that automatically adjusts\nadaptive_strategy = Adaptive(\n initial_concurrency=5,\n min_concurrency=1,\n max_concurrency=50,\n adaptation_interval=2.0\n)\nforeach_adaptive = ForEach(concurrency=adaptive_strategy)\n\n@foreach_adaptive(\"text,label\", large_dataset)\nasync def eval_adaptive(text, label, model):\n response = await model.classify(text)\n return Result(prompt=text, scores=[exact_match(response, label)])\n```\n\n### Adaptive Concurrency Strategy\n\nThe `Adaptive` strategy automatically adjusts concurrency levels based on throughput to maximize performance:\n\n```python\nfrom doteval.concurrency import Adaptive\n\n# Create an adaptive strategy\nadaptive = Adaptive(\n initial_concurrency=5, # Starting concurrency level\n min_concurrency=1, # Minimum concurrency limit\n max_concurrency=100, # Maximum concurrency limit\n adaptation_interval=2.0, # Seconds between adaptation decisions\n increase_threshold=0.98, # Increase if throughput ratio > this\n decrease_threshold=0.90, # Decrease if throughput ratio < this\n stability_window=3, # Number of measurements before changing direction\n error_backoff_factor=0.7 # Multiply concurrency by this on errors\n)\n\nforeach = ForEach(concurrency=adaptive)\n\n@foreach(\"prompt,expected\", large_dataset)\nasync def eval_with_adaptive_concurrency(prompt, expected, model):\n response = await model.generate(prompt)\n return Result(prompt=prompt, scores=[exact_match(response, expected)])\n\n# Get adaptation statistics\nstats = adaptive.get_stats()\nprint(f\"Current concurrency: {stats['current_concurrency']}\")\nprint(f\"Throughput: {stats['throughput']:.2f} requests/second\")\nprint(f\"Total completed: {stats['total_completed']}\")\n```\n\n**Key Features:**\n- **Progressive Increase**: Starts conservatively and increases based on observed throughput\n- **Hill-Climbing**: Continuously finds the optimal concurrency level\n- **Error Backoff**: Automatically reduces concurrency when errors occur\n- **Stability Windows**: Prevents oscillation by waiting for consistent improvements\n- **Throughput Tracking**: Measures requests per second in a sliding window\n\nThis strategy is ideal for saturating remote GPU resources where you don't have explicit usage information.\n\n### Custom Concurrency Strategy\n\nImplement your own concurrency strategy:\n\n```python\nimport asyncio\nfrom typing import AsyncIterator, Callable, TypeVar\n\nT = TypeVar('T')\n\nclass RateLimitedStrategy:\n def __init__(self, max_concurrency: int, requests_per_second: float):\n self.max_concurrency = max_concurrency\n self.requests_per_second = requests_per_second\n self.min_interval = 1.0 / requests_per_second\n\n async def execute(\n self,\n tasks: AsyncIterator[Callable[[], T]],\n progress_callback: Callable[[T], None] | None = None\n ) -> AsyncIterator[T]:\n last_request_time = 0\n pending = set()\n\n async for task in tasks:\n # Rate limiting\n current_time = asyncio.get_event_loop().time()\n time_since_last = current_time - last_request_time\n if time_since_last < self.min_interval:\n await asyncio.sleep(self.min_interval - time_since_last)\n\n # Manage concurrency\n if len(pending) >= self.max_concurrency:\n done, pending = await asyncio.wait(\n pending, return_when=asyncio.FIRST_COMPLETED\n )\n for completed in done:\n result = await completed\n if progress_callback:\n progress_callback(result)\n yield result\n\n pending.add(asyncio.create_task(task()))\n last_request_time = asyncio.get_event_loop().time()\n\n # Process remaining tasks\n while pending:\n done, pending = await asyncio.wait(\n pending, return_when=asyncio.FIRST_COMPLETED\n )\n for completed in done:\n result = await completed\n if progress_callback:\n progress_callback(result)\n yield result\n\n# Use the custom strategy\nrate_limited = RateLimitedStrategy(max_concurrency=10, requests_per_second=5)\nforeach = ForEach(concurrency=rate_limited)\n```\n\n### Complete Example\n\n```python\nfrom tenacity import AsyncRetrying, stop_after_attempt, wait_fixed, retry_if_exception_type\nfrom doteval import ForEach, Result\nfrom doteval.concurrency import SlidingWindow\nfrom doteval.evaluators import exact_match\nimport aiohttp\n\n# Configure retry for specific exceptions\napi_retries = AsyncRetrying(\n retry=retry_if_exception_type((aiohttp.ClientError, TimeoutError)),\n stop=stop_after_attempt(3),\n wait=wait_fixed(2)\n)\n\n# Configure concurrency\nconcurrency = SlidingWindow(max_concurrency=5)\n\n# Create configured ForEach instance\nforeach = ForEach(retries=api_retries, concurrency=concurrency)\n\n@foreach(\"prompt,expected\", test_prompts)\nasync def eval_api_responses(prompt, expected, api_client):\n # This will retry on API errors and run up to 5 concurrent requests\n response = await api_client.complete(prompt)\n return Result(prompt=response, scores=[exact_match(response, expected)])\n```\n\n## Mixing Evaluations and Regular Tests\n\ndoteval is designed to work seamlessly alongside regular pytest tests in the same codebase. You can have both evaluation functions (marked with `@foreach`) and standard test functions in the same files, and pytest will handle them appropriately.\n\n```python\n# test_mixed.py\nimport pytest\nfrom doteval import foreach\nfrom doteval.models import Result, Score\nfrom doteval.metrics import accuracy\n\n# Regular pytest tests\ndef test_basic_functionality():\n \"\"\"Regular pytest test.\"\"\"\n assert 1 + 1 == 2\n\n@pytest.fixture\ndef model():\n \"\"\"Shared fixture for both tests and evaluations.\"\"\"\n return load_your_model()\n\ndef test_model_loads(model):\n \"\"\"Regular test using fixture.\"\"\"\n assert model is not None\n\n# Doteval evaluation functions\n@foreach(\"input,expected\", [(\"hello\", \"HELLO\"), (\"world\", \"WORLD\")])\ndef eval_string_processing(input, expected, model):\n \"\"\"Evaluation function using the same fixture.\"\"\"\n result = model.process(input)\n score = Score(\"accuracy\", result == expected, [accuracy()])\n return Result(score, prompt=f\"Input: {input}\")\n```\n\n### Running Mixed Test Suites\n\n```bash\n# Run everything together\npytest test_mixed.py -v\n\n# Run only regular tests\npytest test_mixed.py -m \"not doteval\" -v\n\n# Run only doteval functions\npytest test_mixed.py -m \"doteval\" -v\n```\n\nThe key benefits:\n- **Shared fixtures**: Both test types can use the same pytest fixtures\n- **Independent execution**: Regular tests run immediately, evaluations run at session end with progress tracking\n- **Selective filtering**: Use pytest markers to run subsets independently\n- **No interference**: Regular pytest functionality remains completely unaffected\n\n## Storage Backends\n\ndoteval supports multiple storage backends and allows you to implement custom ones:\n\n```python\n# Built-in backends\nfrom doteval.sessions import SessionManager\n\n# JSON storage (default)\nmanager = SessionManager(storage_path=\"json://.doteval\")\n\n# SQLite storage with query capabilities\nmanager = SessionManager(storage_path=\"sqlite://evaluations.db\")\n\n# Custom backend\nfrom doteval.storage import Storage, register\n\nclass MyStorage(Storage):\n # Implement abstract methods...\n pass\n\nregister(\"mybackend\", MyStorage)\nmanager = SessionManager(storage_path=\"mybackend://config\")\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Simple and robust LLM evaluations",
"version": "0.19.0",
"project_urls": {
"repository": "https://github.com/dottxt-ai/doteval"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "005d1992aefa0683100673373d12eb86925748311bc0aaacd94eee7c8f09970d",
"md5": "187a9221a3dcb6f263decd04b54edb79",
"sha256": "735b41ca10385aafad200566ef67b96ec5961f5355ef623203f7e11f92fe8664"
},
"downloads": -1,
"filename": "dottxt_eval-0.19.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "187a9221a3dcb6f263decd04b54edb79",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 50198,
"upload_time": "2025-07-30T13:53:58",
"upload_time_iso_8601": "2025-07-30T13:53:58.977058Z",
"url": "https://files.pythonhosted.org/packages/00/5d/1992aefa0683100673373d12eb86925748311bc0aaacd94eee7c8f09970d/dottxt_eval-0.19.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c3021489ea5069e8cb41ab3cfd780a1f6450fc8b2cc2e7b8784d8f68a1719312",
"md5": "f9dd1e5412df632c76283fe24435872f",
"sha256": "e98e84fa9c5eafde353e64c386d72caf41a209359ef4b825a5b216f2c4616632"
},
"downloads": -1,
"filename": "dottxt_eval-0.19.0.tar.gz",
"has_sig": false,
"md5_digest": "f9dd1e5412df632c76283fe24435872f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 230286,
"upload_time": "2025-07-30T13:54:00",
"upload_time_iso_8601": "2025-07-30T13:54:00.280233Z",
"url": "https://files.pythonhosted.org/packages/c3/02/1489ea5069e8cb41ab3cfd780a1f6450fc8b2cc2e7b8784d8f68a1719312/dottxt_eval-0.19.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-30 13:54:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dottxt-ai",
"github_project": "doteval",
"github_not_found": true,
"lcname": "dottxt-eval"
}