tool-scorer

Name	tool-scorer JSON
Version	1.2.0 JSON
	download
home_page	None
Summary	Evaluate LLM tool usage and function calling accuracy with comprehensive metrics, support for OpenAI/Anthropic/LangChain, pytest integration, and beautiful reports
upload_time	2025-10-18 10:59:19
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	Apache-2.0
keywords	accuracy agent-evaluation ai-agents anthropic benchmarking claude evaluation function-calling gpt langchain llm llm-testing metrics openai testing tool-use
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
  <img src="assets/logo.png" alt="Toolscore Logo" width="200"/>
</p>

<h1 align="center">Toolscore</h1>

<p align="center">
  <em>A Python package for evaluating LLM tool usage against gold standard specifications</em>
</p>

[![PyPI version](https://badge.fury.io/py/tool-scorer.svg)](https://badge.fury.io/py/tool-scorer)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![Downloads](https://static.pepy.tech/badge/tool-scorer)](https://pepy.tech/project/tool-scorer)
[![Python Versions](https://img.shields.io/pypi/pyversions/tool-scorer.svg)](https://pypi.org/project/tool-scorer/)
[![CI](https://github.com/yotambraun/toolscore/workflows/CI/badge.svg)](https://github.com/yotambraun/toolscore/actions)
[![codecov](https://codecov.io/gh/yotambraun/toolscore/branch/main/graph/badge.svg)](https://codecov.io/gh/yotambraun/toolscore)

Toolscore helps developers evaluate the tool-using behavior of LLM-based agents by comparing recorded tool usage traces against gold-standard specifications, producing detailed metrics and reports.

## What is Toolscore?

**Toolscore evaluates LLM tool usage** - it doesn't call LLM APIs directly. Think of it as a testing framework for function-calling agents:

- ✅ **Evaluates** existing tool usage traces from OpenAI, Anthropic, or custom sources
- ✅ **Compares** actual behavior against expected gold standards
- ✅ **Reports** detailed metrics on accuracy, efficiency, and correctness
- ❌ **Does NOT** call LLM APIs or execute tools (you capture traces separately)

**Use Toolscore to:**
- Benchmark different LLM models on tool usage tasks
- Validate that your agent calls the right tools with the right arguments
- Track improvements in function calling accuracy over time
- Compare agent performance across different prompting strategies

## Features

- **Trace vs. Spec Comparison**: Load agent tool-use traces (OpenAI, Anthropic, LangChain, or custom) and compare against gold standard specifications
- **Comprehensive Metrics Suite**:
  - Tool Invocation Accuracy
  - Tool Selection Accuracy
  - **NEW**: Tool Correctness (were all expected tools called?)
  - Tool Call Sequence Edit Distance
  - Argument Match F1 Score
  - **NEW**: Parameter Schema Validation (types, ranges, patterns)
  - Redundant Call Rate
  - Side-Effect Success Rate
  - Latency/Cost Attribution
  - **NEW**: Integrated LLM-as-a-judge semantic evaluation
- **Multiple Trace Adapters**: Built-in support for OpenAI, Anthropic Claude, LangChain, and custom JSON formats
- **CLI and API**: Command-line interface and Python API for programmatic use
- **Beautiful Console Output**: Color-coded metrics, tables, and progress indicators with Rich
- **Rich Output Reports**: Interactive HTML and machine-readable JSON reports
- **Pytest Integration**: Seamless test integration with pytest plugin and assertion helpers
- **Interactive Tutorials**: Jupyter notebooks for hands-on learning
- **Example Datasets**: 5 realistic gold standards for common agent types (weather, ecommerce, code, RAG, multi-tool)
- **Extensible Checks**: Validate side-effects like HTTP calls, file creation, database queries
- **Automated Releases**: Semantic versioning with conventional commits

## Why Toolscore?

| Feature | Toolscore | Manual Testing | Basic Assertions |
|---------|-----------|----------------|------------------|
| **Multiple LLM Support** | ✅ OpenAI, Anthropic, LangChain, Custom | ❌ | ❌ |
| **Comprehensive Metrics** | ✅ 10+ metrics | ❌ | ⚠️ Basic |
| **Schema Validation** | ✅ Types, ranges, patterns | ❌ | ❌ |
| **Tool Correctness** | ✅ Deterministic coverage check | ❌ | ❌ |
| **Semantic Evaluation** | ✅ Integrated LLM-as-a-judge | ❌ | ❌ |
| **Example Datasets** | ✅ 5 realistic templates | ❌ | ❌ |
| **Pytest Integration** | ✅ Native plugin | ❌ | ⚠️ Manual |
| **Beautiful Reports** | ✅ HTML + JSON | ❌ | ❌ |
| **Side-effect Validation** | ✅ HTTP, FS, DB | ❌ | ❌ |
| **Sequence Analysis** | ✅ Edit distance | ❌ | ❌ |
| **Interactive Tutorials** | ✅ Jupyter notebooks | ❌ | ❌ |
| **CI/CD Ready** | ✅ GitHub Actions | ⚠️ Custom | ⚠️ Custom |
| **Type Safety** | ✅ Fully typed | ❌ | ❌ |

## Installation

```bash
# Install from PyPI
pip install tool-scorer

# Or install from source
git clone https://github.com/yotambraun/toolscore.git
cd toolscore
pip install -e .
```

### Optional Dependencies

```bash
# Install with HTTP validation support
pip install tool-scorer[http]

# Install with LLM-as-a-judge metrics (requires OpenAI API key)
pip install tool-scorer[llm]

# Install with LangChain support
pip install tool-scorer[langchain]

# Install all optional features
pip install tool-scorer[all]
```

### Development Installation

```bash
# Install with development dependencies
pip install -e ".[dev]"

# Install with dev + docs dependencies
pip install -e ".[dev,docs]"

# Or using uv (faster)
uv pip install -e ".[dev]"
```

## ⚡ What's New in v1.1

Toolscore v1.1 focuses on **making evaluation incredibly easy and intuitive** with powerful new features:

### 🚀 Zero-Friction Onboarding (`toolscore init`)
Interactive project setup in 30 seconds. Choose your agent type (Weather, E-commerce, Code, RAG, Multi-tool), get pre-built templates and ready-to-use examples.

```bash
toolscore init
# Follow prompts → Start evaluating in 30 seconds
```

### ⚡ Synthetic Test Generator (`toolscore generate`)
Create comprehensive test suites automatically from OpenAI function schemas. Generates varied test cases with edge cases and boundary values - no manual test writing needed.

```bash
toolscore generate --from-openai functions.json --count 20
# Creates 20 test cases with normal + edge + boundary variations
```

### 📊 Quick Compare (`toolscore compare`)
Compare multiple models side-by-side in one command. See which model (GPT-4, Claude, Gemini, etc.) performs best on each metric with beautiful comparison tables.

```bash
toolscore compare gold.json gpt4.json claude.json gemini.json \
  -n gpt-4 -n claude-3 -n gemini-1.5
# Shows color-coded comparison table with overall winner
```

### 🔍 Interactive Debug Mode (`--debug`)
Step through failures one-by-one with guided troubleshooting. See exactly what went wrong and get actionable fix suggestions for each mismatch.

```bash
toolscore eval gold.json trace.json --debug
# Navigate mismatches interactively with context-specific suggestions
```

### 💡 Actionable Error Messages
Automatic failure detection with specific fix suggestions. No more guessing - get told exactly what to try next (use `--llm-judge`, check schemas, review arguments, etc.).

### 🎯 Tool Correctness Metric
Deterministic evaluation of whether all expected tools were called - goes beyond just checking individual call matches.

### 🧠 Integrated LLM-as-a-Judge
Semantic evaluation is now built into the core - just add `--llm-judge` flag to catch equivalent but differently-named tools (e.g., `search` vs `web_search`).

### 🔒 Parameter Schema Validation
Validate argument types, ranges, patterns, and constraints - catch type errors, out-of-range values, and missing required fields.

### 📦 Example Datasets
5 realistic gold standards for common agent types (weather, ecommerce, code assistant, RAG, multi-tool) - start evaluating in 30 seconds!

## Quick Start

### 🚀 30-Second Start

The fastest way to start evaluating:

```bash
# Install
pip install tool-scorer

# Initialize project (interactive)
toolscore init

# Evaluate (included templates)
toolscore eval gold_calls.json example_trace.json
```

Done! You now have evaluation results with detailed metrics.

### 5-Minute Complete Workflow

1. **Install Toolscore:**
   ```bash
   pip install tool-scorer
   ```

2. **Initialize a project** (choose from 5 agent types):
   ```bash
   toolscore init
   # Select agent type → Get templates + examples
   ```

3. **Generate test cases** (if you have OpenAI function schemas):
   ```bash
   toolscore generate --from-openai functions.json --count 20
   ```

4. **Run evaluation** with your agent's trace:
   ```bash
   # Basic evaluation
   toolscore eval gold_calls.json my_trace.json --html report.html

   # With semantic matching (catches similar tool names)
   toolscore eval gold_calls.json my_trace.json --llm-judge

   # With interactive debugging
   toolscore eval gold_calls.json my_trace.json --debug
   ```

5. **Compare multiple models:**
   ```bash
   toolscore compare gold.json gpt4.json claude.json \
     -n gpt-4 -n claude-3
   ```

6. **View results:**
   - Console shows color-coded metrics
   - Open `report.html` for interactive analysis
   - Check `toolscore.json` for machine-readable results

**Want to test with your own LLM?** See the [Complete Tutorial](TUTORIAL.md) for step-by-step instructions on capturing traces from OpenAI/Anthropic APIs.

### Command Line Usage

```bash
# ===== GETTING STARTED =====

# Initialize new project (interactive)
toolscore init

# Generate test cases from OpenAI function schemas
toolscore generate --from-openai functions.json --count 20 -o gold.json

# Validate trace file format
toolscore validate trace.json

# ===== EVALUATION =====

# Basic evaluation
toolscore eval gold_calls.json trace.json

# With HTML report
toolscore eval gold_calls.json trace.json --html report.html

# With semantic matching (LLM-as-a-judge)
toolscore eval gold_calls.json trace.json --llm-judge

# With interactive debugging
toolscore eval gold_calls.json trace.json --debug

# Verbose output (shows missing/extra tools)
toolscore eval gold_calls.json trace.json --verbose

# Specify trace format explicitly
toolscore eval gold_calls.json trace.json --format openai

# Use realistic example dataset
toolscore eval examples/datasets/ecommerce_agent.json trace.json

# ===== MULTI-MODEL COMPARISON =====

# Compare multiple models side-by-side
toolscore compare gold.json gpt4.json claude.json gemini.json

# With custom model names
toolscore compare gold.json model1.json model2.json \
  -n "GPT-4" -n "Claude-3-Opus"

# Save comparison report
toolscore compare gold.json *.json -o comparison.json
```

### Python API

```python
from toolscore import evaluate_trace

# Run evaluation
result = evaluate_trace(
    gold_file="gold_calls.json",
    trace_file="trace.json",
    format="auto"  # auto-detect format
)

# Access metrics
print(f"Invocation Accuracy: {result.metrics['invocation_accuracy']:.2%}")
print(f"Selection Accuracy: {result.metrics['selection_accuracy']:.2%}")

sequence = result.metrics['sequence_metrics']
print(f"Sequence Accuracy: {sequence['sequence_accuracy']:.2%}")

arguments = result.metrics['argument_metrics']
print(f"Argument F1: {arguments['f1']:.2%}")
```

### Pytest Integration

Toolscore includes a pytest plugin for seamless test integration:

```python
# test_my_agent.py
def test_agent_accuracy(toolscore_eval, toolscore_assertions):
    """Test that agent achieves high accuracy."""
    result = toolscore_eval("gold_calls.json", "trace.json")

    # Use built-in assertions
    toolscore_assertions.assert_invocation_accuracy(result, min_accuracy=0.9)
    toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)
    toolscore_assertions.assert_argument_f1(result, min_f1=0.8)
```

The plugin is automatically loaded when you install Toolscore. See the [examples](examples/test_example_with_pytest.py) for more patterns.

### Interactive Tutorials

Try Toolscore in your browser with our Jupyter notebooks:

- [Quickstart Tutorial](examples/notebooks/01_quickstart.ipynb) - 5-minute introduction
- [Custom Formats](examples/notebooks/02_custom_formats.ipynb) - Working with custom traces
- [Advanced Metrics](examples/notebooks/03_advanced_metrics.ipynb) - Deep dive into metrics

Open them in [Google Colab](https://colab.research.google.com/) for instant experimentation.

## Gold Standard Format

Create a `gold_calls.json` file defining the expected tool calls:

```json
[
  {
    "tool": "make_file",
    "args": {
      "filename": "poem.txt",
      "lines_of_text": ["Roses are red,", "Violets are blue."]
    },
    "side_effects": {
      "file_exists": "poem.txt"
    },
    "description": "Create a file with a poem"
  }
]
```

## Trace Formats

Toolscore supports multiple trace formats:

### OpenAI Format

```json
[
  {
    "role": "assistant",
    "function_call": {
      "name": "get_weather",
      "arguments": "{\"location\": \"Boston\"}"
    }
  }
]
```

### Anthropic Format

```json
[
  {
    "role": "assistant",
    "content": [
      {
        "type": "tool_use",
        "id": "toolu_123",
        "name": "search",
        "input": {"query": "Python"}
      }
    ]
  }
]
```

### LangChain Format

```json
[
  {
    "tool": "search",
    "tool_input": {"query": "Python tutorials"},
    "log": "Invoking search..."
  }
]
```

Or modern format:

```json
[
  {
    "name": "search",
    "args": {"query": "Python"},
    "id": "call_123"
  }
]
```

### Custom Format

```json
{
  "calls": [
    {
      "tool": "read_file",
      "args": {"path": "data.txt"},
      "result": "file contents"
    }
  ]
}
```

## Metrics Explained

### Tool Invocation Accuracy
Measures whether the agent invoked tools when needed and refrained when not needed.

### Tool Selection Accuracy
Proportion of tool calls that match expected tool names.

### Tool Correctness (NEW)
Checks if all expected tools were called at least once - complements selection accuracy by measuring coverage rather than per-call matching.

### Sequence Edit Distance
Levenshtein distance between expected and actual tool call sequences.

### Argument Match F1
Precision and recall of argument correctness across all tool calls.

### Schema Validation (NEW)
Validates argument types, numeric ranges, string patterns, enums, and required fields. Define schemas in your gold standard:

```json
{
  "tool": "search",
  "args": {"query": "test", "limit": 10},
  "metadata": {
    "schema": {
      "query": {"type": "string", "minLength": 1},
      "limit": {"type": "integer", "minimum": 1, "maximum": 100}
    }
  }
}
```

### Redundant Call Rate
Percentage of unnecessary or duplicate tool calls.

### Side-Effect Success Rate
Proportion of validated side-effects (HTTP, filesystem, database) that succeeded.

### LLM-as-a-judge Semantic Evaluation (Integrated)
Now built into core evaluation! Use `--llm-judge` flag to evaluate semantic equivalence beyond exact string matching. Perfect for catching cases where tool names differ but intentions match (e.g., `search_web` vs `web_search`).

```bash
# CLI usage - easiest way
tool-scorer eval gold.json trace.json --llm-judge

# Python API
result = evaluate_trace("gold.json", "trace.json", use_llm_judge=True)
print(f"Semantic Score: {result.metrics['semantic_metrics']['semantic_score']:.2%}")
```

## Project Structure

```
toolscore/
├── adapters/          # Trace format adapters
│   ├── openai.py
│   ├── anthropic.py
│   └── custom.py
├── metrics/           # Metric calculators
│   ├── accuracy.py
│   ├── sequence.py
│   ├── arguments.py
│   └── ...
├── validators/        # Side-effect validators
│   ├── http.py
│   ├── filesystem.py
│   └── database.py
├── reports/           # Report generators
├── cli.py            # CLI interface
└── core.py           # Core evaluation logic
```

## Development

```bash
# Install dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=toolscore

# Type checking
mypy toolscore

# Linting and formatting
ruff check toolscore
ruff format toolscore
```

## Real-World Use Cases

### 1. Model Evaluation & Selection
Compare GPT-4 vs Claude vs Gemini on your specific tool-calling tasks:

```python
models = ["gpt-4", "claude-3-5-sonnet", "gemini-pro"]
results = {}

for model in models:
    trace = capture_trace(model, task="customer_support")
    result = evaluate_trace("gold_standard.json", trace)
    results[model] = result.metrics['selection_accuracy']

best_model = max(results, key=results.get)
print(f"Best model: {best_model} ({results[best_model]:.1%} accuracy)")
```

### 2. CI/CD Integration
Catch regressions in agent behavior before deployment:

```python
# test_agent_quality.py
def test_agent_meets_sla(toolscore_eval, toolscore_assertions):
    """Ensure agent meets 95% accuracy SLA."""
    result = toolscore_eval("gold_standard.json", "production_trace.json")
    toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.95)
    toolscore_assertions.assert_redundancy_rate(result, max_rate=0.1)
```

### 3. Prompt Engineering Optimization
A/B test different prompts and measure impact:

```python
prompts = ["prompt_v1.txt", "prompt_v2.txt", "prompt_v3.txt"]

for prompt_file in prompts:
    trace = run_agent_with_prompt(prompt_file)
    result = evaluate_trace("gold_standard.json", trace)

    print(f"{prompt_file}:")
    print(f"  Selection: {result.metrics['selection_accuracy']:.1%}")
    print(f"  Arguments: {result.metrics['argument_metrics']['f1']:.1%}")
    print(f"  Efficiency: {result.metrics['efficiency_metrics']['redundant_rate']:.1%}")
```

### 4. Production Monitoring
Track agent performance over time in production:

```python
# Run daily
today_traces = collect_production_traces(date=today)
result = evaluate_trace("gold_standard.json", today_traces)

# Alert if degradation
if result.metrics['selection_accuracy'] < 0.90:
    send_alert("Agent performance degraded!")

# Log metrics to dashboard
log_to_datadog({
    "accuracy": result.metrics['selection_accuracy'],
    "redundancy": result.metrics['efficiency_metrics']['redundant_rate'],
})
```

## Documentation

- **[ReadTheDocs](https://toolscore.readthedocs.io/)** - Complete API documentation
- **[Complete Tutorial](TUTORIAL.md)** - In-depth guide with end-to-end workflow
- **[Example Datasets](examples/datasets/)** - 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)
- **[Examples Directory](examples/)** - Sample traces and capture scripts
- **[Jupyter Notebooks](examples/notebooks/)** - Interactive tutorials
- **[Contributing Guide](CONTRIBUTING.md)** - How to contribute to Toolscore

## What's New

### v1.1.0 (Latest - October 2025)

**Major Product Improvements:**
- **🧠 Integrated LLM-as-a-Judge**: Semantic evaluation now built into core with `--llm-judge` flag
- **🎯 Tool Correctness Metric**: Deterministic check for complete tool coverage
- **🔒 Parameter Schema Validation**: Validate types, ranges, patterns, enums in arguments
- **📦 Example Datasets**: 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)
- **Enhanced Console Output**: Beautiful tables showing tool coverage and schema validation

### v1.0.x

- **LLM-as-a-judge metrics**: Semantic correctness evaluation using OpenAI API
- **LangChain adapter**: Support for LangChain agent traces (legacy and modern formats)
- **Beautiful console output**: Color-coded metrics with Rich library
- **Pytest plugin**: Seamless test integration with fixtures and assertions
- **Interactive tutorials**: Jupyter notebooks for hands-on learning
- **Comprehensive documentation**: Sphinx docs on ReadTheDocs
- **Test coverage**: Increased to 80%+ with 123 passing tests
- **Automated releases**: Semantic versioning with conventional commits
- **Enhanced PyPI presence**: 16 searchable keywords, Beta status, comprehensive classifiers

See [CHANGELOG.md](CHANGELOG.md) for full release history.

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

Apache License 2.0 - see [LICENSE](LICENSE) for details.

## Citation

If you use Toolscore in your research, please cite:

```bibtex
@software{toolscore,
  title = {Toolscore: LLM Tool Usage Evaluation Package},
  author = {Yotam Braun},
  year = {2025},
  url = {https://github.com/yotambraun/toolscore}
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tool-scorer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "accuracy, agent-evaluation, ai-agents, anthropic, benchmarking, claude, evaluation, function-calling, gpt, langchain, llm, llm-testing, metrics, openai, testing, tool-use",
    "author": null,
    "author_email": "Yotam Barun <yotambarun93@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/39/f8/8ac47efb29e4ab34f947286755dcf1a61a0480e72d83c19e5f0d56f6e796/tool_scorer-1.2.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"assets/logo.png\" alt=\"Toolscore Logo\" width=\"200\"/>\n</p>\n\n<h1 align=\"center\">Toolscore</h1>\n\n<p align=\"center\">\n  <em>A Python package for evaluating LLM tool usage against gold standard specifications</em>\n</p>\n\n[![PyPI version](https://badge.fury.io/py/tool-scorer.svg)](https://badge.fury.io/py/tool-scorer)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![Downloads](https://static.pepy.tech/badge/tool-scorer)](https://pepy.tech/project/tool-scorer)\n[![Python Versions](https://img.shields.io/pypi/pyversions/tool-scorer.svg)](https://pypi.org/project/tool-scorer/)\n[![CI](https://github.com/yotambraun/toolscore/workflows/CI/badge.svg)](https://github.com/yotambraun/toolscore/actions)\n[![codecov](https://codecov.io/gh/yotambraun/toolscore/branch/main/graph/badge.svg)](https://codecov.io/gh/yotambraun/toolscore)\n\nToolscore helps developers evaluate the tool-using behavior of LLM-based agents by comparing recorded tool usage traces against gold-standard specifications, producing detailed metrics and reports.\n\n## What is Toolscore?\n\n**Toolscore evaluates LLM tool usage** - it doesn't call LLM APIs directly. Think of it as a testing framework for function-calling agents:\n\n- \u2705 **Evaluates** existing tool usage traces from OpenAI, Anthropic, or custom sources\n- \u2705 **Compares** actual behavior against expected gold standards\n- \u2705 **Reports** detailed metrics on accuracy, efficiency, and correctness\n- \u274c **Does NOT** call LLM APIs or execute tools (you capture traces separately)\n\n**Use Toolscore to:**\n- Benchmark different LLM models on tool usage tasks\n- Validate that your agent calls the right tools with the right arguments\n- Track improvements in function calling accuracy over time\n- Compare agent performance across different prompting strategies\n\n## Features\n\n- **Trace vs. Spec Comparison**: Load agent tool-use traces (OpenAI, Anthropic, LangChain, or custom) and compare against gold standard specifications\n- **Comprehensive Metrics Suite**:\n  - Tool Invocation Accuracy\n  - Tool Selection Accuracy\n  - **NEW**: Tool Correctness (were all expected tools called?)\n  - Tool Call Sequence Edit Distance\n  - Argument Match F1 Score\n  - **NEW**: Parameter Schema Validation (types, ranges, patterns)\n  - Redundant Call Rate\n  - Side-Effect Success Rate\n  - Latency/Cost Attribution\n  - **NEW**: Integrated LLM-as-a-judge semantic evaluation\n- **Multiple Trace Adapters**: Built-in support for OpenAI, Anthropic Claude, LangChain, and custom JSON formats\n- **CLI and API**: Command-line interface and Python API for programmatic use\n- **Beautiful Console Output**: Color-coded metrics, tables, and progress indicators with Rich\n- **Rich Output Reports**: Interactive HTML and machine-readable JSON reports\n- **Pytest Integration**: Seamless test integration with pytest plugin and assertion helpers\n- **Interactive Tutorials**: Jupyter notebooks for hands-on learning\n- **Example Datasets**: 5 realistic gold standards for common agent types (weather, ecommerce, code, RAG, multi-tool)\n- **Extensible Checks**: Validate side-effects like HTTP calls, file creation, database queries\n- **Automated Releases**: Semantic versioning with conventional commits\n\n## Why Toolscore?\n\n| Feature | Toolscore | Manual Testing | Basic Assertions |\n|---------|-----------|----------------|------------------|\n| **Multiple LLM Support** | \u2705 OpenAI, Anthropic, LangChain, Custom | \u274c | \u274c |\n| **Comprehensive Metrics** | \u2705 10+ metrics | \u274c | \u26a0\ufe0f Basic |\n| **Schema Validation** | \u2705 Types, ranges, patterns | \u274c | \u274c |\n| **Tool Correctness** | \u2705 Deterministic coverage check | \u274c | \u274c |\n| **Semantic Evaluation** | \u2705 Integrated LLM-as-a-judge | \u274c | \u274c |\n| **Example Datasets** | \u2705 5 realistic templates | \u274c | \u274c |\n| **Pytest Integration** | \u2705 Native plugin | \u274c | \u26a0\ufe0f Manual |\n| **Beautiful Reports** | \u2705 HTML + JSON | \u274c | \u274c |\n| **Side-effect Validation** | \u2705 HTTP, FS, DB | \u274c | \u274c |\n| **Sequence Analysis** | \u2705 Edit distance | \u274c | \u274c |\n| **Interactive Tutorials** | \u2705 Jupyter notebooks | \u274c | \u274c |\n| **CI/CD Ready** | \u2705 GitHub Actions | \u26a0\ufe0f Custom | \u26a0\ufe0f Custom |\n| **Type Safety** | \u2705 Fully typed | \u274c | \u274c |\n\n## Installation\n\n```bash\n# Install from PyPI\npip install tool-scorer\n\n# Or install from source\ngit clone https://github.com/yotambraun/toolscore.git\ncd toolscore\npip install -e .\n```\n\n### Optional Dependencies\n\n```bash\n# Install with HTTP validation support\npip install tool-scorer[http]\n\n# Install with LLM-as-a-judge metrics (requires OpenAI API key)\npip install tool-scorer[llm]\n\n# Install with LangChain support\npip install tool-scorer[langchain]\n\n# Install all optional features\npip install tool-scorer[all]\n```\n\n### Development Installation\n\n```bash\n# Install with development dependencies\npip install -e \".[dev]\"\n\n# Install with dev + docs dependencies\npip install -e \".[dev,docs]\"\n\n# Or using uv (faster)\nuv pip install -e \".[dev]\"\n```\n\n## \u26a1 What's New in v1.1\n\nToolscore v1.1 focuses on **making evaluation incredibly easy and intuitive** with powerful new features:\n\n### \ud83d\ude80 Zero-Friction Onboarding (`toolscore init`)\nInteractive project setup in 30 seconds. Choose your agent type (Weather, E-commerce, Code, RAG, Multi-tool), get pre-built templates and ready-to-use examples.\n\n```bash\ntoolscore init\n# Follow prompts \u2192 Start evaluating in 30 seconds\n```\n\n### \u26a1 Synthetic Test Generator (`toolscore generate`)\nCreate comprehensive test suites automatically from OpenAI function schemas. Generates varied test cases with edge cases and boundary values - no manual test writing needed.\n\n```bash\ntoolscore generate --from-openai functions.json --count 20\n# Creates 20 test cases with normal + edge + boundary variations\n```\n\n### \ud83d\udcca Quick Compare (`toolscore compare`)\nCompare multiple models side-by-side in one command. See which model (GPT-4, Claude, Gemini, etc.) performs best on each metric with beautiful comparison tables.\n\n```bash\ntoolscore compare gold.json gpt4.json claude.json gemini.json \\\n  -n gpt-4 -n claude-3 -n gemini-1.5\n# Shows color-coded comparison table with overall winner\n```\n\n### \ud83d\udd0d Interactive Debug Mode (`--debug`)\nStep through failures one-by-one with guided troubleshooting. See exactly what went wrong and get actionable fix suggestions for each mismatch.\n\n```bash\ntoolscore eval gold.json trace.json --debug\n# Navigate mismatches interactively with context-specific suggestions\n```\n\n### \ud83d\udca1 Actionable Error Messages\nAutomatic failure detection with specific fix suggestions. No more guessing - get told exactly what to try next (use `--llm-judge`, check schemas, review arguments, etc.).\n\n### \ud83c\udfaf Tool Correctness Metric\nDeterministic evaluation of whether all expected tools were called - goes beyond just checking individual call matches.\n\n### \ud83e\udde0 Integrated LLM-as-a-Judge\nSemantic evaluation is now built into the core - just add `--llm-judge` flag to catch equivalent but differently-named tools (e.g., `search` vs `web_search`).\n\n### \ud83d\udd12 Parameter Schema Validation\nValidate argument types, ranges, patterns, and constraints - catch type errors, out-of-range values, and missing required fields.\n\n### \ud83d\udce6 Example Datasets\n5 realistic gold standards for common agent types (weather, ecommerce, code assistant, RAG, multi-tool) - start evaluating in 30 seconds!\n\n## Quick Start\n\n### \ud83d\ude80 30-Second Start\n\nThe fastest way to start evaluating:\n\n```bash\n# Install\npip install tool-scorer\n\n# Initialize project (interactive)\ntoolscore init\n\n# Evaluate (included templates)\ntoolscore eval gold_calls.json example_trace.json\n```\n\nDone! You now have evaluation results with detailed metrics.\n\n### 5-Minute Complete Workflow\n\n1. **Install Toolscore:**\n   ```bash\n   pip install tool-scorer\n   ```\n\n2. **Initialize a project** (choose from 5 agent types):\n   ```bash\n   toolscore init\n   # Select agent type \u2192 Get templates + examples\n   ```\n\n3. **Generate test cases** (if you have OpenAI function schemas):\n   ```bash\n   toolscore generate --from-openai functions.json --count 20\n   ```\n\n4. **Run evaluation** with your agent's trace:\n   ```bash\n   # Basic evaluation\n   toolscore eval gold_calls.json my_trace.json --html report.html\n\n   # With semantic matching (catches similar tool names)\n   toolscore eval gold_calls.json my_trace.json --llm-judge\n\n   # With interactive debugging\n   toolscore eval gold_calls.json my_trace.json --debug\n   ```\n\n5. **Compare multiple models:**\n   ```bash\n   toolscore compare gold.json gpt4.json claude.json \\\n     -n gpt-4 -n claude-3\n   ```\n\n6. **View results:**\n   - Console shows color-coded metrics\n   - Open `report.html` for interactive analysis\n   - Check `toolscore.json` for machine-readable results\n\n**Want to test with your own LLM?** See the [Complete Tutorial](TUTORIAL.md) for step-by-step instructions on capturing traces from OpenAI/Anthropic APIs.\n\n### Command Line Usage\n\n```bash\n# ===== GETTING STARTED =====\n\n# Initialize new project (interactive)\ntoolscore init\n\n# Generate test cases from OpenAI function schemas\ntoolscore generate --from-openai functions.json --count 20 -o gold.json\n\n# Validate trace file format\ntoolscore validate trace.json\n\n# ===== EVALUATION =====\n\n# Basic evaluation\ntoolscore eval gold_calls.json trace.json\n\n# With HTML report\ntoolscore eval gold_calls.json trace.json --html report.html\n\n# With semantic matching (LLM-as-a-judge)\ntoolscore eval gold_calls.json trace.json --llm-judge\n\n# With interactive debugging\ntoolscore eval gold_calls.json trace.json --debug\n\n# Verbose output (shows missing/extra tools)\ntoolscore eval gold_calls.json trace.json --verbose\n\n# Specify trace format explicitly\ntoolscore eval gold_calls.json trace.json --format openai\n\n# Use realistic example dataset\ntoolscore eval examples/datasets/ecommerce_agent.json trace.json\n\n# ===== MULTI-MODEL COMPARISON =====\n\n# Compare multiple models side-by-side\ntoolscore compare gold.json gpt4.json claude.json gemini.json\n\n# With custom model names\ntoolscore compare gold.json model1.json model2.json \\\n  -n \"GPT-4\" -n \"Claude-3-Opus\"\n\n# Save comparison report\ntoolscore compare gold.json *.json -o comparison.json\n```\n\n### Python API\n\n```python\nfrom toolscore import evaluate_trace\n\n# Run evaluation\nresult = evaluate_trace(\n    gold_file=\"gold_calls.json\",\n    trace_file=\"trace.json\",\n    format=\"auto\"  # auto-detect format\n)\n\n# Access metrics\nprint(f\"Invocation Accuracy: {result.metrics['invocation_accuracy']:.2%}\")\nprint(f\"Selection Accuracy: {result.metrics['selection_accuracy']:.2%}\")\n\nsequence = result.metrics['sequence_metrics']\nprint(f\"Sequence Accuracy: {sequence['sequence_accuracy']:.2%}\")\n\narguments = result.metrics['argument_metrics']\nprint(f\"Argument F1: {arguments['f1']:.2%}\")\n```\n\n### Pytest Integration\n\nToolscore includes a pytest plugin for seamless test integration:\n\n```python\n# test_my_agent.py\ndef test_agent_accuracy(toolscore_eval, toolscore_assertions):\n    \"\"\"Test that agent achieves high accuracy.\"\"\"\n    result = toolscore_eval(\"gold_calls.json\", \"trace.json\")\n\n    # Use built-in assertions\n    toolscore_assertions.assert_invocation_accuracy(result, min_accuracy=0.9)\n    toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)\n    toolscore_assertions.assert_argument_f1(result, min_f1=0.8)\n```\n\nThe plugin is automatically loaded when you install Toolscore. See the [examples](examples/test_example_with_pytest.py) for more patterns.\n\n### Interactive Tutorials\n\nTry Toolscore in your browser with our Jupyter notebooks:\n\n- [Quickstart Tutorial](examples/notebooks/01_quickstart.ipynb) - 5-minute introduction\n- [Custom Formats](examples/notebooks/02_custom_formats.ipynb) - Working with custom traces\n- [Advanced Metrics](examples/notebooks/03_advanced_metrics.ipynb) - Deep dive into metrics\n\nOpen them in [Google Colab](https://colab.research.google.com/) for instant experimentation.\n\n## Gold Standard Format\n\nCreate a `gold_calls.json` file defining the expected tool calls:\n\n```json\n[\n  {\n    \"tool\": \"make_file\",\n    \"args\": {\n      \"filename\": \"poem.txt\",\n      \"lines_of_text\": [\"Roses are red,\", \"Violets are blue.\"]\n    },\n    \"side_effects\": {\n      \"file_exists\": \"poem.txt\"\n    },\n    \"description\": \"Create a file with a poem\"\n  }\n]\n```\n\n## Trace Formats\n\nToolscore supports multiple trace formats:\n\n### OpenAI Format\n\n```json\n[\n  {\n    \"role\": \"assistant\",\n    \"function_call\": {\n      \"name\": \"get_weather\",\n      \"arguments\": \"{\\\"location\\\": \\\"Boston\\\"}\"\n    }\n  }\n]\n```\n\n### Anthropic Format\n\n```json\n[\n  {\n    \"role\": \"assistant\",\n    \"content\": [\n      {\n        \"type\": \"tool_use\",\n        \"id\": \"toolu_123\",\n        \"name\": \"search\",\n        \"input\": {\"query\": \"Python\"}\n      }\n    ]\n  }\n]\n```\n\n### LangChain Format\n\n```json\n[\n  {\n    \"tool\": \"search\",\n    \"tool_input\": {\"query\": \"Python tutorials\"},\n    \"log\": \"Invoking search...\"\n  }\n]\n```\n\nOr modern format:\n\n```json\n[\n  {\n    \"name\": \"search\",\n    \"args\": {\"query\": \"Python\"},\n    \"id\": \"call_123\"\n  }\n]\n```\n\n### Custom Format\n\n```json\n{\n  \"calls\": [\n    {\n      \"tool\": \"read_file\",\n      \"args\": {\"path\": \"data.txt\"},\n      \"result\": \"file contents\"\n    }\n  ]\n}\n```\n\n## Metrics Explained\n\n### Tool Invocation Accuracy\nMeasures whether the agent invoked tools when needed and refrained when not needed.\n\n### Tool Selection Accuracy\nProportion of tool calls that match expected tool names.\n\n### Tool Correctness (NEW)\nChecks if all expected tools were called at least once - complements selection accuracy by measuring coverage rather than per-call matching.\n\n### Sequence Edit Distance\nLevenshtein distance between expected and actual tool call sequences.\n\n### Argument Match F1\nPrecision and recall of argument correctness across all tool calls.\n\n### Schema Validation (NEW)\nValidates argument types, numeric ranges, string patterns, enums, and required fields. Define schemas in your gold standard:\n\n```json\n{\n  \"tool\": \"search\",\n  \"args\": {\"query\": \"test\", \"limit\": 10},\n  \"metadata\": {\n    \"schema\": {\n      \"query\": {\"type\": \"string\", \"minLength\": 1},\n      \"limit\": {\"type\": \"integer\", \"minimum\": 1, \"maximum\": 100}\n    }\n  }\n}\n```\n\n### Redundant Call Rate\nPercentage of unnecessary or duplicate tool calls.\n\n### Side-Effect Success Rate\nProportion of validated side-effects (HTTP, filesystem, database) that succeeded.\n\n### LLM-as-a-judge Semantic Evaluation (Integrated)\nNow built into core evaluation! Use `--llm-judge` flag to evaluate semantic equivalence beyond exact string matching. Perfect for catching cases where tool names differ but intentions match (e.g., `search_web` vs `web_search`).\n\n```bash\n# CLI usage - easiest way\ntool-scorer eval gold.json trace.json --llm-judge\n\n# Python API\nresult = evaluate_trace(\"gold.json\", \"trace.json\", use_llm_judge=True)\nprint(f\"Semantic Score: {result.metrics['semantic_metrics']['semantic_score']:.2%}\")\n```\n\n## Project Structure\n\n```\ntoolscore/\n\u251c\u2500\u2500 adapters/          # Trace format adapters\n\u2502   \u251c\u2500\u2500 openai.py\n\u2502   \u251c\u2500\u2500 anthropic.py\n\u2502   \u2514\u2500\u2500 custom.py\n\u251c\u2500\u2500 metrics/           # Metric calculators\n\u2502   \u251c\u2500\u2500 accuracy.py\n\u2502   \u251c\u2500\u2500 sequence.py\n\u2502   \u251c\u2500\u2500 arguments.py\n\u2502   \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 validators/        # Side-effect validators\n\u2502   \u251c\u2500\u2500 http.py\n\u2502   \u251c\u2500\u2500 filesystem.py\n\u2502   \u2514\u2500\u2500 database.py\n\u251c\u2500\u2500 reports/           # Report generators\n\u251c\u2500\u2500 cli.py            # CLI interface\n\u2514\u2500\u2500 core.py           # Core evaluation logic\n```\n\n## Development\n\n```bash\n# Install dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=toolscore\n\n# Type checking\nmypy toolscore\n\n# Linting and formatting\nruff check toolscore\nruff format toolscore\n```\n\n## Real-World Use Cases\n\n### 1. Model Evaluation & Selection\nCompare GPT-4 vs Claude vs Gemini on your specific tool-calling tasks:\n\n```python\nmodels = [\"gpt-4\", \"claude-3-5-sonnet\", \"gemini-pro\"]\nresults = {}\n\nfor model in models:\n    trace = capture_trace(model, task=\"customer_support\")\n    result = evaluate_trace(\"gold_standard.json\", trace)\n    results[model] = result.metrics['selection_accuracy']\n\nbest_model = max(results, key=results.get)\nprint(f\"Best model: {best_model} ({results[best_model]:.1%} accuracy)\")\n```\n\n### 2. CI/CD Integration\nCatch regressions in agent behavior before deployment:\n\n```python\n# test_agent_quality.py\ndef test_agent_meets_sla(toolscore_eval, toolscore_assertions):\n    \"\"\"Ensure agent meets 95% accuracy SLA.\"\"\"\n    result = toolscore_eval(\"gold_standard.json\", \"production_trace.json\")\n    toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.95)\n    toolscore_assertions.assert_redundancy_rate(result, max_rate=0.1)\n```\n\n### 3. Prompt Engineering Optimization\nA/B test different prompts and measure impact:\n\n```python\nprompts = [\"prompt_v1.txt\", \"prompt_v2.txt\", \"prompt_v3.txt\"]\n\nfor prompt_file in prompts:\n    trace = run_agent_with_prompt(prompt_file)\n    result = evaluate_trace(\"gold_standard.json\", trace)\n\n    print(f\"{prompt_file}:\")\n    print(f\"  Selection: {result.metrics['selection_accuracy']:.1%}\")\n    print(f\"  Arguments: {result.metrics['argument_metrics']['f1']:.1%}\")\n    print(f\"  Efficiency: {result.metrics['efficiency_metrics']['redundant_rate']:.1%}\")\n```\n\n### 4. Production Monitoring\nTrack agent performance over time in production:\n\n```python\n# Run daily\ntoday_traces = collect_production_traces(date=today)\nresult = evaluate_trace(\"gold_standard.json\", today_traces)\n\n# Alert if degradation\nif result.metrics['selection_accuracy'] < 0.90:\n    send_alert(\"Agent performance degraded!\")\n\n# Log metrics to dashboard\nlog_to_datadog({\n    \"accuracy\": result.metrics['selection_accuracy'],\n    \"redundancy\": result.metrics['efficiency_metrics']['redundant_rate'],\n})\n```\n\n## Documentation\n\n- **[ReadTheDocs](https://toolscore.readthedocs.io/)** - Complete API documentation\n- **[Complete Tutorial](TUTORIAL.md)** - In-depth guide with end-to-end workflow\n- **[Example Datasets](examples/datasets/)** - 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)\n- **[Examples Directory](examples/)** - Sample traces and capture scripts\n- **[Jupyter Notebooks](examples/notebooks/)** - Interactive tutorials\n- **[Contributing Guide](CONTRIBUTING.md)** - How to contribute to Toolscore\n\n## What's New\n\n### v1.1.0 (Latest - October 2025)\n\n**Major Product Improvements:**\n- **\ud83e\udde0 Integrated LLM-as-a-Judge**: Semantic evaluation now built into core with `--llm-judge` flag\n- **\ud83c\udfaf Tool Correctness Metric**: Deterministic check for complete tool coverage\n- **\ud83d\udd12 Parameter Schema Validation**: Validate types, ranges, patterns, enums in arguments\n- **\ud83d\udce6 Example Datasets**: 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)\n- **Enhanced Console Output**: Beautiful tables showing tool coverage and schema validation\n\n### v1.0.x\n\n- **LLM-as-a-judge metrics**: Semantic correctness evaluation using OpenAI API\n- **LangChain adapter**: Support for LangChain agent traces (legacy and modern formats)\n- **Beautiful console output**: Color-coded metrics with Rich library\n- **Pytest plugin**: Seamless test integration with fixtures and assertions\n- **Interactive tutorials**: Jupyter notebooks for hands-on learning\n- **Comprehensive documentation**: Sphinx docs on ReadTheDocs\n- **Test coverage**: Increased to 80%+ with 123 passing tests\n- **Automated releases**: Semantic versioning with conventional commits\n- **Enhanced PyPI presence**: 16 searchable keywords, Beta status, comprehensive classifiers\n\nSee [CHANGELOG.md](CHANGELOG.md) for full release history.\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nApache License 2.0 - see [LICENSE](LICENSE) for details.\n\n## Citation\n\nIf you use Toolscore in your research, please cite:\n\n```bibtex\n@software{toolscore,\n  title = {Toolscore: LLM Tool Usage Evaluation Package},\n  author = {Yotam Braun},\n  year = {2025},\n  url = {https://github.com/yotambraun/toolscore}\n}\n```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Evaluate LLM tool usage and function calling accuracy with comprehensive metrics, support for OpenAI/Anthropic/LangChain, pytest integration, and beautiful reports",
    "version": "1.2.0",
    "project_urls": {
        "Documentation": "https://github.com/yotambraun/toolscore#readme",
        "Homepage": "https://github.com/yotambraun/toolscore",
        "Issues": "https://github.com/yotambraun/toolscore/issues",
        "Repository": "https://github.com/yotambraun/toolscore"
    },
    "split_keywords": [
        "accuracy",
        " agent-evaluation",
        " ai-agents",
        " anthropic",
        " benchmarking",
        " claude",
        " evaluation",
        " function-calling",
        " gpt",
        " langchain",
        " llm",
        " llm-testing",
        " metrics",
        " openai",
        " testing",
        " tool-use"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c8c647424d7ea020d26b01286290dd8cff3b4f59c24411acb06cd5eb010c7769",
                "md5": "7392003ab7f71a0306be8e51a2bfd53b",
                "sha256": "c9c7e9b03365547fcd8106a0f730692386aca94dbf5c028dbbfa0b60a498dbcc"
            },
            "downloads": -1,
            "filename": "tool_scorer-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7392003ab7f71a0306be8e51a2bfd53b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 61303,
            "upload_time": "2025-10-18T10:59:17",
            "upload_time_iso_8601": "2025-10-18T10:59:17.976405Z",
            "url": "https://files.pythonhosted.org/packages/c8/c6/47424d7ea020d26b01286290dd8cff3b4f59c24411acb06cd5eb010c7769/tool_scorer-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "39f88ac47efb29e4ab34f947286755dcf1a61a0480e72d83c19e5f0d56f6e796",
                "md5": "e8b3d0ce5add71afc6d4dc3d62f412e6",
                "sha256": "a6911349d013ffd10ebaa5996a154e33656edbd5444957e6af517323a34698ba"
            },
            "downloads": -1,
            "filename": "tool_scorer-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e8b3d0ce5add71afc6d4dc3d62f412e6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 1647320,
            "upload_time": "2025-10-18T10:59:19",
            "upload_time_iso_8601": "2025-10-18T10:59:19.933137Z",
            "url": "https://files.pythonhosted.org/packages/39/f8/8ac47efb29e4ab34f947286755dcf1a61a0480e72d83c19e5f0d56f6e796/tool_scorer-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-18 10:59:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yotambraun",
    "github_project": "toolscore#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tool-scorer"
}

None