<p align="center">
<img src="assets/logo.png" alt="Toolscore Logo" width="200"/>
</p>
<h1 align="center">Toolscore</h1>
<p align="center">
<em>A Python package for evaluating LLM tool usage against gold standard specifications</em>
</p>
[](https://badge.fury.io/py/tool-scorer)
[](LICENSE)
[](https://pepy.tech/project/tool-scorer)
[](https://pypi.org/project/tool-scorer/)
[](https://github.com/yotambraun/toolscore/actions)
[](https://codecov.io/gh/yotambraun/toolscore)
Toolscore helps developers evaluate the tool-using behavior of LLM-based agents by comparing recorded tool usage traces against gold-standard specifications, producing detailed metrics and reports.
## What is Toolscore?
**Toolscore evaluates LLM tool usage** - it doesn't call LLM APIs directly. Think of it as a testing framework for function-calling agents:
- ✅ **Evaluates** existing tool usage traces from OpenAI, Anthropic, or custom sources
- ✅ **Compares** actual behavior against expected gold standards
- ✅ **Reports** detailed metrics on accuracy, efficiency, and correctness
- ❌ **Does NOT** call LLM APIs or execute tools (you capture traces separately)
**Use Toolscore to:**
- Benchmark different LLM models on tool usage tasks
- Validate that your agent calls the right tools with the right arguments
- Track improvements in function calling accuracy over time
- Compare agent performance across different prompting strategies
## Features
- **Trace vs. Spec Comparison**: Load agent tool-use traces (OpenAI, Anthropic, LangChain, or custom) and compare against gold standard specifications
- **Comprehensive Metrics Suite**:
- Tool Invocation Accuracy
- Tool Selection Accuracy
- **NEW**: Tool Correctness (were all expected tools called?)
- Tool Call Sequence Edit Distance
- Argument Match F1 Score
- **NEW**: Parameter Schema Validation (types, ranges, patterns)
- Redundant Call Rate
- Side-Effect Success Rate
- Latency/Cost Attribution
- **NEW**: Integrated LLM-as-a-judge semantic evaluation
- **Multiple Trace Adapters**: Built-in support for OpenAI, Anthropic Claude, LangChain, and custom JSON formats
- **CLI and API**: Command-line interface and Python API for programmatic use
- **Beautiful Console Output**: Color-coded metrics, tables, and progress indicators with Rich
- **Rich Output Reports**: Interactive HTML and machine-readable JSON reports
- **Pytest Integration**: Seamless test integration with pytest plugin and assertion helpers
- **Interactive Tutorials**: Jupyter notebooks for hands-on learning
- **Example Datasets**: 5 realistic gold standards for common agent types (weather, ecommerce, code, RAG, multi-tool)
- **Extensible Checks**: Validate side-effects like HTTP calls, file creation, database queries
- **Automated Releases**: Semantic versioning with conventional commits
## Why Toolscore?
| Feature | Toolscore | Manual Testing | Basic Assertions |
|---------|-----------|----------------|------------------|
| **Multiple LLM Support** | ✅ OpenAI, Anthropic, LangChain, Custom | ❌ | ❌ |
| **Comprehensive Metrics** | ✅ 10+ metrics | ❌ | ⚠️ Basic |
| **Schema Validation** | ✅ Types, ranges, patterns | ❌ | ❌ |
| **Tool Correctness** | ✅ Deterministic coverage check | ❌ | ❌ |
| **Semantic Evaluation** | ✅ Integrated LLM-as-a-judge | ❌ | ❌ |
| **Example Datasets** | ✅ 5 realistic templates | ❌ | ❌ |
| **Pytest Integration** | ✅ Native plugin | ❌ | ⚠️ Manual |
| **Beautiful Reports** | ✅ HTML + JSON | ❌ | ❌ |
| **Side-effect Validation** | ✅ HTTP, FS, DB | ❌ | ❌ |
| **Sequence Analysis** | ✅ Edit distance | ❌ | ❌ |
| **Interactive Tutorials** | ✅ Jupyter notebooks | ❌ | ❌ |
| **CI/CD Ready** | ✅ GitHub Actions | ⚠️ Custom | ⚠️ Custom |
| **Type Safety** | ✅ Fully typed | ❌ | ❌ |
## Installation
```bash
# Install from PyPI
pip install tool-scorer
# Or install from source
git clone https://github.com/yotambraun/toolscore.git
cd toolscore
pip install -e .
```
### Optional Dependencies
```bash
# Install with HTTP validation support
pip install tool-scorer[http]
# Install with LLM-as-a-judge metrics (requires OpenAI API key)
pip install tool-scorer[llm]
# Install with LangChain support
pip install tool-scorer[langchain]
# Install all optional features
pip install tool-scorer[all]
```
### Development Installation
```bash
# Install with development dependencies
pip install -e ".[dev]"
# Install with dev + docs dependencies
pip install -e ".[dev,docs]"
# Or using uv (faster)
uv pip install -e ".[dev]"
```
## ⚡ What's New in v1.1
Toolscore v1.1 focuses on **making evaluation incredibly easy and intuitive** with powerful new features:
### 🚀 Zero-Friction Onboarding (`toolscore init`)
Interactive project setup in 30 seconds. Choose your agent type (Weather, E-commerce, Code, RAG, Multi-tool), get pre-built templates and ready-to-use examples.
```bash
toolscore init
# Follow prompts → Start evaluating in 30 seconds
```
### ⚡ Synthetic Test Generator (`toolscore generate`)
Create comprehensive test suites automatically from OpenAI function schemas. Generates varied test cases with edge cases and boundary values - no manual test writing needed.
```bash
toolscore generate --from-openai functions.json --count 20
# Creates 20 test cases with normal + edge + boundary variations
```
### 📊 Quick Compare (`toolscore compare`)
Compare multiple models side-by-side in one command. See which model (GPT-4, Claude, Gemini, etc.) performs best on each metric with beautiful comparison tables.
```bash
toolscore compare gold.json gpt4.json claude.json gemini.json \
-n gpt-4 -n claude-3 -n gemini-1.5
# Shows color-coded comparison table with overall winner
```
### 🔍 Interactive Debug Mode (`--debug`)
Step through failures one-by-one with guided troubleshooting. See exactly what went wrong and get actionable fix suggestions for each mismatch.
```bash
toolscore eval gold.json trace.json --debug
# Navigate mismatches interactively with context-specific suggestions
```
### 💡 Actionable Error Messages
Automatic failure detection with specific fix suggestions. No more guessing - get told exactly what to try next (use `--llm-judge`, check schemas, review arguments, etc.).
### 🎯 Tool Correctness Metric
Deterministic evaluation of whether all expected tools were called - goes beyond just checking individual call matches.
### 🧠 Integrated LLM-as-a-Judge
Semantic evaluation is now built into the core - just add `--llm-judge` flag to catch equivalent but differently-named tools (e.g., `search` vs `web_search`).
### 🔒 Parameter Schema Validation
Validate argument types, ranges, patterns, and constraints - catch type errors, out-of-range values, and missing required fields.
### 📦 Example Datasets
5 realistic gold standards for common agent types (weather, ecommerce, code assistant, RAG, multi-tool) - start evaluating in 30 seconds!
## Quick Start
### 🚀 30-Second Start
The fastest way to start evaluating:
```bash
# Install
pip install tool-scorer
# Initialize project (interactive)
toolscore init
# Evaluate (included templates)
toolscore eval gold_calls.json example_trace.json
```
Done! You now have evaluation results with detailed metrics.
### 5-Minute Complete Workflow
1. **Install Toolscore:**
```bash
pip install tool-scorer
```
2. **Initialize a project** (choose from 5 agent types):
```bash
toolscore init
# Select agent type → Get templates + examples
```
3. **Generate test cases** (if you have OpenAI function schemas):
```bash
toolscore generate --from-openai functions.json --count 20
```
4. **Run evaluation** with your agent's trace:
```bash
# Basic evaluation
toolscore eval gold_calls.json my_trace.json --html report.html
# With semantic matching (catches similar tool names)
toolscore eval gold_calls.json my_trace.json --llm-judge
# With interactive debugging
toolscore eval gold_calls.json my_trace.json --debug
```
5. **Compare multiple models:**
```bash
toolscore compare gold.json gpt4.json claude.json \
-n gpt-4 -n claude-3
```
6. **View results:**
- Console shows color-coded metrics
- Open `report.html` for interactive analysis
- Check `toolscore.json` for machine-readable results
**Want to test with your own LLM?** See the [Complete Tutorial](TUTORIAL.md) for step-by-step instructions on capturing traces from OpenAI/Anthropic APIs.
### Command Line Usage
```bash
# ===== GETTING STARTED =====
# Initialize new project (interactive)
toolscore init
# Generate test cases from OpenAI function schemas
toolscore generate --from-openai functions.json --count 20 -o gold.json
# Validate trace file format
toolscore validate trace.json
# ===== EVALUATION =====
# Basic evaluation
toolscore eval gold_calls.json trace.json
# With HTML report
toolscore eval gold_calls.json trace.json --html report.html
# With semantic matching (LLM-as-a-judge)
toolscore eval gold_calls.json trace.json --llm-judge
# With interactive debugging
toolscore eval gold_calls.json trace.json --debug
# Verbose output (shows missing/extra tools)
toolscore eval gold_calls.json trace.json --verbose
# Specify trace format explicitly
toolscore eval gold_calls.json trace.json --format openai
# Use realistic example dataset
toolscore eval examples/datasets/ecommerce_agent.json trace.json
# ===== MULTI-MODEL COMPARISON =====
# Compare multiple models side-by-side
toolscore compare gold.json gpt4.json claude.json gemini.json
# With custom model names
toolscore compare gold.json model1.json model2.json \
-n "GPT-4" -n "Claude-3-Opus"
# Save comparison report
toolscore compare gold.json *.json -o comparison.json
```
### Python API
```python
from toolscore import evaluate_trace
# Run evaluation
result = evaluate_trace(
gold_file="gold_calls.json",
trace_file="trace.json",
format="auto" # auto-detect format
)
# Access metrics
print(f"Invocation Accuracy: {result.metrics['invocation_accuracy']:.2%}")
print(f"Selection Accuracy: {result.metrics['selection_accuracy']:.2%}")
sequence = result.metrics['sequence_metrics']
print(f"Sequence Accuracy: {sequence['sequence_accuracy']:.2%}")
arguments = result.metrics['argument_metrics']
print(f"Argument F1: {arguments['f1']:.2%}")
```
### Pytest Integration
Toolscore includes a pytest plugin for seamless test integration:
```python
# test_my_agent.py
def test_agent_accuracy(toolscore_eval, toolscore_assertions):
"""Test that agent achieves high accuracy."""
result = toolscore_eval("gold_calls.json", "trace.json")
# Use built-in assertions
toolscore_assertions.assert_invocation_accuracy(result, min_accuracy=0.9)
toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)
toolscore_assertions.assert_argument_f1(result, min_f1=0.8)
```
The plugin is automatically loaded when you install Toolscore. See the [examples](examples/test_example_with_pytest.py) for more patterns.
### Interactive Tutorials
Try Toolscore in your browser with our Jupyter notebooks:
- [Quickstart Tutorial](examples/notebooks/01_quickstart.ipynb) - 5-minute introduction
- [Custom Formats](examples/notebooks/02_custom_formats.ipynb) - Working with custom traces
- [Advanced Metrics](examples/notebooks/03_advanced_metrics.ipynb) - Deep dive into metrics
Open them in [Google Colab](https://colab.research.google.com/) for instant experimentation.
## Gold Standard Format
Create a `gold_calls.json` file defining the expected tool calls:
```json
[
{
"tool": "make_file",
"args": {
"filename": "poem.txt",
"lines_of_text": ["Roses are red,", "Violets are blue."]
},
"side_effects": {
"file_exists": "poem.txt"
},
"description": "Create a file with a poem"
}
]
```
## Trace Formats
Toolscore supports multiple trace formats:
### OpenAI Format
```json
[
{
"role": "assistant",
"function_call": {
"name": "get_weather",
"arguments": "{\"location\": \"Boston\"}"
}
}
]
```
### Anthropic Format
```json
[
{
"role": "assistant",
"content": [
{
"type": "tool_use",
"id": "toolu_123",
"name": "search",
"input": {"query": "Python"}
}
]
}
]
```
### LangChain Format
```json
[
{
"tool": "search",
"tool_input": {"query": "Python tutorials"},
"log": "Invoking search..."
}
]
```
Or modern format:
```json
[
{
"name": "search",
"args": {"query": "Python"},
"id": "call_123"
}
]
```
### Custom Format
```json
{
"calls": [
{
"tool": "read_file",
"args": {"path": "data.txt"},
"result": "file contents"
}
]
}
```
## Metrics Explained
### Tool Invocation Accuracy
Measures whether the agent invoked tools when needed and refrained when not needed.
### Tool Selection Accuracy
Proportion of tool calls that match expected tool names.
### Tool Correctness (NEW)
Checks if all expected tools were called at least once - complements selection accuracy by measuring coverage rather than per-call matching.
### Sequence Edit Distance
Levenshtein distance between expected and actual tool call sequences.
### Argument Match F1
Precision and recall of argument correctness across all tool calls.
### Schema Validation (NEW)
Validates argument types, numeric ranges, string patterns, enums, and required fields. Define schemas in your gold standard:
```json
{
"tool": "search",
"args": {"query": "test", "limit": 10},
"metadata": {
"schema": {
"query": {"type": "string", "minLength": 1},
"limit": {"type": "integer", "minimum": 1, "maximum": 100}
}
}
}
```
### Redundant Call Rate
Percentage of unnecessary or duplicate tool calls.
### Side-Effect Success Rate
Proportion of validated side-effects (HTTP, filesystem, database) that succeeded.
### LLM-as-a-judge Semantic Evaluation (Integrated)
Now built into core evaluation! Use `--llm-judge` flag to evaluate semantic equivalence beyond exact string matching. Perfect for catching cases where tool names differ but intentions match (e.g., `search_web` vs `web_search`).
```bash
# CLI usage - easiest way
tool-scorer eval gold.json trace.json --llm-judge
# Python API
result = evaluate_trace("gold.json", "trace.json", use_llm_judge=True)
print(f"Semantic Score: {result.metrics['semantic_metrics']['semantic_score']:.2%}")
```
## Project Structure
```
toolscore/
├── adapters/ # Trace format adapters
│ ├── openai.py
│ ├── anthropic.py
│ └── custom.py
├── metrics/ # Metric calculators
│ ├── accuracy.py
│ ├── sequence.py
│ ├── arguments.py
│ └── ...
├── validators/ # Side-effect validators
│ ├── http.py
│ ├── filesystem.py
│ └── database.py
├── reports/ # Report generators
├── cli.py # CLI interface
└── core.py # Core evaluation logic
```
## Development
```bash
# Install dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=toolscore
# Type checking
mypy toolscore
# Linting and formatting
ruff check toolscore
ruff format toolscore
```
## Real-World Use Cases
### 1. Model Evaluation & Selection
Compare GPT-4 vs Claude vs Gemini on your specific tool-calling tasks:
```python
models = ["gpt-4", "claude-3-5-sonnet", "gemini-pro"]
results = {}
for model in models:
trace = capture_trace(model, task="customer_support")
result = evaluate_trace("gold_standard.json", trace)
results[model] = result.metrics['selection_accuracy']
best_model = max(results, key=results.get)
print(f"Best model: {best_model} ({results[best_model]:.1%} accuracy)")
```
### 2. CI/CD Integration
Catch regressions in agent behavior before deployment:
```python
# test_agent_quality.py
def test_agent_meets_sla(toolscore_eval, toolscore_assertions):
"""Ensure agent meets 95% accuracy SLA."""
result = toolscore_eval("gold_standard.json", "production_trace.json")
toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.95)
toolscore_assertions.assert_redundancy_rate(result, max_rate=0.1)
```
### 3. Prompt Engineering Optimization
A/B test different prompts and measure impact:
```python
prompts = ["prompt_v1.txt", "prompt_v2.txt", "prompt_v3.txt"]
for prompt_file in prompts:
trace = run_agent_with_prompt(prompt_file)
result = evaluate_trace("gold_standard.json", trace)
print(f"{prompt_file}:")
print(f" Selection: {result.metrics['selection_accuracy']:.1%}")
print(f" Arguments: {result.metrics['argument_metrics']['f1']:.1%}")
print(f" Efficiency: {result.metrics['efficiency_metrics']['redundant_rate']:.1%}")
```
### 4. Production Monitoring
Track agent performance over time in production:
```python
# Run daily
today_traces = collect_production_traces(date=today)
result = evaluate_trace("gold_standard.json", today_traces)
# Alert if degradation
if result.metrics['selection_accuracy'] < 0.90:
send_alert("Agent performance degraded!")
# Log metrics to dashboard
log_to_datadog({
"accuracy": result.metrics['selection_accuracy'],
"redundancy": result.metrics['efficiency_metrics']['redundant_rate'],
})
```
## Documentation
- **[ReadTheDocs](https://toolscore.readthedocs.io/)** - Complete API documentation
- **[Complete Tutorial](TUTORIAL.md)** - In-depth guide with end-to-end workflow
- **[Example Datasets](examples/datasets/)** - 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)
- **[Examples Directory](examples/)** - Sample traces and capture scripts
- **[Jupyter Notebooks](examples/notebooks/)** - Interactive tutorials
- **[Contributing Guide](CONTRIBUTING.md)** - How to contribute to Toolscore
## What's New
### v1.1.0 (Latest - October 2025)
**Major Product Improvements:**
- **🧠 Integrated LLM-as-a-Judge**: Semantic evaluation now built into core with `--llm-judge` flag
- **🎯 Tool Correctness Metric**: Deterministic check for complete tool coverage
- **🔒 Parameter Schema Validation**: Validate types, ranges, patterns, enums in arguments
- **📦 Example Datasets**: 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)
- **Enhanced Console Output**: Beautiful tables showing tool coverage and schema validation
### v1.0.x
- **LLM-as-a-judge metrics**: Semantic correctness evaluation using OpenAI API
- **LangChain adapter**: Support for LangChain agent traces (legacy and modern formats)
- **Beautiful console output**: Color-coded metrics with Rich library
- **Pytest plugin**: Seamless test integration with fixtures and assertions
- **Interactive tutorials**: Jupyter notebooks for hands-on learning
- **Comprehensive documentation**: Sphinx docs on ReadTheDocs
- **Test coverage**: Increased to 80%+ with 123 passing tests
- **Automated releases**: Semantic versioning with conventional commits
- **Enhanced PyPI presence**: 16 searchable keywords, Beta status, comprehensive classifiers
See [CHANGELOG.md](CHANGELOG.md) for full release history.
## Contributing
Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## License
Apache License 2.0 - see [LICENSE](LICENSE) for details.
## Citation
If you use Toolscore in your research, please cite:
```bibtex
@software{toolscore,
title = {Toolscore: LLM Tool Usage Evaluation Package},
author = {Yotam Braun},
year = {2025},
url = {https://github.com/yotambraun/toolscore}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "tool-scorer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "accuracy, agent-evaluation, ai-agents, anthropic, benchmarking, claude, evaluation, function-calling, gpt, langchain, llm, llm-testing, metrics, openai, testing, tool-use",
"author": null,
"author_email": "Yotam Barun <yotambarun93@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/39/f8/8ac47efb29e4ab34f947286755dcf1a61a0480e72d83c19e5f0d56f6e796/tool_scorer-1.2.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"assets/logo.png\" alt=\"Toolscore Logo\" width=\"200\"/>\n</p>\n\n<h1 align=\"center\">Toolscore</h1>\n\n<p align=\"center\">\n <em>A Python package for evaluating LLM tool usage against gold standard specifications</em>\n</p>\n\n[](https://badge.fury.io/py/tool-scorer)\n[](LICENSE)\n[](https://pepy.tech/project/tool-scorer)\n[](https://pypi.org/project/tool-scorer/)\n[](https://github.com/yotambraun/toolscore/actions)\n[](https://codecov.io/gh/yotambraun/toolscore)\n\nToolscore helps developers evaluate the tool-using behavior of LLM-based agents by comparing recorded tool usage traces against gold-standard specifications, producing detailed metrics and reports.\n\n## What is Toolscore?\n\n**Toolscore evaluates LLM tool usage** - it doesn't call LLM APIs directly. Think of it as a testing framework for function-calling agents:\n\n- \u2705 **Evaluates** existing tool usage traces from OpenAI, Anthropic, or custom sources\n- \u2705 **Compares** actual behavior against expected gold standards\n- \u2705 **Reports** detailed metrics on accuracy, efficiency, and correctness\n- \u274c **Does NOT** call LLM APIs or execute tools (you capture traces separately)\n\n**Use Toolscore to:**\n- Benchmark different LLM models on tool usage tasks\n- Validate that your agent calls the right tools with the right arguments\n- Track improvements in function calling accuracy over time\n- Compare agent performance across different prompting strategies\n\n## Features\n\n- **Trace vs. Spec Comparison**: Load agent tool-use traces (OpenAI, Anthropic, LangChain, or custom) and compare against gold standard specifications\n- **Comprehensive Metrics Suite**:\n - Tool Invocation Accuracy\n - Tool Selection Accuracy\n - **NEW**: Tool Correctness (were all expected tools called?)\n - Tool Call Sequence Edit Distance\n - Argument Match F1 Score\n - **NEW**: Parameter Schema Validation (types, ranges, patterns)\n - Redundant Call Rate\n - Side-Effect Success Rate\n - Latency/Cost Attribution\n - **NEW**: Integrated LLM-as-a-judge semantic evaluation\n- **Multiple Trace Adapters**: Built-in support for OpenAI, Anthropic Claude, LangChain, and custom JSON formats\n- **CLI and API**: Command-line interface and Python API for programmatic use\n- **Beautiful Console Output**: Color-coded metrics, tables, and progress indicators with Rich\n- **Rich Output Reports**: Interactive HTML and machine-readable JSON reports\n- **Pytest Integration**: Seamless test integration with pytest plugin and assertion helpers\n- **Interactive Tutorials**: Jupyter notebooks for hands-on learning\n- **Example Datasets**: 5 realistic gold standards for common agent types (weather, ecommerce, code, RAG, multi-tool)\n- **Extensible Checks**: Validate side-effects like HTTP calls, file creation, database queries\n- **Automated Releases**: Semantic versioning with conventional commits\n\n## Why Toolscore?\n\n| Feature | Toolscore | Manual Testing | Basic Assertions |\n|---------|-----------|----------------|------------------|\n| **Multiple LLM Support** | \u2705 OpenAI, Anthropic, LangChain, Custom | \u274c | \u274c |\n| **Comprehensive Metrics** | \u2705 10+ metrics | \u274c | \u26a0\ufe0f Basic |\n| **Schema Validation** | \u2705 Types, ranges, patterns | \u274c | \u274c |\n| **Tool Correctness** | \u2705 Deterministic coverage check | \u274c | \u274c |\n| **Semantic Evaluation** | \u2705 Integrated LLM-as-a-judge | \u274c | \u274c |\n| **Example Datasets** | \u2705 5 realistic templates | \u274c | \u274c |\n| **Pytest Integration** | \u2705 Native plugin | \u274c | \u26a0\ufe0f Manual |\n| **Beautiful Reports** | \u2705 HTML + JSON | \u274c | \u274c |\n| **Side-effect Validation** | \u2705 HTTP, FS, DB | \u274c | \u274c |\n| **Sequence Analysis** | \u2705 Edit distance | \u274c | \u274c |\n| **Interactive Tutorials** | \u2705 Jupyter notebooks | \u274c | \u274c |\n| **CI/CD Ready** | \u2705 GitHub Actions | \u26a0\ufe0f Custom | \u26a0\ufe0f Custom |\n| **Type Safety** | \u2705 Fully typed | \u274c | \u274c |\n\n## Installation\n\n```bash\n# Install from PyPI\npip install tool-scorer\n\n# Or install from source\ngit clone https://github.com/yotambraun/toolscore.git\ncd toolscore\npip install -e .\n```\n\n### Optional Dependencies\n\n```bash\n# Install with HTTP validation support\npip install tool-scorer[http]\n\n# Install with LLM-as-a-judge metrics (requires OpenAI API key)\npip install tool-scorer[llm]\n\n# Install with LangChain support\npip install tool-scorer[langchain]\n\n# Install all optional features\npip install tool-scorer[all]\n```\n\n### Development Installation\n\n```bash\n# Install with development dependencies\npip install -e \".[dev]\"\n\n# Install with dev + docs dependencies\npip install -e \".[dev,docs]\"\n\n# Or using uv (faster)\nuv pip install -e \".[dev]\"\n```\n\n## \u26a1 What's New in v1.1\n\nToolscore v1.1 focuses on **making evaluation incredibly easy and intuitive** with powerful new features:\n\n### \ud83d\ude80 Zero-Friction Onboarding (`toolscore init`)\nInteractive project setup in 30 seconds. Choose your agent type (Weather, E-commerce, Code, RAG, Multi-tool), get pre-built templates and ready-to-use examples.\n\n```bash\ntoolscore init\n# Follow prompts \u2192 Start evaluating in 30 seconds\n```\n\n### \u26a1 Synthetic Test Generator (`toolscore generate`)\nCreate comprehensive test suites automatically from OpenAI function schemas. Generates varied test cases with edge cases and boundary values - no manual test writing needed.\n\n```bash\ntoolscore generate --from-openai functions.json --count 20\n# Creates 20 test cases with normal + edge + boundary variations\n```\n\n### \ud83d\udcca Quick Compare (`toolscore compare`)\nCompare multiple models side-by-side in one command. See which model (GPT-4, Claude, Gemini, etc.) performs best on each metric with beautiful comparison tables.\n\n```bash\ntoolscore compare gold.json gpt4.json claude.json gemini.json \\\n -n gpt-4 -n claude-3 -n gemini-1.5\n# Shows color-coded comparison table with overall winner\n```\n\n### \ud83d\udd0d Interactive Debug Mode (`--debug`)\nStep through failures one-by-one with guided troubleshooting. See exactly what went wrong and get actionable fix suggestions for each mismatch.\n\n```bash\ntoolscore eval gold.json trace.json --debug\n# Navigate mismatches interactively with context-specific suggestions\n```\n\n### \ud83d\udca1 Actionable Error Messages\nAutomatic failure detection with specific fix suggestions. No more guessing - get told exactly what to try next (use `--llm-judge`, check schemas, review arguments, etc.).\n\n### \ud83c\udfaf Tool Correctness Metric\nDeterministic evaluation of whether all expected tools were called - goes beyond just checking individual call matches.\n\n### \ud83e\udde0 Integrated LLM-as-a-Judge\nSemantic evaluation is now built into the core - just add `--llm-judge` flag to catch equivalent but differently-named tools (e.g., `search` vs `web_search`).\n\n### \ud83d\udd12 Parameter Schema Validation\nValidate argument types, ranges, patterns, and constraints - catch type errors, out-of-range values, and missing required fields.\n\n### \ud83d\udce6 Example Datasets\n5 realistic gold standards for common agent types (weather, ecommerce, code assistant, RAG, multi-tool) - start evaluating in 30 seconds!\n\n## Quick Start\n\n### \ud83d\ude80 30-Second Start\n\nThe fastest way to start evaluating:\n\n```bash\n# Install\npip install tool-scorer\n\n# Initialize project (interactive)\ntoolscore init\n\n# Evaluate (included templates)\ntoolscore eval gold_calls.json example_trace.json\n```\n\nDone! You now have evaluation results with detailed metrics.\n\n### 5-Minute Complete Workflow\n\n1. **Install Toolscore:**\n ```bash\n pip install tool-scorer\n ```\n\n2. **Initialize a project** (choose from 5 agent types):\n ```bash\n toolscore init\n # Select agent type \u2192 Get templates + examples\n ```\n\n3. **Generate test cases** (if you have OpenAI function schemas):\n ```bash\n toolscore generate --from-openai functions.json --count 20\n ```\n\n4. **Run evaluation** with your agent's trace:\n ```bash\n # Basic evaluation\n toolscore eval gold_calls.json my_trace.json --html report.html\n\n # With semantic matching (catches similar tool names)\n toolscore eval gold_calls.json my_trace.json --llm-judge\n\n # With interactive debugging\n toolscore eval gold_calls.json my_trace.json --debug\n ```\n\n5. **Compare multiple models:**\n ```bash\n toolscore compare gold.json gpt4.json claude.json \\\n -n gpt-4 -n claude-3\n ```\n\n6. **View results:**\n - Console shows color-coded metrics\n - Open `report.html` for interactive analysis\n - Check `toolscore.json` for machine-readable results\n\n**Want to test with your own LLM?** See the [Complete Tutorial](TUTORIAL.md) for step-by-step instructions on capturing traces from OpenAI/Anthropic APIs.\n\n### Command Line Usage\n\n```bash\n# ===== GETTING STARTED =====\n\n# Initialize new project (interactive)\ntoolscore init\n\n# Generate test cases from OpenAI function schemas\ntoolscore generate --from-openai functions.json --count 20 -o gold.json\n\n# Validate trace file format\ntoolscore validate trace.json\n\n# ===== EVALUATION =====\n\n# Basic evaluation\ntoolscore eval gold_calls.json trace.json\n\n# With HTML report\ntoolscore eval gold_calls.json trace.json --html report.html\n\n# With semantic matching (LLM-as-a-judge)\ntoolscore eval gold_calls.json trace.json --llm-judge\n\n# With interactive debugging\ntoolscore eval gold_calls.json trace.json --debug\n\n# Verbose output (shows missing/extra tools)\ntoolscore eval gold_calls.json trace.json --verbose\n\n# Specify trace format explicitly\ntoolscore eval gold_calls.json trace.json --format openai\n\n# Use realistic example dataset\ntoolscore eval examples/datasets/ecommerce_agent.json trace.json\n\n# ===== MULTI-MODEL COMPARISON =====\n\n# Compare multiple models side-by-side\ntoolscore compare gold.json gpt4.json claude.json gemini.json\n\n# With custom model names\ntoolscore compare gold.json model1.json model2.json \\\n -n \"GPT-4\" -n \"Claude-3-Opus\"\n\n# Save comparison report\ntoolscore compare gold.json *.json -o comparison.json\n```\n\n### Python API\n\n```python\nfrom toolscore import evaluate_trace\n\n# Run evaluation\nresult = evaluate_trace(\n gold_file=\"gold_calls.json\",\n trace_file=\"trace.json\",\n format=\"auto\" # auto-detect format\n)\n\n# Access metrics\nprint(f\"Invocation Accuracy: {result.metrics['invocation_accuracy']:.2%}\")\nprint(f\"Selection Accuracy: {result.metrics['selection_accuracy']:.2%}\")\n\nsequence = result.metrics['sequence_metrics']\nprint(f\"Sequence Accuracy: {sequence['sequence_accuracy']:.2%}\")\n\narguments = result.metrics['argument_metrics']\nprint(f\"Argument F1: {arguments['f1']:.2%}\")\n```\n\n### Pytest Integration\n\nToolscore includes a pytest plugin for seamless test integration:\n\n```python\n# test_my_agent.py\ndef test_agent_accuracy(toolscore_eval, toolscore_assertions):\n \"\"\"Test that agent achieves high accuracy.\"\"\"\n result = toolscore_eval(\"gold_calls.json\", \"trace.json\")\n\n # Use built-in assertions\n toolscore_assertions.assert_invocation_accuracy(result, min_accuracy=0.9)\n toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)\n toolscore_assertions.assert_argument_f1(result, min_f1=0.8)\n```\n\nThe plugin is automatically loaded when you install Toolscore. See the [examples](examples/test_example_with_pytest.py) for more patterns.\n\n### Interactive Tutorials\n\nTry Toolscore in your browser with our Jupyter notebooks:\n\n- [Quickstart Tutorial](examples/notebooks/01_quickstart.ipynb) - 5-minute introduction\n- [Custom Formats](examples/notebooks/02_custom_formats.ipynb) - Working with custom traces\n- [Advanced Metrics](examples/notebooks/03_advanced_metrics.ipynb) - Deep dive into metrics\n\nOpen them in [Google Colab](https://colab.research.google.com/) for instant experimentation.\n\n## Gold Standard Format\n\nCreate a `gold_calls.json` file defining the expected tool calls:\n\n```json\n[\n {\n \"tool\": \"make_file\",\n \"args\": {\n \"filename\": \"poem.txt\",\n \"lines_of_text\": [\"Roses are red,\", \"Violets are blue.\"]\n },\n \"side_effects\": {\n \"file_exists\": \"poem.txt\"\n },\n \"description\": \"Create a file with a poem\"\n }\n]\n```\n\n## Trace Formats\n\nToolscore supports multiple trace formats:\n\n### OpenAI Format\n\n```json\n[\n {\n \"role\": \"assistant\",\n \"function_call\": {\n \"name\": \"get_weather\",\n \"arguments\": \"{\\\"location\\\": \\\"Boston\\\"}\"\n }\n }\n]\n```\n\n### Anthropic Format\n\n```json\n[\n {\n \"role\": \"assistant\",\n \"content\": [\n {\n \"type\": \"tool_use\",\n \"id\": \"toolu_123\",\n \"name\": \"search\",\n \"input\": {\"query\": \"Python\"}\n }\n ]\n }\n]\n```\n\n### LangChain Format\n\n```json\n[\n {\n \"tool\": \"search\",\n \"tool_input\": {\"query\": \"Python tutorials\"},\n \"log\": \"Invoking search...\"\n }\n]\n```\n\nOr modern format:\n\n```json\n[\n {\n \"name\": \"search\",\n \"args\": {\"query\": \"Python\"},\n \"id\": \"call_123\"\n }\n]\n```\n\n### Custom Format\n\n```json\n{\n \"calls\": [\n {\n \"tool\": \"read_file\",\n \"args\": {\"path\": \"data.txt\"},\n \"result\": \"file contents\"\n }\n ]\n}\n```\n\n## Metrics Explained\n\n### Tool Invocation Accuracy\nMeasures whether the agent invoked tools when needed and refrained when not needed.\n\n### Tool Selection Accuracy\nProportion of tool calls that match expected tool names.\n\n### Tool Correctness (NEW)\nChecks if all expected tools were called at least once - complements selection accuracy by measuring coverage rather than per-call matching.\n\n### Sequence Edit Distance\nLevenshtein distance between expected and actual tool call sequences.\n\n### Argument Match F1\nPrecision and recall of argument correctness across all tool calls.\n\n### Schema Validation (NEW)\nValidates argument types, numeric ranges, string patterns, enums, and required fields. Define schemas in your gold standard:\n\n```json\n{\n \"tool\": \"search\",\n \"args\": {\"query\": \"test\", \"limit\": 10},\n \"metadata\": {\n \"schema\": {\n \"query\": {\"type\": \"string\", \"minLength\": 1},\n \"limit\": {\"type\": \"integer\", \"minimum\": 1, \"maximum\": 100}\n }\n }\n}\n```\n\n### Redundant Call Rate\nPercentage of unnecessary or duplicate tool calls.\n\n### Side-Effect Success Rate\nProportion of validated side-effects (HTTP, filesystem, database) that succeeded.\n\n### LLM-as-a-judge Semantic Evaluation (Integrated)\nNow built into core evaluation! Use `--llm-judge` flag to evaluate semantic equivalence beyond exact string matching. Perfect for catching cases where tool names differ but intentions match (e.g., `search_web` vs `web_search`).\n\n```bash\n# CLI usage - easiest way\ntool-scorer eval gold.json trace.json --llm-judge\n\n# Python API\nresult = evaluate_trace(\"gold.json\", \"trace.json\", use_llm_judge=True)\nprint(f\"Semantic Score: {result.metrics['semantic_metrics']['semantic_score']:.2%}\")\n```\n\n## Project Structure\n\n```\ntoolscore/\n\u251c\u2500\u2500 adapters/ # Trace format adapters\n\u2502 \u251c\u2500\u2500 openai.py\n\u2502 \u251c\u2500\u2500 anthropic.py\n\u2502 \u2514\u2500\u2500 custom.py\n\u251c\u2500\u2500 metrics/ # Metric calculators\n\u2502 \u251c\u2500\u2500 accuracy.py\n\u2502 \u251c\u2500\u2500 sequence.py\n\u2502 \u251c\u2500\u2500 arguments.py\n\u2502 \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 validators/ # Side-effect validators\n\u2502 \u251c\u2500\u2500 http.py\n\u2502 \u251c\u2500\u2500 filesystem.py\n\u2502 \u2514\u2500\u2500 database.py\n\u251c\u2500\u2500 reports/ # Report generators\n\u251c\u2500\u2500 cli.py # CLI interface\n\u2514\u2500\u2500 core.py # Core evaluation logic\n```\n\n## Development\n\n```bash\n# Install dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=toolscore\n\n# Type checking\nmypy toolscore\n\n# Linting and formatting\nruff check toolscore\nruff format toolscore\n```\n\n## Real-World Use Cases\n\n### 1. Model Evaluation & Selection\nCompare GPT-4 vs Claude vs Gemini on your specific tool-calling tasks:\n\n```python\nmodels = [\"gpt-4\", \"claude-3-5-sonnet\", \"gemini-pro\"]\nresults = {}\n\nfor model in models:\n trace = capture_trace(model, task=\"customer_support\")\n result = evaluate_trace(\"gold_standard.json\", trace)\n results[model] = result.metrics['selection_accuracy']\n\nbest_model = max(results, key=results.get)\nprint(f\"Best model: {best_model} ({results[best_model]:.1%} accuracy)\")\n```\n\n### 2. CI/CD Integration\nCatch regressions in agent behavior before deployment:\n\n```python\n# test_agent_quality.py\ndef test_agent_meets_sla(toolscore_eval, toolscore_assertions):\n \"\"\"Ensure agent meets 95% accuracy SLA.\"\"\"\n result = toolscore_eval(\"gold_standard.json\", \"production_trace.json\")\n toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.95)\n toolscore_assertions.assert_redundancy_rate(result, max_rate=0.1)\n```\n\n### 3. Prompt Engineering Optimization\nA/B test different prompts and measure impact:\n\n```python\nprompts = [\"prompt_v1.txt\", \"prompt_v2.txt\", \"prompt_v3.txt\"]\n\nfor prompt_file in prompts:\n trace = run_agent_with_prompt(prompt_file)\n result = evaluate_trace(\"gold_standard.json\", trace)\n\n print(f\"{prompt_file}:\")\n print(f\" Selection: {result.metrics['selection_accuracy']:.1%}\")\n print(f\" Arguments: {result.metrics['argument_metrics']['f1']:.1%}\")\n print(f\" Efficiency: {result.metrics['efficiency_metrics']['redundant_rate']:.1%}\")\n```\n\n### 4. Production Monitoring\nTrack agent performance over time in production:\n\n```python\n# Run daily\ntoday_traces = collect_production_traces(date=today)\nresult = evaluate_trace(\"gold_standard.json\", today_traces)\n\n# Alert if degradation\nif result.metrics['selection_accuracy'] < 0.90:\n send_alert(\"Agent performance degraded!\")\n\n# Log metrics to dashboard\nlog_to_datadog({\n \"accuracy\": result.metrics['selection_accuracy'],\n \"redundancy\": result.metrics['efficiency_metrics']['redundant_rate'],\n})\n```\n\n## Documentation\n\n- **[ReadTheDocs](https://toolscore.readthedocs.io/)** - Complete API documentation\n- **[Complete Tutorial](TUTORIAL.md)** - In-depth guide with end-to-end workflow\n- **[Example Datasets](examples/datasets/)** - 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)\n- **[Examples Directory](examples/)** - Sample traces and capture scripts\n- **[Jupyter Notebooks](examples/notebooks/)** - Interactive tutorials\n- **[Contributing Guide](CONTRIBUTING.md)** - How to contribute to Toolscore\n\n## What's New\n\n### v1.1.0 (Latest - October 2025)\n\n**Major Product Improvements:**\n- **\ud83e\udde0 Integrated LLM-as-a-Judge**: Semantic evaluation now built into core with `--llm-judge` flag\n- **\ud83c\udfaf Tool Correctness Metric**: Deterministic check for complete tool coverage\n- **\ud83d\udd12 Parameter Schema Validation**: Validate types, ranges, patterns, enums in arguments\n- **\ud83d\udce6 Example Datasets**: 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)\n- **Enhanced Console Output**: Beautiful tables showing tool coverage and schema validation\n\n### v1.0.x\n\n- **LLM-as-a-judge metrics**: Semantic correctness evaluation using OpenAI API\n- **LangChain adapter**: Support for LangChain agent traces (legacy and modern formats)\n- **Beautiful console output**: Color-coded metrics with Rich library\n- **Pytest plugin**: Seamless test integration with fixtures and assertions\n- **Interactive tutorials**: Jupyter notebooks for hands-on learning\n- **Comprehensive documentation**: Sphinx docs on ReadTheDocs\n- **Test coverage**: Increased to 80%+ with 123 passing tests\n- **Automated releases**: Semantic versioning with conventional commits\n- **Enhanced PyPI presence**: 16 searchable keywords, Beta status, comprehensive classifiers\n\nSee [CHANGELOG.md](CHANGELOG.md) for full release history.\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nApache License 2.0 - see [LICENSE](LICENSE) for details.\n\n## Citation\n\nIf you use Toolscore in your research, please cite:\n\n```bibtex\n@software{toolscore,\n title = {Toolscore: LLM Tool Usage Evaluation Package},\n author = {Yotam Braun},\n year = {2025},\n url = {https://github.com/yotambraun/toolscore}\n}\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Evaluate LLM tool usage and function calling accuracy with comprehensive metrics, support for OpenAI/Anthropic/LangChain, pytest integration, and beautiful reports",
"version": "1.2.0",
"project_urls": {
"Documentation": "https://github.com/yotambraun/toolscore#readme",
"Homepage": "https://github.com/yotambraun/toolscore",
"Issues": "https://github.com/yotambraun/toolscore/issues",
"Repository": "https://github.com/yotambraun/toolscore"
},
"split_keywords": [
"accuracy",
" agent-evaluation",
" ai-agents",
" anthropic",
" benchmarking",
" claude",
" evaluation",
" function-calling",
" gpt",
" langchain",
" llm",
" llm-testing",
" metrics",
" openai",
" testing",
" tool-use"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c8c647424d7ea020d26b01286290dd8cff3b4f59c24411acb06cd5eb010c7769",
"md5": "7392003ab7f71a0306be8e51a2bfd53b",
"sha256": "c9c7e9b03365547fcd8106a0f730692386aca94dbf5c028dbbfa0b60a498dbcc"
},
"downloads": -1,
"filename": "tool_scorer-1.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7392003ab7f71a0306be8e51a2bfd53b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 61303,
"upload_time": "2025-10-18T10:59:17",
"upload_time_iso_8601": "2025-10-18T10:59:17.976405Z",
"url": "https://files.pythonhosted.org/packages/c8/c6/47424d7ea020d26b01286290dd8cff3b4f59c24411acb06cd5eb010c7769/tool_scorer-1.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "39f88ac47efb29e4ab34f947286755dcf1a61a0480e72d83c19e5f0d56f6e796",
"md5": "e8b3d0ce5add71afc6d4dc3d62f412e6",
"sha256": "a6911349d013ffd10ebaa5996a154e33656edbd5444957e6af517323a34698ba"
},
"downloads": -1,
"filename": "tool_scorer-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "e8b3d0ce5add71afc6d4dc3d62f412e6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 1647320,
"upload_time": "2025-10-18T10:59:19",
"upload_time_iso_8601": "2025-10-18T10:59:19.933137Z",
"url": "https://files.pythonhosted.org/packages/39/f8/8ac47efb29e4ab34f947286755dcf1a61a0480e72d83c19e5f0d56f6e796/tool_scorer-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-18 10:59:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yotambraun",
"github_project": "toolscore#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tool-scorer"
}