pymcpevals


Namepymcpevals JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryPython package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring
upload_time2025-07-27 07:17:20
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords mcp evaluation testing llm ai
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyMCPEvals

> **⚠️ Still Under Development** - APIs may change. Use with caution in production.

**Server-focused evaluation framework for MCP (Model Context Protocol) servers.**

πŸš€ **Test your MCP server capabilities, not LLM conversation patterns.**

**"Are my MCP server's tools working correctly and being used as expected?"**

PyMCPEvals separates what you **can control** (server) from what you **cannot** (LLM behavior):

### βœ… What You Control (We Test This)
- Tool implementation correctness
- Tool parameter validation  
- Error handling and recovery
- Tool result formatting
- Multi-turn state management

### ❌ What You Cannot Control (We Ignore This)
- LLM conversation patterns
- How LLMs choose to use tools
- LLM response formatting
- Whether LLMs provide intermediate responses

## Key Pain Points Solved

- **🚫 Manual Tool Testing**: Automated assertions verify exact tool calls
- **❓ Multi-step Failures**: Track tool chaining across conversation turns
- **πŸ› Silent Tool Errors**: Instant feedback when expected tools aren't called
- **πŸ“Š CI/CD Integration**: JUnit XML output for automated testing pipelines

## Quick Start

```bash
pip install pymcpevals
pymcpevals init                    # Create template config
pymcpevals run evals.yaml         # Run evaluations
```

## Example Configuration

```yaml
model:
  provider: openai
  name: gpt-4

server:
  command: ["python", "my_server.py"]

evaluations:
  - name: "weather_check"
    prompt: "What's the weather in Boston?"
    expected_tools: ["get_weather"]  # βœ… Validates tool usage
    expected_result: "Should call weather API and return conditions"
    threshold: 3.5
    
  - name: "multi_step"
    turns:
      - role: "user"
        content: "What's the weather in London?"
        expected_tools: ["get_weather"]
      - role: "user"  
        content: "And in Paris?"
        expected_tools: ["get_weather"]
    expected_result: "Should provide weather for both cities"
    threshold: 4.0
```

**Output**: Pass/fail status, tool validation, execution metrics, and server-focused scoring.

## How It Works

1. **Connect** to your MCP server via FastMCP
2. **Execute** prompts and track tool calls
3. **Validate** expected tools are called (instant feedback)
4. **Evaluate** server performance (ignores LLM style)
5. **Report** results with actionable insights

## What Makes This Different

**Precise Tool Assertions**: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:

- βœ… **Exact tool calls**: `assert_tools_called(result, ["add", "multiply"])`
- βœ… **Tool execution success**: `assert_no_tool_errors(result)`  
- βœ… **Multi-turn trajectories**: Test tool chaining across conversation steps
- βœ… **Instant failure detection**: No expensive LLM evaluation for obvious failures

## Usage

### CLI

```bash
# Basic usage
pymcpevals run evals.yaml

# Override server/model
pymcpevals run evals.yaml --server "node server.js" --model gpt-4

# Different outputs
pymcpevals run evals.yaml --output table    # Simple table
pymcpevals run evals.yaml --output json     # Full JSON
pymcpevals run evals.yaml --output junit    # CI/CD format
```

### Pytest Integration

```python
from pymcpevals import (
    assert_tools_called, 
    assert_evaluation_passed,
    assert_min_score,
    assert_no_tool_errors,
    ConversationTurn
)

# Simple marker-based test
@pytest.mark.mcp_eval(
    prompt="What is 15 + 27?",
    expected_tools=["add"],
    min_score=4.0
)
async def test_basic_addition(mcp_result):
    assert_evaluation_passed(mcp_result)
    assert_tools_called(mcp_result, ["add"])
    assert "42" in mcp_result.server_response

# Multi-turn trajectory testing
async def test_math_sequence(mcp_evaluator):
    turns = [
        ConversationTurn(role="user", content="What is 10 + 5?", expected_tools=["add"]),
        ConversationTurn(role="user", content="Now multiply by 2", expected_tools=["multiply"])
    ]
    result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)
    
    # Rich assertions
    assert_evaluation_passed(result)
    assert_tools_called(result, ["add", "multiply"])
    assert_no_tool_errors(result)
    assert_min_score(result, 4.0, dimension="accuracy")
    assert "30" in str(result.conversation_history)

# Run with: pytest -m mcp_eval
```

## Examples

Check out the `examples/` directory for:
- `calculator_server.py` - Simple MCP server for testing
- `local_server_basic.yaml` - Basic evaluation configuration examples
- `trajectory_evaluation.yaml` - Multi-turn conversation examples
- `test_simple_plugin_example.py` - Pytest integration examples

Run the examples:
```bash
# Test with the example calculator server
pymcpevals run examples/local_server_basic.yaml

# Run pytest examples
cd examples && pytest test_simple_plugin_example.py
```

## Installation

```bash
pip install pymcpevals
```

## Environment Setup

```bash
export OPENAI_API_KEY="sk-..."        # or ANTHROPIC_API_KEY
export GEMINI_API_KEY="..."           # for Gemini models
```

## Output Formats

### Table View (default)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Name                                     β”‚ Status β”‚ Acc β”‚ Comp β”‚ Rel β”‚ Clar β”‚ Reas β”‚ Avg  β”‚ Tools β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ What is 15 + 27?                         β”‚ PASS   β”‚ 4.5 β”‚ 4.2  β”‚ 5.0 β”‚ 4.8  β”‚ 4.1  β”‚ 4.52 β”‚ βœ“     β”‚
β”‚ What happens if I divide 10 by 0?        β”‚ PASS   β”‚ 4.0 β”‚ 4.1  β”‚ 4.5 β”‚ 4.2  β”‚ 3.8  β”‚ 4.12 β”‚ βœ“     β”‚
β”‚ Multi-turn test                          β”‚ PASS   β”‚ 4.2 β”‚ 4.5  β”‚ 4.8 β”‚ 4.1  β”‚ 4.3  β”‚ 4.38 β”‚ βœ“     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Summary: 3/3 passed (100.0%) - Average: 4.34/5.0
```

### Detailed View (--output detailed)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Test                    β”‚ Status β”‚ Scoreβ”‚ Expected Tools     β”‚ Tools Used         β”‚ Time   β”‚ Errors β”‚ Notes                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ What is 15 + 27?        β”‚ PASS   β”‚ 4.5  β”‚ add                β”‚ add                β”‚ 12ms   β”‚ 0      β”‚ OK                           β”‚
β”‚ What happens if I div...β”‚ PASS   β”‚ 4.1  β”‚ divide             β”‚ divide             β”‚ 8ms    β”‚ 1      β”‚ Handled error correctly      β”‚
β”‚ Multi-turn test         β”‚ PASS   β”‚ 4.4  β”‚ add, multiply      β”‚ add, multiply      β”‚ 23ms   β”‚ 0      β”‚ Tool chaining successful     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Tool Execution Details:
β€’ add: Called 2 times, avg 10ms, 100% success rate
β€’ divide: Called 1 time, 8ms, handled error gracefully  
β€’ multiply: Called 1 time, 13ms, 100% success rate

Summary: 3/3 passed (100.0%) - Average: 4.33/5.0
```

## Key Benefits

### For MCP Server Developers
- **🎯 Server-Focused Testing**: Test your server capabilities, not LLM behavior
- **βœ… Instant Tool Validation**: Get immediate feedback if wrong tools are called (no LLM needed)
- **πŸ”§ Tool Execution Insights**: See success rates, timing, and error handling
- **πŸ”„ Multi-turn Validation**: Test tool chaining and state management
- **πŸ“Š Capability Scoring**: LLM judges server tool performance, ignoring conversation style
- **πŸ› οΈ Easy Integration**: Works with any MCP server via FastMCP

### For Development Teams  
- **πŸš€ CI/CD Integration**: JUnit XML output for automated testing pipelines
- **πŸ“ˆ Progress Tracking**: Monitor improvement over time with consistent scoring
- **πŸ”„ Regression Testing**: Ensure new changes don't break existing functionality
- **βš–οΈ Model Comparison**: Test across different LLM providers

## Acknowledgments

πŸ™ **Huge kudos to [mcp-evals](https://github.com/mclenhard/mcp-evals)** - This Python package was heavily inspired by the excellent Node.js implementation by [@mclenhard](https://github.com/mclenhard).

If you're working in a Node.js environment, definitely check out the original [mcp-evals](https://github.com/mclenhard/mcp-evals) project, which also includes GitHub Action integration and monitoring capabilities.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality  
4. Ensure all tests pass
5. Submit a pull request

## License

MIT - see LICENSE file.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pymcpevals",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "mcp, evaluation, testing, llm, ai",
    "author": null,
    "author_email": "Akshay Ram Vignesh <akshay5995@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a4/80/95f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8/pymcpevals-0.1.1.tar.gz",
    "platform": null,
    "description": "# PyMCPEvals\n\n> **\u26a0\ufe0f Still Under Development** - APIs may change. Use with caution in production.\n\n**Server-focused evaluation framework for MCP (Model Context Protocol) servers.**\n\n\ud83d\ude80 **Test your MCP server capabilities, not LLM conversation patterns.**\n\n**\"Are my MCP server's tools working correctly and being used as expected?\"**\n\nPyMCPEvals separates what you **can control** (server) from what you **cannot** (LLM behavior):\n\n### \u2705 What You Control (We Test This)\n- Tool implementation correctness\n- Tool parameter validation  \n- Error handling and recovery\n- Tool result formatting\n- Multi-turn state management\n\n### \u274c What You Cannot Control (We Ignore This)\n- LLM conversation patterns\n- How LLMs choose to use tools\n- LLM response formatting\n- Whether LLMs provide intermediate responses\n\n## Key Pain Points Solved\n\n- **\ud83d\udeab Manual Tool Testing**: Automated assertions verify exact tool calls\n- **\u2753 Multi-step Failures**: Track tool chaining across conversation turns\n- **\ud83d\udc1b Silent Tool Errors**: Instant feedback when expected tools aren't called\n- **\ud83d\udcca CI/CD Integration**: JUnit XML output for automated testing pipelines\n\n## Quick Start\n\n```bash\npip install pymcpevals\npymcpevals init                    # Create template config\npymcpevals run evals.yaml         # Run evaluations\n```\n\n## Example Configuration\n\n```yaml\nmodel:\n  provider: openai\n  name: gpt-4\n\nserver:\n  command: [\"python\", \"my_server.py\"]\n\nevaluations:\n  - name: \"weather_check\"\n    prompt: \"What's the weather in Boston?\"\n    expected_tools: [\"get_weather\"]  # \u2705 Validates tool usage\n    expected_result: \"Should call weather API and return conditions\"\n    threshold: 3.5\n    \n  - name: \"multi_step\"\n    turns:\n      - role: \"user\"\n        content: \"What's the weather in London?\"\n        expected_tools: [\"get_weather\"]\n      - role: \"user\"  \n        content: \"And in Paris?\"\n        expected_tools: [\"get_weather\"]\n    expected_result: \"Should provide weather for both cities\"\n    threshold: 4.0\n```\n\n**Output**: Pass/fail status, tool validation, execution metrics, and server-focused scoring.\n\n## How It Works\n\n1. **Connect** to your MCP server via FastMCP\n2. **Execute** prompts and track tool calls\n3. **Validate** expected tools are called (instant feedback)\n4. **Evaluate** server performance (ignores LLM style)\n5. **Report** results with actionable insights\n\n## What Makes This Different\n\n**Precise Tool Assertions**: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:\n\n- \u2705 **Exact tool calls**: `assert_tools_called(result, [\"add\", \"multiply\"])`\n- \u2705 **Tool execution success**: `assert_no_tool_errors(result)`  \n- \u2705 **Multi-turn trajectories**: Test tool chaining across conversation steps\n- \u2705 **Instant failure detection**: No expensive LLM evaluation for obvious failures\n\n## Usage\n\n### CLI\n\n```bash\n# Basic usage\npymcpevals run evals.yaml\n\n# Override server/model\npymcpevals run evals.yaml --server \"node server.js\" --model gpt-4\n\n# Different outputs\npymcpevals run evals.yaml --output table    # Simple table\npymcpevals run evals.yaml --output json     # Full JSON\npymcpevals run evals.yaml --output junit    # CI/CD format\n```\n\n### Pytest Integration\n\n```python\nfrom pymcpevals import (\n    assert_tools_called, \n    assert_evaluation_passed,\n    assert_min_score,\n    assert_no_tool_errors,\n    ConversationTurn\n)\n\n# Simple marker-based test\n@pytest.mark.mcp_eval(\n    prompt=\"What is 15 + 27?\",\n    expected_tools=[\"add\"],\n    min_score=4.0\n)\nasync def test_basic_addition(mcp_result):\n    assert_evaluation_passed(mcp_result)\n    assert_tools_called(mcp_result, [\"add\"])\n    assert \"42\" in mcp_result.server_response\n\n# Multi-turn trajectory testing\nasync def test_math_sequence(mcp_evaluator):\n    turns = [\n        ConversationTurn(role=\"user\", content=\"What is 10 + 5?\", expected_tools=[\"add\"]),\n        ConversationTurn(role=\"user\", content=\"Now multiply by 2\", expected_tools=[\"multiply\"])\n    ]\n    result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)\n    \n    # Rich assertions\n    assert_evaluation_passed(result)\n    assert_tools_called(result, [\"add\", \"multiply\"])\n    assert_no_tool_errors(result)\n    assert_min_score(result, 4.0, dimension=\"accuracy\")\n    assert \"30\" in str(result.conversation_history)\n\n# Run with: pytest -m mcp_eval\n```\n\n## Examples\n\nCheck out the `examples/` directory for:\n- `calculator_server.py` - Simple MCP server for testing\n- `local_server_basic.yaml` - Basic evaluation configuration examples\n- `trajectory_evaluation.yaml` - Multi-turn conversation examples\n- `test_simple_plugin_example.py` - Pytest integration examples\n\nRun the examples:\n```bash\n# Test with the example calculator server\npymcpevals run examples/local_server_basic.yaml\n\n# Run pytest examples\ncd examples && pytest test_simple_plugin_example.py\n```\n\n## Installation\n\n```bash\npip install pymcpevals\n```\n\n## Environment Setup\n\n```bash\nexport OPENAI_API_KEY=\"sk-...\"        # or ANTHROPIC_API_KEY\nexport GEMINI_API_KEY=\"...\"           # for Gemini models\n```\n\n## Output Formats\n\n### Table View (default)\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Name                                     \u2502 Status \u2502 Acc \u2502 Comp \u2502 Rel \u2502 Clar \u2502 Reas \u2502 Avg  \u2502 Tools \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 What is 15 + 27?                         \u2502 PASS   \u2502 4.5 \u2502 4.2  \u2502 5.0 \u2502 4.8  \u2502 4.1  \u2502 4.52 \u2502 \u2713     \u2502\n\u2502 What happens if I divide 10 by 0?        \u2502 PASS   \u2502 4.0 \u2502 4.1  \u2502 4.5 \u2502 4.2  \u2502 3.8  \u2502 4.12 \u2502 \u2713     \u2502\n\u2502 Multi-turn test                          \u2502 PASS   \u2502 4.2 \u2502 4.5  \u2502 4.8 \u2502 4.1  \u2502 4.3  \u2502 4.38 \u2502 \u2713     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\nSummary: 3/3 passed (100.0%) - Average: 4.34/5.0\n```\n\n### Detailed View (--output detailed)\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Test                    \u2502 Status \u2502 Score\u2502 Expected Tools     \u2502 Tools Used         \u2502 Time   \u2502 Errors \u2502 Notes                        \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 What is 15 + 27?        \u2502 PASS   \u2502 4.5  \u2502 add                \u2502 add                \u2502 12ms   \u2502 0      \u2502 OK                           \u2502\n\u2502 What happens if I div...\u2502 PASS   \u2502 4.1  \u2502 divide             \u2502 divide             \u2502 8ms    \u2502 1      \u2502 Handled error correctly      \u2502\n\u2502 Multi-turn test         \u2502 PASS   \u2502 4.4  \u2502 add, multiply      \u2502 add, multiply      \u2502 23ms   \u2502 0      \u2502 Tool chaining successful     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n\ud83d\udd27 Tool Execution Details:\n\u2022 add: Called 2 times, avg 10ms, 100% success rate\n\u2022 divide: Called 1 time, 8ms, handled error gracefully  \n\u2022 multiply: Called 1 time, 13ms, 100% success rate\n\nSummary: 3/3 passed (100.0%) - Average: 4.33/5.0\n```\n\n## Key Benefits\n\n### For MCP Server Developers\n- **\ud83c\udfaf Server-Focused Testing**: Test your server capabilities, not LLM behavior\n- **\u2705 Instant Tool Validation**: Get immediate feedback if wrong tools are called (no LLM needed)\n- **\ud83d\udd27 Tool Execution Insights**: See success rates, timing, and error handling\n- **\ud83d\udd04 Multi-turn Validation**: Test tool chaining and state management\n- **\ud83d\udcca Capability Scoring**: LLM judges server tool performance, ignoring conversation style\n- **\ud83d\udee0\ufe0f Easy Integration**: Works with any MCP server via FastMCP\n\n### For Development Teams  \n- **\ud83d\ude80 CI/CD Integration**: JUnit XML output for automated testing pipelines\n- **\ud83d\udcc8 Progress Tracking**: Monitor improvement over time with consistent scoring\n- **\ud83d\udd04 Regression Testing**: Ensure new changes don't break existing functionality\n- **\u2696\ufe0f Model Comparison**: Test across different LLM providers\n\n## Acknowledgments\n\n\ud83d\ude4f **Huge kudos to [mcp-evals](https://github.com/mclenhard/mcp-evals)** - This Python package was heavily inspired by the excellent Node.js implementation by [@mclenhard](https://github.com/mclenhard).\n\nIf you're working in a Node.js environment, definitely check out the original [mcp-evals](https://github.com/mclenhard/mcp-evals) project, which also includes GitHub Action integration and monitoring capabilities.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality  \n4. Ensure all tests pass\n5. Submit a pull request\n\n## License\n\nMIT - see LICENSE file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/akshay5995/pymcpevals#readme",
        "Homepage": "https://github.com/akshay5995/pymcpevals",
        "Issues": "https://github.com/akshay5995/pymcpevals/issues",
        "Repository": "https://github.com/akshay5995/pymcpevals"
    },
    "split_keywords": [
        "mcp",
        " evaluation",
        " testing",
        " llm",
        " ai"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3fcb21018f14e635711831d24930209a0eb66e9e1d8ba99cd5b6991a6562a854",
                "md5": "879881ab6d35d209dc1497662d0df03e",
                "sha256": "00bdf4d9e9ac22e14139488aa75082458cb1113298c031fcd20f12adfdb5bf69"
            },
            "downloads": -1,
            "filename": "pymcpevals-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "879881ab6d35d209dc1497662d0df03e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 26680,
            "upload_time": "2025-07-27T07:17:19",
            "upload_time_iso_8601": "2025-07-27T07:17:19.477616Z",
            "url": "https://files.pythonhosted.org/packages/3f/cb/21018f14e635711831d24930209a0eb66e9e1d8ba99cd5b6991a6562a854/pymcpevals-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a48095f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8",
                "md5": "690376fa3e71809bbbe61f1734a1b670",
                "sha256": "b3dd06182a662730ab2a8e44eda25738a849ca3908037a2aa9e6b5ff2ea25b5f"
            },
            "downloads": -1,
            "filename": "pymcpevals-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "690376fa3e71809bbbe61f1734a1b670",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 27524,
            "upload_time": "2025-07-27T07:17:20",
            "upload_time_iso_8601": "2025-07-27T07:17:20.946682Z",
            "url": "https://files.pythonhosted.org/packages/a4/80/95f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8/pymcpevals-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 07:17:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "akshay5995",
    "github_project": "pymcpevals#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pymcpevals"
}
        
Elapsed time: 2.46053s