pymcpevals

Name	pymcpevals JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	Python package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring
upload_time	2025-07-27 07:17:20
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	mcp evaluation testing llm ai
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PyMCPEvals

> **⚠️ Still Under Development** - APIs may change. Use with caution in production.

**Server-focused evaluation framework for MCP (Model Context Protocol) servers.**

🚀 **Test your MCP server capabilities, not LLM conversation patterns.**

**"Are my MCP server's tools working correctly and being used as expected?"**

PyMCPEvals separates what you **can control** (server) from what you **cannot** (LLM behavior):

### ✅ What You Control (We Test This)
- Tool implementation correctness
- Tool parameter validation  
- Error handling and recovery
- Tool result formatting
- Multi-turn state management

### ❌ What You Cannot Control (We Ignore This)
- LLM conversation patterns
- How LLMs choose to use tools
- LLM response formatting
- Whether LLMs provide intermediate responses

## Key Pain Points Solved

- **🚫 Manual Tool Testing**: Automated assertions verify exact tool calls
- **❓ Multi-step Failures**: Track tool chaining across conversation turns
- **🐛 Silent Tool Errors**: Instant feedback when expected tools aren't called
- **📊 CI/CD Integration**: JUnit XML output for automated testing pipelines

## Quick Start

```bash
pip install pymcpevals
pymcpevals init                    # Create template config
pymcpevals run evals.yaml         # Run evaluations
```

## Example Configuration

```yaml
model:
  provider: openai
  name: gpt-4

server:
  command: ["python", "my_server.py"]

evaluations:
  - name: "weather_check"
    prompt: "What's the weather in Boston?"
    expected_tools: ["get_weather"]  # ✅ Validates tool usage
    expected_result: "Should call weather API and return conditions"
    threshold: 3.5
    
  - name: "multi_step"
    turns:
      - role: "user"
        content: "What's the weather in London?"
        expected_tools: ["get_weather"]
      - role: "user"  
        content: "And in Paris?"
        expected_tools: ["get_weather"]
    expected_result: "Should provide weather for both cities"
    threshold: 4.0
```

**Output**: Pass/fail status, tool validation, execution metrics, and server-focused scoring.

## How It Works

1. **Connect** to your MCP server via FastMCP
2. **Execute** prompts and track tool calls
3. **Validate** expected tools are called (instant feedback)
4. **Evaluate** server performance (ignores LLM style)
5. **Report** results with actionable insights

## What Makes This Different

**Precise Tool Assertions**: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:

- ✅ **Exact tool calls**: `assert_tools_called(result, ["add", "multiply"])`
- ✅ **Tool execution success**: `assert_no_tool_errors(result)`  
- ✅ **Multi-turn trajectories**: Test tool chaining across conversation steps
- ✅ **Instant failure detection**: No expensive LLM evaluation for obvious failures

## Usage

### CLI

```bash
# Basic usage
pymcpevals run evals.yaml

# Override server/model
pymcpevals run evals.yaml --server "node server.js" --model gpt-4

# Different outputs
pymcpevals run evals.yaml --output table    # Simple table
pymcpevals run evals.yaml --output json     # Full JSON
pymcpevals run evals.yaml --output junit    # CI/CD format
```

### Pytest Integration

```python
from pymcpevals import (
    assert_tools_called, 
    assert_evaluation_passed,
    assert_min_score,
    assert_no_tool_errors,
    ConversationTurn
)

# Simple marker-based test
@pytest.mark.mcp_eval(
    prompt="What is 15 + 27?",
    expected_tools=["add"],
    min_score=4.0
)
async def test_basic_addition(mcp_result):
    assert_evaluation_passed(mcp_result)
    assert_tools_called(mcp_result, ["add"])
    assert "42" in mcp_result.server_response

# Multi-turn trajectory testing
async def test_math_sequence(mcp_evaluator):
    turns = [
        ConversationTurn(role="user", content="What is 10 + 5?", expected_tools=["add"]),
        ConversationTurn(role="user", content="Now multiply by 2", expected_tools=["multiply"])
    ]
    result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)
    
    # Rich assertions
    assert_evaluation_passed(result)
    assert_tools_called(result, ["add", "multiply"])
    assert_no_tool_errors(result)
    assert_min_score(result, 4.0, dimension="accuracy")
    assert "30" in str(result.conversation_history)

# Run with: pytest -m mcp_eval
```

## Examples

Check out the `examples/` directory for:
- `calculator_server.py` - Simple MCP server for testing
- `local_server_basic.yaml` - Basic evaluation configuration examples
- `trajectory_evaluation.yaml` - Multi-turn conversation examples
- `test_simple_plugin_example.py` - Pytest integration examples

Run the examples:
```bash
# Test with the example calculator server
pymcpevals run examples/local_server_basic.yaml

# Run pytest examples
cd examples && pytest test_simple_plugin_example.py
```

## Installation

```bash
pip install pymcpevals
```

## Environment Setup

```bash
export OPENAI_API_KEY="sk-..."        # or ANTHROPIC_API_KEY
export GEMINI_API_KEY="..."           # for Gemini models
```

## Output Formats

### Table View (default)
```
┌──────────────────────────────────────────┬────────┬─────┬──────┬─────┬──────┬──────┬──────┬───────┐
│ Name                                     │ Status │ Acc │ Comp │ Rel │ Clar │ Reas │ Avg  │ Tools │
├──────────────────────────────────────────┼────────┼─────┼──────┼─────┼──────┼──────┼──────┼───────┤
│ What is 15 + 27?                         │ PASS   │ 4.5 │ 4.2  │ 5.0 │ 4.8  │ 4.1  │ 4.52 │ ✓     │
│ What happens if I divide 10 by 0?        │ PASS   │ 4.0 │ 4.1  │ 4.5 │ 4.2  │ 3.8  │ 4.12 │ ✓     │
│ Multi-turn test                          │ PASS   │ 4.2 │ 4.5  │ 4.8 │ 4.1  │ 4.3  │ 4.38 │ ✓     │
└──────────────────────────────────────────┴────────┴─────┴──────┴─────┴──────┴──────┴──────┴───────┘

Summary: 3/3 passed (100.0%) - Average: 4.34/5.0
```

### Detailed View (--output detailed)
```
┌─────────────────────────┬────────┬──────┬────────────────────┬────────────────────┬────────┬────────┬──────────────────────────────┐
│ Test                    │ Status │ Score│ Expected Tools     │ Tools Used         │ Time   │ Errors │ Notes                        │
├─────────────────────────┼────────┼──────┼────────────────────┼────────────────────┼────────┼────────┼──────────────────────────────┤
│ What is 15 + 27?        │ PASS   │ 4.5  │ add                │ add                │ 12ms   │ 0      │ OK                           │
│ What happens if I div...│ PASS   │ 4.1  │ divide             │ divide             │ 8ms    │ 1      │ Handled error correctly      │
│ Multi-turn test         │ PASS   │ 4.4  │ add, multiply      │ add, multiply      │ 23ms   │ 0      │ Tool chaining successful     │
└─────────────────────────┴────────┴──────┴────────────────────┴────────────────────┴────────┴────────┴──────────────────────────────┘

🔧 Tool Execution Details:
• add: Called 2 times, avg 10ms, 100% success rate
• divide: Called 1 time, 8ms, handled error gracefully  
• multiply: Called 1 time, 13ms, 100% success rate

Summary: 3/3 passed (100.0%) - Average: 4.33/5.0
```

## Key Benefits

### For MCP Server Developers
- **🎯 Server-Focused Testing**: Test your server capabilities, not LLM behavior
- **✅ Instant Tool Validation**: Get immediate feedback if wrong tools are called (no LLM needed)
- **🔧 Tool Execution Insights**: See success rates, timing, and error handling
- **🔄 Multi-turn Validation**: Test tool chaining and state management
- **📊 Capability Scoring**: LLM judges server tool performance, ignoring conversation style
- **🛠️ Easy Integration**: Works with any MCP server via FastMCP

### For Development Teams  
- **🚀 CI/CD Integration**: JUnit XML output for automated testing pipelines
- **📈 Progress Tracking**: Monitor improvement over time with consistent scoring
- **🔄 Regression Testing**: Ensure new changes don't break existing functionality
- **⚖️ Model Comparison**: Test across different LLM providers

## Acknowledgments

🙏 **Huge kudos to [mcp-evals](https://github.com/mclenhard/mcp-evals)** - This Python package was heavily inspired by the excellent Node.js implementation by [@mclenhard](https://github.com/mclenhard).

If you're working in a Node.js environment, definitely check out the original [mcp-evals](https://github.com/mclenhard/mcp-evals) project, which also includes GitHub Action integration and monitoring capabilities.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality  
4. Ensure all tests pass
5. Submit a pull request

## License

MIT - see LICENSE file.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pymcpevals",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "mcp, evaluation, testing, llm, ai",
    "author": null,
    "author_email": "Akshay Ram Vignesh <akshay5995@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a4/80/95f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8/pymcpevals-0.1.1.tar.gz",
    "platform": null,
    "description": "# PyMCPEvals\n\n> **\u26a0\ufe0f Still Under Development** - APIs may change. Use with caution in production.\n\n**Server-focused evaluation framework for MCP (Model Context Protocol) servers.**\n\n\ud83d\ude80 **Test your MCP server capabilities, not LLM conversation patterns.**\n\n**\"Are my MCP server's tools working correctly and being used as expected?\"**\n\nPyMCPEvals separates what you **can control** (server) from what you **cannot** (LLM behavior):\n\n### \u2705 What You Control (We Test This)\n- Tool implementation correctness\n- Tool parameter validation  \n- Error handling and recovery\n- Tool result formatting\n- Multi-turn state management\n\n### \u274c What You Cannot Control (We Ignore This)\n- LLM conversation patterns\n- How LLMs choose to use tools\n- LLM response formatting\n- Whether LLMs provide intermediate responses\n\n## Key Pain Points Solved\n\n- **\ud83d\udeab Manual Tool Testing**: Automated assertions verify exact tool calls\n- **\u2753 Multi-step Failures**: Track tool chaining across conversation turns\n- **\ud83d\udc1b Silent Tool Errors**: Instant feedback when expected tools aren't called\n- **\ud83d\udcca CI/CD Integration**: JUnit XML output for automated testing pipelines\n\n## Quick Start\n\n```bash\npip install pymcpevals\npymcpevals init                    # Create template config\npymcpevals run evals.yaml         # Run evaluations\n```\n\n## Example Configuration\n\n```yaml\nmodel:\n  provider: openai\n  name: gpt-4\n\nserver:\n  command: [\"python\", \"my_server.py\"]\n\nevaluations:\n  - name: \"weather_check\"\n    prompt: \"What's the weather in Boston?\"\n    expected_tools: [\"get_weather\"]  # \u2705 Validates tool usage\n    expected_result: \"Should call weather API and return conditions\"\n    threshold: 3.5\n    \n  - name: \"multi_step\"\n    turns:\n      - role: \"user\"\n        content: \"What's the weather in London?\"\n        expected_tools: [\"get_weather\"]\n      - role: \"user\"  \n        content: \"And in Paris?\"\n        expected_tools: [\"get_weather\"]\n    expected_result: \"Should provide weather for both cities\"\n    threshold: 4.0\n```\n\n**Output**: Pass/fail status, tool validation, execution metrics, and server-focused scoring.\n\n## How It Works\n\n1. **Connect** to your MCP server via FastMCP\n2. **Execute** prompts and track tool calls\n3. **Validate** expected tools are called (instant feedback)\n4. **Evaluate** server performance (ignores LLM style)\n5. **Report** results with actionable insights\n\n## What Makes This Different\n\n**Precise Tool Assertions**: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:\n\n- \u2705 **Exact tool calls**: `assert_tools_called(result, [\"add\", \"multiply\"])`\n- \u2705 **Tool execution success**: `assert_no_tool_errors(result)`  \n- \u2705 **Multi-turn trajectories**: Test tool chaining across conversation steps\n- \u2705 **Instant failure detection**: No expensive LLM evaluation for obvious failures\n\n## Usage\n\n### CLI\n\n```bash\n# Basic usage\npymcpevals run evals.yaml\n\n# Override server/model\npymcpevals run evals.yaml --server \"node server.js\" --model gpt-4\n\n# Different outputs\npymcpevals run evals.yaml --output table    # Simple table\npymcpevals run evals.yaml --output json     # Full JSON\npymcpevals run evals.yaml --output junit    # CI/CD format\n```\n\n### Pytest Integration\n\n```python\nfrom pymcpevals import (\n    assert_tools_called, \n    assert_evaluation_passed,\n    assert_min_score,\n    assert_no_tool_errors,\n    ConversationTurn\n)\n\n# Simple marker-based test\n@pytest.mark.mcp_eval(\n    prompt=\"What is 15 + 27?\",\n    expected_tools=[\"add\"],\n    min_score=4.0\n)\nasync def test_basic_addition(mcp_result):\n    assert_evaluation_passed(mcp_result)\n    assert_tools_called(mcp_result, [\"add\"])\n    assert \"42\" in mcp_result.server_response\n\n# Multi-turn trajectory testing\nasync def test_math_sequence(mcp_evaluator):\n    turns = [\n        ConversationTurn(role=\"user\", content=\"What is 10 + 5?\", expected_tools=[\"add\"]),\n        ConversationTurn(role=\"user\", content=\"Now multiply by 2\", expected_tools=[\"multiply\"])\n    ]\n    result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)\n    \n    # Rich assertions\n    assert_evaluation_passed(result)\n    assert_tools_called(result, [\"add\", \"multiply\"])\n    assert_no_tool_errors(result)\n    assert_min_score(result, 4.0, dimension=\"accuracy\")\n    assert \"30\" in str(result.conversation_history)\n\n# Run with: pytest -m mcp_eval\n```\n\n## Examples\n\nCheck out the `examples/` directory for:\n- `calculator_server.py` - Simple MCP server for testing\n- `local_server_basic.yaml` - Basic evaluation configuration examples\n- `trajectory_evaluation.yaml` - Multi-turn conversation examples\n- `test_simple_plugin_example.py` - Pytest integration examples\n\nRun the examples:\n```bash\n# Test with the example calculator server\npymcpevals run examples/local_server_basic.yaml\n\n# Run pytest examples\ncd examples && pytest test_simple_plugin_example.py\n```\n\n## Installation\n\n```bash\npip install pymcpevals\n```\n\n## Environment Setup\n\n```bash\nexport OPENAI_API_KEY=\"sk-...\"        # or ANTHROPIC_API_KEY\nexport GEMINI_API_KEY=\"...\"           # for Gemini models\n```\n\n## Output Formats\n\n### Table View (default)\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Name                                     \u2502 Status \u2502 Acc \u2502 Comp \u2502 Rel \u2502 Clar \u2502 Reas \u2502 Avg  \u2502 Tools \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 What is 15 + 27?                         \u2502 PASS   \u2502 4.5 \u2502 4.2  \u2502 5.0 \u2502 4.8  \u2502 4.1  \u2502 4.52 \u2502 \u2713     \u2502\n\u2502 What happens if I divide 10 by 0?        \u2502 PASS   \u2502 4.0 \u2502 4.1  \u2502 4.5 \u2502 4.2  \u2502 3.8  \u2502 4.12 \u2502 \u2713     \u2502\n\u2502 Multi-turn test                          \u2502 PASS   \u2502 4.2 \u2502 4.5  \u2502 4.8 \u2502 4.1  \u2502 4.3  \u2502 4.38 \u2502 \u2713     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\nSummary: 3/3 passed (100.0%) - Average: 4.34/5.0\n```\n\n### Detailed View (--output detailed)\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Test                    \u2502 Status \u2502 Score\u2502 Expected Tools     \u2502 Tools Used         \u2502 Time   \u2502 Errors \u2502 Notes                        \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 What is 15 + 27?        \u2502 PASS   \u2502 4.5  \u2502 add                \u2502 add                \u2502 12ms   \u2502 0      \u2502 OK                           \u2502\n\u2502 What happens if I div...\u2502 PASS   \u2502 4.1  \u2502 divide             \u2502 divide             \u2502 8ms    \u2502 1      \u2502 Handled error correctly      \u2502\n\u2502 Multi-turn test         \u2502 PASS   \u2502 4.4  \u2502 add, multiply      \u2502 add, multiply      \u2502 23ms   \u2502 0      \u2502 Tool chaining successful     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n\ud83d\udd27 Tool Execution Details:\n\u2022 add: Called 2 times, avg 10ms, 100% success rate\n\u2022 divide: Called 1 time, 8ms, handled error gracefully  \n\u2022 multiply: Called 1 time, 13ms, 100% success rate\n\nSummary: 3/3 passed (100.0%) - Average: 4.33/5.0\n```\n\n## Key Benefits\n\n### For MCP Server Developers\n- **\ud83c\udfaf Server-Focused Testing**: Test your server capabilities, not LLM behavior\n- **\u2705 Instant Tool Validation**: Get immediate feedback if wrong tools are called (no LLM needed)\n- **\ud83d\udd27 Tool Execution Insights**: See success rates, timing, and error handling\n- **\ud83d\udd04 Multi-turn Validation**: Test tool chaining and state management\n- **\ud83d\udcca Capability Scoring**: LLM judges server tool performance, ignoring conversation style\n- **\ud83d\udee0\ufe0f Easy Integration**: Works with any MCP server via FastMCP\n\n### For Development Teams  \n- **\ud83d\ude80 CI/CD Integration**: JUnit XML output for automated testing pipelines\n- **\ud83d\udcc8 Progress Tracking**: Monitor improvement over time with consistent scoring\n- **\ud83d\udd04 Regression Testing**: Ensure new changes don't break existing functionality\n- **\u2696\ufe0f Model Comparison**: Test across different LLM providers\n\n## Acknowledgments\n\n\ud83d\ude4f **Huge kudos to [mcp-evals](https://github.com/mclenhard/mcp-evals)** - This Python package was heavily inspired by the excellent Node.js implementation by [@mclenhard](https://github.com/mclenhard).\n\nIf you're working in a Node.js environment, definitely check out the original [mcp-evals](https://github.com/mclenhard/mcp-evals) project, which also includes GitHub Action integration and monitoring capabilities.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality  \n4. Ensure all tests pass\n5. Submit a pull request\n\n## License\n\nMIT - see LICENSE file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/akshay5995/pymcpevals#readme",
        "Homepage": "https://github.com/akshay5995/pymcpevals",
        "Issues": "https://github.com/akshay5995/pymcpevals/issues",
        "Repository": "https://github.com/akshay5995/pymcpevals"
    },
    "split_keywords": [
        "mcp",
        " evaluation",
        " testing",
        " llm",
        " ai"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3fcb21018f14e635711831d24930209a0eb66e9e1d8ba99cd5b6991a6562a854",
                "md5": "879881ab6d35d209dc1497662d0df03e",
                "sha256": "00bdf4d9e9ac22e14139488aa75082458cb1113298c031fcd20f12adfdb5bf69"
            },
            "downloads": -1,
            "filename": "pymcpevals-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "879881ab6d35d209dc1497662d0df03e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 26680,
            "upload_time": "2025-07-27T07:17:19",
            "upload_time_iso_8601": "2025-07-27T07:17:19.477616Z",
            "url": "https://files.pythonhosted.org/packages/3f/cb/21018f14e635711831d24930209a0eb66e9e1d8ba99cd5b6991a6562a854/pymcpevals-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a48095f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8",
                "md5": "690376fa3e71809bbbe61f1734a1b670",
                "sha256": "b3dd06182a662730ab2a8e44eda25738a849ca3908037a2aa9e6b5ff2ea25b5f"
            },
            "downloads": -1,
            "filename": "pymcpevals-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "690376fa3e71809bbbe61f1734a1b670",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 27524,
            "upload_time": "2025-07-27T07:17:20",
            "upload_time_iso_8601": "2025-07-27T07:17:20.946682Z",
            "url": "https://files.pythonhosted.org/packages/a4/80/95f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8/pymcpevals-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 07:17:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "akshay5995",
    "github_project": "pymcpevals#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pymcpevals"
}

None