Name | pymcpevals JSON |
Version |
0.1.1
JSON |
| download |
home_page | None |
Summary | Python package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring |
upload_time | 2025-07-27 07:17:20 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | MIT |
keywords |
mcp
evaluation
testing
llm
ai
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# PyMCPEvals
> **β οΈ Still Under Development** - APIs may change. Use with caution in production.
**Server-focused evaluation framework for MCP (Model Context Protocol) servers.**
π **Test your MCP server capabilities, not LLM conversation patterns.**
**"Are my MCP server's tools working correctly and being used as expected?"**
PyMCPEvals separates what you **can control** (server) from what you **cannot** (LLM behavior):
### β
What You Control (We Test This)
- Tool implementation correctness
- Tool parameter validation
- Error handling and recovery
- Tool result formatting
- Multi-turn state management
### β What You Cannot Control (We Ignore This)
- LLM conversation patterns
- How LLMs choose to use tools
- LLM response formatting
- Whether LLMs provide intermediate responses
## Key Pain Points Solved
- **π« Manual Tool Testing**: Automated assertions verify exact tool calls
- **β Multi-step Failures**: Track tool chaining across conversation turns
- **π Silent Tool Errors**: Instant feedback when expected tools aren't called
- **π CI/CD Integration**: JUnit XML output for automated testing pipelines
## Quick Start
```bash
pip install pymcpevals
pymcpevals init # Create template config
pymcpevals run evals.yaml # Run evaluations
```
## Example Configuration
```yaml
model:
provider: openai
name: gpt-4
server:
command: ["python", "my_server.py"]
evaluations:
- name: "weather_check"
prompt: "What's the weather in Boston?"
expected_tools: ["get_weather"] # β
Validates tool usage
expected_result: "Should call weather API and return conditions"
threshold: 3.5
- name: "multi_step"
turns:
- role: "user"
content: "What's the weather in London?"
expected_tools: ["get_weather"]
- role: "user"
content: "And in Paris?"
expected_tools: ["get_weather"]
expected_result: "Should provide weather for both cities"
threshold: 4.0
```
**Output**: Pass/fail status, tool validation, execution metrics, and server-focused scoring.
## How It Works
1. **Connect** to your MCP server via FastMCP
2. **Execute** prompts and track tool calls
3. **Validate** expected tools are called (instant feedback)
4. **Evaluate** server performance (ignores LLM style)
5. **Report** results with actionable insights
## What Makes This Different
**Precise Tool Assertions**: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:
- β
**Exact tool calls**: `assert_tools_called(result, ["add", "multiply"])`
- β
**Tool execution success**: `assert_no_tool_errors(result)`
- β
**Multi-turn trajectories**: Test tool chaining across conversation steps
- β
**Instant failure detection**: No expensive LLM evaluation for obvious failures
## Usage
### CLI
```bash
# Basic usage
pymcpevals run evals.yaml
# Override server/model
pymcpevals run evals.yaml --server "node server.js" --model gpt-4
# Different outputs
pymcpevals run evals.yaml --output table # Simple table
pymcpevals run evals.yaml --output json # Full JSON
pymcpevals run evals.yaml --output junit # CI/CD format
```
### Pytest Integration
```python
from pymcpevals import (
assert_tools_called,
assert_evaluation_passed,
assert_min_score,
assert_no_tool_errors,
ConversationTurn
)
# Simple marker-based test
@pytest.mark.mcp_eval(
prompt="What is 15 + 27?",
expected_tools=["add"],
min_score=4.0
)
async def test_basic_addition(mcp_result):
assert_evaluation_passed(mcp_result)
assert_tools_called(mcp_result, ["add"])
assert "42" in mcp_result.server_response
# Multi-turn trajectory testing
async def test_math_sequence(mcp_evaluator):
turns = [
ConversationTurn(role="user", content="What is 10 + 5?", expected_tools=["add"]),
ConversationTurn(role="user", content="Now multiply by 2", expected_tools=["multiply"])
]
result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)
# Rich assertions
assert_evaluation_passed(result)
assert_tools_called(result, ["add", "multiply"])
assert_no_tool_errors(result)
assert_min_score(result, 4.0, dimension="accuracy")
assert "30" in str(result.conversation_history)
# Run with: pytest -m mcp_eval
```
## Examples
Check out the `examples/` directory for:
- `calculator_server.py` - Simple MCP server for testing
- `local_server_basic.yaml` - Basic evaluation configuration examples
- `trajectory_evaluation.yaml` - Multi-turn conversation examples
- `test_simple_plugin_example.py` - Pytest integration examples
Run the examples:
```bash
# Test with the example calculator server
pymcpevals run examples/local_server_basic.yaml
# Run pytest examples
cd examples && pytest test_simple_plugin_example.py
```
## Installation
```bash
pip install pymcpevals
```
## Environment Setup
```bash
export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY
export GEMINI_API_KEY="..." # for Gemini models
```
## Output Formats
### Table View (default)
```
ββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββ¬ββββββ¬βββββββ¬ββββββ¬βββββββ¬βββββββ¬βββββββ¬ββββββββ
β Name β Status β Acc β Comp β Rel β Clar β Reas β Avg β Tools β
ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββΌββββββΌβββββββΌββββββΌβββββββΌβββββββΌβββββββΌββββββββ€
β What is 15 + 27? β PASS β 4.5 β 4.2 β 5.0 β 4.8 β 4.1 β 4.52 β β β
β What happens if I divide 10 by 0? β PASS β 4.0 β 4.1 β 4.5 β 4.2 β 3.8 β 4.12 β β β
β Multi-turn test β PASS β 4.2 β 4.5 β 4.8 β 4.1 β 4.3 β 4.38 β β β
ββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ΄ββββββ΄βββββββ΄ββββββ΄βββββββ΄βββββββ΄βββββββ΄ββββββββ
Summary: 3/3 passed (100.0%) - Average: 4.34/5.0
```
### Detailed View (--output detailed)
```
βββββββββββββββββββββββββββ¬βββββββββ¬βββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββ¬βββββββββ¬βββββββββββββββββββββββββββββββ
β Test β Status β Scoreβ Expected Tools β Tools Used β Time β Errors β Notes β
βββββββββββββββββββββββββββΌβββββββββΌβββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββΌβββββββββΌβββββββββββββββββββββββββββββββ€
β What is 15 + 27? β PASS β 4.5 β add β add β 12ms β 0 β OK β
β What happens if I div...β PASS β 4.1 β divide β divide β 8ms β 1 β Handled error correctly β
β Multi-turn test β PASS β 4.4 β add, multiply β add, multiply β 23ms β 0 β Tool chaining successful β
βββββββββββββββββββββββββββ΄βββββββββ΄βββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββ
π§ Tool Execution Details:
β’ add: Called 2 times, avg 10ms, 100% success rate
β’ divide: Called 1 time, 8ms, handled error gracefully
β’ multiply: Called 1 time, 13ms, 100% success rate
Summary: 3/3 passed (100.0%) - Average: 4.33/5.0
```
## Key Benefits
### For MCP Server Developers
- **π― Server-Focused Testing**: Test your server capabilities, not LLM behavior
- **β
Instant Tool Validation**: Get immediate feedback if wrong tools are called (no LLM needed)
- **π§ Tool Execution Insights**: See success rates, timing, and error handling
- **π Multi-turn Validation**: Test tool chaining and state management
- **π Capability Scoring**: LLM judges server tool performance, ignoring conversation style
- **π οΈ Easy Integration**: Works with any MCP server via FastMCP
### For Development Teams
- **π CI/CD Integration**: JUnit XML output for automated testing pipelines
- **π Progress Tracking**: Monitor improvement over time with consistent scoring
- **π Regression Testing**: Ensure new changes don't break existing functionality
- **βοΈ Model Comparison**: Test across different LLM providers
## Acknowledgments
π **Huge kudos to [mcp-evals](https://github.com/mclenhard/mcp-evals)** - This Python package was heavily inspired by the excellent Node.js implementation by [@mclenhard](https://github.com/mclenhard).
If you're working in a Node.js environment, definitely check out the original [mcp-evals](https://github.com/mclenhard/mcp-evals) project, which also includes GitHub Action integration and monitoring capabilities.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request
## License
MIT - see LICENSE file.
Raw data
{
"_id": null,
"home_page": null,
"name": "pymcpevals",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "mcp, evaluation, testing, llm, ai",
"author": null,
"author_email": "Akshay Ram Vignesh <akshay5995@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a4/80/95f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8/pymcpevals-0.1.1.tar.gz",
"platform": null,
"description": "# PyMCPEvals\n\n> **\u26a0\ufe0f Still Under Development** - APIs may change. Use with caution in production.\n\n**Server-focused evaluation framework for MCP (Model Context Protocol) servers.**\n\n\ud83d\ude80 **Test your MCP server capabilities, not LLM conversation patterns.**\n\n**\"Are my MCP server's tools working correctly and being used as expected?\"**\n\nPyMCPEvals separates what you **can control** (server) from what you **cannot** (LLM behavior):\n\n### \u2705 What You Control (We Test This)\n- Tool implementation correctness\n- Tool parameter validation \n- Error handling and recovery\n- Tool result formatting\n- Multi-turn state management\n\n### \u274c What You Cannot Control (We Ignore This)\n- LLM conversation patterns\n- How LLMs choose to use tools\n- LLM response formatting\n- Whether LLMs provide intermediate responses\n\n## Key Pain Points Solved\n\n- **\ud83d\udeab Manual Tool Testing**: Automated assertions verify exact tool calls\n- **\u2753 Multi-step Failures**: Track tool chaining across conversation turns\n- **\ud83d\udc1b Silent Tool Errors**: Instant feedback when expected tools aren't called\n- **\ud83d\udcca CI/CD Integration**: JUnit XML output for automated testing pipelines\n\n## Quick Start\n\n```bash\npip install pymcpevals\npymcpevals init # Create template config\npymcpevals run evals.yaml # Run evaluations\n```\n\n## Example Configuration\n\n```yaml\nmodel:\n provider: openai\n name: gpt-4\n\nserver:\n command: [\"python\", \"my_server.py\"]\n\nevaluations:\n - name: \"weather_check\"\n prompt: \"What's the weather in Boston?\"\n expected_tools: [\"get_weather\"] # \u2705 Validates tool usage\n expected_result: \"Should call weather API and return conditions\"\n threshold: 3.5\n \n - name: \"multi_step\"\n turns:\n - role: \"user\"\n content: \"What's the weather in London?\"\n expected_tools: [\"get_weather\"]\n - role: \"user\" \n content: \"And in Paris?\"\n expected_tools: [\"get_weather\"]\n expected_result: \"Should provide weather for both cities\"\n threshold: 4.0\n```\n\n**Output**: Pass/fail status, tool validation, execution metrics, and server-focused scoring.\n\n## How It Works\n\n1. **Connect** to your MCP server via FastMCP\n2. **Execute** prompts and track tool calls\n3. **Validate** expected tools are called (instant feedback)\n4. **Evaluate** server performance (ignores LLM style)\n5. **Report** results with actionable insights\n\n## What Makes This Different\n\n**Precise Tool Assertions**: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:\n\n- \u2705 **Exact tool calls**: `assert_tools_called(result, [\"add\", \"multiply\"])`\n- \u2705 **Tool execution success**: `assert_no_tool_errors(result)` \n- \u2705 **Multi-turn trajectories**: Test tool chaining across conversation steps\n- \u2705 **Instant failure detection**: No expensive LLM evaluation for obvious failures\n\n## Usage\n\n### CLI\n\n```bash\n# Basic usage\npymcpevals run evals.yaml\n\n# Override server/model\npymcpevals run evals.yaml --server \"node server.js\" --model gpt-4\n\n# Different outputs\npymcpevals run evals.yaml --output table # Simple table\npymcpevals run evals.yaml --output json # Full JSON\npymcpevals run evals.yaml --output junit # CI/CD format\n```\n\n### Pytest Integration\n\n```python\nfrom pymcpevals import (\n assert_tools_called, \n assert_evaluation_passed,\n assert_min_score,\n assert_no_tool_errors,\n ConversationTurn\n)\n\n# Simple marker-based test\n@pytest.mark.mcp_eval(\n prompt=\"What is 15 + 27?\",\n expected_tools=[\"add\"],\n min_score=4.0\n)\nasync def test_basic_addition(mcp_result):\n assert_evaluation_passed(mcp_result)\n assert_tools_called(mcp_result, [\"add\"])\n assert \"42\" in mcp_result.server_response\n\n# Multi-turn trajectory testing\nasync def test_math_sequence(mcp_evaluator):\n turns = [\n ConversationTurn(role=\"user\", content=\"What is 10 + 5?\", expected_tools=[\"add\"]),\n ConversationTurn(role=\"user\", content=\"Now multiply by 2\", expected_tools=[\"multiply\"])\n ]\n result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)\n \n # Rich assertions\n assert_evaluation_passed(result)\n assert_tools_called(result, [\"add\", \"multiply\"])\n assert_no_tool_errors(result)\n assert_min_score(result, 4.0, dimension=\"accuracy\")\n assert \"30\" in str(result.conversation_history)\n\n# Run with: pytest -m mcp_eval\n```\n\n## Examples\n\nCheck out the `examples/` directory for:\n- `calculator_server.py` - Simple MCP server for testing\n- `local_server_basic.yaml` - Basic evaluation configuration examples\n- `trajectory_evaluation.yaml` - Multi-turn conversation examples\n- `test_simple_plugin_example.py` - Pytest integration examples\n\nRun the examples:\n```bash\n# Test with the example calculator server\npymcpevals run examples/local_server_basic.yaml\n\n# Run pytest examples\ncd examples && pytest test_simple_plugin_example.py\n```\n\n## Installation\n\n```bash\npip install pymcpevals\n```\n\n## Environment Setup\n\n```bash\nexport OPENAI_API_KEY=\"sk-...\" # or ANTHROPIC_API_KEY\nexport GEMINI_API_KEY=\"...\" # for Gemini models\n```\n\n## Output Formats\n\n### Table View (default)\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Name \u2502 Status \u2502 Acc \u2502 Comp \u2502 Rel \u2502 Clar \u2502 Reas \u2502 Avg \u2502 Tools \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 What is 15 + 27? \u2502 PASS \u2502 4.5 \u2502 4.2 \u2502 5.0 \u2502 4.8 \u2502 4.1 \u2502 4.52 \u2502 \u2713 \u2502\n\u2502 What happens if I divide 10 by 0? \u2502 PASS \u2502 4.0 \u2502 4.1 \u2502 4.5 \u2502 4.2 \u2502 3.8 \u2502 4.12 \u2502 \u2713 \u2502\n\u2502 Multi-turn test \u2502 PASS \u2502 4.2 \u2502 4.5 \u2502 4.8 \u2502 4.1 \u2502 4.3 \u2502 4.38 \u2502 \u2713 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\nSummary: 3/3 passed (100.0%) - Average: 4.34/5.0\n```\n\n### Detailed View (--output detailed)\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Test \u2502 Status \u2502 Score\u2502 Expected Tools \u2502 Tools Used \u2502 Time \u2502 Errors \u2502 Notes \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 What is 15 + 27? \u2502 PASS \u2502 4.5 \u2502 add \u2502 add \u2502 12ms \u2502 0 \u2502 OK \u2502\n\u2502 What happens if I div...\u2502 PASS \u2502 4.1 \u2502 divide \u2502 divide \u2502 8ms \u2502 1 \u2502 Handled error correctly \u2502\n\u2502 Multi-turn test \u2502 PASS \u2502 4.4 \u2502 add, multiply \u2502 add, multiply \u2502 23ms \u2502 0 \u2502 Tool chaining successful \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n\ud83d\udd27 Tool Execution Details:\n\u2022 add: Called 2 times, avg 10ms, 100% success rate\n\u2022 divide: Called 1 time, 8ms, handled error gracefully \n\u2022 multiply: Called 1 time, 13ms, 100% success rate\n\nSummary: 3/3 passed (100.0%) - Average: 4.33/5.0\n```\n\n## Key Benefits\n\n### For MCP Server Developers\n- **\ud83c\udfaf Server-Focused Testing**: Test your server capabilities, not LLM behavior\n- **\u2705 Instant Tool Validation**: Get immediate feedback if wrong tools are called (no LLM needed)\n- **\ud83d\udd27 Tool Execution Insights**: See success rates, timing, and error handling\n- **\ud83d\udd04 Multi-turn Validation**: Test tool chaining and state management\n- **\ud83d\udcca Capability Scoring**: LLM judges server tool performance, ignoring conversation style\n- **\ud83d\udee0\ufe0f Easy Integration**: Works with any MCP server via FastMCP\n\n### For Development Teams \n- **\ud83d\ude80 CI/CD Integration**: JUnit XML output for automated testing pipelines\n- **\ud83d\udcc8 Progress Tracking**: Monitor improvement over time with consistent scoring\n- **\ud83d\udd04 Regression Testing**: Ensure new changes don't break existing functionality\n- **\u2696\ufe0f Model Comparison**: Test across different LLM providers\n\n## Acknowledgments\n\n\ud83d\ude4f **Huge kudos to [mcp-evals](https://github.com/mclenhard/mcp-evals)** - This Python package was heavily inspired by the excellent Node.js implementation by [@mclenhard](https://github.com/mclenhard).\n\nIf you're working in a Node.js environment, definitely check out the original [mcp-evals](https://github.com/mclenhard/mcp-evals) project, which also includes GitHub Action integration and monitoring capabilities.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality \n4. Ensure all tests pass\n5. Submit a pull request\n\n## License\n\nMIT - see LICENSE file.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring",
"version": "0.1.1",
"project_urls": {
"Documentation": "https://github.com/akshay5995/pymcpevals#readme",
"Homepage": "https://github.com/akshay5995/pymcpevals",
"Issues": "https://github.com/akshay5995/pymcpevals/issues",
"Repository": "https://github.com/akshay5995/pymcpevals"
},
"split_keywords": [
"mcp",
" evaluation",
" testing",
" llm",
" ai"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3fcb21018f14e635711831d24930209a0eb66e9e1d8ba99cd5b6991a6562a854",
"md5": "879881ab6d35d209dc1497662d0df03e",
"sha256": "00bdf4d9e9ac22e14139488aa75082458cb1113298c031fcd20f12adfdb5bf69"
},
"downloads": -1,
"filename": "pymcpevals-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "879881ab6d35d209dc1497662d0df03e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 26680,
"upload_time": "2025-07-27T07:17:19",
"upload_time_iso_8601": "2025-07-27T07:17:19.477616Z",
"url": "https://files.pythonhosted.org/packages/3f/cb/21018f14e635711831d24930209a0eb66e9e1d8ba99cd5b6991a6562a854/pymcpevals-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a48095f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8",
"md5": "690376fa3e71809bbbe61f1734a1b670",
"sha256": "b3dd06182a662730ab2a8e44eda25738a849ca3908037a2aa9e6b5ff2ea25b5f"
},
"downloads": -1,
"filename": "pymcpevals-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "690376fa3e71809bbbe61f1734a1b670",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 27524,
"upload_time": "2025-07-27T07:17:20",
"upload_time_iso_8601": "2025-07-27T07:17:20.946682Z",
"url": "https://files.pythonhosted.org/packages/a4/80/95f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8/pymcpevals-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-27 07:17:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "akshay5995",
"github_project": "pymcpevals#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pymcpevals"
}