mcpuniverse


Namemcpuniverse JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryA framework for developing and benchmarking AI agents using Model Control Protocol
upload_time2025-09-04 03:21:51
maintainerNone
docs_urlNone
authorSalesforce Research
requires_python<4,>=3.10
license3-Clause BSD
keywords ai agents mcp benchmarking llm machine-learning
VCS
bugtrack_url
requirements requests pydantic pydantic schema mcp httpx click jinja2 python-dotenv anyio openai anthropic mistralai pyyaml google-genai redis psycopg sqlalchemy fastapi bcrypt pyseto celery pytz xai-sdk claude-code-sdk wikipedia-api mcp_server_fetch google-auth google-auth-oauthlib google-api-python-client mcp_server_calculator yfinance blender-mcp playwright mathutils
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# <img src="assets/icon.png" alt="MCP-Universe" width="23" height="23"> MCP-Universe

[![Paper](https://img.shields.io/badge/Paper-arXiv:2508.14704-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2508.14704)
[![Website](https://img.shields.io/badge/Website-Live-4285F4?style=for-the-badge&logo=googlechrome&logoColor=white)](https://mcp-universe.github.io/)
[![Leaderboard](https://img.shields.io/badge/Leaderboard-Results-FF6B35?style=for-the-badge&logo=chartdotjs&logoColor=white)](https://mcp-universe.github.io/#results)
[![Discord](https://img.shields.io/badge/Discord-Join_Community-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/t9tU77GF)

</div>

---

## What is MCP-Universe?

MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents. It offers a robust platform for building and evaluating both AI agents and LLMs across a wide range of task environments. The framework also supports seamless integration with external MCP servers and facilitates sophisticated agent orchestration workflows.

<div align="center">

![MCP-Universe Introduction](assets/intro-mcp-universe.png)

</div>

Unlike existing benchmarks that rely on overly simplistic tasks, MCP-Universe addresses critical gaps by evaluating LLMs in **real-world scenarios** through interaction with actual MCP servers, capturing real application challenges such as:

- 🎯 **Long-horizon reasoning** across multi-step tasks
- πŸ”§ **Large, unfamiliar tool spaces** with diverse MCP servers  
- 🌍 **Real-world data sources** and live environments
- ⚑ **Dynamic evaluation** with time-sensitive ground truth

## Performance Highlights

Even state-of-the-art models show significant limitations in real-world MCP interactions:

- πŸ₯‡ **GPT-5**: 43.72% success rate
- πŸ₯ˆ **Grok-4**: 33.33% success rate  
- πŸ₯‰ **Claude-4.0-Sonnet**: 29.44% success rate

*This highlights the challenging nature of real-world MCP server interactions and substantial room for improvement in current LLM agents.*

## Table of Contents

- [Architecture Overview](#architecture-overview)
- [Getting Started](#getting-started)
    - [Prerequisites](#prerequisites)
    - [Installation](#installation)
    - [Quick Test](#quick-test)
- [Evaluating LLMs and Agents](#evaluating-llms-and-agents)
    - [Prerequisites](#prerequisites-1)
    - [Environment Configuration](#environment-configuration)
    - [Benchmark Configuration](#benchmark-configuration)
    - [Execution](#execution)
    - [Save the running log](#save-the-running-log)
    - [Save the benchmark result to a report](#save-the-benchmark-result-to-a-report)
    - [Visualize the agent running information](#visualize-the-agent-running-information)
- [Creating Custom Benchmarks](#creating-custom-benchmarks)
    - [Task definition](#task-definition)
    - [Benchmark definition](#benchmark-definition)
- [Citation](#citation)

## Architecture Overview

The MCPUniverse architecture consists of the following key components:

- **Agents** (`mcpuniverse/agent/`): Base implementations for different agent types
- **Workflows** (`mcpuniverse/workflows/`): Orchestration and coordination layer
- **MCP Servers** (`mcpuniverse/mcp/`): Protocol management and external service integration
- **LLM Integration** (`mcpuniverse/llm/`): Multi-provider language model support
- **Benchmarking** (`mcpuniverse/benchmark/`): Evaluation and testing framework
- **Dashboard** (`mcpuniverse/dashboard/`): Visualization and monitoring interface

The diagram below illustrates the high-level view:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Application Layer                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Dashboard  β”‚    Web API      β”‚   Python Lib   β”‚   Benchmarks   β”‚
β”‚   (Gradio)  β”‚   (FastAPI)     β”‚                β”‚                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                 β”‚                β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Orchestration Layer                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           Workflows           β”‚        Benchmark Runner         β”‚
β”‚    (Chain, Router, etc.)      β”‚      (Evaluation Engine)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                 β”‚                β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Agent Layer                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  BasicAgent β”‚   ReActAgent    β”‚  FunctionCall  β”‚     Other      β”‚
β”‚             β”‚                 β”‚     Agent      β”‚     Agents     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                 β”‚                β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Foundation Layer                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   MCP Manager   β”‚   LLM Manager   β”‚  Memory Systems β”‚  Tracers  β”‚
β”‚   (Servers &    β”‚   (Multi-Model  β”‚   (RAM, Redis)  β”‚ (Logging) β”‚
β”‚    Clients)     β”‚    Support)     β”‚                 β”‚           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

More information can be found [here](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs).

## Getting Started

We follow
the [feature branch workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow)
in this repo for its simplicity. To ensure code quality, [PyLint](https://pylint.readthedocs.io/en/latest/)
is integrated into our CI to enforce Python coding standards.

### Prerequisites

* **Python**: Requires version 3.10 or higher.
* **Docker**: Used for running Dockerized MCP servers.
* **PostgreSQL** (optional): Used for database storage and persistence.
* **Redis** (optional): Used for caching and memory management.

### Installation

1. **Clone the repository**
   ```bash
   git clone https://github.com/SalesforceAIResearch/MCP-Universe.git
   cd MCP-Universe
   ```

2. **Create and activate virtual environment**
   ```bash
   python3 -m venv venv
   source venv/bin/activate
   ```

3. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   pip install -r dev-requirements.txt
   ```

4. **Platform-specific requirements**

   **Linux:**
   ```bash
   sudo apt-get install libpq-dev
   ```

   **macOS:**
   ```bash
   brew install postgresql
   ```

5. **Configure pre-commit hooks**
   ```bash
   pre-commit install
   ```

6. **Environment configuration**
   ```bash
   cp .env.example .env
   # Edit .env with your API keys and configuration
   ```

### Quick Test

To run benchmarks, you first need to set environment variables:

1. Copy the `.env.example` file to a new file named `.env`.
2. In the `.env` file, set the required API keys for various services used by the agents,
   such as `OPENAI_API_KEY` and `GOOGLE_MAPS_API_KEY`.

To execute a benchmark programmatically:

```python
from mcpuniverse.tracer.collectors import MemoryCollector  # You can also use SQLiteCollector
from mcpuniverse.benchmark.runner import BenchmarkRunner

async def test():
    trace_collector = MemoryCollector()
    # Choose a benchmark config file under the folder "mcpuniverse/benchmark/configs"
    benchmark = BenchmarkRunner("dummy/benchmark_1.yaml")
    # Run the specified benchmark
    results = await benchmark.run(trace_collector=trace_collector)
    # Get traces
    trace_id = results[0].task_trace_ids["dummy/tasks/weather.json"]
    trace_records = trace_collector.get(trace_id)
```

## Evaluating LLMs and Agents

This section provides comprehensive instructions for evaluating LLMs and AI agents using the MCP-Universe benchmark suite. The framework supports evaluation across multiple domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.

### Prerequisites

Before running benchmark evaluations, ensure you have completed the [Getting Started](#getting-started) section and have the following:

- Python: Version 3.10 or higher
- Docker: Installed and available in your environment
- All required dependencies installed via `pip install -r requirements.txt`
- Active virtual environment
- Appropriate API access for the services you intend to evaluate

### Environment Configuration

#### 1. Initial Setup

Copy the environment template and configure your API credentials:

```bash
cp .env.example .env
```

#### 2. API Keys and Configuration

Configure the following environment variables in your `.env` file. The required keys depend on which benchmark domains you plan to evaluate:

##### Core LLM Providers

| Environment Variable | Provider | Description | Required For |
|---------------------|----------|-------------|--------------|
| `OPENAI_API_KEY` | OpenAI | API key for GPT models (gpt-5, etc.) | All domains |
| `ANTHROPIC_API_KEY` | Anthropic | API key for Claude models | All domains |
| `GEMINI_API_KEY` | Google | API key for Gemini models | All domains |

> **Note**: You only need to configure the API key for the LLM provider you intend to use in your evaluation.

##### Domain-Specific Services

| Environment Variable | Service | Description | Setup Instructions |
|---------------------|---------|-------------|-------------------|
| `SERP_API_KEY` | SerpAPI | Web search API for search benchmark evaluation | [Get API key](https://serpapi.com/) |
| `GOOGLE_MAPS_API_KEY` | Google Maps | Geolocation and mapping services | [Setup Guide](https://console.cloud.google.com/google/maps-apis/credentials) |
| `GITHUB_PERSONAL_ACCESS_TOKEN` | GitHub | Personal access token for repository operations | [Token Setup](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) |
| `GITHUB_PERSONAL_ACCOUNT_NAME` | GitHub | Your GitHub username | N/A |
| `NOTION_API_KEY` | Notion | Integration token for Notion workspace access | [Integration Setup](https://developers.notion.com/docs/authorization#obtaining-a-token) |
| `NOTION_ROOT_PAGE` | Notion | Root page ID for your Notion workspace | See configuration example below |

##### System Paths

| Environment Variable | Description | Example |
|---------------------|-------------|---------|
| `BLENDER_APP_PATH` | Full path to Blender executable (we used v4.4.0) | `/Applications/Blender.app/Contents/MacOS/Blender` |
| `MCPUniverse_DIR` | Absolute path to your MCP-Universe repository | `/Users/username/MCP-Universe` |

##### Configuration Examples

**Notion Root Page ID:**
If your Notion page URL is:
```
https://www.notion.so/your_workspace/MCP-Evaluation-1dd6d96e12345678901234567eaf9eff
```
Set `NOTION_ROOT_PAGE=MCP-Evaluation-1dd6d96e12345678901234567eaf9eff`

**Blender Installation:**
1. Download Blender v4.4.0 from [blender.org](https://www.blender.org/)
2. Install our modified Blender MCP server following the [installation guide](docs/blender-setup.md)
3. Set the path to the Blender executable

##### ⚠️ Security Recommendations

> **πŸ”’ IMPORTANT SECURITY NOTICE**
> 
> Please read and follow these security guidelines carefully before running benchmarks:

- **🚨 GitHub Integration**: **CRITICAL** - We strongly recommend using a dedicated test GitHub account for benchmark evaluation. The AI agent will perform real operations on GitHub repositories, which could potentially modify or damage your personal repositories.

- **πŸ” API Key Management**: 
  - Store API keys securely and never commit them to version control
  - Use environment variables or secure key management systems
  - Regularly rotate your API keys for enhanced security

- **πŸ›‘οΈ Access Permissions**: 
  - Grant minimal necessary permissions for each service integration
  - Review and limit API key scopes to only required operations
  - Monitor API usage and set appropriate rate limits

- **⚑ Blender Operations**: The 3D design benchmarks will execute Blender commands that may modify or create files on your system. Ensure you have adequate backups and run in an isolated environment if necessary.

### Benchmark Configuration

#### Domain-Specific Configuration Files

Each benchmark domain has a dedicated YAML configuration file located in `mcpuniverse/benchmark/configs/test/`. To evaluate your LLM/agent, modify the appropriate configuration file:

| Domain | Configuration File | Description |
|--------|-------------------|-------------|
| Web Search | `web_search.yaml` | Search engine and information retrieval tasks |
| Location Navigation | `location_navigation.yaml` | Geographic and mapping-related queries |
| Browser Automation | `browser_automation.yaml` | Web interaction and automation scenarios |
| Financial Analysis | `financial_analysis.yaml` | Market data analysis and financial computations |
| Repository Management | `repository_management.yaml` | Git operations and code repository tasks |
| 3D Design | `3d_design.yaml` | Blender-based 3D modeling and design tasks |

#### LLM Model Configuration

In each configuration file, update the LLM specification to match your target model:

```yaml
kind: llm
spec:
  name: llm-1
  type: openai  # or anthropic, google, etc.
  config:
    model_name: gpt-4o  # Replace with your target model
```

### Execution

#### Running Individual Benchmarks

Execute specific domain benchmarks using the following commands:

```bash
# Set Python path and run individual benchmarks
export PYTHONPATH=.

# Location Navigation
python tests/benchmark/test_benchmark_location_navigation.py

# Browser Automation  
python tests/benchmark/test_benchmark_browser_automation.py

# Financial Analysis
python tests/benchmark/test_benchmark_financial_analysis.py

# Repository Management
python tests/benchmark/test_benchmark_repository_management.py

# Web Search
python tests/benchmark/test_benchmark_web_search.py

# 3D Design
python tests/benchmark/test_benchmark_3d_design.py
```

#### Batch Execution

For comprehensive evaluation across all domains:

```bash
#!/bin/bash
export PYTHONPATH=.

domains=("location_navigation" "browser_automation" "financial_analysis" 
         "repository_management" "web_search" "3d_design")

for domain in "${domains[@]}"; do
    echo "Running benchmark: $domain"
    python "tests/benchmark/test_benchmark_${domain}.py"
    echo "Completed: $domain"
done
```

### Save the running log

If you want to save the running log, you can pass the `trace_collector` to the benchmark run function:

```python
from mcpuniverse.tracer.collectors import FileCollector

trace_collector = FileCollector(log_file="log/location_navigation.log")
benchmark_results = await benchmark.run(trace_collector=trace_collector)
```

### Save the benchmark result to a report 

If you want to save a report of the benchmark result, you can use `BenchmarkReport` to dump a report:

```python
from mcpuniverse.benchmark.report import BenchmarkReport

report = BenchmarkReport(benchmark, trace_collector=trace_collector)
report.dump()
```

### Visualize the agent running information

To run the benchmark with intermediate results and see real-time progress, pass `callbacks=get_vprint_callbacks()` to the run function:

```python
from mcpuniverse.callbacks.handlers.vprint import get_vprint_callbacks

benchmark_results = await benchmark.run(
    trace_collector=trace_collector, 
    callbacks=get_vprint_callbacks()
)
```

This will print out the intermediate results as the benchmark runs.


For further details, refer to the in-code documentation or existing configuration samples in the repository.

## Creating Custom Benchmarks

A benchmark is defined by three main configuration elements: the task definition,
agent/workflow definition, and the benchmark configuration itself. Below is an example
using a simple "weather forecasting" task.

### Task definition

The task definition is provided in JSON format, for example:

```json
{
  "category": "general",
  "question": "What's the weather in San Francisco now?",
  "mcp_servers": [
    {
      "name": "weather"
    }
  ],
  "output_format": {
    "city": "<City>",
    "weather": "<Weather forecast results>"
  },
  "evaluators": [
    {
      "func": "json -> get(city)",
      "op": "=",
      "value": "San Francisco"
    }
  ]
}
```

Field descriptions:

1. **category**: The task category, e.g., "general", "google-maps", etc. You can set any value for this property.
2. **question**: The main question you want to ask in this task. This is treated as a user message.
3. **mcp_servers**: A list of MCP servers that are supported in this framework.
4. **output_format**: The desired output format of agent responses.
5. **evaluators**: A list of tests to evaluate. For each test/evaluator, it has three attributes: "func" indicates
   how to extract values from the agent response, "op" is the comparison operator, and "value" is the ground-truth
   value.
   It will evaluate **op(func(...), value, op_args...)**. "op" can be "=", "<", ">" or other customized operators.

In "evaluators", you need to write a rule ("func" attribute) showing how to extract values for testing. In the example
above, "json -> get(city)" will first do JSON decoding and then extract the value of key "city". There are several
predefined funcs in this repo:

1. **json**: Perform JSON decoding.
2. **get**: Get the value of a key.
3. **len**: Get the length of a list.
4. **foreach**: Do a FOR-EACH loop.

For example, let's define

```python
data = {"x": [{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}]}
```

Then `get(x) -> foreach -> get(y) -> len` will do the following:

1. Get the value of "x": `[{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}]`.
2. Do a foreach loop and get the value of "y": `[[1], [1, 1], [1, 2, 3, 4]]`.
3. Get the length of each list: `[1, 2, 4]`.

If these predefined functions are not enough, you can implement custom ones.
For more details, please check
this [doc](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs/custom-evaluators-guide.md).

### Benchmark definition

Define agent(s) and benchmark in a YAML file. Here’s a simple weather forecast benchmark:

```yaml
kind: llm
spec:
  name: llm-1
  type: openai
  config:
    model_name: gpt-4o

---
kind: agent
spec:
  name: ReAct-agent
  type: react
  config:
    llm: llm-1
    instruction: You are an agent for weather forecasting.
    servers:
      - name: weather

---
kind: benchmark
spec:
  description: Test the agent for weather forecasting
  agent: ReAct-agent
  tasks:
    - dummy/tasks/weather.json
```

The benchmark definition mainly contains two parts: the agent definition and the benchmark configuration. The benchmark configuration is simpleβ€”you just need to specify the agent to use (by the defined agent name) and a list of tasks to evaluate. Each task entry is the task config file
path. It can be a full file path or a partial file path. If it is a partial file path (like "dummy/tasks/weather.json"),
it should be put in the
folder [mcpuniverse/benchmark/configs](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/benchmark/configs)
in this repo.

This framework offers a flexible way to define both simple agents (such as ReAct) and more complex, multi-step agent
workflows.

1. **Specify LLMs:** Begin by declaring the large language models (LLMs) you want the agents to use. Each LLM component
   must be assigned a unique name (e.g., `"llm-1"`). These names serve as identifiers that the framework uses to connect
   the different components together.
2. **Define an agent:** Next, define an agent by providing its name and selecting an agent class. Agent classes are
   available in
   the [mcpuniverse.agent](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/agent) package.
   Commonly used classes include `"basic"`, `"function-call"`, and `"react"`. Within the agent specification (
   `spec.config`), you must also indicate which LLM instance the agent should use by setting the `"llm"` field.
3. **Create complex workflows:** Beyond simple agents, the framework supports the definition of sophisticated,
   orchestrated workflows where multiple agents interact or collaborate to solve more complex tasks.

For example:

```yaml
kind: llm
spec:
  name: llm-1
  type: openai
  config:
    model_name: gpt-4o

---
kind: agent
spec:
  name: basic-agent
  type: basic
  config:
    llm: llm-1
    instruction: Return the latitude and the longitude of a place.

---
kind: agent
spec:
  name: function-call-agent
  type: function-call
  config:
    llm: llm-1
    instruction: You are an agent for weather forecast. Please return the weather today at the given latitude and longitude.
    servers:
      - name: weather

---
kind: workflow
spec:
  name: orchestrator-workflow
  type: orchestrator
  config:
    llm: llm-1
    agents:
      - basic-agent
      - function-call-agent

---
kind: benchmark
spec:
  description: Test the agent for weather forecasting
  agent: orchestrator-workflow
  tasks:
    - dummy/tasks/weather.json
```

## Citation

If you use MCP-Universe in your research, please cite our paper:

```bibtex
@misc{mcpuniverse,
  title={MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers},
  author={Ziyang Luo and Zhiqi Shen and Wenzhuo Yang and Zirui Zhao and Prathyusha Jwalapuram and Amrita Saha and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Junnan Li},
  year={2025},
  eprint={2508.14704},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.14704}, 
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mcpuniverse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4,>=3.10",
    "maintainer_email": null,
    "keywords": "AI, Agents, MCP, benchmarking, LLM, machine-learning",
    "author": "Salesforce Research",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/5b/fb/de4f3b162678ba03ad277c08496529bac914ec35b7d346a8f997596cce62/mcpuniverse-1.0.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# <img src=\"assets/icon.png\" alt=\"MCP-Universe\" width=\"23\" height=\"23\"> MCP-Universe\n\n[![Paper](https://img.shields.io/badge/Paper-arXiv:2508.14704-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2508.14704)\n[![Website](https://img.shields.io/badge/Website-Live-4285F4?style=for-the-badge&logo=googlechrome&logoColor=white)](https://mcp-universe.github.io/)\n[![Leaderboard](https://img.shields.io/badge/Leaderboard-Results-FF6B35?style=for-the-badge&logo=chartdotjs&logoColor=white)](https://mcp-universe.github.io/#results)\n[![Discord](https://img.shields.io/badge/Discord-Join_Community-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/t9tU77GF)\n\n</div>\n\n---\n\n## What is MCP-Universe?\n\nMCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents. It offers a robust platform for building and evaluating both AI agents and LLMs across a wide range of task environments. The framework also supports seamless integration with external MCP servers and facilitates sophisticated agent orchestration workflows.\n\n<div align=\"center\">\n\n![MCP-Universe Introduction](assets/intro-mcp-universe.png)\n\n</div>\n\nUnlike existing benchmarks that rely on overly simplistic tasks, MCP-Universe addresses critical gaps by evaluating LLMs in **real-world scenarios** through interaction with actual MCP servers, capturing real application challenges such as:\n\n- \ud83c\udfaf **Long-horizon reasoning** across multi-step tasks\n- \ud83d\udd27 **Large, unfamiliar tool spaces** with diverse MCP servers  \n- \ud83c\udf0d **Real-world data sources** and live environments\n- \u26a1 **Dynamic evaluation** with time-sensitive ground truth\n\n## Performance Highlights\n\nEven state-of-the-art models show significant limitations in real-world MCP interactions:\n\n- \ud83e\udd47 **GPT-5**: 43.72% success rate\n- \ud83e\udd48 **Grok-4**: 33.33% success rate  \n- \ud83e\udd49 **Claude-4.0-Sonnet**: 29.44% success rate\n\n*This highlights the challenging nature of real-world MCP server interactions and substantial room for improvement in current LLM agents.*\n\n## Table of Contents\n\n- [Architecture Overview](#architecture-overview)\n- [Getting Started](#getting-started)\n    - [Prerequisites](#prerequisites)\n    - [Installation](#installation)\n    - [Quick Test](#quick-test)\n- [Evaluating LLMs and Agents](#evaluating-llms-and-agents)\n    - [Prerequisites](#prerequisites-1)\n    - [Environment Configuration](#environment-configuration)\n    - [Benchmark Configuration](#benchmark-configuration)\n    - [Execution](#execution)\n    - [Save the running log](#save-the-running-log)\n    - [Save the benchmark result to a report](#save-the-benchmark-result-to-a-report)\n    - [Visualize the agent running information](#visualize-the-agent-running-information)\n- [Creating Custom Benchmarks](#creating-custom-benchmarks)\n    - [Task definition](#task-definition)\n    - [Benchmark definition](#benchmark-definition)\n- [Citation](#citation)\n\n## Architecture Overview\n\nThe MCPUniverse architecture consists of the following key components:\n\n- **Agents** (`mcpuniverse/agent/`): Base implementations for different agent types\n- **Workflows** (`mcpuniverse/workflows/`): Orchestration and coordination layer\n- **MCP Servers** (`mcpuniverse/mcp/`): Protocol management and external service integration\n- **LLM Integration** (`mcpuniverse/llm/`): Multi-provider language model support\n- **Benchmarking** (`mcpuniverse/benchmark/`): Evaluation and testing framework\n- **Dashboard** (`mcpuniverse/dashboard/`): Visualization and monitoring interface\n\nThe diagram below illustrates the high-level view:\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                      Application Layer                          \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502  Dashboard  \u2502    Web API      \u2502   Python Lib   \u2502   Benchmarks   \u2502\n\u2502   (Gradio)  \u2502   (FastAPI)     \u2502                \u2502                \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n              \u2502                 \u2502                \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                      Orchestration Layer                        \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502           Workflows           \u2502        Benchmark Runner         \u2502\n\u2502    (Chain, Router, etc.)      \u2502      (Evaluation Engine)        \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n              \u2502                 \u2502                \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                        Agent Layer                              \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502  BasicAgent \u2502   ReActAgent    \u2502  FunctionCall  \u2502     Other      \u2502\n\u2502             \u2502                 \u2502     Agent      \u2502     Agents     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n              \u2502                 \u2502                \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                      Foundation Layer                           \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502   MCP Manager   \u2502   LLM Manager   \u2502  Memory Systems \u2502  Tracers  \u2502\n\u2502   (Servers &    \u2502   (Multi-Model  \u2502   (RAM, Redis)  \u2502 (Logging) \u2502\n\u2502    Clients)     \u2502    Support)     \u2502                 \u2502           \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nMore information can be found [here](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs).\n\n## Getting Started\n\nWe follow\nthe [feature branch workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow)\nin this repo for its simplicity. To ensure code quality, [PyLint](https://pylint.readthedocs.io/en/latest/)\nis integrated into our CI to enforce Python coding standards.\n\n### Prerequisites\n\n* **Python**: Requires version 3.10 or higher.\n* **Docker**: Used for running Dockerized MCP servers.\n* **PostgreSQL** (optional): Used for database storage and persistence.\n* **Redis** (optional): Used for caching and memory management.\n\n### Installation\n\n1. **Clone the repository**\n   ```bash\n   git clone https://github.com/SalesforceAIResearch/MCP-Universe.git\n   cd MCP-Universe\n   ```\n\n2. **Create and activate virtual environment**\n   ```bash\n   python3 -m venv venv\n   source venv/bin/activate\n   ```\n\n3. **Install dependencies**\n   ```bash\n   pip install -r requirements.txt\n   pip install -r dev-requirements.txt\n   ```\n\n4. **Platform-specific requirements**\n\n   **Linux:**\n   ```bash\n   sudo apt-get install libpq-dev\n   ```\n\n   **macOS:**\n   ```bash\n   brew install postgresql\n   ```\n\n5. **Configure pre-commit hooks**\n   ```bash\n   pre-commit install\n   ```\n\n6. **Environment configuration**\n   ```bash\n   cp .env.example .env\n   # Edit .env with your API keys and configuration\n   ```\n\n### Quick Test\n\nTo run benchmarks, you first need to set environment variables:\n\n1. Copy the `.env.example` file to a new file named `.env`.\n2. In the `.env` file, set the required API keys for various services used by the agents,\n   such as `OPENAI_API_KEY` and `GOOGLE_MAPS_API_KEY`.\n\nTo execute a benchmark programmatically:\n\n```python\nfrom mcpuniverse.tracer.collectors import MemoryCollector  # You can also use SQLiteCollector\nfrom mcpuniverse.benchmark.runner import BenchmarkRunner\n\nasync def test():\n    trace_collector = MemoryCollector()\n    # Choose a benchmark config file under the folder \"mcpuniverse/benchmark/configs\"\n    benchmark = BenchmarkRunner(\"dummy/benchmark_1.yaml\")\n    # Run the specified benchmark\n    results = await benchmark.run(trace_collector=trace_collector)\n    # Get traces\n    trace_id = results[0].task_trace_ids[\"dummy/tasks/weather.json\"]\n    trace_records = trace_collector.get(trace_id)\n```\n\n## Evaluating LLMs and Agents\n\nThis section provides comprehensive instructions for evaluating LLMs and AI agents using the MCP-Universe benchmark suite. The framework supports evaluation across multiple domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.\n\n### Prerequisites\n\nBefore running benchmark evaluations, ensure you have completed the [Getting Started](#getting-started) section and have the following:\n\n- Python: Version 3.10 or higher\n- Docker: Installed and available in your environment\n- All required dependencies installed via `pip install -r requirements.txt`\n- Active virtual environment\n- Appropriate API access for the services you intend to evaluate\n\n### Environment Configuration\n\n#### 1. Initial Setup\n\nCopy the environment template and configure your API credentials:\n\n```bash\ncp .env.example .env\n```\n\n#### 2. API Keys and Configuration\n\nConfigure the following environment variables in your `.env` file. The required keys depend on which benchmark domains you plan to evaluate:\n\n##### Core LLM Providers\n\n| Environment Variable | Provider | Description | Required For |\n|---------------------|----------|-------------|--------------|\n| `OPENAI_API_KEY` | OpenAI | API key for GPT models (gpt-5, etc.) | All domains |\n| `ANTHROPIC_API_KEY` | Anthropic | API key for Claude models | All domains |\n| `GEMINI_API_KEY` | Google | API key for Gemini models | All domains |\n\n> **Note**: You only need to configure the API key for the LLM provider you intend to use in your evaluation.\n\n##### Domain-Specific Services\n\n| Environment Variable | Service | Description | Setup Instructions |\n|---------------------|---------|-------------|-------------------|\n| `SERP_API_KEY` | SerpAPI | Web search API for search benchmark evaluation | [Get API key](https://serpapi.com/) |\n| `GOOGLE_MAPS_API_KEY` | Google Maps | Geolocation and mapping services | [Setup Guide](https://console.cloud.google.com/google/maps-apis/credentials) |\n| `GITHUB_PERSONAL_ACCESS_TOKEN` | GitHub | Personal access token for repository operations | [Token Setup](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) |\n| `GITHUB_PERSONAL_ACCOUNT_NAME` | GitHub | Your GitHub username | N/A |\n| `NOTION_API_KEY` | Notion | Integration token for Notion workspace access | [Integration Setup](https://developers.notion.com/docs/authorization#obtaining-a-token) |\n| `NOTION_ROOT_PAGE` | Notion | Root page ID for your Notion workspace | See configuration example below |\n\n##### System Paths\n\n| Environment Variable | Description | Example |\n|---------------------|-------------|---------|\n| `BLENDER_APP_PATH` | Full path to Blender executable (we used v4.4.0) | `/Applications/Blender.app/Contents/MacOS/Blender` |\n| `MCPUniverse_DIR` | Absolute path to your MCP-Universe repository | `/Users/username/MCP-Universe` |\n\n##### Configuration Examples\n\n**Notion Root Page ID:**\nIf your Notion page URL is:\n```\nhttps://www.notion.so/your_workspace/MCP-Evaluation-1dd6d96e12345678901234567eaf9eff\n```\nSet `NOTION_ROOT_PAGE=MCP-Evaluation-1dd6d96e12345678901234567eaf9eff`\n\n**Blender Installation:**\n1. Download Blender v4.4.0 from [blender.org](https://www.blender.org/)\n2. Install our modified Blender MCP server following the [installation guide](docs/blender-setup.md)\n3. Set the path to the Blender executable\n\n##### \u26a0\ufe0f Security Recommendations\n\n> **\ud83d\udd12 IMPORTANT SECURITY NOTICE**\n> \n> Please read and follow these security guidelines carefully before running benchmarks:\n\n- **\ud83d\udea8 GitHub Integration**: **CRITICAL** - We strongly recommend using a dedicated test GitHub account for benchmark evaluation. The AI agent will perform real operations on GitHub repositories, which could potentially modify or damage your personal repositories.\n\n- **\ud83d\udd10 API Key Management**: \n  - Store API keys securely and never commit them to version control\n  - Use environment variables or secure key management systems\n  - Regularly rotate your API keys for enhanced security\n\n- **\ud83d\udee1\ufe0f Access Permissions**: \n  - Grant minimal necessary permissions for each service integration\n  - Review and limit API key scopes to only required operations\n  - Monitor API usage and set appropriate rate limits\n\n- **\u26a1 Blender Operations**: The 3D design benchmarks will execute Blender commands that may modify or create files on your system. Ensure you have adequate backups and run in an isolated environment if necessary.\n\n### Benchmark Configuration\n\n#### Domain-Specific Configuration Files\n\nEach benchmark domain has a dedicated YAML configuration file located in `mcpuniverse/benchmark/configs/test/`. To evaluate your LLM/agent, modify the appropriate configuration file:\n\n| Domain | Configuration File | Description |\n|--------|-------------------|-------------|\n| Web Search | `web_search.yaml` | Search engine and information retrieval tasks |\n| Location Navigation | `location_navigation.yaml` | Geographic and mapping-related queries |\n| Browser Automation | `browser_automation.yaml` | Web interaction and automation scenarios |\n| Financial Analysis | `financial_analysis.yaml` | Market data analysis and financial computations |\n| Repository Management | `repository_management.yaml` | Git operations and code repository tasks |\n| 3D Design | `3d_design.yaml` | Blender-based 3D modeling and design tasks |\n\n#### LLM Model Configuration\n\nIn each configuration file, update the LLM specification to match your target model:\n\n```yaml\nkind: llm\nspec:\n  name: llm-1\n  type: openai  # or anthropic, google, etc.\n  config:\n    model_name: gpt-4o  # Replace with your target model\n```\n\n### Execution\n\n#### Running Individual Benchmarks\n\nExecute specific domain benchmarks using the following commands:\n\n```bash\n# Set Python path and run individual benchmarks\nexport PYTHONPATH=.\n\n# Location Navigation\npython tests/benchmark/test_benchmark_location_navigation.py\n\n# Browser Automation  \npython tests/benchmark/test_benchmark_browser_automation.py\n\n# Financial Analysis\npython tests/benchmark/test_benchmark_financial_analysis.py\n\n# Repository Management\npython tests/benchmark/test_benchmark_repository_management.py\n\n# Web Search\npython tests/benchmark/test_benchmark_web_search.py\n\n# 3D Design\npython tests/benchmark/test_benchmark_3d_design.py\n```\n\n#### Batch Execution\n\nFor comprehensive evaluation across all domains:\n\n```bash\n#!/bin/bash\nexport PYTHONPATH=.\n\ndomains=(\"location_navigation\" \"browser_automation\" \"financial_analysis\" \n         \"repository_management\" \"web_search\" \"3d_design\")\n\nfor domain in \"${domains[@]}\"; do\n    echo \"Running benchmark: $domain\"\n    python \"tests/benchmark/test_benchmark_${domain}.py\"\n    echo \"Completed: $domain\"\ndone\n```\n\n### Save the running log\n\nIf you want to save the running log, you can pass the `trace_collector` to the benchmark run function:\n\n```python\nfrom mcpuniverse.tracer.collectors import FileCollector\n\ntrace_collector = FileCollector(log_file=\"log/location_navigation.log\")\nbenchmark_results = await benchmark.run(trace_collector=trace_collector)\n```\n\n### Save the benchmark result to a report \n\nIf you want to save a report of the benchmark result, you can use `BenchmarkReport` to dump a report:\n\n```python\nfrom mcpuniverse.benchmark.report import BenchmarkReport\n\nreport = BenchmarkReport(benchmark, trace_collector=trace_collector)\nreport.dump()\n```\n\n### Visualize the agent running information\n\nTo run the benchmark with intermediate results and see real-time progress, pass `callbacks=get_vprint_callbacks()` to the run function:\n\n```python\nfrom mcpuniverse.callbacks.handlers.vprint import get_vprint_callbacks\n\nbenchmark_results = await benchmark.run(\n    trace_collector=trace_collector, \n    callbacks=get_vprint_callbacks()\n)\n```\n\nThis will print out the intermediate results as the benchmark runs.\n\n\nFor further details, refer to the in-code documentation or existing configuration samples in the repository.\n\n## Creating Custom Benchmarks\n\nA benchmark is defined by three main configuration elements: the task definition,\nagent/workflow definition, and the benchmark configuration itself. Below is an example\nusing a simple \"weather forecasting\" task.\n\n### Task definition\n\nThe task definition is provided in JSON format, for example:\n\n```json\n{\n  \"category\": \"general\",\n  \"question\": \"What's the weather in San Francisco now?\",\n  \"mcp_servers\": [\n    {\n      \"name\": \"weather\"\n    }\n  ],\n  \"output_format\": {\n    \"city\": \"<City>\",\n    \"weather\": \"<Weather forecast results>\"\n  },\n  \"evaluators\": [\n    {\n      \"func\": \"json -> get(city)\",\n      \"op\": \"=\",\n      \"value\": \"San Francisco\"\n    }\n  ]\n}\n```\n\nField descriptions:\n\n1. **category**: The task category, e.g., \"general\", \"google-maps\", etc. You can set any value for this property.\n2. **question**: The main question you want to ask in this task. This is treated as a user message.\n3. **mcp_servers**: A list of MCP servers that are supported in this framework.\n4. **output_format**: The desired output format of agent responses.\n5. **evaluators**: A list of tests to evaluate. For each test/evaluator, it has three attributes: \"func\" indicates\n   how to extract values from the agent response, \"op\" is the comparison operator, and \"value\" is the ground-truth\n   value.\n   It will evaluate **op(func(...), value, op_args...)**. \"op\" can be \"=\", \"<\", \">\" or other customized operators.\n\nIn \"evaluators\", you need to write a rule (\"func\" attribute) showing how to extract values for testing. In the example\nabove, \"json -> get(city)\" will first do JSON decoding and then extract the value of key \"city\". There are several\npredefined funcs in this repo:\n\n1. **json**: Perform JSON decoding.\n2. **get**: Get the value of a key.\n3. **len**: Get the length of a list.\n4. **foreach**: Do a FOR-EACH loop.\n\nFor example, let's define\n\n```python\ndata = {\"x\": [{\"y\": [1]}, {\"y\": [1, 1]}, {\"y\": [1, 2, 3, 4]}]}\n```\n\nThen `get(x) -> foreach -> get(y) -> len` will do the following:\n\n1. Get the value of \"x\": `[{\"y\": [1]}, {\"y\": [1, 1]}, {\"y\": [1, 2, 3, 4]}]`.\n2. Do a foreach loop and get the value of \"y\": `[[1], [1, 1], [1, 2, 3, 4]]`.\n3. Get the length of each list: `[1, 2, 4]`.\n\nIf these predefined functions are not enough, you can implement custom ones.\nFor more details, please check\nthis [doc](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs/custom-evaluators-guide.md).\n\n### Benchmark definition\n\nDefine agent(s) and benchmark in a YAML file. Here\u2019s a simple weather forecast benchmark:\n\n```yaml\nkind: llm\nspec:\n  name: llm-1\n  type: openai\n  config:\n    model_name: gpt-4o\n\n---\nkind: agent\nspec:\n  name: ReAct-agent\n  type: react\n  config:\n    llm: llm-1\n    instruction: You are an agent for weather forecasting.\n    servers:\n      - name: weather\n\n---\nkind: benchmark\nspec:\n  description: Test the agent for weather forecasting\n  agent: ReAct-agent\n  tasks:\n    - dummy/tasks/weather.json\n```\n\nThe benchmark definition mainly contains two parts: the agent definition and the benchmark configuration. The benchmark configuration is simple\u2014you just need to specify the agent to use (by the defined agent name) and a list of tasks to evaluate. Each task entry is the task config file\npath. It can be a full file path or a partial file path. If it is a partial file path (like \"dummy/tasks/weather.json\"),\nit should be put in the\nfolder [mcpuniverse/benchmark/configs](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/benchmark/configs)\nin this repo.\n\nThis framework offers a flexible way to define both simple agents (such as ReAct) and more complex, multi-step agent\nworkflows.\n\n1. **Specify LLMs:** Begin by declaring the large language models (LLMs) you want the agents to use. Each LLM component\n   must be assigned a unique name (e.g., `\"llm-1\"`). These names serve as identifiers that the framework uses to connect\n   the different components together.\n2. **Define an agent:** Next, define an agent by providing its name and selecting an agent class. Agent classes are\n   available in\n   the [mcpuniverse.agent](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/agent) package.\n   Commonly used classes include `\"basic\"`, `\"function-call\"`, and `\"react\"`. Within the agent specification (\n   `spec.config`), you must also indicate which LLM instance the agent should use by setting the `\"llm\"` field.\n3. **Create complex workflows:** Beyond simple agents, the framework supports the definition of sophisticated,\n   orchestrated workflows where multiple agents interact or collaborate to solve more complex tasks.\n\nFor example:\n\n```yaml\nkind: llm\nspec:\n  name: llm-1\n  type: openai\n  config:\n    model_name: gpt-4o\n\n---\nkind: agent\nspec:\n  name: basic-agent\n  type: basic\n  config:\n    llm: llm-1\n    instruction: Return the latitude and the longitude of a place.\n\n---\nkind: agent\nspec:\n  name: function-call-agent\n  type: function-call\n  config:\n    llm: llm-1\n    instruction: You are an agent for weather forecast. Please return the weather today at the given latitude and longitude.\n    servers:\n      - name: weather\n\n---\nkind: workflow\nspec:\n  name: orchestrator-workflow\n  type: orchestrator\n  config:\n    llm: llm-1\n    agents:\n      - basic-agent\n      - function-call-agent\n\n---\nkind: benchmark\nspec:\n  description: Test the agent for weather forecasting\n  agent: orchestrator-workflow\n  tasks:\n    - dummy/tasks/weather.json\n```\n\n## Citation\n\nIf you use MCP-Universe in your research, please cite our paper:\n\n```bibtex\n@misc{mcpuniverse,\n  title={MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers},\n  author={Ziyang Luo and Zhiqi Shen and Wenzhuo Yang and Zirui Zhao and Prathyusha Jwalapuram and Amrita Saha and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Junnan Li},\n  year={2025},\n  eprint={2508.14704},\n  archivePrefix={arXiv},\n  primaryClass={cs.AI},\n  url={https://arxiv.org/abs/2508.14704}, \n}\n```\n",
    "bugtrack_url": null,
    "license": "3-Clause BSD",
    "summary": "A framework for developing and benchmarking AI agents using Model Control Protocol",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/SalesforceAIResearch/MCP-Universe",
        "Repository": "https://github.com/SalesforceAIResearch/MCP-Universe"
    },
    "split_keywords": [
        "ai",
        " agents",
        " mcp",
        " benchmarking",
        " llm",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8ad265efeb626530cdee6c7e54ec91423f6fd1fe4abe0dd14dd8f192292bea3f",
                "md5": "40abb07cd4b45c9e653114bcbd1486ad",
                "sha256": "9676920562a51c15f3b9d26b7ff5ff2265917b0080e8eba5766f96dacdd39fd1"
            },
            "downloads": -1,
            "filename": "mcpuniverse-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "40abb07cd4b45c9e653114bcbd1486ad",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.10",
            "size": 322757,
            "upload_time": "2025-09-04T03:21:49",
            "upload_time_iso_8601": "2025-09-04T03:21:49.833455Z",
            "url": "https://files.pythonhosted.org/packages/8a/d2/65efeb626530cdee6c7e54ec91423f6fd1fe4abe0dd14dd8f192292bea3f/mcpuniverse-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5bfbde4f3b162678ba03ad277c08496529bac914ec35b7d346a8f997596cce62",
                "md5": "a9e96fbdbbd697821b2bb06109028e36",
                "sha256": "a07a5d946c97ebb0d18cdd4494b4478cc6c6bc4e439611e9025c270d97db1b8d"
            },
            "downloads": -1,
            "filename": "mcpuniverse-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a9e96fbdbbd697821b2bb06109028e36",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.10",
            "size": 223145,
            "upload_time": "2025-09-04T03:21:51",
            "upload_time_iso_8601": "2025-09-04T03:21:51.545852Z",
            "url": "https://files.pythonhosted.org/packages/5b/fb/de4f3b162678ba03ad277c08496529bac914ec35b7d346a8f997596cce62/mcpuniverse-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-04 03:21:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SalesforceAIResearch",
    "github_project": "MCP-Universe",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.32.4"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "2.10.6"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "2.10.6"
                ]
            ]
        },
        {
            "name": "schema",
            "specs": [
                [
                    "==",
                    "0.7.7"
                ]
            ]
        },
        {
            "name": "mcp",
            "specs": [
                [
                    "==",
                    "1.9.4"
                ]
            ]
        },
        {
            "name": "httpx",
            "specs": [
                [
                    "==",
                    "0.28.1"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.1.8"
                ]
            ]
        },
        {
            "name": "jinja2",
            "specs": [
                [
                    "==",
                    "3.1.6"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "1.0.1"
                ]
            ]
        },
        {
            "name": "anyio",
            "specs": [
                [
                    "==",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    "==",
                    "1.68.2"
                ]
            ]
        },
        {
            "name": "anthropic",
            "specs": [
                [
                    "==",
                    "0.49.0"
                ]
            ]
        },
        {
            "name": "mistralai",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    "==",
                    "6.0.2"
                ]
            ]
        },
        {
            "name": "google-genai",
            "specs": [
                [
                    "==",
                    "1.16.1"
                ]
            ]
        },
        {
            "name": "redis",
            "specs": [
                [
                    "==",
                    "6.1.0"
                ]
            ]
        },
        {
            "name": "psycopg",
            "specs": [
                [
                    "==",
                    "3.2.9"
                ]
            ]
        },
        {
            "name": "sqlalchemy",
            "specs": [
                [
                    "==",
                    "2.0.41"
                ]
            ]
        },
        {
            "name": "fastapi",
            "specs": [
                [
                    "==",
                    "0.115.12"
                ]
            ]
        },
        {
            "name": "bcrypt",
            "specs": [
                [
                    "==",
                    "4.3.0"
                ]
            ]
        },
        {
            "name": "pyseto",
            "specs": [
                [
                    "==",
                    "1.8.4"
                ]
            ]
        },
        {
            "name": "celery",
            "specs": [
                [
                    "==",
                    "5.5.3"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    "==",
                    "2024.2"
                ]
            ]
        },
        {
            "name": "xai-sdk",
            "specs": [
                [
                    "==",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "claude-code-sdk",
            "specs": [
                [
                    "==",
                    "0.0.20"
                ]
            ]
        },
        {
            "name": "wikipedia-api",
            "specs": [
                [
                    "==",
                    "0.8.1"
                ]
            ]
        },
        {
            "name": "mcp_server_fetch",
            "specs": []
        },
        {
            "name": "google-auth",
            "specs": [
                [
                    "==",
                    "2.38.0"
                ]
            ]
        },
        {
            "name": "google-auth-oauthlib",
            "specs": [
                [
                    "==",
                    "1.2.1"
                ]
            ]
        },
        {
            "name": "google-api-python-client",
            "specs": []
        },
        {
            "name": "mcp_server_calculator",
            "specs": [
                [
                    "==",
                    "0.1.1"
                ]
            ]
        },
        {
            "name": "yfinance",
            "specs": [
                [
                    "==",
                    "0.2.61"
                ]
            ]
        },
        {
            "name": "blender-mcp",
            "specs": [
                [
                    "==",
                    "1.1.3"
                ]
            ]
        },
        {
            "name": "playwright",
            "specs": [
                [
                    "==",
                    "1.52.0"
                ]
            ]
        },
        {
            "name": "mathutils",
            "specs": [
                [
                    "==",
                    "3.3.0"
                ]
            ]
        }
    ],
    "lcname": "mcpuniverse"
}
        
Elapsed time: 1.48473s