<div align="center">
# <img src="assets/icon.png" alt="MCP-Universe" width="23" height="23"> MCP-Universe
[](https://arxiv.org/abs/2508.14704)
[](https://mcp-universe.github.io/)
[](https://mcp-universe.github.io/#results)
[](https://discord.gg/t9tU77GF)
</div>
---
## What is MCP-Universe?
MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents. It offers a robust platform for building and evaluating both AI agents and LLMs across a wide range of task environments. The framework also supports seamless integration with external MCP servers and facilitates sophisticated agent orchestration workflows.
<div align="center">

</div>
Unlike existing benchmarks that rely on overly simplistic tasks, MCP-Universe addresses critical gaps by evaluating LLMs in **real-world scenarios** through interaction with actual MCP servers, capturing real application challenges such as:
- π― **Long-horizon reasoning** across multi-step tasks
- π§ **Large, unfamiliar tool spaces** with diverse MCP servers
- π **Real-world data sources** and live environments
- β‘ **Dynamic evaluation** with time-sensitive ground truth
## Performance Highlights
Even state-of-the-art models show significant limitations in real-world MCP interactions:
- π₯ **GPT-5**: 43.72% success rate
- π₯ **Grok-4**: 33.33% success rate
- π₯ **Claude-4.0-Sonnet**: 29.44% success rate
*This highlights the challenging nature of real-world MCP server interactions and substantial room for improvement in current LLM agents.*
## Table of Contents
- [Architecture Overview](#architecture-overview)
- [Getting Started](#getting-started)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Quick Test](#quick-test)
- [Evaluating LLMs and Agents](#evaluating-llms-and-agents)
- [Prerequisites](#prerequisites-1)
- [Environment Configuration](#environment-configuration)
- [Benchmark Configuration](#benchmark-configuration)
- [Execution](#execution)
- [Save the running log](#save-the-running-log)
- [Save the benchmark result to a report](#save-the-benchmark-result-to-a-report)
- [Visualize the agent running information](#visualize-the-agent-running-information)
- [Creating Custom Benchmarks](#creating-custom-benchmarks)
- [Task definition](#task-definition)
- [Benchmark definition](#benchmark-definition)
- [Citation](#citation)
## Architecture Overview
The MCPUniverse architecture consists of the following key components:
- **Agents** (`mcpuniverse/agent/`): Base implementations for different agent types
- **Workflows** (`mcpuniverse/workflows/`): Orchestration and coordination layer
- **MCP Servers** (`mcpuniverse/mcp/`): Protocol management and external service integration
- **LLM Integration** (`mcpuniverse/llm/`): Multi-provider language model support
- **Benchmarking** (`mcpuniverse/benchmark/`): Evaluation and testing framework
- **Dashboard** (`mcpuniverse/dashboard/`): Visualization and monitoring interface
The diagram below illustrates the high-level view:
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Dashboard β Web API β Python Lib β Benchmarks β
β (Gradio) β (FastAPI) β β β
βββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β β β
βββββββββββββββΌββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ
β Orchestration Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Workflows β Benchmark Runner β
β (Chain, Router, etc.) β (Evaluation Engine) β
βββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β β β
βββββββββββββββΌββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ
β Agent Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BasicAgent β ReActAgent β FunctionCall β Other β
β β β Agent β Agents β
βββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β β β
βββββββββββββββΌββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββ
β Foundation Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β MCP Manager β LLM Manager β Memory Systems β Tracers β
β (Servers & β (Multi-Model β (RAM, Redis) β (Logging) β
β Clients) β Support) β β β
βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββ
```
More information can be found [here](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs).
## Getting Started
We follow
the [feature branch workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow)
in this repo for its simplicity. To ensure code quality, [PyLint](https://pylint.readthedocs.io/en/latest/)
is integrated into our CI to enforce Python coding standards.
### Prerequisites
* **Python**: Requires version 3.10 or higher.
* **Docker**: Used for running Dockerized MCP servers.
* **PostgreSQL** (optional): Used for database storage and persistence.
* **Redis** (optional): Used for caching and memory management.
### Installation
1. **Clone the repository**
```bash
git clone https://github.com/SalesforceAIResearch/MCP-Universe.git
cd MCP-Universe
```
2. **Create and activate virtual environment**
```bash
python3 -m venv venv
source venv/bin/activate
```
3. **Install dependencies**
```bash
pip install -r requirements.txt
pip install -r dev-requirements.txt
```
4. **Platform-specific requirements**
**Linux:**
```bash
sudo apt-get install libpq-dev
```
**macOS:**
```bash
brew install postgresql
```
5. **Configure pre-commit hooks**
```bash
pre-commit install
```
6. **Environment configuration**
```bash
cp .env.example .env
# Edit .env with your API keys and configuration
```
### Quick Test
To run benchmarks, you first need to set environment variables:
1. Copy the `.env.example` file to a new file named `.env`.
2. In the `.env` file, set the required API keys for various services used by the agents,
such as `OPENAI_API_KEY` and `GOOGLE_MAPS_API_KEY`.
To execute a benchmark programmatically:
```python
from mcpuniverse.tracer.collectors import MemoryCollector # You can also use SQLiteCollector
from mcpuniverse.benchmark.runner import BenchmarkRunner
async def test():
trace_collector = MemoryCollector()
# Choose a benchmark config file under the folder "mcpuniverse/benchmark/configs"
benchmark = BenchmarkRunner("dummy/benchmark_1.yaml")
# Run the specified benchmark
results = await benchmark.run(trace_collector=trace_collector)
# Get traces
trace_id = results[0].task_trace_ids["dummy/tasks/weather.json"]
trace_records = trace_collector.get(trace_id)
```
## Evaluating LLMs and Agents
This section provides comprehensive instructions for evaluating LLMs and AI agents using the MCP-Universe benchmark suite. The framework supports evaluation across multiple domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.
### Prerequisites
Before running benchmark evaluations, ensure you have completed the [Getting Started](#getting-started) section and have the following:
- Python: Version 3.10 or higher
- Docker: Installed and available in your environment
- All required dependencies installed via `pip install -r requirements.txt`
- Active virtual environment
- Appropriate API access for the services you intend to evaluate
### Environment Configuration
#### 1. Initial Setup
Copy the environment template and configure your API credentials:
```bash
cp .env.example .env
```
#### 2. API Keys and Configuration
Configure the following environment variables in your `.env` file. The required keys depend on which benchmark domains you plan to evaluate:
##### Core LLM Providers
| Environment Variable | Provider | Description | Required For |
|---------------------|----------|-------------|--------------|
| `OPENAI_API_KEY` | OpenAI | API key for GPT models (gpt-5, etc.) | All domains |
| `ANTHROPIC_API_KEY` | Anthropic | API key for Claude models | All domains |
| `GEMINI_API_KEY` | Google | API key for Gemini models | All domains |
> **Note**: You only need to configure the API key for the LLM provider you intend to use in your evaluation.
##### Domain-Specific Services
| Environment Variable | Service | Description | Setup Instructions |
|---------------------|---------|-------------|-------------------|
| `SERP_API_KEY` | SerpAPI | Web search API for search benchmark evaluation | [Get API key](https://serpapi.com/) |
| `GOOGLE_MAPS_API_KEY` | Google Maps | Geolocation and mapping services | [Setup Guide](https://console.cloud.google.com/google/maps-apis/credentials) |
| `GITHUB_PERSONAL_ACCESS_TOKEN` | GitHub | Personal access token for repository operations | [Token Setup](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) |
| `GITHUB_PERSONAL_ACCOUNT_NAME` | GitHub | Your GitHub username | N/A |
| `NOTION_API_KEY` | Notion | Integration token for Notion workspace access | [Integration Setup](https://developers.notion.com/docs/authorization#obtaining-a-token) |
| `NOTION_ROOT_PAGE` | Notion | Root page ID for your Notion workspace | See configuration example below |
##### System Paths
| Environment Variable | Description | Example |
|---------------------|-------------|---------|
| `BLENDER_APP_PATH` | Full path to Blender executable (we used v4.4.0) | `/Applications/Blender.app/Contents/MacOS/Blender` |
| `MCPUniverse_DIR` | Absolute path to your MCP-Universe repository | `/Users/username/MCP-Universe` |
##### Configuration Examples
**Notion Root Page ID:**
If your Notion page URL is:
```
https://www.notion.so/your_workspace/MCP-Evaluation-1dd6d96e12345678901234567eaf9eff
```
Set `NOTION_ROOT_PAGE=MCP-Evaluation-1dd6d96e12345678901234567eaf9eff`
**Blender Installation:**
1. Download Blender v4.4.0 from [blender.org](https://www.blender.org/)
2. Install our modified Blender MCP server following the [installation guide](docs/blender-setup.md)
3. Set the path to the Blender executable
##### β οΈ Security Recommendations
> **π IMPORTANT SECURITY NOTICE**
>
> Please read and follow these security guidelines carefully before running benchmarks:
- **π¨ GitHub Integration**: **CRITICAL** - We strongly recommend using a dedicated test GitHub account for benchmark evaluation. The AI agent will perform real operations on GitHub repositories, which could potentially modify or damage your personal repositories.
- **π API Key Management**:
- Store API keys securely and never commit them to version control
- Use environment variables or secure key management systems
- Regularly rotate your API keys for enhanced security
- **π‘οΈ Access Permissions**:
- Grant minimal necessary permissions for each service integration
- Review and limit API key scopes to only required operations
- Monitor API usage and set appropriate rate limits
- **β‘ Blender Operations**: The 3D design benchmarks will execute Blender commands that may modify or create files on your system. Ensure you have adequate backups and run in an isolated environment if necessary.
### Benchmark Configuration
#### Domain-Specific Configuration Files
Each benchmark domain has a dedicated YAML configuration file located in `mcpuniverse/benchmark/configs/test/`. To evaluate your LLM/agent, modify the appropriate configuration file:
| Domain | Configuration File | Description |
|--------|-------------------|-------------|
| Web Search | `web_search.yaml` | Search engine and information retrieval tasks |
| Location Navigation | `location_navigation.yaml` | Geographic and mapping-related queries |
| Browser Automation | `browser_automation.yaml` | Web interaction and automation scenarios |
| Financial Analysis | `financial_analysis.yaml` | Market data analysis and financial computations |
| Repository Management | `repository_management.yaml` | Git operations and code repository tasks |
| 3D Design | `3d_design.yaml` | Blender-based 3D modeling and design tasks |
#### LLM Model Configuration
In each configuration file, update the LLM specification to match your target model:
```yaml
kind: llm
spec:
name: llm-1
type: openai # or anthropic, google, etc.
config:
model_name: gpt-4o # Replace with your target model
```
### Execution
#### Running Individual Benchmarks
Execute specific domain benchmarks using the following commands:
```bash
# Set Python path and run individual benchmarks
export PYTHONPATH=.
# Location Navigation
python tests/benchmark/test_benchmark_location_navigation.py
# Browser Automation
python tests/benchmark/test_benchmark_browser_automation.py
# Financial Analysis
python tests/benchmark/test_benchmark_financial_analysis.py
# Repository Management
python tests/benchmark/test_benchmark_repository_management.py
# Web Search
python tests/benchmark/test_benchmark_web_search.py
# 3D Design
python tests/benchmark/test_benchmark_3d_design.py
```
#### Batch Execution
For comprehensive evaluation across all domains:
```bash
#!/bin/bash
export PYTHONPATH=.
domains=("location_navigation" "browser_automation" "financial_analysis"
"repository_management" "web_search" "3d_design")
for domain in "${domains[@]}"; do
echo "Running benchmark: $domain"
python "tests/benchmark/test_benchmark_${domain}.py"
echo "Completed: $domain"
done
```
### Save the running log
If you want to save the running log, you can pass the `trace_collector` to the benchmark run function:
```python
from mcpuniverse.tracer.collectors import FileCollector
trace_collector = FileCollector(log_file="log/location_navigation.log")
benchmark_results = await benchmark.run(trace_collector=trace_collector)
```
### Save the benchmark result to a report
If you want to save a report of the benchmark result, you can use `BenchmarkReport` to dump a report:
```python
from mcpuniverse.benchmark.report import BenchmarkReport
report = BenchmarkReport(benchmark, trace_collector=trace_collector)
report.dump()
```
### Visualize the agent running information
To run the benchmark with intermediate results and see real-time progress, pass `callbacks=get_vprint_callbacks()` to the run function:
```python
from mcpuniverse.callbacks.handlers.vprint import get_vprint_callbacks
benchmark_results = await benchmark.run(
trace_collector=trace_collector,
callbacks=get_vprint_callbacks()
)
```
This will print out the intermediate results as the benchmark runs.
For further details, refer to the in-code documentation or existing configuration samples in the repository.
## Creating Custom Benchmarks
A benchmark is defined by three main configuration elements: the task definition,
agent/workflow definition, and the benchmark configuration itself. Below is an example
using a simple "weather forecasting" task.
### Task definition
The task definition is provided in JSON format, for example:
```json
{
"category": "general",
"question": "What's the weather in San Francisco now?",
"mcp_servers": [
{
"name": "weather"
}
],
"output_format": {
"city": "<City>",
"weather": "<Weather forecast results>"
},
"evaluators": [
{
"func": "json -> get(city)",
"op": "=",
"value": "San Francisco"
}
]
}
```
Field descriptions:
1. **category**: The task category, e.g., "general", "google-maps", etc. You can set any value for this property.
2. **question**: The main question you want to ask in this task. This is treated as a user message.
3. **mcp_servers**: A list of MCP servers that are supported in this framework.
4. **output_format**: The desired output format of agent responses.
5. **evaluators**: A list of tests to evaluate. For each test/evaluator, it has three attributes: "func" indicates
how to extract values from the agent response, "op" is the comparison operator, and "value" is the ground-truth
value.
It will evaluate **op(func(...), value, op_args...)**. "op" can be "=", "<", ">" or other customized operators.
In "evaluators", you need to write a rule ("func" attribute) showing how to extract values for testing. In the example
above, "json -> get(city)" will first do JSON decoding and then extract the value of key "city". There are several
predefined funcs in this repo:
1. **json**: Perform JSON decoding.
2. **get**: Get the value of a key.
3. **len**: Get the length of a list.
4. **foreach**: Do a FOR-EACH loop.
For example, let's define
```python
data = {"x": [{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}]}
```
Then `get(x) -> foreach -> get(y) -> len` will do the following:
1. Get the value of "x": `[{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}]`.
2. Do a foreach loop and get the value of "y": `[[1], [1, 1], [1, 2, 3, 4]]`.
3. Get the length of each list: `[1, 2, 4]`.
If these predefined functions are not enough, you can implement custom ones.
For more details, please check
this [doc](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs/custom-evaluators-guide.md).
### Benchmark definition
Define agent(s) and benchmark in a YAML file. Hereβs a simple weather forecast benchmark:
```yaml
kind: llm
spec:
name: llm-1
type: openai
config:
model_name: gpt-4o
---
kind: agent
spec:
name: ReAct-agent
type: react
config:
llm: llm-1
instruction: You are an agent for weather forecasting.
servers:
- name: weather
---
kind: benchmark
spec:
description: Test the agent for weather forecasting
agent: ReAct-agent
tasks:
- dummy/tasks/weather.json
```
The benchmark definition mainly contains two parts: the agent definition and the benchmark configuration. The benchmark configuration is simpleβyou just need to specify the agent to use (by the defined agent name) and a list of tasks to evaluate. Each task entry is the task config file
path. It can be a full file path or a partial file path. If it is a partial file path (like "dummy/tasks/weather.json"),
it should be put in the
folder [mcpuniverse/benchmark/configs](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/benchmark/configs)
in this repo.
This framework offers a flexible way to define both simple agents (such as ReAct) and more complex, multi-step agent
workflows.
1. **Specify LLMs:** Begin by declaring the large language models (LLMs) you want the agents to use. Each LLM component
must be assigned a unique name (e.g., `"llm-1"`). These names serve as identifiers that the framework uses to connect
the different components together.
2. **Define an agent:** Next, define an agent by providing its name and selecting an agent class. Agent classes are
available in
the [mcpuniverse.agent](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/agent) package.
Commonly used classes include `"basic"`, `"function-call"`, and `"react"`. Within the agent specification (
`spec.config`), you must also indicate which LLM instance the agent should use by setting the `"llm"` field.
3. **Create complex workflows:** Beyond simple agents, the framework supports the definition of sophisticated,
orchestrated workflows where multiple agents interact or collaborate to solve more complex tasks.
For example:
```yaml
kind: llm
spec:
name: llm-1
type: openai
config:
model_name: gpt-4o
---
kind: agent
spec:
name: basic-agent
type: basic
config:
llm: llm-1
instruction: Return the latitude and the longitude of a place.
---
kind: agent
spec:
name: function-call-agent
type: function-call
config:
llm: llm-1
instruction: You are an agent for weather forecast. Please return the weather today at the given latitude and longitude.
servers:
- name: weather
---
kind: workflow
spec:
name: orchestrator-workflow
type: orchestrator
config:
llm: llm-1
agents:
- basic-agent
- function-call-agent
---
kind: benchmark
spec:
description: Test the agent for weather forecasting
agent: orchestrator-workflow
tasks:
- dummy/tasks/weather.json
```
## Citation
If you use MCP-Universe in your research, please cite our paper:
```bibtex
@misc{mcpuniverse,
title={MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers},
author={Ziyang Luo and Zhiqi Shen and Wenzhuo Yang and Zirui Zhao and Prathyusha Jwalapuram and Amrita Saha and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Junnan Li},
year={2025},
eprint={2508.14704},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.14704},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "mcpuniverse",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.10",
"maintainer_email": null,
"keywords": "AI, Agents, MCP, benchmarking, LLM, machine-learning",
"author": "Salesforce Research",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/5b/fb/de4f3b162678ba03ad277c08496529bac914ec35b7d346a8f997596cce62/mcpuniverse-1.0.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# <img src=\"assets/icon.png\" alt=\"MCP-Universe\" width=\"23\" height=\"23\"> MCP-Universe\n\n[](https://arxiv.org/abs/2508.14704)\n[](https://mcp-universe.github.io/)\n[](https://mcp-universe.github.io/#results)\n[](https://discord.gg/t9tU77GF)\n\n</div>\n\n---\n\n## What is MCP-Universe?\n\nMCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents. It offers a robust platform for building and evaluating both AI agents and LLMs across a wide range of task environments. The framework also supports seamless integration with external MCP servers and facilitates sophisticated agent orchestration workflows.\n\n<div align=\"center\">\n\n\n\n</div>\n\nUnlike existing benchmarks that rely on overly simplistic tasks, MCP-Universe addresses critical gaps by evaluating LLMs in **real-world scenarios** through interaction with actual MCP servers, capturing real application challenges such as:\n\n- \ud83c\udfaf **Long-horizon reasoning** across multi-step tasks\n- \ud83d\udd27 **Large, unfamiliar tool spaces** with diverse MCP servers \n- \ud83c\udf0d **Real-world data sources** and live environments\n- \u26a1 **Dynamic evaluation** with time-sensitive ground truth\n\n## Performance Highlights\n\nEven state-of-the-art models show significant limitations in real-world MCP interactions:\n\n- \ud83e\udd47 **GPT-5**: 43.72% success rate\n- \ud83e\udd48 **Grok-4**: 33.33% success rate \n- \ud83e\udd49 **Claude-4.0-Sonnet**: 29.44% success rate\n\n*This highlights the challenging nature of real-world MCP server interactions and substantial room for improvement in current LLM agents.*\n\n## Table of Contents\n\n- [Architecture Overview](#architecture-overview)\n- [Getting Started](#getting-started)\n - [Prerequisites](#prerequisites)\n - [Installation](#installation)\n - [Quick Test](#quick-test)\n- [Evaluating LLMs and Agents](#evaluating-llms-and-agents)\n - [Prerequisites](#prerequisites-1)\n - [Environment Configuration](#environment-configuration)\n - [Benchmark Configuration](#benchmark-configuration)\n - [Execution](#execution)\n - [Save the running log](#save-the-running-log)\n - [Save the benchmark result to a report](#save-the-benchmark-result-to-a-report)\n - [Visualize the agent running information](#visualize-the-agent-running-information)\n- [Creating Custom Benchmarks](#creating-custom-benchmarks)\n - [Task definition](#task-definition)\n - [Benchmark definition](#benchmark-definition)\n- [Citation](#citation)\n\n## Architecture Overview\n\nThe MCPUniverse architecture consists of the following key components:\n\n- **Agents** (`mcpuniverse/agent/`): Base implementations for different agent types\n- **Workflows** (`mcpuniverse/workflows/`): Orchestration and coordination layer\n- **MCP Servers** (`mcpuniverse/mcp/`): Protocol management and external service integration\n- **LLM Integration** (`mcpuniverse/llm/`): Multi-provider language model support\n- **Benchmarking** (`mcpuniverse/benchmark/`): Evaluation and testing framework\n- **Dashboard** (`mcpuniverse/dashboard/`): Visualization and monitoring interface\n\nThe diagram below illustrates the high-level view:\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Application Layer \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Dashboard \u2502 Web API \u2502 Python Lib \u2502 Benchmarks \u2502\n\u2502 (Gradio) \u2502 (FastAPI) \u2502 \u2502 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Orchestration Layer \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Workflows \u2502 Benchmark Runner \u2502\n\u2502 (Chain, Router, etc.) \u2502 (Evaluation Engine) \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Agent Layer \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 BasicAgent \u2502 ReActAgent \u2502 FunctionCall \u2502 Other \u2502\n\u2502 \u2502 \u2502 Agent \u2502 Agents \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Foundation Layer \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 MCP Manager \u2502 LLM Manager \u2502 Memory Systems \u2502 Tracers \u2502\n\u2502 (Servers & \u2502 (Multi-Model \u2502 (RAM, Redis) \u2502 (Logging) \u2502\n\u2502 Clients) \u2502 Support) \u2502 \u2502 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nMore information can be found [here](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs).\n\n## Getting Started\n\nWe follow\nthe [feature branch workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow)\nin this repo for its simplicity. To ensure code quality, [PyLint](https://pylint.readthedocs.io/en/latest/)\nis integrated into our CI to enforce Python coding standards.\n\n### Prerequisites\n\n* **Python**: Requires version 3.10 or higher.\n* **Docker**: Used for running Dockerized MCP servers.\n* **PostgreSQL** (optional): Used for database storage and persistence.\n* **Redis** (optional): Used for caching and memory management.\n\n### Installation\n\n1. **Clone the repository**\n ```bash\n git clone https://github.com/SalesforceAIResearch/MCP-Universe.git\n cd MCP-Universe\n ```\n\n2. **Create and activate virtual environment**\n ```bash\n python3 -m venv venv\n source venv/bin/activate\n ```\n\n3. **Install dependencies**\n ```bash\n pip install -r requirements.txt\n pip install -r dev-requirements.txt\n ```\n\n4. **Platform-specific requirements**\n\n **Linux:**\n ```bash\n sudo apt-get install libpq-dev\n ```\n\n **macOS:**\n ```bash\n brew install postgresql\n ```\n\n5. **Configure pre-commit hooks**\n ```bash\n pre-commit install\n ```\n\n6. **Environment configuration**\n ```bash\n cp .env.example .env\n # Edit .env with your API keys and configuration\n ```\n\n### Quick Test\n\nTo run benchmarks, you first need to set environment variables:\n\n1. Copy the `.env.example` file to a new file named `.env`.\n2. In the `.env` file, set the required API keys for various services used by the agents,\n such as `OPENAI_API_KEY` and `GOOGLE_MAPS_API_KEY`.\n\nTo execute a benchmark programmatically:\n\n```python\nfrom mcpuniverse.tracer.collectors import MemoryCollector # You can also use SQLiteCollector\nfrom mcpuniverse.benchmark.runner import BenchmarkRunner\n\nasync def test():\n trace_collector = MemoryCollector()\n # Choose a benchmark config file under the folder \"mcpuniverse/benchmark/configs\"\n benchmark = BenchmarkRunner(\"dummy/benchmark_1.yaml\")\n # Run the specified benchmark\n results = await benchmark.run(trace_collector=trace_collector)\n # Get traces\n trace_id = results[0].task_trace_ids[\"dummy/tasks/weather.json\"]\n trace_records = trace_collector.get(trace_id)\n```\n\n## Evaluating LLMs and Agents\n\nThis section provides comprehensive instructions for evaluating LLMs and AI agents using the MCP-Universe benchmark suite. The framework supports evaluation across multiple domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.\n\n### Prerequisites\n\nBefore running benchmark evaluations, ensure you have completed the [Getting Started](#getting-started) section and have the following:\n\n- Python: Version 3.10 or higher\n- Docker: Installed and available in your environment\n- All required dependencies installed via `pip install -r requirements.txt`\n- Active virtual environment\n- Appropriate API access for the services you intend to evaluate\n\n### Environment Configuration\n\n#### 1. Initial Setup\n\nCopy the environment template and configure your API credentials:\n\n```bash\ncp .env.example .env\n```\n\n#### 2. API Keys and Configuration\n\nConfigure the following environment variables in your `.env` file. The required keys depend on which benchmark domains you plan to evaluate:\n\n##### Core LLM Providers\n\n| Environment Variable | Provider | Description | Required For |\n|---------------------|----------|-------------|--------------|\n| `OPENAI_API_KEY` | OpenAI | API key for GPT models (gpt-5, etc.) | All domains |\n| `ANTHROPIC_API_KEY` | Anthropic | API key for Claude models | All domains |\n| `GEMINI_API_KEY` | Google | API key for Gemini models | All domains |\n\n> **Note**: You only need to configure the API key for the LLM provider you intend to use in your evaluation.\n\n##### Domain-Specific Services\n\n| Environment Variable | Service | Description | Setup Instructions |\n|---------------------|---------|-------------|-------------------|\n| `SERP_API_KEY` | SerpAPI | Web search API for search benchmark evaluation | [Get API key](https://serpapi.com/) |\n| `GOOGLE_MAPS_API_KEY` | Google Maps | Geolocation and mapping services | [Setup Guide](https://console.cloud.google.com/google/maps-apis/credentials) |\n| `GITHUB_PERSONAL_ACCESS_TOKEN` | GitHub | Personal access token for repository operations | [Token Setup](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) |\n| `GITHUB_PERSONAL_ACCOUNT_NAME` | GitHub | Your GitHub username | N/A |\n| `NOTION_API_KEY` | Notion | Integration token for Notion workspace access | [Integration Setup](https://developers.notion.com/docs/authorization#obtaining-a-token) |\n| `NOTION_ROOT_PAGE` | Notion | Root page ID for your Notion workspace | See configuration example below |\n\n##### System Paths\n\n| Environment Variable | Description | Example |\n|---------------------|-------------|---------|\n| `BLENDER_APP_PATH` | Full path to Blender executable (we used v4.4.0) | `/Applications/Blender.app/Contents/MacOS/Blender` |\n| `MCPUniverse_DIR` | Absolute path to your MCP-Universe repository | `/Users/username/MCP-Universe` |\n\n##### Configuration Examples\n\n**Notion Root Page ID:**\nIf your Notion page URL is:\n```\nhttps://www.notion.so/your_workspace/MCP-Evaluation-1dd6d96e12345678901234567eaf9eff\n```\nSet `NOTION_ROOT_PAGE=MCP-Evaluation-1dd6d96e12345678901234567eaf9eff`\n\n**Blender Installation:**\n1. Download Blender v4.4.0 from [blender.org](https://www.blender.org/)\n2. Install our modified Blender MCP server following the [installation guide](docs/blender-setup.md)\n3. Set the path to the Blender executable\n\n##### \u26a0\ufe0f Security Recommendations\n\n> **\ud83d\udd12 IMPORTANT SECURITY NOTICE**\n> \n> Please read and follow these security guidelines carefully before running benchmarks:\n\n- **\ud83d\udea8 GitHub Integration**: **CRITICAL** - We strongly recommend using a dedicated test GitHub account for benchmark evaluation. The AI agent will perform real operations on GitHub repositories, which could potentially modify or damage your personal repositories.\n\n- **\ud83d\udd10 API Key Management**: \n - Store API keys securely and never commit them to version control\n - Use environment variables or secure key management systems\n - Regularly rotate your API keys for enhanced security\n\n- **\ud83d\udee1\ufe0f Access Permissions**: \n - Grant minimal necessary permissions for each service integration\n - Review and limit API key scopes to only required operations\n - Monitor API usage and set appropriate rate limits\n\n- **\u26a1 Blender Operations**: The 3D design benchmarks will execute Blender commands that may modify or create files on your system. Ensure you have adequate backups and run in an isolated environment if necessary.\n\n### Benchmark Configuration\n\n#### Domain-Specific Configuration Files\n\nEach benchmark domain has a dedicated YAML configuration file located in `mcpuniverse/benchmark/configs/test/`. To evaluate your LLM/agent, modify the appropriate configuration file:\n\n| Domain | Configuration File | Description |\n|--------|-------------------|-------------|\n| Web Search | `web_search.yaml` | Search engine and information retrieval tasks |\n| Location Navigation | `location_navigation.yaml` | Geographic and mapping-related queries |\n| Browser Automation | `browser_automation.yaml` | Web interaction and automation scenarios |\n| Financial Analysis | `financial_analysis.yaml` | Market data analysis and financial computations |\n| Repository Management | `repository_management.yaml` | Git operations and code repository tasks |\n| 3D Design | `3d_design.yaml` | Blender-based 3D modeling and design tasks |\n\n#### LLM Model Configuration\n\nIn each configuration file, update the LLM specification to match your target model:\n\n```yaml\nkind: llm\nspec:\n name: llm-1\n type: openai # or anthropic, google, etc.\n config:\n model_name: gpt-4o # Replace with your target model\n```\n\n### Execution\n\n#### Running Individual Benchmarks\n\nExecute specific domain benchmarks using the following commands:\n\n```bash\n# Set Python path and run individual benchmarks\nexport PYTHONPATH=.\n\n# Location Navigation\npython tests/benchmark/test_benchmark_location_navigation.py\n\n# Browser Automation \npython tests/benchmark/test_benchmark_browser_automation.py\n\n# Financial Analysis\npython tests/benchmark/test_benchmark_financial_analysis.py\n\n# Repository Management\npython tests/benchmark/test_benchmark_repository_management.py\n\n# Web Search\npython tests/benchmark/test_benchmark_web_search.py\n\n# 3D Design\npython tests/benchmark/test_benchmark_3d_design.py\n```\n\n#### Batch Execution\n\nFor comprehensive evaluation across all domains:\n\n```bash\n#!/bin/bash\nexport PYTHONPATH=.\n\ndomains=(\"location_navigation\" \"browser_automation\" \"financial_analysis\" \n \"repository_management\" \"web_search\" \"3d_design\")\n\nfor domain in \"${domains[@]}\"; do\n echo \"Running benchmark: $domain\"\n python \"tests/benchmark/test_benchmark_${domain}.py\"\n echo \"Completed: $domain\"\ndone\n```\n\n### Save the running log\n\nIf you want to save the running log, you can pass the `trace_collector` to the benchmark run function:\n\n```python\nfrom mcpuniverse.tracer.collectors import FileCollector\n\ntrace_collector = FileCollector(log_file=\"log/location_navigation.log\")\nbenchmark_results = await benchmark.run(trace_collector=trace_collector)\n```\n\n### Save the benchmark result to a report \n\nIf you want to save a report of the benchmark result, you can use `BenchmarkReport` to dump a report:\n\n```python\nfrom mcpuniverse.benchmark.report import BenchmarkReport\n\nreport = BenchmarkReport(benchmark, trace_collector=trace_collector)\nreport.dump()\n```\n\n### Visualize the agent running information\n\nTo run the benchmark with intermediate results and see real-time progress, pass `callbacks=get_vprint_callbacks()` to the run function:\n\n```python\nfrom mcpuniverse.callbacks.handlers.vprint import get_vprint_callbacks\n\nbenchmark_results = await benchmark.run(\n trace_collector=trace_collector, \n callbacks=get_vprint_callbacks()\n)\n```\n\nThis will print out the intermediate results as the benchmark runs.\n\n\nFor further details, refer to the in-code documentation or existing configuration samples in the repository.\n\n## Creating Custom Benchmarks\n\nA benchmark is defined by three main configuration elements: the task definition,\nagent/workflow definition, and the benchmark configuration itself. Below is an example\nusing a simple \"weather forecasting\" task.\n\n### Task definition\n\nThe task definition is provided in JSON format, for example:\n\n```json\n{\n \"category\": \"general\",\n \"question\": \"What's the weather in San Francisco now?\",\n \"mcp_servers\": [\n {\n \"name\": \"weather\"\n }\n ],\n \"output_format\": {\n \"city\": \"<City>\",\n \"weather\": \"<Weather forecast results>\"\n },\n \"evaluators\": [\n {\n \"func\": \"json -> get(city)\",\n \"op\": \"=\",\n \"value\": \"San Francisco\"\n }\n ]\n}\n```\n\nField descriptions:\n\n1. **category**: The task category, e.g., \"general\", \"google-maps\", etc. You can set any value for this property.\n2. **question**: The main question you want to ask in this task. This is treated as a user message.\n3. **mcp_servers**: A list of MCP servers that are supported in this framework.\n4. **output_format**: The desired output format of agent responses.\n5. **evaluators**: A list of tests to evaluate. For each test/evaluator, it has three attributes: \"func\" indicates\n how to extract values from the agent response, \"op\" is the comparison operator, and \"value\" is the ground-truth\n value.\n It will evaluate **op(func(...), value, op_args...)**. \"op\" can be \"=\", \"<\", \">\" or other customized operators.\n\nIn \"evaluators\", you need to write a rule (\"func\" attribute) showing how to extract values for testing. In the example\nabove, \"json -> get(city)\" will first do JSON decoding and then extract the value of key \"city\". There are several\npredefined funcs in this repo:\n\n1. **json**: Perform JSON decoding.\n2. **get**: Get the value of a key.\n3. **len**: Get the length of a list.\n4. **foreach**: Do a FOR-EACH loop.\n\nFor example, let's define\n\n```python\ndata = {\"x\": [{\"y\": [1]}, {\"y\": [1, 1]}, {\"y\": [1, 2, 3, 4]}]}\n```\n\nThen `get(x) -> foreach -> get(y) -> len` will do the following:\n\n1. Get the value of \"x\": `[{\"y\": [1]}, {\"y\": [1, 1]}, {\"y\": [1, 2, 3, 4]}]`.\n2. Do a foreach loop and get the value of \"y\": `[[1], [1, 1], [1, 2, 3, 4]]`.\n3. Get the length of each list: `[1, 2, 4]`.\n\nIf these predefined functions are not enough, you can implement custom ones.\nFor more details, please check\nthis [doc](https://github.com/SalesforceAIResearch/MCP-Universe/blob/main/docs/custom-evaluators-guide.md).\n\n### Benchmark definition\n\nDefine agent(s) and benchmark in a YAML file. Here\u2019s a simple weather forecast benchmark:\n\n```yaml\nkind: llm\nspec:\n name: llm-1\n type: openai\n config:\n model_name: gpt-4o\n\n---\nkind: agent\nspec:\n name: ReAct-agent\n type: react\n config:\n llm: llm-1\n instruction: You are an agent for weather forecasting.\n servers:\n - name: weather\n\n---\nkind: benchmark\nspec:\n description: Test the agent for weather forecasting\n agent: ReAct-agent\n tasks:\n - dummy/tasks/weather.json\n```\n\nThe benchmark definition mainly contains two parts: the agent definition and the benchmark configuration. The benchmark configuration is simple\u2014you just need to specify the agent to use (by the defined agent name) and a list of tasks to evaluate. Each task entry is the task config file\npath. It can be a full file path or a partial file path. If it is a partial file path (like \"dummy/tasks/weather.json\"),\nit should be put in the\nfolder [mcpuniverse/benchmark/configs](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/benchmark/configs)\nin this repo.\n\nThis framework offers a flexible way to define both simple agents (such as ReAct) and more complex, multi-step agent\nworkflows.\n\n1. **Specify LLMs:** Begin by declaring the large language models (LLMs) you want the agents to use. Each LLM component\n must be assigned a unique name (e.g., `\"llm-1\"`). These names serve as identifiers that the framework uses to connect\n the different components together.\n2. **Define an agent:** Next, define an agent by providing its name and selecting an agent class. Agent classes are\n available in\n the [mcpuniverse.agent](https://github.com/SalesforceAIResearch/MCP-Universe/tree/main/mcpuniverse/agent) package.\n Commonly used classes include `\"basic\"`, `\"function-call\"`, and `\"react\"`. Within the agent specification (\n `spec.config`), you must also indicate which LLM instance the agent should use by setting the `\"llm\"` field.\n3. **Create complex workflows:** Beyond simple agents, the framework supports the definition of sophisticated,\n orchestrated workflows where multiple agents interact or collaborate to solve more complex tasks.\n\nFor example:\n\n```yaml\nkind: llm\nspec:\n name: llm-1\n type: openai\n config:\n model_name: gpt-4o\n\n---\nkind: agent\nspec:\n name: basic-agent\n type: basic\n config:\n llm: llm-1\n instruction: Return the latitude and the longitude of a place.\n\n---\nkind: agent\nspec:\n name: function-call-agent\n type: function-call\n config:\n llm: llm-1\n instruction: You are an agent for weather forecast. Please return the weather today at the given latitude and longitude.\n servers:\n - name: weather\n\n---\nkind: workflow\nspec:\n name: orchestrator-workflow\n type: orchestrator\n config:\n llm: llm-1\n agents:\n - basic-agent\n - function-call-agent\n\n---\nkind: benchmark\nspec:\n description: Test the agent for weather forecasting\n agent: orchestrator-workflow\n tasks:\n - dummy/tasks/weather.json\n```\n\n## Citation\n\nIf you use MCP-Universe in your research, please cite our paper:\n\n```bibtex\n@misc{mcpuniverse,\n title={MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers},\n author={Ziyang Luo and Zhiqi Shen and Wenzhuo Yang and Zirui Zhao and Prathyusha Jwalapuram and Amrita Saha and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Junnan Li},\n year={2025},\n eprint={2508.14704},\n archivePrefix={arXiv},\n primaryClass={cs.AI},\n url={https://arxiv.org/abs/2508.14704}, \n}\n```\n",
"bugtrack_url": null,
"license": "3-Clause BSD",
"summary": "A framework for developing and benchmarking AI agents using Model Control Protocol",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/SalesforceAIResearch/MCP-Universe",
"Repository": "https://github.com/SalesforceAIResearch/MCP-Universe"
},
"split_keywords": [
"ai",
" agents",
" mcp",
" benchmarking",
" llm",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8ad265efeb626530cdee6c7e54ec91423f6fd1fe4abe0dd14dd8f192292bea3f",
"md5": "40abb07cd4b45c9e653114bcbd1486ad",
"sha256": "9676920562a51c15f3b9d26b7ff5ff2265917b0080e8eba5766f96dacdd39fd1"
},
"downloads": -1,
"filename": "mcpuniverse-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "40abb07cd4b45c9e653114bcbd1486ad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.10",
"size": 322757,
"upload_time": "2025-09-04T03:21:49",
"upload_time_iso_8601": "2025-09-04T03:21:49.833455Z",
"url": "https://files.pythonhosted.org/packages/8a/d2/65efeb626530cdee6c7e54ec91423f6fd1fe4abe0dd14dd8f192292bea3f/mcpuniverse-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5bfbde4f3b162678ba03ad277c08496529bac914ec35b7d346a8f997596cce62",
"md5": "a9e96fbdbbd697821b2bb06109028e36",
"sha256": "a07a5d946c97ebb0d18cdd4494b4478cc6c6bc4e439611e9025c270d97db1b8d"
},
"downloads": -1,
"filename": "mcpuniverse-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "a9e96fbdbbd697821b2bb06109028e36",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.10",
"size": 223145,
"upload_time": "2025-09-04T03:21:51",
"upload_time_iso_8601": "2025-09-04T03:21:51.545852Z",
"url": "https://files.pythonhosted.org/packages/5b/fb/de4f3b162678ba03ad277c08496529bac914ec35b7d346a8f997596cce62/mcpuniverse-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-04 03:21:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SalesforceAIResearch",
"github_project": "MCP-Universe",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "requests",
"specs": [
[
"==",
"2.32.4"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"2.10.6"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"2.10.6"
]
]
},
{
"name": "schema",
"specs": [
[
"==",
"0.7.7"
]
]
},
{
"name": "mcp",
"specs": [
[
"==",
"1.9.4"
]
]
},
{
"name": "httpx",
"specs": [
[
"==",
"0.28.1"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.8"
]
]
},
{
"name": "jinja2",
"specs": [
[
"==",
"3.1.6"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "anyio",
"specs": [
[
"==",
"4.9.0"
]
]
},
{
"name": "openai",
"specs": [
[
"==",
"1.68.2"
]
]
},
{
"name": "anthropic",
"specs": [
[
"==",
"0.49.0"
]
]
},
{
"name": "mistralai",
"specs": [
[
"==",
"1.6.0"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0.2"
]
]
},
{
"name": "google-genai",
"specs": [
[
"==",
"1.16.1"
]
]
},
{
"name": "redis",
"specs": [
[
"==",
"6.1.0"
]
]
},
{
"name": "psycopg",
"specs": [
[
"==",
"3.2.9"
]
]
},
{
"name": "sqlalchemy",
"specs": [
[
"==",
"2.0.41"
]
]
},
{
"name": "fastapi",
"specs": [
[
"==",
"0.115.12"
]
]
},
{
"name": "bcrypt",
"specs": [
[
"==",
"4.3.0"
]
]
},
{
"name": "pyseto",
"specs": [
[
"==",
"1.8.4"
]
]
},
{
"name": "celery",
"specs": [
[
"==",
"5.5.3"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2024.2"
]
]
},
{
"name": "xai-sdk",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "claude-code-sdk",
"specs": [
[
"==",
"0.0.20"
]
]
},
{
"name": "wikipedia-api",
"specs": [
[
"==",
"0.8.1"
]
]
},
{
"name": "mcp_server_fetch",
"specs": []
},
{
"name": "google-auth",
"specs": [
[
"==",
"2.38.0"
]
]
},
{
"name": "google-auth-oauthlib",
"specs": [
[
"==",
"1.2.1"
]
]
},
{
"name": "google-api-python-client",
"specs": []
},
{
"name": "mcp_server_calculator",
"specs": [
[
"==",
"0.1.1"
]
]
},
{
"name": "yfinance",
"specs": [
[
"==",
"0.2.61"
]
]
},
{
"name": "blender-mcp",
"specs": [
[
"==",
"1.1.3"
]
]
},
{
"name": "playwright",
"specs": [
[
"==",
"1.52.0"
]
]
},
{
"name": "mathutils",
"specs": [
[
"==",
"3.3.0"
]
]
}
],
"lcname": "mcpuniverse"
}