# CloudWatch Application Signals MCP Server
An MCP (Model Context Protocol) server that provides comprehensive tools for monitoring and analyzing AWS services using [AWS Application Signals](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Application-Signals.html).
This server enables AI assistants like Claude, GitHub Copilot, and Amazon Q to help you monitor service health, analyze performance metrics, track SLO compliance, and investigate issues using distributed tracing with advanced audit capabilities and root cause analysis.
## Key Features
1. **Comprehensive Service Auditing** - Monitor overall service health, diagnose root causes, and recommend actionable fixes with built-in APM expertise
2. **Advanced SLO Compliance Monitoring** - Track Service Level Objectives with breach detection and root cause analysis
3. **Operation-Level Performance Analysis** - Deep dive into specific API endpoints and operations
4. **100% Trace Visibility** - Query OpenTelemetry spans data via Transaction Search for complete observability
5. **Multi-Service Analysis** - Audit multiple services simultaneously with automatic batching
6. **Natural Language Insights** - Generate business insights from telemetry data through natural language queries
## Prerequisites
1. [Sign-Up for an AWS account](https://aws.amazon.com/free/?trk=78b916d7-7c94-4cab-98d9-0ce5e648dd5f&sc_channel=ps&ef_id=Cj0KCQjwxJvBBhDuARIsAGUgNfjOZq8r2bH2OfcYfYTht5v5I1Bn0lBKiI2Ii71A8Gk39ZU5cwMLPkcaAo_CEALw_wcB:G:s&s_kwcid=AL!4422!3!432339156162!e!!g!!aws%20sign%20up!9572385111!102212379327&gad_campaignid=9572385111&gbraid=0AAAAADjHtp99c5A9DUyUaUQVhVEoi8of3&gclid=Cj0KCQjwxJvBBhDuARIsAGUgNfjOZq8r2bH2OfcYfYTht5v5I1Bn0lBKiI2Ii71A8Gk39ZU5cwMLPkcaAo_CEALw_wcB)
2. [Enable Application Signals](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Application-Monitoring-Sections.html) for your applications
3. Install `uv` from [Astral](https://docs.astral.sh/uv/getting-started/installation/) or the [GitHub README](https://github.com/astral-sh/uv#installation)
4. Install Python using `uv python install 3.10`
## Available Tools
### 🥇 Primary Audit Tools (Use These First)
#### 1. **`audit_services`** ⭐ **PRIMARY SERVICE AUDIT TOOL**
**The #1 tool for comprehensive AWS service health auditing and monitoring**
- **USE THIS FIRST** for all service-level auditing tasks
- Comprehensive health assessment with actionable insights and recommendations
- Multi-service analysis with automatic batching (audit 1-100+ services simultaneously)
- SLO compliance monitoring with automatic breach detection
- Root cause analysis with traces, logs, and metrics correlation
- Issue prioritization by severity (critical, warning, info findings)
- **Wildcard Pattern Support**: Use `*payment*` for automatic service discovery
- Performance optimized for fast execution across multiple targets
**Key Use Cases:**
- `audit_services(service_targets='[{"Type":"service","Data":{"Service":{"Type":"Service","Name":"*"}}}]')` - Audit all services
- `audit_services(service_targets='[{"Type":"service","Data":{"Service":{"Type":"Service","Name":"*payment*"}}}]')` - Audit payment services
- `audit_services(..., auditors="all")` - Comprehensive root cause analysis with all auditors
#### 2. **`audit_slos`** ⭐ **PRIMARY SLO AUDIT TOOL**
**The #1 tool for comprehensive SLO compliance monitoring and breach analysis**
- **PREFERRED TOOL** for SLO root cause analysis after using `get_slo()`
- Much more comprehensive than individual trace tools - provides integrated analysis
- Combines traces, logs, metrics, and dependencies in a single audit
- Automatic SLO breach detection with prioritized findings
- **Wildcard Pattern Support**: Use `*payment*` for automatic SLO discovery
- Actionable recommendations based on multi-dimensional analysis
**Key Use Cases:**
- `audit_slos(slo_targets='[{"Type":"slo","Data":{"Slo":{"SloName":"*"}}}]')` - Audit all SLOs
- `audit_slos(..., auditors="all")` - Comprehensive root cause analysis for SLO breaches
#### 3. **`audit_service_operations`** 🥇 **PRIMARY OPERATION AUDIT TOOL**
**The #1 RECOMMENDED tool for operation-specific analysis and performance investigation**
- **PREFERRED OVER audit_services()** for operation-level auditing
- Precision targeting of exact operation behavior vs. service-wide averages
- Actionable insights with specific error traces and dependency failures
- Code-level detail with exact stack traces and timeout locations
- **Wildcard Pattern Support**: Use `*GET*` for specific operation types
- Focused analysis that eliminates noise from other operations
**Key Use Cases:**
- `audit_service_operations(operation_targets='[{"Type":"service_operation","Data":{"ServiceOperation":{"Service":{"Type":"Service","Name":"*payment*"},"Operation":"*GET*","MetricType":"Latency"}}}]')` - Audit GET operations in payment services
- `audit_service_operations(..., auditors="all")` - Root cause analysis for specific operations
### 📊 Service Discovery & Information Tools
#### 4. **`list_monitored_services`** - Service Discovery Tool
**OPTIONAL TOOL** - `audit_services()` can automatically discover services using wildcard patterns
- Get detailed overview of all monitored services in your environment
- Discover specific service names and environments for manual audit target construction
- **RECOMMENDED**: Use `audit_services()` with wildcard patterns instead for comprehensive discovery AND analysis
#### 5. **`get_service_detail`** - Service Metadata Tool
**For basic service metadata and configuration details**
- Service metadata and configuration (platform information, key attributes)
- Service-level metrics (Latency, Error, Fault aggregates)
- Log groups associated with the service
- **IMPORTANT**: This tool does NOT provide operation names - use `audit_services()` for operation discovery
#### 6. **`list_service_operations`** - Operation Discovery Tool
**CRITICAL LIMITATION**: Only discovers operations that have been ACTIVELY INVOKED in the specified time window
- Basic operation inventory for RECENTLY ACTIVE operations only (max 24 hours)
- Empty results ≠ no operations exist, just no recent invocations
- **RECOMMENDED**: Use `audit_services()` FIRST for comprehensive operation discovery and analysis
### 🎯 SLO Management Tools
#### 7. **`get_slo`** - SLO Configuration Details
**Essential for understanding SLO configuration before deep investigation**
- Comprehensive SLO configuration details (metrics, thresholds, goals)
- Operation names and key attributes for further investigation
- Metric type (LATENCY or AVAILABILITY) and comparison operators
- **NEXT STEP**: Use `audit_slos()` with `auditors="all"` for root cause analysis
#### 8. **`list_slos`** - SLO Discovery
**List all Service Level Objectives in Application Signals**
- Complete list of all SLOs in your account with names and ARNs
- Filter SLOs by service attributes
- Basic SLO information including creation time and operation names
- Useful for SLO discovery and finding SLO names for use with other tools
### 📈 Metrics & Performance Tools
#### 9. **`query_service_metrics`** - CloudWatch Metrics Analysis
**Get CloudWatch metrics for specific Application Signals services**
- Analyze service performance (latency, throughput, error rates)
- View trends over time with both standard statistics and percentiles
- Automatic granularity adjustment based on time range
- Summary statistics with recent data points and timestamps
### 🔍 Advanced Trace & Log Analysis Tools
#### 10. **`search_transaction_spans`** - 100% Trace Visibility
**Query OpenTelemetry Spans data via Transaction Search (100% sampled data)**
- **100% sampled data** vs X-Ray's 5% sampling for more accurate results
- Query "aws/spans" log group with CloudWatch Logs Insights
- Generate business performance insights and summaries
- **IMPORTANT**: Always include a limit in queries to prevent overwhelming context
**Example Query:**
```
FILTER attributes.aws.local.service = "payment-service" and attributes.aws.local.environment = "eks:production"
| STATS avg(duration) as avg_latency by attributes.aws.local.operation
| LIMIT 50
```
#### 11. **`query_sampled_traces`** - X-Ray Trace Analysis (Secondary Tool)
**Query AWS X-Ray traces (5% sampled data) for trace investigation**
- **⚠️ IMPORTANT**: Consider using `audit_slos()` with `auditors="all"` instead for comprehensive root cause analysis
- Uses X-Ray's 5% sampled trace data - may miss critical errors
- Limited context compared to comprehensive audit tools
- **RECOMMENDATION**: Use `get_service_detail()` for operation discovery and `audit_slos()` for root cause analysis
**Common Filter Expressions:**
- `service("service-name"){fault = true}` - Find traces with faults (5xx errors)
- `duration > 5` - Find slow requests (over 5 seconds)
- `annotation[aws.local.operation]="GET /api/orders"` - Filter by specific operation
#### 12. **`analyze_canary_failures`** - Comprehensive Canary Failure Analysis
**Deep dive into CloudWatch Synthetics canary failures with root cause identification**
- Comprehensive canary failure analysis with deep dive into issues
- Analyze historical patterns and specific incident details
- Get comprehensive artifact analysis including logs, screenshots, and HAR files
- Receive actionable recommendations based on AWS debugging methodology
- Correlate canary failures with Application Signals telemetry data
- Identify performance degradation and availability issues across service dependencies
**Key Features:**
- **Failure Pattern Analysis**: Identifies recurring failure modes and temporal patterns
- **Artifact Deep Dive**: Analyzes canary logs, screenshots, and network traces for root causes
- **Service Correlation**: Links canary failures to upstream/downstream service issues using Application Signals
- **Performance Insights**: Detects latency spikes, fault rates, and connection issues
- **Actionable Remediation**: Provides specific steps based on AWS operational best practices
- **IAM Analysis**: Validates IAM roles and permissions for common canary access issues
- **Backend Service Integration**: Correlates canary failures with backend service errors and exceptions
**Common Use Cases:**
- Incident Response: Rapid diagnosis of canary failures during outages
- Performance Investigation: Understanding latency and availability degradation
- Dependency Analysis: Identifying which services are causing canary failures
- Historical Trending: Analyzing failure patterns over time for proactive improvements
- Root Cause Analysis: Deep dive into specific failure scenarios with full context
- Infrastructure Issues: Diagnose S3 access, VPC connectivity, and browser target problems
- Backend Service Debugging: Identify application code issues affecting canary success
#### 13. **`list_slis`** - Legacy SLI Status Report (Specialized Tool)
**Use `audit_services()` as the PRIMARY tool for service auditing**
- Basic report showing summary counts (total, healthy, breached, insufficient data)
- Simple list of breached services with SLO names
- **IMPORTANT**: `audit_services()` is the PRIMARY and PREFERRED tool for all service auditing tasks
- Only use this tool for legacy SLI status report format specifically
## Installation
### One-Click Installation
| Cursor | VS Code |
|:------:|:-------:|
| [](https://cursor.com/en/install-mcp?name=awslabs.cloudwatch-appsignals-mcp-server&config=eyJhdXRvQXBwcm92ZSI6W10sImRpc2FibGVkIjpmYWxzZSwidGltZW91dCI6NjAsImNvbW1hbmQiOiJ1dnggYXdzbGFicy5jbG91ZHdhdGNoLWFwcHNpZ25hbHMtbWNwLXNlcnZlckBsYXRlc3QiLCJlbnYiOnsiQVdTX1BST0ZJTEUiOiJbVGhlIEFXUyBQcm9maWxlIE5hbWUgdG8gdXNlIGZvciBBV1MgYWNjZXNzXSIsIkFXU19SRUdJT04iOiJbVGhlIEFXUyByZWdpb24gdG8gcnVuIGluXSIsIkZBU1RNQ1BfTE9HX0xFVkVMIjoiRVJST1IifSwidHJhbnNwb3J0VHlwZSI6InN0ZGlvIn0%3D) | [](https://insiders.vscode.dev/redirect/mcp/install?name=CloudWatch%20Application%20Signals%20MCP%20Server&config=%7B%22autoApprove%22%3A%5B%5D%2C%22disabled%22%3Afalse%2C%22timeout%22%3A60%2C%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.cloudwatch-appsignals-mcp-server%40latest%22%5D%2C%22env%22%3A%7B%22AWS_PROFILE%22%3A%22%5BThe%20AWS%20Profile%20Name%20to%20use%20for%20AWS%20access%5D%22%2C%22AWS_REGION%22%3A%22%5BThe%20AWS%20region%20to%20run%20in%5D%22%2C%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22transportType%22%3A%22stdio%22%7D) |
### Installing via `uv`
When using [`uv`](https://docs.astral.sh/uv/) no specific installation is needed. We will
use [`uvx`](https://docs.astral.sh/uv/guides/tools/) to directly run *awslabs.cloudwatch-appsignals-mcp-server*.
### Installing for Amazon Q (Preview)
- Start Amazon Q Developer CLI from [here](https://github.com/aws/amazon-q-developer-cli).
- Add the following configuration in `~/.aws/amazonq/mcp.json` file.
```json
{
"mcpServers": {
"awslabs.cloudwatch-appsignals-mcp": {
"autoApprove": [],
"disabled": false,
"command": "uvx",
"args": [
"awslabs.cloudwatch-appsignals-mcp-server@latest"
],
"env": {
"AWS_PROFILE": "[The AWS Profile Name to use for AWS access]",
"AWS_REGION": "[AWS Region]",
"FASTMCP_LOG_LEVEL": "ERROR"
},
"transportType": "stdio"
}
}
}
```
### Installing via Claude Desktop
On MacOS: `~/Library/Application\ Support/Claude/claude_desktop_config.json`
On Windows: `%APPDATA%/Claude/claude_desktop_config.json`
<details>
<summary>Development/Unpublished Servers Configuration</summary>
When installing a development or unpublished server, add the `--directory` flag:
```json
{
"mcpServers": {
"awslabs.cloudwatch-appsignals-mcp-server": {
"command": "uvx",
"args": ["--from", "/absolute/path/to/cloudwatch-appsignals-mcp-server", "awslabs.cloudwatch-appsignals-mcp-server"],
"env": {
"AWS_PROFILE": "[The AWS Profile Name to use for AWS access]",
"AWS_REGION": "[AWS Region]"
}
}
}
}
```
</details>
<details>
<summary>Published Servers Configuration</summary>
```json
{
"mcpServers": {
"awslabs.cloudwatch-appsignals-mcp-server": {
"command": "uvx",
"args": ["awslabs.cloudwatch-appsignals-mcp-server@latest"],
"env": {
"AWS_PROFILE": "[The AWS Profile Name to use for AWS access]",
"AWS_REGION": "[AWS Region]"
}
}
}
}
```
</details>
### Windows Installation
For Windows users, the MCP server configuration format is slightly different:
```json
{
"mcpServers": {
"awslabs.cloudwatch-appsignals-mcp-server": {
"disabled": false,
"timeout": 60,
"type": "stdio",
"command": "uv",
"args": [
"tool",
"run",
"--from",
"awslabs.cloudwatch-appsignals-mcp-server@latest",
"awslabs.cloudwatch-appsignals-mcp-server.exe"
],
"env": {
"FASTMCP_LOG_LEVEL": "ERROR",
"AWS_PROFILE": "your-aws-profile",
"AWS_REGION": "us-east-1"
}
}
}
}
```
### Build and install docker image locally on the same host of your LLM client
1. `git clone https://github.com/awslabs/mcp.git`
2. Go to sub-directory 'src/cloudwatch-appsignals-mcp-server/'
3. Run 'docker build -t awslabs/cloudwatch-appsignals-mcp-server:latest .'
### Add or update your LLM client's config with following:
```json
{
"mcpServers": {
"awslabs.cloudwatch-appsignals-mcp-server": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"-v", "${HOME}/.aws:/root/.aws:ro",
"-e", "AWS_PROFILE=[The AWS Profile Name to use for AWS access]",
"-e", "AWS_REGION=[AWS Region]",
"awslabs/cloudwatch-appsignals-mcp-server:latest"
]
}
}
}
```
### Debugging
You can use the MCP inspector to debug the server. For `uvx` installations:
```bash
npx @modelcontextprotocol/inspector uvx awslabs.cloudwatch-appsignals-mcp-server
```
Or if you've installed the package in a specific directory:
```bash
cd /path/to/cloudwatch-appsignals-mcp-server
npx @modelcontextprotocol/inspector uv run awslabs.cloudwatch-appsignals-mcp-server
```
## Usage Examples
### Example 1: Comprehensive Service Health Audit (Recommended Workflow)
```
User: "Audit the health of all my payment services"
Assistant: I'll perform a comprehensive health audit of all your payment services using the primary audit tool.
[Uses audit_services with wildcard pattern]
audit_services(
service_targets='[{"Type":"service","Data":{"Service":{"Type":"Service","Name":"*payment*"}}}]',
auditors="slo,operation_metric"
)
Results: Found 3 payment services with the following findings:
🔴 CRITICAL FINDINGS:
• payment-api-service: SLO breach detected - Latency SLO violated (p99: 2.1s > 1.5s threshold)
• payment-processor: High error rate detected (8.2% > 5% threshold)
⚠️ WARNING FINDINGS:
• payment-gateway: Dependency timeout issues with external payment provider
✅ HEALTHY:
• payment-validator: All metrics within normal ranges
Recommendations:
1. Investigate payment-api-service latency issues - check database connection pool
2. Review payment-processor error logs for root cause analysis
3. Consider circuit breaker pattern for payment-gateway external dependencies
```
### Example 2: SLO Breach Root Cause Analysis
```
User: "I have an SLO breach for my checkout service. Can you help me investigate?"
Assistant: I'll help you investigate the SLO breach. Let me first get the SLO configuration and then perform a comprehensive root cause analysis.
[Uses get_slo to understand configuration]
[Then uses audit_slos with auditors="all" for comprehensive analysis]
audit_slos(
slo_targets='[{"Type":"slo","Data":{"Slo":{"SloName":"checkout-service-latency-slo"}}}]',
auditors="all"
)
Root Cause Analysis Results:
🔴 SLO BREACH CONFIRMED:
• checkout-service-latency-slo: 87.3% attainment (< 95% goal)
• Breach duration: 2.5 hours
• Impact: 1,247 requests affected
🔍 ROOT CAUSE IDENTIFIED:
• Database connection pool exhaustion detected
• Trace analysis shows 95% of slow requests waiting for DB connections
• Dependency on payment-service also showing elevated latency
📊 SUPPORTING EVIDENCE:
• P99 latency: 3.2s (threshold: 2.0s)
• Error rate spike: 12% during breach period
• 15 error traces with "connection timeout" messages
🛠️ RECOMMENDED ACTIONS:
1. Increase database connection pool size from 10 to 25
2. Implement connection pool monitoring alerts
3. Review payment-service dependency performance
4. Consider implementing request queuing for high load periods
```
### Example 3: Operation-Level Performance Investigation
```
User: "The GET operations in my payment services are slow. Can you investigate?"
Assistant: I'll investigate the performance of GET operations specifically in your payment services using operation-level analysis.
[Uses audit_service_operations for precise operation targeting]
audit_service_operations(
operation_targets='[{"Type":"service_operation","Data":{"ServiceOperation":{"Service":{"Type":"Service","Name":"*payment*"},"Operation":"*GET*","MetricType":"Latency"}}}]',
auditors="operation_metric,trace"
)
Operation Performance Analysis Results:
🎯 GET OPERATIONS ANALYZED:
• payment-api: GET /api/payments/{id} - P99: 1.8s
• payment-api: GET /api/payments/history - P99: 3.2s ⚠️
• payment-processor: GET /api/process/status - P99: 0.8s ✅
🔴 PERFORMANCE ISSUES IDENTIFIED:
• GET /api/payments/history: Consistently slow (3.2s avg)
- Root cause: Full table scan on payments table
- 89% of requests show database query time > 2.5s
- Missing index on user_id + created_date columns
📈 PERFORMANCE RECOMMENDATIONS:
1. Add composite index: CREATE INDEX idx_payments_user_date ON payments(user_id, created_date)
2. Implement pagination for history endpoint (limit 50 records)
3. Consider caching for frequently accessed payment histories
4. Add query timeout protection (max 5s)
🚀 EXPECTED IMPROVEMENT:
• Estimated latency reduction: 3.2s → 0.4s (87% improvement)
• Database load reduction: ~75%
```
### Example 4: Transaction Search for 100% Trace Visibility
```
User: "I want to analyze all the errors in my checkout service from the last hour"
Assistant: I'll use Transaction Search to analyze all errors with 100% trace visibility (no sampling).
[Uses search_transaction_spans for complete trace data]
search_transaction_spans(
log_group_name="aws/spans",
query_string='FILTER attributes.aws.local.service = "checkout-service" and attributes.http.status_code >= 400 | STATS count() as error_count by attributes.http.status_code, attributes.aws.local.operation | SORT error_count DESC | LIMIT 20',
start_time="2024-01-15T10:00:00Z",
end_time="2024-01-15T11:00:00Z"
)
✅ Transaction Search Results (100% sampled data):
ERROR BREAKDOWN (Last Hour):
• HTTP 500 errors: 47 occurrences
- POST /api/checkout/complete: 31 errors
- POST /api/checkout/validate: 16 errors
• HTTP 404 errors: 23 occurrences
- GET /api/checkout/{id}: 23 errors
• HTTP 503 errors: 12 occurrences
- POST /api/checkout/payment: 12 errors
🔍 KEY INSIGHTS:
• 82 total errors out of 1,247 requests (6.6% error rate)
• Most errors (66%) in checkout completion flow
• Service unavailable errors correlate with payment service dependency
🛠️ IMMEDIATE ACTIONS NEEDED:
1. Investigate checkout completion logic for 500 errors
2. Add validation for checkout ID existence (404s)
3. Implement circuit breaker for payment service calls (503s)
```
### Example 5: Service Discovery and Overview
```
User: "What services are being monitored and what's their current health status?"
Assistant: I'll get an overview of all monitored services and their health status.
[Uses list_monitored_services for discovery, then audit_services for health analysis]
Found 8 services being monitored:
📊 SERVICE INVENTORY:
• api-gateway-service (AWS::ApiGateway::RestApi) - Production
• checkout-service (AWS::ECS::Service) - Production
• payment-api-service (AWS::ECS::Service) - Production
• payment-processor (AWS::Lambda::Function) - Production
• user-service (AWS::EKS::Service) - Production
• inventory-service (AWS::EKS::Service) - Production
• notification-lambda (AWS::Lambda::Function) - Production
• analytics-service (AWS::ECS::Service) - Production
[Then performs health audit]
🏥 HEALTH STATUS SUMMARY:
✅ HEALTHY (5 services):
• api-gateway-service: All SLOs met, low latency
• user-service: Excellent performance metrics
• inventory-service: All operations within thresholds
• notification-lambda: Fast execution, no errors
• analytics-service: Stable performance
⚠️ NEEDS ATTENTION (2 services):
• checkout-service: Elevated latency on completion operations
• payment-api-service: Intermittent timeout issues
🔴 CRITICAL (1 service):
• payment-processor: SLO breach - error rate 8.2% (threshold: 5%)
💡 RECOMMENDATIONS:
1. Focus immediate attention on payment-processor error investigation
2. Monitor checkout-service latency trends
3. Review payment-api-service timeout configurations
```
### Example 6: Canary Failure Analysis and Root Cause Investigation
```
User: "My Pet Clinic canaries are failing. Can you help me diagnose the issues?"
Assistant: I'll analyze your Pet Clinic canaries to identify the root causes of failures.
[Uses analyze_canary_failures for comprehensive canary analysis]
analyze_canary_failures(canary_name="pc-visit-vet")
analyze_canary_failures(canary_name="pc-add-visit")
analyze_canary_failures(canary_name="webapp-erorrpagecanary")
🔍 CANARY FAILURE ANALYSIS RESULTS:
🔴 CRITICAL ISSUES IDENTIFIED:
**pc-visit-vet canary:**
• Root Cause: S3 bucket access issue
• Error Pattern: Exit status 127, "No such file or directory"
• Failure Count: 5 consecutive failures
• IAM Analysis: ✅ Role exists but S3 bucket ARN patterns incorrect in policies
**pc-add-visit canary:**
• Root Cause: Selector timeout + backend service errors
• Error Pattern: 30000ms timeout waiting for UI element + MissingFormatArgumentException
• Backend Issue: Format specifier '% o' error in BedrockRuntimeV1Service.invokeTitanModel()
• Performance: 34 second average response time, 0% success rate
**webapp-erorrpagecanary:**
• Root Cause: Browser target close during selector wait
• Error Pattern: "Target closed" waiting for `#jsError` selector
• Failure Count: 5 consecutive failures with 60000ms connection timeouts
🔍 BACKEND SERVICE CORRELATION:
• MissingFormatArgumentException detected in Pet Clinic backend
• Location: org.springframework.samples.petclinic.customers.aws.BedrockRuntimeV1Service.invokeTitanModel (line 75)
• Impact: Affects multiple canaries testing Pet Clinic functionality
• 20% fault rate on GET /api/customer/diagnose/owners/{ownerId}/pets/{petId}
🛠️ RECOMMENDED ACTIONS:
**Immediate (Critical):**
1. Fix S3 bucket ARN patterns in pc-visit-vet IAM policy
2. Fix format string bug in BedrockRuntimeV1Service: change '% o' to '%s' or correct format
3. Add VPC permissions to canary IAM roles if Lambda runs in VPC
**Infrastructure (High Priority):**
4. Investigate browser target stability issues (webapp-erorrpagecanary)
5. Review canary timeout configurations - consider increasing from 30s to 60s
6. Implement circuit breaker pattern for external service dependencies
**Monitoring (Medium Priority):**
7. Add Application Signals monitoring for canary success rates
8. Set up alerts for consecutive canary failures (>3 failures)
9. Implement canary health dashboard with real-time status
🎯 EXPECTED OUTCOMES:
• S3 access fix: Immediate resolution of pc-visit-vet failures
• Backend service fix: 80%+ improvement in Pet Clinic canary success rates
• Infrastructure improvements: Reduced browser target close errors
• Enhanced monitoring: Proactive failure detection and faster resolution
```
## Recommended Workflows
### 🎯 Primary Audit Workflow (Most Common)
1. **Start with `audit_services()`** - Use wildcard patterns for automatic service discovery
2. **Review findings summary** - Let user choose which issues to investigate further
3. **Deep dive with `auditors="all"`** - For selected services needing root cause analysis
### 🔍 SLO Investigation Workflow
1. **Use `get_slo()`** - Understand SLO configuration and thresholds
2. **Use `audit_slos()` with `auditors="all"`** - Comprehensive root cause analysis
3. **Follow actionable recommendations** - Implement suggested fixes
### ⚡ Operation Performance Workflow
1. **Use `audit_service_operations()`** - Target specific operations with precision
2. **Apply wildcard patterns** - e.g., `*GET*` for all GET operations
3. **Root cause analysis** - Use `auditors="all"` for detailed investigation
### 📊 Complete Observability Workflow
1. **Service Discovery** - `audit_services()` with wildcard patterns
2. **SLO Compliance** - `audit_slos()` for breach detection
3. **Operation Analysis** - `audit_service_operations()` for endpoint-specific issues
4. **Trace Investigation** - `search_transaction_spans()` for 100% trace visibility
## Configuration
### Required AWS Permissions
The server requires the following AWS IAM permissions:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"application-signals:ListServices",
"application-signals:GetService",
"application-signals:ListServiceOperations",
"application-signals:ListServiceLevelObjectives",
"application-signals:GetServiceLevelObjective",
"application-signals:BatchGetServiceLevelObjectiveBudgetReport",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"logs:GetQueryResults",
"logs:StartQuery",
"logs:StopQuery",
"xray:GetTraceSummaries",
"xray:BatchGetTraces",
"xray:GetTraceSegmentDestination"
],
"Resource": "*"
}
]
}
```
### Environment Variables
- `AWS_PROFILE` - AWS profile name to use for authentication (defaults to `default` profile)
- `AWS_REGION` - AWS region (defaults to us-east-1)
- `MCP_CLOUDWATCH_APPSIGNALS_LOG_LEVEL` - Logging level (defaults to INFO)
- `AUDITOR_LOG_PATH` - Path for audit log files (defaults to /tmp)
### AWS Credentials
This server uses AWS profiles for authentication. Set the `AWS_PROFILE` environment variable to use a specific profile from your `~/.aws/credentials` file.
The server will use the standard AWS credential chain via boto3, which includes:
- AWS Profile specified by `AWS_PROFILE` environment variable
- Default profile from AWS credentials file
- IAM roles when running on EC2, ECS, Lambda, etc.
### Transaction Search Configuration
For 100% trace visibility, enable AWS X-Ray Transaction Search:
1. Configure X-Ray to send traces to CloudWatch Logs
2. Set destination to 'CloudWatchLogs' with status 'ACTIVE'
3. This enables the `search_transaction_spans()` tool for complete observability
Without Transaction Search, you'll only have access to 5% sampled trace data through X-Ray.
## Development
This server is part of the AWS Labs MCP collection. For development and contribution guidelines, please see the main repository documentation.
### Running Tests
To run the comprehensive test suite that validates all use case examples and tool functionality:
```bash
cd src/cloudwatch-appsignals-mcp-server
python -m pytest tests/test_use_case_examples.py -v
```
This test file verifies that all use case examples in the tool documentation call the correct tools with the right parameters and target formats. It includes tests for:
- All documented use cases for `audit_services()`, `audit_slos()`, and `audit_service_operations()`
- Target format validation (service, SLO, and operation targets)
- Wildcard pattern expansion functionality
- Auditor selection for different scenarios
- JSON format validation for all documentation examples
The tests use mocked AWS clients to prevent real API calls while validating the tool logic and parameter handling.
## License
This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "awslabs.cloudwatch-appsignals-mcp-server",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Amazon Web Services",
"author_email": "AWSLabs MCP <203918161+awslabs-mcp@users.noreply.github.com>",
"download_url": "https://files.pythonhosted.org/packages/3d/81/01923e44955b7548f383a6875a031c06a362a1e0645c96cb2a900b913eeb/awslabs_cloudwatch_appsignals_mcp_server-0.1.11.tar.gz",
"platform": null,
"description": "# CloudWatch Application Signals MCP Server\n\nAn MCP (Model Context Protocol) server that provides comprehensive tools for monitoring and analyzing AWS services using [AWS Application Signals](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Application-Signals.html).\n\nThis server enables AI assistants like Claude, GitHub Copilot, and Amazon Q to help you monitor service health, analyze performance metrics, track SLO compliance, and investigate issues using distributed tracing with advanced audit capabilities and root cause analysis.\n\n## Key Features\n\n1. **Comprehensive Service Auditing** - Monitor overall service health, diagnose root causes, and recommend actionable fixes with built-in APM expertise\n2. **Advanced SLO Compliance Monitoring** - Track Service Level Objectives with breach detection and root cause analysis\n3. **Operation-Level Performance Analysis** - Deep dive into specific API endpoints and operations\n4. **100% Trace Visibility** - Query OpenTelemetry spans data via Transaction Search for complete observability\n5. **Multi-Service Analysis** - Audit multiple services simultaneously with automatic batching\n6. **Natural Language Insights** - Generate business insights from telemetry data through natural language queries\n\n## Prerequisites\n\n1. [Sign-Up for an AWS account](https://aws.amazon.com/free/?trk=78b916d7-7c94-4cab-98d9-0ce5e648dd5f&sc_channel=ps&ef_id=Cj0KCQjwxJvBBhDuARIsAGUgNfjOZq8r2bH2OfcYfYTht5v5I1Bn0lBKiI2Ii71A8Gk39ZU5cwMLPkcaAo_CEALw_wcB:G:s&s_kwcid=AL!4422!3!432339156162!e!!g!!aws%20sign%20up!9572385111!102212379327&gad_campaignid=9572385111&gbraid=0AAAAADjHtp99c5A9DUyUaUQVhVEoi8of3&gclid=Cj0KCQjwxJvBBhDuARIsAGUgNfjOZq8r2bH2OfcYfYTht5v5I1Bn0lBKiI2Ii71A8Gk39ZU5cwMLPkcaAo_CEALw_wcB)\n2. [Enable Application Signals](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Application-Monitoring-Sections.html) for your applications\n3. Install `uv` from [Astral](https://docs.astral.sh/uv/getting-started/installation/) or the [GitHub README](https://github.com/astral-sh/uv#installation)\n4. Install Python using `uv python install 3.10`\n\n## Available Tools\n\n### \ud83e\udd47 Primary Audit Tools (Use These First)\n\n#### 1. **`audit_services`** \u2b50 **PRIMARY SERVICE AUDIT TOOL**\n**The #1 tool for comprehensive AWS service health auditing and monitoring**\n\n- **USE THIS FIRST** for all service-level auditing tasks\n- Comprehensive health assessment with actionable insights and recommendations\n- Multi-service analysis with automatic batching (audit 1-100+ services simultaneously)\n- SLO compliance monitoring with automatic breach detection\n- Root cause analysis with traces, logs, and metrics correlation\n- Issue prioritization by severity (critical, warning, info findings)\n- **Wildcard Pattern Support**: Use `*payment*` for automatic service discovery\n- Performance optimized for fast execution across multiple targets\n\n**Key Use Cases:**\n- `audit_services(service_targets='[{\"Type\":\"service\",\"Data\":{\"Service\":{\"Type\":\"Service\",\"Name\":\"*\"}}}]')` - Audit all services\n- `audit_services(service_targets='[{\"Type\":\"service\",\"Data\":{\"Service\":{\"Type\":\"Service\",\"Name\":\"*payment*\"}}}]')` - Audit payment services\n- `audit_services(..., auditors=\"all\")` - Comprehensive root cause analysis with all auditors\n\n#### 2. **`audit_slos`** \u2b50 **PRIMARY SLO AUDIT TOOL**\n**The #1 tool for comprehensive SLO compliance monitoring and breach analysis**\n\n- **PREFERRED TOOL** for SLO root cause analysis after using `get_slo()`\n- Much more comprehensive than individual trace tools - provides integrated analysis\n- Combines traces, logs, metrics, and dependencies in a single audit\n- Automatic SLO breach detection with prioritized findings\n- **Wildcard Pattern Support**: Use `*payment*` for automatic SLO discovery\n- Actionable recommendations based on multi-dimensional analysis\n\n**Key Use Cases:**\n- `audit_slos(slo_targets='[{\"Type\":\"slo\",\"Data\":{\"Slo\":{\"SloName\":\"*\"}}}]')` - Audit all SLOs\n- `audit_slos(..., auditors=\"all\")` - Comprehensive root cause analysis for SLO breaches\n\n#### 3. **`audit_service_operations`** \ud83e\udd47 **PRIMARY OPERATION AUDIT TOOL**\n**The #1 RECOMMENDED tool for operation-specific analysis and performance investigation**\n\n- **PREFERRED OVER audit_services()** for operation-level auditing\n- Precision targeting of exact operation behavior vs. service-wide averages\n- Actionable insights with specific error traces and dependency failures\n- Code-level detail with exact stack traces and timeout locations\n- **Wildcard Pattern Support**: Use `*GET*` for specific operation types\n- Focused analysis that eliminates noise from other operations\n\n**Key Use Cases:**\n- `audit_service_operations(operation_targets='[{\"Type\":\"service_operation\",\"Data\":{\"ServiceOperation\":{\"Service\":{\"Type\":\"Service\",\"Name\":\"*payment*\"},\"Operation\":\"*GET*\",\"MetricType\":\"Latency\"}}}]')` - Audit GET operations in payment services\n- `audit_service_operations(..., auditors=\"all\")` - Root cause analysis for specific operations\n\n### \ud83d\udcca Service Discovery & Information Tools\n\n#### 4. **`list_monitored_services`** - Service Discovery Tool\n**OPTIONAL TOOL** - `audit_services()` can automatically discover services using wildcard patterns\n\n- Get detailed overview of all monitored services in your environment\n- Discover specific service names and environments for manual audit target construction\n- **RECOMMENDED**: Use `audit_services()` with wildcard patterns instead for comprehensive discovery AND analysis\n\n#### 5. **`get_service_detail`** - Service Metadata Tool\n**For basic service metadata and configuration details**\n\n- Service metadata and configuration (platform information, key attributes)\n- Service-level metrics (Latency, Error, Fault aggregates)\n- Log groups associated with the service\n- **IMPORTANT**: This tool does NOT provide operation names - use `audit_services()` for operation discovery\n\n#### 6. **`list_service_operations`** - Operation Discovery Tool\n**CRITICAL LIMITATION**: Only discovers operations that have been ACTIVELY INVOKED in the specified time window\n\n- Basic operation inventory for RECENTLY ACTIVE operations only (max 24 hours)\n- Empty results \u2260 no operations exist, just no recent invocations\n- **RECOMMENDED**: Use `audit_services()` FIRST for comprehensive operation discovery and analysis\n\n### \ud83c\udfaf SLO Management Tools\n\n#### 7. **`get_slo`** - SLO Configuration Details\n**Essential for understanding SLO configuration before deep investigation**\n\n- Comprehensive SLO configuration details (metrics, thresholds, goals)\n- Operation names and key attributes for further investigation\n- Metric type (LATENCY or AVAILABILITY) and comparison operators\n- **NEXT STEP**: Use `audit_slos()` with `auditors=\"all\"` for root cause analysis\n\n#### 8. **`list_slos`** - SLO Discovery\n**List all Service Level Objectives in Application Signals**\n\n- Complete list of all SLOs in your account with names and ARNs\n- Filter SLOs by service attributes\n- Basic SLO information including creation time and operation names\n- Useful for SLO discovery and finding SLO names for use with other tools\n\n### \ud83d\udcc8 Metrics & Performance Tools\n\n#### 9. **`query_service_metrics`** - CloudWatch Metrics Analysis\n**Get CloudWatch metrics for specific Application Signals services**\n\n- Analyze service performance (latency, throughput, error rates)\n- View trends over time with both standard statistics and percentiles\n- Automatic granularity adjustment based on time range\n- Summary statistics with recent data points and timestamps\n\n### \ud83d\udd0d Advanced Trace & Log Analysis Tools\n\n#### 10. **`search_transaction_spans`** - 100% Trace Visibility\n**Query OpenTelemetry Spans data via Transaction Search (100% sampled data)**\n\n- **100% sampled data** vs X-Ray's 5% sampling for more accurate results\n- Query \"aws/spans\" log group with CloudWatch Logs Insights\n- Generate business performance insights and summaries\n- **IMPORTANT**: Always include a limit in queries to prevent overwhelming context\n\n**Example Query:**\n```\nFILTER attributes.aws.local.service = \"payment-service\" and attributes.aws.local.environment = \"eks:production\"\n| STATS avg(duration) as avg_latency by attributes.aws.local.operation\n| LIMIT 50\n```\n\n#### 11. **`query_sampled_traces`** - X-Ray Trace Analysis (Secondary Tool)\n**Query AWS X-Ray traces (5% sampled data) for trace investigation**\n\n- **\u26a0\ufe0f IMPORTANT**: Consider using `audit_slos()` with `auditors=\"all\"` instead for comprehensive root cause analysis\n- Uses X-Ray's 5% sampled trace data - may miss critical errors\n- Limited context compared to comprehensive audit tools\n- **RECOMMENDATION**: Use `get_service_detail()` for operation discovery and `audit_slos()` for root cause analysis\n\n**Common Filter Expressions:**\n- `service(\"service-name\"){fault = true}` - Find traces with faults (5xx errors)\n- `duration > 5` - Find slow requests (over 5 seconds)\n- `annotation[aws.local.operation]=\"GET /api/orders\"` - Filter by specific operation\n\n#### 12. **`analyze_canary_failures`** - Comprehensive Canary Failure Analysis\n**Deep dive into CloudWatch Synthetics canary failures with root cause identification**\n\n- Comprehensive canary failure analysis with deep dive into issues\n- Analyze historical patterns and specific incident details\n- Get comprehensive artifact analysis including logs, screenshots, and HAR files\n- Receive actionable recommendations based on AWS debugging methodology\n- Correlate canary failures with Application Signals telemetry data\n- Identify performance degradation and availability issues across service dependencies\n\n**Key Features:**\n- **Failure Pattern Analysis**: Identifies recurring failure modes and temporal patterns\n- **Artifact Deep Dive**: Analyzes canary logs, screenshots, and network traces for root causes\n- **Service Correlation**: Links canary failures to upstream/downstream service issues using Application Signals\n- **Performance Insights**: Detects latency spikes, fault rates, and connection issues\n- **Actionable Remediation**: Provides specific steps based on AWS operational best practices\n- **IAM Analysis**: Validates IAM roles and permissions for common canary access issues\n- **Backend Service Integration**: Correlates canary failures with backend service errors and exceptions\n\n**Common Use Cases:**\n- Incident Response: Rapid diagnosis of canary failures during outages\n- Performance Investigation: Understanding latency and availability degradation\n- Dependency Analysis: Identifying which services are causing canary failures\n- Historical Trending: Analyzing failure patterns over time for proactive improvements\n- Root Cause Analysis: Deep dive into specific failure scenarios with full context\n- Infrastructure Issues: Diagnose S3 access, VPC connectivity, and browser target problems\n- Backend Service Debugging: Identify application code issues affecting canary success\n\n#### 13. **`list_slis`** - Legacy SLI Status Report (Specialized Tool)\n**Use `audit_services()` as the PRIMARY tool for service auditing**\n\n- Basic report showing summary counts (total, healthy, breached, insufficient data)\n- Simple list of breached services with SLO names\n- **IMPORTANT**: `audit_services()` is the PRIMARY and PREFERRED tool for all service auditing tasks\n- Only use this tool for legacy SLI status report format specifically\n\n## Installation\n\n### One-Click Installation\n\n| Cursor | VS Code |\n|:------:|:-------:|\n| [](https://cursor.com/en/install-mcp?name=awslabs.cloudwatch-appsignals-mcp-server&config=eyJhdXRvQXBwcm92ZSI6W10sImRpc2FibGVkIjpmYWxzZSwidGltZW91dCI6NjAsImNvbW1hbmQiOiJ1dnggYXdzbGFicy5jbG91ZHdhdGNoLWFwcHNpZ25hbHMtbWNwLXNlcnZlckBsYXRlc3QiLCJlbnYiOnsiQVdTX1BST0ZJTEUiOiJbVGhlIEFXUyBQcm9maWxlIE5hbWUgdG8gdXNlIGZvciBBV1MgYWNjZXNzXSIsIkFXU19SRUdJT04iOiJbVGhlIEFXUyByZWdpb24gdG8gcnVuIGluXSIsIkZBU1RNQ1BfTE9HX0xFVkVMIjoiRVJST1IifSwidHJhbnNwb3J0VHlwZSI6InN0ZGlvIn0%3D) | [](https://insiders.vscode.dev/redirect/mcp/install?name=CloudWatch%20Application%20Signals%20MCP%20Server&config=%7B%22autoApprove%22%3A%5B%5D%2C%22disabled%22%3Afalse%2C%22timeout%22%3A60%2C%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.cloudwatch-appsignals-mcp-server%40latest%22%5D%2C%22env%22%3A%7B%22AWS_PROFILE%22%3A%22%5BThe%20AWS%20Profile%20Name%20to%20use%20for%20AWS%20access%5D%22%2C%22AWS_REGION%22%3A%22%5BThe%20AWS%20region%20to%20run%20in%5D%22%2C%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22transportType%22%3A%22stdio%22%7D) |\n\n### Installing via `uv`\n\nWhen using [`uv`](https://docs.astral.sh/uv/) no specific installation is needed. We will\nuse [`uvx`](https://docs.astral.sh/uv/guides/tools/) to directly run *awslabs.cloudwatch-appsignals-mcp-server*.\n\n### Installing for Amazon Q (Preview)\n\n- Start Amazon Q Developer CLI from [here](https://github.com/aws/amazon-q-developer-cli).\n- Add the following configuration in `~/.aws/amazonq/mcp.json` file.\n```json\n{\n \"mcpServers\": {\n \"awslabs.cloudwatch-appsignals-mcp\": {\n \"autoApprove\": [],\n \"disabled\": false,\n \"command\": \"uvx\",\n \"args\": [\n \"awslabs.cloudwatch-appsignals-mcp-server@latest\"\n ],\n \"env\": {\n \"AWS_PROFILE\": \"[The AWS Profile Name to use for AWS access]\",\n \"AWS_REGION\": \"[AWS Region]\",\n \"FASTMCP_LOG_LEVEL\": \"ERROR\"\n },\n \"transportType\": \"stdio\"\n }\n }\n}\n```\n\n### Installing via Claude Desktop\n\nOn MacOS: `~/Library/Application\\ Support/Claude/claude_desktop_config.json`\nOn Windows: `%APPDATA%/Claude/claude_desktop_config.json`\n\n<details>\n <summary>Development/Unpublished Servers Configuration</summary>\n When installing a development or unpublished server, add the `--directory` flag:\n\n ```json\n {\n \"mcpServers\": {\n \"awslabs.cloudwatch-appsignals-mcp-server\": {\n \"command\": \"uvx\",\n \"args\": [\"--from\", \"/absolute/path/to/cloudwatch-appsignals-mcp-server\", \"awslabs.cloudwatch-appsignals-mcp-server\"],\n \"env\": {\n \"AWS_PROFILE\": \"[The AWS Profile Name to use for AWS access]\",\n \"AWS_REGION\": \"[AWS Region]\"\n }\n }\n }\n }\n ```\n</details>\n\n<details>\n <summary>Published Servers Configuration</summary>\n\n ```json\n {\n \"mcpServers\": {\n \"awslabs.cloudwatch-appsignals-mcp-server\": {\n \"command\": \"uvx\",\n \"args\": [\"awslabs.cloudwatch-appsignals-mcp-server@latest\"],\n \"env\": {\n \"AWS_PROFILE\": \"[The AWS Profile Name to use for AWS access]\",\n \"AWS_REGION\": \"[AWS Region]\"\n }\n }\n }\n }\n ```\n</details>\n\n### Windows Installation\n\nFor Windows users, the MCP server configuration format is slightly different:\n\n```json\n{\n \"mcpServers\": {\n \"awslabs.cloudwatch-appsignals-mcp-server\": {\n \"disabled\": false,\n \"timeout\": 60,\n \"type\": \"stdio\",\n \"command\": \"uv\",\n \"args\": [\n \"tool\",\n \"run\",\n \"--from\",\n \"awslabs.cloudwatch-appsignals-mcp-server@latest\",\n \"awslabs.cloudwatch-appsignals-mcp-server.exe\"\n ],\n \"env\": {\n \"FASTMCP_LOG_LEVEL\": \"ERROR\",\n \"AWS_PROFILE\": \"your-aws-profile\",\n \"AWS_REGION\": \"us-east-1\"\n }\n }\n }\n}\n```\n\n### Build and install docker image locally on the same host of your LLM client\n\n1. `git clone https://github.com/awslabs/mcp.git`\n2. Go to sub-directory 'src/cloudwatch-appsignals-mcp-server/'\n3. Run 'docker build -t awslabs/cloudwatch-appsignals-mcp-server:latest .'\n\n### Add or update your LLM client's config with following:\n```json\n{\n \"mcpServers\": {\n \"awslabs.cloudwatch-appsignals-mcp-server\": {\n \"command\": \"docker\",\n \"args\": [\n \"run\",\n \"-i\",\n \"--rm\",\n \"-v\", \"${HOME}/.aws:/root/.aws:ro\",\n \"-e\", \"AWS_PROFILE=[The AWS Profile Name to use for AWS access]\",\n \"-e\", \"AWS_REGION=[AWS Region]\",\n \"awslabs/cloudwatch-appsignals-mcp-server:latest\"\n ]\n }\n }\n}\n```\n\n### Debugging\n\nYou can use the MCP inspector to debug the server. For `uvx` installations:\n\n```bash\nnpx @modelcontextprotocol/inspector uvx awslabs.cloudwatch-appsignals-mcp-server\n```\n\nOr if you've installed the package in a specific directory:\n\n```bash\ncd /path/to/cloudwatch-appsignals-mcp-server\nnpx @modelcontextprotocol/inspector uv run awslabs.cloudwatch-appsignals-mcp-server\n```\n\n## Usage Examples\n\n### Example 1: Comprehensive Service Health Audit (Recommended Workflow)\n```\nUser: \"Audit the health of all my payment services\"\nAssistant: I'll perform a comprehensive health audit of all your payment services using the primary audit tool.\n\n[Uses audit_services with wildcard pattern]\n\naudit_services(\n service_targets='[{\"Type\":\"service\",\"Data\":{\"Service\":{\"Type\":\"Service\",\"Name\":\"*payment*\"}}}]',\n auditors=\"slo,operation_metric\"\n)\n\nResults: Found 3 payment services with the following findings:\n\n\ud83d\udd34 CRITICAL FINDINGS:\n\u2022 payment-api-service: SLO breach detected - Latency SLO violated (p99: 2.1s > 1.5s threshold)\n\u2022 payment-processor: High error rate detected (8.2% > 5% threshold)\n\n\u26a0\ufe0f WARNING FINDINGS:\n\u2022 payment-gateway: Dependency timeout issues with external payment provider\n\n\u2705 HEALTHY:\n\u2022 payment-validator: All metrics within normal ranges\n\nRecommendations:\n1. Investigate payment-api-service latency issues - check database connection pool\n2. Review payment-processor error logs for root cause analysis\n3. Consider circuit breaker pattern for payment-gateway external dependencies\n```\n\n### Example 2: SLO Breach Root Cause Analysis\n```\nUser: \"I have an SLO breach for my checkout service. Can you help me investigate?\"\nAssistant: I'll help you investigate the SLO breach. Let me first get the SLO configuration and then perform a comprehensive root cause analysis.\n\n[Uses get_slo to understand configuration]\n[Then uses audit_slos with auditors=\"all\" for comprehensive analysis]\n\naudit_slos(\n slo_targets='[{\"Type\":\"slo\",\"Data\":{\"Slo\":{\"SloName\":\"checkout-service-latency-slo\"}}}]',\n auditors=\"all\"\n)\n\nRoot Cause Analysis Results:\n\n\ud83d\udd34 SLO BREACH CONFIRMED:\n\u2022 checkout-service-latency-slo: 87.3% attainment (< 95% goal)\n\u2022 Breach duration: 2.5 hours\n\u2022 Impact: 1,247 requests affected\n\n\ud83d\udd0d ROOT CAUSE IDENTIFIED:\n\u2022 Database connection pool exhaustion detected\n\u2022 Trace analysis shows 95% of slow requests waiting for DB connections\n\u2022 Dependency on payment-service also showing elevated latency\n\n\ud83d\udcca SUPPORTING EVIDENCE:\n\u2022 P99 latency: 3.2s (threshold: 2.0s)\n\u2022 Error rate spike: 12% during breach period\n\u2022 15 error traces with \"connection timeout\" messages\n\n\ud83d\udee0\ufe0f RECOMMENDED ACTIONS:\n1. Increase database connection pool size from 10 to 25\n2. Implement connection pool monitoring alerts\n3. Review payment-service dependency performance\n4. Consider implementing request queuing for high load periods\n```\n\n### Example 3: Operation-Level Performance Investigation\n```\nUser: \"The GET operations in my payment services are slow. Can you investigate?\"\nAssistant: I'll investigate the performance of GET operations specifically in your payment services using operation-level analysis.\n\n[Uses audit_service_operations for precise operation targeting]\n\naudit_service_operations(\n operation_targets='[{\"Type\":\"service_operation\",\"Data\":{\"ServiceOperation\":{\"Service\":{\"Type\":\"Service\",\"Name\":\"*payment*\"},\"Operation\":\"*GET*\",\"MetricType\":\"Latency\"}}}]',\n auditors=\"operation_metric,trace\"\n)\n\nOperation Performance Analysis Results:\n\n\ud83c\udfaf GET OPERATIONS ANALYZED:\n\u2022 payment-api: GET /api/payments/{id} - P99: 1.8s\n\u2022 payment-api: GET /api/payments/history - P99: 3.2s \u26a0\ufe0f\n\u2022 payment-processor: GET /api/process/status - P99: 0.8s \u2705\n\n\ud83d\udd34 PERFORMANCE ISSUES IDENTIFIED:\n\u2022 GET /api/payments/history: Consistently slow (3.2s avg)\n - Root cause: Full table scan on payments table\n - 89% of requests show database query time > 2.5s\n - Missing index on user_id + created_date columns\n\n\ud83d\udcc8 PERFORMANCE RECOMMENDATIONS:\n1. Add composite index: CREATE INDEX idx_payments_user_date ON payments(user_id, created_date)\n2. Implement pagination for history endpoint (limit 50 records)\n3. Consider caching for frequently accessed payment histories\n4. Add query timeout protection (max 5s)\n\n\ud83d\ude80 EXPECTED IMPROVEMENT:\n\u2022 Estimated latency reduction: 3.2s \u2192 0.4s (87% improvement)\n\u2022 Database load reduction: ~75%\n```\n\n### Example 4: Transaction Search for 100% Trace Visibility\n```\nUser: \"I want to analyze all the errors in my checkout service from the last hour\"\nAssistant: I'll use Transaction Search to analyze all errors with 100% trace visibility (no sampling).\n\n[Uses search_transaction_spans for complete trace data]\n\nsearch_transaction_spans(\n log_group_name=\"aws/spans\",\n query_string='FILTER attributes.aws.local.service = \"checkout-service\" and attributes.http.status_code >= 400 | STATS count() as error_count by attributes.http.status_code, attributes.aws.local.operation | SORT error_count DESC | LIMIT 20',\n start_time=\"2024-01-15T10:00:00Z\",\n end_time=\"2024-01-15T11:00:00Z\"\n)\n\n\u2705 Transaction Search Results (100% sampled data):\n\nERROR BREAKDOWN (Last Hour):\n\u2022 HTTP 500 errors: 47 occurrences\n - POST /api/checkout/complete: 31 errors\n - POST /api/checkout/validate: 16 errors\n\n\u2022 HTTP 404 errors: 23 occurrences\n - GET /api/checkout/{id}: 23 errors\n\n\u2022 HTTP 503 errors: 12 occurrences\n - POST /api/checkout/payment: 12 errors\n\n\ud83d\udd0d KEY INSIGHTS:\n\u2022 82 total errors out of 1,247 requests (6.6% error rate)\n\u2022 Most errors (66%) in checkout completion flow\n\u2022 Service unavailable errors correlate with payment service dependency\n\n\ud83d\udee0\ufe0f IMMEDIATE ACTIONS NEEDED:\n1. Investigate checkout completion logic for 500 errors\n2. Add validation for checkout ID existence (404s)\n3. Implement circuit breaker for payment service calls (503s)\n```\n\n### Example 5: Service Discovery and Overview\n```\nUser: \"What services are being monitored and what's their current health status?\"\nAssistant: I'll get an overview of all monitored services and their health status.\n\n[Uses list_monitored_services for discovery, then audit_services for health analysis]\n\nFound 8 services being monitored:\n\n\ud83d\udcca SERVICE INVENTORY:\n\u2022 api-gateway-service (AWS::ApiGateway::RestApi) - Production\n\u2022 checkout-service (AWS::ECS::Service) - Production\n\u2022 payment-api-service (AWS::ECS::Service) - Production\n\u2022 payment-processor (AWS::Lambda::Function) - Production\n\u2022 user-service (AWS::EKS::Service) - Production\n\u2022 inventory-service (AWS::EKS::Service) - Production\n\u2022 notification-lambda (AWS::Lambda::Function) - Production\n\u2022 analytics-service (AWS::ECS::Service) - Production\n\n[Then performs health audit]\n\n\ud83c\udfe5 HEALTH STATUS SUMMARY:\n\u2705 HEALTHY (5 services):\n\u2022 api-gateway-service: All SLOs met, low latency\n\u2022 user-service: Excellent performance metrics\n\u2022 inventory-service: All operations within thresholds\n\u2022 notification-lambda: Fast execution, no errors\n\u2022 analytics-service: Stable performance\n\n\u26a0\ufe0f NEEDS ATTENTION (2 services):\n\u2022 checkout-service: Elevated latency on completion operations\n\u2022 payment-api-service: Intermittent timeout issues\n\n\ud83d\udd34 CRITICAL (1 service):\n\u2022 payment-processor: SLO breach - error rate 8.2% (threshold: 5%)\n\n\ud83d\udca1 RECOMMENDATIONS:\n1. Focus immediate attention on payment-processor error investigation\n2. Monitor checkout-service latency trends\n3. Review payment-api-service timeout configurations\n```\n\n### Example 6: Canary Failure Analysis and Root Cause Investigation\n```\nUser: \"My Pet Clinic canaries are failing. Can you help me diagnose the issues?\"\nAssistant: I'll analyze your Pet Clinic canaries to identify the root causes of failures.\n\n[Uses analyze_canary_failures for comprehensive canary analysis]\n\nanalyze_canary_failures(canary_name=\"pc-visit-vet\")\nanalyze_canary_failures(canary_name=\"pc-add-visit\")\nanalyze_canary_failures(canary_name=\"webapp-erorrpagecanary\")\n\n\ud83d\udd0d CANARY FAILURE ANALYSIS RESULTS:\n\n\ud83d\udd34 CRITICAL ISSUES IDENTIFIED:\n\n**pc-visit-vet canary:**\n\u2022 Root Cause: S3 bucket access issue\n\u2022 Error Pattern: Exit status 127, \"No such file or directory\"\n\u2022 Failure Count: 5 consecutive failures\n\u2022 IAM Analysis: \u2705 Role exists but S3 bucket ARN patterns incorrect in policies\n\n**pc-add-visit canary:**\n\u2022 Root Cause: Selector timeout + backend service errors\n\u2022 Error Pattern: 30000ms timeout waiting for UI element + MissingFormatArgumentException\n\u2022 Backend Issue: Format specifier '% o' error in BedrockRuntimeV1Service.invokeTitanModel()\n\u2022 Performance: 34 second average response time, 0% success rate\n\n**webapp-erorrpagecanary:**\n\u2022 Root Cause: Browser target close during selector wait\n\u2022 Error Pattern: \"Target closed\" waiting for `#jsError` selector\n\u2022 Failure Count: 5 consecutive failures with 60000ms connection timeouts\n\n\ud83d\udd0d BACKEND SERVICE CORRELATION:\n\u2022 MissingFormatArgumentException detected in Pet Clinic backend\n\u2022 Location: org.springframework.samples.petclinic.customers.aws.BedrockRuntimeV1Service.invokeTitanModel (line 75)\n\u2022 Impact: Affects multiple canaries testing Pet Clinic functionality\n\u2022 20% fault rate on GET /api/customer/diagnose/owners/{ownerId}/pets/{petId}\n\n\ud83d\udee0\ufe0f RECOMMENDED ACTIONS:\n\n**Immediate (Critical):**\n1. Fix S3 bucket ARN patterns in pc-visit-vet IAM policy\n2. Fix format string bug in BedrockRuntimeV1Service: change '% o' to '%s' or correct format\n3. Add VPC permissions to canary IAM roles if Lambda runs in VPC\n\n**Infrastructure (High Priority):**\n4. Investigate browser target stability issues (webapp-erorrpagecanary)\n5. Review canary timeout configurations - consider increasing from 30s to 60s\n6. Implement circuit breaker pattern for external service dependencies\n\n**Monitoring (Medium Priority):**\n7. Add Application Signals monitoring for canary success rates\n8. Set up alerts for consecutive canary failures (>3 failures)\n9. Implement canary health dashboard with real-time status\n\n\ud83c\udfaf EXPECTED OUTCOMES:\n\u2022 S3 access fix: Immediate resolution of pc-visit-vet failures\n\u2022 Backend service fix: 80%+ improvement in Pet Clinic canary success rates\n\u2022 Infrastructure improvements: Reduced browser target close errors\n\u2022 Enhanced monitoring: Proactive failure detection and faster resolution\n```\n\n## Recommended Workflows\n\n### \ud83c\udfaf Primary Audit Workflow (Most Common)\n1. **Start with `audit_services()`** - Use wildcard patterns for automatic service discovery\n2. **Review findings summary** - Let user choose which issues to investigate further\n3. **Deep dive with `auditors=\"all\"`** - For selected services needing root cause analysis\n\n### \ud83d\udd0d SLO Investigation Workflow\n1. **Use `get_slo()`** - Understand SLO configuration and thresholds\n2. **Use `audit_slos()` with `auditors=\"all\"`** - Comprehensive root cause analysis\n3. **Follow actionable recommendations** - Implement suggested fixes\n\n### \u26a1 Operation Performance Workflow\n1. **Use `audit_service_operations()`** - Target specific operations with precision\n2. **Apply wildcard patterns** - e.g., `*GET*` for all GET operations\n3. **Root cause analysis** - Use `auditors=\"all\"` for detailed investigation\n\n### \ud83d\udcca Complete Observability Workflow\n1. **Service Discovery** - `audit_services()` with wildcard patterns\n2. **SLO Compliance** - `audit_slos()` for breach detection\n3. **Operation Analysis** - `audit_service_operations()` for endpoint-specific issues\n4. **Trace Investigation** - `search_transaction_spans()` for 100% trace visibility\n\n## Configuration\n\n### Required AWS Permissions\n\nThe server requires the following AWS IAM permissions:\n\n```json\n{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Effect\": \"Allow\",\n \"Action\": [\n \"application-signals:ListServices\",\n \"application-signals:GetService\",\n \"application-signals:ListServiceOperations\",\n \"application-signals:ListServiceLevelObjectives\",\n \"application-signals:GetServiceLevelObjective\",\n \"application-signals:BatchGetServiceLevelObjectiveBudgetReport\",\n \"cloudwatch:GetMetricData\",\n \"cloudwatch:GetMetricStatistics\",\n \"logs:GetQueryResults\",\n \"logs:StartQuery\",\n \"logs:StopQuery\",\n \"xray:GetTraceSummaries\",\n \"xray:BatchGetTraces\",\n \"xray:GetTraceSegmentDestination\"\n ],\n \"Resource\": \"*\"\n }\n ]\n}\n```\n\n### Environment Variables\n\n- `AWS_PROFILE` - AWS profile name to use for authentication (defaults to `default` profile)\n- `AWS_REGION` - AWS region (defaults to us-east-1)\n- `MCP_CLOUDWATCH_APPSIGNALS_LOG_LEVEL` - Logging level (defaults to INFO)\n- `AUDITOR_LOG_PATH` - Path for audit log files (defaults to /tmp)\n\n### AWS Credentials\n\nThis server uses AWS profiles for authentication. Set the `AWS_PROFILE` environment variable to use a specific profile from your `~/.aws/credentials` file.\n\nThe server will use the standard AWS credential chain via boto3, which includes:\n- AWS Profile specified by `AWS_PROFILE` environment variable\n- Default profile from AWS credentials file\n- IAM roles when running on EC2, ECS, Lambda, etc.\n\n### Transaction Search Configuration\n\nFor 100% trace visibility, enable AWS X-Ray Transaction Search:\n1. Configure X-Ray to send traces to CloudWatch Logs\n2. Set destination to 'CloudWatchLogs' with status 'ACTIVE'\n3. This enables the `search_transaction_spans()` tool for complete observability\n\nWithout Transaction Search, you'll only have access to 5% sampled trace data through X-Ray.\n\n## Development\n\nThis server is part of the AWS Labs MCP collection. For development and contribution guidelines, please see the main repository documentation.\n\n### Running Tests\n\nTo run the comprehensive test suite that validates all use case examples and tool functionality:\n\n```bash\ncd src/cloudwatch-appsignals-mcp-server\npython -m pytest tests/test_use_case_examples.py -v\n```\n\nThis test file verifies that all use case examples in the tool documentation call the correct tools with the right parameters and target formats. It includes tests for:\n\n- All documented use cases for `audit_services()`, `audit_slos()`, and `audit_service_operations()`\n- Target format validation (service, SLO, and operation targets)\n- Wildcard pattern expansion functionality\n- Auditor selection for different scenarios\n- JSON format validation for all documentation examples\n\nThe tests use mocked AWS clients to prevent real API calls while validating the tool logic and parameter handling.\n\n## License\n\nThis project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "An AWS Labs Model Context Protocol (MCP) server for AWS Application Signals",
"version": "0.1.11",
"project_urls": {
"Bug Tracker": "https://github.com/awslabs/mcp/issues",
"Changelog": "https://github.com/awslabs/mcp/blob/main/src/cloudwatch-appsignals-mcp-server/CHANGELOG.md",
"Documentation": "https://awslabs.github.io/mcp/servers/cloudwatch-appsignals-mcp-server/",
"Homepage": "https://awslabs.github.io/mcp/",
"Source": "https://github.com/awslabs/mcp.git"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6a3d0049a0543bd5e6e672be3deef0b23d4edb2af61b741b4448766b1e39eaf1",
"md5": "2eff6a6f64fe2cb91c538abd463e325f",
"sha256": "3fd124d1ebc19e6b0f6350cd9e33e875882185d6af8a27a79591fc31a25baa95"
},
"downloads": -1,
"filename": "awslabs_cloudwatch_appsignals_mcp_server-0.1.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2eff6a6f64fe2cb91c538abd463e325f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 80998,
"upload_time": "2025-10-22T17:35:34",
"upload_time_iso_8601": "2025-10-22T17:35:34.696201Z",
"url": "https://files.pythonhosted.org/packages/6a/3d/0049a0543bd5e6e672be3deef0b23d4edb2af61b741b4448766b1e39eaf1/awslabs_cloudwatch_appsignals_mcp_server-0.1.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3d8101923e44955b7548f383a6875a031c06a362a1e0645c96cb2a900b913eeb",
"md5": "dcc0aceef03ca8030ecdb4362be02540",
"sha256": "aa2ef003b5e695bc69e0315116346b5fe2bda140d719c87341be86faf9cae7c2"
},
"downloads": -1,
"filename": "awslabs_cloudwatch_appsignals_mcp_server-0.1.11.tar.gz",
"has_sig": false,
"md5_digest": "dcc0aceef03ca8030ecdb4362be02540",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 201191,
"upload_time": "2025-10-22T17:35:36",
"upload_time_iso_8601": "2025-10-22T17:35:36.466157Z",
"url": "https://files.pythonhosted.org/packages/3d/81/01923e44955b7548f383a6875a031c06a362a1e0645c96cb2a900b913eeb/awslabs_cloudwatch_appsignals_mcp_server-0.1.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-22 17:35:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "awslabs",
"github_project": "mcp",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "awslabs.cloudwatch-appsignals-mcp-server"
}