# Failure Invoker MCP Server
A comprehensive chaos engineering tool that enables failure injection experiments across multiple AWS services using AWS Fault Injection Simulator (FIS) and AWS Systems Manager (SSM).
## Features
- **Multi-Service Support**: Target EC2, RDS, ECS, Lambda, ASG, ELB, EKS, and MSK
- **Tag-Based Targeting**: Flexible resource selection using AWS tags
- **Configurable Duration**: Control experiment duration with human-readable formats
- **Auto-Recovery**: Built-in recovery mechanisms for most services
- **Comprehensive Logging**: Detailed experiment tracking and status monitoring
## Supported AWS Services
| Service | Action | Recovery |
|---------|--------|----------|
| EC2 | Stop instances | Auto-restart after duration |
| RDS | Reboot/Failover | Automatic |
| ECS | Stop tasks | Service auto-recovery |
| Lambda | Error injection | Duration-based |
| ASG | Capacity errors | Duration-based |
| ELB | Unavailable state | Duration-based |
| EKS | Terminate nodes | Auto Scaling recovery |
| MSK | Restart brokers | Automatic |
## Installation
### MCP Configuration
```json
{
"mcpServers": {
"failure-invoker": {
"command": "uvx",
"args": ["failure-invoker-mcp@latest"],
"env": {
"AWS_REGION": "us-west-2",
"AWS_ACCESS_KEY_ID": "your-access-key",
"AWS_SECRET_ACCESS_KEY": "your-secret-key"
}
}
}
}
```
### Strands Agent SDK
```python
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
failure_invoker_client = MCPClient(
lambda: stdio_client(
StdioServerParameters(
command="uvx",
args=["failure-invoker-mcp@latest"],
env={
"AWS_REGION": "us-west-2",
"AWS_ACCESS_KEY_ID": "your-access-key",
"AWS_SECRET_ACCESS_KEY": "your-secret-key"
}
)
)
)
failure_invoker_client.start()
agent = Agent(
model,
system_prompt,
tools=[failure_invoker_client.list_tools_sync()],
)
```
## Available Tools
### 1. `db_failure`
Execute database failure experiments on RDS instances or Aurora clusters.
**Parameters:**
- `db_identifier` (required): RDS instance or cluster identifier
- `failure_type` (optional): "reboot" or "failover" (default: "reboot")
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)
**Examples:**
```python
# Reboot RDS instance
db_failure(db_identifier="my-database", failure_type="reboot")
# Failover Aurora cluster
db_failure(db_identifier="my-cluster", failure_type="failover", region="us-east-1")
```
### 2. `az_failure`
Execute availability zone failure experiments affecting all resources in a specific AZ.
**Parameters:**
- `availability_zone` (required): Target availability zone (e.g., "us-west-2a")
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)
**Examples:**
```python
# Simulate AZ failure
az_failure(availability_zone="us-west-2a")
# Target specific region
az_failure(availability_zone="eu-west-1b", region="eu-west-1")
```
### 3. `msk_failure`
Execute MSK (Managed Streaming for Kafka) cluster failure experiments.
**Parameters:**
- `cluster_name` (required): MSK cluster name
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)
**Examples:**
```python
# Restart MSK brokers
msk_failure(cluster_name="my-kafka-cluster")
# Target specific region
msk_failure(cluster_name="prod-kafka", region="us-east-1")
```
### 4. `tag_based_failure`
Execute failure experiments on all resources matching specified tags across multiple AWS services.
**Parameters:**
- `tag_key` (required): Tag key to search for
- `tag_value` (required): Tag value to match
- `duration` (optional): Duration of the failure (e.g., "60s", "10m", "2h", default: "10m")
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)
**Examples:**
```python
# Target all resources with Environment=test tag
tag_based_failure(tag_key="Environment", tag_value="test", duration="5m")
# Target specific team's resources
tag_based_failure(tag_key="Team", tag_value="backend", duration="30s")
# Target EKS cluster nodes
tag_based_failure(
tag_key="eks:cluster-name",
tag_value="my-cluster",
duration="2m"
)
# Target auto-scaling enabled resources
tag_based_failure(
tag_key="k8s.io/cluster-autoscaler/enabled",
tag_value="true",
duration="1h"
)
```
### 5. `get_experiment_status`
Check the status of running or completed FIS experiments.
**Parameters:**
- `experiment_id` (optional): Specific experiment ID to check
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)
**Examples:**
```python
# Get all recent experiments
get_experiment_status()
# Check specific experiment
get_experiment_status(experiment_id="EXP123456789")
# Check experiments in specific region
get_experiment_status(region="eu-west-1")
```
## Duration Format
The `duration` parameter accepts human-readable formats:
- `"30s"` - 30 seconds
- `"5m"` - 5 minutes
- `"2h"` - 2 hours
- `"1h30m"` - 1 hour 30 minutes
## Resource Targeting Logic
### Tag-Based Targeting
The `tag_based_failure` tool searches across all supported AWS services:
1. **EC2 Instances**: Uses describe-instances with tag filters
2. **RDS**: Queries all instances/clusters, then checks tags individually
3. **ECS**: Searches services across all clusters for matching tags
4. **Lambda**: Iterates through functions checking tags
5. **ASG**: Examines Auto Scaling Group tags
6. **ELB**: Checks Load Balancer tags
7. **EKS**: Searches Node Groups across all clusters
8. **MSK**: Not included in tag-based targeting (use `msk_failure` instead)
### Failure Actions by Service
- **EC2**: Stop instances → Auto-restart after duration
- **RDS Instances**: Reboot → Automatic recovery
- **RDS Clusters**: Failover → Automatic recovery
- **ECS**: Stop tasks → Service maintains desired count
- **Lambda**: Inject errors → Duration-based
- **ASG**: Insufficient capacity errors → Duration-based
- **ELB**: Mark unavailable → Duration-based
- **EKS**: Terminate 100% of nodes → Auto Scaling recovery
## Prerequisites
1. **AWS Credentials**: Configure via environment variables or AWS profiles
2. **IAM Permissions**: Ensure the following permissions:
- `fis:*` - For Fault Injection Simulator
- `ssm:*` - For Systems Manager (MSK experiments)
- `ec2:*`, `rds:*`, `ecs:*`, `lambda:*`, `autoscaling:*`, `elasticloadbalancing:*`, `eks:*`, `kafka:*` - For resource discovery and targeting
3. **FIS Service Role**: Create an IAM role for FIS experiments with appropriate permissions
## Error Handling
- **Resource Not Found**: Experiments skip missing resources
- **Permission Denied**: Clear error messages with required permissions
- **Invalid Duration**: Automatic conversion to AWS FIS PT format
- **Network Issues**: Configurable timeouts and retries (300s read, 60s connect, 3 retries)
## Safety Features
- **Dry Run Mode**: Preview targets before execution
- **Auto Recovery**: Most experiments include automatic recovery
- **Resource Validation**: Verify resources exist before targeting
- **Region Isolation**: Experiments are region-specific
- **Tag Validation**: Ensure exact tag matches to prevent accidental targeting
## Examples
### Chaos Engineering Scenarios
```python
# Test EKS cluster resilience
tag_based_failure(
tag_key="eks:cluster-name",
tag_value="production-cluster",
duration="5m"
)
# Simulate database failover
db_failure(
db_identifier="prod-aurora-cluster",
failure_type="failover"
)
# Test multi-AZ application resilience
az_failure(availability_zone="us-west-2a")
# Validate auto-scaling behavior
tag_based_failure(
tag_key="Environment",
tag_value="staging",
duration="10m"
)
# Test Kafka cluster resilience
msk_failure(cluster_name="event-streaming-cluster")
```
## Monitoring
Use `get_experiment_status()` to monitor experiment progress:
```python
# Start experiment
result = tag_based_failure(tag_key="Team", tag_value="platform")
experiment_id = result.content[0].text # Extract experiment ID
# Monitor progress
status = get_experiment_status(experiment_id=experiment_id)
```
## Contributing
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Submit a pull request
## License
MIT License - see LICENSE file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "failure-invoker-mcp",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "aws, fis, chaos-engineering, mcp, fault-injection",
"author": null,
"author_email": "Hyeonggeun Oh <kandy@plaintexting.com>",
"download_url": "https://files.pythonhosted.org/packages/20/1b/7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8/failure_invoker_mcp-1.1.0.tar.gz",
"platform": null,
"description": "# Failure Invoker MCP Server\n\nA comprehensive chaos engineering tool that enables failure injection experiments across multiple AWS services using AWS Fault Injection Simulator (FIS) and AWS Systems Manager (SSM).\n\n## Features\n\n- **Multi-Service Support**: Target EC2, RDS, ECS, Lambda, ASG, ELB, EKS, and MSK\n- **Tag-Based Targeting**: Flexible resource selection using AWS tags\n- **Configurable Duration**: Control experiment duration with human-readable formats\n- **Auto-Recovery**: Built-in recovery mechanisms for most services\n- **Comprehensive Logging**: Detailed experiment tracking and status monitoring\n\n## Supported AWS Services\n\n| Service | Action | Recovery |\n|---------|--------|----------|\n| EC2 | Stop instances | Auto-restart after duration |\n| RDS | Reboot/Failover | Automatic |\n| ECS | Stop tasks | Service auto-recovery |\n| Lambda | Error injection | Duration-based |\n| ASG | Capacity errors | Duration-based |\n| ELB | Unavailable state | Duration-based |\n| EKS | Terminate nodes | Auto Scaling recovery |\n| MSK | Restart brokers | Automatic |\n\n## Installation\n\n### MCP Configuration\n\n```json\n{\n \"mcpServers\": {\n \"failure-invoker\": {\n \"command\": \"uvx\",\n \"args\": [\"failure-invoker-mcp@latest\"],\n \"env\": {\n \"AWS_REGION\": \"us-west-2\",\n \"AWS_ACCESS_KEY_ID\": \"your-access-key\",\n \"AWS_SECRET_ACCESS_KEY\": \"your-secret-key\"\n }\n }\n }\n}\n```\n\n### Strands Agent SDK\n\n```python\nfrom mcp import ClientSession, StdioServerParameters\nfrom mcp.client.stdio import stdio_client\n\nfailure_invoker_client = MCPClient(\n lambda: stdio_client(\n StdioServerParameters(\n command=\"uvx\",\n args=[\"failure-invoker-mcp@latest\"],\n env={\n \"AWS_REGION\": \"us-west-2\",\n \"AWS_ACCESS_KEY_ID\": \"your-access-key\",\n \"AWS_SECRET_ACCESS_KEY\": \"your-secret-key\"\n }\n )\n )\n)\n\nfailure_invoker_client.start()\n\nagent = Agent(\n model,\n system_prompt,\n tools=[failure_invoker_client.list_tools_sync()],\n)\n```\n\n## Available Tools\n\n### 1. `db_failure`\n\nExecute database failure experiments on RDS instances or Aurora clusters.\n\n**Parameters:**\n- `db_identifier` (required): RDS instance or cluster identifier\n- `failure_type` (optional): \"reboot\" or \"failover\" (default: \"reboot\")\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Reboot RDS instance\ndb_failure(db_identifier=\"my-database\", failure_type=\"reboot\")\n\n# Failover Aurora cluster\ndb_failure(db_identifier=\"my-cluster\", failure_type=\"failover\", region=\"us-east-1\")\n```\n\n### 2. `az_failure`\n\nExecute availability zone failure experiments affecting all resources in a specific AZ.\n\n**Parameters:**\n- `availability_zone` (required): Target availability zone (e.g., \"us-west-2a\")\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Simulate AZ failure\naz_failure(availability_zone=\"us-west-2a\")\n\n# Target specific region\naz_failure(availability_zone=\"eu-west-1b\", region=\"eu-west-1\")\n```\n\n### 3. `msk_failure`\n\nExecute MSK (Managed Streaming for Kafka) cluster failure experiments.\n\n**Parameters:**\n- `cluster_name` (required): MSK cluster name\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Restart MSK brokers\nmsk_failure(cluster_name=\"my-kafka-cluster\")\n\n# Target specific region\nmsk_failure(cluster_name=\"prod-kafka\", region=\"us-east-1\")\n```\n\n### 4. `tag_based_failure`\n\nExecute failure experiments on all resources matching specified tags across multiple AWS services.\n\n**Parameters:**\n- `tag_key` (required): Tag key to search for\n- `tag_value` (required): Tag value to match\n- `duration` (optional): Duration of the failure (e.g., \"60s\", \"10m\", \"2h\", default: \"10m\")\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Target all resources with Environment=test tag\ntag_based_failure(tag_key=\"Environment\", tag_value=\"test\", duration=\"5m\")\n\n# Target specific team's resources\ntag_based_failure(tag_key=\"Team\", tag_value=\"backend\", duration=\"30s\")\n\n# Target EKS cluster nodes\ntag_based_failure(\n tag_key=\"eks:cluster-name\", \n tag_value=\"my-cluster\", \n duration=\"2m\"\n)\n\n# Target auto-scaling enabled resources\ntag_based_failure(\n tag_key=\"k8s.io/cluster-autoscaler/enabled\", \n tag_value=\"true\", \n duration=\"1h\"\n)\n```\n\n### 5. `get_experiment_status`\n\nCheck the status of running or completed FIS experiments.\n\n**Parameters:**\n- `experiment_id` (optional): Specific experiment ID to check\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Get all recent experiments\nget_experiment_status()\n\n# Check specific experiment\nget_experiment_status(experiment_id=\"EXP123456789\")\n\n# Check experiments in specific region\nget_experiment_status(region=\"eu-west-1\")\n```\n\n## Duration Format\n\nThe `duration` parameter accepts human-readable formats:\n- `\"30s\"` - 30 seconds\n- `\"5m\"` - 5 minutes \n- `\"2h\"` - 2 hours\n- `\"1h30m\"` - 1 hour 30 minutes\n\n## Resource Targeting Logic\n\n### Tag-Based Targeting\n\nThe `tag_based_failure` tool searches across all supported AWS services:\n\n1. **EC2 Instances**: Uses describe-instances with tag filters\n2. **RDS**: Queries all instances/clusters, then checks tags individually\n3. **ECS**: Searches services across all clusters for matching tags\n4. **Lambda**: Iterates through functions checking tags\n5. **ASG**: Examines Auto Scaling Group tags\n6. **ELB**: Checks Load Balancer tags\n7. **EKS**: Searches Node Groups across all clusters\n8. **MSK**: Not included in tag-based targeting (use `msk_failure` instead)\n\n### Failure Actions by Service\n\n- **EC2**: Stop instances \u2192 Auto-restart after duration\n- **RDS Instances**: Reboot \u2192 Automatic recovery\n- **RDS Clusters**: Failover \u2192 Automatic recovery \n- **ECS**: Stop tasks \u2192 Service maintains desired count\n- **Lambda**: Inject errors \u2192 Duration-based\n- **ASG**: Insufficient capacity errors \u2192 Duration-based\n- **ELB**: Mark unavailable \u2192 Duration-based\n- **EKS**: Terminate 100% of nodes \u2192 Auto Scaling recovery\n\n## Prerequisites\n\n1. **AWS Credentials**: Configure via environment variables or AWS profiles\n2. **IAM Permissions**: Ensure the following permissions:\n - `fis:*` - For Fault Injection Simulator\n - `ssm:*` - For Systems Manager (MSK experiments)\n - `ec2:*`, `rds:*`, `ecs:*`, `lambda:*`, `autoscaling:*`, `elasticloadbalancing:*`, `eks:*`, `kafka:*` - For resource discovery and targeting\n3. **FIS Service Role**: Create an IAM role for FIS experiments with appropriate permissions\n\n## Error Handling\n\n- **Resource Not Found**: Experiments skip missing resources\n- **Permission Denied**: Clear error messages with required permissions\n- **Invalid Duration**: Automatic conversion to AWS FIS PT format\n- **Network Issues**: Configurable timeouts and retries (300s read, 60s connect, 3 retries)\n\n## Safety Features\n\n- **Dry Run Mode**: Preview targets before execution\n- **Auto Recovery**: Most experiments include automatic recovery\n- **Resource Validation**: Verify resources exist before targeting\n- **Region Isolation**: Experiments are region-specific\n- **Tag Validation**: Ensure exact tag matches to prevent accidental targeting\n\n## Examples\n\n### Chaos Engineering Scenarios\n\n```python\n# Test EKS cluster resilience\ntag_based_failure(\n tag_key=\"eks:cluster-name\",\n tag_value=\"production-cluster\",\n duration=\"5m\"\n)\n\n# Simulate database failover\ndb_failure(\n db_identifier=\"prod-aurora-cluster\",\n failure_type=\"failover\"\n)\n\n# Test multi-AZ application resilience \naz_failure(availability_zone=\"us-west-2a\")\n\n# Validate auto-scaling behavior\ntag_based_failure(\n tag_key=\"Environment\",\n tag_value=\"staging\", \n duration=\"10m\"\n)\n\n# Test Kafka cluster resilience\nmsk_failure(cluster_name=\"event-streaming-cluster\")\n```\n\n## Monitoring\n\nUse `get_experiment_status()` to monitor experiment progress:\n\n```python\n# Start experiment\nresult = tag_based_failure(tag_key=\"Team\", tag_value=\"platform\")\nexperiment_id = result.content[0].text # Extract experiment ID\n\n# Monitor progress\nstatus = get_experiment_status(experiment_id=experiment_id)\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality\n4. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Invoke mock AZ, DB, and MSK Failure. Internally use AWS FIS, AWS SSM.",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://github.com/Geun-Oh/failure-invoker-mcp",
"Issues": "https://github.com/Geun-Oh/failure-invoker-mcp/issues",
"Repository": "https://github.com/Geun-Oh/failure-invoker-mcp"
},
"split_keywords": [
"aws",
" fis",
" chaos-engineering",
" mcp",
" fault-injection"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6b1e3a8a3e65b05f832c445348d0848245066c5e9644d6cd226aa411aeb42e96",
"md5": "854eb8302a2a78e1efe7b173416584eb",
"sha256": "d145afdf0092b153b0048aba1150a2024eb6a3d14b2fcdb955dda97e1355303e"
},
"downloads": -1,
"filename": "failure_invoker_mcp-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "854eb8302a2a78e1efe7b173416584eb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 15013,
"upload_time": "2025-09-07T06:25:19",
"upload_time_iso_8601": "2025-09-07T06:25:19.163034Z",
"url": "https://files.pythonhosted.org/packages/6b/1e/3a8a3e65b05f832c445348d0848245066c5e9644d6cd226aa411aeb42e96/failure_invoker_mcp-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "201b7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8",
"md5": "4212a9d8b7658c01940f003c2ec879dc",
"sha256": "07f89129221d30ca3e892eb840fed9530b710432d60453534c539c63997711f6"
},
"downloads": -1,
"filename": "failure_invoker_mcp-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "4212a9d8b7658c01940f003c2ec879dc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 14824,
"upload_time": "2025-09-07T06:25:20",
"upload_time_iso_8601": "2025-09-07T06:25:20.445601Z",
"url": "https://files.pythonhosted.org/packages/20/1b/7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8/failure_invoker_mcp-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-07 06:25:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Geun-Oh",
"github_project": "failure-invoker-mcp",
"github_not_found": true,
"lcname": "failure-invoker-mcp"
}