failure-invoker-mcp


Namefailure-invoker-mcp JSON
Version 1.1.0 PyPI version JSON
download
home_pageNone
SummaryInvoke mock AZ, DB, and MSK Failure. Internally use AWS FIS, AWS SSM.
upload_time2025-09-07 06:25:20
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords aws fis chaos-engineering mcp fault-injection
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Failure Invoker MCP Server

A comprehensive chaos engineering tool that enables failure injection experiments across multiple AWS services using AWS Fault Injection Simulator (FIS) and AWS Systems Manager (SSM).

## Features

- **Multi-Service Support**: Target EC2, RDS, ECS, Lambda, ASG, ELB, EKS, and MSK
- **Tag-Based Targeting**: Flexible resource selection using AWS tags
- **Configurable Duration**: Control experiment duration with human-readable formats
- **Auto-Recovery**: Built-in recovery mechanisms for most services
- **Comprehensive Logging**: Detailed experiment tracking and status monitoring

## Supported AWS Services

| Service | Action | Recovery |
|---------|--------|----------|
| EC2 | Stop instances | Auto-restart after duration |
| RDS | Reboot/Failover | Automatic |
| ECS | Stop tasks | Service auto-recovery |
| Lambda | Error injection | Duration-based |
| ASG | Capacity errors | Duration-based |
| ELB | Unavailable state | Duration-based |
| EKS | Terminate nodes | Auto Scaling recovery |
| MSK | Restart brokers | Automatic |

## Installation

### MCP Configuration

```json
{
  "mcpServers": {
    "failure-invoker": {
      "command": "uvx",
      "args": ["failure-invoker-mcp@latest"],
      "env": {
        "AWS_REGION": "us-west-2",
        "AWS_ACCESS_KEY_ID": "your-access-key",
        "AWS_SECRET_ACCESS_KEY": "your-secret-key"
      }
    }
  }
}
```

### Strands Agent SDK

```python
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

failure_invoker_client = MCPClient(
    lambda: stdio_client(
        StdioServerParameters(
            command="uvx",
            args=["failure-invoker-mcp@latest"],
            env={
                "AWS_REGION": "us-west-2",
                "AWS_ACCESS_KEY_ID": "your-access-key",
                "AWS_SECRET_ACCESS_KEY": "your-secret-key"
            }
        )
    )
)

failure_invoker_client.start()

agent = Agent(
    model,
    system_prompt,
    tools=[failure_invoker_client.list_tools_sync()],
)
```

## Available Tools

### 1. `db_failure`

Execute database failure experiments on RDS instances or Aurora clusters.

**Parameters:**
- `db_identifier` (required): RDS instance or cluster identifier
- `failure_type` (optional): "reboot" or "failover" (default: "reboot")
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)

**Examples:**
```python
# Reboot RDS instance
db_failure(db_identifier="my-database", failure_type="reboot")

# Failover Aurora cluster
db_failure(db_identifier="my-cluster", failure_type="failover", region="us-east-1")
```

### 2. `az_failure`

Execute availability zone failure experiments affecting all resources in a specific AZ.

**Parameters:**
- `availability_zone` (required): Target availability zone (e.g., "us-west-2a")
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)

**Examples:**
```python
# Simulate AZ failure
az_failure(availability_zone="us-west-2a")

# Target specific region
az_failure(availability_zone="eu-west-1b", region="eu-west-1")
```

### 3. `msk_failure`

Execute MSK (Managed Streaming for Kafka) cluster failure experiments.

**Parameters:**
- `cluster_name` (required): MSK cluster name
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)

**Examples:**
```python
# Restart MSK brokers
msk_failure(cluster_name="my-kafka-cluster")

# Target specific region
msk_failure(cluster_name="prod-kafka", region="us-east-1")
```

### 4. `tag_based_failure`

Execute failure experiments on all resources matching specified tags across multiple AWS services.

**Parameters:**
- `tag_key` (required): Tag key to search for
- `tag_value` (required): Tag value to match
- `duration` (optional): Duration of the failure (e.g., "60s", "10m", "2h", default: "10m")
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)

**Examples:**
```python
# Target all resources with Environment=test tag
tag_based_failure(tag_key="Environment", tag_value="test", duration="5m")

# Target specific team's resources
tag_based_failure(tag_key="Team", tag_value="backend", duration="30s")

# Target EKS cluster nodes
tag_based_failure(
    tag_key="eks:cluster-name", 
    tag_value="my-cluster", 
    duration="2m"
)

# Target auto-scaling enabled resources
tag_based_failure(
    tag_key="k8s.io/cluster-autoscaler/enabled", 
    tag_value="true", 
    duration="1h"
)
```

### 5. `get_experiment_status`

Check the status of running or completed FIS experiments.

**Parameters:**
- `experiment_id` (optional): Specific experiment ID to check
- `region` (optional): AWS region (uses AWS_REGION env var if not specified)

**Examples:**
```python
# Get all recent experiments
get_experiment_status()

# Check specific experiment
get_experiment_status(experiment_id="EXP123456789")

# Check experiments in specific region
get_experiment_status(region="eu-west-1")
```

## Duration Format

The `duration` parameter accepts human-readable formats:
- `"30s"` - 30 seconds
- `"5m"` - 5 minutes  
- `"2h"` - 2 hours
- `"1h30m"` - 1 hour 30 minutes

## Resource Targeting Logic

### Tag-Based Targeting

The `tag_based_failure` tool searches across all supported AWS services:

1. **EC2 Instances**: Uses describe-instances with tag filters
2. **RDS**: Queries all instances/clusters, then checks tags individually
3. **ECS**: Searches services across all clusters for matching tags
4. **Lambda**: Iterates through functions checking tags
5. **ASG**: Examines Auto Scaling Group tags
6. **ELB**: Checks Load Balancer tags
7. **EKS**: Searches Node Groups across all clusters
8. **MSK**: Not included in tag-based targeting (use `msk_failure` instead)

### Failure Actions by Service

- **EC2**: Stop instances → Auto-restart after duration
- **RDS Instances**: Reboot → Automatic recovery
- **RDS Clusters**: Failover → Automatic recovery  
- **ECS**: Stop tasks → Service maintains desired count
- **Lambda**: Inject errors → Duration-based
- **ASG**: Insufficient capacity errors → Duration-based
- **ELB**: Mark unavailable → Duration-based
- **EKS**: Terminate 100% of nodes → Auto Scaling recovery

## Prerequisites

1. **AWS Credentials**: Configure via environment variables or AWS profiles
2. **IAM Permissions**: Ensure the following permissions:
   - `fis:*` - For Fault Injection Simulator
   - `ssm:*` - For Systems Manager (MSK experiments)
   - `ec2:*`, `rds:*`, `ecs:*`, `lambda:*`, `autoscaling:*`, `elasticloadbalancing:*`, `eks:*`, `kafka:*` - For resource discovery and targeting
3. **FIS Service Role**: Create an IAM role for FIS experiments with appropriate permissions

## Error Handling

- **Resource Not Found**: Experiments skip missing resources
- **Permission Denied**: Clear error messages with required permissions
- **Invalid Duration**: Automatic conversion to AWS FIS PT format
- **Network Issues**: Configurable timeouts and retries (300s read, 60s connect, 3 retries)

## Safety Features

- **Dry Run Mode**: Preview targets before execution
- **Auto Recovery**: Most experiments include automatic recovery
- **Resource Validation**: Verify resources exist before targeting
- **Region Isolation**: Experiments are region-specific
- **Tag Validation**: Ensure exact tag matches to prevent accidental targeting

## Examples

### Chaos Engineering Scenarios

```python
# Test EKS cluster resilience
tag_based_failure(
    tag_key="eks:cluster-name",
    tag_value="production-cluster",
    duration="5m"
)

# Simulate database failover
db_failure(
    db_identifier="prod-aurora-cluster",
    failure_type="failover"
)

# Test multi-AZ application resilience  
az_failure(availability_zone="us-west-2a")

# Validate auto-scaling behavior
tag_based_failure(
    tag_key="Environment",
    tag_value="staging", 
    duration="10m"
)

# Test Kafka cluster resilience
msk_failure(cluster_name="event-streaming-cluster")
```

## Monitoring

Use `get_experiment_status()` to monitor experiment progress:

```python
# Start experiment
result = tag_based_failure(tag_key="Team", tag_value="platform")
experiment_id = result.content[0].text  # Extract experiment ID

# Monitor progress
status = get_experiment_status(experiment_id=experiment_id)
```

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Submit a pull request

## License

MIT License - see LICENSE file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "failure-invoker-mcp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "aws, fis, chaos-engineering, mcp, fault-injection",
    "author": null,
    "author_email": "Hyeonggeun Oh <kandy@plaintexting.com>",
    "download_url": "https://files.pythonhosted.org/packages/20/1b/7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8/failure_invoker_mcp-1.1.0.tar.gz",
    "platform": null,
    "description": "# Failure Invoker MCP Server\n\nA comprehensive chaos engineering tool that enables failure injection experiments across multiple AWS services using AWS Fault Injection Simulator (FIS) and AWS Systems Manager (SSM).\n\n## Features\n\n- **Multi-Service Support**: Target EC2, RDS, ECS, Lambda, ASG, ELB, EKS, and MSK\n- **Tag-Based Targeting**: Flexible resource selection using AWS tags\n- **Configurable Duration**: Control experiment duration with human-readable formats\n- **Auto-Recovery**: Built-in recovery mechanisms for most services\n- **Comprehensive Logging**: Detailed experiment tracking and status monitoring\n\n## Supported AWS Services\n\n| Service | Action | Recovery |\n|---------|--------|----------|\n| EC2 | Stop instances | Auto-restart after duration |\n| RDS | Reboot/Failover | Automatic |\n| ECS | Stop tasks | Service auto-recovery |\n| Lambda | Error injection | Duration-based |\n| ASG | Capacity errors | Duration-based |\n| ELB | Unavailable state | Duration-based |\n| EKS | Terminate nodes | Auto Scaling recovery |\n| MSK | Restart brokers | Automatic |\n\n## Installation\n\n### MCP Configuration\n\n```json\n{\n  \"mcpServers\": {\n    \"failure-invoker\": {\n      \"command\": \"uvx\",\n      \"args\": [\"failure-invoker-mcp@latest\"],\n      \"env\": {\n        \"AWS_REGION\": \"us-west-2\",\n        \"AWS_ACCESS_KEY_ID\": \"your-access-key\",\n        \"AWS_SECRET_ACCESS_KEY\": \"your-secret-key\"\n      }\n    }\n  }\n}\n```\n\n### Strands Agent SDK\n\n```python\nfrom mcp import ClientSession, StdioServerParameters\nfrom mcp.client.stdio import stdio_client\n\nfailure_invoker_client = MCPClient(\n    lambda: stdio_client(\n        StdioServerParameters(\n            command=\"uvx\",\n            args=[\"failure-invoker-mcp@latest\"],\n            env={\n                \"AWS_REGION\": \"us-west-2\",\n                \"AWS_ACCESS_KEY_ID\": \"your-access-key\",\n                \"AWS_SECRET_ACCESS_KEY\": \"your-secret-key\"\n            }\n        )\n    )\n)\n\nfailure_invoker_client.start()\n\nagent = Agent(\n    model,\n    system_prompt,\n    tools=[failure_invoker_client.list_tools_sync()],\n)\n```\n\n## Available Tools\n\n### 1. `db_failure`\n\nExecute database failure experiments on RDS instances or Aurora clusters.\n\n**Parameters:**\n- `db_identifier` (required): RDS instance or cluster identifier\n- `failure_type` (optional): \"reboot\" or \"failover\" (default: \"reboot\")\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Reboot RDS instance\ndb_failure(db_identifier=\"my-database\", failure_type=\"reboot\")\n\n# Failover Aurora cluster\ndb_failure(db_identifier=\"my-cluster\", failure_type=\"failover\", region=\"us-east-1\")\n```\n\n### 2. `az_failure`\n\nExecute availability zone failure experiments affecting all resources in a specific AZ.\n\n**Parameters:**\n- `availability_zone` (required): Target availability zone (e.g., \"us-west-2a\")\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Simulate AZ failure\naz_failure(availability_zone=\"us-west-2a\")\n\n# Target specific region\naz_failure(availability_zone=\"eu-west-1b\", region=\"eu-west-1\")\n```\n\n### 3. `msk_failure`\n\nExecute MSK (Managed Streaming for Kafka) cluster failure experiments.\n\n**Parameters:**\n- `cluster_name` (required): MSK cluster name\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Restart MSK brokers\nmsk_failure(cluster_name=\"my-kafka-cluster\")\n\n# Target specific region\nmsk_failure(cluster_name=\"prod-kafka\", region=\"us-east-1\")\n```\n\n### 4. `tag_based_failure`\n\nExecute failure experiments on all resources matching specified tags across multiple AWS services.\n\n**Parameters:**\n- `tag_key` (required): Tag key to search for\n- `tag_value` (required): Tag value to match\n- `duration` (optional): Duration of the failure (e.g., \"60s\", \"10m\", \"2h\", default: \"10m\")\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Target all resources with Environment=test tag\ntag_based_failure(tag_key=\"Environment\", tag_value=\"test\", duration=\"5m\")\n\n# Target specific team's resources\ntag_based_failure(tag_key=\"Team\", tag_value=\"backend\", duration=\"30s\")\n\n# Target EKS cluster nodes\ntag_based_failure(\n    tag_key=\"eks:cluster-name\", \n    tag_value=\"my-cluster\", \n    duration=\"2m\"\n)\n\n# Target auto-scaling enabled resources\ntag_based_failure(\n    tag_key=\"k8s.io/cluster-autoscaler/enabled\", \n    tag_value=\"true\", \n    duration=\"1h\"\n)\n```\n\n### 5. `get_experiment_status`\n\nCheck the status of running or completed FIS experiments.\n\n**Parameters:**\n- `experiment_id` (optional): Specific experiment ID to check\n- `region` (optional): AWS region (uses AWS_REGION env var if not specified)\n\n**Examples:**\n```python\n# Get all recent experiments\nget_experiment_status()\n\n# Check specific experiment\nget_experiment_status(experiment_id=\"EXP123456789\")\n\n# Check experiments in specific region\nget_experiment_status(region=\"eu-west-1\")\n```\n\n## Duration Format\n\nThe `duration` parameter accepts human-readable formats:\n- `\"30s\"` - 30 seconds\n- `\"5m\"` - 5 minutes  \n- `\"2h\"` - 2 hours\n- `\"1h30m\"` - 1 hour 30 minutes\n\n## Resource Targeting Logic\n\n### Tag-Based Targeting\n\nThe `tag_based_failure` tool searches across all supported AWS services:\n\n1. **EC2 Instances**: Uses describe-instances with tag filters\n2. **RDS**: Queries all instances/clusters, then checks tags individually\n3. **ECS**: Searches services across all clusters for matching tags\n4. **Lambda**: Iterates through functions checking tags\n5. **ASG**: Examines Auto Scaling Group tags\n6. **ELB**: Checks Load Balancer tags\n7. **EKS**: Searches Node Groups across all clusters\n8. **MSK**: Not included in tag-based targeting (use `msk_failure` instead)\n\n### Failure Actions by Service\n\n- **EC2**: Stop instances \u2192 Auto-restart after duration\n- **RDS Instances**: Reboot \u2192 Automatic recovery\n- **RDS Clusters**: Failover \u2192 Automatic recovery  \n- **ECS**: Stop tasks \u2192 Service maintains desired count\n- **Lambda**: Inject errors \u2192 Duration-based\n- **ASG**: Insufficient capacity errors \u2192 Duration-based\n- **ELB**: Mark unavailable \u2192 Duration-based\n- **EKS**: Terminate 100% of nodes \u2192 Auto Scaling recovery\n\n## Prerequisites\n\n1. **AWS Credentials**: Configure via environment variables or AWS profiles\n2. **IAM Permissions**: Ensure the following permissions:\n   - `fis:*` - For Fault Injection Simulator\n   - `ssm:*` - For Systems Manager (MSK experiments)\n   - `ec2:*`, `rds:*`, `ecs:*`, `lambda:*`, `autoscaling:*`, `elasticloadbalancing:*`, `eks:*`, `kafka:*` - For resource discovery and targeting\n3. **FIS Service Role**: Create an IAM role for FIS experiments with appropriate permissions\n\n## Error Handling\n\n- **Resource Not Found**: Experiments skip missing resources\n- **Permission Denied**: Clear error messages with required permissions\n- **Invalid Duration**: Automatic conversion to AWS FIS PT format\n- **Network Issues**: Configurable timeouts and retries (300s read, 60s connect, 3 retries)\n\n## Safety Features\n\n- **Dry Run Mode**: Preview targets before execution\n- **Auto Recovery**: Most experiments include automatic recovery\n- **Resource Validation**: Verify resources exist before targeting\n- **Region Isolation**: Experiments are region-specific\n- **Tag Validation**: Ensure exact tag matches to prevent accidental targeting\n\n## Examples\n\n### Chaos Engineering Scenarios\n\n```python\n# Test EKS cluster resilience\ntag_based_failure(\n    tag_key=\"eks:cluster-name\",\n    tag_value=\"production-cluster\",\n    duration=\"5m\"\n)\n\n# Simulate database failover\ndb_failure(\n    db_identifier=\"prod-aurora-cluster\",\n    failure_type=\"failover\"\n)\n\n# Test multi-AZ application resilience  \naz_failure(availability_zone=\"us-west-2a\")\n\n# Validate auto-scaling behavior\ntag_based_failure(\n    tag_key=\"Environment\",\n    tag_value=\"staging\", \n    duration=\"10m\"\n)\n\n# Test Kafka cluster resilience\nmsk_failure(cluster_name=\"event-streaming-cluster\")\n```\n\n## Monitoring\n\nUse `get_experiment_status()` to monitor experiment progress:\n\n```python\n# Start experiment\nresult = tag_based_failure(tag_key=\"Team\", tag_value=\"platform\")\nexperiment_id = result.content[0].text  # Extract experiment ID\n\n# Monitor progress\nstatus = get_experiment_status(experiment_id=experiment_id)\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality\n4. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Invoke mock AZ, DB, and MSK Failure. Internally use AWS FIS, AWS SSM.",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/Geun-Oh/failure-invoker-mcp",
        "Issues": "https://github.com/Geun-Oh/failure-invoker-mcp/issues",
        "Repository": "https://github.com/Geun-Oh/failure-invoker-mcp"
    },
    "split_keywords": [
        "aws",
        " fis",
        " chaos-engineering",
        " mcp",
        " fault-injection"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6b1e3a8a3e65b05f832c445348d0848245066c5e9644d6cd226aa411aeb42e96",
                "md5": "854eb8302a2a78e1efe7b173416584eb",
                "sha256": "d145afdf0092b153b0048aba1150a2024eb6a3d14b2fcdb955dda97e1355303e"
            },
            "downloads": -1,
            "filename": "failure_invoker_mcp-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "854eb8302a2a78e1efe7b173416584eb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 15013,
            "upload_time": "2025-09-07T06:25:19",
            "upload_time_iso_8601": "2025-09-07T06:25:19.163034Z",
            "url": "https://files.pythonhosted.org/packages/6b/1e/3a8a3e65b05f832c445348d0848245066c5e9644d6cd226aa411aeb42e96/failure_invoker_mcp-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "201b7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8",
                "md5": "4212a9d8b7658c01940f003c2ec879dc",
                "sha256": "07f89129221d30ca3e892eb840fed9530b710432d60453534c539c63997711f6"
            },
            "downloads": -1,
            "filename": "failure_invoker_mcp-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4212a9d8b7658c01940f003c2ec879dc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 14824,
            "upload_time": "2025-09-07T06:25:20",
            "upload_time_iso_8601": "2025-09-07T06:25:20.445601Z",
            "url": "https://files.pythonhosted.org/packages/20/1b/7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8/failure_invoker_mcp-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-07 06:25:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Geun-Oh",
    "github_project": "failure-invoker-mcp",
    "github_not_found": true,
    "lcname": "failure-invoker-mcp"
}
        
Elapsed time: 0.53086s