agentprobe

Name	agentprobe JSON
Version	0.1.3 JSON
	download
home_page	None
Summary	Test how well AI agents interact with CLI tools
upload_time	2025-07-31 16:36:37
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	agents ai claude cli testing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # AgentProbe

[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg?logo=python&logoColor=white)](https://python.org)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![GitHub Stars](https://img.shields.io/github/stars/nibzard/agentprobe.svg)](https://github.com/nibzard/agentprobe)
[![GitHub Issues](https://img.shields.io/github/issues/nibzard/agentprobe.svg)](https://github.com/nibzard/agentprobe/issues)

Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and provides actionable insights to improve Agent Experience (AX) - helping CLI developers make their tools more AI-friendly.

<p align="center">
  <img src="assets/agentprobe.jpeg" alt="AgentProbe" width="100%">
</p>

## Quick Start

```bash
# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

# Or install locally for development
uv sync
uv run agentprobe test vercel --scenario deploy
```

## Authentication

AgentProbe supports multiple authentication methods to avoid environment pollution:

### Get an OAuth Token

First, obtain your OAuth token using Claude Code:

```bash
claude setup-token
```

This will guide you through the OAuth flow and provide a token for authentication.

### Method 1: Token File (Recommended)
```bash
# Store token in a file (replace with your actual token from claude setup-token)
echo "your-oauth-token-here" > ~/.agentprobe-token

# Use with commands
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --oauth-token-file ~/.agentprobe-token
```

### Method 2: Config Files
Create a config file in one of these locations (checked in priority order):

```bash
# Global user config (replace with your actual token from claude setup-token)
mkdir -p ~/.agentprobe
echo "your-oauth-token-here" > ~/.agentprobe/config

# Project-specific config (add to .gitignore)
echo "your-oauth-token-here" > .agentprobe
echo ".agentprobe" >> .gitignore

# Then run normally without additional flags
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy
```

### Method 3: Environment Variables (Legacy)
```bash
# Replace with your actual token from claude setup-token
export CLAUDE_CODE_OAUTH_TOKEN="your-oauth-token-here"
# Note: This may affect other Claude CLI processes
```

**Recommendation**: Use token files or config files for better process isolation.

## What It Does

AgentProbe launches Claude Code to test CLI tools and provides **Agent Experience (AX)** insights on:
- **AX Score** (A-F) based on turn count and success rate
- **CLI Friction Points** - specific issues that confuse agents
- **Actionable Improvements** - concrete changes to reduce agent friction
- **Real-time Progress** - see agent progress with live turn counts

## Community Benchmark

Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.

| Tool | Scenarios | Passing | Failing | Success Rate | Last Updated |
|------|-----------|---------|---------|--------------|--------------|
| vercel | 9 | 7 | 2 | 77.8% | 2025-01-20 |
| gh | 1 | 1 | 0 | 100% | 2025-01-20 |
| docker | 1 | 1 | 0 | 100% | 2025-01-20 |

[View detailed results →](scenarios/RESULTS.md)

## Commands

### Test Individual Scenarios

```bash
# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr

# With authentication token file
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --oauth-token-file ~/.agentprobe-token

# Test multiple runs for consistency analysis
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --runs 5

# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project

# Show detailed trace with message debugging (disables progress indicators)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose
```

### Benchmark Tools

```bash
# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel

# Test all scenarios with authentication
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel --oauth-token-file ~/.agentprobe-token

# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all
```

### Reports

```bash
# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.md
```

### Debugging and Verbose Output

The `--verbose` flag provides detailed insights into how Claude Code interacts with your CLI:

```bash
# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose
```

Verbose output includes:
- Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
- Message content and tool usage
- SDK object attributes and debugging information
- Full conversation trace between Claude and your CLI

## Example Output

### Single Run (Default)
```
⠋ Agent running... (Turn 3, 12s)
╭───────────────────────────── AgentProbe Results ─────────────────────────────╮
│ Tool: vercel | Scenario: deploy                                               │
│ AX Score: B (12 turns, 80% success rate)                                      │
│                                                                               │
│ Agent Experience Summary:                                                     │
│ Agent completed deployment but needed extra turns due to unclear progress     │
│ feedback and ambiguous success indicators.                                    │
│                                                                               │
│ CLI Friction Points:                                                          │
│ • No progress feedback during build process                                   │
│ • Deployment URL returned before actual completion                            │
│ • Success status ambiguous ("building" vs "deployed")                        │
│                                                                               │
│ Top Improvements for CLI:                                                     │
│ 1. Add --status flag to check deployment progress                             │
│ 2. Include completion status in deployment output                             │
│ 3. Provide structured --json output for programmatic usage                    │
│                                                                               │
│ Expected turns: 5-8 | Duration: 23.4s | Cost: $0.012                         │
│                                                                               │
│ Use --verbose for full trace analysis                                         │
╰───────────────────────────────────────────────────────────────────────────────╯
```

### Multiple Runs (Aggregate)
```
╭──────────────────────── AgentProbe Aggregate Results ────────────────────────╮
│ Tool: vercel | Scenario: deploy                                               │
│ AX Score: C (14.2 avg turns, 60% success rate) | Runs: 5                      │
│                                                                               │
│ Consistency Analysis:                                                         │
│ • Turn variance: 8-22 turns                                                   │
│ • Success consistency: 60% of runs succeeded                                  │
│ • Agent confusion points: 18 total friction events                            │
│                                                                               │
│ Consistent CLI Friction Points:                                               │
│ • Permission errors lack clear remediation steps                              │
│ • No progress feedback during deployment                                      │
│ • Build failures don't suggest next steps                                     │
│                                                                               │
│ Priority Improvements for CLI:                                                │
│ 1. Add deployment status polling with vercel status                           │
│ 2. Include troubleshooting hints in error messages                            │
│ 3. Provide progress indicators during long operations                          │
│                                                                               │
│ Avg duration: 45.2s | Total cost: $0.156                                      │
╰───────────────────────────────────────────────────────────────────────────────╯
```

## Contributing Scenarios

We welcome scenario contributions! Help us test more CLI tools:

1. Fork this repository
2. Add your scenarios under `scenarios/<tool-name>/`
3. Run the tests and update the benchmark table
4. Submit a PR with your results

### Scenario Format

#### Simple Text Format
Create simple text files with clear prompts:

```
# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.
```

#### Enhanced YAML Format (Recommended)
Use YAML frontmatter for better control and metadata:

```yaml
# scenarios/vercel/deploy-complex.txt
---
version: 2
created: 2025-01-22
tool: vercel
permission_mode: acceptEdits
allowed_tools: [Read, Write, Bash]
model: opus
max_turns: 15
complexity: complex
expected_turns: 8-12
description: "Production deployment with environment setup"
---
Deploy this Next.js application to production using Vercel CLI.
Configure production environment variables and ensure the deployment
is successful with proper domain configuration.
```

**YAML Frontmatter Options:**
- `model`: Override default model (`sonnet`, `opus`)
- `max_turns`: Limit agent interactions
- `permission_mode`: Set permissions (`acceptEdits`, `default`, `plan`, `bypassPermissions`)
- `allowed_tools`: Specify tools (`[Read, Write, Bash]`)
- `expected_turns`: Range for AX scoring comparison
- `complexity`: Scenario difficulty (`simple`, `medium`, `complex`)

### Running Benchmark Tests

```bash
# Test all scenarios for a tool
uv run agentprobe benchmark vercel

# Test all tools
uv run agentprobe benchmark --all

# Generate report (placeholder)
uv run agentprobe report --format markdown
```

## Architecture

AgentProbe follows a simple 4-component architecture:

1. **CLI Layer** (`cli.py`) - Typer-based command interface with progress indicators
2. **Runner** (`runner.py`) - Executes scenarios via Claude Code SDK with YAML frontmatter support
3. **Analyzer** (`analyzer.py`) - AI-powered analysis using Claude to identify friction points
4. **Reporter** (`reporter.py`) - AX-focused output for CLI developers

### Agent Experience (AX) Analysis

AgentProbe uses Claude itself to analyze agent interactions, providing:

- **Intelligent Analysis**: Claude analyzes execution traces to identify specific friction points
- **AX Scoring**: Automatic scoring based on turn efficiency and success patterns
- **Contextual Recommendations**: Actionable improvements tailored to each CLI tool
- **Consistency Tracking**: Multi-run analysis to identify systematic issues

This approach avoids hardcoded patterns and provides nuanced, tool-specific insights that help CLI developers understand exactly where their tools create friction for AI agents.

### Prompt Management & Versioning

AgentProbe uses externalized Jinja2 templates for analysis prompts:

- **Template-based Prompts**: Analysis prompts are stored in `prompts/analysis.jinja2` for easy editing and iteration
- **Version Tracking**: Each analysis includes prompt version metadata for reproducible results
- **Dynamic Variables**: Templates support contextual variables (scenario, tool, trace data)
- **Historical Comparison**: Version tracking enables comparing results across prompt iterations

```bash
# Prompt templates are automatically loaded from prompts/ directory
# Version information is tracked in prompts/metadata.json
# Analysis results include prompt_version field for tracking
```

## Requirements

- Python 3.10+
- uv package manager
- Claude Code SDK (automatically installed)

## Key Features

### 🎯 Agent Experience (AX) Focus
- **AX Scores** (A-F) based on turn efficiency and success rate
- **Friction Point Analysis** identifies specific CLI pain points
- **Actionable Recommendations** for CLI developers

### 📊 Progress & Feedback
- **Real-time Progress** with live turn count and elapsed time
- **Consistency Analysis** across multiple runs
- **Expected vs Actual** turn comparison using YAML metadata

### 🔧 Advanced Scenario Control
- **YAML Frontmatter** for model selection, permissions, turn limits
- **Multiple Authentication** methods with process isolation
- **Flexible Tool Configuration** per scenario

## Available Scenarios

Current test scenarios included:

- **GitHub CLI** (`gh/`)
  - `create-pr.txt` - Create pull requests
- **Vercel** (`vercel/`)
  - `deploy.txt` - Deploy applications to production
  - `preview-deploy.txt` - Deploy to preview environment
  - `init-project.txt` - Initialize new project with template
  - `env-setup.txt` - Configure environment variables
  - `list-deployments.txt` - List recent deployments
  - `domain-setup.txt` - Add custom domain configuration
  - `rollback.txt` - Rollback to previous deployment
  - `logs.txt` - View deployment logs
  - `build-local.txt` - Build project locally
  - `ax-test.txt` - Simple version check (AX demo)
  - `yaml-options-test.txt` - YAML frontmatter demo
- **Docker** (`docker/`)
  - `run-nginx.txt` - Run nginx containers
- **Wrangler (Cloudflare)** (`wrangler/`)
  - Multiple deployment and development scenarios

[Browse all scenarios →](scenarios/)

## Development

```bash
# Install with dev dependencies
uv sync --extra dev

# Format code
uv run black src/

# Lint code
uv run ruff check src/

# Run tests (when implemented)
uv run pytest
```

See [TASKS.md](TASKS.md) for the development roadmap and task tracking.

## Programmatic Usage

```python
import asyncio
from agentprobe import test_cli

async def main():
    result = await test_cli("gh", "create-pr")
    print(f"Success: {result['success']}")
    print(f"Duration: {result['duration_seconds']}s")
    print(f"Cost: ${result['cost_usd']:.3f}")

asyncio.run(main())
```

## License

MIT

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "agentprobe",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "agents, ai, claude, cli, testing",
    "author": null,
    "author_email": "nkkko <nikola.balic@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7c/02/3b77555cfec0e4c72ec836b48fb024ca10a11ac387827bb2ba1c34acfc29/agentprobe-0.1.3.tar.gz",
    "platform": null,
    "description": "# AgentProbe\n\n[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg?logo=python&logoColor=white)](https://python.org)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![GitHub Stars](https://img.shields.io/github/stars/nibzard/agentprobe.svg)](https://github.com/nibzard/agentprobe)\n[![GitHub Issues](https://img.shields.io/github/issues/nibzard/agentprobe.svg)](https://github.com/nibzard/agentprobe/issues)\n\nTest how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and provides actionable insights to improve Agent Experience (AX) - helping CLI developers make their tools more AI-friendly.\n\n<p align=\"center\">\n  <img src=\"assets/agentprobe.jpeg\" alt=\"AgentProbe\" width=\"100%\">\n</p>\n\n## Quick Start\n\n```bash\n# No installation needed - run directly with uvx\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy\n\n# Or install locally for development\nuv sync\nuv run agentprobe test vercel --scenario deploy\n```\n\n## Authentication\n\nAgentProbe supports multiple authentication methods to avoid environment pollution:\n\n### Get an OAuth Token\n\nFirst, obtain your OAuth token using Claude Code:\n\n```bash\nclaude setup-token\n```\n\nThis will guide you through the OAuth flow and provide a token for authentication.\n\n### Method 1: Token File (Recommended)\n```bash\n# Store token in a file (replace with your actual token from claude setup-token)\necho \"your-oauth-token-here\" > ~/.agentprobe-token\n\n# Use with commands\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --oauth-token-file ~/.agentprobe-token\n```\n\n### Method 2: Config Files\nCreate a config file in one of these locations (checked in priority order):\n\n```bash\n# Global user config (replace with your actual token from claude setup-token)\nmkdir -p ~/.agentprobe\necho \"your-oauth-token-here\" > ~/.agentprobe/config\n\n# Project-specific config (add to .gitignore)\necho \"your-oauth-token-here\" > .agentprobe\necho \".agentprobe\" >> .gitignore\n\n# Then run normally without additional flags\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy\n```\n\n### Method 3: Environment Variables (Legacy)\n```bash\n# Replace with your actual token from claude setup-token\nexport CLAUDE_CODE_OAUTH_TOKEN=\"your-oauth-token-here\"\n# Note: This may affect other Claude CLI processes\n```\n\n**Recommendation**: Use token files or config files for better process isolation.\n\n## What It Does\n\nAgentProbe launches Claude Code to test CLI tools and provides **Agent Experience (AX)** insights on:\n- **AX Score** (A-F) based on turn count and success rate\n- **CLI Friction Points** - specific issues that confuse agents\n- **Actionable Improvements** - concrete changes to reduce agent friction\n- **Real-time Progress** - see agent progress with live turn counts\n\n## Community Benchmark\n\nHelp us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.\n\n| Tool | Scenarios | Passing | Failing | Success Rate | Last Updated |\n|------|-----------|---------|---------|--------------|--------------|\n| vercel | 9 | 7 | 2 | 77.8% | 2025-01-20 |\n| gh | 1 | 1 | 0 | 100% | 2025-01-20 |\n| docker | 1 | 1 | 0 | 100% | 2025-01-20 |\n\n[View detailed results \u2192](scenarios/RESULTS.md)\n\n## Commands\n\n### Test Individual Scenarios\n\n```bash\n# Test a specific scenario (with uvx)\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr\n\n# With authentication token file\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --oauth-token-file ~/.agentprobe-token\n\n# Test multiple runs for consistency analysis\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --runs 5\n\n# With custom working directory\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project\n\n# Show detailed trace with message debugging (disables progress indicators)\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose\n```\n\n### Benchmark Tools\n\n```bash\n# Test all scenarios for one tool\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel\n\n# Test all scenarios with authentication\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel --oauth-token-file ~/.agentprobe-token\n\n# Test all available tools and scenarios\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all\n```\n\n### Reports\n\n```bash\n# Generate reports (future feature)\nuv run agentprobe report --format markdown --output results.md\n```\n\n### Debugging and Verbose Output\n\nThe `--verbose` flag provides detailed insights into how Claude Code interacts with your CLI:\n\n```bash\n# Show full message trace with object types and attributes\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose\n```\n\nVerbose output includes:\n- Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)\n- Message content and tool usage\n- SDK object attributes and debugging information\n- Full conversation trace between Claude and your CLI\n\n## Example Output\n\n### Single Run (Default)\n```\n\u280b Agent running... (Turn 3, 12s)\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 AgentProbe Results \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Tool: vercel | Scenario: deploy                                               \u2502\n\u2502 AX Score: B (12 turns, 80% success rate)                                      \u2502\n\u2502                                                                               \u2502\n\u2502 Agent Experience Summary:                                                     \u2502\n\u2502 Agent completed deployment but needed extra turns due to unclear progress     \u2502\n\u2502 feedback and ambiguous success indicators.                                    \u2502\n\u2502                                                                               \u2502\n\u2502 CLI Friction Points:                                                          \u2502\n\u2502 \u2022 No progress feedback during build process                                   \u2502\n\u2502 \u2022 Deployment URL returned before actual completion                            \u2502\n\u2502 \u2022 Success status ambiguous (\"building\" vs \"deployed\")                        \u2502\n\u2502                                                                               \u2502\n\u2502 Top Improvements for CLI:                                                     \u2502\n\u2502 1. Add --status flag to check deployment progress                             \u2502\n\u2502 2. Include completion status in deployment output                             \u2502\n\u2502 3. Provide structured --json output for programmatic usage                    \u2502\n\u2502                                                                               \u2502\n\u2502 Expected turns: 5-8 | Duration: 23.4s | Cost: $0.012                         \u2502\n\u2502                                                                               \u2502\n\u2502 Use --verbose for full trace analysis                                         \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n### Multiple Runs (Aggregate)\n```\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 AgentProbe Aggregate Results \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Tool: vercel | Scenario: deploy                                               \u2502\n\u2502 AX Score: C (14.2 avg turns, 60% success rate) | Runs: 5                      \u2502\n\u2502                                                                               \u2502\n\u2502 Consistency Analysis:                                                         \u2502\n\u2502 \u2022 Turn variance: 8-22 turns                                                   \u2502\n\u2502 \u2022 Success consistency: 60% of runs succeeded                                  \u2502\n\u2502 \u2022 Agent confusion points: 18 total friction events                            \u2502\n\u2502                                                                               \u2502\n\u2502 Consistent CLI Friction Points:                                               \u2502\n\u2502 \u2022 Permission errors lack clear remediation steps                              \u2502\n\u2502 \u2022 No progress feedback during deployment                                      \u2502\n\u2502 \u2022 Build failures don't suggest next steps                                     \u2502\n\u2502                                                                               \u2502\n\u2502 Priority Improvements for CLI:                                                \u2502\n\u2502 1. Add deployment status polling with vercel status                           \u2502\n\u2502 2. Include troubleshooting hints in error messages                            \u2502\n\u2502 3. Provide progress indicators during long operations                          \u2502\n\u2502                                                                               \u2502\n\u2502 Avg duration: 45.2s | Total cost: $0.156                                      \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n## Contributing Scenarios\n\nWe welcome scenario contributions! Help us test more CLI tools:\n\n1. Fork this repository\n2. Add your scenarios under `scenarios/<tool-name>/`\n3. Run the tests and update the benchmark table\n4. Submit a PR with your results\n\n### Scenario Format\n\n#### Simple Text Format\nCreate simple text files with clear prompts:\n\n```\n# scenarios/stripe/create-customer.txt\nCreate a new Stripe customer with email test@example.com and\nadd a test credit card. Return the customer ID.\n```\n\n#### Enhanced YAML Format (Recommended)\nUse YAML frontmatter for better control and metadata:\n\n```yaml\n# scenarios/vercel/deploy-complex.txt\n---\nversion: 2\ncreated: 2025-01-22\ntool: vercel\npermission_mode: acceptEdits\nallowed_tools: [Read, Write, Bash]\nmodel: opus\nmax_turns: 15\ncomplexity: complex\nexpected_turns: 8-12\ndescription: \"Production deployment with environment setup\"\n---\nDeploy this Next.js application to production using Vercel CLI.\nConfigure production environment variables and ensure the deployment\nis successful with proper domain configuration.\n```\n\n**YAML Frontmatter Options:**\n- `model`: Override default model (`sonnet`, `opus`)\n- `max_turns`: Limit agent interactions\n- `permission_mode`: Set permissions (`acceptEdits`, `default`, `plan`, `bypassPermissions`)\n- `allowed_tools`: Specify tools (`[Read, Write, Bash]`)\n- `expected_turns`: Range for AX scoring comparison\n- `complexity`: Scenario difficulty (`simple`, `medium`, `complex`)\n\n### Running Benchmark Tests\n\n```bash\n# Test all scenarios for a tool\nuv run agentprobe benchmark vercel\n\n# Test all tools\nuv run agentprobe benchmark --all\n\n# Generate report (placeholder)\nuv run agentprobe report --format markdown\n```\n\n## Architecture\n\nAgentProbe follows a simple 4-component architecture:\n\n1. **CLI Layer** (`cli.py`) - Typer-based command interface with progress indicators\n2. **Runner** (`runner.py`) - Executes scenarios via Claude Code SDK with YAML frontmatter support\n3. **Analyzer** (`analyzer.py`) - AI-powered analysis using Claude to identify friction points\n4. **Reporter** (`reporter.py`) - AX-focused output for CLI developers\n\n### Agent Experience (AX) Analysis\n\nAgentProbe uses Claude itself to analyze agent interactions, providing:\n\n- **Intelligent Analysis**: Claude analyzes execution traces to identify specific friction points\n- **AX Scoring**: Automatic scoring based on turn efficiency and success patterns\n- **Contextual Recommendations**: Actionable improvements tailored to each CLI tool\n- **Consistency Tracking**: Multi-run analysis to identify systematic issues\n\nThis approach avoids hardcoded patterns and provides nuanced, tool-specific insights that help CLI developers understand exactly where their tools create friction for AI agents.\n\n### Prompt Management & Versioning\n\nAgentProbe uses externalized Jinja2 templates for analysis prompts:\n\n- **Template-based Prompts**: Analysis prompts are stored in `prompts/analysis.jinja2` for easy editing and iteration\n- **Version Tracking**: Each analysis includes prompt version metadata for reproducible results\n- **Dynamic Variables**: Templates support contextual variables (scenario, tool, trace data)\n- **Historical Comparison**: Version tracking enables comparing results across prompt iterations\n\n```bash\n# Prompt templates are automatically loaded from prompts/ directory\n# Version information is tracked in prompts/metadata.json\n# Analysis results include prompt_version field for tracking\n```\n\n## Requirements\n\n- Python 3.10+\n- uv package manager\n- Claude Code SDK (automatically installed)\n\n## Key Features\n\n### \ud83c\udfaf Agent Experience (AX) Focus\n- **AX Scores** (A-F) based on turn efficiency and success rate\n- **Friction Point Analysis** identifies specific CLI pain points\n- **Actionable Recommendations** for CLI developers\n\n### \ud83d\udcca Progress & Feedback\n- **Real-time Progress** with live turn count and elapsed time\n- **Consistency Analysis** across multiple runs\n- **Expected vs Actual** turn comparison using YAML metadata\n\n### \ud83d\udd27 Advanced Scenario Control\n- **YAML Frontmatter** for model selection, permissions, turn limits\n- **Multiple Authentication** methods with process isolation\n- **Flexible Tool Configuration** per scenario\n\n## Available Scenarios\n\nCurrent test scenarios included:\n\n- **GitHub CLI** (`gh/`)\n  - `create-pr.txt` - Create pull requests\n- **Vercel** (`vercel/`)\n  - `deploy.txt` - Deploy applications to production\n  - `preview-deploy.txt` - Deploy to preview environment\n  - `init-project.txt` - Initialize new project with template\n  - `env-setup.txt` - Configure environment variables\n  - `list-deployments.txt` - List recent deployments\n  - `domain-setup.txt` - Add custom domain configuration\n  - `rollback.txt` - Rollback to previous deployment\n  - `logs.txt` - View deployment logs\n  - `build-local.txt` - Build project locally\n  - `ax-test.txt` - Simple version check (AX demo)\n  - `yaml-options-test.txt` - YAML frontmatter demo\n- **Docker** (`docker/`)\n  - `run-nginx.txt` - Run nginx containers\n- **Wrangler (Cloudflare)** (`wrangler/`)\n  - Multiple deployment and development scenarios\n\n[Browse all scenarios \u2192](scenarios/)\n\n## Development\n\n```bash\n# Install with dev dependencies\nuv sync --extra dev\n\n# Format code\nuv run black src/\n\n# Lint code\nuv run ruff check src/\n\n# Run tests (when implemented)\nuv run pytest\n```\n\nSee [TASKS.md](TASKS.md) for the development roadmap and task tracking.\n\n## Programmatic Usage\n\n```python\nimport asyncio\nfrom agentprobe import test_cli\n\nasync def main():\n    result = await test_cli(\"gh\", \"create-pr\")\n    print(f\"Success: {result['success']}\")\n    print(f\"Duration: {result['duration_seconds']}s\")\n    print(f\"Cost: ${result['cost_usd']:.3f}\")\n\nasyncio.run(main())\n```\n\n## License\n\nMIT",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Test how well AI agents interact with CLI tools",
    "version": "0.1.3",
    "project_urls": {
        "Documentation": "https://github.com/nkkko/agentprobe#readme",
        "Homepage": "https://github.com/nkkko/agentprobe",
        "Issues": "https://github.com/nkkko/agentprobe/issues"
    },
    "split_keywords": [
        "agents",
        " ai",
        " claude",
        " cli",
        " testing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3db62042eda6356bd7177d9ab3694586f3fd60619678efb1b6f3ebd5c3255126",
                "md5": "e4132e8a480bf78ff3d752a93222b87c",
                "sha256": "b58618eba05ac9b7914162de4bafd22130bb6f9087b3520243e572518e3a7f0a"
            },
            "downloads": -1,
            "filename": "agentprobe-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e4132e8a480bf78ff3d752a93222b87c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 33797,
            "upload_time": "2025-07-31T16:36:36",
            "upload_time_iso_8601": "2025-07-31T16:36:36.612249Z",
            "url": "https://files.pythonhosted.org/packages/3d/b6/2042eda6356bd7177d9ab3694586f3fd60619678efb1b6f3ebd5c3255126/agentprobe-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7c023b77555cfec0e4c72ec836b48fb024ca10a11ac387827bb2ba1c34acfc29",
                "md5": "94d205bb71e2bdca92b5ab35c097bbd5",
                "sha256": "f3807643c9fdcce3ae6f1330355bb46fd1f196dea9c7d69cec0ef9c59d98e489"
            },
            "downloads": -1,
            "filename": "agentprobe-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "94d205bb71e2bdca92b5ab35c097bbd5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 299521,
            "upload_time": "2025-07-31T16:36:37",
            "upload_time_iso_8601": "2025-07-31T16:36:37.966934Z",
            "url": "https://files.pythonhosted.org/packages/7c/02/3b77555cfec0e4c72ec836b48fb024ca10a11ac387827bb2ba1c34acfc29/agentprobe-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-31 16:36:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nkkko",
    "github_project": "agentprobe#readme",
    "github_not_found": true,
    "lcname": "agentprobe"
}

None