Name | agentprobe JSON |
Version |
0.1.3
JSON |
| download |
home_page | None |
Summary | Test how well AI agents interact with CLI tools |
upload_time | 2025-07-31 16:36:37 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | MIT |
keywords |
agents
ai
claude
cli
testing
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# AgentProbe
[](https://python.org)
[](https://opensource.org/licenses/MIT)
[](https://github.com/nibzard/agentprobe)
[](https://github.com/nibzard/agentprobe/issues)
Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and provides actionable insights to improve Agent Experience (AX) - helping CLI developers make their tools more AI-friendly.
<p align="center">
<img src="assets/agentprobe.jpeg" alt="AgentProbe" width="100%">
</p>
## Quick Start
```bash
# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy
# Or install locally for development
uv sync
uv run agentprobe test vercel --scenario deploy
```
## Authentication
AgentProbe supports multiple authentication methods to avoid environment pollution:
### Get an OAuth Token
First, obtain your OAuth token using Claude Code:
```bash
claude setup-token
```
This will guide you through the OAuth flow and provide a token for authentication.
### Method 1: Token File (Recommended)
```bash
# Store token in a file (replace with your actual token from claude setup-token)
echo "your-oauth-token-here" > ~/.agentprobe-token
# Use with commands
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --oauth-token-file ~/.agentprobe-token
```
### Method 2: Config Files
Create a config file in one of these locations (checked in priority order):
```bash
# Global user config (replace with your actual token from claude setup-token)
mkdir -p ~/.agentprobe
echo "your-oauth-token-here" > ~/.agentprobe/config
# Project-specific config (add to .gitignore)
echo "your-oauth-token-here" > .agentprobe
echo ".agentprobe" >> .gitignore
# Then run normally without additional flags
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy
```
### Method 3: Environment Variables (Legacy)
```bash
# Replace with your actual token from claude setup-token
export CLAUDE_CODE_OAUTH_TOKEN="your-oauth-token-here"
# Note: This may affect other Claude CLI processes
```
**Recommendation**: Use token files or config files for better process isolation.
## What It Does
AgentProbe launches Claude Code to test CLI tools and provides **Agent Experience (AX)** insights on:
- **AX Score** (A-F) based on turn count and success rate
- **CLI Friction Points** - specific issues that confuse agents
- **Actionable Improvements** - concrete changes to reduce agent friction
- **Real-time Progress** - see agent progress with live turn counts
## Community Benchmark
Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.
| Tool | Scenarios | Passing | Failing | Success Rate | Last Updated |
|------|-----------|---------|---------|--------------|--------------|
| vercel | 9 | 7 | 2 | 77.8% | 2025-01-20 |
| gh | 1 | 1 | 0 | 100% | 2025-01-20 |
| docker | 1 | 1 | 0 | 100% | 2025-01-20 |
[View detailed results →](scenarios/RESULTS.md)
## Commands
### Test Individual Scenarios
```bash
# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr
# With authentication token file
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --oauth-token-file ~/.agentprobe-token
# Test multiple runs for consistency analysis
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --runs 5
# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project
# Show detailed trace with message debugging (disables progress indicators)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose
```
### Benchmark Tools
```bash
# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel
# Test all scenarios with authentication
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel --oauth-token-file ~/.agentprobe-token
# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all
```
### Reports
```bash
# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.md
```
### Debugging and Verbose Output
The `--verbose` flag provides detailed insights into how Claude Code interacts with your CLI:
```bash
# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose
```
Verbose output includes:
- Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
- Message content and tool usage
- SDK object attributes and debugging information
- Full conversation trace between Claude and your CLI
## Example Output
### Single Run (Default)
```
⠋ Agent running... (Turn 3, 12s)
╭───────────────────────────── AgentProbe Results ─────────────────────────────╮
│ Tool: vercel | Scenario: deploy │
│ AX Score: B (12 turns, 80% success rate) │
│ │
│ Agent Experience Summary: │
│ Agent completed deployment but needed extra turns due to unclear progress │
│ feedback and ambiguous success indicators. │
│ │
│ CLI Friction Points: │
│ • No progress feedback during build process │
│ • Deployment URL returned before actual completion │
│ • Success status ambiguous ("building" vs "deployed") │
│ │
│ Top Improvements for CLI: │
│ 1. Add --status flag to check deployment progress │
│ 2. Include completion status in deployment output │
│ 3. Provide structured --json output for programmatic usage │
│ │
│ Expected turns: 5-8 | Duration: 23.4s | Cost: $0.012 │
│ │
│ Use --verbose for full trace analysis │
╰───────────────────────────────────────────────────────────────────────────────╯
```
### Multiple Runs (Aggregate)
```
╭──────────────────────── AgentProbe Aggregate Results ────────────────────────╮
│ Tool: vercel | Scenario: deploy │
│ AX Score: C (14.2 avg turns, 60% success rate) | Runs: 5 │
│ │
│ Consistency Analysis: │
│ • Turn variance: 8-22 turns │
│ • Success consistency: 60% of runs succeeded │
│ • Agent confusion points: 18 total friction events │
│ │
│ Consistent CLI Friction Points: │
│ • Permission errors lack clear remediation steps │
│ • No progress feedback during deployment │
│ • Build failures don't suggest next steps │
│ │
│ Priority Improvements for CLI: │
│ 1. Add deployment status polling with vercel status │
│ 2. Include troubleshooting hints in error messages │
│ 3. Provide progress indicators during long operations │
│ │
│ Avg duration: 45.2s | Total cost: $0.156 │
╰───────────────────────────────────────────────────────────────────────────────╯
```
## Contributing Scenarios
We welcome scenario contributions! Help us test more CLI tools:
1. Fork this repository
2. Add your scenarios under `scenarios/<tool-name>/`
3. Run the tests and update the benchmark table
4. Submit a PR with your results
### Scenario Format
#### Simple Text Format
Create simple text files with clear prompts:
```
# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.
```
#### Enhanced YAML Format (Recommended)
Use YAML frontmatter for better control and metadata:
```yaml
# scenarios/vercel/deploy-complex.txt
---
version: 2
created: 2025-01-22
tool: vercel
permission_mode: acceptEdits
allowed_tools: [Read, Write, Bash]
model: opus
max_turns: 15
complexity: complex
expected_turns: 8-12
description: "Production deployment with environment setup"
---
Deploy this Next.js application to production using Vercel CLI.
Configure production environment variables and ensure the deployment
is successful with proper domain configuration.
```
**YAML Frontmatter Options:**
- `model`: Override default model (`sonnet`, `opus`)
- `max_turns`: Limit agent interactions
- `permission_mode`: Set permissions (`acceptEdits`, `default`, `plan`, `bypassPermissions`)
- `allowed_tools`: Specify tools (`[Read, Write, Bash]`)
- `expected_turns`: Range for AX scoring comparison
- `complexity`: Scenario difficulty (`simple`, `medium`, `complex`)
### Running Benchmark Tests
```bash
# Test all scenarios for a tool
uv run agentprobe benchmark vercel
# Test all tools
uv run agentprobe benchmark --all
# Generate report (placeholder)
uv run agentprobe report --format markdown
```
## Architecture
AgentProbe follows a simple 4-component architecture:
1. **CLI Layer** (`cli.py`) - Typer-based command interface with progress indicators
2. **Runner** (`runner.py`) - Executes scenarios via Claude Code SDK with YAML frontmatter support
3. **Analyzer** (`analyzer.py`) - AI-powered analysis using Claude to identify friction points
4. **Reporter** (`reporter.py`) - AX-focused output for CLI developers
### Agent Experience (AX) Analysis
AgentProbe uses Claude itself to analyze agent interactions, providing:
- **Intelligent Analysis**: Claude analyzes execution traces to identify specific friction points
- **AX Scoring**: Automatic scoring based on turn efficiency and success patterns
- **Contextual Recommendations**: Actionable improvements tailored to each CLI tool
- **Consistency Tracking**: Multi-run analysis to identify systematic issues
This approach avoids hardcoded patterns and provides nuanced, tool-specific insights that help CLI developers understand exactly where their tools create friction for AI agents.
### Prompt Management & Versioning
AgentProbe uses externalized Jinja2 templates for analysis prompts:
- **Template-based Prompts**: Analysis prompts are stored in `prompts/analysis.jinja2` for easy editing and iteration
- **Version Tracking**: Each analysis includes prompt version metadata for reproducible results
- **Dynamic Variables**: Templates support contextual variables (scenario, tool, trace data)
- **Historical Comparison**: Version tracking enables comparing results across prompt iterations
```bash
# Prompt templates are automatically loaded from prompts/ directory
# Version information is tracked in prompts/metadata.json
# Analysis results include prompt_version field for tracking
```
## Requirements
- Python 3.10+
- uv package manager
- Claude Code SDK (automatically installed)
## Key Features
### 🎯 Agent Experience (AX) Focus
- **AX Scores** (A-F) based on turn efficiency and success rate
- **Friction Point Analysis** identifies specific CLI pain points
- **Actionable Recommendations** for CLI developers
### 📊 Progress & Feedback
- **Real-time Progress** with live turn count and elapsed time
- **Consistency Analysis** across multiple runs
- **Expected vs Actual** turn comparison using YAML metadata
### 🔧 Advanced Scenario Control
- **YAML Frontmatter** for model selection, permissions, turn limits
- **Multiple Authentication** methods with process isolation
- **Flexible Tool Configuration** per scenario
## Available Scenarios
Current test scenarios included:
- **GitHub CLI** (`gh/`)
- `create-pr.txt` - Create pull requests
- **Vercel** (`vercel/`)
- `deploy.txt` - Deploy applications to production
- `preview-deploy.txt` - Deploy to preview environment
- `init-project.txt` - Initialize new project with template
- `env-setup.txt` - Configure environment variables
- `list-deployments.txt` - List recent deployments
- `domain-setup.txt` - Add custom domain configuration
- `rollback.txt` - Rollback to previous deployment
- `logs.txt` - View deployment logs
- `build-local.txt` - Build project locally
- `ax-test.txt` - Simple version check (AX demo)
- `yaml-options-test.txt` - YAML frontmatter demo
- **Docker** (`docker/`)
- `run-nginx.txt` - Run nginx containers
- **Wrangler (Cloudflare)** (`wrangler/`)
- Multiple deployment and development scenarios
[Browse all scenarios →](scenarios/)
## Development
```bash
# Install with dev dependencies
uv sync --extra dev
# Format code
uv run black src/
# Lint code
uv run ruff check src/
# Run tests (when implemented)
uv run pytest
```
See [TASKS.md](TASKS.md) for the development roadmap and task tracking.
## Programmatic Usage
```python
import asyncio
from agentprobe import test_cli
async def main():
result = await test_cli("gh", "create-pr")
print(f"Success: {result['success']}")
print(f"Duration: {result['duration_seconds']}s")
print(f"Cost: ${result['cost_usd']:.3f}")
asyncio.run(main())
```
## License
MIT
Raw data
{
"_id": null,
"home_page": null,
"name": "agentprobe",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "agents, ai, claude, cli, testing",
"author": null,
"author_email": "nkkko <nikola.balic@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/7c/02/3b77555cfec0e4c72ec836b48fb024ca10a11ac387827bb2ba1c34acfc29/agentprobe-0.1.3.tar.gz",
"platform": null,
"description": "# AgentProbe\n\n[](https://python.org)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/nibzard/agentprobe)\n[](https://github.com/nibzard/agentprobe/issues)\n\nTest how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and provides actionable insights to improve Agent Experience (AX) - helping CLI developers make their tools more AI-friendly.\n\n<p align=\"center\">\n <img src=\"assets/agentprobe.jpeg\" alt=\"AgentProbe\" width=\"100%\">\n</p>\n\n## Quick Start\n\n```bash\n# No installation needed - run directly with uvx\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy\n\n# Or install locally for development\nuv sync\nuv run agentprobe test vercel --scenario deploy\n```\n\n## Authentication\n\nAgentProbe supports multiple authentication methods to avoid environment pollution:\n\n### Get an OAuth Token\n\nFirst, obtain your OAuth token using Claude Code:\n\n```bash\nclaude setup-token\n```\n\nThis will guide you through the OAuth flow and provide a token for authentication.\n\n### Method 1: Token File (Recommended)\n```bash\n# Store token in a file (replace with your actual token from claude setup-token)\necho \"your-oauth-token-here\" > ~/.agentprobe-token\n\n# Use with commands\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --oauth-token-file ~/.agentprobe-token\n```\n\n### Method 2: Config Files\nCreate a config file in one of these locations (checked in priority order):\n\n```bash\n# Global user config (replace with your actual token from claude setup-token)\nmkdir -p ~/.agentprobe\necho \"your-oauth-token-here\" > ~/.agentprobe/config\n\n# Project-specific config (add to .gitignore)\necho \"your-oauth-token-here\" > .agentprobe\necho \".agentprobe\" >> .gitignore\n\n# Then run normally without additional flags\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy\n```\n\n### Method 3: Environment Variables (Legacy)\n```bash\n# Replace with your actual token from claude setup-token\nexport CLAUDE_CODE_OAUTH_TOKEN=\"your-oauth-token-here\"\n# Note: This may affect other Claude CLI processes\n```\n\n**Recommendation**: Use token files or config files for better process isolation.\n\n## What It Does\n\nAgentProbe launches Claude Code to test CLI tools and provides **Agent Experience (AX)** insights on:\n- **AX Score** (A-F) based on turn count and success rate\n- **CLI Friction Points** - specific issues that confuse agents\n- **Actionable Improvements** - concrete changes to reduce agent friction\n- **Real-time Progress** - see agent progress with live turn counts\n\n## Community Benchmark\n\nHelp us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.\n\n| Tool | Scenarios | Passing | Failing | Success Rate | Last Updated |\n|------|-----------|---------|---------|--------------|--------------|\n| vercel | 9 | 7 | 2 | 77.8% | 2025-01-20 |\n| gh | 1 | 1 | 0 | 100% | 2025-01-20 |\n| docker | 1 | 1 | 0 | 100% | 2025-01-20 |\n\n[View detailed results \u2192](scenarios/RESULTS.md)\n\n## Commands\n\n### Test Individual Scenarios\n\n```bash\n# Test a specific scenario (with uvx)\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr\n\n# With authentication token file\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --oauth-token-file ~/.agentprobe-token\n\n# Test multiple runs for consistency analysis\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --runs 5\n\n# With custom working directory\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project\n\n# Show detailed trace with message debugging (disables progress indicators)\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose\n```\n\n### Benchmark Tools\n\n```bash\n# Test all scenarios for one tool\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel\n\n# Test all scenarios with authentication\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel --oauth-token-file ~/.agentprobe-token\n\n# Test all available tools and scenarios\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all\n```\n\n### Reports\n\n```bash\n# Generate reports (future feature)\nuv run agentprobe report --format markdown --output results.md\n```\n\n### Debugging and Verbose Output\n\nThe `--verbose` flag provides detailed insights into how Claude Code interacts with your CLI:\n\n```bash\n# Show full message trace with object types and attributes\nuvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose\n```\n\nVerbose output includes:\n- Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)\n- Message content and tool usage\n- SDK object attributes and debugging information\n- Full conversation trace between Claude and your CLI\n\n## Example Output\n\n### Single Run (Default)\n```\n\u280b Agent running... (Turn 3, 12s)\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 AgentProbe Results \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Tool: vercel | Scenario: deploy \u2502\n\u2502 AX Score: B (12 turns, 80% success rate) \u2502\n\u2502 \u2502\n\u2502 Agent Experience Summary: \u2502\n\u2502 Agent completed deployment but needed extra turns due to unclear progress \u2502\n\u2502 feedback and ambiguous success indicators. \u2502\n\u2502 \u2502\n\u2502 CLI Friction Points: \u2502\n\u2502 \u2022 No progress feedback during build process \u2502\n\u2502 \u2022 Deployment URL returned before actual completion \u2502\n\u2502 \u2022 Success status ambiguous (\"building\" vs \"deployed\") \u2502\n\u2502 \u2502\n\u2502 Top Improvements for CLI: \u2502\n\u2502 1. Add --status flag to check deployment progress \u2502\n\u2502 2. Include completion status in deployment output \u2502\n\u2502 3. Provide structured --json output for programmatic usage \u2502\n\u2502 \u2502\n\u2502 Expected turns: 5-8 | Duration: 23.4s | Cost: $0.012 \u2502\n\u2502 \u2502\n\u2502 Use --verbose for full trace analysis \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n### Multiple Runs (Aggregate)\n```\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 AgentProbe Aggregate Results \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Tool: vercel | Scenario: deploy \u2502\n\u2502 AX Score: C (14.2 avg turns, 60% success rate) | Runs: 5 \u2502\n\u2502 \u2502\n\u2502 Consistency Analysis: \u2502\n\u2502 \u2022 Turn variance: 8-22 turns \u2502\n\u2502 \u2022 Success consistency: 60% of runs succeeded \u2502\n\u2502 \u2022 Agent confusion points: 18 total friction events \u2502\n\u2502 \u2502\n\u2502 Consistent CLI Friction Points: \u2502\n\u2502 \u2022 Permission errors lack clear remediation steps \u2502\n\u2502 \u2022 No progress feedback during deployment \u2502\n\u2502 \u2022 Build failures don't suggest next steps \u2502\n\u2502 \u2502\n\u2502 Priority Improvements for CLI: \u2502\n\u2502 1. Add deployment status polling with vercel status \u2502\n\u2502 2. Include troubleshooting hints in error messages \u2502\n\u2502 3. Provide progress indicators during long operations \u2502\n\u2502 \u2502\n\u2502 Avg duration: 45.2s | Total cost: $0.156 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n## Contributing Scenarios\n\nWe welcome scenario contributions! Help us test more CLI tools:\n\n1. Fork this repository\n2. Add your scenarios under `scenarios/<tool-name>/`\n3. Run the tests and update the benchmark table\n4. Submit a PR with your results\n\n### Scenario Format\n\n#### Simple Text Format\nCreate simple text files with clear prompts:\n\n```\n# scenarios/stripe/create-customer.txt\nCreate a new Stripe customer with email test@example.com and\nadd a test credit card. Return the customer ID.\n```\n\n#### Enhanced YAML Format (Recommended)\nUse YAML frontmatter for better control and metadata:\n\n```yaml\n# scenarios/vercel/deploy-complex.txt\n---\nversion: 2\ncreated: 2025-01-22\ntool: vercel\npermission_mode: acceptEdits\nallowed_tools: [Read, Write, Bash]\nmodel: opus\nmax_turns: 15\ncomplexity: complex\nexpected_turns: 8-12\ndescription: \"Production deployment with environment setup\"\n---\nDeploy this Next.js application to production using Vercel CLI.\nConfigure production environment variables and ensure the deployment\nis successful with proper domain configuration.\n```\n\n**YAML Frontmatter Options:**\n- `model`: Override default model (`sonnet`, `opus`)\n- `max_turns`: Limit agent interactions\n- `permission_mode`: Set permissions (`acceptEdits`, `default`, `plan`, `bypassPermissions`)\n- `allowed_tools`: Specify tools (`[Read, Write, Bash]`)\n- `expected_turns`: Range for AX scoring comparison\n- `complexity`: Scenario difficulty (`simple`, `medium`, `complex`)\n\n### Running Benchmark Tests\n\n```bash\n# Test all scenarios for a tool\nuv run agentprobe benchmark vercel\n\n# Test all tools\nuv run agentprobe benchmark --all\n\n# Generate report (placeholder)\nuv run agentprobe report --format markdown\n```\n\n## Architecture\n\nAgentProbe follows a simple 4-component architecture:\n\n1. **CLI Layer** (`cli.py`) - Typer-based command interface with progress indicators\n2. **Runner** (`runner.py`) - Executes scenarios via Claude Code SDK with YAML frontmatter support\n3. **Analyzer** (`analyzer.py`) - AI-powered analysis using Claude to identify friction points\n4. **Reporter** (`reporter.py`) - AX-focused output for CLI developers\n\n### Agent Experience (AX) Analysis\n\nAgentProbe uses Claude itself to analyze agent interactions, providing:\n\n- **Intelligent Analysis**: Claude analyzes execution traces to identify specific friction points\n- **AX Scoring**: Automatic scoring based on turn efficiency and success patterns\n- **Contextual Recommendations**: Actionable improvements tailored to each CLI tool\n- **Consistency Tracking**: Multi-run analysis to identify systematic issues\n\nThis approach avoids hardcoded patterns and provides nuanced, tool-specific insights that help CLI developers understand exactly where their tools create friction for AI agents.\n\n### Prompt Management & Versioning\n\nAgentProbe uses externalized Jinja2 templates for analysis prompts:\n\n- **Template-based Prompts**: Analysis prompts are stored in `prompts/analysis.jinja2` for easy editing and iteration\n- **Version Tracking**: Each analysis includes prompt version metadata for reproducible results\n- **Dynamic Variables**: Templates support contextual variables (scenario, tool, trace data)\n- **Historical Comparison**: Version tracking enables comparing results across prompt iterations\n\n```bash\n# Prompt templates are automatically loaded from prompts/ directory\n# Version information is tracked in prompts/metadata.json\n# Analysis results include prompt_version field for tracking\n```\n\n## Requirements\n\n- Python 3.10+\n- uv package manager\n- Claude Code SDK (automatically installed)\n\n## Key Features\n\n### \ud83c\udfaf Agent Experience (AX) Focus\n- **AX Scores** (A-F) based on turn efficiency and success rate\n- **Friction Point Analysis** identifies specific CLI pain points\n- **Actionable Recommendations** for CLI developers\n\n### \ud83d\udcca Progress & Feedback\n- **Real-time Progress** with live turn count and elapsed time\n- **Consistency Analysis** across multiple runs\n- **Expected vs Actual** turn comparison using YAML metadata\n\n### \ud83d\udd27 Advanced Scenario Control\n- **YAML Frontmatter** for model selection, permissions, turn limits\n- **Multiple Authentication** methods with process isolation\n- **Flexible Tool Configuration** per scenario\n\n## Available Scenarios\n\nCurrent test scenarios included:\n\n- **GitHub CLI** (`gh/`)\n - `create-pr.txt` - Create pull requests\n- **Vercel** (`vercel/`)\n - `deploy.txt` - Deploy applications to production\n - `preview-deploy.txt` - Deploy to preview environment\n - `init-project.txt` - Initialize new project with template\n - `env-setup.txt` - Configure environment variables\n - `list-deployments.txt` - List recent deployments\n - `domain-setup.txt` - Add custom domain configuration\n - `rollback.txt` - Rollback to previous deployment\n - `logs.txt` - View deployment logs\n - `build-local.txt` - Build project locally\n - `ax-test.txt` - Simple version check (AX demo)\n - `yaml-options-test.txt` - YAML frontmatter demo\n- **Docker** (`docker/`)\n - `run-nginx.txt` - Run nginx containers\n- **Wrangler (Cloudflare)** (`wrangler/`)\n - Multiple deployment and development scenarios\n\n[Browse all scenarios \u2192](scenarios/)\n\n## Development\n\n```bash\n# Install with dev dependencies\nuv sync --extra dev\n\n# Format code\nuv run black src/\n\n# Lint code\nuv run ruff check src/\n\n# Run tests (when implemented)\nuv run pytest\n```\n\nSee [TASKS.md](TASKS.md) for the development roadmap and task tracking.\n\n## Programmatic Usage\n\n```python\nimport asyncio\nfrom agentprobe import test_cli\n\nasync def main():\n result = await test_cli(\"gh\", \"create-pr\")\n print(f\"Success: {result['success']}\")\n print(f\"Duration: {result['duration_seconds']}s\")\n print(f\"Cost: ${result['cost_usd']:.3f}\")\n\nasyncio.run(main())\n```\n\n## License\n\nMIT",
"bugtrack_url": null,
"license": "MIT",
"summary": "Test how well AI agents interact with CLI tools",
"version": "0.1.3",
"project_urls": {
"Documentation": "https://github.com/nkkko/agentprobe#readme",
"Homepage": "https://github.com/nkkko/agentprobe",
"Issues": "https://github.com/nkkko/agentprobe/issues"
},
"split_keywords": [
"agents",
" ai",
" claude",
" cli",
" testing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3db62042eda6356bd7177d9ab3694586f3fd60619678efb1b6f3ebd5c3255126",
"md5": "e4132e8a480bf78ff3d752a93222b87c",
"sha256": "b58618eba05ac9b7914162de4bafd22130bb6f9087b3520243e572518e3a7f0a"
},
"downloads": -1,
"filename": "agentprobe-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e4132e8a480bf78ff3d752a93222b87c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 33797,
"upload_time": "2025-07-31T16:36:36",
"upload_time_iso_8601": "2025-07-31T16:36:36.612249Z",
"url": "https://files.pythonhosted.org/packages/3d/b6/2042eda6356bd7177d9ab3694586f3fd60619678efb1b6f3ebd5c3255126/agentprobe-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7c023b77555cfec0e4c72ec836b48fb024ca10a11ac387827bb2ba1c34acfc29",
"md5": "94d205bb71e2bdca92b5ab35c097bbd5",
"sha256": "f3807643c9fdcce3ae6f1330355bb46fd1f196dea9c7d69cec0ef9c59d98e489"
},
"downloads": -1,
"filename": "agentprobe-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "94d205bb71e2bdca92b5ab35c097bbd5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 299521,
"upload_time": "2025-07-31T16:36:37",
"upload_time_iso_8601": "2025-07-31T16:36:37.966934Z",
"url": "https://files.pythonhosted.org/packages/7c/02/3b77555cfec0e4c72ec836b48fb024ca10a11ac387827bb2ba1c34acfc29/agentprobe-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-31 16:36:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nkkko",
"github_project": "agentprobe#readme",
"github_not_found": true,
"lcname": "agentprobe"
}