```
__ __ __
____ ______/ /_/ /_ ___ ____ _____/ /_
/ __ `/ ___/ __/ __ \/ _ \/ __ \/ ___/ __ \
/ /_/ / /__/ /_/ /_/ / __/ / / / /__/ / / /
\__,_/\___/\__/_.___/\___/_/ /_/\___/_/ /_/
```
[](https://pypi.org/project/actbench/)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/release/python-3120/)
## Overview
**actbench** is a extensible framework designed to evaluate the performance and capabilities of web automation agents and LAM systems.
## Installing actbench CLI
**actbench** requires Python 3.12 or higher. We recommend using `pipx` for a clean, isolated installation:
```bash
pipx install actbench
```
## Usage
### 1. Setting API Keys
Before running benchmarks, you need to set API keys for the agents you want to use.
```bash
actbench set-key --agent raccoonai
```
You can list the supported agents and check which API keys are stored:
```bash
actbench agents list
```
### 2. Listing Available Tasks
**actbench** provides a [built-in dataset](https://github.com/raccoonaihq/actbench/blob/master/dataset.jsonl) of web automation tasks, crafted by merging and refining tasks from the [webarena](https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json) and [webvoyager](https://github.com/MinorJerry/WebVoyager/blob/main/data/WebVoyager_data.jsonl) datasets.<br/>
Duplicate tasks have been stripped out, and the queries have been refreshed to align with the most recent information.<br/>
If you want to explore how the tasks have been modified, you can trace their IDs back to the original datasets for a side-by-side comparison.<br/>
To see all the tasks currently available, just run this command:
```bash
actbench tasks list
```
### 3. Running Benchmarks
The `run` command is the heart of **actbench**. It allows you to execute tasks against specified agents.
#### Basic Usage
```bash
actbench run --agent raccoonai --task 256 --task 424
```
This command runs tasks with IDs `256` and `424` using the `raccoonai` agent.
#### Running All Tasks
```bash
actbench run --agent raccoonai --all-tasks
```
This runs all available tasks using the `raccoonai` agent.
#### Running Random Tasks
```bash
actbench run --agent raccoonai --random 5
```
This runs a random sample of 5 tasks using the `raccoonai` agent.
#### Running with All Agents
```bash
actbench run --all-agents --all-tasks
```
This runs all tasks with all configured agents (for which API keys are stored).
#### Controlling Parallelism
```bash
actbench run --agent raccoonai --all-tasks --parallel 4
```
This runs all tasks using the `raccoonai` agent, executing up to 4 tasks concurrently.
#### Setting Rate Limiting
```bash
actbench run --agent raccoonai --all-tasks --rate-limit 0.5
```
This adds a 0.5-second delay between task submissions.
#### Disabling Scoring
```bash
actbench run --agent raccoonai --all-tasks --no-scoring
```
This disables the LLM powered scoring, and gives all tasks a score of -1.
#### Combined Options
You can combine these options for more complex benchmark configurations:
```bash
actbench run --agent raccoonai --agent anotheragent --task 1 --task 2 --random 3 --parallel 2 --rate-limit 0.2
```
This command runs tasks 1 and 2, plus 3 random tasks, using both `raccoonai` and `anotheragent` (assuming API keys are set), with a parallelism of 2 and a rate limit of 0.2 seconds.
### 4. Viewing Results
The `results` command group allows you to manage and view benchmark results.
#### Listing Results
```bash
actbench results list
```
You can filter results by agent or run ID:
```bash
actbench results list --agent raccoonai
actbench results list --run-id <run_id>
```
#### Exporting Results
You can export results to JSON or CSV files:
```bash
actbench results export --format json --output results.json
actbench results export --format csv --output results.csv --agent raccoonai
```
#### Here's a complete table detailing the `actbench` CLI commands, their flags (options) and explanations:
| Command | Flag(s) / Option(s) | Explanation |
|:-------------------------------|:-----------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|
| `actbench run` | `--task` / `-t` | Specifies one or more task IDs to run. Can be used multiple times. If omitted, other task selection flags (`--random`, `--all-tasks`) must be used. |
| | `--agent` / `-a` | Specifies one or more agents to use. Can be used multiple times. If omitted, `--all-agents` must be used. |
| | `--random` / `-r` | Runs a specified number of random tasks. Takes an integer argument (e.g., `--random 5`). |
| | `--all-tasks` | Runs all available tasks. |
| | `--all-agents` | Runs with all configured agents (for which API keys have been set). |
| | `--parallel` / `-p` | Sets the number of tasks to run concurrently. Takes an integer argument (e.g., `--parallel 4`). Defaults to 1 (no parallelism). |
| | `--rate-limit` / `-l` | Sets the delay (in seconds) between task submissions. Takes a float argument (e.g., `--rate-limit 0.5`). Defaults to 0.1. |
| | `--no-scoring` / `-ns` | Disables LLM-based scoring. Results will have a score of -1. |
| `actbench tasks list` | *None* | Lists all available tasks in the dataset, showing their ID, query, URL, complexity, and whether they require login. |
| `actbench set-key` | `--agent` / `-a` | Sets the API key for a specified agent. Prompts the user to enter the key securely. Example: `actbench set-key --agent raccoonai` |
| `actbench agents list` | *None* | Lists all supported agents, and shows which agents have API Keys stored. |
| `actbench results list` | `--agent` / `-a` | Filters the results to show only those for a specific agent. |
| | `--run-id` / `-r` | Filters the results to show only those for a specific run ID. |
| `actbench results export` | `--agent` / `-a` | Filters the results to be exported to a specific agent. |
| | `--run-id` / `-r` | Filters the results to be exported for a specific run ID. |
| | `--format` / `-f` | Specifies the export format. Must be one of `json` or `csv`. Defaults to `json`. |
| | `--output` / `-o` | Specifies the output file path. Required. |
| `actbench` | *None* | Prints the help message for the CLI. |
| `actbench --version` | *None* | Prints the actbench version number. |
## Extending actbench
### Adding New Agents
1. **Create a new client class:** Create a new Python file in the `actbench/clients/` directory (e.g., `my_agent.py`).
2. **Implement the `BaseClient` interface:** Your class should inherit from `actbench.clients.BaseClient` and implement the `set_api_key()` and `run()` methods.
3. **Register your client:** Add your client class to the `_CLIENT_REGISTRY` in `actbench/clients/__init__.py`.
### Adding New Datasets
1. **Create a new dataset class:** Create a new Python file in the `actbench/datasets/` directory (e.g., `my_dataset.py`).
2. **Implement the `BaseDataset` interface:** Your class should inherit from `actbench.datasets.BaseDataset` and implement the `load_task_data()`, `get_all_task_ids()`, and `get_all_tasks()` methods.
3. **Provide your dataset file:** Place your dataset file (e.g., `my_dataset.jsonl`) in the `src/actbench/dataset/` directory.
4. **Update `_DATASET_INSTANCE`**: If you want to use this dataset by default, update the `_DATASET_INSTANCE` variable in `src/actbench/datasets/__init__.py`.
### Adding New Evaluation Metrics
You can customize the evaluation process by modifying the `Evaluator` class in `actbench/executor/evaluator.py` or by creating a new evaluator and integrating it into the `TaskExecutor`.
## Contributing
Contributions are welcome! Please follow these simple guidelines:
1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Write clear and concise code with appropriate comments.
4. Submit a pull request.
Raw data
{
"_id": null,
"home_page": null,
"name": "actbench",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "AI, LAM systems, agent evaluation, benchmarking, web automation",
"author": null,
"author_email": "Raccoon AI <team@flyingraccoon.tech>",
"download_url": "https://files.pythonhosted.org/packages/2f/8e/4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883/actbench-0.0.1a5.tar.gz",
"platform": null,
"description": "```\n __ __ __ \n ____ ______/ /_/ /_ ___ ____ _____/ /_ \n / __ `/ ___/ __/ __ \\/ _ \\/ __ \\/ ___/ __ \\\n/ /_/ / /__/ /_/ /_/ / __/ / / / /__/ / / /\n\\__,_/\\___/\\__/_.___/\\___/_/ /_/\\___/_/ /_/ \n \n``` \n[](https://pypi.org/project/actbench/)\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/downloads/release/python-3120/)\n\n## Overview\n\n**actbench** is a extensible framework designed to evaluate the performance and capabilities of web automation agents and LAM systems.\n\n\n## Installing actbench CLI\n\n**actbench** requires Python 3.12 or higher. We recommend using `pipx` for a clean, isolated installation:\n\n```bash\npipx install actbench\n```\n\n## Usage\n\n### 1. Setting API Keys\n\nBefore running benchmarks, you need to set API keys for the agents you want to use.\n\n```bash\nactbench set-key --agent raccoonai\n```\n\nYou can list the supported agents and check which API keys are stored:\n\n```bash\nactbench agents list\n```\n### 2. Listing Available Tasks\n\n**actbench** provides a [built-in dataset](https://github.com/raccoonaihq/actbench/blob/master/dataset.jsonl) of web automation tasks, crafted by merging and refining tasks from the [webarena](https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json) and [webvoyager](https://github.com/MinorJerry/WebVoyager/blob/main/data/WebVoyager_data.jsonl) datasets.<br/>\nDuplicate tasks have been stripped out, and the queries have been refreshed to align with the most recent information.<br/>\nIf you want to explore how the tasks have been modified, you can trace their IDs back to the original datasets for a side-by-side comparison.<br/>\n\n\nTo see all the tasks currently available, just run this command:\n\n```bash\nactbench tasks list\n```\n\n### 3. Running Benchmarks\n\nThe `run` command is the heart of **actbench**. It allows you to execute tasks against specified agents.\n\n#### Basic Usage\n\n```bash\nactbench run --agent raccoonai --task 256 --task 424\n```\n\nThis command runs tasks with IDs `256` and `424` using the `raccoonai` agent.\n\n#### Running All Tasks\n\n```bash\nactbench run --agent raccoonai --all-tasks\n```\n\nThis runs all available tasks using the `raccoonai` agent.\n\n#### Running Random Tasks\n\n```bash\nactbench run --agent raccoonai --random 5\n```\n\nThis runs a random sample of 5 tasks using the `raccoonai` agent.\n\n#### Running with All Agents\n\n```bash\nactbench run --all-agents --all-tasks\n```\n\nThis runs all tasks with all configured agents (for which API keys are stored).\n\n#### Controlling Parallelism\n\n```bash\nactbench run --agent raccoonai --all-tasks --parallel 4\n```\n\nThis runs all tasks using the `raccoonai` agent, executing up to 4 tasks concurrently.\n\n#### Setting Rate Limiting\n\n```bash\nactbench run --agent raccoonai --all-tasks --rate-limit 0.5\n```\nThis adds a 0.5-second delay between task submissions.\n\n#### Disabling Scoring\n```bash\nactbench run --agent raccoonai --all-tasks --no-scoring\n```\nThis disables the LLM powered scoring, and gives all tasks a score of -1.\n\n#### Combined Options\n\nYou can combine these options for more complex benchmark configurations:\n\n```bash\nactbench run --agent raccoonai --agent anotheragent --task 1 --task 2 --random 3 --parallel 2 --rate-limit 0.2\n```\n\nThis command runs tasks 1 and 2, plus 3 random tasks, using both `raccoonai` and `anotheragent` (assuming API keys are set), with a parallelism of 2 and a rate limit of 0.2 seconds.\n\n\n### 4. Viewing Results\n\nThe `results` command group allows you to manage and view benchmark results.\n\n#### Listing Results\n\n```bash\nactbench results list\n```\n\nYou can filter results by agent or run ID:\n\n```bash\nactbench results list --agent raccoonai\nactbench results list --run-id <run_id>\n```\n\n#### Exporting Results\n\nYou can export results to JSON or CSV files:\n\n```bash\nactbench results export --format json --output results.json\nactbench results export --format csv --output results.csv --agent raccoonai\n```\n\n\n\n#### Here's a complete table detailing the `actbench` CLI commands, their flags (options) and explanations:\n\n| Command | Flag(s) / Option(s) | Explanation |\n|:-------------------------------|:-----------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `actbench run` | `--task` / `-t` | Specifies one or more task IDs to run. Can be used multiple times. If omitted, other task selection flags (`--random`, `--all-tasks`) must be used. |\n| | `--agent` / `-a` | Specifies one or more agents to use. Can be used multiple times. If omitted, `--all-agents` must be used. |\n| | `--random` / `-r` | Runs a specified number of random tasks. Takes an integer argument (e.g., `--random 5`). |\n| | `--all-tasks` | Runs all available tasks. |\n| | `--all-agents` | Runs with all configured agents (for which API keys have been set). |\n| | `--parallel` / `-p` | Sets the number of tasks to run concurrently. Takes an integer argument (e.g., `--parallel 4`). Defaults to 1 (no parallelism). |\n| | `--rate-limit` / `-l` | Sets the delay (in seconds) between task submissions. Takes a float argument (e.g., `--rate-limit 0.5`). Defaults to 0.1. |\n| | `--no-scoring` / `-ns` | Disables LLM-based scoring. Results will have a score of -1. |\n| `actbench tasks list` | *None* | Lists all available tasks in the dataset, showing their ID, query, URL, complexity, and whether they require login. |\n| `actbench set-key` | `--agent` / `-a` | Sets the API key for a specified agent. Prompts the user to enter the key securely. Example: `actbench set-key --agent raccoonai` |\n| `actbench agents list` | *None* | Lists all supported agents, and shows which agents have API Keys stored. |\n| `actbench results list` | `--agent` / `-a` | Filters the results to show only those for a specific agent. |\n| | `--run-id` / `-r` | Filters the results to show only those for a specific run ID. |\n| `actbench results export` | `--agent` / `-a` | Filters the results to be exported to a specific agent. |\n| | `--run-id` / `-r` | Filters the results to be exported for a specific run ID. |\n| | `--format` / `-f` | Specifies the export format. Must be one of `json` or `csv`. Defaults to `json`. |\n| | `--output` / `-o` | Specifies the output file path. Required. |\n| `actbench` | *None* | Prints the help message for the CLI. |\n| `actbench --version` | *None* | Prints the actbench version number. |\n\n\n## Extending actbench\n\n### Adding New Agents\n\n1. **Create a new client class:** Create a new Python file in the `actbench/clients/` directory (e.g., `my_agent.py`).\n2. **Implement the `BaseClient` interface:** Your class should inherit from `actbench.clients.BaseClient` and implement the `set_api_key()` and `run()` methods.\n3. **Register your client:** Add your client class to the `_CLIENT_REGISTRY` in `actbench/clients/__init__.py`.\n\n### Adding New Datasets\n\n1. **Create a new dataset class:** Create a new Python file in the `actbench/datasets/` directory (e.g., `my_dataset.py`).\n2. **Implement the `BaseDataset` interface:** Your class should inherit from `actbench.datasets.BaseDataset` and implement the `load_task_data()`, `get_all_task_ids()`, and `get_all_tasks()` methods.\n3. **Provide your dataset file:** Place your dataset file (e.g., `my_dataset.jsonl`) in the `src/actbench/dataset/` directory.\n4. **Update `_DATASET_INSTANCE`**: If you want to use this dataset by default, update the `_DATASET_INSTANCE` variable in `src/actbench/datasets/__init__.py`.\n\n### Adding New Evaluation Metrics\n\nYou can customize the evaluation process by modifying the `Evaluator` class in `actbench/executor/evaluator.py` or by creating a new evaluator and integrating it into the `TaskExecutor`.\n\n## Contributing\n\nContributions are welcome! Please follow these simple guidelines:\n\n1. Fork the repository.\n2. Create a new branch for your feature or bug fix.\n3. Write clear and concise code with appropriate comments.\n4. Submit a pull request.\n",
"bugtrack_url": null,
"license": null,
"summary": "A framework for evaluating web automation agents and LAM systems.",
"version": "0.0.1a5",
"project_urls": {
"Homepage": "https://github.com/raccoonaihq/actbench",
"Repository": "https://github.com/raccoonaihq/actbench"
},
"split_keywords": [
"ai",
" lam systems",
" agent evaluation",
" benchmarking",
" web automation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ff351404ea9cb34fd225a3ae3b8e73fbee9bc75d549a5c1aa401fb57c21f0fea",
"md5": "84ad0089839aaa0594fa7aeadb950aac",
"sha256": "3989b7200dadab618129b5d49c31d5c365b5ebfb211ad983b984c6180ceb58c6"
},
"downloads": -1,
"filename": "actbench-0.0.1a5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "84ad0089839aaa0594fa7aeadb950aac",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 19036,
"upload_time": "2025-02-27T23:49:22",
"upload_time_iso_8601": "2025-02-27T23:49:22.375722Z",
"url": "https://files.pythonhosted.org/packages/ff/35/1404ea9cb34fd225a3ae3b8e73fbee9bc75d549a5c1aa401fb57c21f0fea/actbench-0.0.1a5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2f8e4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883",
"md5": "e83b484e3ca4efb67fc1e02b0d2aae0f",
"sha256": "43aa62f898b422b5ba39d9bc8e9321b6477be36747b2995796dc5561c9c58200"
},
"downloads": -1,
"filename": "actbench-0.0.1a5.tar.gz",
"has_sig": false,
"md5_digest": "e83b484e3ca4efb67fc1e02b0d2aae0f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 130086,
"upload_time": "2025-02-27T23:49:24",
"upload_time_iso_8601": "2025-02-27T23:49:24.189502Z",
"url": "https://files.pythonhosted.org/packages/2f/8e/4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883/actbench-0.0.1a5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-27 23:49:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "raccoonaihq",
"github_project": "actbench",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "actbench"
}