actbench


Nameactbench JSON
Version 0.0.1a5 PyPI version JSON
download
home_pageNone
SummaryA framework for evaluating web automation agents and LAM systems.
upload_time2025-02-27 23:49:24
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseNone
keywords ai lam systems agent evaluation benchmarking web automation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ```
              __  __                    __  
  ____ ______/ /_/ /_  ___  ____  _____/ /_ 
 / __ `/ ___/ __/ __ \/ _ \/ __ \/ ___/ __ \
/ /_/ / /__/ /_/ /_/ /  __/ / / / /__/ / / /
\__,_/\___/\__/_.___/\___/_/ /_/\___/_/ /_/ 
                                  
```         
[![PyPI version](https://img.shields.io/pypi/v/actbench.svg?logo=pypi&&logoColor=white&&color=5d5fef&&cacheSeconds=10)](https://pypi.org/project/actbench/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Requires: Python 3.12+](https://img.shields.io/badge/Python-3.12+-blue.svg)](https://www.python.org/downloads/release/python-3120/)

## Overview

**actbench** is a extensible framework designed to evaluate the performance and capabilities of web automation agents and LAM systems.


## Installing actbench CLI

**actbench** requires Python 3.12 or higher. We recommend using `pipx` for a clean, isolated installation:

```bash
pipx install actbench
```

## Usage

### 1. Setting API Keys

Before running benchmarks, you need to set API keys for the agents you want to use.

```bash
actbench set-key --agent raccoonai
```

You can list the supported agents and check which API keys are stored:

```bash
actbench agents list
```
### 2. Listing Available Tasks

**actbench** provides a [built-in dataset](https://github.com/raccoonaihq/actbench/blob/master/dataset.jsonl) of web automation tasks, crafted by merging and refining tasks from the [webarena](https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json) and [webvoyager](https://github.com/MinorJerry/WebVoyager/blob/main/data/WebVoyager_data.jsonl) datasets.<br/>
Duplicate tasks have been stripped out, and the queries have been refreshed to align with the most recent information.<br/>
If you want to explore how the tasks have been modified, you can trace their IDs back to the original datasets for a side-by-side comparison.<br/>


To see all the tasks currently available, just run this command:

```bash
actbench tasks list
```

### 3. Running Benchmarks

The `run` command is the heart of **actbench**.  It allows you to execute tasks against specified agents.

#### Basic Usage

```bash
actbench run --agent raccoonai --task 256 --task 424
```

This command runs tasks with IDs `256` and `424` using the `raccoonai` agent.

#### Running All Tasks

```bash
actbench run --agent raccoonai --all-tasks
```

This runs all available tasks using the `raccoonai` agent.

#### Running Random Tasks

```bash
actbench run --agent raccoonai --random 5
```

This runs a random sample of 5 tasks using the `raccoonai` agent.

#### Running with All Agents

```bash
actbench run --all-agents --all-tasks
```

This runs all tasks with all configured agents (for which API keys are stored).

#### Controlling Parallelism

```bash
actbench run --agent raccoonai --all-tasks --parallel 4
```

This runs all tasks using the `raccoonai` agent, executing up to 4 tasks concurrently.

#### Setting Rate Limiting

```bash
actbench run --agent raccoonai --all-tasks --rate-limit 0.5
```
This adds a 0.5-second delay between task submissions.

#### Disabling Scoring
```bash
actbench run --agent raccoonai --all-tasks --no-scoring
```
This disables the LLM powered scoring, and gives all tasks a score of -1.

#### Combined Options

You can combine these options for more complex benchmark configurations:

```bash
actbench run --agent raccoonai --agent anotheragent --task 1 --task 2 --random 3 --parallel 2 --rate-limit 0.2
```

This command runs tasks 1 and 2, plus 3 random tasks, using both `raccoonai` and `anotheragent` (assuming API keys are set), with a parallelism of 2 and a rate limit of 0.2 seconds.


### 4. Viewing Results

The `results` command group allows you to manage and view benchmark results.

#### Listing Results

```bash
actbench results list
```

You can filter results by agent or run ID:

```bash
actbench results list --agent raccoonai
actbench results list --run-id <run_id>
```

#### Exporting Results

You can export results to JSON or CSV files:

```bash
actbench results export --format json --output results.json
actbench results export --format csv --output results.csv --agent raccoonai
```



#### Here's a complete table detailing the `actbench` CLI commands, their flags (options) and explanations:

| Command                        | Flag(s) / Option(s)    | Explanation                                                                                                                                           |
|:-------------------------------|:-----------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|
| `actbench run`                 | `--task` / `-t`        | Specifies one or more task IDs to run.  Can be used multiple times.  If omitted, other task selection flags (`--random`, `--all-tasks`) must be used. |
|                                | `--agent` / `-a`       | Specifies one or more agents to use. Can be used multiple times. If omitted, `--all-agents` must be used.                                             |
|                                | `--random` / `-r`      | Runs a specified number of random tasks.  Takes an integer argument (e.g., `--random 5`).                                                             |
|                                | `--all-tasks`          | Runs all available tasks.                                                                                                                             |
|                                | `--all-agents`         | Runs with all configured agents (for which API keys have been set).                                                                                   |
|                                | `--parallel` / `-p`    | Sets the number of tasks to run concurrently. Takes an integer argument (e.g., `--parallel 4`).  Defaults to 1 (no parallelism).                      |
|                                | `--rate-limit` / `-l`  | Sets the delay (in seconds) between task submissions.  Takes a float argument (e.g., `--rate-limit 0.5`). Defaults to 0.1.                            |
|                                | `--no-scoring` / `-ns` | Disables LLM-based scoring. Results will have a score of -1.                                                                                          |
| `actbench tasks list`          | *None*                 | Lists all available tasks in the dataset, showing their ID, query, URL, complexity, and whether they require login.                                   |
| `actbench set-key`             | `--agent` / `-a`       | Sets the API key for a specified agent.  Prompts the user to enter the key securely.  Example: `actbench set-key --agent raccoonai`                   |
| `actbench agents list`         | *None*                 | Lists all supported agents, and shows which agents have API Keys stored.                                                                              |
| `actbench results list`        | `--agent` / `-a`       | Filters the results to show only those for a specific agent.                                                                                          |
|                                | `--run-id` / `-r`      | Filters the results to show only those for a specific run ID.                                                                                         |
| `actbench results export`      | `--agent` / `-a`       | Filters the results to be exported to a specific agent.                                                                                               |
|                                | `--run-id` / `-r`      | Filters the results to be exported for a specific run ID.                                                                                             |
|                                | `--format` / `-f`      | Specifies the export format.  Must be one of `json` or `csv`. Defaults to `json`.                                                                     |
|                                | `--output` / `-o`      | Specifies the output file path.  Required.                                                                                                            |
| `actbench`                     | *None*                 | Prints the help message for the CLI.                                                                                                                  |
| `actbench --version`           | *None*                 | Prints the actbench version number.                                                                                                                   |


## Extending actbench

### Adding New Agents

1.  **Create a new client class:**  Create a new Python file in the `actbench/clients/` directory (e.g., `my_agent.py`).
2.  **Implement the `BaseClient` interface:**  Your class should inherit from `actbench.clients.BaseClient` and implement the `set_api_key()` and `run()` methods.
3.  **Register your client:**  Add your client class to the `_CLIENT_REGISTRY` in `actbench/clients/__init__.py`.

### Adding New Datasets

1.  **Create a new dataset class:** Create a new Python file in the `actbench/datasets/` directory (e.g., `my_dataset.py`).
2.  **Implement the `BaseDataset` interface:** Your class should inherit from `actbench.datasets.BaseDataset` and implement the `load_task_data()`, `get_all_task_ids()`, and `get_all_tasks()` methods.
3.  **Provide your dataset file:**  Place your dataset file (e.g., `my_dataset.jsonl`) in the `src/actbench/dataset/` directory.
4.  **Update `_DATASET_INSTANCE`**: If you want to use this dataset by default, update the `_DATASET_INSTANCE` variable in `src/actbench/datasets/__init__.py`.

### Adding New Evaluation Metrics

You can customize the evaluation process by modifying the `Evaluator` class in `actbench/executor/evaluator.py` or by creating a new evaluator and integrating it into the `TaskExecutor`.

## Contributing

Contributions are welcome! Please follow these simple guidelines:

1.  Fork the repository.
2.  Create a new branch for your feature or bug fix.
3.  Write clear and concise code with appropriate comments.
4.  Submit a pull request.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "actbench",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "AI, LAM systems, agent evaluation, benchmarking, web automation",
    "author": null,
    "author_email": "Raccoon AI <team@flyingraccoon.tech>",
    "download_url": "https://files.pythonhosted.org/packages/2f/8e/4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883/actbench-0.0.1a5.tar.gz",
    "platform": null,
    "description": "```\n              __  __                    __  \n  ____ ______/ /_/ /_  ___  ____  _____/ /_ \n / __ `/ ___/ __/ __ \\/ _ \\/ __ \\/ ___/ __ \\\n/ /_/ / /__/ /_/ /_/ /  __/ / / / /__/ / / /\n\\__,_/\\___/\\__/_.___/\\___/_/ /_/\\___/_/ /_/ \n                                  \n```         \n[![PyPI version](https://img.shields.io/pypi/v/actbench.svg?logo=pypi&&logoColor=white&&color=5d5fef&&cacheSeconds=10)](https://pypi.org/project/actbench/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Requires: Python 3.12+](https://img.shields.io/badge/Python-3.12+-blue.svg)](https://www.python.org/downloads/release/python-3120/)\n\n## Overview\n\n**actbench** is a extensible framework designed to evaluate the performance and capabilities of web automation agents and LAM systems.\n\n\n## Installing actbench CLI\n\n**actbench** requires Python 3.12 or higher. We recommend using `pipx` for a clean, isolated installation:\n\n```bash\npipx install actbench\n```\n\n## Usage\n\n### 1. Setting API Keys\n\nBefore running benchmarks, you need to set API keys for the agents you want to use.\n\n```bash\nactbench set-key --agent raccoonai\n```\n\nYou can list the supported agents and check which API keys are stored:\n\n```bash\nactbench agents list\n```\n### 2. Listing Available Tasks\n\n**actbench** provides a [built-in dataset](https://github.com/raccoonaihq/actbench/blob/master/dataset.jsonl) of web automation tasks, crafted by merging and refining tasks from the [webarena](https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json) and [webvoyager](https://github.com/MinorJerry/WebVoyager/blob/main/data/WebVoyager_data.jsonl) datasets.<br/>\nDuplicate tasks have been stripped out, and the queries have been refreshed to align with the most recent information.<br/>\nIf you want to explore how the tasks have been modified, you can trace their IDs back to the original datasets for a side-by-side comparison.<br/>\n\n\nTo see all the tasks currently available, just run this command:\n\n```bash\nactbench tasks list\n```\n\n### 3. Running Benchmarks\n\nThe `run` command is the heart of **actbench**.  It allows you to execute tasks against specified agents.\n\n#### Basic Usage\n\n```bash\nactbench run --agent raccoonai --task 256 --task 424\n```\n\nThis command runs tasks with IDs `256` and `424` using the `raccoonai` agent.\n\n#### Running All Tasks\n\n```bash\nactbench run --agent raccoonai --all-tasks\n```\n\nThis runs all available tasks using the `raccoonai` agent.\n\n#### Running Random Tasks\n\n```bash\nactbench run --agent raccoonai --random 5\n```\n\nThis runs a random sample of 5 tasks using the `raccoonai` agent.\n\n#### Running with All Agents\n\n```bash\nactbench run --all-agents --all-tasks\n```\n\nThis runs all tasks with all configured agents (for which API keys are stored).\n\n#### Controlling Parallelism\n\n```bash\nactbench run --agent raccoonai --all-tasks --parallel 4\n```\n\nThis runs all tasks using the `raccoonai` agent, executing up to 4 tasks concurrently.\n\n#### Setting Rate Limiting\n\n```bash\nactbench run --agent raccoonai --all-tasks --rate-limit 0.5\n```\nThis adds a 0.5-second delay between task submissions.\n\n#### Disabling Scoring\n```bash\nactbench run --agent raccoonai --all-tasks --no-scoring\n```\nThis disables the LLM powered scoring, and gives all tasks a score of -1.\n\n#### Combined Options\n\nYou can combine these options for more complex benchmark configurations:\n\n```bash\nactbench run --agent raccoonai --agent anotheragent --task 1 --task 2 --random 3 --parallel 2 --rate-limit 0.2\n```\n\nThis command runs tasks 1 and 2, plus 3 random tasks, using both `raccoonai` and `anotheragent` (assuming API keys are set), with a parallelism of 2 and a rate limit of 0.2 seconds.\n\n\n### 4. Viewing Results\n\nThe `results` command group allows you to manage and view benchmark results.\n\n#### Listing Results\n\n```bash\nactbench results list\n```\n\nYou can filter results by agent or run ID:\n\n```bash\nactbench results list --agent raccoonai\nactbench results list --run-id <run_id>\n```\n\n#### Exporting Results\n\nYou can export results to JSON or CSV files:\n\n```bash\nactbench results export --format json --output results.json\nactbench results export --format csv --output results.csv --agent raccoonai\n```\n\n\n\n#### Here's a complete table detailing the `actbench` CLI commands, their flags (options) and explanations:\n\n| Command                        | Flag(s) / Option(s)    | Explanation                                                                                                                                           |\n|:-------------------------------|:-----------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `actbench run`                 | `--task` / `-t`        | Specifies one or more task IDs to run.  Can be used multiple times.  If omitted, other task selection flags (`--random`, `--all-tasks`) must be used. |\n|                                | `--agent` / `-a`       | Specifies one or more agents to use. Can be used multiple times. If omitted, `--all-agents` must be used.                                             |\n|                                | `--random` / `-r`      | Runs a specified number of random tasks.  Takes an integer argument (e.g., `--random 5`).                                                             |\n|                                | `--all-tasks`          | Runs all available tasks.                                                                                                                             |\n|                                | `--all-agents`         | Runs with all configured agents (for which API keys have been set).                                                                                   |\n|                                | `--parallel` / `-p`    | Sets the number of tasks to run concurrently. Takes an integer argument (e.g., `--parallel 4`).  Defaults to 1 (no parallelism).                      |\n|                                | `--rate-limit` / `-l`  | Sets the delay (in seconds) between task submissions.  Takes a float argument (e.g., `--rate-limit 0.5`). Defaults to 0.1.                            |\n|                                | `--no-scoring` / `-ns` | Disables LLM-based scoring. Results will have a score of -1.                                                                                          |\n| `actbench tasks list`          | *None*                 | Lists all available tasks in the dataset, showing their ID, query, URL, complexity, and whether they require login.                                   |\n| `actbench set-key`             | `--agent` / `-a`       | Sets the API key for a specified agent.  Prompts the user to enter the key securely.  Example: `actbench set-key --agent raccoonai`                   |\n| `actbench agents list`         | *None*                 | Lists all supported agents, and shows which agents have API Keys stored.                                                                              |\n| `actbench results list`        | `--agent` / `-a`       | Filters the results to show only those for a specific agent.                                                                                          |\n|                                | `--run-id` / `-r`      | Filters the results to show only those for a specific run ID.                                                                                         |\n| `actbench results export`      | `--agent` / `-a`       | Filters the results to be exported to a specific agent.                                                                                               |\n|                                | `--run-id` / `-r`      | Filters the results to be exported for a specific run ID.                                                                                             |\n|                                | `--format` / `-f`      | Specifies the export format.  Must be one of `json` or `csv`. Defaults to `json`.                                                                     |\n|                                | `--output` / `-o`      | Specifies the output file path.  Required.                                                                                                            |\n| `actbench`                     | *None*                 | Prints the help message for the CLI.                                                                                                                  |\n| `actbench --version`           | *None*                 | Prints the actbench version number.                                                                                                                   |\n\n\n## Extending actbench\n\n### Adding New Agents\n\n1.  **Create a new client class:**  Create a new Python file in the `actbench/clients/` directory (e.g., `my_agent.py`).\n2.  **Implement the `BaseClient` interface:**  Your class should inherit from `actbench.clients.BaseClient` and implement the `set_api_key()` and `run()` methods.\n3.  **Register your client:**  Add your client class to the `_CLIENT_REGISTRY` in `actbench/clients/__init__.py`.\n\n### Adding New Datasets\n\n1.  **Create a new dataset class:** Create a new Python file in the `actbench/datasets/` directory (e.g., `my_dataset.py`).\n2.  **Implement the `BaseDataset` interface:** Your class should inherit from `actbench.datasets.BaseDataset` and implement the `load_task_data()`, `get_all_task_ids()`, and `get_all_tasks()` methods.\n3.  **Provide your dataset file:**  Place your dataset file (e.g., `my_dataset.jsonl`) in the `src/actbench/dataset/` directory.\n4.  **Update `_DATASET_INSTANCE`**: If you want to use this dataset by default, update the `_DATASET_INSTANCE` variable in `src/actbench/datasets/__init__.py`.\n\n### Adding New Evaluation Metrics\n\nYou can customize the evaluation process by modifying the `Evaluator` class in `actbench/executor/evaluator.py` or by creating a new evaluator and integrating it into the `TaskExecutor`.\n\n## Contributing\n\nContributions are welcome! Please follow these simple guidelines:\n\n1.  Fork the repository.\n2.  Create a new branch for your feature or bug fix.\n3.  Write clear and concise code with appropriate comments.\n4.  Submit a pull request.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A framework for evaluating web automation agents and LAM systems.",
    "version": "0.0.1a5",
    "project_urls": {
        "Homepage": "https://github.com/raccoonaihq/actbench",
        "Repository": "https://github.com/raccoonaihq/actbench"
    },
    "split_keywords": [
        "ai",
        " lam systems",
        " agent evaluation",
        " benchmarking",
        " web automation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ff351404ea9cb34fd225a3ae3b8e73fbee9bc75d549a5c1aa401fb57c21f0fea",
                "md5": "84ad0089839aaa0594fa7aeadb950aac",
                "sha256": "3989b7200dadab618129b5d49c31d5c365b5ebfb211ad983b984c6180ceb58c6"
            },
            "downloads": -1,
            "filename": "actbench-0.0.1a5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "84ad0089839aaa0594fa7aeadb950aac",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 19036,
            "upload_time": "2025-02-27T23:49:22",
            "upload_time_iso_8601": "2025-02-27T23:49:22.375722Z",
            "url": "https://files.pythonhosted.org/packages/ff/35/1404ea9cb34fd225a3ae3b8e73fbee9bc75d549a5c1aa401fb57c21f0fea/actbench-0.0.1a5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2f8e4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883",
                "md5": "e83b484e3ca4efb67fc1e02b0d2aae0f",
                "sha256": "43aa62f898b422b5ba39d9bc8e9321b6477be36747b2995796dc5561c9c58200"
            },
            "downloads": -1,
            "filename": "actbench-0.0.1a5.tar.gz",
            "has_sig": false,
            "md5_digest": "e83b484e3ca4efb67fc1e02b0d2aae0f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 130086,
            "upload_time": "2025-02-27T23:49:24",
            "upload_time_iso_8601": "2025-02-27T23:49:24.189502Z",
            "url": "https://files.pythonhosted.org/packages/2f/8e/4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883/actbench-0.0.1a5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-27 23:49:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "raccoonaihq",
    "github_project": "actbench",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "actbench"
}
        
Elapsed time: 0.59820s