agent-eval


Nameagent-eval JSON
Version 0.1.24 PyPI version JSON
download
home_pageNone
SummaryAgent evaluation toolkit
upload_time2025-08-01 16:34:06
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # agent-eval

A utility for evaluating agents on a suite of [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai)-formatted evals, with the following primary benefits:
1. Task suite specifications as config.
2. Extracts the token usage of the agent from log files, and computes cost using `litellm`.
3. Submits task suite results to a leaderboard, with submission metadata and easy upload to a HuggingFace repo for distribution of scores and logs.

# Installation

To install from pypi, use `pip install agent-eval`.

For leaderboard extras, use `pip install agent-eval[leaderboard]`.

# Usage

## Run evaluation suite
```shell
agenteval eval --config-path CONFIG_PATH --split SPLIT LOG_DIR
```
Evaluate an agent on the supplied eval suite configuration. Results are written to `agenteval.json` in the log directory. 

See [sample-config.yml](sample-config.yml) for a sample configuration file. 

For aggregation in a leaderboard, each task specifies a `primary_metric` as `{scorer_name}/{metric_name}`. 
The scoring utils will look for a corresponding stderr metric, 
by looking for another metric with the same `scorer_name` and with a `metric_name` containing the string "stderr".

### Weighted Macro Averaging with Tags

Tasks can be grouped using `tags` for computing summary statistics. The tags support weighted macro averaging, allowing you to assign different weights to tasks within a tag group.

Tags are specified as simple strings on tasks. To adjust weights for specific tag-task combinations, use the `macro_average_weight_adjustments` field at the split level. Tasks not specified in the adjustments default to a weight of 1.0.

See [sample-config.yml](sample-config.yml) for an example of the tag and weight adjustment format.

## Score results 
```shell
agenteval score [OPTIONS] LOG_DIR
```
Compute scores for the results in `agenteval.json` and update the file with the computed scores.

## Publish scores to leaderboard
```shell
agenteval lb publish [OPTIONS] LOG_DIR
```
Upload the scored results to HuggingFace datasets.

## View leaderboard scores
```shell
agenteval lb view [OPTIONS]
```
View results from the leaderboard.

# Administer the leaderboard
Prior to publishing scores, two HuggingFace datasets should be set up, one for full submissions and one for results files.

If you want to call `load_dataset()` on the results dataset (e.g., for populating a leaderboard), you probably want to explicitly tell HuggingFace about the schema and dataset structure (otherwise, HuggingFace may fail to propertly auto-convert to Parquet).
This is done by updating the `configs` attribute in the YAML metadata block at the top of the `README.md` file at the root of the results dataset (the metadata block is identified by lines with just `---` above and below it).
This attribute should contain a list of configs, each of which specifies the schema (under the `features` key) and dataset structure (under the `data_files` key).
See [sample-config-hf-readme-metadata.yml](sample-config-hf-readme-metadata.yml) for a sample metadata block corresponding to [sample-comfig.yml](sample-config.yml) (note that the metadata references the [raw schema data](src/agenteval/leaderboard/dataset_features.yml), which must be copied).

To facilitate initializing new configs, `agenteval lb publish` will automatically add this metadata if it is missing.

# Development

See [Development.md](Development.md) for development instructions.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "agent-eval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/6e/68/6ea5b4ad5fca53a4683bcfb0ae787e7e2719b91b61ea24708ffe5faa4734/agent_eval-0.1.24.tar.gz",
    "platform": null,
    "description": "# agent-eval\n\nA utility for evaluating agents on a suite of [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai)-formatted evals, with the following primary benefits:\n1. Task suite specifications as config.\n2. Extracts the token usage of the agent from log files, and computes cost using `litellm`.\n3. Submits task suite results to a leaderboard, with submission metadata and easy upload to a HuggingFace repo for distribution of scores and logs.\n\n# Installation\n\nTo install from pypi, use `pip install agent-eval`.\n\nFor leaderboard extras, use `pip install agent-eval[leaderboard]`.\n\n# Usage\n\n## Run evaluation suite\n```shell\nagenteval eval --config-path CONFIG_PATH --split SPLIT LOG_DIR\n```\nEvaluate an agent on the supplied eval suite configuration. Results are written to `agenteval.json` in the log directory. \n\nSee [sample-config.yml](sample-config.yml) for a sample configuration file. \n\nFor aggregation in a leaderboard, each task specifies a `primary_metric` as `{scorer_name}/{metric_name}`. \nThe scoring utils will look for a corresponding stderr metric, \nby looking for another metric with the same `scorer_name` and with a `metric_name` containing the string \"stderr\".\n\n### Weighted Macro Averaging with Tags\n\nTasks can be grouped using `tags` for computing summary statistics. The tags support weighted macro averaging, allowing you to assign different weights to tasks within a tag group.\n\nTags are specified as simple strings on tasks. To adjust weights for specific tag-task combinations, use the `macro_average_weight_adjustments` field at the split level. Tasks not specified in the adjustments default to a weight of 1.0.\n\nSee [sample-config.yml](sample-config.yml) for an example of the tag and weight adjustment format.\n\n## Score results \n```shell\nagenteval score [OPTIONS] LOG_DIR\n```\nCompute scores for the results in `agenteval.json` and update the file with the computed scores.\n\n## Publish scores to leaderboard\n```shell\nagenteval lb publish [OPTIONS] LOG_DIR\n```\nUpload the scored results to HuggingFace datasets.\n\n## View leaderboard scores\n```shell\nagenteval lb view [OPTIONS]\n```\nView results from the leaderboard.\n\n# Administer the leaderboard\nPrior to publishing scores, two HuggingFace datasets should be set up, one for full submissions and one for results files.\n\nIf you want to call `load_dataset()` on the results dataset (e.g., for populating a leaderboard), you probably want to explicitly tell HuggingFace about the schema and dataset structure (otherwise, HuggingFace may fail to propertly auto-convert to Parquet).\nThis is done by updating the `configs` attribute in the YAML metadata block at the top of the `README.md` file at the root of the results dataset (the metadata block is identified by lines with just `---` above and below it).\nThis attribute should contain a list of configs, each of which specifies the schema (under the `features` key) and dataset structure (under the `data_files` key).\nSee [sample-config-hf-readme-metadata.yml](sample-config-hf-readme-metadata.yml) for a sample metadata block corresponding to [sample-comfig.yml](sample-config.yml) (note that the metadata references the [raw schema data](src/agenteval/leaderboard/dataset_features.yml), which must be copied).\n\nTo facilitate initializing new configs, `agenteval lb publish` will automatically add this metadata if it is missing.\n\n# Development\n\nSee [Development.md](Development.md) for development instructions.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Agent evaluation toolkit",
    "version": "0.1.24",
    "project_urls": {
        "Homepage": "https://github.com/allenai/agent-eval"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2273c2fe8c90de28da6f5dca08a3d8d44b8b952777ac6e8c96d38f1a3c43cfc8",
                "md5": "39230e64e484e1dc733d9053b7b994cb",
                "sha256": "877c84dea7ea9d98e4bb72be0598dc56e0133f40ec5badda21df9d2dbc6a8518"
            },
            "downloads": -1,
            "filename": "agent_eval-0.1.24-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "39230e64e484e1dc733d9053b7b994cb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 29856,
            "upload_time": "2025-08-01T16:34:05",
            "upload_time_iso_8601": "2025-08-01T16:34:05.409643Z",
            "url": "https://files.pythonhosted.org/packages/22/73/c2fe8c90de28da6f5dca08a3d8d44b8b952777ac6e8c96d38f1a3c43cfc8/agent_eval-0.1.24-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6e686ea5b4ad5fca53a4683bcfb0ae787e7e2719b91b61ea24708ffe5faa4734",
                "md5": "850f3ba8f770aa4bc244548daa6bc423",
                "sha256": "3bafd9566d54467c6dcdefd0914d0b698e2a2eaffa8fe1aa6c31945a1f27b320"
            },
            "downloads": -1,
            "filename": "agent_eval-0.1.24.tar.gz",
            "has_sig": false,
            "md5_digest": "850f3ba8f770aa4bc244548daa6bc423",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 29140,
            "upload_time": "2025-08-01T16:34:06",
            "upload_time_iso_8601": "2025-08-01T16:34:06.366613Z",
            "url": "https://files.pythonhosted.org/packages/6e/68/6ea5b4ad5fca53a4683bcfb0ae787e7e2719b91b61ea24708ffe5faa4734/agent_eval-0.1.24.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 16:34:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "allenai",
    "github_project": "agent-eval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "agent-eval"
}
        
Elapsed time: 1.74320s