Name | agent-eval JSON |
Version |
0.1.24
JSON |
| download |
home_page | None |
Summary | Agent evaluation toolkit |
upload_time | 2025-08-01 16:34:06 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | None |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# agent-eval
A utility for evaluating agents on a suite of [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai)-formatted evals, with the following primary benefits:
1. Task suite specifications as config.
2. Extracts the token usage of the agent from log files, and computes cost using `litellm`.
3. Submits task suite results to a leaderboard, with submission metadata and easy upload to a HuggingFace repo for distribution of scores and logs.
# Installation
To install from pypi, use `pip install agent-eval`.
For leaderboard extras, use `pip install agent-eval[leaderboard]`.
# Usage
## Run evaluation suite
```shell
agenteval eval --config-path CONFIG_PATH --split SPLIT LOG_DIR
```
Evaluate an agent on the supplied eval suite configuration. Results are written to `agenteval.json` in the log directory.
See [sample-config.yml](sample-config.yml) for a sample configuration file.
For aggregation in a leaderboard, each task specifies a `primary_metric` as `{scorer_name}/{metric_name}`.
The scoring utils will look for a corresponding stderr metric,
by looking for another metric with the same `scorer_name` and with a `metric_name` containing the string "stderr".
### Weighted Macro Averaging with Tags
Tasks can be grouped using `tags` for computing summary statistics. The tags support weighted macro averaging, allowing you to assign different weights to tasks within a tag group.
Tags are specified as simple strings on tasks. To adjust weights for specific tag-task combinations, use the `macro_average_weight_adjustments` field at the split level. Tasks not specified in the adjustments default to a weight of 1.0.
See [sample-config.yml](sample-config.yml) for an example of the tag and weight adjustment format.
## Score results
```shell
agenteval score [OPTIONS] LOG_DIR
```
Compute scores for the results in `agenteval.json` and update the file with the computed scores.
## Publish scores to leaderboard
```shell
agenteval lb publish [OPTIONS] LOG_DIR
```
Upload the scored results to HuggingFace datasets.
## View leaderboard scores
```shell
agenteval lb view [OPTIONS]
```
View results from the leaderboard.
# Administer the leaderboard
Prior to publishing scores, two HuggingFace datasets should be set up, one for full submissions and one for results files.
If you want to call `load_dataset()` on the results dataset (e.g., for populating a leaderboard), you probably want to explicitly tell HuggingFace about the schema and dataset structure (otherwise, HuggingFace may fail to propertly auto-convert to Parquet).
This is done by updating the `configs` attribute in the YAML metadata block at the top of the `README.md` file at the root of the results dataset (the metadata block is identified by lines with just `---` above and below it).
This attribute should contain a list of configs, each of which specifies the schema (under the `features` key) and dataset structure (under the `data_files` key).
See [sample-config-hf-readme-metadata.yml](sample-config-hf-readme-metadata.yml) for a sample metadata block corresponding to [sample-comfig.yml](sample-config.yml) (note that the metadata references the [raw schema data](src/agenteval/leaderboard/dataset_features.yml), which must be copied).
To facilitate initializing new configs, `agenteval lb publish` will automatically add this metadata if it is missing.
# Development
See [Development.md](Development.md) for development instructions.
Raw data
{
"_id": null,
"home_page": null,
"name": "agent-eval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/6e/68/6ea5b4ad5fca53a4683bcfb0ae787e7e2719b91b61ea24708ffe5faa4734/agent_eval-0.1.24.tar.gz",
"platform": null,
"description": "# agent-eval\n\nA utility for evaluating agents on a suite of [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai)-formatted evals, with the following primary benefits:\n1. Task suite specifications as config.\n2. Extracts the token usage of the agent from log files, and computes cost using `litellm`.\n3. Submits task suite results to a leaderboard, with submission metadata and easy upload to a HuggingFace repo for distribution of scores and logs.\n\n# Installation\n\nTo install from pypi, use `pip install agent-eval`.\n\nFor leaderboard extras, use `pip install agent-eval[leaderboard]`.\n\n# Usage\n\n## Run evaluation suite\n```shell\nagenteval eval --config-path CONFIG_PATH --split SPLIT LOG_DIR\n```\nEvaluate an agent on the supplied eval suite configuration. Results are written to `agenteval.json` in the log directory. \n\nSee [sample-config.yml](sample-config.yml) for a sample configuration file. \n\nFor aggregation in a leaderboard, each task specifies a `primary_metric` as `{scorer_name}/{metric_name}`. \nThe scoring utils will look for a corresponding stderr metric, \nby looking for another metric with the same `scorer_name` and with a `metric_name` containing the string \"stderr\".\n\n### Weighted Macro Averaging with Tags\n\nTasks can be grouped using `tags` for computing summary statistics. The tags support weighted macro averaging, allowing you to assign different weights to tasks within a tag group.\n\nTags are specified as simple strings on tasks. To adjust weights for specific tag-task combinations, use the `macro_average_weight_adjustments` field at the split level. Tasks not specified in the adjustments default to a weight of 1.0.\n\nSee [sample-config.yml](sample-config.yml) for an example of the tag and weight adjustment format.\n\n## Score results \n```shell\nagenteval score [OPTIONS] LOG_DIR\n```\nCompute scores for the results in `agenteval.json` and update the file with the computed scores.\n\n## Publish scores to leaderboard\n```shell\nagenteval lb publish [OPTIONS] LOG_DIR\n```\nUpload the scored results to HuggingFace datasets.\n\n## View leaderboard scores\n```shell\nagenteval lb view [OPTIONS]\n```\nView results from the leaderboard.\n\n# Administer the leaderboard\nPrior to publishing scores, two HuggingFace datasets should be set up, one for full submissions and one for results files.\n\nIf you want to call `load_dataset()` on the results dataset (e.g., for populating a leaderboard), you probably want to explicitly tell HuggingFace about the schema and dataset structure (otherwise, HuggingFace may fail to propertly auto-convert to Parquet).\nThis is done by updating the `configs` attribute in the YAML metadata block at the top of the `README.md` file at the root of the results dataset (the metadata block is identified by lines with just `---` above and below it).\nThis attribute should contain a list of configs, each of which specifies the schema (under the `features` key) and dataset structure (under the `data_files` key).\nSee [sample-config-hf-readme-metadata.yml](sample-config-hf-readme-metadata.yml) for a sample metadata block corresponding to [sample-comfig.yml](sample-config.yml) (note that the metadata references the [raw schema data](src/agenteval/leaderboard/dataset_features.yml), which must be copied).\n\nTo facilitate initializing new configs, `agenteval lb publish` will automatically add this metadata if it is missing.\n\n# Development\n\nSee [Development.md](Development.md) for development instructions.\n",
"bugtrack_url": null,
"license": null,
"summary": "Agent evaluation toolkit",
"version": "0.1.24",
"project_urls": {
"Homepage": "https://github.com/allenai/agent-eval"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2273c2fe8c90de28da6f5dca08a3d8d44b8b952777ac6e8c96d38f1a3c43cfc8",
"md5": "39230e64e484e1dc733d9053b7b994cb",
"sha256": "877c84dea7ea9d98e4bb72be0598dc56e0133f40ec5badda21df9d2dbc6a8518"
},
"downloads": -1,
"filename": "agent_eval-0.1.24-py3-none-any.whl",
"has_sig": false,
"md5_digest": "39230e64e484e1dc733d9053b7b994cb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 29856,
"upload_time": "2025-08-01T16:34:05",
"upload_time_iso_8601": "2025-08-01T16:34:05.409643Z",
"url": "https://files.pythonhosted.org/packages/22/73/c2fe8c90de28da6f5dca08a3d8d44b8b952777ac6e8c96d38f1a3c43cfc8/agent_eval-0.1.24-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6e686ea5b4ad5fca53a4683bcfb0ae787e7e2719b91b61ea24708ffe5faa4734",
"md5": "850f3ba8f770aa4bc244548daa6bc423",
"sha256": "3bafd9566d54467c6dcdefd0914d0b698e2a2eaffa8fe1aa6c31945a1f27b320"
},
"downloads": -1,
"filename": "agent_eval-0.1.24.tar.gz",
"has_sig": false,
"md5_digest": "850f3ba8f770aa4bc244548daa6bc423",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 29140,
"upload_time": "2025-08-01T16:34:06",
"upload_time_iso_8601": "2025-08-01T16:34:06.366613Z",
"url": "https://files.pythonhosted.org/packages/6e/68/6ea5b4ad5fca53a4683bcfb0ae787e7e2719b91b61ea24708ffe5faa4734/agent_eval-0.1.24.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-01 16:34:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "allenai",
"github_project": "agent-eval",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "agent-eval"
}