# Craftax LM
A wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons.
<p align="middle">
<img src="https://raw.githubusercontent.com/MichaelTMatthews/Craftax/main/images/dungeon_crawling.gif" width="200" />
</p>
## Craftax-Classic
| LM | Algorithm | Score (% max) | Code |
|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|
| claude-3-7-sonnet-latest (default) | ReAct | 18.0 | |
| claude-3-5-sonnet-20241022 | ReAct | 17.8 | |
| claude-3-5-sonnet-20240620 | ReAct | 15.7 | |
| o3-mini | ReAct | 12.6 | |
| gpt-4o | ReAct | 7.0 | |
* Note - this is a limited evaluation where trajectories are terminated after 30 api calls, or roughly 150 in-game steps. 10 trajectories are rolled-out, yielding a log-weighted score as per the Crafter [paper](https://arxiv.org/abs/2109.06780). Reproducible code forthcoming.
# Usage
First, download the package with ```pip install craftaxlm```. Next, import the agent-computer interface of your choice via
```
from craftaxlm import CraftaxACI, CraftaxClassicACI
```
This package is early in development, so for implementation examples, please refer to the [baseline ReAct implementation](https://github.com/JoshuaPurtell/Apropos/blob/main/apropos/bench/craftax)
# Leaderboard
In order to make experiments reasonable to run across a range of LMs, currently the leaderboard evaluates agents in the following manner:
1. Five rollouts are sampled from the agent, with a hard cap of 300 actions per rollout.
2. The agent is evaluated using a modified version of the original Crafter score -
```
sum(ln(1 + P(1_achievement_obtained)) for achievement in achievements) / (sum(ln(2) * len(achievements)))
```
where P(1_achievement_obtained) is the probability of the achievement being obtained in a single rollout. The key idea is that incremental progress towards difficult achievements ought to weigh more heavily in the score.
## Craftax-Full
| LM | Algorithm | Score (% max) | Code |
|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|
# Dev Instructions
```
pyenv virtualenv craftax_env
poetry install
```
When in doubt
```
from jax import debug
...
debug.breakpoint()
```
# 📚 Citation
To learn more about Craftax, check out the paper [website](https://craftaxenv.github.io) here.
To cite the underlying Craftax environment, see:
```
@inproceedings{matthews2024craftax,
author={Michael Matthews and Michael Beukman and Benjamin Ellis and Mikayel Samvelyan and Matthew Jackson and Samuel Coward and Jakob Foerster},
title = {Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning},
booktitle = {International Conference on Machine Learning ({ICML})},
year = {2024}
}
```
To cite the Crafter benchmark, see:
```
@article{hafner2021crafter,
title={Benchmarking the Spectrum of Agent Capabilities},
author={Danijar Hafner},
year={2021},
journal={arXiv preprint arXiv:2109.06780},
}
```
# Contributing
## Setup
```
uv venv craftaxlm-dev
source craftaxlm-dev/bin/activate
uv sync
uv run ruff format .
```
## Help Wanted
- General code quality suggestions or improvements. Especially those that improve speed or reduce tokens.
- PRs to fix issues or add afforances that help your LM agent perform well
- Leaderboard submissions that demonstrate improved performance using algorithms for learning from data
Raw data
{
"_id": null,
"home_page": "https://github.com/JoshuaPurtell/craftaxlm",
"name": "craftaxlm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Josh Purtell",
"author_email": "Josh Purtell <jmvpurtell@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0d/b2/b35f3705c340d43baca25094ae8402b8072704212feedf37edd52d4341e8/craftaxlm-0.0.37.tar.gz",
"platform": null,
"description": "# Craftax LM\nA wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons.\n\n<p align=\"middle\">\n <img src=\"https://raw.githubusercontent.com/MichaelTMatthews/Craftax/main/images/dungeon_crawling.gif\" width=\"200\" />\n</p>\n\n## Craftax-Classic\n| LM | Algorithm | Score (% max) | Code |\n|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|\n| claude-3-7-sonnet-latest (default) | ReAct | 18.0 | |\n| claude-3-5-sonnet-20241022 | ReAct | 17.8 | |\n| claude-3-5-sonnet-20240620 | ReAct | 15.7 | |\n| o3-mini | ReAct | 12.6 | |\n| gpt-4o | ReAct | 7.0 | |\n\n* Note - this is a limited evaluation where trajectories are terminated after 30 api calls, or roughly 150 in-game steps. 10 trajectories are rolled-out, yielding a log-weighted score as per the Crafter [paper](https://arxiv.org/abs/2109.06780). Reproducible code forthcoming.\n\n# Usage\nFirst, download the package with ```pip install craftaxlm```. Next, import the agent-computer interface of your choice via\n```\nfrom craftaxlm import CraftaxACI, CraftaxClassicACI\n```\nThis package is early in development, so for implementation examples, please refer to the [baseline ReAct implementation](https://github.com/JoshuaPurtell/Apropos/blob/main/apropos/bench/craftax)\n\n# Leaderboard\nIn order to make experiments reasonable to run across a range of LMs, currently the leaderboard evaluates agents in the following manner:\n1. Five rollouts are sampled from the agent, with a hard cap of 300 actions per rollout.\n2. The agent is evaluated using a modified version of the original Crafter score - \n ```\n sum(ln(1 + P(1_achievement_obtained)) for achievement in achievements) / (sum(ln(2) * len(achievements)))\n ```\n where P(1_achievement_obtained) is the probability of the achievement being obtained in a single rollout. The key idea is that incremental progress towards difficult achievements ought to weigh more heavily in the score.\n\n## Craftax-Full\n| LM | Algorithm | Score (% max) | Code |\n|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|\n\n# Dev Instructions\n```\npyenv virtualenv craftax_env\npoetry install\n```\n\nWhen in doubt\n\n```\nfrom jax import debug\n...\ndebug.breakpoint()\n```\n\n# \ud83d\udcda Citation\nTo learn more about Craftax, check out the paper [website](https://craftaxenv.github.io) here.\nTo cite the underlying Craftax environment, see:\n```\n@inproceedings{matthews2024craftax,\n author={Michael Matthews and Michael Beukman and Benjamin Ellis and Mikayel Samvelyan and Matthew Jackson and Samuel Coward and Jakob Foerster},\n title = {Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning},\n booktitle = {International Conference on Machine Learning ({ICML})},\n year = {2024}\n}\n```\nTo cite the Crafter benchmark, see:\n```\n@article{hafner2021crafter,\n title={Benchmarking the Spectrum of Agent Capabilities},\n author={Danijar Hafner},\n year={2021},\n journal={arXiv preprint arXiv:2109.06780},\n}\n```\n\n# Contributing\n## Setup\n```\nuv venv craftaxlm-dev\nsource craftaxlm-dev/bin/activate\nuv sync\nuv run ruff format .\n```\n## Help Wanted\n- General code quality suggestions or improvements. Especially those that improve speed or reduce tokens.\n- PRs to fix issues or add afforances that help your LM agent perform well\n- Leaderboard submissions that demonstrate improved performance using algorithms for learning from data\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Add your description here",
"version": "0.0.37",
"project_urls": {
"Homepage": "https://github.com/JoshuaPurtell/craftaxlm"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "667287906428ce8d33c26418175e0f472bb1e1adb8c66f6df344d6aa3e83d4bd",
"md5": "533f694e58184e751926b7525316ce85",
"sha256": "65e1b458308203dd5d0571a8ca20c48315a3eecb74026b2da47ce24cb8009305"
},
"downloads": -1,
"filename": "craftaxlm-0.0.37-py3-none-any.whl",
"has_sig": false,
"md5_digest": "533f694e58184e751926b7525316ce85",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 32072,
"upload_time": "2025-03-03T04:33:53",
"upload_time_iso_8601": "2025-03-03T04:33:53.487994Z",
"url": "https://files.pythonhosted.org/packages/66/72/87906428ce8d33c26418175e0f472bb1e1adb8c66f6df344d6aa3e83d4bd/craftaxlm-0.0.37-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0db2b35f3705c340d43baca25094ae8402b8072704212feedf37edd52d4341e8",
"md5": "6946027713fe957c678313b391470537",
"sha256": "8a2ab5373025e33f2750032cc91e8e64082b82dea257bb3d75a7d65ca0346d0e"
},
"downloads": -1,
"filename": "craftaxlm-0.0.37.tar.gz",
"has_sig": false,
"md5_digest": "6946027713fe957c678313b391470537",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 32149,
"upload_time": "2025-03-03T04:33:59",
"upload_time_iso_8601": "2025-03-03T04:33:59.020694Z",
"url": "https://files.pythonhosted.org/packages/0d/b2/b35f3705c340d43baca25094ae8402b8072704212feedf37edd52d4341e8/craftaxlm-0.0.37.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-03-03 04:33:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "JoshuaPurtell",
"github_project": "craftaxlm",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "craftaxlm"
}