craftaxlm


Namecraftaxlm JSON
Version 0.0.37 PyPI version JSON
download
home_pagehttps://github.com/JoshuaPurtell/craftaxlm
SummaryAdd your description here
upload_time2025-03-03 04:33:59
maintainerNone
docs_urlNone
authorJosh Purtell
requires_python>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Craftax LM
A wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons.

<p align="middle">
  <img src="https://raw.githubusercontent.com/MichaelTMatthews/Craftax/main/images/dungeon_crawling.gif" width="200" />
</p>

## Craftax-Classic
| LM | Algorithm | Score (% max) |                                              Code                                               |
|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|
| claude-3-7-sonnet-latest (default) | ReAct   |            18.0 | |
| claude-3-5-sonnet-20241022 | ReAct   |            17.8 | |
| claude-3-5-sonnet-20240620 | ReAct   |            15.7 | |
| o3-mini | ReAct   |            12.6 | |
| gpt-4o | ReAct   |            7.0 | |

* Note - this is a limited evaluation where trajectories are terminated after 30 api calls, or roughly 150 in-game steps. 10 trajectories are rolled-out, yielding a log-weighted score as per the Crafter [paper](https://arxiv.org/abs/2109.06780). Reproducible code forthcoming.

# Usage
First, download the package with ```pip install craftaxlm```. Next, import the agent-computer interface of your choice via
```
from craftaxlm import CraftaxACI, CraftaxClassicACI
```
This package is early in development, so for implementation examples, please refer to the [baseline ReAct implementation](https://github.com/JoshuaPurtell/Apropos/blob/main/apropos/bench/craftax)

# Leaderboard
In order to make experiments reasonable to run across a range of LMs, currently the leaderboard evaluates agents in the following manner:
1. Five rollouts are sampled from the agent, with a hard cap of 300 actions per rollout.
2. The agent is evaluated using a modified version of the original Crafter score - 
    ```
    sum(ln(1 + P(1_achievement_obtained)) for achievement in achievements) / (sum(ln(2) * len(achievements)))
    ```
    where P(1_achievement_obtained) is the probability of the achievement being obtained in a single rollout. The key idea is that incremental progress towards difficult achievements ought to weigh more heavily in the score.

## Craftax-Full
| LM | Algorithm | Score (% max) |                                              Code                                               |
|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|

# Dev Instructions
```
pyenv virtualenv craftax_env
poetry install
```

When in doubt

```
from jax import debug
...
debug.breakpoint()
```

# 📚 Citation
To learn more about Craftax, check out the paper [website](https://craftaxenv.github.io) here.
To cite the underlying Craftax environment, see:
```
@inproceedings{matthews2024craftax,
    author={Michael Matthews and Michael Beukman and Benjamin Ellis and Mikayel Samvelyan and Matthew Jackson and Samuel Coward and Jakob Foerster},
    title = {Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning},
    booktitle = {International Conference on Machine Learning ({ICML})},
    year = {2024}
}
```
To cite the Crafter benchmark, see:
```
@article{hafner2021crafter,
  title={Benchmarking the Spectrum of Agent Capabilities},
  author={Danijar Hafner},
  year={2021},
  journal={arXiv preprint arXiv:2109.06780},
}
```

# Contributing
## Setup
```
uv venv craftaxlm-dev
source craftaxlm-dev/bin/activate
uv sync
uv run ruff format .
```
## Help Wanted
- General code quality suggestions or improvements. Especially those that improve speed or reduce tokens.
- PRs to fix issues or add afforances that help your LM agent perform well
- Leaderboard submissions that demonstrate improved performance using algorithms for learning from data

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JoshuaPurtell/craftaxlm",
    "name": "craftaxlm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Josh Purtell",
    "author_email": "Josh Purtell <jmvpurtell@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0d/b2/b35f3705c340d43baca25094ae8402b8072704212feedf37edd52d4341e8/craftaxlm-0.0.37.tar.gz",
    "platform": null,
    "description": "# Craftax LM\nA wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons.\n\n<p align=\"middle\">\n  <img src=\"https://raw.githubusercontent.com/MichaelTMatthews/Craftax/main/images/dungeon_crawling.gif\" width=\"200\" />\n</p>\n\n## Craftax-Classic\n| LM | Algorithm | Score (% max) |                                              Code                                               |\n|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|\n| claude-3-7-sonnet-latest (default) | ReAct   |            18.0 | |\n| claude-3-5-sonnet-20241022 | ReAct   |            17.8 | |\n| claude-3-5-sonnet-20240620 | ReAct   |            15.7 | |\n| o3-mini | ReAct   |            12.6 | |\n| gpt-4o | ReAct   |            7.0 | |\n\n* Note - this is a limited evaluation where trajectories are terminated after 30 api calls, or roughly 150 in-game steps. 10 trajectories are rolled-out, yielding a log-weighted score as per the Crafter [paper](https://arxiv.org/abs/2109.06780). Reproducible code forthcoming.\n\n# Usage\nFirst, download the package with ```pip install craftaxlm```. Next, import the agent-computer interface of your choice via\n```\nfrom craftaxlm import CraftaxACI, CraftaxClassicACI\n```\nThis package is early in development, so for implementation examples, please refer to the [baseline ReAct implementation](https://github.com/JoshuaPurtell/Apropos/blob/main/apropos/bench/craftax)\n\n# Leaderboard\nIn order to make experiments reasonable to run across a range of LMs, currently the leaderboard evaluates agents in the following manner:\n1. Five rollouts are sampled from the agent, with a hard cap of 300 actions per rollout.\n2. The agent is evaluated using a modified version of the original Crafter score - \n    ```\n    sum(ln(1 + P(1_achievement_obtained)) for achievement in achievements) / (sum(ln(2) * len(achievements)))\n    ```\n    where P(1_achievement_obtained) is the probability of the achievement being obtained in a single rollout. The key idea is that incremental progress towards difficult achievements ought to weigh more heavily in the score.\n\n## Craftax-Full\n| LM | Algorithm | Score (% max) |                                              Code                                               |\n|:----------|---------------:|:-----------------------------------------------------------------------------------------------:|:---------------------------------------:|\n\n# Dev Instructions\n```\npyenv virtualenv craftax_env\npoetry install\n```\n\nWhen in doubt\n\n```\nfrom jax import debug\n...\ndebug.breakpoint()\n```\n\n# \ud83d\udcda Citation\nTo learn more about Craftax, check out the paper [website](https://craftaxenv.github.io) here.\nTo cite the underlying Craftax environment, see:\n```\n@inproceedings{matthews2024craftax,\n    author={Michael Matthews and Michael Beukman and Benjamin Ellis and Mikayel Samvelyan and Matthew Jackson and Samuel Coward and Jakob Foerster},\n    title = {Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning},\n    booktitle = {International Conference on Machine Learning ({ICML})},\n    year = {2024}\n}\n```\nTo cite the Crafter benchmark, see:\n```\n@article{hafner2021crafter,\n  title={Benchmarking the Spectrum of Agent Capabilities},\n  author={Danijar Hafner},\n  year={2021},\n  journal={arXiv preprint arXiv:2109.06780},\n}\n```\n\n# Contributing\n## Setup\n```\nuv venv craftaxlm-dev\nsource craftaxlm-dev/bin/activate\nuv sync\nuv run ruff format .\n```\n## Help Wanted\n- General code quality suggestions or improvements. Especially those that improve speed or reduce tokens.\n- PRs to fix issues or add afforances that help your LM agent perform well\n- Leaderboard submissions that demonstrate improved performance using algorithms for learning from data\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Add your description here",
    "version": "0.0.37",
    "project_urls": {
        "Homepage": "https://github.com/JoshuaPurtell/craftaxlm"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "667287906428ce8d33c26418175e0f472bb1e1adb8c66f6df344d6aa3e83d4bd",
                "md5": "533f694e58184e751926b7525316ce85",
                "sha256": "65e1b458308203dd5d0571a8ca20c48315a3eecb74026b2da47ce24cb8009305"
            },
            "downloads": -1,
            "filename": "craftaxlm-0.0.37-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "533f694e58184e751926b7525316ce85",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 32072,
            "upload_time": "2025-03-03T04:33:53",
            "upload_time_iso_8601": "2025-03-03T04:33:53.487994Z",
            "url": "https://files.pythonhosted.org/packages/66/72/87906428ce8d33c26418175e0f472bb1e1adb8c66f6df344d6aa3e83d4bd/craftaxlm-0.0.37-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0db2b35f3705c340d43baca25094ae8402b8072704212feedf37edd52d4341e8",
                "md5": "6946027713fe957c678313b391470537",
                "sha256": "8a2ab5373025e33f2750032cc91e8e64082b82dea257bb3d75a7d65ca0346d0e"
            },
            "downloads": -1,
            "filename": "craftaxlm-0.0.37.tar.gz",
            "has_sig": false,
            "md5_digest": "6946027713fe957c678313b391470537",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 32149,
            "upload_time": "2025-03-03T04:33:59",
            "upload_time_iso_8601": "2025-03-03T04:33:59.020694Z",
            "url": "https://files.pythonhosted.org/packages/0d/b2/b35f3705c340d43baca25094ae8402b8072704212feedf37edd52d4341e8/craftaxlm-0.0.37.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-03 04:33:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JoshuaPurtell",
    "github_project": "craftaxlm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "craftaxlm"
}
        
Elapsed time: 1.03786s