<h1 align="center">Embodied Agent Interface (EAgent): Benchmarking LLMs for Embodied Decision Making</h1>
<p align="center">
<a href="https://embodied-agent-interface.github.io/">
<img src="https://img.shields.io/badge/Website-EAgent-purple?style=plastic&logo=Google%20chrome" alt="Website">
</a>
<!-- <a href="https://github.com/embodied-agent-interface/embodied-agent-interface/tree/main/dataset">
<img src="https://img.shields.io/badge/Dataset-Download-yellow?style=plastic&logo=Data" alt="Dataset">
</a> -->
<a href="https://huggingface.co/datasets/Inevitablevalor/EmbodiedAgentInterface" target="_blank">
<img src="https://img.shields.io/badge/Dataset-Download-yellow?style=plastic&logo=huggingface" alt="Download the EmbodiedAgentInterface Dataset from Hugging Face">
</a>
<a href="https://hub.docker.com/repository/docker/jameskrw/eagent-eval/general">
<img src="https://img.shields.io/badge/Docker-Eval--Embodied--Agent-blue?style=plastic&logo=Docker" alt="Docker">
</a>
<a href="https://embodied-agent-eval.readthedocs.io/en/latest/#">
<img src="https://img.shields.io/badge/Docs-Online-blue?style=plastic&logo=Read%20the%20Docs" alt="Docs">
</a>
<a href="https://opensource.org/licenses/MIT">
<img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT">
</a>
</p>
<p align="center">
<a href="https://limanling.github.io/">Manling Li</a>,
<a href="https://www.linkedin.com/in/shiyu-zhao-1124a0266/">Shiyu Zhao</a>,
<a href="https://qinengwang-aiden.github.io/">Qineng Wang</a>,
<a href="https://jameskrw.github.io/">Kangrui Wang</a>,
<a href="https://bryanzhou008.github.io/">Yu Zhou</a>,
<a href="https://example.com/sanjana-srivastava">Sanjana Srivastava</a>,
<a href="https://example.com/cem-gokmen">Cem Gokmen</a>,
<a href="https://example.com/tony-lee">Tony Lee</a>,
<a href="https://sites.google.com/site/lieranli/">Li Erran Li</a>,
<a href="https://example.com/ruohan-zhang">Ruohan Zhang</a>,
<a href="https://example.com/weiyu-liu">Weiyu Liu</a>,
<a href="https://cs.stanford.edu/~pliang/">Percy Liang</a>,
<a href="https://profiles.stanford.edu/fei-fei-li">Li Fei-Fei</a>,
<a href="https://jiayuanm.com/">Jiayuan Mao</a>,
<a href="https://jiajunwu.com/">Jiajun Wu</a>
</p>
<p align="center">Stanford Vision and Learning Lab, Stanford University</p>
<p align="center">
<a href="https://cs.stanford.edu/~manlingl/projects/embodied-eval" target="_blank">
<img src="./EAgent.png" alt="EAgent" width="80%" height="80%" border="10" />
</a>
</p>
## Dataset Highlights
- Standardized goal specifications.
- Standardized modules and interfaces.
- Broad coverage of evaluation and fine-grained metrics.
## Overview
We aim to evaluate Large Language Models (LLMs) for embodied decision-making. While many works leverage LLMs for decision-making in embodied environments, a systematic understanding of their performance is still lacking. These models are applied in different domains, for various purposes, and with diverse inputs and outputs. Current evaluations tend to rely on final success rates alone, making it difficult to pinpoint where LLMs fall short and how to leverage them effectively in embodied AI systems.
To address this gap, we propose the **Embodied Agent Interface (EAgent)**, which unifies:
1. A broad set of embodied decision-making tasks involving both state and temporally extended goals.
2. Four commonly used LLM-based modules: goal interpretation, subgoal decomposition, action sequencing, and transition modeling.
3. Fine-grained evaluation metrics, identifying errors such as hallucinations, affordance issues, and planning mistakes.
Our benchmark provides a comprehensive assessment of LLM performance across different subtasks, identifying their strengths and weaknesses in embodied decision-making contexts.
## Installation
1. **Create and Activate a Conda Environment**:
```bash
conda create -n eagent python=3.8 -y
conda activate eagent
```
2. **Install `eagent-eval`**:
You can install it from pip:
```bash
pip install eagent-eval
```
Or, install from source:
```bash
git clone https://github.com/embodied-agent-interface/embodied-agent-interface.git
cd embodied-agent-interface
pip install -e .
```
3. **(Optional) Install iGibson for behavior evaluation**:
If you need to use `behavior_eval`, install iGibson. Follow these steps to minimize installation issues:
- Make sure you are using Python 3.8 and meet the minimum system requirements in the [iGibson installation guide](https://stanfordvl.github.io/iGibson/installation.html).
- Install CMake using Conda (do not use pip):
```bash
conda install cmake
```
- Install `iGibson`:
We provide an installation script:
```bash
python -m behavior_eval.utils.install_igibson_utils
```
Alternatively, install it manually:
```bash
git clone https://github.com/embodied-agent-interface/iGibson.git --recursive
cd iGibson
pip install -e .
```
- Download assets:
```bash
python -m behavior_eval.utils.download_utils
```
We have successfully tested installation on Linux, Windows 10+, and macOS.
# Quick Start
1. **Arguments**:
```bash
eagent-eval \
--dataset {virtualhome,behavior} \
--mode {generate_prompts,evaluate_results} \
--eval-type {action_sequencing,transition_modeling,goal_interpretation,subgoal_decomposition} \
--llm-response-path <path_to_responses> \
--output-dir <output_directory> \
--num-workers <number_of_workers>
```
Run the following command for further information:
```bash
eagent-eval --help
```
2. **Examples**:
- ***Evaluate Results***
Make sure to download our results first if you don't want to specify <path_to_responses>
```bash
python -m eagent_eval.utils.download_utils
```
Then, run the commands below:
```bash
eagent-eval --dataset virtualhome --eval-type action_sequencing --mode evaluate_results
eagent-eval --dataset virtualhome --eval-type transition_modeling --mode evaluate_results
eagent-eval --dataset virtualhome --eval-type goal_interpretation --mode evaluate_results
eagent-eval --dataset virtualhome --eval-type subgoal_decomposition --mode evaluate_results
eagent-eval --dataset behavior --eval-type action_sequencing --mode evaluate_results
eagent-eval --dataset behavior --eval-type transition_modeling --mode evaluate_results
eagent-eval --dataset behavior --eval-type goal_interpretation --mode evaluate_results
eagent-eval --dataset behavior --eval-type subgoal_decomposition --mode evaluate_results
```
- ***Generate Pormpts***
To generate prompts, you can run:
```bash
eagent-eval --dataset virtualhome --eval-type action_sequencing --mode generate_prompts
eagent-eval --dataset virtualhome --eval-type transition_modeling --mode generate_prompts
eagent-eval --dataset virtualhome --eval-type goal_interpretation --mode generate_prompts
eagent-eval --dataset virtualhome --eval-type subgoal_decomposition --mode generate_prompts
eagent-eval --dataset behavior --eval-type action_sequencing --mode generate_prompts
eagent-eval --dataset behavior --eval-type transition_modeling --mode generate_prompts
eagent-eval --dataset behavior --eval-type goal_interpretation --mode generate_prompts
eagent-eval --dataset behavior --eval-type subgoal_decomposition --mode generate_prompts
```
# Docker
We provide a ready-to-use Docker image for easy installation and usage.
First, pull the Docker image from Docker Hub:
```bash
docker pull jameskrw/eagent-eval
```
Next, run the Docker container interactively:
```bash
docker run -it jameskrw/eagent-eval
```
When inside the container, make sure you remain in the `/opt/iGibson` directory (do not change to other directories).
To check the available arguments for the `eagent-eval` CLI, use the following command:
```bash
python3 -m eagent_eval.cli --help
```
You can run:
```bash
python3 -m eagent_eval.cli
```
By default, this will start generating prompts for goal interpretation in Behavior.
The command `python3 -m eagent_eval.cli` is equivalent to `eagent-eval` as introduced above, although currently only `python3 -m eagent_eval.cli` is supported in the docker.
Raw data
{
"_id": null,
"home_page": "https://github.com/embodied-agent-interface/embodied-agent-interface",
"name": "eagent-eval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "stanford",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/b4/92/d91926a48074ac8e7d9eebc234d973f2a51b1aeb2c19eeecdae11040ae4e/eagent_eval-0.0.8.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">Embodied Agent Interface (EAgent): Benchmarking LLMs for Embodied Decision Making</h1>\r\n\r\n<p align=\"center\">\r\n <a href=\"https://embodied-agent-interface.github.io/\">\r\n <img src=\"https://img.shields.io/badge/Website-EAgent-purple?style=plastic&logo=Google%20chrome\" alt=\"Website\">\r\n </a>\r\n<!-- <a href=\"https://github.com/embodied-agent-interface/embodied-agent-interface/tree/main/dataset\">\r\n <img src=\"https://img.shields.io/badge/Dataset-Download-yellow?style=plastic&logo=Data\" alt=\"Dataset\">\r\n </a> -->\r\n <a href=\"https://huggingface.co/datasets/Inevitablevalor/EmbodiedAgentInterface\" target=\"_blank\">\r\n <img src=\"https://img.shields.io/badge/Dataset-Download-yellow?style=plastic&logo=huggingface\" alt=\"Download the EmbodiedAgentInterface Dataset from Hugging Face\">\r\n </a>\r\n <a href=\"https://hub.docker.com/repository/docker/jameskrw/eagent-eval/general\">\r\n <img src=\"https://img.shields.io/badge/Docker-Eval--Embodied--Agent-blue?style=plastic&logo=Docker\" alt=\"Docker\">\r\n </a>\r\n <a href=\"https://embodied-agent-eval.readthedocs.io/en/latest/#\">\r\n <img src=\"https://img.shields.io/badge/Docs-Online-blue?style=plastic&logo=Read%20the%20Docs\" alt=\"Docs\">\r\n </a>\r\n <a href=\"https://opensource.org/licenses/MIT\">\r\n <img src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" alt=\"License: MIT\">\r\n </a>\r\n</p>\r\n\r\n<p align=\"center\">\r\n <a href=\"https://limanling.github.io/\">Manling Li</a>, \r\n <a href=\"https://www.linkedin.com/in/shiyu-zhao-1124a0266/\">Shiyu Zhao</a>, \r\n <a href=\"https://qinengwang-aiden.github.io/\">Qineng Wang</a>, \r\n <a href=\"https://jameskrw.github.io/\">Kangrui Wang</a>, \r\n <a href=\"https://bryanzhou008.github.io/\">Yu Zhou</a>, \r\n <a href=\"https://example.com/sanjana-srivastava\">Sanjana Srivastava</a>, \r\n <a href=\"https://example.com/cem-gokmen\">Cem Gokmen</a>, \r\n <a href=\"https://example.com/tony-lee\">Tony Lee</a>, \r\n <a href=\"https://sites.google.com/site/lieranli/\">Li Erran Li</a>, \r\n <a href=\"https://example.com/ruohan-zhang\">Ruohan Zhang</a>, \r\n <a href=\"https://example.com/weiyu-liu\">Weiyu Liu</a>, \r\n <a href=\"https://cs.stanford.edu/~pliang/\">Percy Liang</a>, \r\n <a href=\"https://profiles.stanford.edu/fei-fei-li\">Li Fei-Fei</a>, \r\n <a href=\"https://jiayuanm.com/\">Jiayuan Mao</a>, \r\n <a href=\"https://jiajunwu.com/\">Jiajun Wu</a>\r\n</p>\r\n<p align=\"center\">Stanford Vision and Learning Lab, Stanford University</p>\r\n\r\n<p align=\"center\">\r\n <a href=\"https://cs.stanford.edu/~manlingl/projects/embodied-eval\" target=\"_blank\">\r\n <img src=\"./EAgent.png\" alt=\"EAgent\" width=\"80%\" height=\"80%\" border=\"10\" />\r\n </a>\r\n</p>\r\n\r\n## Dataset Highlights\r\n\r\n- Standardized goal specifications.\r\n- Standardized modules and interfaces.\r\n- Broad coverage of evaluation and fine-grained metrics.\r\n\r\n## Overview\r\n\r\nWe aim to evaluate Large Language Models (LLMs) for embodied decision-making. While many works leverage LLMs for decision-making in embodied environments, a systematic understanding of their performance is still lacking. These models are applied in different domains, for various purposes, and with diverse inputs and outputs. Current evaluations tend to rely on final success rates alone, making it difficult to pinpoint where LLMs fall short and how to leverage them effectively in embodied AI systems.\r\n\r\nTo address this gap, we propose the **Embodied Agent Interface (EAgent)**, which unifies:\r\n1. A broad set of embodied decision-making tasks involving both state and temporally extended goals.\r\n2. Four commonly used LLM-based modules: goal interpretation, subgoal decomposition, action sequencing, and transition modeling.\r\n3. Fine-grained evaluation metrics, identifying errors such as hallucinations, affordance issues, and planning mistakes.\r\n\r\nOur benchmark provides a comprehensive assessment of LLM performance across different subtasks, identifying their strengths and weaknesses in embodied decision-making contexts.\r\n\r\n## Installation\r\n1. **Create and Activate a Conda Environment**:\r\n ```bash\r\n conda create -n eagent python=3.8 -y \r\n conda activate eagent\r\n ```\r\n\r\n2. **Install `eagent-eval`**:\r\n \r\n You can install it from pip:\r\n ```bash\r\n pip install eagent-eval\r\n ```\r\n\r\n Or, install from source:\r\n ```bash\r\n git clone https://github.com/embodied-agent-interface/embodied-agent-interface.git\r\n cd embodied-agent-interface\r\n pip install -e .\r\n ```\r\n\r\n3. **(Optional) Install iGibson for behavior evaluation**:\r\n \r\n If you need to use `behavior_eval`, install iGibson. Follow these steps to minimize installation issues:\r\n\r\n - Make sure you are using Python 3.8 and meet the minimum system requirements in the [iGibson installation guide](https://stanfordvl.github.io/iGibson/installation.html).\r\n \r\n - Install CMake using Conda (do not use pip):\r\n ```bash\r\n conda install cmake\r\n ```\r\n\r\n - Install `iGibson`:\r\n We provide an installation script:\r\n ```bash\r\n python -m behavior_eval.utils.install_igibson_utils\r\n ```\r\n Alternatively, install it manually:\r\n ```bash\r\n git clone https://github.com/embodied-agent-interface/iGibson.git --recursive\r\n cd iGibson\r\n pip install -e .\r\n ```\r\n\r\n - Download assets:\r\n ```bash\r\n python -m behavior_eval.utils.download_utils\r\n ```\r\n\r\n We have successfully tested installation on Linux, Windows 10+, and macOS.\r\n\r\n# Quick Start\r\n\r\n1. **Arguments**:\r\n ```bash\r\n eagent-eval \\\r\n --dataset {virtualhome,behavior} \\\r\n --mode {generate_prompts,evaluate_results} \\\r\n --eval-type {action_sequencing,transition_modeling,goal_interpretation,subgoal_decomposition} \\\r\n --llm-response-path <path_to_responses> \\\r\n --output-dir <output_directory> \\\r\n --num-workers <number_of_workers>\r\n ```\r\n\r\n Run the following command for further information:\r\n ```bash\r\n eagent-eval --help\r\n ```\r\n\r\n2. **Examples**:\r\n\r\n- ***Evaluate Results***\r\n \r\n \r\n Make sure to download our results first if you don't want to specify <path_to_responses>\r\n ```bash\r\n python -m eagent_eval.utils.download_utils\r\n ```\r\n\r\n Then, run the commands below:\r\n ```bash\r\n eagent-eval --dataset virtualhome --eval-type action_sequencing --mode evaluate_results\r\n eagent-eval --dataset virtualhome --eval-type transition_modeling --mode evaluate_results\r\n eagent-eval --dataset virtualhome --eval-type goal_interpretation --mode evaluate_results\r\n eagent-eval --dataset virtualhome --eval-type subgoal_decomposition --mode evaluate_results\r\n eagent-eval --dataset behavior --eval-type action_sequencing --mode evaluate_results\r\n eagent-eval --dataset behavior --eval-type transition_modeling --mode evaluate_results\r\n eagent-eval --dataset behavior --eval-type goal_interpretation --mode evaluate_results\r\n eagent-eval --dataset behavior --eval-type subgoal_decomposition --mode evaluate_results\r\n ```\r\n\r\n- ***Generate Pormpts***\r\n \r\n \r\n To generate prompts, you can run:\r\n ```bash\r\n eagent-eval --dataset virtualhome --eval-type action_sequencing --mode generate_prompts\r\n eagent-eval --dataset virtualhome --eval-type transition_modeling --mode generate_prompts\r\n eagent-eval --dataset virtualhome --eval-type goal_interpretation --mode generate_prompts\r\n eagent-eval --dataset virtualhome --eval-type subgoal_decomposition --mode generate_prompts\r\n eagent-eval --dataset behavior --eval-type action_sequencing --mode generate_prompts\r\n eagent-eval --dataset behavior --eval-type transition_modeling --mode generate_prompts\r\n eagent-eval --dataset behavior --eval-type goal_interpretation --mode generate_prompts\r\n eagent-eval --dataset behavior --eval-type subgoal_decomposition --mode generate_prompts\r\n ```\r\n\r\n\r\n\r\n# Docker\r\nWe provide a ready-to-use Docker image for easy installation and usage.\r\n\r\nFirst, pull the Docker image from Docker Hub:\r\n```bash\r\ndocker pull jameskrw/eagent-eval\r\n```\r\n\r\nNext, run the Docker container interactively:\r\n\r\n```bash\r\ndocker run -it jameskrw/eagent-eval\r\n```\r\n\r\nWhen inside the container, make sure you remain in the `/opt/iGibson` directory (do not change to other directories).\r\n\r\nTo check the available arguments for the `eagent-eval` CLI, use the following command:\r\n\r\n```bash\r\npython3 -m eagent_eval.cli --help\r\n```\r\n\r\nYou can run:\r\n\r\n```bash\r\npython3 -m eagent_eval.cli\r\n```\r\n\r\nBy default, this will start generating prompts for goal interpretation in Behavior.\r\n\r\nThe command `python3 -m eagent_eval.cli` is equivalent to `eagent-eval` as introduced above, although currently only `python3 -m eagent_eval.cli` is supported in the docker.\r\n\r\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.0.8",
"project_urls": {
"Homepage": "https://github.com/embodied-agent-interface/embodied-agent-interface"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "cfdbf909ad55150ea09da865ca2f89c83cd9765de755e5f5f6d959b0427997d7",
"md5": "eb4fba9b283ccfd3c673848673e1b549",
"sha256": "5f3144ee459ec92f18f1adac2d2ddd23db33fed89a3b8dd16313ac9e881383a5"
},
"downloads": -1,
"filename": "eagent_eval-0.0.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "eb4fba9b283ccfd3c673848673e1b549",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 26335213,
"upload_time": "2024-10-01T00:24:10",
"upload_time_iso_8601": "2024-10-01T00:24:10.452395Z",
"url": "https://files.pythonhosted.org/packages/cf/db/f909ad55150ea09da865ca2f89c83cd9765de755e5f5f6d959b0427997d7/eagent_eval-0.0.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b492d91926a48074ac8e7d9eebc234d973f2a51b1aeb2c19eeecdae11040ae4e",
"md5": "b63ce5da471b3f08f2712a4034e7fca9",
"sha256": "3810bf22af84d84810b0890e3ec762b54a912037fe0b9395d841e61e26c9a26f"
},
"downloads": -1,
"filename": "eagent_eval-0.0.8.tar.gz",
"has_sig": false,
"md5_digest": "b63ce5da471b3f08f2712a4034e7fca9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 22274830,
"upload_time": "2024-10-01T00:24:16",
"upload_time_iso_8601": "2024-10-01T00:24:16.263388Z",
"url": "https://files.pythonhosted.org/packages/b4/92/d91926a48074ac8e7d9eebc234d973f2a51b1aeb2c19eeecdae11040ae4e/eagent_eval-0.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-01 00:24:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "embodied-agent-interface",
"github_project": "embodied-agent-interface",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "eagent-eval"
}