webllama


Namewebllama JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/McGill-NLP/webllama
SummaryLlama-powered agents for automatic web browsing
upload_time2024-06-25 18:21:49
maintainerNone
docs_urlNone
authorXing Han Lù
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

<h1>🖥️ WebLlama🦙</h1>

<i>Building <b>agents</b> that can browse the web by following instructions and talking to you</i>

| 💻 [**GitHub**](https://github.com/McGill-NLP/webllama) | 🏠 [**Homepage**](https://webllama.github.io) | 🤗 [**`Llama-3-8B-Web`**](https://huggingface.co/McGill-NLP/Llama-3-8B-Web) |
| :--: | :--: | :--: |


<img src="assets/WebLlamaLogo.png" style="width: 400px;" />


</div>

> [!IMPORTANT] 
> **We are thrilled to release [`Llama-3-8B-Web`](https://huggingface.co/McGill-NLP/Llama-3-8B-Web), the most capable agent built with 🦙 Llama 3 and finetuned for web navigation with dialogue. You can download the agent from the 🤗 [Hugging Face Model Hub](https://huggingface.co/McGill-NLP/Llama-3-8B-Web).**

| `WebLlama` helps you build powerful agents, powered by Meta Llama 3, for browsing the web on your behalf | Our first model, [`Llama-3-8B-Web`](https://huggingface.co/McGill-NLP/Llama-3-8B-Web), surpasses GPT-4V (`*`zero-shot) by 18% on [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) |
|:---: | :---: |
| ![Built with Meta Llama 3](assets/llama-3.jpg) | ![Comparison with GPT-4V](assets/LlamaAndGPT.png) |

## About the project

| `WebLlama` | The goal of our project is to build effective human-centric agents for browsing the web. We don't want to replace users, but equip them with powerful assistants. |
|:---: | :---|
| Modeling | We are build on top of cutting edge libraries for training Llama agents on web navigation tasks. We will provide training scripts, optimized configs, and instructions for training cutting-edge Llamas. |
| Evaluation | Benchmarks for testing Llama models on real-world web browsing. This include *human-centric* browsing through dialogue ([`WebLINX`](https://mcgill-nlp.github.io/weblinx/)), and we will soon add more benchmarks for automatic web navigation (e.g. Mind2Web). |
| Data | Our first model is finetuned on over 24K instances of web interactions, including `click`, `textinput`, `submit`, and dialogue acts. We want to continuously curate, compile and release datasets for training better agents. |
| Deployment | We want to make it easy to integrate Llama models with existing deployment platforms, including Playwright, Selenium, and BrowserGym. We are currently focusing on making this a reality. |


## Modeling

> [!NOTE]
> The model is available on the 🤗 Hugging Face Model Hub as [`McGill-NLP/Llama-3-8B-Web`](https://huggingface.co/McGill-NLP/Llama-3-8B-Web). The training and evaluation data is available on [Hugging Face Hub as `McGill-NLP/WebLINX`](https://huggingface.co/datasets/McGill-NLP/WebLINX).

Our first agent is a finetuned [`Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model, which was recently released by Meta GenAI team. We have finetuned this model on the [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) dataset, which contains over 100K instances of web navigation and dialogue, each collected and verified by expert annotators. We use a 24K curated subset for training the data.

![Comparison of Llama-3-Web, GPT-4V, GPT-3.5 and MindAct](assets/LlamaAndGPTAndMindAct.png)

**It surpasses GPT-4V (zero-shot `*`) by over 18% on the [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) benchmark**, achieving an overall score of 28.8% on the out-of-domain test splits (compared to 10.5% for GPT-4V). It chooses more useful links (34.1% vs 18.9% *seg-F1*), clicks on more relevant elements (27.1% vs 13.6% *IoU*) and formulates more aligned responses (37.5% vs 3.1% *chr-F1*).

It's extremely straightforward to use the model via Hugging Face's `transformers`, `datasets` and `hub` libraries:

```python
from datasets import load_dataset
from huggingface_hub import snapshot_download
from transformers import pipeline

# We use validation data, but you can use your own data here
valid = load_dataset("McGill-NLP/WebLINX", split="validation")
snapshot_download("McGill-NLP/WebLINX", repo_type="dataset", allow_patterns="templates/*")
template = open('templates/llama.txt').read()

# Run the agent on a single state (text representation) and get the action
state = template.format(**valid[0])
agent = pipeline("McGill-NLP/Llama-3-8b-Web")
out = agent(state, return_full_text=False)[0]
print("Action:", out['generated_text'])

# Here, you can use the predictions on platforms like playwright or browsergym
action = process_pred(out['generated_text'])  # implement based on your platform
env.step(action)  # execute the action in your environment
```

## Evaluation

We believe short demo videos showing how well an agent performs is NOT enough to judge an agent. Simply put, **we do not know if we have a good agent if we do not have good benchmarks.** We need to systematically evaluate agents on wide range of tasks, spanning from simple instruction-following web navigation to complex dialogue-guided browsing. 

<img src="assets/WebLINXTestSplits.png" style="width: 100%"/>

This is why we chose [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) as our first benchmark. In addition to the training split, the benchmark has 4 real-world splits, with the goal of testing multiple dimensions of generalization: new websites, new domains, unseen geographic locations, and scenarios where the *user cannot see the screen and relies on dialogue*. It also covers 150 websites, including booking, shopping, writing, knowledge lookup, and even complex tasks like manipulating spreadsheets. Evaluating on this benchmark is very straightforward:

```bash
cd modeling/

# After installing dependencies, downloading the dataset, and training/evaluating your model, you can evaluate:
python -m weblinx.eval # automatically find all `results.jsonl` and generate an `aggregated_results.json` file

# Visualize your results with our app:
cd ..
streamlit run app/Results.py
```

> 👷‍♀️ **Next steps**\
> We are planning to evaluate our models on more benchmarks, including Mind2Web, a benchmark for automatic web navigation. We believe that a good agent should be able to navigate the web both through dialogue and autonomously, and potentially attain even broader ranges of capabilities useful for real-world web browsing.


## Data

Although the 24K training examples from [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) provide a good starting point for training a capable agent, we believe that more data is needed to train agents that can generalize to a wide range of web navigation tasks. Although it has been trained and evaluated on 150 websites, there are millions of websites that has never been seen by the model, with new ones being created every day. 

**This motivates us to continuously curate, compile and release datasets for training better agents.** As an immediate next step, we will be incorporating `Mind2Web`'s training data into the equation, which also covers over 100 websites.


## Deployment

We are working hard to make it easy for you to deploy Llama web agents to the web. We want to integrate `WebLlama` with existing deployment platforms, including Microsoft's Playwright, ServiceNow Research's BrowserGym, and other partners.

At the moment, we offer the following integrations:
* `Browsergym`: Please find more information in [`examples/README.md`](examples/README.md) and [`docs/README.md`](docs/README.md).

## Code

The code for finetuning the model and evaluating it on the [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) benchmark is available now. 
* **Modeling**: You can find the detailed instructions in [modeling](modeling/README.md) for training `Llama-3-8B-Web` on the `WebLINX` dataset.
* **Examples**: We provide a few example for using the `webllama` API and models, including web API, end-to-end, and BrowserGym integration. You can find them in [examples](examples/README.md).
* **App**: We provide a simple Streamlit app for visualizing the results of your model on the `WebLINX` benchmark. You can find the code in [app](app/Results.py).
* **Docs**: We provide detailed documentation for the code in [docs](docs/README.md).


> 👷‍♀️ **Next steps**\
> We are actively working on new data and evaluation at the moment! If you want to help, please create an issue describing what you would like to contribute, and we will be happy to help you get started.

## Citation

If you use `WebLlama` in your research, please cite the following paper (upon which the data, training and evaluation are originally based on):

```
@misc{lù2024weblinx,
      title={WebLINX: Real-World Website Navigation with Multi-Turn Dialogue}, 
      author={Xing Han Lù and Zdeněk Kasner and Siva Reddy},
      year={2024},
      eprint={2402.05930},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## License

The code in this repository is licensed under the MIT license, unless otherwise specified in the header of the file. Other materials (models, data, images) have their own licenses, which are specified in the original pages.

## FAQ

### How can I contribute to the project?

We are actively looking for collaborators to help us build the best Llama-3 web agents! To get started, open an issue about what you would like to contribute, and once it has been discussed, you can submit a pull request.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/McGill-NLP/webllama",
    "name": "webllama",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Xing Han L\u00f9",
    "author_email": "webllama@googlegroups.com",
    "download_url": "https://files.pythonhosted.org/packages/15/7e/437b0b5e5d34ef43dad4304298400f0b903d58250c6b0eb77b13f8651f27/webllama-0.1.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n<h1>\ud83d\udda5\ufe0f WebLlama\ud83e\udd99</h1>\n\n<i>Building <b>agents</b> that can browse the web by following instructions and talking to you</i>\n\n| \ud83d\udcbb [**GitHub**](https://github.com/McGill-NLP/webllama) | \ud83c\udfe0 [**Homepage**](https://webllama.github.io) | \ud83e\udd17 [**`Llama-3-8B-Web`**](https://huggingface.co/McGill-NLP/Llama-3-8B-Web) |\n| :--: | :--: | :--: |\n\n\n<img src=\"assets/WebLlamaLogo.png\" style=\"width: 400px;\" />\n\n\n</div>\n\n> [!IMPORTANT] \n> **We are thrilled to release [`Llama-3-8B-Web`](https://huggingface.co/McGill-NLP/Llama-3-8B-Web), the most capable agent built with \ud83e\udd99 Llama 3 and finetuned for web navigation with dialogue. You can download the agent from the \ud83e\udd17 [Hugging Face Model Hub](https://huggingface.co/McGill-NLP/Llama-3-8B-Web).**\n\n| `WebLlama` helps you build powerful agents, powered by Meta Llama 3, for browsing the web on your behalf | Our first model, [`Llama-3-8B-Web`](https://huggingface.co/McGill-NLP/Llama-3-8B-Web), surpasses GPT-4V (`*`zero-shot) by 18% on [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) |\n|:---: | :---: |\n| ![Built with Meta Llama 3](assets/llama-3.jpg) | ![Comparison with GPT-4V](assets/LlamaAndGPT.png) |\n\n## About the project\n\n| `WebLlama` | The goal of our project is to build effective human-centric agents for browsing the web. We don't want to replace users, but equip them with powerful assistants. |\n|:---: | :---|\n| Modeling | We are build on top of cutting edge libraries for training Llama agents on web navigation tasks. We will provide training scripts, optimized configs, and instructions for training cutting-edge Llamas. |\n| Evaluation | Benchmarks for testing Llama models on real-world web browsing. This include *human-centric* browsing through dialogue ([`WebLINX`](https://mcgill-nlp.github.io/weblinx/)), and we will soon add more benchmarks for automatic web navigation (e.g. Mind2Web). |\n| Data | Our first model is finetuned on over 24K instances of web interactions, including `click`, `textinput`, `submit`, and dialogue acts. We want to continuously curate, compile and release datasets for training better agents. |\n| Deployment | We want to make it easy to integrate Llama models with existing deployment platforms, including Playwright, Selenium, and BrowserGym. We are currently focusing on making this a reality. |\n\n\n## Modeling\n\n> [!NOTE]\n> The model is available on the \ud83e\udd17 Hugging Face Model Hub as [`McGill-NLP/Llama-3-8B-Web`](https://huggingface.co/McGill-NLP/Llama-3-8B-Web). The training and evaluation data is available on [Hugging Face Hub as `McGill-NLP/WebLINX`](https://huggingface.co/datasets/McGill-NLP/WebLINX).\n\nOur first agent is a finetuned [`Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model, which was recently released by Meta GenAI team. We have finetuned this model on the [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) dataset, which contains over 100K instances of web navigation and dialogue, each collected and verified by expert annotators. We use a 24K curated subset for training the data.\n\n![Comparison of Llama-3-Web, GPT-4V, GPT-3.5 and MindAct](assets/LlamaAndGPTAndMindAct.png)\n\n**It surpasses GPT-4V (zero-shot `*`) by over 18% on the [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) benchmark**, achieving an overall score of 28.8% on the out-of-domain test splits (compared to 10.5% for GPT-4V). It chooses more useful links (34.1% vs 18.9% *seg-F1*), clicks on more relevant elements (27.1% vs 13.6% *IoU*) and formulates more aligned responses (37.5% vs 3.1% *chr-F1*).\n\nIt's extremely straightforward to use the model via Hugging Face's `transformers`, `datasets` and `hub` libraries:\n\n```python\nfrom datasets import load_dataset\nfrom huggingface_hub import snapshot_download\nfrom transformers import pipeline\n\n# We use validation data, but you can use your own data here\nvalid = load_dataset(\"McGill-NLP/WebLINX\", split=\"validation\")\nsnapshot_download(\"McGill-NLP/WebLINX\", repo_type=\"dataset\", allow_patterns=\"templates/*\")\ntemplate = open('templates/llama.txt').read()\n\n# Run the agent on a single state (text representation) and get the action\nstate = template.format(**valid[0])\nagent = pipeline(\"McGill-NLP/Llama-3-8b-Web\")\nout = agent(state, return_full_text=False)[0]\nprint(\"Action:\", out['generated_text'])\n\n# Here, you can use the predictions on platforms like playwright or browsergym\naction = process_pred(out['generated_text'])  # implement based on your platform\nenv.step(action)  # execute the action in your environment\n```\n\n## Evaluation\n\nWe believe short demo videos showing how well an agent performs is NOT enough to judge an agent. Simply put, **we do not know if we have a good agent if we do not have good benchmarks.** We need to systematically evaluate agents on wide range of tasks, spanning from simple instruction-following web navigation to complex dialogue-guided browsing. \n\n<img src=\"assets/WebLINXTestSplits.png\" style=\"width: 100%\"/>\n\nThis is why we chose [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) as our first benchmark. In addition to the training split, the benchmark has 4 real-world splits, with the goal of testing multiple dimensions of generalization: new websites, new domains, unseen geographic locations, and scenarios where the *user cannot see the screen and relies on dialogue*. It also covers 150 websites, including booking, shopping, writing, knowledge lookup, and even complex tasks like manipulating spreadsheets. Evaluating on this benchmark is very straightforward:\n\n```bash\ncd modeling/\n\n# After installing dependencies, downloading the dataset, and training/evaluating your model, you can evaluate:\npython -m weblinx.eval # automatically find all `results.jsonl` and generate an `aggregated_results.json` file\n\n# Visualize your results with our app:\ncd ..\nstreamlit run app/Results.py\n```\n\n> \ud83d\udc77\u200d\u2640\ufe0f **Next steps**\\\n> We are planning to evaluate our models on more benchmarks, including Mind2Web, a benchmark for automatic web navigation. We believe that a good agent should be able to navigate the web both through dialogue and autonomously, and potentially attain even broader ranges of capabilities useful for real-world web browsing.\n\n\n## Data\n\nAlthough the 24K training examples from [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) provide a good starting point for training a capable agent, we believe that more data is needed to train agents that can generalize to a wide range of web navigation tasks. Although it has been trained and evaluated on 150 websites, there are millions of websites that has never been seen by the model, with new ones being created every day. \n\n**This motivates us to continuously curate, compile and release datasets for training better agents.** As an immediate next step, we will be incorporating `Mind2Web`'s training data into the equation, which also covers over 100 websites.\n\n\n## Deployment\n\nWe are working hard to make it easy for you to deploy Llama web agents to the web. We want to integrate `WebLlama` with existing deployment platforms, including Microsoft's Playwright, ServiceNow Research's BrowserGym, and other partners.\n\nAt the moment, we offer the following integrations:\n* `Browsergym`: Please find more information in [`examples/README.md`](examples/README.md) and [`docs/README.md`](docs/README.md).\n\n## Code\n\nThe code for finetuning the model and evaluating it on the [`WebLINX`](https://mcgill-nlp.github.io/weblinx/) benchmark is available now. \n* **Modeling**: You can find the detailed instructions in [modeling](modeling/README.md) for training `Llama-3-8B-Web` on the `WebLINX` dataset.\n* **Examples**: We provide a few example for using the `webllama` API and models, including web API, end-to-end, and BrowserGym integration. You can find them in [examples](examples/README.md).\n* **App**: We provide a simple Streamlit app for visualizing the results of your model on the `WebLINX` benchmark. You can find the code in [app](app/Results.py).\n* **Docs**: We provide detailed documentation for the code in [docs](docs/README.md).\n\n\n> \ud83d\udc77\u200d\u2640\ufe0f **Next steps**\\\n> We are actively working on new data and evaluation at the moment! If you want to help, please create an issue describing what you would like to contribute, and we will be happy to help you get started.\n\n## Citation\n\nIf you use `WebLlama` in your research, please cite the following paper (upon which the data, training and evaluation are originally based on):\n\n```\n@misc{l\u00f92024weblinx,\n      title={WebLINX: Real-World Website Navigation with Multi-Turn Dialogue}, \n      author={Xing Han L\u00f9 and Zden\u011bk Kasner and Siva Reddy},\n      year={2024},\n      eprint={2402.05930},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n## License\n\nThe code in this repository is licensed under the MIT license, unless otherwise specified in the header of the file. Other materials (models, data, images) have their own licenses, which are specified in the original pages.\n\n## FAQ\n\n### How can I contribute to the project?\n\nWe are actively looking for collaborators to help us build the best Llama-3 web agents! To get started, open an issue about what you would like to contribute, and once it has been discussed, you can submit a pull request.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Llama-powered agents for automatic web browsing",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/McGill-NLP/webllama"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "840043553321ef83993309566a4cafb78db35c160e9a3a0906cb86dec2e65816",
                "md5": "600d370a0c92fdb1acdaa499d666e380",
                "sha256": "a53f287c5c68a8acc35318bcc5659a60fb3e0ffa848baa8cdeb9c76fbb4cfa62"
            },
            "downloads": -1,
            "filename": "webllama-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "600d370a0c92fdb1acdaa499d666e380",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 27179,
            "upload_time": "2024-06-25T18:21:47",
            "upload_time_iso_8601": "2024-06-25T18:21:47.085916Z",
            "url": "https://files.pythonhosted.org/packages/84/00/43553321ef83993309566a4cafb78db35c160e9a3a0906cb86dec2e65816/webllama-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "157e437b0b5e5d34ef43dad4304298400f0b903d58250c6b0eb77b13f8651f27",
                "md5": "f346b3a89c6c704059e85f8c28113ed8",
                "sha256": "11b7f3d5dc39837091a215d9f5a408748dd06b57ff7ec446550ce6c6d4243c56"
            },
            "downloads": -1,
            "filename": "webllama-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f346b3a89c6c704059e85f8c28113ed8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 26998,
            "upload_time": "2024-06-25T18:21:49",
            "upload_time_iso_8601": "2024-06-25T18:21:49.261246Z",
            "url": "https://files.pythonhosted.org/packages/15/7e/437b0b5e5d34ef43dad4304298400f0b903d58250c6b0eb77b13f8651f27/webllama-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-25 18:21:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "McGill-NLP",
    "github_project": "webllama",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "webllama"
}
        
Elapsed time: 0.90961s