browserpilot


Namebrowserpilot JSON
Version 0.2.53 PyPI version JSON
download
home_pageNone
SummaryNatural language browser automation
upload_time2024-12-21 06:45:34
maintainerNone
docs_urlNone
authorAndrew Han
requires_python<4.0,>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # πŸ›« BrowserPilot

An intelligent web browsing agent controlled by natural language.

![demo](assets/demo_buffalo.gif)

Language is the most natural interface through which humans give and receive instructions. Instead of writing bespoke automation or scraping code which is brittle to changes, creating and adding agents should be as simple as writing plain English.

## πŸ—οΈ Installation

1. `pip install browserpilot`
2. Download Chromedriver (latest stable release) from [here](https://sites.google.com/chromium.org/driver/) and place it in the same folder as this file. Unzip. In Finder, right click the unpacked chromedriver and click "Open". This will remove the restrictive default permissions and allow Python to access it.
3. Create an environment variable in your favorite manner setting OPENAI_API_KEY to your API key.


## 🦭 Usage
### πŸ—ΊοΈ API
The form factor is fairly simple (see below).

```python
from browserpilot.agents.gpt_selenium_agent import GPTSeleniumAgent

instructions = """Go to Google.com
Find all textareas.
Find the first visible textarea.
Click on the first visible textarea.
Type in "buffalo buffalo buffalo buffalo buffalo" and press enter.
Wait 2 seconds.
Find all anchor elements that link to Wikipedia.
Click on the first one.
Wait for 10 seconds."""

agent = GPTSeleniumAgent(instructions, "/path/to/chromedriver")
agent.run()
```

The harder (but funner) part is writing the natural language prompts.


### πŸ“‘ Writing Prompts

It helps if you are familiar with how Selenium works and programming in general. This is because this project uses GPT-3 to translate natural language into code, so you should be as precise as you can. In this way, it is more like writing code with Copilot than it is talking to a friend; for instance, it helps to refer to things as `input`s or `textareas` (vs. "text box" "search box") or "button which says 'Log in'" rather than "the login button". Sometimes, it will also not pick up on specific words that are important, so it helps to break them out into separate lines. Instead of "find all the visible textareas", you do "find all the textareas" and then "find the first visible textarea".

You can look at some examples in `prompts/examples` to get started.

Create "functions" by enclosing instructions in `BEGIN_FUNCTION func_name` and `END_FUNCTION`, and then call them by starting a line with `RUN_FUNCTION` or `INJECT_FUNCTION`. Below is an example: 

```
BEGIN_FUNCTION search_buffalo
Go to Google.com
Find all textareas.
Find the first visible textarea.
Click on the first visible textarea.
Type in "buffalo buffalo buffalo buffalo buffalo" and press enter.
Wait 2 seconds.
Get all anchors on the page that contain the word "buffalo".
Click on the first link.
END_FUNCTION

RUN_FUNCTION search_buffalo
Wait for 10 seconds.
```

You may also choose to create a yaml or json file with a list of instructions. In general, it needs to have an `instructions` field, and optionally a `compiled` field which has the processed code.

See [buffalo wikipedia example](prompts/examples/buffalo_wikipedia.yaml).

You may pass a `instruction_output_file` to the constructor of GPTSeleniumAgent which will output a yaml file with the compiled instructions from GPT-3, to avoid having to pay API costs. 

## βœ‹πŸΌ Contributing
There are two ways I envision folks contributing.

- **Adding to the Prompt Library**: Read "Writing Prompts" above and simply make a pull request to add something to `prompts/`! At some point, I will figure out a protocol for folder naming conventions and the evaluation of submitted code (for security, accuracy, etc). This would be a particularly attractive option for those who aren't as familiar with coding.
- **Contributing code**: I am happy to take suggestions! The main way to add to the repository is to extend the capabilities of the agent, or to create new agents entirely. The best way to do this is to familiarize yourself with "Architecture and Prompt Patterns" above, and to (a) expand the list of capabilities in the base prompt in `InstructionCompiler` and (b) write the corresponding method in `GPTSeleniumAgent`. 

## ⛩️ Architecture and Prompt Patterns

This repo was inspired by the work of [Yihui He](https://github.com/yihui-he/ActGPT), [Adept.ai](https://adept.ai/), and [Nat Friedman](https://github.com/nat/natbot). In particular, the basic abstractions and prompts used were built off of Yihui's hackathon code. The idea to preprocess HTML and use GPT-3 to intelligently pick elements out is from Nat. 

- The prompts used can be found in [instruction compiler](browserpilot/agents/compilers/instruction_compiler.py). The base prompt describes in plain English a set of actions that the browsing agent can take, some general conventions on how to write code, and some constraints on its behavior. **These actions correspond one-for-one with methods in `GPTSeleniumAgent`**. Those actions, to-date, include:
    - `env.driver`, the Selenium webdriver.
    - `env.find_elements(by='id', value=None)` finds and returns list of elements.
    - `env.find_element(by='id', value=None)` is similar to `env.find_elements()` except it only returns the first element.
    - `env.find_nearest(e, xpath)` can be used to locate an element near another one.
    - `env.send_keys(element, text)` sends `text` to element.
    - `env.get(url)` goes to url.
    - `env.click(element)` clicks the element.
    - `env.wait(seconds)` waits for `seconds` seconds.
    - `env.scroll(direction, iframe=None)` scrolls the page. Will switch to `iframe` if given. `direction` can be "up", "down", "left", or "right". 
    - `env.get_llm_response(text)` asks AI about a string `text`.
    - `env.retrieve_information(prompt)` returns a string, information from a page given a prompt. Use prompt="Summarize:" for summaries. Invoked with commands like "retrieve", "find in the page", or similar.
    - `env.ask_llm_to_find_element(description)` asks AI to find an element that matches the description.
    - `env.query_memory(prompt)` asks AI with a prompt to query its memory (an embeddings index) of the web pages it has browsed. Invoked with "Query memory".
    - `env.save(text, filename)` saves the string `text` to a file `filename`.
    - `env.get_text_from_page()` returns the free text from the page.
- The rest of the code is basically middleware which exposes a Selenium object to GPT-3. **For each action mentioned in the base prompt, there is a corresponding method in GPTSeleniumAgent.**
    - An `InstructionCompiler` is used to parse user input into semantically cogent blocks of actions.
- The agent has a `Memory` which enables it to synthesize what it sees.


## πŸŽ‰ Finished
0.2.51 
- Thanks to @rapatel0, you can now run BrowserPilot with Selenium Grid, remotely.

-0.2.42 - 0.2.44
- Small changes in `examples.py` and dependencies.
- Refactor for the big Llama Index upgrade.

0.2.38 - 0.2.41
- Change `enable_memory` to `memory_file` to enable more control over what the memory is called. Allow users to load memory as well.
- Make `get_text_from_page` simpler.


0.2.26 - 0.2.37
- Bit the bullet and switched the default model to gpt-3.5-turbo. Will be much cheaper!
- Also fixed retries. I wasn't actually getting the retry action!
- Fiddle with the prompt a bit for GPT-3.5.
- Concerningly, gpt-3.5-turbo keep trying to import modules. I manually remove lines that try to import modules.
- Compatibility with new Llama Index updates.

0.2.14 - 0.2.25
- Add options to avoid website detection of bots.
- Add more OpenAI API error handling.
- Improve stack trace prompt and a few other prompts.
- Add "displayed in viewport" capability. 
- Make ```from browserpilot.agents import <agent>``` possible.
- Make `find_element` and `find_elements` search only for displayed elements.
- Save memory once finished running.
- Add scroll option to top and bottom.

0.2.10 - 0.2.13
- Add more error handling for OpenAI exceptions.
- Change all the embedding querying to use ChatGPT.
- Get rid of the nltk dependency! Good riddance.
- Expand the max token window for asking the LLM a question on a web page. 
- Fix an issue with the Memory module which tried to access OpenAI API key before it's initialized. Change the prompt slightly.
- ❗️ Enable ChatGPT use with GPT Index, so that we can use GPT3.5-turbo to query embeddings.

0.2.7 -  0.2.9
- Vacillating on the default model. ChatGPT does not work well for writing code, as it takes too many freedoms with what it returns.
- Also, I tried condensing the prompt a bit, which is growing a bit long.

0.2.4 - 0.2.6
- Give the agent a memory (still very experimental and not very good). Add capability to screenshot elements.
- Bug fixes around versioning and prompt injection.

0.2.3
- Move `chrome_options` to somewhere more sensible. Just keep the yaml clean, you know?

0.2.2
- ChatGPT support.

0.2.1
- Dict support for loading instructions.

0.2.0
- 🎬 a `Studio` CLI which helps iteratively test prompts!
- JSON loading.
- Basic iframe support.

<0.2.0
- GPTSeleniumAgent should be able to load prompts and cached successful runs in the form of yaml files. InstructionCompiler should be able to save instructions to yaml.
- πŸ’­ Add a summarization capability to the agent.
- Demo/test something where it has to ask the LLM to synthesize something it reads online.
- 🚨 Figured out how to feed the content of the HTML page into the GPT-3 context window and have it reliably pick out specific elements from it, that would be great!

## 🚨 Disclaimer 

This package runs code output from the OpenAI API in Python using `exec`. **This is not considered a safe convention**. Accordingly, you should be extra careful when using this package. The standard disclaimer follows.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "browserpilot",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Andrew Han",
    "author_email": "handrew11@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c8/b6/41a3304cb749897d7dc63a650a5b697429170bd175e9250cd10a24d466eb/browserpilot-0.2.53.tar.gz",
    "platform": null,
    "description": "# \ud83d\udeeb BrowserPilot\n\nAn intelligent web browsing agent controlled by natural language.\n\n![demo](assets/demo_buffalo.gif)\n\nLanguage is the most natural interface through which humans give and receive instructions. Instead of writing bespoke automation or scraping code which is brittle to changes, creating and adding agents should be as simple as writing plain English.\n\n## \ud83c\udfd7\ufe0f Installation\n\n1. `pip install browserpilot`\n2. Download Chromedriver (latest stable release) from [here](https://sites.google.com/chromium.org/driver/) and place it in the same folder as this file. Unzip. In Finder, right click the unpacked chromedriver and click \"Open\". This will remove the restrictive default permissions and allow Python to access it.\n3. Create an environment variable in your favorite manner setting OPENAI_API_KEY to your API key.\n\n\n## \ud83e\uddad Usage\n### \ud83d\uddfa\ufe0f API\nThe form factor is fairly simple (see below).\n\n```python\nfrom browserpilot.agents.gpt_selenium_agent import GPTSeleniumAgent\n\ninstructions = \"\"\"Go to Google.com\nFind all textareas.\nFind the first visible textarea.\nClick on the first visible textarea.\nType in \"buffalo buffalo buffalo buffalo buffalo\" and press enter.\nWait 2 seconds.\nFind all anchor elements that link to Wikipedia.\nClick on the first one.\nWait for 10 seconds.\"\"\"\n\nagent = GPTSeleniumAgent(instructions, \"/path/to/chromedriver\")\nagent.run()\n```\n\nThe harder (but funner) part is writing the natural language prompts.\n\n\n### \ud83d\udcd1 Writing Prompts\n\nIt helps if you are familiar with how Selenium works and programming in general. This is because this project uses GPT-3 to translate natural language into code, so you should be as precise as you can. In this way, it is more like writing code with Copilot than it is talking to a friend; for instance, it helps to refer to things as `input`s or `textareas` (vs. \"text box\" \"search box\") or \"button which says 'Log in'\" rather than \"the login button\". Sometimes, it will also not pick up on specific words that are important, so it helps to break them out into separate lines. Instead of \"find all the visible textareas\", you do \"find all the textareas\" and then \"find the first visible textarea\".\n\nYou can look at some examples in `prompts/examples` to get started.\n\nCreate \"functions\" by enclosing instructions in `BEGIN_FUNCTION func_name` and `END_FUNCTION`, and then call them by starting a line with `RUN_FUNCTION` or `INJECT_FUNCTION`. Below is an example: \n\n```\nBEGIN_FUNCTION search_buffalo\nGo to Google.com\nFind all textareas.\nFind the first visible textarea.\nClick on the first visible textarea.\nType in \"buffalo buffalo buffalo buffalo buffalo\" and press enter.\nWait 2 seconds.\nGet all anchors on the page that contain the word \"buffalo\".\nClick on the first link.\nEND_FUNCTION\n\nRUN_FUNCTION search_buffalo\nWait for 10 seconds.\n```\n\nYou may also choose to create a yaml or json file with a list of instructions. In general, it needs to have an `instructions` field, and optionally a `compiled` field which has the processed code.\n\nSee [buffalo wikipedia example](prompts/examples/buffalo_wikipedia.yaml).\n\nYou may pass a `instruction_output_file` to the constructor of GPTSeleniumAgent which will output a yaml file with the compiled instructions from GPT-3, to avoid having to pay API costs. \n\n## \u270b\ud83c\udffc Contributing\nThere are two ways I envision folks contributing.\n\n- **Adding to the Prompt Library**: Read \"Writing Prompts\" above and simply make a pull request to add something to `prompts/`! At some point, I will figure out a protocol for folder naming conventions and the evaluation of submitted code (for security, accuracy, etc). This would be a particularly attractive option for those who aren't as familiar with coding.\n- **Contributing code**: I am happy to take suggestions! The main way to add to the repository is to extend the capabilities of the agent, or to create new agents entirely. The best way to do this is to familiarize yourself with \"Architecture and Prompt Patterns\" above, and to (a) expand the list of capabilities in the base prompt in `InstructionCompiler` and (b) write the corresponding method in `GPTSeleniumAgent`. \n\n## \u26e9\ufe0f Architecture and Prompt Patterns\n\nThis repo was inspired by the work of [Yihui He](https://github.com/yihui-he/ActGPT), [Adept.ai](https://adept.ai/), and [Nat Friedman](https://github.com/nat/natbot). In particular, the basic abstractions and prompts used were built off of Yihui's hackathon code. The idea to preprocess HTML and use GPT-3 to intelligently pick elements out is from Nat. \n\n- The prompts used can be found in [instruction compiler](browserpilot/agents/compilers/instruction_compiler.py). The base prompt describes in plain English a set of actions that the browsing agent can take, some general conventions on how to write code, and some constraints on its behavior. **These actions correspond one-for-one with methods in `GPTSeleniumAgent`**. Those actions, to-date, include:\n    - `env.driver`, the Selenium webdriver.\n    - `env.find_elements(by='id', value=None)` finds and returns list of elements.\n    - `env.find_element(by='id', value=None)` is similar to `env.find_elements()` except it only returns the first element.\n    - `env.find_nearest(e, xpath)` can be used to locate an element near another one.\n    - `env.send_keys(element, text)` sends `text` to element.\n    - `env.get(url)` goes to url.\n    - `env.click(element)` clicks the element.\n    - `env.wait(seconds)` waits for `seconds` seconds.\n    - `env.scroll(direction, iframe=None)` scrolls the page. Will switch to `iframe` if given. `direction` can be \"up\", \"down\", \"left\", or \"right\". \n    - `env.get_llm_response(text)` asks AI about a string `text`.\n    - `env.retrieve_information(prompt)` returns a string, information from a page given a prompt. Use prompt=\"Summarize:\" for summaries. Invoked with commands like \"retrieve\", \"find in the page\", or similar.\n    - `env.ask_llm_to_find_element(description)` asks AI to find an element that matches the description.\n    - `env.query_memory(prompt)` asks AI with a prompt to query its memory (an embeddings index) of the web pages it has browsed. Invoked with \"Query memory\".\n    - `env.save(text, filename)` saves the string `text` to a file `filename`.\n    - `env.get_text_from_page()` returns the free text from the page.\n- The rest of the code is basically middleware which exposes a Selenium object to GPT-3. **For each action mentioned in the base prompt, there is a corresponding method in GPTSeleniumAgent.**\n    - An `InstructionCompiler` is used to parse user input into semantically cogent blocks of actions.\n- The agent has a `Memory` which enables it to synthesize what it sees.\n\n\n## \ud83c\udf89 Finished\n0.2.51 \n- Thanks to @rapatel0, you can now run BrowserPilot with Selenium Grid, remotely.\n\n-0.2.42 - 0.2.44\n- Small changes in `examples.py` and dependencies.\n- Refactor for the big Llama Index upgrade.\n\n0.2.38 - 0.2.41\n- Change `enable_memory` to `memory_file` to enable more control over what the memory is called. Allow users to load memory as well.\n- Make `get_text_from_page` simpler.\n\n\n0.2.26 - 0.2.37\n- Bit the bullet and switched the default model to gpt-3.5-turbo. Will be much cheaper!\n- Also fixed retries. I wasn't actually getting the retry action!\n- Fiddle with the prompt a bit for GPT-3.5.\n- Concerningly, gpt-3.5-turbo keep trying to import modules. I manually remove lines that try to import modules.\n- Compatibility with new Llama Index updates.\n\n0.2.14 - 0.2.25\n- Add options to avoid website detection of bots.\n- Add more OpenAI API error handling.\n- Improve stack trace prompt and a few other prompts.\n- Add \"displayed in viewport\" capability. \n- Make ```from browserpilot.agents import <agent>``` possible.\n- Make `find_element` and `find_elements` search only for displayed elements.\n- Save memory once finished running.\n- Add scroll option to top and bottom.\n\n0.2.10 - 0.2.13\n- Add more error handling for OpenAI exceptions.\n- Change all the embedding querying to use ChatGPT.\n- Get rid of the nltk dependency! Good riddance.\n- Expand the max token window for asking the LLM a question on a web page. \n- Fix an issue with the Memory module which tried to access OpenAI API key before it's initialized. Change the prompt slightly.\n- \u2757\ufe0f Enable ChatGPT use with GPT Index, so that we can use GPT3.5-turbo to query embeddings.\n\n0.2.7 -  0.2.9\n- Vacillating on the default model. ChatGPT does not work well for writing code, as it takes too many freedoms with what it returns.\n- Also, I tried condensing the prompt a bit, which is growing a bit long.\n\n0.2.4 - 0.2.6\n- Give the agent a memory (still very experimental and not very good). Add capability to screenshot elements.\n- Bug fixes around versioning and prompt injection.\n\n0.2.3\n- Move `chrome_options` to somewhere more sensible. Just keep the yaml clean, you know?\n\n0.2.2\n- ChatGPT support.\n\n0.2.1\n- Dict support for loading instructions.\n\n0.2.0\n- \ud83c\udfac a `Studio` CLI which helps iteratively test prompts!\n- JSON loading.\n- Basic iframe support.\n\n<0.2.0\n- GPTSeleniumAgent should be able to load prompts and cached successful runs in the form of yaml files. InstructionCompiler should be able to save instructions to yaml.\n- \ud83d\udcad Add a summarization capability to the agent.\n- Demo/test something where it has to ask the LLM to synthesize something it reads online.\n- \ud83d\udea8 Figured out how to feed the content of the HTML page into the GPT-3 context window and have it reliably pick out specific elements from it, that would be great!\n\n## \ud83d\udea8 Disclaimer \n\nThis package runs code output from the OpenAI API in Python using `exec`. **This is not considered a safe convention**. Accordingly, you should be extra careful when using this package. The standard disclaimer follows.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Natural language browser automation",
    "version": "0.2.53",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "069ff9c2778ae8ab19fd859aed75f085c4f3aeff00fbdd00491cdc50ea88b626",
                "md5": "50d20ed32395f68bea0bd2fd3b997f68",
                "sha256": "473615c58ffc8dd633e76aa62c0930b0d900beab85fab5243c8d4d1c9a32e45e"
            },
            "downloads": -1,
            "filename": "browserpilot-0.2.53-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "50d20ed32395f68bea0bd2fd3b997f68",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 21118,
            "upload_time": "2024-12-21T06:45:32",
            "upload_time_iso_8601": "2024-12-21T06:45:32.316406Z",
            "url": "https://files.pythonhosted.org/packages/06/9f/f9c2778ae8ab19fd859aed75f085c4f3aeff00fbdd00491cdc50ea88b626/browserpilot-0.2.53-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c8b641a3304cb749897d7dc63a650a5b697429170bd175e9250cd10a24d466eb",
                "md5": "7259efcb412851c7117edf846e867367",
                "sha256": "e46d155292723fc5b96b2ab821f0c1f303dd52051c24fc650a39636e6ac70b7d"
            },
            "downloads": -1,
            "filename": "browserpilot-0.2.53.tar.gz",
            "has_sig": false,
            "md5_digest": "7259efcb412851c7117edf846e867367",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 22552,
            "upload_time": "2024-12-21T06:45:34",
            "upload_time_iso_8601": "2024-12-21T06:45:34.714050Z",
            "url": "https://files.pythonhosted.org/packages/c8/b6/41a3304cb749897d7dc63a650a5b697429170bd175e9250cd10a24d466eb/browserpilot-0.2.53.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-21 06:45:34",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "browserpilot"
}
        
Elapsed time: 0.41794s