undetected-browser-tool


Nameundetected-browser-tool JSON
Version 1.0.2 PyPI version JSON
download
home_pagehttps://github.com/thevgergroup/undetected-browser-tool
SummaryA langchain tool implementation of Undetected with Selenium and Chrome for page fetching, making it easier to bypass bot detectors
upload_time2024-08-27 21:46:31
maintainerNone
docs_urlNone
authorpatrick o'leary
requires_python<4.0,>=3.9
licenseMIT
keywords ai agents langchain undetected selenium chrome browser bot detection bypass
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Undetected Browser Tool
[![Publish Python 🐍 distribution 📦 to PyPI and TestPyPI](https://github.com/thevgergroup/undetected-browser-tool/actions/workflows/python-publish.yml/badge.svg)](https://github.com/thevgergroup/undetected-browser-tool/actions/workflows/python-publish.yml)

### Introduction

The Undetected Browser Tool is an AI Agent tool to simplify the process of accessing individual web pages that are protected by bot detection systems.

e.g. Cloudflare and similar services. 

This tool is designed to facilitate access to publicly available data on a single-page basis for AI Agents. 
This is not a hacking tool, or a paywall bypass tool, nor guaranteed in all situations. 

It is not intended for web scraping or crawling purposes

Various configurations in WAF and security services may still block access, and the tool can still be detected through methods like the Chrome Developer Protocol (CDP) in JavaScript.


It works by creating a headless browser instance and navigating to a page, which inherently makes it slower and unsuitable for high-volume or automated data extraction.

The main aim is to make it easier for AI agents to access web pages that might otherwise be challenging due to bot protection. 

It is built as a LangChain tool using Selenium and the [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver) project.


- [Undetected Browser Tool](#undetected-browser-tool)
  - [Introduction](#introduction)
  - [Installation](#installation)
  - [Usage](#usage)
  - [Example with CrewAI](#example-with-crewai)
  - [Tips for being undetected](#tips-for-being-undetected)
  - [Ethics](#ethics)


### Installation

```
pip install -U undetected-browser-tool
```

### Usage 
We use this in [langchain agents](https://python.langchain.com/) or in [CrewAI Agents](https://docs.crewai.com/core-concepts/Agents/#what-is-an-agent)


```python
from undetected_browser_tool import UndetectedBrowserTool

# for a headless browser
browser = UndetectedBrowserTool(headless=True) 

# or useful for debugging
browser = UndetectedBrowserTool(headless=False) 


# or using a proxy, change http to https or socks5 to suit your proxy settings
opts = {"proxy-server" : "http://xxx.xxx.xxx:port"}
browser = UndetectedBrowserTool(headless=False, additional_opts=opts) 

# fetch a page
page = browser.run("https://nowsecure.nl/")
```

### Example with CrewAI
Untested code, written as an example for usage with CrewAI
Be aware that langchain and Crew are in constant development, so these interfaces may change.


```python
from crewai import Agent, Task, Crew
from undetected_browser_tool import UndetectedBrowserTool
from langchain_community.tools import DuckDuckGoSearchResults

browser = UndetectedBrowserTool(headless=True)
search_engine = DuckDuckGoSearchResults()

researcher = Agent(
            role="Researcher",
            goal="A document reviewing the top 5 solutions in {topic}",
            backstory="You are a research analyst tasked with reviewing software for an IT firm to help them make buying decisions",
            tools=[search_engine, browser],
)

report_writer = Agent(
            role="Report writer",
            goal="A executive review of the top 5 solutions in {topic}"
            backstory="You are a report writer for an IT firm, you excel at writing summaries and detailed reports for executives in an IT firm", 
            
)

research_task = Task(
            expected_output="A list of pros and cons, features, customer reviews and pricing of the top 5 solutions in {topic}",
            description="By using a search engine and Capterra, G2, Gartner write a review of the top 5 solutions in {topic}, include the source link for each item you find."
            tools=[search_engine, browser],
            agent=researcher,
)

report_task = Task(
            expected_output="An executive style report, with a summary, detailed information and a recommendation for selecting a solution in {topic}"
            description="Based upon the research provided, write an executive summary, detailed report, and a recommendation on the software selection. The report should have a table of features, pros / cons, and pricing options."
            output_file="report.md",
            context={research_task}
)

crew = Crew(agents=[researcher, report_writer], tasks=[research_task, report_task])

crew.kickoff(inputs={"topic" : "BI Reporting Platforms"})

```


### Tips for being undetected
This is not a 100%, it's hard to be 100% but it's a good start.
* Don't go through a data center, those IPs are easy to track and block.
* Use proxies, and I recommend residential proxies. Tor networks are easily blocked. 
* Switch proxies frequently
* Don't crawl the website, use a search service to try and pinpoint the page you need.

Every time a confirm you are a human page pops up, about 50% of the users just drop out, so companies who are dependent on traffic try to avoid that. Meaning as long as you're not doing something silly you're usually going to be ok.



### Ethics
The use of automation tools to interact with websites is a nuanced topic that involves balancing access to information with respect for publishers' rights. Publishers have the right to protect their content and manage how it is accessed. They often allow search engines to index their data so it can be discovered by the general public, while also setting guidelines to regulate how their information is used.

If your intent is to occasionally access content to obtain information and properly reference the source, this tool can help streamline your workflow by automating repetitive tasks, thus saving time and effort.

However, if your goal is to scrape large amounts of data from a website, this project is not intended for that purpose. Engaging in large-scale scraping can violate the terms of service of many websites, infringe on intellectual property rights, and potentially cause harm to the website's infrastructure. 

Always ensure that your use of automation tools is ethical and complies with the website's terms of service. Respect the rights of content owners and use automation responsibly.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thevgergroup/undetected-browser-tool",
    "name": "undetected-browser-tool",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "AI, Agents, langchain, undetected, selenium, chrome, browser, bot, detection, bypass",
    "author": "patrick o'leary",
    "author_email": "pjaol@pjaol.com",
    "download_url": "https://files.pythonhosted.org/packages/dd/49/6e5b13aa752a802c64b91fe90bce8eef9bb7e70f124d3326122e015b428c/undetected_browser_tool-1.0.2.tar.gz",
    "platform": null,
    "description": "## Undetected Browser Tool\n[![Publish Python \ud83d\udc0d distribution \ud83d\udce6 to PyPI and TestPyPI](https://github.com/thevgergroup/undetected-browser-tool/actions/workflows/python-publish.yml/badge.svg)](https://github.com/thevgergroup/undetected-browser-tool/actions/workflows/python-publish.yml)\n\n### Introduction\n\nThe Undetected Browser Tool is an AI Agent tool to simplify the process of accessing individual web pages that are protected by bot detection systems.\n\ne.g. Cloudflare and similar services. \n\nThis tool is designed to facilitate access to publicly available data on a single-page basis for AI Agents. \nThis is not a hacking tool, or a paywall bypass tool, nor guaranteed in all situations. \n\nIt is not intended for web scraping or crawling purposes\n\nVarious configurations in WAF and security services may still block access, and the tool can still be detected through methods like the Chrome Developer Protocol (CDP) in JavaScript.\n\n\nIt works by creating a headless browser instance and navigating to a page, which inherently makes it slower and unsuitable for high-volume or automated data extraction.\n\nThe main aim is to make it easier for AI agents to access web pages that might otherwise be challenging due to bot protection. \n\nIt is built as a LangChain tool using Selenium and the [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver) project.\n\n\n- [Undetected Browser Tool](#undetected-browser-tool)\n  - [Introduction](#introduction)\n  - [Installation](#installation)\n  - [Usage](#usage)\n  - [Example with CrewAI](#example-with-crewai)\n  - [Tips for being undetected](#tips-for-being-undetected)\n  - [Ethics](#ethics)\n\n\n### Installation\n\n```\npip install -U undetected-browser-tool\n```\n\n### Usage \nWe use this in [langchain agents](https://python.langchain.com/) or in [CrewAI Agents](https://docs.crewai.com/core-concepts/Agents/#what-is-an-agent)\n\n\n```python\nfrom undetected_browser_tool import UndetectedBrowserTool\n\n# for a headless browser\nbrowser = UndetectedBrowserTool(headless=True) \n\n# or useful for debugging\nbrowser = UndetectedBrowserTool(headless=False) \n\n\n# or using a proxy, change http to https or socks5 to suit your proxy settings\nopts = {\"proxy-server\" : \"http://xxx.xxx.xxx:port\"}\nbrowser = UndetectedBrowserTool(headless=False, additional_opts=opts) \n\n# fetch a page\npage = browser.run(\"https://nowsecure.nl/\")\n```\n\n### Example with CrewAI\nUntested code, written as an example for usage with CrewAI\nBe aware that langchain and Crew are in constant development, so these interfaces may change.\n\n\n```python\nfrom crewai import Agent, Task, Crew\nfrom undetected_browser_tool import UndetectedBrowserTool\nfrom langchain_community.tools import DuckDuckGoSearchResults\n\nbrowser = UndetectedBrowserTool(headless=True)\nsearch_engine = DuckDuckGoSearchResults()\n\nresearcher = Agent(\n            role=\"Researcher\",\n            goal=\"A document reviewing the top 5 solutions in {topic}\",\n            backstory=\"You are a research analyst tasked with reviewing software for an IT firm to help them make buying decisions\",\n            tools=[search_engine, browser],\n)\n\nreport_writer = Agent(\n            role=\"Report writer\",\n            goal=\"A executive review of the top 5 solutions in {topic}\"\n            backstory=\"You are a report writer for an IT firm, you excel at writing summaries and detailed reports for executives in an IT firm\", \n            \n)\n\nresearch_task = Task(\n            expected_output=\"A list of pros and cons, features, customer reviews and pricing of the top 5 solutions in {topic}\",\n            description=\"By using a search engine and Capterra, G2, Gartner write a review of the top 5 solutions in {topic}, include the source link for each item you find.\"\n            tools=[search_engine, browser],\n            agent=researcher,\n)\n\nreport_task = Task(\n            expected_output=\"An executive style report, with a summary, detailed information and a recommendation for selecting a solution in {topic}\"\n            description=\"Based upon the research provided, write an executive summary, detailed report, and a recommendation on the software selection. The report should have a table of features, pros / cons, and pricing options.\"\n            output_file=\"report.md\",\n            context={research_task}\n)\n\ncrew = Crew(agents=[researcher, report_writer], tasks=[research_task, report_task])\n\ncrew.kickoff(inputs={\"topic\" : \"BI Reporting Platforms\"})\n\n```\n\n\n### Tips for being undetected\nThis is not a 100%, it's hard to be 100% but it's a good start.\n* Don't go through a data center, those IPs are easy to track and block.\n* Use proxies, and I recommend residential proxies. Tor networks are easily blocked. \n* Switch proxies frequently\n* Don't crawl the website, use a search service to try and pinpoint the page you need.\n\nEvery time a confirm you are a human page pops up, about 50% of the users just drop out, so companies who are dependent on traffic try to avoid that. Meaning as long as you're not doing something silly you're usually going to be ok.\n\n\n\n### Ethics\nThe use of automation tools to interact with websites is a nuanced topic that involves balancing access to information with respect for publishers' rights. Publishers have the right to protect their content and manage how it is accessed. They often allow search engines to index their data so it can be discovered by the general public, while also setting guidelines to regulate how their information is used.\n\nIf your intent is to occasionally access content to obtain information and properly reference the source, this tool can help streamline your workflow by automating repetitive tasks, thus saving time and effort.\n\nHowever, if your goal is to scrape large amounts of data from a website, this project is not intended for that purpose. Engaging in large-scale scraping can violate the terms of service of many websites, infringe on intellectual property rights, and potentially cause harm to the website's infrastructure. \n\nAlways ensure that your use of automation tools is ethical and complies with the website's terms of service. Respect the rights of content owners and use automation responsibly.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A langchain tool implementation of Undetected with Selenium and Chrome for page fetching, making it easier to bypass bot detectors",
    "version": "1.0.2",
    "project_urls": {
        "Homepage": "https://github.com/thevgergroup/undetected-browser-tool",
        "Repository": "https://github.com/thevgergroup/undetected-browser-tool.git"
    },
    "split_keywords": [
        "ai",
        " agents",
        " langchain",
        " undetected",
        " selenium",
        " chrome",
        " browser",
        " bot",
        " detection",
        " bypass"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8e3a96cc1efca20eed64b133a1dfa9fd76fec88db19c7ced8f01020e44b7c2d7",
                "md5": "c7db65c3a6a56151a8b6de65c0633c2c",
                "sha256": "6df2e629cdb9c9ca357721243235d0b34aa9a2b9c4456fffcdec540f2e1e3023"
            },
            "downloads": -1,
            "filename": "undetected_browser_tool-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c7db65c3a6a56151a8b6de65c0633c2c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 7164,
            "upload_time": "2024-08-27T21:46:30",
            "upload_time_iso_8601": "2024-08-27T21:46:30.006414Z",
            "url": "https://files.pythonhosted.org/packages/8e/3a/96cc1efca20eed64b133a1dfa9fd76fec88db19c7ced8f01020e44b7c2d7/undetected_browser_tool-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dd496e5b13aa752a802c64b91fe90bce8eef9bb7e70f124d3326122e015b428c",
                "md5": "d940cb55e2f7693515cbeeefe2099d70",
                "sha256": "7447f9d560babfd1079832e53f630cbbf03a9e9a7db2b45965bd26705ef00298"
            },
            "downloads": -1,
            "filename": "undetected_browser_tool-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "d940cb55e2f7693515cbeeefe2099d70",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 6540,
            "upload_time": "2024-08-27T21:46:31",
            "upload_time_iso_8601": "2024-08-27T21:46:31.728737Z",
            "url": "https://files.pythonhosted.org/packages/dd/49/6e5b13aa752a802c64b91fe90bce8eef9bb7e70f124d3326122e015b428c/undetected_browser_tool-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-27 21:46:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thevgergroup",
    "github_project": "undetected-browser-tool",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "undetected-browser-tool"
}
        
Elapsed time: 0.33452s