# Scrape-AI
`Scrape-AI` is a Python library designed to intelligently scrape data from websites using a combination of LLMs (Large Language Models) and Selenium for dynamic web interactions. It allows you to configure the library to use a specific LLM (such as OpenAI, Anthropic, Azure OpenAI, etc.) and fetch data based on a user query from websites in real-time. The core functionality enables the agent-like scraping capabilities by leveraging natural language queries.
## Key Features
- **LLM Integration**: Supports multiple LLM models (OpenAI, Anthropic, Azure, Google, etc.) through a flexible factory pattern.
- **Dynamic Web Scraping**: Utilizes Selenium WebDriver to interact with dynamic content on websites.
- **Agent-Like Functionality**: Acts as an intelligent agent that can process user queries and fetch relevant data from the web.
- **Configurable**: Customizable LLM model settings, verbosity, headless browsing, and target URLs.
- **Modular Design**: Structured in a modular way to extend scraping strategies and LLM integrations.
## Installation
1. Clone the repository or install via pip (if available in PyPi):
```bash
pip install scrapeAI
```
2. **Selenium WebDriver Dependencies**:
Selenium requires a browser driver to interact with the chosen browser. For example, if you're using Chrome, you need to install **ChromeDriver**. Make sure the driver is placed in your system's PATH (`/usr/bin` or `/usr/local/bin`). If this step is skipped, you'll encounter the following error:
```bash
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.
```
Here are links to some of the popular browser drivers:
- **Chrome**: [Download ChromeDriver](https://chromedriver.chromium.org/downloads)
- **Edge**: [Download EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)
- **Firefox**: [Download GeckoDriver](https://github.com/mozilla/geckodriver/releases)
- **Safari**: [WebDriver support in Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)
3. Set up your preferred LLM API keys: Ensure you have API keys ready for the LLM model you intend to use (e.g., OpenAI, Azure, Google, Anthropic).
## Usage
Here is a basic usage example of how to use `scrapeAI` to scrape data from a website based on a user query:
```python
from scrapeAI import WebScraper
config = {
"llm": {
"api_key": '<Azure OpenAI API KEY>',
"model": "<Azure OpenAI Deplyement Name>",
"api_version": "<Azure Open AI API Version>",
"endpoint": '<Azure OpenAI Endpoint Name>'
},
"verbose": False,
"headless": False,
"url" : "https://pypi.org/search/?q=genai",
"prompt" : "Provide all the libraries and their installation commands"
}
scraper = WebScraper(config)
# Invoke the scraping process
result = scraper.invoke()
# Output the result
print(result)
```
The output will be a json as the following:
```markdown
[
{
'library': 'genai',
'installation_command': 'pip install genai'
},
{
'library': 'bookworm_genai',
'installation_command': 'pip install bookworm_genai'
},
{
'library': 'ada-genai',
'installation_command': 'pip install ada-genai'
},
...{
'library': 'semantix-genai-serve',
'installation_command': 'pip install semantix-genai-serve'
},
{
'library': 'platform-gen-ai',
'installation_command': 'pip install platform-gen-ai'
},
]
```
### Configuration Options
- **llm**: Defines the configuration for the LLM model, including the API key, model version, and endpoint.
- **verbose**: If set to `True`, enables detailed logging of operations.
- **headless**: If set to `True`, runs the web scraping in headless mode (without opening a browser window).
- **url**: The target URL for scraping.
- **prompt**: The natural language query to ask the LLM and fetch relevant content from the page.
Project Structure
---
The project is organized as follows: markdown Copy code
```markdown
├── README.md
├── scrapeAI/
│ ├── **init**.py
│ ├── core/
│ │ ├── **init**.py
│ │ ├── base_scraper.py
│ │ ├── direct_scraper.py
│ │ ├── scraper_factory.py
│ │ └── search_scraper.py
│ ├── llms/
│ │ ├── **init**.py
│ │ ├── anthropic_llm.py
│ │ ├── azure_openai_llm.py
│ │ ├── base.py
│ │ ├── google_llm.py
│ │ ├── llm_factory.py
│ │ └── openai_llm.py
│ ├── utils/
│ │ ├── **init**.py
│ │ ├── html_utils.py
│ │ └── logging.py
│ └── web_scraper.py
├── setup.py
├── tests/
│ └── tests_operations.py
```
### Core Components
- **core/**: Contains the base scraper classes and factory design patterns for scraping strategies.
- **llms/**: Includes different LLM integration classes such as OpenAI, Anthropic, and Google.
- **utils/**: Utility functions for HTML parsing and logging.
Contributing
---
We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.
License
---
This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
### Selenium Dependencies:
- Selenium requires browser drivers to interact with different browsers.
- The required driver for **Chrome** is **ChromeDriver**, and similarly, each browser has its respective driver.
- Make sure the driver is installed and placed in the system's PATH.
### Popular Browser Drivers:
- **Chrome**: [ChromeDriver](https://chromedriver.chromium.org/downloads)
- **Edge**: [EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)
- **Firefox**: [GeckoDriver](https://github.com/mozilla/geckodriver/releases)
- **Safari**: [WebDriver for Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)
<div style="text-align: center;">
<a href="https://github.com/deBUGger404" target="_blank">
<img src="https://raw.githubusercontent.com/deBUGger404/Python-Course-From-Beginner-to-Expert/main/Data/happy_code.webp" alt="Happy Code" style="width:200px; border-radius:12px;">
</a>
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/deBUGger404/Scrape-AI",
"name": "scrapeAI",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Rakesh Kumar",
"author_email": "rakeshparmuwal1436@example.com",
"download_url": "https://files.pythonhosted.org/packages/72/3b/ee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff/scrapeAI-0.3.0.tar.gz",
"platform": null,
"description": "# Scrape-AI\n\n`Scrape-AI` is a Python library designed to intelligently scrape data from websites using a combination of LLMs (Large Language Models) and Selenium for dynamic web interactions. It allows you to configure the library to use a specific LLM (such as OpenAI, Anthropic, Azure OpenAI, etc.) and fetch data based on a user query from websites in real-time. The core functionality enables the agent-like scraping capabilities by leveraging natural language queries.\n\n## Key Features\n\n- **LLM Integration**: Supports multiple LLM models (OpenAI, Anthropic, Azure, Google, etc.) through a flexible factory pattern.\n- **Dynamic Web Scraping**: Utilizes Selenium WebDriver to interact with dynamic content on websites.\n- **Agent-Like Functionality**: Acts as an intelligent agent that can process user queries and fetch relevant data from the web.\n- **Configurable**: Customizable LLM model settings, verbosity, headless browsing, and target URLs.\n- **Modular Design**: Structured in a modular way to extend scraping strategies and LLM integrations.\n\n## Installation\n\n1. Clone the repository or install via pip (if available in PyPi):\n\n```bash\npip install scrapeAI\n```\n\n2. **Selenium WebDriver Dependencies**:\n Selenium requires a browser driver to interact with the chosen browser. For example, if you're using Chrome, you need to install **ChromeDriver**. Make sure the driver is placed in your system's PATH (`/usr/bin` or `/usr/local/bin`). If this step is skipped, you'll encounter the following error:\n\n```bash\nselenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.\n```\n\nHere are links to some of the popular browser drivers:\n\n- **Chrome**: [Download ChromeDriver](https://chromedriver.chromium.org/downloads)\n- **Edge**: [Download EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)\n- **Firefox**: [Download GeckoDriver](https://github.com/mozilla/geckodriver/releases)\n- **Safari**: [WebDriver support in Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)\n\n3. Set up your preferred LLM API keys: Ensure you have API keys ready for the LLM model you intend to use (e.g., OpenAI, Azure, Google, Anthropic).\n\n## Usage\n\nHere is a basic usage example of how to use `scrapeAI` to scrape data from a website based on a user query:\n\n```python\nfrom scrapeAI import WebScraper\n\nconfig = {\n \"llm\": {\n \"api_key\": '<Azure OpenAI API KEY>',\n \"model\": \"<Azure OpenAI Deplyement Name>\",\n \"api_version\": \"<Azure Open AI API Version>\",\n \"endpoint\": '<Azure OpenAI Endpoint Name>'\n },\n \"verbose\": False,\n \"headless\": False,\n \"url\" : \"https://pypi.org/search/?q=genai\",\n \"prompt\" : \"Provide all the libraries and their installation commands\"\n}\n\nscraper = WebScraper(config)\n\n# Invoke the scraping process\nresult = scraper.invoke()\n\n# Output the result\nprint(result)\n```\nThe output will be a json as the following:\n```markdown\n[\n {\n 'library': 'genai',\n 'installation_command': 'pip install genai'\n },\n {\n 'library': 'bookworm_genai',\n 'installation_command': 'pip install bookworm_genai'\n },\n {\n 'library': 'ada-genai',\n 'installation_command': 'pip install ada-genai'\n },\n ...{\n 'library': 'semantix-genai-serve',\n 'installation_command': 'pip install semantix-genai-serve'\n },\n {\n 'library': 'platform-gen-ai',\n 'installation_command': 'pip install platform-gen-ai'\n },\n \n]\n```\n\n### Configuration Options\n\n- **llm**: Defines the configuration for the LLM model, including the API key, model version, and endpoint.\n- **verbose**: If set to `True`, enables detailed logging of operations.\n- **headless**: If set to `True`, runs the web scraping in headless mode (without opening a browser window).\n- **url**: The target URL for scraping.\n- **prompt**: The natural language query to ask the LLM and fetch relevant content from the page.\n\nProject Structure\n\n---\n\nThe project is organized as follows: markdown Copy code\n\n```markdown\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 scrapeAI/\n\u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u251c\u2500\u2500 core/\n\u2502 \u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u2502 \u251c\u2500\u2500 base_scraper.py\n\u2502 \u2502 \u251c\u2500\u2500 direct_scraper.py\n\u2502 \u2502 \u251c\u2500\u2500 scraper_factory.py\n\u2502 \u2502 \u2514\u2500\u2500 search_scraper.py\n\u2502 \u251c\u2500\u2500 llms/\n\u2502 \u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u2502 \u251c\u2500\u2500 anthropic_llm.py\n\u2502 \u2502 \u251c\u2500\u2500 azure_openai_llm.py\n\u2502 \u2502 \u251c\u2500\u2500 base.py\n\u2502 \u2502 \u251c\u2500\u2500 google_llm.py\n\u2502 \u2502 \u251c\u2500\u2500 llm_factory.py\n\u2502 \u2502 \u2514\u2500\u2500 openai_llm.py\n\u2502 \u251c\u2500\u2500 utils/\n\u2502 \u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u2502 \u251c\u2500\u2500 html_utils.py\n\u2502 \u2502 \u2514\u2500\u2500 logging.py\n\u2502 \u2514\u2500\u2500 web_scraper.py\n\u251c\u2500\u2500 setup.py\n\u251c\u2500\u2500 tests/\n\u2502 \u2514\u2500\u2500 tests_operations.py\n```\n\n### Core Components\n\n- **core/**: Contains the base scraper classes and factory design patterns for scraping strategies.\n- **llms/**: Includes different LLM integration classes such as OpenAI, Anthropic, and Google.\n- **utils/**: Utility functions for HTML parsing and logging.\n\nContributing\n\n---\n\nWe welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.\n\nLicense\n\n---\n\nThis project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.\n\n### Selenium Dependencies:\n\n- Selenium requires browser drivers to interact with different browsers.\n- The required driver for **Chrome** is **ChromeDriver**, and similarly, each browser has its respective driver.\n- Make sure the driver is installed and placed in the system's PATH.\n\n### Popular Browser Drivers:\n\n- **Chrome**: [ChromeDriver](https://chromedriver.chromium.org/downloads)\n- **Edge**: [EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)\n- **Firefox**: [GeckoDriver](https://github.com/mozilla/geckodriver/releases)\n- **Safari**: [WebDriver for Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)\n\n\n<div style=\"text-align: center;\">\n <a href=\"https://github.com/deBUGger404\" target=\"_blank\">\n <img src=\"https://raw.githubusercontent.com/deBUGger404/Python-Course-From-Beginner-to-Expert/main/Data/happy_code.webp\" alt=\"Happy Code\" style=\"width:200px; border-radius:12px;\">\n </a>\n</div>\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python library to scrape web data using LLMs and Selenium",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/deBUGger404/Scrape-AI"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a8707d09ebf05eb9f12843ab7486f3d0854cc4f4ddfef08f4088c430d017a32b",
"md5": "d1b6c9632735c6544b7e71ed81576f99",
"sha256": "b19c3bc3c608548d29fa17f38a851f70f449f45db674eb5439c715b6543f494a"
},
"downloads": -1,
"filename": "scrapeAI-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d1b6c9632735c6544b7e71ed81576f99",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 12772,
"upload_time": "2024-09-08T12:49:10",
"upload_time_iso_8601": "2024-09-08T12:49:10.662588Z",
"url": "https://files.pythonhosted.org/packages/a8/70/7d09ebf05eb9f12843ab7486f3d0854cc4f4ddfef08f4088c430d017a32b/scrapeAI-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "723bee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff",
"md5": "8b1e545d0a60f2690442e2cb3c8d6047",
"sha256": "7dcced5e60a27ee1f4776b8548fce1b67951b3c9f7925fce47a8c32d72e3d303"
},
"downloads": -1,
"filename": "scrapeAI-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "8b1e545d0a60f2690442e2cb3c8d6047",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 11194,
"upload_time": "2024-09-08T12:49:13",
"upload_time_iso_8601": "2024-09-08T12:49:13.033946Z",
"url": "https://files.pythonhosted.org/packages/72/3b/ee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff/scrapeAI-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-08 12:49:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "deBUGger404",
"github_project": "Scrape-AI",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "scrapeai"
}