scrapeAI


NamescrapeAI JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/deBUGger404/Scrape-AI
SummaryA Python library to scrape web data using LLMs and Selenium
upload_time2024-09-08 12:49:13
maintainerNone
docs_urlNone
authorRakesh Kumar
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Scrape-AI

`Scrape-AI` is a Python library designed to intelligently scrape data from websites using a combination of LLMs (Large Language Models) and Selenium for dynamic web interactions. It allows you to configure the library to use a specific LLM (such as OpenAI, Anthropic, Azure OpenAI, etc.) and fetch data based on a user query from websites in real-time. The core functionality enables the agent-like scraping capabilities by leveraging natural language queries.

## Key Features

- **LLM Integration**: Supports multiple LLM models (OpenAI, Anthropic, Azure, Google, etc.) through a flexible factory pattern.
- **Dynamic Web Scraping**: Utilizes Selenium WebDriver to interact with dynamic content on websites.
- **Agent-Like Functionality**: Acts as an intelligent agent that can process user queries and fetch relevant data from the web.
- **Configurable**: Customizable LLM model settings, verbosity, headless browsing, and target URLs.
- **Modular Design**: Structured in a modular way to extend scraping strategies and LLM integrations.

## Installation

1. Clone the repository or install via pip (if available in PyPi):

```bash
pip install scrapeAI
```

2. **Selenium WebDriver Dependencies**:
   Selenium requires a browser driver to interact with the chosen browser. For example, if you're using Chrome, you need to install **ChromeDriver**. Make sure the driver is placed in your system's PATH (`/usr/bin` or `/usr/local/bin`). If this step is skipped, you'll encounter the following error:

```bash
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.
```

Here are links to some of the popular browser drivers:

- **Chrome**: [Download ChromeDriver](https://chromedriver.chromium.org/downloads)
- **Edge**: [Download EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)
- **Firefox**: [Download GeckoDriver](https://github.com/mozilla/geckodriver/releases)
- **Safari**: [WebDriver support in Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)

3. Set up your preferred LLM API keys: Ensure you have API keys ready for the LLM model you intend to use (e.g., OpenAI, Azure, Google, Anthropic).

## Usage

Here is a basic usage example of how to use `scrapeAI` to scrape data from a website based on a user query:

```python
from scrapeAI import WebScraper

config = {
    "llm": {
        "api_key": '<Azure OpenAI API KEY>',
        "model": "<Azure OpenAI Deplyement Name>",
        "api_version": "<Azure Open AI API Version>",
        "endpoint": '<Azure OpenAI Endpoint Name>'
    },
    "verbose": False,
    "headless": False,
    "url" : "https://pypi.org/search/?q=genai",
    "prompt" : "Provide all the libraries and their installation commands"
}

scraper = WebScraper(config)

# Invoke the scraping process
result = scraper.invoke()

# Output the result
print(result)
```
The output will be a json as the following:
```markdown
[
  {
    'library': 'genai',
    'installation_command': 'pip install genai'
  },
  {
    'library': 'bookworm_genai',
    'installation_command': 'pip install bookworm_genai'
  },
  {
    'library': 'ada-genai',
    'installation_command': 'pip install ada-genai'
  },
  ...{
    'library': 'semantix-genai-serve',
    'installation_command': 'pip install semantix-genai-serve'
  },
  {
    'library': 'platform-gen-ai',
    'installation_command': 'pip install platform-gen-ai'
  },
  
]
```

### Configuration Options

- **llm**: Defines the configuration for the LLM model, including the API key, model version, and endpoint.
- **verbose**: If set to `True`, enables detailed logging of operations.
- **headless**: If set to `True`, runs the web scraping in headless mode (without opening a browser window).
- **url**: The target URL for scraping.
- **prompt**: The natural language query to ask the LLM and fetch relevant content from the page.

Project Structure

---

The project is organized as follows: markdown Copy code

```markdown
├── README.md
├── scrapeAI/
│ ├── **init**.py
│ ├── core/
│ │ ├── **init**.py
│ │ ├── base_scraper.py
│ │ ├── direct_scraper.py
│ │ ├── scraper_factory.py
│ │ └── search_scraper.py
│ ├── llms/
│ │ ├── **init**.py
│ │ ├── anthropic_llm.py
│ │ ├── azure_openai_llm.py
│ │ ├── base.py
│ │ ├── google_llm.py
│ │ ├── llm_factory.py
│ │ └── openai_llm.py
│ ├── utils/
│ │ ├── **init**.py
│ │ ├── html_utils.py
│ │ └── logging.py
│ └── web_scraper.py
├── setup.py
├── tests/
│ └── tests_operations.py
```

### Core Components

- **core/**: Contains the base scraper classes and factory design patterns for scraping strategies.
- **llms/**: Includes different LLM integration classes such as OpenAI, Anthropic, and Google.
- **utils/**: Utility functions for HTML parsing and logging.

Contributing

---

We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.

License

---

This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.

### Selenium Dependencies:

- Selenium requires browser drivers to interact with different browsers.
- The required driver for **Chrome** is **ChromeDriver**, and similarly, each browser has its respective driver.
- Make sure the driver is installed and placed in the system's PATH.

### Popular Browser Drivers:

- **Chrome**: [ChromeDriver](https://chromedriver.chromium.org/downloads)
- **Edge**: [EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)
- **Firefox**: [GeckoDriver](https://github.com/mozilla/geckodriver/releases)
- **Safari**: [WebDriver for Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)


<div style="text-align: center;">
  <a href="https://github.com/deBUGger404" target="_blank">
    <img src="https://raw.githubusercontent.com/deBUGger404/Python-Course-From-Beginner-to-Expert/main/Data/happy_code.webp" alt="Happy Code" style="width:200px; border-radius:12px;">
  </a>
</div>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/deBUGger404/Scrape-AI",
    "name": "scrapeAI",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Rakesh Kumar",
    "author_email": "rakeshparmuwal1436@example.com",
    "download_url": "https://files.pythonhosted.org/packages/72/3b/ee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff/scrapeAI-0.3.0.tar.gz",
    "platform": null,
    "description": "# Scrape-AI\n\n`Scrape-AI` is a Python library designed to intelligently scrape data from websites using a combination of LLMs (Large Language Models) and Selenium for dynamic web interactions. It allows you to configure the library to use a specific LLM (such as OpenAI, Anthropic, Azure OpenAI, etc.) and fetch data based on a user query from websites in real-time. The core functionality enables the agent-like scraping capabilities by leveraging natural language queries.\n\n## Key Features\n\n- **LLM Integration**: Supports multiple LLM models (OpenAI, Anthropic, Azure, Google, etc.) through a flexible factory pattern.\n- **Dynamic Web Scraping**: Utilizes Selenium WebDriver to interact with dynamic content on websites.\n- **Agent-Like Functionality**: Acts as an intelligent agent that can process user queries and fetch relevant data from the web.\n- **Configurable**: Customizable LLM model settings, verbosity, headless browsing, and target URLs.\n- **Modular Design**: Structured in a modular way to extend scraping strategies and LLM integrations.\n\n## Installation\n\n1. Clone the repository or install via pip (if available in PyPi):\n\n```bash\npip install scrapeAI\n```\n\n2. **Selenium WebDriver Dependencies**:\n   Selenium requires a browser driver to interact with the chosen browser. For example, if you're using Chrome, you need to install **ChromeDriver**. Make sure the driver is placed in your system's PATH (`/usr/bin` or `/usr/local/bin`). If this step is skipped, you'll encounter the following error:\n\n```bash\nselenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.\n```\n\nHere are links to some of the popular browser drivers:\n\n- **Chrome**: [Download ChromeDriver](https://chromedriver.chromium.org/downloads)\n- **Edge**: [Download EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)\n- **Firefox**: [Download GeckoDriver](https://github.com/mozilla/geckodriver/releases)\n- **Safari**: [WebDriver support in Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)\n\n3. Set up your preferred LLM API keys: Ensure you have API keys ready for the LLM model you intend to use (e.g., OpenAI, Azure, Google, Anthropic).\n\n## Usage\n\nHere is a basic usage example of how to use `scrapeAI` to scrape data from a website based on a user query:\n\n```python\nfrom scrapeAI import WebScraper\n\nconfig = {\n    \"llm\": {\n        \"api_key\": '<Azure OpenAI API KEY>',\n        \"model\": \"<Azure OpenAI Deplyement Name>\",\n        \"api_version\": \"<Azure Open AI API Version>\",\n        \"endpoint\": '<Azure OpenAI Endpoint Name>'\n    },\n    \"verbose\": False,\n    \"headless\": False,\n    \"url\" : \"https://pypi.org/search/?q=genai\",\n    \"prompt\" : \"Provide all the libraries and their installation commands\"\n}\n\nscraper = WebScraper(config)\n\n# Invoke the scraping process\nresult = scraper.invoke()\n\n# Output the result\nprint(result)\n```\nThe output will be a json as the following:\n```markdown\n[\n  {\n    'library': 'genai',\n    'installation_command': 'pip install genai'\n  },\n  {\n    'library': 'bookworm_genai',\n    'installation_command': 'pip install bookworm_genai'\n  },\n  {\n    'library': 'ada-genai',\n    'installation_command': 'pip install ada-genai'\n  },\n  ...{\n    'library': 'semantix-genai-serve',\n    'installation_command': 'pip install semantix-genai-serve'\n  },\n  {\n    'library': 'platform-gen-ai',\n    'installation_command': 'pip install platform-gen-ai'\n  },\n  \n]\n```\n\n### Configuration Options\n\n- **llm**: Defines the configuration for the LLM model, including the API key, model version, and endpoint.\n- **verbose**: If set to `True`, enables detailed logging of operations.\n- **headless**: If set to `True`, runs the web scraping in headless mode (without opening a browser window).\n- **url**: The target URL for scraping.\n- **prompt**: The natural language query to ask the LLM and fetch relevant content from the page.\n\nProject Structure\n\n---\n\nThe project is organized as follows: markdown Copy code\n\n```markdown\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 scrapeAI/\n\u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u251c\u2500\u2500 core/\n\u2502 \u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u2502 \u251c\u2500\u2500 base_scraper.py\n\u2502 \u2502 \u251c\u2500\u2500 direct_scraper.py\n\u2502 \u2502 \u251c\u2500\u2500 scraper_factory.py\n\u2502 \u2502 \u2514\u2500\u2500 search_scraper.py\n\u2502 \u251c\u2500\u2500 llms/\n\u2502 \u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u2502 \u251c\u2500\u2500 anthropic_llm.py\n\u2502 \u2502 \u251c\u2500\u2500 azure_openai_llm.py\n\u2502 \u2502 \u251c\u2500\u2500 base.py\n\u2502 \u2502 \u251c\u2500\u2500 google_llm.py\n\u2502 \u2502 \u251c\u2500\u2500 llm_factory.py\n\u2502 \u2502 \u2514\u2500\u2500 openai_llm.py\n\u2502 \u251c\u2500\u2500 utils/\n\u2502 \u2502 \u251c\u2500\u2500 **init**.py\n\u2502 \u2502 \u251c\u2500\u2500 html_utils.py\n\u2502 \u2502 \u2514\u2500\u2500 logging.py\n\u2502 \u2514\u2500\u2500 web_scraper.py\n\u251c\u2500\u2500 setup.py\n\u251c\u2500\u2500 tests/\n\u2502 \u2514\u2500\u2500 tests_operations.py\n```\n\n### Core Components\n\n- **core/**: Contains the base scraper classes and factory design patterns for scraping strategies.\n- **llms/**: Includes different LLM integration classes such as OpenAI, Anthropic, and Google.\n- **utils/**: Utility functions for HTML parsing and logging.\n\nContributing\n\n---\n\nWe welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.\n\nLicense\n\n---\n\nThis project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.\n\n### Selenium Dependencies:\n\n- Selenium requires browser drivers to interact with different browsers.\n- The required driver for **Chrome** is **ChromeDriver**, and similarly, each browser has its respective driver.\n- Make sure the driver is installed and placed in the system's PATH.\n\n### Popular Browser Drivers:\n\n- **Chrome**: [ChromeDriver](https://chromedriver.chromium.org/downloads)\n- **Edge**: [EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)\n- **Firefox**: [GeckoDriver](https://github.com/mozilla/geckodriver/releases)\n- **Safari**: [WebDriver for Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)\n\n\n<div style=\"text-align: center;\">\n  <a href=\"https://github.com/deBUGger404\" target=\"_blank\">\n    <img src=\"https://raw.githubusercontent.com/deBUGger404/Python-Course-From-Beginner-to-Expert/main/Data/happy_code.webp\" alt=\"Happy Code\" style=\"width:200px; border-radius:12px;\">\n  </a>\n</div>\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python library to scrape web data using LLMs and Selenium",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/deBUGger404/Scrape-AI"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a8707d09ebf05eb9f12843ab7486f3d0854cc4f4ddfef08f4088c430d017a32b",
                "md5": "d1b6c9632735c6544b7e71ed81576f99",
                "sha256": "b19c3bc3c608548d29fa17f38a851f70f449f45db674eb5439c715b6543f494a"
            },
            "downloads": -1,
            "filename": "scrapeAI-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d1b6c9632735c6544b7e71ed81576f99",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 12772,
            "upload_time": "2024-09-08T12:49:10",
            "upload_time_iso_8601": "2024-09-08T12:49:10.662588Z",
            "url": "https://files.pythonhosted.org/packages/a8/70/7d09ebf05eb9f12843ab7486f3d0854cc4f4ddfef08f4088c430d017a32b/scrapeAI-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "723bee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff",
                "md5": "8b1e545d0a60f2690442e2cb3c8d6047",
                "sha256": "7dcced5e60a27ee1f4776b8548fce1b67951b3c9f7925fce47a8c32d72e3d303"
            },
            "downloads": -1,
            "filename": "scrapeAI-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8b1e545d0a60f2690442e2cb3c8d6047",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 11194,
            "upload_time": "2024-09-08T12:49:13",
            "upload_time_iso_8601": "2024-09-08T12:49:13.033946Z",
            "url": "https://files.pythonhosted.org/packages/72/3b/ee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff/scrapeAI-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-08 12:49:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "deBUGger404",
    "github_project": "Scrape-AI",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "scrapeai"
}
        
Elapsed time: 0.32052s