scrapegen


Namescrapegen JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryAI-driven web scraping framework
upload_time2025-02-02 17:45:02
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT
keywords ai bing search scraper web scraping automation
VCS
bugtrack_url
requirements requests beautifulsoup4 langchain-google-genai pydantic lxml
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ScrapeGen

<img src="https://github.com/user-attachments/assets/2f458a05-66f9-47a4-bc40-6069e3c9e849" alt="Logo" width="80" height="80">

ScrapeGen 🚀 is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.

## ✨ Features

- **🤖 AI-Powered Data Extraction**: Utilizes Google's Gemini models for intelligent parsing.
- **⚙️ Configurable Web Scraping**: Supports depth control and flexible extraction rules.
- **📊 Structured Data Modeling**: Uses Pydantic for well-defined data structures.
- **🛡️ Robust Error Handling**: Implements retry mechanisms and detailed error reporting.
- **🔧 Customizable Scraping Configurations**: Adjust settings dynamically based on needs.
- **🌐 Comprehensive URL Handling**: Supports both relative and absolute URLs.
- **📦 Modular Architecture**: Ensures clear separation of concerns for maintainability.

## 📥 Installation

```bash
pip install scrapegen  # Package name may vary
```

## 📌 Requirements

- Python 3.7+
- Google API Key (for Gemini models)
- Required Python packages:
  - requests
  - beautifulsoup4
  - langchain
  - langchain-google-genai
  - pydantic

## 🚀 Quick Start

```python
from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo

# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")

# Define the target URL
url = "https://example.com"

# Scrape and extract company information
companies_data = scraper.scrape(url, CompaniesInfo)

# Display extracted data
for company in companies_data.companies:
    print(f"🏢 Company Name: {company.company_name}")
    print(f"📄 Description: {company.company_description}")
```

## ⚙️ Configuration

### 🔹 ScrapeConfig Options

```python
from scrapegen import ScrapeConfig

config = ScrapeConfig(
    max_pages=20,      # Max pages to scrape per depth level
    max_subpages=2,    # Max subpages to scrape per page
    max_depth=1,       # Max depth to follow links
    timeout=30,        # Request timeout in seconds
    retries=3,         # Number of retry attempts
    user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
    headers=None       # Additional HTTP headers
)
```

### 🔄 Updating Configuration

```python
scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)

# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)
```

## 📌 Custom Data Models

Define Pydantic models to structure extracted data:

```python
from pydantic import BaseModel
from typing import Optional, List

class CustomDataModel(BaseModel):
    title: str
    description: Optional[str]
    date: str
    tags: List[str]

class CustomDataCollection(BaseModel):
    items: List[CustomDataModel]

# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)
```

## 🤖 Supported Gemini Models

- gemini-1.5-flash-8b
- gemini-1.5-pro
- gemini-2.0-flash-exp
- gemini-1.5-flash

## ⚠️ Error Handling

ScrapeGen provides specific exception classes for detailed error handling:

- **❗ ScrapeGenError**: Base exception class.
- **⚙️ ConfigurationError**: Errors related to scraper configuration.
- **🕷️ ScrapingError**: Issues encountered during web scraping.
- **🔍 ExtractionError**: Problems with AI-driven data extraction.

Example usage:

```python
try:
    data = scraper.scrape(url, CustomDataCollection)
except ConfigurationError as e:
    print(f"⚙️ Configuration error: {e}")
except ScrapingError as e:
    print(f"🕷️ Scraping error: {e}")
except ExtractionError as e:
    print(f"🔍 Extraction error: {e}")
```

## 🏗️ Architecture

ScrapeGen follows a modular design for scalability and maintainability:

1. **🕷️ WebsiteScraper**: Handles core web scraping logic.
2. **📑 InfoExtractorAi**: Performs AI-driven content extraction.
3. **🤖 LlmManager**: Manages interactions with language models.
4. **🔗 UrlParser**: Parses and normalizes URLs.
5. **📥 ContentExtractor**: Extracts structured data from HTML elements.

## ✅ Best Practices

### 1️⃣ Rate Limiting

- ⏳ Use delays between requests.
- 📜 Respect robots.txt guidelines.
- ⚖️ Configure max_pages and max_depth responsibly.

### 2️⃣ Error Handling

- 🔄 Wrap scraping operations in try-except blocks.
- 📋 Implement proper logging for debugging.
- 🔁 Handle network timeouts and retries effectively.

### 3️⃣ Resource Management

- 🖥️ Monitor memory usage for large-scale operations.
- 📚 Implement pagination for large datasets.
- ⏱️ Adjust timeout settings based on expected response times.

## 🤝 Contributing

Contributions are welcome! 🎉 Feel free to submit a Pull Request to improve ScrapeGen.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrapegen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "AI, bing, search, scraper, web scraping, automation",
    "author": null,
    "author_email": "Affan Shaikhsurab <affanshaikhsurab@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/6b/dc/541c589cdc0b8ea36521492fa7bd18d1275b5baa7c773e8c5d9d6c651fb7/scrapegen-0.1.0.tar.gz",
    "platform": null,
    "description": "# ScrapeGen\r\n\r\n<img src=\"https://github.com/user-attachments/assets/2f458a05-66f9-47a4-bc40-6069e3c9e849\" alt=\"Logo\" width=\"80\" height=\"80\">\r\n\r\nScrapeGen \ud83d\ude80 is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.\r\n\r\n## \u2728 Features\r\n\r\n- **\ud83e\udd16 AI-Powered Data Extraction**: Utilizes Google's Gemini models for intelligent parsing.\r\n- **\u2699\ufe0f Configurable Web Scraping**: Supports depth control and flexible extraction rules.\r\n- **\ud83d\udcca Structured Data Modeling**: Uses Pydantic for well-defined data structures.\r\n- **\ud83d\udee1\ufe0f Robust Error Handling**: Implements retry mechanisms and detailed error reporting.\r\n- **\ud83d\udd27 Customizable Scraping Configurations**: Adjust settings dynamically based on needs.\r\n- **\ud83c\udf10 Comprehensive URL Handling**: Supports both relative and absolute URLs.\r\n- **\ud83d\udce6 Modular Architecture**: Ensures clear separation of concerns for maintainability.\r\n\r\n## \ud83d\udce5 Installation\r\n\r\n```bash\r\npip install scrapegen  # Package name may vary\r\n```\r\n\r\n## \ud83d\udccc Requirements\r\n\r\n- Python 3.7+\r\n- Google API Key (for Gemini models)\r\n- Required Python packages:\r\n  - requests\r\n  - beautifulsoup4\r\n  - langchain\r\n  - langchain-google-genai\r\n  - pydantic\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n```python\r\nfrom scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo\r\n\r\n# Initialize ScrapeGen with your Google API key\r\nscraper = ScrapeGen(api_key=\"your-google-api-key\", model=\"gemini-1.5-pro\")\r\n\r\n# Define the target URL\r\nurl = \"https://example.com\"\r\n\r\n# Scrape and extract company information\r\ncompanies_data = scraper.scrape(url, CompaniesInfo)\r\n\r\n# Display extracted data\r\nfor company in companies_data.companies:\r\n    print(f\"\ud83c\udfe2 Company Name: {company.company_name}\")\r\n    print(f\"\ud83d\udcc4 Description: {company.company_description}\")\r\n```\r\n\r\n## \u2699\ufe0f Configuration\r\n\r\n### \ud83d\udd39 ScrapeConfig Options\r\n\r\n```python\r\nfrom scrapegen import ScrapeConfig\r\n\r\nconfig = ScrapeConfig(\r\n    max_pages=20,      # Max pages to scrape per depth level\r\n    max_subpages=2,    # Max subpages to scrape per page\r\n    max_depth=1,       # Max depth to follow links\r\n    timeout=30,        # Request timeout in seconds\r\n    retries=3,         # Number of retry attempts\r\n    user_agent=\"Mozilla/5.0 (compatible; ScrapeGen/1.0)\",\r\n    headers=None       # Additional HTTP headers\r\n)\r\n```\r\n\r\n### \ud83d\udd04 Updating Configuration\r\n\r\n```python\r\nscraper = ScrapeGen(api_key=\"your-api-key\", model=\"gemini-1.5-pro\", config=config)\r\n\r\n# Dynamically update configuration\r\nscraper.update_config(max_pages=30, timeout=45)\r\n```\r\n\r\n## \ud83d\udccc Custom Data Models\r\n\r\nDefine Pydantic models to structure extracted data:\r\n\r\n```python\r\nfrom pydantic import BaseModel\r\nfrom typing import Optional, List\r\n\r\nclass CustomDataModel(BaseModel):\r\n    title: str\r\n    description: Optional[str]\r\n    date: str\r\n    tags: List[str]\r\n\r\nclass CustomDataCollection(BaseModel):\r\n    items: List[CustomDataModel]\r\n\r\n# Scrape using the custom model\r\ndata = scraper.scrape(url, CustomDataCollection)\r\n```\r\n\r\n## \ud83e\udd16 Supported Gemini Models\r\n\r\n- gemini-1.5-flash-8b\r\n- gemini-1.5-pro\r\n- gemini-2.0-flash-exp\r\n- gemini-1.5-flash\r\n\r\n## \u26a0\ufe0f Error Handling\r\n\r\nScrapeGen provides specific exception classes for detailed error handling:\r\n\r\n- **\u2757 ScrapeGenError**: Base exception class.\r\n- **\u2699\ufe0f ConfigurationError**: Errors related to scraper configuration.\r\n- **\ud83d\udd77\ufe0f ScrapingError**: Issues encountered during web scraping.\r\n- **\ud83d\udd0d ExtractionError**: Problems with AI-driven data extraction.\r\n\r\nExample usage:\r\n\r\n```python\r\ntry:\r\n    data = scraper.scrape(url, CustomDataCollection)\r\nexcept ConfigurationError as e:\r\n    print(f\"\u2699\ufe0f Configuration error: {e}\")\r\nexcept ScrapingError as e:\r\n    print(f\"\ud83d\udd77\ufe0f Scraping error: {e}\")\r\nexcept ExtractionError as e:\r\n    print(f\"\ud83d\udd0d Extraction error: {e}\")\r\n```\r\n\r\n## \ud83c\udfd7\ufe0f Architecture\r\n\r\nScrapeGen follows a modular design for scalability and maintainability:\r\n\r\n1. **\ud83d\udd77\ufe0f WebsiteScraper**: Handles core web scraping logic.\r\n2. **\ud83d\udcd1 InfoExtractorAi**: Performs AI-driven content extraction.\r\n3. **\ud83e\udd16 LlmManager**: Manages interactions with language models.\r\n4. **\ud83d\udd17 UrlParser**: Parses and normalizes URLs.\r\n5. **\ud83d\udce5 ContentExtractor**: Extracts structured data from HTML elements.\r\n\r\n## \u2705 Best Practices\r\n\r\n### 1\ufe0f\u20e3 Rate Limiting\r\n\r\n- \u23f3 Use delays between requests.\r\n- \ud83d\udcdc Respect robots.txt guidelines.\r\n- \u2696\ufe0f Configure max_pages and max_depth responsibly.\r\n\r\n### 2\ufe0f\u20e3 Error Handling\r\n\r\n- \ud83d\udd04 Wrap scraping operations in try-except blocks.\r\n- \ud83d\udccb Implement proper logging for debugging.\r\n- \ud83d\udd01 Handle network timeouts and retries effectively.\r\n\r\n### 3\ufe0f\u20e3 Resource Management\r\n\r\n- \ud83d\udda5\ufe0f Monitor memory usage for large-scale operations.\r\n- \ud83d\udcda Implement pagination for large datasets.\r\n- \u23f1\ufe0f Adjust timeout settings based on expected response times.\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! \ud83c\udf89 Feel free to submit a Pull Request to improve ScrapeGen.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AI-driven web scraping framework",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/affanshaikhsurab/scrapegen/docs",
        "Homepage": "https://github.com/affanshaikhsurab/scrapegen",
        "Issues": "https://github.com/affanshaikhsurab/scrapegen/issues",
        "Repository": "https://github.com/affanshaikhsurab/scrapegen"
    },
    "split_keywords": [
        "ai",
        " bing",
        " search",
        " scraper",
        " web scraping",
        " automation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a8c2fd8a22f785ed33c025482bca3dcf4859f487b04a56a32174b11e3aac715c",
                "md5": "0ccddcdb43e48e59654cc6fa25ff17c7",
                "sha256": "d09c4d281128526d7a5cf01de46a25ea7ae2a1349031bcffefe8e6eb133dae29"
            },
            "downloads": -1,
            "filename": "scrapegen-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0ccddcdb43e48e59654cc6fa25ff17c7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 43024,
            "upload_time": "2025-02-02T17:45:00",
            "upload_time_iso_8601": "2025-02-02T17:45:00.048914Z",
            "url": "https://files.pythonhosted.org/packages/a8/c2/fd8a22f785ed33c025482bca3dcf4859f487b04a56a32174b11e3aac715c/scrapegen-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6bdc541c589cdc0b8ea36521492fa7bd18d1275b5baa7c773e8c5d9d6c651fb7",
                "md5": "6249c1f69de83e23cf3f97a4f522b02a",
                "sha256": "eaf67e2773e91876362c02cb6bf288c35e830c960774f38177324c44d47f9327"
            },
            "downloads": -1,
            "filename": "scrapegen-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "6249c1f69de83e23cf3f97a4f522b02a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 33829,
            "upload_time": "2025-02-02T17:45:02",
            "upload_time_iso_8601": "2025-02-02T17:45:02.071072Z",
            "url": "https://files.pythonhosted.org/packages/6b/dc/541c589cdc0b8ea36521492fa7bd18d1275b5baa7c773e8c5d9d6c651fb7/scrapegen-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-02 17:45:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "affanshaikhsurab",
    "github_project": "scrapegen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.26.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.9.3"
                ]
            ]
        },
        {
            "name": "langchain-google-genai",
            "specs": [
                [
                    ">=",
                    "0.0.3"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        }
    ],
    "lcname": "scrapegen"
}
        
Elapsed time: 0.45928s