# ScrapeGen
<img src="https://github.com/user-attachments/assets/2f458a05-66f9-47a4-bc40-6069e3c9e849" alt="Logo" width="80" height="80">
ScrapeGen 🚀 is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.
## ✨ Features
- **🤖 AI-Powered Data Extraction**: Utilizes Google's Gemini models for intelligent parsing.
- **⚙️ Configurable Web Scraping**: Supports depth control and flexible extraction rules.
- **📊 Structured Data Modeling**: Uses Pydantic for well-defined data structures.
- **🛡️ Robust Error Handling**: Implements retry mechanisms and detailed error reporting.
- **🔧 Customizable Scraping Configurations**: Adjust settings dynamically based on needs.
- **🌐 Comprehensive URL Handling**: Supports both relative and absolute URLs.
- **📦 Modular Architecture**: Ensures clear separation of concerns for maintainability.
## 📥 Installation
```bash
pip install scrapegen # Package name may vary
```
## 📌 Requirements
- Python 3.7+
- Google API Key (for Gemini models)
- Required Python packages:
- requests
- beautifulsoup4
- langchain
- langchain-google-genai
- pydantic
## 🚀 Quick Start
```python
from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo
# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")
# Define the target URL
url = "https://example.com"
# Scrape and extract company information
companies_data = scraper.scrape(url, CompaniesInfo)
# Display extracted data
for company in companies_data.companies:
print(f"🏢 Company Name: {company.company_name}")
print(f"📄 Description: {company.company_description}")
```
## ⚙️ Configuration
### 🔹 ScrapeConfig Options
```python
from scrapegen import ScrapeConfig
config = ScrapeConfig(
max_pages=20, # Max pages to scrape per depth level
max_subpages=2, # Max subpages to scrape per page
max_depth=1, # Max depth to follow links
timeout=30, # Request timeout in seconds
retries=3, # Number of retry attempts
user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
headers=None # Additional HTTP headers
)
```
### 🔄 Updating Configuration
```python
scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)
# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)
```
## 📌 Custom Data Models
Define Pydantic models to structure extracted data:
```python
from pydantic import BaseModel
from typing import Optional, List
class CustomDataModel(BaseModel):
title: str
description: Optional[str]
date: str
tags: List[str]
class CustomDataCollection(BaseModel):
items: List[CustomDataModel]
# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)
```
## 🤖 Supported Gemini Models
- gemini-1.5-flash-8b
- gemini-1.5-pro
- gemini-2.0-flash-exp
- gemini-1.5-flash
## ⚠️ Error Handling
ScrapeGen provides specific exception classes for detailed error handling:
- **❗ ScrapeGenError**: Base exception class.
- **⚙️ ConfigurationError**: Errors related to scraper configuration.
- **🕷️ ScrapingError**: Issues encountered during web scraping.
- **🔍 ExtractionError**: Problems with AI-driven data extraction.
Example usage:
```python
try:
data = scraper.scrape(url, CustomDataCollection)
except ConfigurationError as e:
print(f"⚙️ Configuration error: {e}")
except ScrapingError as e:
print(f"🕷️ Scraping error: {e}")
except ExtractionError as e:
print(f"🔍 Extraction error: {e}")
```
## 🏗️ Architecture
ScrapeGen follows a modular design for scalability and maintainability:
1. **🕷️ WebsiteScraper**: Handles core web scraping logic.
2. **📑 InfoExtractorAi**: Performs AI-driven content extraction.
3. **🤖 LlmManager**: Manages interactions with language models.
4. **🔗 UrlParser**: Parses and normalizes URLs.
5. **📥 ContentExtractor**: Extracts structured data from HTML elements.
## ✅ Best Practices
### 1️⃣ Rate Limiting
- ⏳ Use delays between requests.
- 📜 Respect robots.txt guidelines.
- ⚖️ Configure max_pages and max_depth responsibly.
### 2️⃣ Error Handling
- 🔄 Wrap scraping operations in try-except blocks.
- 📋 Implement proper logging for debugging.
- 🔁 Handle network timeouts and retries effectively.
### 3️⃣ Resource Management
- 🖥️ Monitor memory usage for large-scale operations.
- 📚 Implement pagination for large datasets.
- ⏱️ Adjust timeout settings based on expected response times.
## 🤝 Contributing
Contributions are welcome! 🎉 Feel free to submit a Pull Request to improve ScrapeGen.
Raw data
{
"_id": null,
"home_page": null,
"name": "scrapegen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "AI, bing, search, scraper, web scraping, automation",
"author": null,
"author_email": "Affan Shaikhsurab <affanshaikhsurab@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/6b/dc/541c589cdc0b8ea36521492fa7bd18d1275b5baa7c773e8c5d9d6c651fb7/scrapegen-0.1.0.tar.gz",
"platform": null,
"description": "# ScrapeGen\r\n\r\n<img src=\"https://github.com/user-attachments/assets/2f458a05-66f9-47a4-bc40-6069e3c9e849\" alt=\"Logo\" width=\"80\" height=\"80\">\r\n\r\nScrapeGen \ud83d\ude80 is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.\r\n\r\n## \u2728 Features\r\n\r\n- **\ud83e\udd16 AI-Powered Data Extraction**: Utilizes Google's Gemini models for intelligent parsing.\r\n- **\u2699\ufe0f Configurable Web Scraping**: Supports depth control and flexible extraction rules.\r\n- **\ud83d\udcca Structured Data Modeling**: Uses Pydantic for well-defined data structures.\r\n- **\ud83d\udee1\ufe0f Robust Error Handling**: Implements retry mechanisms and detailed error reporting.\r\n- **\ud83d\udd27 Customizable Scraping Configurations**: Adjust settings dynamically based on needs.\r\n- **\ud83c\udf10 Comprehensive URL Handling**: Supports both relative and absolute URLs.\r\n- **\ud83d\udce6 Modular Architecture**: Ensures clear separation of concerns for maintainability.\r\n\r\n## \ud83d\udce5 Installation\r\n\r\n```bash\r\npip install scrapegen # Package name may vary\r\n```\r\n\r\n## \ud83d\udccc Requirements\r\n\r\n- Python 3.7+\r\n- Google API Key (for Gemini models)\r\n- Required Python packages:\r\n - requests\r\n - beautifulsoup4\r\n - langchain\r\n - langchain-google-genai\r\n - pydantic\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n```python\r\nfrom scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo\r\n\r\n# Initialize ScrapeGen with your Google API key\r\nscraper = ScrapeGen(api_key=\"your-google-api-key\", model=\"gemini-1.5-pro\")\r\n\r\n# Define the target URL\r\nurl = \"https://example.com\"\r\n\r\n# Scrape and extract company information\r\ncompanies_data = scraper.scrape(url, CompaniesInfo)\r\n\r\n# Display extracted data\r\nfor company in companies_data.companies:\r\n print(f\"\ud83c\udfe2 Company Name: {company.company_name}\")\r\n print(f\"\ud83d\udcc4 Description: {company.company_description}\")\r\n```\r\n\r\n## \u2699\ufe0f Configuration\r\n\r\n### \ud83d\udd39 ScrapeConfig Options\r\n\r\n```python\r\nfrom scrapegen import ScrapeConfig\r\n\r\nconfig = ScrapeConfig(\r\n max_pages=20, # Max pages to scrape per depth level\r\n max_subpages=2, # Max subpages to scrape per page\r\n max_depth=1, # Max depth to follow links\r\n timeout=30, # Request timeout in seconds\r\n retries=3, # Number of retry attempts\r\n user_agent=\"Mozilla/5.0 (compatible; ScrapeGen/1.0)\",\r\n headers=None # Additional HTTP headers\r\n)\r\n```\r\n\r\n### \ud83d\udd04 Updating Configuration\r\n\r\n```python\r\nscraper = ScrapeGen(api_key=\"your-api-key\", model=\"gemini-1.5-pro\", config=config)\r\n\r\n# Dynamically update configuration\r\nscraper.update_config(max_pages=30, timeout=45)\r\n```\r\n\r\n## \ud83d\udccc Custom Data Models\r\n\r\nDefine Pydantic models to structure extracted data:\r\n\r\n```python\r\nfrom pydantic import BaseModel\r\nfrom typing import Optional, List\r\n\r\nclass CustomDataModel(BaseModel):\r\n title: str\r\n description: Optional[str]\r\n date: str\r\n tags: List[str]\r\n\r\nclass CustomDataCollection(BaseModel):\r\n items: List[CustomDataModel]\r\n\r\n# Scrape using the custom model\r\ndata = scraper.scrape(url, CustomDataCollection)\r\n```\r\n\r\n## \ud83e\udd16 Supported Gemini Models\r\n\r\n- gemini-1.5-flash-8b\r\n- gemini-1.5-pro\r\n- gemini-2.0-flash-exp\r\n- gemini-1.5-flash\r\n\r\n## \u26a0\ufe0f Error Handling\r\n\r\nScrapeGen provides specific exception classes for detailed error handling:\r\n\r\n- **\u2757 ScrapeGenError**: Base exception class.\r\n- **\u2699\ufe0f ConfigurationError**: Errors related to scraper configuration.\r\n- **\ud83d\udd77\ufe0f ScrapingError**: Issues encountered during web scraping.\r\n- **\ud83d\udd0d ExtractionError**: Problems with AI-driven data extraction.\r\n\r\nExample usage:\r\n\r\n```python\r\ntry:\r\n data = scraper.scrape(url, CustomDataCollection)\r\nexcept ConfigurationError as e:\r\n print(f\"\u2699\ufe0f Configuration error: {e}\")\r\nexcept ScrapingError as e:\r\n print(f\"\ud83d\udd77\ufe0f Scraping error: {e}\")\r\nexcept ExtractionError as e:\r\n print(f\"\ud83d\udd0d Extraction error: {e}\")\r\n```\r\n\r\n## \ud83c\udfd7\ufe0f Architecture\r\n\r\nScrapeGen follows a modular design for scalability and maintainability:\r\n\r\n1. **\ud83d\udd77\ufe0f WebsiteScraper**: Handles core web scraping logic.\r\n2. **\ud83d\udcd1 InfoExtractorAi**: Performs AI-driven content extraction.\r\n3. **\ud83e\udd16 LlmManager**: Manages interactions with language models.\r\n4. **\ud83d\udd17 UrlParser**: Parses and normalizes URLs.\r\n5. **\ud83d\udce5 ContentExtractor**: Extracts structured data from HTML elements.\r\n\r\n## \u2705 Best Practices\r\n\r\n### 1\ufe0f\u20e3 Rate Limiting\r\n\r\n- \u23f3 Use delays between requests.\r\n- \ud83d\udcdc Respect robots.txt guidelines.\r\n- \u2696\ufe0f Configure max_pages and max_depth responsibly.\r\n\r\n### 2\ufe0f\u20e3 Error Handling\r\n\r\n- \ud83d\udd04 Wrap scraping operations in try-except blocks.\r\n- \ud83d\udccb Implement proper logging for debugging.\r\n- \ud83d\udd01 Handle network timeouts and retries effectively.\r\n\r\n### 3\ufe0f\u20e3 Resource Management\r\n\r\n- \ud83d\udda5\ufe0f Monitor memory usage for large-scale operations.\r\n- \ud83d\udcda Implement pagination for large datasets.\r\n- \u23f1\ufe0f Adjust timeout settings based on expected response times.\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! \ud83c\udf89 Feel free to submit a Pull Request to improve ScrapeGen.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AI-driven web scraping framework",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/affanshaikhsurab/scrapegen/docs",
"Homepage": "https://github.com/affanshaikhsurab/scrapegen",
"Issues": "https://github.com/affanshaikhsurab/scrapegen/issues",
"Repository": "https://github.com/affanshaikhsurab/scrapegen"
},
"split_keywords": [
"ai",
" bing",
" search",
" scraper",
" web scraping",
" automation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a8c2fd8a22f785ed33c025482bca3dcf4859f487b04a56a32174b11e3aac715c",
"md5": "0ccddcdb43e48e59654cc6fa25ff17c7",
"sha256": "d09c4d281128526d7a5cf01de46a25ea7ae2a1349031bcffefe8e6eb133dae29"
},
"downloads": -1,
"filename": "scrapegen-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0ccddcdb43e48e59654cc6fa25ff17c7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 43024,
"upload_time": "2025-02-02T17:45:00",
"upload_time_iso_8601": "2025-02-02T17:45:00.048914Z",
"url": "https://files.pythonhosted.org/packages/a8/c2/fd8a22f785ed33c025482bca3dcf4859f487b04a56a32174b11e3aac715c/scrapegen-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6bdc541c589cdc0b8ea36521492fa7bd18d1275b5baa7c773e8c5d9d6c651fb7",
"md5": "6249c1f69de83e23cf3f97a4f522b02a",
"sha256": "eaf67e2773e91876362c02cb6bf288c35e830c960774f38177324c44d47f9327"
},
"downloads": -1,
"filename": "scrapegen-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "6249c1f69de83e23cf3f97a4f522b02a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 33829,
"upload_time": "2025-02-02T17:45:02",
"upload_time_iso_8601": "2025-02-02T17:45:02.071072Z",
"url": "https://files.pythonhosted.org/packages/6b/dc/541c589cdc0b8ea36521492fa7bd18d1275b5baa7c773e8c5d9d6c651fb7/scrapegen-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-02 17:45:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "affanshaikhsurab",
"github_project": "scrapegen",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "requests",
"specs": [
[
">=",
"2.26.0"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.9.3"
]
]
},
{
"name": "langchain-google-genai",
"specs": [
[
">=",
"0.0.3"
]
]
},
{
"name": "pydantic",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "lxml",
"specs": [
[
">=",
"4.9.0"
]
]
}
],
"lcname": "scrapegen"
}