universal-scraper

Name	universal-scraper JSON
Version	1.8.0 JSON
	download
home_page	https://github.com/WitesoAI/universal-scraper
Summary	AI-powered web scraping with customizable field extraction using multiple AI providers
upload_time	2025-09-11 11:41:06
maintainer	None
docs_url	None
author	Ayushi Gupta & Pushpender Singh
requires_python	>=3.7
license	None
keywords	web scraping ai data extraction beautifulsoup gemini openai anthropic claude gpt litellm automation html parsing structured data caching performance multi-provider
VCS
bugtrack_url
requirements	cloudscraper selenium beautifulsoup4 lxml google-generativeai litellm requests urllib3 webdriver-manager python-dateutil cchardet html5lib
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center"> Universal Scraper</h1>

<h2 align="center"> The Python package for scraping data from any website</h2>

<p align="center">
<a href="https://pypi.org/project/universal-scraper/"><img alt="pypi" src="https://img.shields.io/pypi/v/universal-scraper.svg"></a>
<a href="https://pepy.tech/project/universal-scraper?versions=1*&versions=2*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/universal-scraper"></a>
<a href="https://pepy.tech/project/universal-scraper?versions=1*&versions=2*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/universal-scraper/month"></a>
<a href="https://github.com/WitesoAI/universal-scraper/commits/main"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/WitesoAI/universal-scraper?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/universal-scraper?style=flat-square"></a>
</p>

## Table of Contents

- [🔄 How Universal Scraper Works](#-how-universal-scraper-works)
- [💻 Live Working Example](#-live-working-example)
- [How It Works](#how-it-works)
- [Features](#features)
- [🧹 Smart HTML Cleaner](#-smart-html-cleaner)
- [Installation (Recommended)](#installation-recommended)
- [Installation](#installation)
- [Quick Start](#quick-start)
  - [1. Set up your API key](#1-set-up-your-api-key)
  - [2. Basic Usage](#2-basic-usage)
  - [3. Convenience Function](#3-convenience-function)
- [📁 Export Formats](#-export-formats)
- [CLI Usage](#cli-usage)
- [Cache Management](#cache-management)
- [Advanced Usage](#advanced-usage)
- [API Reference](#api-reference)
- [Output Format](#output-format)
- [Common Field Examples](#common-field-examples)
- [🤖 Multi-Provider AI Support](#-multi-provider-ai-support)
- [Troubleshooting](#troubleshooting)
- [Roadmap](#roadmap)
- [Core Contributors](#core-contributors)
- [Contributing](#contributing)
- [License](#license)
- [Changelog](#changelog)

--------------------------------------------------------------------------

A Python module for AI-powered web scraping with customizable field extraction using multiple AI providers (Gemini, OpenAI, Anthropic, and more via LiteLLM).

## 🔄 How Universal Scraper Works

```mermaid
graph TB
    A[🌐 Input URL] --> B[📥 HTML Fetcher]
    B --> B1[CloudScraper Anti-Bot Protection]
    B1 --> C[🧹 Smart HTML Cleaner]
    
    C --> C1[Remove Scripts & Styles]
    C1 --> C2[Remove Ads & Analytics]
    C2 --> C3[Remove Navigation Elements]
    C3 --> C4[Detect Repeating Structures]
    C4 --> C5[Keep 2 Samples, Remove Others]
    C5 --> C6[Remove Empty Divs]
    C6 --> D[📊 98% Size Reduction]
    
    D --> E{🔍 Check Code Cache}
    E -->|Cache Hit| F[♻️ Use Cached Code]
    E -->|Cache Miss| G[🤖 AI Code Generation]
    
    G --> G1[🧠 Choose AI Provider]
    G1 --> G2[Gemini 2.5-Flash Default]
    G1 --> G3[OpenAI GPT-4/GPT-4o]
    G1 --> G4[Claude 3 Opus/Sonnet/Haiku]
    G1 --> G5[100+ Other Models via LiteLLM]
    
    G2 --> H[📝 Generate BeautifulSoup Code]
    G3 --> H
    G4 --> H
    G5 --> H
    
    H --> I[💾 Cache Generated Code]
    F --> J[⚡ Execute Code on Original HTML]
    I --> J
    
    J --> K[📋 Extract Structured Data]
    K --> L{📁 Output Format}
    L -->|JSON| M[💾 Save as JSON]
    L -->|CSV| N[📊 Save as CSV]
    
    M --> O[✅ Complete with Metadata]
    N --> O
    
    style A fill:#e1f5fe
    style D fill:#4caf50,color:#fff
    style E fill:#ff9800,color:#fff
    style F fill:#4caf50,color:#fff
    style G1 fill:#9c27b0,color:#fff
    style O fill:#2196f3,color:#fff
```

**Key Performance Benefits:**
- 🚀 **98% HTML Size Reduction** → Massive token savings
- ⚡ **Smart Caching** → 90%+ API cost reduction on repeat scraping  
- 🤖 **Multi-Provider Support** → Choose the best AI for your use case
- 🔄 **Dual HTML Processing** → Clean HTML for AI analysis, original HTML for complete data extraction

## 💻 Live Working Example

Here's a real working example showing Universal Scraper in action with OpenAI GPT-4o:

```python
>>> from universal_scraper import UniversalScraper
>>> scraper = UniversalScraper(api_key="AIzxxxxxxxxxxxxxxxxxxxxx", model_name="gemini-2.5-pro")
2025-09-11 16:49:31 - code_cache - INFO - CodeCache initialized with database: temp/extraction_cache.db
2025-09-11 16:49:31 - data_extractor - INFO - Code caching enabled
2025-09-11 16:49:31 - data_extractor - INFO - Using Google Gemini API with model: gemini-2.5-pro
2025-09-11 16:49:31 - data_extractor - INFO - Initialized DataExtractor with model: gemini-2.5-pro

>>> # Set fields for e-commerce laptop scraping
>>> scraper.set_fields(["product_name", "product_price", "product_rating", "product_description", "availability"])
2025-09-11 16:52:45 - universal_scraper - INFO - Extraction fields updated: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']

>>> result = scraper.scrape_url("https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops", save_to_file=True, format='csv')
2025-09-11 16:52:55 - universal_scraper - INFO - Starting scraping for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-11 16:52:55 - html_fetcher - INFO - Starting to fetch HTML for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-11 16:52:55 - html_fetcher - INFO - Fetching https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops with cloudscraper...
2025-09-11 16:52:57 - html_fetcher - INFO - Successfully fetched content with cloudscraper. Length: 163496
2025-09-11 16:52:57 - html_fetcher - INFO - Successfully fetched HTML with cloudscraper
2025-09-11 16:52:57 - html_cleaner - INFO - Starting HTML cleaning process...
2025-09-11 16:52:57 - html_cleaner - INFO - Removed noise. Length: 142614
2025-09-11 16:52:57 - html_cleaner - INFO - Removed headers/footers. Length: 135883
2025-09-11 16:52:57 - html_cleaner - INFO - Focused on main content. Length: 135646
2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-11 16:52:57 - html_cleaner - INFO - Removed 115 repeating structure elements
2025-09-11 16:52:57 - html_cleaner - INFO - Removed repeating structures. Length: 2933
2025-09-11 16:52:57 - html_cleaner - INFO - Limited select options. Length: 2933
2025-09-11 16:52:57 - html_cleaner - INFO - Removed 3 empty div elements in 1 iterations
2025-09-11 16:52:57 - html_cleaner - INFO - Removed empty divs. Length: 2844
2025-09-11 16:52:57 - html_cleaner - INFO - Removed 0 non-essential attributes (71 → 71)
2025-09-11 16:52:57 - html_cleaner - INFO - Removed non-essential attributes. Length: 2844
2025-09-11 16:52:57 - html_cleaner - INFO - Removed whitespace between tags. Length: 2844 → 2619 (7.9% reduction)
2025-09-11 16:52:57 - html_cleaner - INFO - HTML cleaning completed. Original: 150742, Final: 2619
2025-09-11 16:52:57 - html_cleaner - INFO - Reduction: 98.3%
2025-09-11 16:52:57 - data_extractor - INFO - Using HTML separation: cleaned for code generation, original for execution
2025-09-11 16:52:57 - code_cache - INFO - Cache MISS for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-11 16:52:57 - data_extractor - INFO - Generating BeautifulSoup code with gemini-2.5-pro for fields: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']
2025-09-11 16:53:39 - code_cache - INFO - Code cached for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops (hash: bd0ed6e62683fcfb...)
2025-09-11 16:53:39 - data_extractor - INFO - Successfully generated BeautifulSoup code
2025-09-11 16:53:39 - data_extractor - INFO - Executing generated extraction code...
2025-09-11 16:53:39 - data_extractor - INFO - Successfully extracted data with 117 items
2025-09-11 16:53:39 - universal_scraper - INFO - Successfully extracted data from https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
>>>

# ✨ Results: 117 laptop products extracted from 163KB HTML in ~5 seconds!
# 🎯 98.3% HTML size reduction (163KB → 2.6KB for AI processing to generate BeautifulSoup4 code)  
# 💾 Data automatically saved as CSV with product_name, product_price, product_rating, etc.
```

**🔥 What Just Happened:**
1. **Fields Configured** for e-commerce: product_name, product_price, product_rating, etc.
2. **HTML Fetched** with anti-bot protection (163KB)
3. **Smart Cleaning** reduced size by 98.1% (163KB → 2.8KB)
4. **AI Generated** custom extraction code using GPT-4o for specified fields
5. **Code Cached** for future use (90% cost savings on re-runs)
6. **117 Laptop Products Extracted** from original HTML with complete data
7. **Saved as CSV** ready for analysis with all specified product fields

## How It Works

1. **HTML Fetching**: Uses cloudscraper or selenium to fetch HTML content, handling anti-bot measures
2. **Smart HTML Cleaning**: Removes 91%+ of noise (scripts, ads, navigation, repeated structures, empty divs) while preserving data structure
3. **Structure-Based Caching**: Creates structural hash and checks cache for existing extraction code
4. **AI Code Generation**: Uses your chosen AI provider (Gemini, OpenAI, Claude, etc.) to generate custom BeautifulSoup code on cleaned HTML (only when not cached)
5. **Code Execution**: Runs the cached/generated code on original HTML to extract ALL data items
6. **JSON Output**: Returns complete, structured data with metadata and performance stats

## Features

- 🤖 **Multi-Provider AI Support**: Uses Google Gemini by default, with support for OpenAI, Anthropic, and 100+ other models via LiteLLM
- 🎯 **Customizable Fields**: Define exactly which fields you want to extract (e.g., company name, job title, salary)
- 🚀 **Smart Caching**: Automatically caches extraction code based on HTML structure - saves 90%+ API tokens on repeat scraping
- 🧹 **Smart HTML Cleaner**: Removes noise and reduces HTML by 98%+ - significantly cuts token usage for AI processing
- 🔧 **Easy to Use**: Simple API for both quick scraping and advanced use cases
- 📦 **Modular Design**: Built with clean, modular components
- 🛡️ **Robust**: Handles edge cases, missing data, and various HTML structures
- 💾 **Multiple Output Formats**: Support for both JSON (default) and CSV export formats
- 📊 **Structured Output**: Clean, structured data output with comprehensive metadata

## 🧹 Smart HTML Cleaner

### What Gets Removed
- **Scripts & Styles**: JavaScript, CSS, and style blocks
- **Ads & Analytics**: Advertisement content and tracking scripts
- **Navigation**: Headers, footers, sidebars, and menu elements  
- **Metadata**: Meta tags, SEO tags, and hidden elements
- **Empty Elements**: Recursively removes empty div elements that don't contain meaningful content
- **Noise**: Comments, unnecessary attributes, and whitespace

### Repeating Structure Reduction (NEW!)
The cleaner now intelligently detects and reduces repeated HTML structures:

- **Pattern Detection**: Uses structural hashing + similarity algorithms to find repeated elements
- **Smart Sampling**: Keeps 2 samples from groups of 3+ similar structures (e.g., 20 job cards → 2 samples)
- **Structure Preservation**: Maintains document flow and parent-child relationships
- **AI Optimization**: Provides enough samples for pattern recognition without overwhelming the AI

### Empty Element Removal (NEW!)
The cleaner now intelligently removes empty div elements:

- **Recursive Processing**: Starts from innermost divs and works outward
- **Content Detection**: Preserves divs with text, images, inputs, or interactive elements
- **Structure Preservation**: Maintains parent-child relationships and avoids breaking important structural elements
- **Smart Analysis**: Removes placeholder/skeleton divs while keeping functional containers

**Example**: Removes empty animation placeholders like `<div class="animate-pulse"></div>` while preserving divs containing actual content.

## Installation (Recommended)

```
pip install universal-scraper
```

## Installation

1. **Clone the repository**:
   ```bash
   git clone <repository-url>
   cd Universal_Scrapper
   ```

2. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

   Or install manually:
   ```bash
   pip install google-generativeai beautifulsoup4 requests selenium lxml html5lib fake-useragent
   ```

3. **Install the module**:
   ```bash
   pip install -e .
   ```

## Quick Start

### 1. Set up your API key

**Option A: Use Gemini (Default - Recommended)**
Get a Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey):

```bash
export GEMINI_API_KEY="your_gemini_api_key_here"
```

**Option B: Use OpenAI**
```bash
export OPENAI_API_KEY="your_openai_api_key_here"
```

**Option C: Use Anthropic Claude**
```bash
export ANTHROPIC_API_KEY="your_anthropic_api_key_here"
```

**Option D: Pass API key directly**
```python
# For any provider - just pass the API key directly
scraper = UniversalScraper(api_key="your_api_key")
```

### 2. Basic Usage

```python
from universal_scraper import UniversalScraper

# Option 1: Auto-detect provider (uses Gemini by default)
scraper = UniversalScraper(api_key="your_gemini_api_key")

# Option 2: Specify Gemini model explicitly
scraper = UniversalScraper(api_key="your_gemini_api_key", model_name="gemini-2.5-flash")

# Option 3: Use OpenAI
scraper = UniversalScraper(api_key="your_openai_api_key", model_name="gpt-4")

# Option 4: Use Anthropic Claude
scraper = UniversalScraper(api_key="your_anthropic_api_key", model_name="claude-3-sonnet-20240229")

# Option 5: Use any other provider supported by LiteLLM
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")

# Set the fields you want to extract
scraper.set_fields([
    "company_name", 
    "job_title", 
    "apply_link", 
    "salary_range",
    "location"
])

# Check current model
print(f"Using model: {scraper.get_model_name()}")

# Scrape a URL (default JSON format)
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)

print(f"Extracted {result['metadata']['items_extracted']} items")
print(f"Data saved to: {result.get('saved_to')}")

# Scrape and save as CSV
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')
print(f"CSV data saved to: {result.get('saved_to')}")
```

### 3. Convenience Function

For quick one-off scraping:

```python
from universal_scraper import scrape

# Quick scraping with default JSON format
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"]
)

# Quick scraping with CSV format
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"],
    format="csv"
)

# Quick scraping with OpenAI
data = scrape(
    url="https://example.com/jobs",
    api_key="your_openai_api_key",
    fields=["company_name", "job_title", "apply_link"],
    model_name="gpt-4"
)

# Quick scraping with Anthropic Claude
data = scrape(
    url="https://example.com/jobs",
    api_key="your_anthropic_api_key",
    fields=["company_name", "job_title", "apply_link"],
    model_name="claude-3-haiku-20240307"
)

print(data['data'])  # The extracted data
```

## 📁 Export Formats

Universal Scraper supports multiple output formats to suit your data processing needs:

### JSON Export (Default)
```python
# JSON is the default format
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)
# or explicitly specify
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='json')
```

**JSON Output Structure:**
```json
{
  "url": "https://example.com",
  "timestamp": "2025-01-01T12:00:00",
  "fields": ["company_name", "job_title", "apply_link"],
  "data": [
    {
      "company_name": "Example Corp",
      "job_title": "Software Engineer", 
      "apply_link": "https://example.com/apply/123"
    }
  ],
  "metadata": {
    "raw_html_length": 50000,
    "cleaned_html_length": 15000,
    "items_extracted": 1
  }
}
```

### CSV Export
```python
# Export as CSV for spreadsheet analysis
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')
```

**CSV Output:**
- Clean tabular format with headers
- All fields as columns, missing values filled with empty strings
- Perfect for Excel, Google Sheets, or pandas processing
- Automatically handles varying field structures across items

### Multiple URLs with Format Choice
```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]

# Save all as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)

# Save all as CSV
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')
```

### CLI Usage

**🎉 NEW in v1.6.0**: Full multi-provider CLI support!

```bash
# Gemini (default) - auto-detects from environment
universal-scraper https://example.com/jobs --output jobs.json

# OpenAI GPT models
universal-scraper https://example.com/products --api-key YOUR_OPENAI_KEY --model gpt-4 --format csv

# Anthropic Claude models  
universal-scraper https://example.com/data --api-key YOUR_ANTHROPIC_KEY --model claude-3-haiku-20240307

# Custom fields extraction
universal-scraper https://example.com/listings --fields product_name product_price product_rating

# Batch processing multiple URLs
universal-scraper --urls urls.txt --output-dir results --format csv --model gpt-4o-mini

# Verbose logging with any provider
universal-scraper https://example.com --api-key YOUR_KEY --model gpt-4 --verbose
```

**🔧 Advanced CLI Options:**
```bash
# Set custom extraction fields
universal-scraper URL --fields title price description availability

# Use environment variables (auto-detected)
export OPENAI_API_KEY="your_key"
universal-scraper URL --model gpt-4

# Multiple output formats
universal-scraper URL --format json    # Default
universal-scraper URL --format csv     # Spreadsheet-ready

# Batch processing
echo -e "https://site1.com\nhttps://site2.com" > urls.txt
universal-scraper --urls urls.txt --output-dir batch_results
```

**🔗 Provider Support**: All 100+ models supported by LiteLLM work in CLI! See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for complete list.

**📝 Development Usage** (from cloned repo):
```bash
python main.py https://example.com/jobs --api-key YOUR_KEY --model gpt-4
```

## Cache Management
```python
scraper = UniversalScraper(api_key="your_key")

# View cache statistics
stats = scraper.get_cache_stats()
print(f"Cached entries: {stats['total_entries']}")
print(f"Total cache hits: {stats['total_uses']}")

# Clear old entries (30+ days)
removed = scraper.cleanup_old_cache(30)
print(f"Removed {removed} old entries")

# Clear entire cache
scraper.clear_cache()

# Disable/enable caching
scraper.disable_cache()  # For testing
scraper.enable_cache()   # Re-enable
```

## Advanced Usage

### Multiple URLs

```python
scraper = UniversalScraper(api_key="your_api_key")
scraper.set_fields(["title", "price", "description"])

urls = [
    "https://site1.com/products",
    "https://site2.com/items", 
    "https://site3.com/listings"
]

# Scrape all URLs and save as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)

# Scrape all URLs and save as CSV for analysis
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')

for result in results:
    if result.get('error'):
        print(f"Failed {result['url']}: {result['error']}")
    else:
        print(f"Success {result['url']}: {result['metadata']['items_extracted']} items")
```

### Custom Configuration

```python
scraper = UniversalScraper(
    api_key="your_api_key",
    temp_dir="custom_temp",      # Custom temporary directory
    output_dir="custom_output",  # Custom output directory  
    log_level=logging.DEBUG,     # Enable debug logging
    model_name="gpt-4"           # Custom model (OpenAI, Gemini, Claude, etc.)
)

# Configure for e-commerce scraping
scraper.set_fields([
    "product_name",
    "product_price", 
    "product_rating",
    "product_reviews_count",
    "product_availability",
    "product_description"
])

# Check and change model dynamically
print(f"Current model: {scraper.get_model_name()}")
scraper.set_model_name("gpt-4")  # Switch to OpenAI
print(f"Switched to: {scraper.get_model_name()}")

# Or switch to Claude
scraper.set_model_name("claude-3-sonnet-20240229")
print(f"Switched to: {scraper.get_model_name()}")

result = scraper.scrape_url("https://ecommerce-site.com", save_to_file=True)
```

## API Reference

### UniversalScraper Class

#### Constructor
```python
UniversalScraper(api_key=None, temp_dir="temp", output_dir="output", log_level=logging.INFO, model_name=None)
```

- `api_key`: AI provider API key (auto-detects provider, or set specific env vars)
- `temp_dir`: Directory for temporary files
- `output_dir`: Directory for output files
- `log_level`: Logging level
- `model_name`: AI model name (default: 'gemini-2.5-flash', supports 100+ models via LiteLLM)
  - See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for complete model list and setup

#### Methods

- `set_fields(fields: List[str])`: Set the fields to extract
- `get_fields() -> List[str]`: Get current fields configuration
- `get_model_name() -> str`: Get current Gemini model name
- `set_model_name(model_name: str)`: Change the Gemini model
- `scrape_url(url: str, save_to_file=False, output_filename=None, format='json') -> Dict`: Scrape a single URL
- `scrape_multiple_urls(urls: List[str], save_to_files=True, format='json') -> List[Dict]`: Scrape multiple URLs

### Convenience Function

```python
scrape(url: str, api_key: str, fields: List[str], model_name: Optional[str] = None, format: str = 'json') -> Dict
```

Quick scraping function for simple use cases. Auto-detects AI provider from API key pattern.

**Note**: For model names and provider-specific setup, refer to the [LiteLLM Providers Documentation](https://docs.litellm.ai/docs/providers).

## Output Format

The scraped data is returned in a structured format:

```json
{
  "url": "https://example.com",
  "timestamp": "2025-01-01T12:00:00",
  "fields": ["company_name", "job_title", "apply_link"],
  "data": [
    {
      "company_name": "Example Corp",
      "job_title": "Software Engineer", 
      "apply_link": "https://example.com/apply/123"
    }
  ],
  "metadata": {
    "raw_html_length": 50000,
    "cleaned_html_length": 15000,
    "items_extracted": 1
  }
}
```

## Common Field Examples

### Job Listings
```python
scraper.set_fields([
    "company_name",
    "job_title", 
    "apply_link",
    "salary_range",
    "location",
    "job_description",
    "employment_type",
    "experience_level"
])
```

### E-commerce Products
```python
scraper.set_fields([
    "product_name",
    "product_price",
    "product_rating", 
    "product_reviews_count",
    "product_availability",
    "product_image_url",
    "product_description"
])
```

### News Articles
```python
scraper.set_fields([
    "article_title",
    "article_content",
    "article_author",
    "publish_date", 
    "article_url",
    "article_category"
])
```

## 🤖 Multi-Provider AI Support

Universal Scraper now supports multiple AI providers through LiteLLM integration:

### Supported Providers
- **Google Gemini** (Default): `gemini-2.5-flash`, `gemini-1.5-pro`, etc.
- **OpenAI**: `gpt-4`, `gpt-4-turbo`, `gpt-3.5-turbo`, etc.
- **Anthropic**: `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307`
- **100+ Other Models**: Via LiteLLM including Llama, PaLM, Cohere, and more

**📚 For complete model names and provider setup**: See [LiteLLM Providers Documentation](https://docs.litellm.ai/docs/providers)

### Usage Examples

```python
# Gemini (Default - Free tier available)
scraper = UniversalScraper(api_key="your_gemini_key")
# Auto-detects as gemini-2.5-flash

# OpenAI
scraper = UniversalScraper(api_key="sk-...", model_name="gpt-4")

# Anthropic Claude
scraper = UniversalScraper(api_key="sk-ant-...", model_name="claude-3-haiku-20240307")

# Environment variable approach
# Set GEMINI_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY
scraper = UniversalScraper()  # Auto-detects from env vars

# Any other provider from LiteLLM (see link above for model names)
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")
```

### Model Configuration Guide

**📋 Quick Reference for Popular Models:**
```python
# Gemini Models
model_name="gemini-2.5-flash"        # Fast, efficient
model_name="gemini-1.5-pro"          # More capable

# OpenAI Models  
model_name="gpt-4"                   # Most capable
model_name="gpt-4o-mini"             # Fast, cost-effective
model_name="gpt-3.5-turbo"           # Legacy but reliable

# Anthropic Models
model_name="claude-3-opus-20240229"      # Most capable
model_name="claude-3-sonnet-20240229"    # Balanced
model_name="claude-3-haiku-20240307"     # Fast, efficient

# Other Popular Models (see LiteLLM docs for setup)
model_name="llama-2-70b-chat"        # Meta Llama
model_name="command-nightly"          # Cohere
model_name="palm-2-chat-bison"        # Google PaLM
```

**🔗 Complete Model List**: Visit [LiteLLM Providers Documentation](https://docs.litellm.ai/docs/providers) for:
- All available model names
- Provider-specific API key setup
- Environment variable configuration
- Rate limits and pricing information

### Model Auto-Detection
If you don't specify a model, the scraper automatically selects:
- **Gemini**: If `GEMINI_API_KEY` is set or API key contains "AIza"
- **OpenAI**: If `OPENAI_API_KEY` is set or API key starts with "sk-"
- **Anthropic**: If `ANTHROPIC_API_KEY` is set or API key starts with "sk-ant-"

## Troubleshooting

### Common Issues

1. **API Key Error**: Make sure your API key is valid and set correctly:
   - Gemini: Set `GEMINI_API_KEY` or pass directly
   - OpenAI: Set `OPENAI_API_KEY` or pass directly
   - Anthropic: Set `ANTHROPIC_API_KEY` or pass directly
2. **Model Not Found**: Ensure you're using the correct model name for your provider
3. **Empty Results**: The AI might need more specific field names or the page might not contain the expected data
4. **Network Errors**: Some sites block scrapers - the tool uses cloudscraper to handle most cases
5. **Model Name Issues**: Check [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for correct model names and setup instructions

### Debug Mode

Enable debug logging to see what's happening:

```python
import logging
scraper = UniversalScraper(api_key="your_key", log_level=logging.DEBUG)
```

## Roadmap

See [ROADMAP.md](ROADMAP.md) for planned features and improvements.

## Contributors

[Contributors List](https://github.com/WitesoAI/universal-scraper/graphs/contributors)

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for detailed version history and release notes.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/WitesoAI/universal-scraper",
    "name": "universal-scraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "web scraping, ai, data extraction, beautifulsoup, gemini, openai, anthropic, claude, gpt, litellm, automation, html parsing, structured data, caching, performance, multi-provider",
    "author": "Ayushi Gupta & Pushpender Singh",
    "author_email": "aayushi.gupta0405@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/26/f5/39981f3d1e86c365867f50ae9cfc2d5a90367c401660625329d3231742a6/universal_scraper-1.8.0.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\"> Universal Scraper</h1>\n\n<h2 align=\"center\"> The Python package for scraping data from any website</h2>\n\n<p align=\"center\">\n<a href=\"https://pypi.org/project/universal-scraper/\"><img alt=\"pypi\" src=\"https://img.shields.io/pypi/v/universal-scraper.svg\"></a>\n<a href=\"https://pepy.tech/project/universal-scraper?versions=1*&versions=2*&versions=3*\"><img alt=\"Downloads\" src=\"https://pepy.tech/badge/universal-scraper\"></a>\n<a href=\"https://pepy.tech/project/universal-scraper?versions=1*&versions=2*&versions=3*\"><img alt=\"Downloads\" src=\"https://pepy.tech/badge/universal-scraper/month\"></a>\n<a href=\"https://github.com/WitesoAI/universal-scraper/commits/main\"><img alt=\"GitHub lastest commit\" src=\"https://img.shields.io/github/last-commit/WitesoAI/universal-scraper?color=blue&style=flat-square\"></a>\n<a href=\"#\"><img alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/universal-scraper?style=flat-square\"></a>\n</p>\n\n## Table of Contents\n\n- [\ud83d\udd04 How Universal Scraper Works](#-how-universal-scraper-works)\n- [\ud83d\udcbb Live Working Example](#-live-working-example)\n- [How It Works](#how-it-works)\n- [Features](#features)\n- [\ud83e\uddf9 Smart HTML Cleaner](#-smart-html-cleaner)\n- [Installation (Recommended)](#installation-recommended)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n  - [1. Set up your API key](#1-set-up-your-api-key)\n  - [2. Basic Usage](#2-basic-usage)\n  - [3. Convenience Function](#3-convenience-function)\n- [\ud83d\udcc1 Export Formats](#-export-formats)\n- [CLI Usage](#cli-usage)\n- [Cache Management](#cache-management)\n- [Advanced Usage](#advanced-usage)\n- [API Reference](#api-reference)\n- [Output Format](#output-format)\n- [Common Field Examples](#common-field-examples)\n- [\ud83e\udd16 Multi-Provider AI Support](#-multi-provider-ai-support)\n- [Troubleshooting](#troubleshooting)\n- [Roadmap](#roadmap)\n- [Core Contributors](#core-contributors)\n- [Contributing](#contributing)\n- [License](#license)\n- [Changelog](#changelog)\n\n--------------------------------------------------------------------------\n\nA Python module for AI-powered web scraping with customizable field extraction using multiple AI providers (Gemini, OpenAI, Anthropic, and more via LiteLLM).\n\n## \ud83d\udd04 How Universal Scraper Works\n\n```mermaid\ngraph TB\n    A[\ud83c\udf10 Input URL] --> B[\ud83d\udce5 HTML Fetcher]\n    B --> B1[CloudScraper Anti-Bot Protection]\n    B1 --> C[\ud83e\uddf9 Smart HTML Cleaner]\n    \n    C --> C1[Remove Scripts & Styles]\n    C1 --> C2[Remove Ads & Analytics]\n    C2 --> C3[Remove Navigation Elements]\n    C3 --> C4[Detect Repeating Structures]\n    C4 --> C5[Keep 2 Samples, Remove Others]\n    C5 --> C6[Remove Empty Divs]\n    C6 --> D[\ud83d\udcca 98% Size Reduction]\n    \n    D --> E{\ud83d\udd0d Check Code Cache}\n    E -->|Cache Hit| F[\u267b\ufe0f Use Cached Code]\n    E -->|Cache Miss| G[\ud83e\udd16 AI Code Generation]\n    \n    G --> G1[\ud83e\udde0 Choose AI Provider]\n    G1 --> G2[Gemini 2.5-Flash Default]\n    G1 --> G3[OpenAI GPT-4/GPT-4o]\n    G1 --> G4[Claude 3 Opus/Sonnet/Haiku]\n    G1 --> G5[100+ Other Models via LiteLLM]\n    \n    G2 --> H[\ud83d\udcdd Generate BeautifulSoup Code]\n    G3 --> H\n    G4 --> H\n    G5 --> H\n    \n    H --> I[\ud83d\udcbe Cache Generated Code]\n    F --> J[\u26a1 Execute Code on Original HTML]\n    I --> J\n    \n    J --> K[\ud83d\udccb Extract Structured Data]\n    K --> L{\ud83d\udcc1 Output Format}\n    L -->|JSON| M[\ud83d\udcbe Save as JSON]\n    L -->|CSV| N[\ud83d\udcca Save as CSV]\n    \n    M --> O[\u2705 Complete with Metadata]\n    N --> O\n    \n    style A fill:#e1f5fe\n    style D fill:#4caf50,color:#fff\n    style E fill:#ff9800,color:#fff\n    style F fill:#4caf50,color:#fff\n    style G1 fill:#9c27b0,color:#fff\n    style O fill:#2196f3,color:#fff\n```\n\n**Key Performance Benefits:**\n- \ud83d\ude80 **98% HTML Size Reduction** \u2192 Massive token savings\n- \u26a1 **Smart Caching** \u2192 90%+ API cost reduction on repeat scraping  \n- \ud83e\udd16 **Multi-Provider Support** \u2192 Choose the best AI for your use case\n- \ud83d\udd04 **Dual HTML Processing** \u2192 Clean HTML for AI analysis, original HTML for complete data extraction\n\n## \ud83d\udcbb Live Working Example\n\nHere's a real working example showing Universal Scraper in action with OpenAI GPT-4o:\n\n```python\n>>> from universal_scraper import UniversalScraper\n>>> scraper = UniversalScraper(api_key=\"AIzxxxxxxxxxxxxxxxxxxxxx\", model_name=\"gemini-2.5-pro\")\n2025-09-11 16:49:31 - code_cache - INFO - CodeCache initialized with database: temp/extraction_cache.db\n2025-09-11 16:49:31 - data_extractor - INFO - Code caching enabled\n2025-09-11 16:49:31 - data_extractor - INFO - Using Google Gemini API with model: gemini-2.5-pro\n2025-09-11 16:49:31 - data_extractor - INFO - Initialized DataExtractor with model: gemini-2.5-pro\n\n>>> # Set fields for e-commerce laptop scraping\n>>> scraper.set_fields([\"product_name\", \"product_price\", \"product_rating\", \"product_description\", \"availability\"])\n2025-09-11 16:52:45 - universal_scraper - INFO - Extraction fields updated: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']\n\n>>> result = scraper.scrape_url(\"https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops\", save_to_file=True, format='csv')\n2025-09-11 16:52:55 - universal_scraper - INFO - Starting scraping for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops\n2025-09-11 16:52:55 - html_fetcher - INFO - Starting to fetch HTML for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops\n2025-09-11 16:52:55 - html_fetcher - INFO - Fetching https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops with cloudscraper...\n2025-09-11 16:52:57 - html_fetcher - INFO - Successfully fetched content with cloudscraper. Length: 163496\n2025-09-11 16:52:57 - html_fetcher - INFO - Successfully fetched HTML with cloudscraper\n2025-09-11 16:52:57 - html_cleaner - INFO - Starting HTML cleaning process...\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed noise. Length: 142614\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed headers/footers. Length: 135883\n2025-09-11 16:52:57 - html_cleaner - INFO - Focused on main content. Length: 135646\n2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115\n2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115\n2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115\n2025-09-11 16:52:57 - html_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed 115 repeating structure elements\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed repeating structures. Length: 2933\n2025-09-11 16:52:57 - html_cleaner - INFO - Limited select options. Length: 2933\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed 3 empty div elements in 1 iterations\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed empty divs. Length: 2844\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed 0 non-essential attributes (71 \u2192 71)\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed non-essential attributes. Length: 2844\n2025-09-11 16:52:57 - html_cleaner - INFO - Removed whitespace between tags. Length: 2844 \u2192 2619 (7.9% reduction)\n2025-09-11 16:52:57 - html_cleaner - INFO - HTML cleaning completed. Original: 150742, Final: 2619\n2025-09-11 16:52:57 - html_cleaner - INFO - Reduction: 98.3%\n2025-09-11 16:52:57 - data_extractor - INFO - Using HTML separation: cleaned for code generation, original for execution\n2025-09-11 16:52:57 - code_cache - INFO - Cache MISS for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops\n2025-09-11 16:52:57 - data_extractor - INFO - Generating BeautifulSoup code with gemini-2.5-pro for fields: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']\n2025-09-11 16:53:39 - code_cache - INFO - Code cached for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops (hash: bd0ed6e62683fcfb...)\n2025-09-11 16:53:39 - data_extractor - INFO - Successfully generated BeautifulSoup code\n2025-09-11 16:53:39 - data_extractor - INFO - Executing generated extraction code...\n2025-09-11 16:53:39 - data_extractor - INFO - Successfully extracted data with 117 items\n2025-09-11 16:53:39 - universal_scraper - INFO - Successfully extracted data from https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops\n>>>\n\n# \u2728 Results: 117 laptop products extracted from 163KB HTML in ~5 seconds!\n# \ud83c\udfaf 98.3% HTML size reduction (163KB \u2192 2.6KB for AI processing to generate BeautifulSoup4 code)  \n# \ud83d\udcbe Data automatically saved as CSV with product_name, product_price, product_rating, etc.\n```\n\n**\ud83d\udd25 What Just Happened:**\n1. **Fields Configured** for e-commerce: product_name, product_price, product_rating, etc.\n2. **HTML Fetched** with anti-bot protection (163KB)\n3. **Smart Cleaning** reduced size by 98.1% (163KB \u2192 2.8KB)\n4. **AI Generated** custom extraction code using GPT-4o for specified fields\n5. **Code Cached** for future use (90% cost savings on re-runs)\n6. **117 Laptop Products Extracted** from original HTML with complete data\n7. **Saved as CSV** ready for analysis with all specified product fields\n\n## How It Works\n\n1. **HTML Fetching**: Uses cloudscraper or selenium to fetch HTML content, handling anti-bot measures\n2. **Smart HTML Cleaning**: Removes 91%+ of noise (scripts, ads, navigation, repeated structures, empty divs) while preserving data structure\n3. **Structure-Based Caching**: Creates structural hash and checks cache for existing extraction code\n4. **AI Code Generation**: Uses your chosen AI provider (Gemini, OpenAI, Claude, etc.) to generate custom BeautifulSoup code on cleaned HTML (only when not cached)\n5. **Code Execution**: Runs the cached/generated code on original HTML to extract ALL data items\n6. **JSON Output**: Returns complete, structured data with metadata and performance stats\n\n## Features\n\n- \ud83e\udd16 **Multi-Provider AI Support**: Uses Google Gemini by default, with support for OpenAI, Anthropic, and 100+ other models via LiteLLM\n- \ud83c\udfaf **Customizable Fields**: Define exactly which fields you want to extract (e.g., company name, job title, salary)\n- \ud83d\ude80 **Smart Caching**: Automatically caches extraction code based on HTML structure - saves 90%+ API tokens on repeat scraping\n- \ud83e\uddf9 **Smart HTML Cleaner**: Removes noise and reduces HTML by 98%+ - significantly cuts token usage for AI processing\n- \ud83d\udd27 **Easy to Use**: Simple API for both quick scraping and advanced use cases\n- \ud83d\udce6 **Modular Design**: Built with clean, modular components\n- \ud83d\udee1\ufe0f **Robust**: Handles edge cases, missing data, and various HTML structures\n- \ud83d\udcbe **Multiple Output Formats**: Support for both JSON (default) and CSV export formats\n- \ud83d\udcca **Structured Output**: Clean, structured data output with comprehensive metadata\n\n## \ud83e\uddf9 Smart HTML Cleaner\n\n### What Gets Removed\n- **Scripts & Styles**: JavaScript, CSS, and style blocks\n- **Ads & Analytics**: Advertisement content and tracking scripts\n- **Navigation**: Headers, footers, sidebars, and menu elements  \n- **Metadata**: Meta tags, SEO tags, and hidden elements\n- **Empty Elements**: Recursively removes empty div elements that don't contain meaningful content\n- **Noise**: Comments, unnecessary attributes, and whitespace\n\n### Repeating Structure Reduction (NEW!)\nThe cleaner now intelligently detects and reduces repeated HTML structures:\n\n- **Pattern Detection**: Uses structural hashing + similarity algorithms to find repeated elements\n- **Smart Sampling**: Keeps 2 samples from groups of 3+ similar structures (e.g., 20 job cards \u2192 2 samples)\n- **Structure Preservation**: Maintains document flow and parent-child relationships\n- **AI Optimization**: Provides enough samples for pattern recognition without overwhelming the AI\n\n### Empty Element Removal (NEW!)\nThe cleaner now intelligently removes empty div elements:\n\n- **Recursive Processing**: Starts from innermost divs and works outward\n- **Content Detection**: Preserves divs with text, images, inputs, or interactive elements\n- **Structure Preservation**: Maintains parent-child relationships and avoids breaking important structural elements\n- **Smart Analysis**: Removes placeholder/skeleton divs while keeping functional containers\n\n**Example**: Removes empty animation placeholders like `<div class=\"animate-pulse\"></div>` while preserving divs containing actual content.\n\n## Installation (Recommended)\n\n```\npip install universal-scraper\n```\n\n## Installation\n\n1. **Clone the repository**:\n   ```bash\n   git clone <repository-url>\n   cd Universal_Scrapper\n   ```\n\n2. **Install dependencies**:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n   Or install manually:\n   ```bash\n   pip install google-generativeai beautifulsoup4 requests selenium lxml html5lib fake-useragent\n   ```\n\n3. **Install the module**:\n   ```bash\n   pip install -e .\n   ```\n\n## Quick Start\n\n### 1. Set up your API key\n\n**Option A: Use Gemini (Default - Recommended)**\nGet a Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey):\n\n```bash\nexport GEMINI_API_KEY=\"your_gemini_api_key_here\"\n```\n\n**Option B: Use OpenAI**\n```bash\nexport OPENAI_API_KEY=\"your_openai_api_key_here\"\n```\n\n**Option C: Use Anthropic Claude**\n```bash\nexport ANTHROPIC_API_KEY=\"your_anthropic_api_key_here\"\n```\n\n**Option D: Pass API key directly**\n```python\n# For any provider - just pass the API key directly\nscraper = UniversalScraper(api_key=\"your_api_key\")\n```\n\n### 2. Basic Usage\n\n```python\nfrom universal_scraper import UniversalScraper\n\n# Option 1: Auto-detect provider (uses Gemini by default)\nscraper = UniversalScraper(api_key=\"your_gemini_api_key\")\n\n# Option 2: Specify Gemini model explicitly\nscraper = UniversalScraper(api_key=\"your_gemini_api_key\", model_name=\"gemini-2.5-flash\")\n\n# Option 3: Use OpenAI\nscraper = UniversalScraper(api_key=\"your_openai_api_key\", model_name=\"gpt-4\")\n\n# Option 4: Use Anthropic Claude\nscraper = UniversalScraper(api_key=\"your_anthropic_api_key\", model_name=\"claude-3-sonnet-20240229\")\n\n# Option 5: Use any other provider supported by LiteLLM\nscraper = UniversalScraper(api_key=\"your_api_key\", model_name=\"llama-2-70b-chat\")\n\n# Set the fields you want to extract\nscraper.set_fields([\n    \"company_name\", \n    \"job_title\", \n    \"apply_link\", \n    \"salary_range\",\n    \"location\"\n])\n\n# Check current model\nprint(f\"Using model: {scraper.get_model_name()}\")\n\n# Scrape a URL (default JSON format)\nresult = scraper.scrape_url(\"https://example.com/jobs\", save_to_file=True)\n\nprint(f\"Extracted {result['metadata']['items_extracted']} items\")\nprint(f\"Data saved to: {result.get('saved_to')}\")\n\n# Scrape and save as CSV\nresult = scraper.scrape_url(\"https://example.com/jobs\", save_to_file=True, format='csv')\nprint(f\"CSV data saved to: {result.get('saved_to')}\")\n```\n\n### 3. Convenience Function\n\nFor quick one-off scraping:\n\n```python\nfrom universal_scraper import scrape\n\n# Quick scraping with default JSON format\ndata = scrape(\n    url=\"https://example.com/jobs\",\n    api_key=\"your_gemini_api_key\",\n    fields=[\"company_name\", \"job_title\", \"apply_link\"]\n)\n\n# Quick scraping with CSV format\ndata = scrape(\n    url=\"https://example.com/jobs\",\n    api_key=\"your_gemini_api_key\",\n    fields=[\"company_name\", \"job_title\", \"apply_link\"],\n    format=\"csv\"\n)\n\n# Quick scraping with OpenAI\ndata = scrape(\n    url=\"https://example.com/jobs\",\n    api_key=\"your_openai_api_key\",\n    fields=[\"company_name\", \"job_title\", \"apply_link\"],\n    model_name=\"gpt-4\"\n)\n\n# Quick scraping with Anthropic Claude\ndata = scrape(\n    url=\"https://example.com/jobs\",\n    api_key=\"your_anthropic_api_key\",\n    fields=[\"company_name\", \"job_title\", \"apply_link\"],\n    model_name=\"claude-3-haiku-20240307\"\n)\n\nprint(data['data'])  # The extracted data\n```\n\n## \ud83d\udcc1 Export Formats\n\nUniversal Scraper supports multiple output formats to suit your data processing needs:\n\n### JSON Export (Default)\n```python\n# JSON is the default format\nresult = scraper.scrape_url(\"https://example.com/jobs\", save_to_file=True)\n# or explicitly specify\nresult = scraper.scrape_url(\"https://example.com/jobs\", save_to_file=True, format='json')\n```\n\n**JSON Output Structure:**\n```json\n{\n  \"url\": \"https://example.com\",\n  \"timestamp\": \"2025-01-01T12:00:00\",\n  \"fields\": [\"company_name\", \"job_title\", \"apply_link\"],\n  \"data\": [\n    {\n      \"company_name\": \"Example Corp\",\n      \"job_title\": \"Software Engineer\", \n      \"apply_link\": \"https://example.com/apply/123\"\n    }\n  ],\n  \"metadata\": {\n    \"raw_html_length\": 50000,\n    \"cleaned_html_length\": 15000,\n    \"items_extracted\": 1\n  }\n}\n```\n\n### CSV Export\n```python\n# Export as CSV for spreadsheet analysis\nresult = scraper.scrape_url(\"https://example.com/jobs\", save_to_file=True, format='csv')\n```\n\n**CSV Output:**\n- Clean tabular format with headers\n- All fields as columns, missing values filled with empty strings\n- Perfect for Excel, Google Sheets, or pandas processing\n- Automatically handles varying field structures across items\n\n### Multiple URLs with Format Choice\n```python\nurls = [\"https://site1.com\", \"https://site2.com\", \"https://site3.com\"]\n\n# Save all as JSON (default)\nresults = scraper.scrape_multiple_urls(urls, save_to_files=True)\n\n# Save all as CSV\nresults = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')\n```\n\n### CLI Usage\n\n**\ud83c\udf89 NEW in v1.6.0**: Full multi-provider CLI support!\n\n```bash\n# Gemini (default) - auto-detects from environment\nuniversal-scraper https://example.com/jobs --output jobs.json\n\n# OpenAI GPT models\nuniversal-scraper https://example.com/products --api-key YOUR_OPENAI_KEY --model gpt-4 --format csv\n\n# Anthropic Claude models  \nuniversal-scraper https://example.com/data --api-key YOUR_ANTHROPIC_KEY --model claude-3-haiku-20240307\n\n# Custom fields extraction\nuniversal-scraper https://example.com/listings --fields product_name product_price product_rating\n\n# Batch processing multiple URLs\nuniversal-scraper --urls urls.txt --output-dir results --format csv --model gpt-4o-mini\n\n# Verbose logging with any provider\nuniversal-scraper https://example.com --api-key YOUR_KEY --model gpt-4 --verbose\n```\n\n**\ud83d\udd27 Advanced CLI Options:**\n```bash\n# Set custom extraction fields\nuniversal-scraper URL --fields title price description availability\n\n# Use environment variables (auto-detected)\nexport OPENAI_API_KEY=\"your_key\"\nuniversal-scraper URL --model gpt-4\n\n# Multiple output formats\nuniversal-scraper URL --format json    # Default\nuniversal-scraper URL --format csv     # Spreadsheet-ready\n\n# Batch processing\necho -e \"https://site1.com\\nhttps://site2.com\" > urls.txt\nuniversal-scraper --urls urls.txt --output-dir batch_results\n```\n\n**\ud83d\udd17 Provider Support**: All 100+ models supported by LiteLLM work in CLI! See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for complete list.\n\n**\ud83d\udcdd Development Usage** (from cloned repo):\n```bash\npython main.py https://example.com/jobs --api-key YOUR_KEY --model gpt-4\n```\n\n## Cache Management\n```python\nscraper = UniversalScraper(api_key=\"your_key\")\n\n# View cache statistics\nstats = scraper.get_cache_stats()\nprint(f\"Cached entries: {stats['total_entries']}\")\nprint(f\"Total cache hits: {stats['total_uses']}\")\n\n# Clear old entries (30+ days)\nremoved = scraper.cleanup_old_cache(30)\nprint(f\"Removed {removed} old entries\")\n\n# Clear entire cache\nscraper.clear_cache()\n\n# Disable/enable caching\nscraper.disable_cache()  # For testing\nscraper.enable_cache()   # Re-enable\n```\n\n## Advanced Usage\n\n### Multiple URLs\n\n```python\nscraper = UniversalScraper(api_key=\"your_api_key\")\nscraper.set_fields([\"title\", \"price\", \"description\"])\n\nurls = [\n    \"https://site1.com/products\",\n    \"https://site2.com/items\", \n    \"https://site3.com/listings\"\n]\n\n# Scrape all URLs and save as JSON (default)\nresults = scraper.scrape_multiple_urls(urls, save_to_files=True)\n\n# Scrape all URLs and save as CSV for analysis\nresults = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')\n\nfor result in results:\n    if result.get('error'):\n        print(f\"Failed {result['url']}: {result['error']}\")\n    else:\n        print(f\"Success {result['url']}: {result['metadata']['items_extracted']} items\")\n```\n\n### Custom Configuration\n\n```python\nscraper = UniversalScraper(\n    api_key=\"your_api_key\",\n    temp_dir=\"custom_temp\",      # Custom temporary directory\n    output_dir=\"custom_output\",  # Custom output directory  \n    log_level=logging.DEBUG,     # Enable debug logging\n    model_name=\"gpt-4\"           # Custom model (OpenAI, Gemini, Claude, etc.)\n)\n\n# Configure for e-commerce scraping\nscraper.set_fields([\n    \"product_name\",\n    \"product_price\", \n    \"product_rating\",\n    \"product_reviews_count\",\n    \"product_availability\",\n    \"product_description\"\n])\n\n# Check and change model dynamically\nprint(f\"Current model: {scraper.get_model_name()}\")\nscraper.set_model_name(\"gpt-4\")  # Switch to OpenAI\nprint(f\"Switched to: {scraper.get_model_name()}\")\n\n# Or switch to Claude\nscraper.set_model_name(\"claude-3-sonnet-20240229\")\nprint(f\"Switched to: {scraper.get_model_name()}\")\n\nresult = scraper.scrape_url(\"https://ecommerce-site.com\", save_to_file=True)\n```\n\n## API Reference\n\n### UniversalScraper Class\n\n#### Constructor\n```python\nUniversalScraper(api_key=None, temp_dir=\"temp\", output_dir=\"output\", log_level=logging.INFO, model_name=None)\n```\n\n- `api_key`: AI provider API key (auto-detects provider, or set specific env vars)\n- `temp_dir`: Directory for temporary files\n- `output_dir`: Directory for output files\n- `log_level`: Logging level\n- `model_name`: AI model name (default: 'gemini-2.5-flash', supports 100+ models via LiteLLM)\n  - See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for complete model list and setup\n\n#### Methods\n\n- `set_fields(fields: List[str])`: Set the fields to extract\n- `get_fields() -> List[str]`: Get current fields configuration\n- `get_model_name() -> str`: Get current Gemini model name\n- `set_model_name(model_name: str)`: Change the Gemini model\n- `scrape_url(url: str, save_to_file=False, output_filename=None, format='json') -> Dict`: Scrape a single URL\n- `scrape_multiple_urls(urls: List[str], save_to_files=True, format='json') -> List[Dict]`: Scrape multiple URLs\n\n### Convenience Function\n\n```python\nscrape(url: str, api_key: str, fields: List[str], model_name: Optional[str] = None, format: str = 'json') -> Dict\n```\n\nQuick scraping function for simple use cases. Auto-detects AI provider from API key pattern.\n\n**Note**: For model names and provider-specific setup, refer to the [LiteLLM Providers Documentation](https://docs.litellm.ai/docs/providers).\n\n## Output Format\n\nThe scraped data is returned in a structured format:\n\n```json\n{\n  \"url\": \"https://example.com\",\n  \"timestamp\": \"2025-01-01T12:00:00\",\n  \"fields\": [\"company_name\", \"job_title\", \"apply_link\"],\n  \"data\": [\n    {\n      \"company_name\": \"Example Corp\",\n      \"job_title\": \"Software Engineer\", \n      \"apply_link\": \"https://example.com/apply/123\"\n    }\n  ],\n  \"metadata\": {\n    \"raw_html_length\": 50000,\n    \"cleaned_html_length\": 15000,\n    \"items_extracted\": 1\n  }\n}\n```\n\n## Common Field Examples\n\n### Job Listings\n```python\nscraper.set_fields([\n    \"company_name\",\n    \"job_title\", \n    \"apply_link\",\n    \"salary_range\",\n    \"location\",\n    \"job_description\",\n    \"employment_type\",\n    \"experience_level\"\n])\n```\n\n### E-commerce Products\n```python\nscraper.set_fields([\n    \"product_name\",\n    \"product_price\",\n    \"product_rating\", \n    \"product_reviews_count\",\n    \"product_availability\",\n    \"product_image_url\",\n    \"product_description\"\n])\n```\n\n### News Articles\n```python\nscraper.set_fields([\n    \"article_title\",\n    \"article_content\",\n    \"article_author\",\n    \"publish_date\", \n    \"article_url\",\n    \"article_category\"\n])\n```\n\n## \ud83e\udd16 Multi-Provider AI Support\n\nUniversal Scraper now supports multiple AI providers through LiteLLM integration:\n\n### Supported Providers\n- **Google Gemini** (Default): `gemini-2.5-flash`, `gemini-1.5-pro`, etc.\n- **OpenAI**: `gpt-4`, `gpt-4-turbo`, `gpt-3.5-turbo`, etc.\n- **Anthropic**: `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307`\n- **100+ Other Models**: Via LiteLLM including Llama, PaLM, Cohere, and more\n\n**\ud83d\udcda For complete model names and provider setup**: See [LiteLLM Providers Documentation](https://docs.litellm.ai/docs/providers)\n\n### Usage Examples\n\n```python\n# Gemini (Default - Free tier available)\nscraper = UniversalScraper(api_key=\"your_gemini_key\")\n# Auto-detects as gemini-2.5-flash\n\n# OpenAI\nscraper = UniversalScraper(api_key=\"sk-...\", model_name=\"gpt-4\")\n\n# Anthropic Claude\nscraper = UniversalScraper(api_key=\"sk-ant-...\", model_name=\"claude-3-haiku-20240307\")\n\n# Environment variable approach\n# Set GEMINI_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY\nscraper = UniversalScraper()  # Auto-detects from env vars\n\n# Any other provider from LiteLLM (see link above for model names)\nscraper = UniversalScraper(api_key=\"your_api_key\", model_name=\"llama-2-70b-chat\")\n```\n\n### Model Configuration Guide\n\n**\ud83d\udccb Quick Reference for Popular Models:**\n```python\n# Gemini Models\nmodel_name=\"gemini-2.5-flash\"        # Fast, efficient\nmodel_name=\"gemini-1.5-pro\"          # More capable\n\n# OpenAI Models  \nmodel_name=\"gpt-4\"                   # Most capable\nmodel_name=\"gpt-4o-mini\"             # Fast, cost-effective\nmodel_name=\"gpt-3.5-turbo\"           # Legacy but reliable\n\n# Anthropic Models\nmodel_name=\"claude-3-opus-20240229\"      # Most capable\nmodel_name=\"claude-3-sonnet-20240229\"    # Balanced\nmodel_name=\"claude-3-haiku-20240307\"     # Fast, efficient\n\n# Other Popular Models (see LiteLLM docs for setup)\nmodel_name=\"llama-2-70b-chat\"        # Meta Llama\nmodel_name=\"command-nightly\"          # Cohere\nmodel_name=\"palm-2-chat-bison\"        # Google PaLM\n```\n\n**\ud83d\udd17 Complete Model List**: Visit [LiteLLM Providers Documentation](https://docs.litellm.ai/docs/providers) for:\n- All available model names\n- Provider-specific API key setup\n- Environment variable configuration\n- Rate limits and pricing information\n\n### Model Auto-Detection\nIf you don't specify a model, the scraper automatically selects:\n- **Gemini**: If `GEMINI_API_KEY` is set or API key contains \"AIza\"\n- **OpenAI**: If `OPENAI_API_KEY` is set or API key starts with \"sk-\"\n- **Anthropic**: If `ANTHROPIC_API_KEY` is set or API key starts with \"sk-ant-\"\n\n## Troubleshooting\n\n### Common Issues\n\n1. **API Key Error**: Make sure your API key is valid and set correctly:\n   - Gemini: Set `GEMINI_API_KEY` or pass directly\n   - OpenAI: Set `OPENAI_API_KEY` or pass directly\n   - Anthropic: Set `ANTHROPIC_API_KEY` or pass directly\n2. **Model Not Found**: Ensure you're using the correct model name for your provider\n3. **Empty Results**: The AI might need more specific field names or the page might not contain the expected data\n4. **Network Errors**: Some sites block scrapers - the tool uses cloudscraper to handle most cases\n5. **Model Name Issues**: Check [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for correct model names and setup instructions\n\n### Debug Mode\n\nEnable debug logging to see what's happening:\n\n```python\nimport logging\nscraper = UniversalScraper(api_key=\"your_key\", log_level=logging.DEBUG)\n```\n\n## Roadmap\n\nSee [ROADMAP.md](ROADMAP.md) for planned features and improvements.\n\n## Contributors\n\n[Contributors List](https://github.com/WitesoAI/universal-scraper/graphs/contributors)\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for detailed version history and release notes.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AI-powered web scraping with customizable field extraction using multiple AI providers",
    "version": "1.8.0",
    "project_urls": {
        "Bug Reports": "https://github.com/WitesoAI/universal-scraper/issues",
        "Documentation": "https://github.com/WitesoAI/universal-scraper/wiki",
        "Homepage": "https://github.com/WitesoAI/universal-scraper",
        "Source": "https://github.com/WitesoAI/universal-scraper"
    },
    "split_keywords": [
        "web scraping",
        " ai",
        " data extraction",
        " beautifulsoup",
        " gemini",
        " openai",
        " anthropic",
        " claude",
        " gpt",
        " litellm",
        " automation",
        " html parsing",
        " structured data",
        " caching",
        " performance",
        " multi-provider"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2fd38cc35e0f71cc31e9900043f1a5265ccc016f22e6c3f31e2a57bfa3c7f4c2",
                "md5": "ee9bc333bfa9b090e0c32ffb545eda22",
                "sha256": "faae99de3a2b6ce97a8ba483f9982524baaf594bed166eb34b30134369881fb3"
            },
            "downloads": -1,
            "filename": "universal_scraper-1.8.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ee9bc333bfa9b090e0c32ffb545eda22",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 41576,
            "upload_time": "2025-09-11T11:41:03",
            "upload_time_iso_8601": "2025-09-11T11:41:03.961522Z",
            "url": "https://files.pythonhosted.org/packages/2f/d3/8cc35e0f71cc31e9900043f1a5265ccc016f22e6c3f31e2a57bfa3c7f4c2/universal_scraper-1.8.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "26f539981f3d1e86c365867f50ae9cfc2d5a90367c401660625329d3231742a6",
                "md5": "ddc81e1cc8660b1ae624a5584540baeb",
                "sha256": "b1a81f1153e8a055ecf8d5edc0100115736dd4912c1891c65593bd0e3b30ad6f"
            },
            "downloads": -1,
            "filename": "universal_scraper-1.8.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ddc81e1cc8660b1ae624a5584540baeb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 47098,
            "upload_time": "2025-09-11T11:41:06",
            "upload_time_iso_8601": "2025-09-11T11:41:06.718873Z",
            "url": "https://files.pythonhosted.org/packages/26/f5/39981f3d1e86c365867f50ae9cfc2d5a90367c401660625329d3231742a6/universal_scraper-1.8.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-11 11:41:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "WitesoAI",
    "github_project": "universal-scraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "cloudscraper",
            "specs": [
                [
                    ">=",
                    "1.2.60"
                ]
            ]
        },
        {
            "name": "selenium",
            "specs": [
                [
                    ">=",
                    "4.15.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "google-generativeai",
            "specs": [
                [
                    ">=",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "litellm",
            "specs": [
                [
                    ">=",
                    "1.70.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.31.0"
                ]
            ]
        },
        {
            "name": "urllib3",
            "specs": [
                [
                    ">=",
                    "1.26.0"
                ]
            ]
        },
        {
            "name": "webdriver-manager",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    ">=",
                    "2.8.0"
                ]
            ]
        },
        {
            "name": "cchardet",
            "specs": [
                [
                    ">=",
                    "2.1.7"
                ]
            ]
        },
        {
            "name": "html5lib",
            "specs": [
                [
                    ">=",
                    "1.1"
                ]
            ]
        }
    ],
    "lcname": "universal-scraper"
}

Ayushi Gupta & Pushpender Singh