universal-scraper


Nameuniversal-scraper JSON
Version 1.2.0 PyPI version JSON
download
home_pagehttps://github.com/Ayushi0405/Universal_Scrapper
SummaryAI-powered web scraping with customizable field extraction
upload_time2025-09-02 07:16:11
maintainerNone
docs_urlNone
authorAyushi Gupta & Pushpender Singh
requires_python>=3.7
licenseNone
keywords web scraping ai data extraction beautifulsoup gemini automation html parsing structured data caching performance
VCS
bugtrack_url
requirements cloudscraper selenium beautifulsoup4 lxml google-generativeai requests urllib3 webdriver-manager python-dateutil cchardet html5lib
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center"> Universal Scraper</h1>

<h2 align="center"> The Python package for scraping data from any website</h2>

<p align="center">
<a href="https://pypi.org/project/universal-scraper/"><img alt="pypi" src="https://img.shields.io/pypi/v/universal-scraper.svg"></a>
<a href="https://pepy.tech/project/universal-scraper?versions=1*&versions=2*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/universal-scraper/month"></a>
<a href="https://github.com/Ayushi0405/Universal_Scrapper/commits/main"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/Ayushi0405/Universal_Scrapper?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/universal-scraper?style=flat-square"></a>
</p>

--------------------------------------------------------------------------

A Python module for AI-powered web scraping with customizable field extraction using Google's Gemini AI.

## Features

- ๐Ÿค– **AI-Powered Extraction**: Uses Google Gemini to intelligently extract structured data
- ๐ŸŽฏ **Customizable Fields**: Define exactly which fields you want to extract (e.g., company name, job title, salary)
- ๐Ÿš€ **Smart Caching**: Automatically caches extraction code based on HTML structure - saves 90%+ API tokens on repeat scraping
- ๐Ÿงน **Smart HTML Cleaner**: Removes noise and reduces HTML by 65%+ - significantly cuts token usage for AI processing
- ๐Ÿ”ง **Easy to Use**: Simple API for both quick scraping and advanced use cases
- ๐Ÿ“ฆ **Modular Design**: Built with clean, modular components
- ๐Ÿ›ก๏ธ **Robust**: Handles edge cases, missing data, and various HTML structures
- ๐Ÿ’พ **JSON Output**: Clean, structured JSON output with metadata

## Installation (Recommended)

```
pip install universal-scraper
```

## Installation

1. **Clone the repository**:
   ```bash
   git clone <repository-url>
   cd Universal_Scrapper
   ```

2. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

   Or install manually:
   ```bash
   pip install google-generativeai beautifulsoup4 requests selenium lxml html5lib fake-useragent
   ```

3. **Install the module**:
   ```bash
   pip install -e .
   ```

## Quick Start

### 1. Set up your API key

Get a Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey) and set it as an environment variable:

```bash
export GEMINI_API_KEY="your_gemini_api_key_here"
```

### 2. Basic Usage

```python
from universal_scraper import UniversalScraper

# Initialize the scraper (uses default model: gemini-2.5-flash)
scraper = UniversalScraper(api_key="your_gemini_api_key")

# Or initialize with a custom model
scraper = UniversalScraper(api_key="your_gemini_api_key", model_name="gemini-pro")

# Set the fields you want to extract
scraper.set_fields([
    "company_name", 
    "job_title", 
    "apply_link", 
    "salary_range",
    "location"
])

# Check current model
print(f"Using model: {scraper.get_model_name()}")

# Scrape a URL
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)

print(f"Extracted {result['metadata']['items_extracted']} items")
print(f"Data saved to: {result.get('saved_to')}")
```

### 3. Convenience Function

For quick one-off scraping:

```python
from universal_scraper import scrape

# Quick scraping with default model
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"]
)

# Quick scraping with custom model
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"],
    model_name="gemini-1.5-pro"
)

print(data['data'])  # The extracted data
```

## ๐Ÿงน Smart HTML Cleaning

**Reduces HTML size by 65%+** before sending to AI - dramatically cuts token usage:

### What Gets Removed
- **Scripts & Styles**: JavaScript, CSS, and style blocks
- **Ads & Analytics**: Advertisement content and tracking scripts
- **Navigation**: Headers, footers, sidebars, and menu elements  
- **Metadata**: Meta tags, SEO tags, and hidden elements
- **Noise**: Comments, unnecessary attributes, and whitespace

### Benefits
- **Token Reduction**: 65%+ smaller HTML means 65%+ fewer tokens to process
- **Better AI Focus**: Clean HTML helps AI generate more accurate extraction code
- **Faster Processing**: Less data to analyze means faster response times
- **Cost Savings**: Fewer tokens = lower API costs per extraction

### Example Impact
```
Original HTML: 150KB โ†’ Cleaned HTML: 45KB (70% reduction)
Before: ~38,000 tokens โ†’ After: ~11,000 tokens (saves 27K tokens per request!)
```

## ๐Ÿš€ Smart Caching (NEW!)

**Saves 90%+ API tokens** by reusing extraction code for similar HTML structures:

### Key Benefits
- **Token Savings**: Avoids regenerating BeautifulSoup code for similar pages
- **Performance**: 5-10x faster scraping on cached structures  
- **Cost Reduction**: Significant API cost savings for repeated scraping
- **Automatic**: Works transparently - no code changes needed

### How It Works
- **Structural Hashing**: Creates hash based on HTML structure (not content)
- **Smart Matching**: Reuses code when URL domain + structure + fields match
- **Local SQLite DB**: Stores cached extraction codes permanently

### Cache Management
```python
scraper = UniversalScraper(api_key="your_key")

# View cache statistics
stats = scraper.get_cache_stats()
print(f"Cached entries: {stats['total_entries']}")
print(f"Total cache hits: {stats['total_uses']}")

# Clear old entries (30+ days)
removed = scraper.cleanup_old_cache(30)
print(f"Removed {removed} old entries")

# Clear entire cache
scraper.clear_cache()

# Disable/enable caching
scraper.disable_cache()  # For testing
scraper.enable_cache()   # Re-enable
```

## Advanced Usage

### Multiple URLs

```python
scraper = UniversalScraper(api_key="your_api_key")
scraper.set_fields(["title", "price", "description"])

urls = [
    "https://site1.com/products",
    "https://site2.com/items", 
    "https://site3.com/listings"
]

results = scraper.scrape_multiple_urls(urls, save_to_files=True)

for result in results:
    if result.get('error'):
        print(f"Failed {result['url']}: {result['error']}")
    else:
        print(f"Success {result['url']}: {result['metadata']['items_extracted']} items")
```

### Custom Configuration

```python
scraper = UniversalScraper(
    api_key="your_api_key",
    temp_dir="custom_temp",      # Custom temporary directory
    output_dir="custom_output",  # Custom output directory  
    log_level=logging.DEBUG,     # Enable debug logging
    model_name="gemini-pro"      # Custom Gemini model
)

# Configure for e-commerce scraping
scraper.set_fields([
    "product_name",
    "product_price", 
    "product_rating",
    "product_reviews_count",
    "product_availability",
    "product_description"
])

# Check and change model dynamically
print(f"Current model: {scraper.get_model_name()}")
scraper.set_model_name("gemini-1.5-pro")
print(f"Switched to: {scraper.get_model_name()}")

result = scraper.scrape_url("https://ecommerce-site.com", save_to_file=True)
```

## API Reference

### UniversalScraper Class

#### Constructor
```python
UniversalScraper(api_key=None, temp_dir="temp", output_dir="output", log_level=logging.INFO, model_name=None)
```

- `api_key`: Gemini API key (optional if GEMINI_API_KEY env var is set)
- `temp_dir`: Directory for temporary files
- `output_dir`: Directory for output files
- `log_level`: Logging level
- `model_name`: Gemini model name (default: 'gemini-2.5-flash')

#### Methods

- `set_fields(fields: List[str])`: Set the fields to extract
- `get_fields() -> List[str]`: Get current fields configuration
- `get_model_name() -> str`: Get current Gemini model name
- `set_model_name(model_name: str)`: Change the Gemini model
- `scrape_url(url: str, save_to_file=False, output_filename=None) -> Dict`: Scrape a single URL
- `scrape_multiple_urls(urls: List[str], save_to_files=True) -> List[Dict]`: Scrape multiple URLs

### Convenience Function

```python
scrape(url: str, api_key: str, fields: List[str], model_name: Optional[str] = None) -> Dict
```

Quick scraping function for simple use cases.

## Output Format

The scraped data is returned in a structured format:

```json
{
  "url": "https://example.com",
  "timestamp": "2025-01-01T12:00:00",
  "fields": ["company_name", "job_title", "apply_link"],
  "data": [
    {
      "company_name": "Example Corp",
      "job_title": "Software Engineer", 
      "apply_link": "https://example.com/apply/123"
    }
  ],
  "metadata": {
    "raw_html_length": 50000,
    "cleaned_html_length": 15000,
    "items_extracted": 1
  }
}
```

## Common Field Examples

### Job Listings
```python
scraper.set_fields([
    "company_name",
    "job_title", 
    "apply_link",
    "salary_range",
    "location",
    "job_description",
    "employment_type",
    "experience_level"
])
```

### E-commerce Products
```python
scraper.set_fields([
    "product_name",
    "product_price",
    "product_rating", 
    "product_reviews_count",
    "product_availability",
    "product_image_url",
    "product_description"
])
```

### News Articles
```python
scraper.set_fields([
    "article_title",
    "article_content",
    "article_author",
    "publish_date", 
    "article_url",
    "article_category"
])
```

## Testing

Run the test suite to verify everything works:

```bash
python test_module.py
```

## Example Files

- `example_usage.py`: Comprehensive examples of different usage patterns
- `test_module.py`: Test suite for the module

## How It Works

1. **HTML Fetching**: Uses cloudscraper to fetch HTML content, handling anti-bot measures
2. **Smart HTML Cleaning**: Removes 65%+ of noise (scripts, ads, navigation) while preserving data structure
3. **Structure-Based Caching**: Creates structural hash and checks cache for existing extraction code
4. **AI Code Generation**: Uses Google Gemini to generate custom BeautifulSoup code (only when not cached)
5. **Code Execution**: Runs the cached/generated code to extract structured data
6. **JSON Output**: Returns clean, structured data with metadata and performance stats

## Troubleshooting

### Common Issues

1. **API Key Error**: Make sure your Gemini API key is valid and set correctly
2. **Empty Results**: The AI might need more specific field names or the page might not contain the expected data
3. **Network Errors**: Some sites block scrapers - the tool uses cloudscraper to handle most cases

### Debug Mode

Enable debug logging to see what's happening:

```python
import logging
scraper = UniversalScraper(api_key="your_key", log_level=logging.DEBUG)
```

## Core Contributors

<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->
<!-- markdownlint-disable -->

<table>
<tr>

<td align="center">
    <a href="https://github.com/PushpenderIndia">
        <kbd><img src="https://avatars3.githubusercontent.com/PushpenderIndia?size=400" width="100px;" alt=""/></kbd><br />
        <sub><b>Pushpender Singh</b></sub>
    </a><br />
    <a href="https://github.com/Ayushi0405/Universal_Scrapper/commits?author=PushpenderIndia" title="Code"> :computer: </a> 
</td>

<td align="center">
    <a href="https://github.com/Ayushi0405">
        <kbd><img src="https://avatars3.githubusercontent.com/Ayushi0405?size=400" width="100px;" alt=""/></kbd><br />
        <sub><b>Ayushi Gupta</b></sub>
    </a><br />
    <a href="https://github.com/Ayushi0405/Universal_Scrapper/commits?author=Ayushi0405" title="Code"> :computer: </a> 
</td>

</tr>
</tr>
</table>

<!-- markdownlint-enable -->
<!-- prettier-ignore-end -->
<!-- ALL-CONTRIBUTORS-LIST:END -->

Contributions of any kind welcome!

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests: `python test_module.py`
5. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Changelog

### v1.2.0 - Smart Caching & HTML Optimization Release
- ๐Ÿš€ **NEW**: Intelligent code caching system - **saves 90%+ API tokens**
- ๐Ÿงน **HIGHLIGHT**: Smart HTML cleaner reduces payload by 65%+ - **massive token savings**
- ๐Ÿ”ง **NEW**: Structural HTML hashing for cache key generation
- ๐Ÿ”ง **NEW**: SQLite-based cache storage with metadata
- ๐Ÿ”ง **NEW**: Cache management methods: `get_cache_stats()`, `clear_cache()`, `cleanup_old_cache()`
- ๐Ÿ”ง **NEW**: Automatic cache hit/miss detection and logging  
- ๐Ÿ”ง **NEW**: URL normalization (removes query params) for better cache matching
- โšก **PERF**: 5-10x faster scraping on cached HTML structures
- ๐Ÿ’ฐ **COST**: Significant API cost reduction (HTML cleaning + caching combined)
- ๐Ÿ“ **ORG**: Moved sample code to `sample_code/` directory

### v1.1.0
- โœจ **NEW**: Gemini model selection functionality
- ๐Ÿ”ง Added `model_name` parameter to `UniversalScraper()` constructor
- ๐Ÿ”ง Added `get_model_name()` and `set_model_name()` methods
- ๐Ÿ”ง Enhanced convenience `scrape()` function with `model_name` parameter  
- ๐Ÿ”„ Updated default model to `gemini-2.5-flash`
- ๐Ÿ“š Updated documentation with model examples
- โœ… Fixed missing `cloudscraper` dependency

### v1.0.0
- Initial release
- AI-powered field extraction
- Customizable field configuration
- Multiple URL support
- Comprehensive test suite

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Ayushi0405/Universal_Scrapper",
    "name": "universal-scraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "web scraping, ai, data extraction, beautifulsoup, gemini, automation, html parsing, structured data, caching, performance",
    "author": "Ayushi Gupta & Pushpender Singh",
    "author_email": "aayushi.gupta0405@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/39/2b/46cd7a41ab7cfcd626f391e2522cf7f8d8756f664f50bdf1b39d04bee267/universal_scraper-1.2.0.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\"> Universal Scraper</h1>\n\n<h2 align=\"center\"> The Python package for scraping data from any website</h2>\n\n<p align=\"center\">\n<a href=\"https://pypi.org/project/universal-scraper/\"><img alt=\"pypi\" src=\"https://img.shields.io/pypi/v/universal-scraper.svg\"></a>\n<a href=\"https://pepy.tech/project/universal-scraper?versions=1*&versions=2*&versions=3*\"><img alt=\"Downloads\" src=\"https://pepy.tech/badge/universal-scraper/month\"></a>\n<a href=\"https://github.com/Ayushi0405/Universal_Scrapper/commits/main\"><img alt=\"GitHub lastest commit\" src=\"https://img.shields.io/github/last-commit/Ayushi0405/Universal_Scrapper?color=blue&style=flat-square\"></a>\n<a href=\"#\"><img alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/universal-scraper?style=flat-square\"></a>\n</p>\n\n--------------------------------------------------------------------------\n\nA Python module for AI-powered web scraping with customizable field extraction using Google's Gemini AI.\n\n## Features\n\n- \ud83e\udd16 **AI-Powered Extraction**: Uses Google Gemini to intelligently extract structured data\n- \ud83c\udfaf **Customizable Fields**: Define exactly which fields you want to extract (e.g., company name, job title, salary)\n- \ud83d\ude80 **Smart Caching**: Automatically caches extraction code based on HTML structure - saves 90%+ API tokens on repeat scraping\n- \ud83e\uddf9 **Smart HTML Cleaner**: Removes noise and reduces HTML by 65%+ - significantly cuts token usage for AI processing\n- \ud83d\udd27 **Easy to Use**: Simple API for both quick scraping and advanced use cases\n- \ud83d\udce6 **Modular Design**: Built with clean, modular components\n- \ud83d\udee1\ufe0f **Robust**: Handles edge cases, missing data, and various HTML structures\n- \ud83d\udcbe **JSON Output**: Clean, structured JSON output with metadata\n\n## Installation (Recommended)\n\n```\npip install universal-scraper\n```\n\n## Installation\n\n1. **Clone the repository**:\n   ```bash\n   git clone <repository-url>\n   cd Universal_Scrapper\n   ```\n\n2. **Install dependencies**:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n   Or install manually:\n   ```bash\n   pip install google-generativeai beautifulsoup4 requests selenium lxml html5lib fake-useragent\n   ```\n\n3. **Install the module**:\n   ```bash\n   pip install -e .\n   ```\n\n## Quick Start\n\n### 1. Set up your API key\n\nGet a Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey) and set it as an environment variable:\n\n```bash\nexport GEMINI_API_KEY=\"your_gemini_api_key_here\"\n```\n\n### 2. Basic Usage\n\n```python\nfrom universal_scraper import UniversalScraper\n\n# Initialize the scraper (uses default model: gemini-2.5-flash)\nscraper = UniversalScraper(api_key=\"your_gemini_api_key\")\n\n# Or initialize with a custom model\nscraper = UniversalScraper(api_key=\"your_gemini_api_key\", model_name=\"gemini-pro\")\n\n# Set the fields you want to extract\nscraper.set_fields([\n    \"company_name\", \n    \"job_title\", \n    \"apply_link\", \n    \"salary_range\",\n    \"location\"\n])\n\n# Check current model\nprint(f\"Using model: {scraper.get_model_name()}\")\n\n# Scrape a URL\nresult = scraper.scrape_url(\"https://example.com/jobs\", save_to_file=True)\n\nprint(f\"Extracted {result['metadata']['items_extracted']} items\")\nprint(f\"Data saved to: {result.get('saved_to')}\")\n```\n\n### 3. Convenience Function\n\nFor quick one-off scraping:\n\n```python\nfrom universal_scraper import scrape\n\n# Quick scraping with default model\ndata = scrape(\n    url=\"https://example.com/jobs\",\n    api_key=\"your_gemini_api_key\",\n    fields=[\"company_name\", \"job_title\", \"apply_link\"]\n)\n\n# Quick scraping with custom model\ndata = scrape(\n    url=\"https://example.com/jobs\",\n    api_key=\"your_gemini_api_key\",\n    fields=[\"company_name\", \"job_title\", \"apply_link\"],\n    model_name=\"gemini-1.5-pro\"\n)\n\nprint(data['data'])  # The extracted data\n```\n\n## \ud83e\uddf9 Smart HTML Cleaning\n\n**Reduces HTML size by 65%+** before sending to AI - dramatically cuts token usage:\n\n### What Gets Removed\n- **Scripts & Styles**: JavaScript, CSS, and style blocks\n- **Ads & Analytics**: Advertisement content and tracking scripts\n- **Navigation**: Headers, footers, sidebars, and menu elements  \n- **Metadata**: Meta tags, SEO tags, and hidden elements\n- **Noise**: Comments, unnecessary attributes, and whitespace\n\n### Benefits\n- **Token Reduction**: 65%+ smaller HTML means 65%+ fewer tokens to process\n- **Better AI Focus**: Clean HTML helps AI generate more accurate extraction code\n- **Faster Processing**: Less data to analyze means faster response times\n- **Cost Savings**: Fewer tokens = lower API costs per extraction\n\n### Example Impact\n```\nOriginal HTML: 150KB \u2192 Cleaned HTML: 45KB (70% reduction)\nBefore: ~38,000 tokens \u2192 After: ~11,000 tokens (saves 27K tokens per request!)\n```\n\n## \ud83d\ude80 Smart Caching (NEW!)\n\n**Saves 90%+ API tokens** by reusing extraction code for similar HTML structures:\n\n### Key Benefits\n- **Token Savings**: Avoids regenerating BeautifulSoup code for similar pages\n- **Performance**: 5-10x faster scraping on cached structures  \n- **Cost Reduction**: Significant API cost savings for repeated scraping\n- **Automatic**: Works transparently - no code changes needed\n\n### How It Works\n- **Structural Hashing**: Creates hash based on HTML structure (not content)\n- **Smart Matching**: Reuses code when URL domain + structure + fields match\n- **Local SQLite DB**: Stores cached extraction codes permanently\n\n### Cache Management\n```python\nscraper = UniversalScraper(api_key=\"your_key\")\n\n# View cache statistics\nstats = scraper.get_cache_stats()\nprint(f\"Cached entries: {stats['total_entries']}\")\nprint(f\"Total cache hits: {stats['total_uses']}\")\n\n# Clear old entries (30+ days)\nremoved = scraper.cleanup_old_cache(30)\nprint(f\"Removed {removed} old entries\")\n\n# Clear entire cache\nscraper.clear_cache()\n\n# Disable/enable caching\nscraper.disable_cache()  # For testing\nscraper.enable_cache()   # Re-enable\n```\n\n## Advanced Usage\n\n### Multiple URLs\n\n```python\nscraper = UniversalScraper(api_key=\"your_api_key\")\nscraper.set_fields([\"title\", \"price\", \"description\"])\n\nurls = [\n    \"https://site1.com/products\",\n    \"https://site2.com/items\", \n    \"https://site3.com/listings\"\n]\n\nresults = scraper.scrape_multiple_urls(urls, save_to_files=True)\n\nfor result in results:\n    if result.get('error'):\n        print(f\"Failed {result['url']}: {result['error']}\")\n    else:\n        print(f\"Success {result['url']}: {result['metadata']['items_extracted']} items\")\n```\n\n### Custom Configuration\n\n```python\nscraper = UniversalScraper(\n    api_key=\"your_api_key\",\n    temp_dir=\"custom_temp\",      # Custom temporary directory\n    output_dir=\"custom_output\",  # Custom output directory  \n    log_level=logging.DEBUG,     # Enable debug logging\n    model_name=\"gemini-pro\"      # Custom Gemini model\n)\n\n# Configure for e-commerce scraping\nscraper.set_fields([\n    \"product_name\",\n    \"product_price\", \n    \"product_rating\",\n    \"product_reviews_count\",\n    \"product_availability\",\n    \"product_description\"\n])\n\n# Check and change model dynamically\nprint(f\"Current model: {scraper.get_model_name()}\")\nscraper.set_model_name(\"gemini-1.5-pro\")\nprint(f\"Switched to: {scraper.get_model_name()}\")\n\nresult = scraper.scrape_url(\"https://ecommerce-site.com\", save_to_file=True)\n```\n\n## API Reference\n\n### UniversalScraper Class\n\n#### Constructor\n```python\nUniversalScraper(api_key=None, temp_dir=\"temp\", output_dir=\"output\", log_level=logging.INFO, model_name=None)\n```\n\n- `api_key`: Gemini API key (optional if GEMINI_API_KEY env var is set)\n- `temp_dir`: Directory for temporary files\n- `output_dir`: Directory for output files\n- `log_level`: Logging level\n- `model_name`: Gemini model name (default: 'gemini-2.5-flash')\n\n#### Methods\n\n- `set_fields(fields: List[str])`: Set the fields to extract\n- `get_fields() -> List[str]`: Get current fields configuration\n- `get_model_name() -> str`: Get current Gemini model name\n- `set_model_name(model_name: str)`: Change the Gemini model\n- `scrape_url(url: str, save_to_file=False, output_filename=None) -> Dict`: Scrape a single URL\n- `scrape_multiple_urls(urls: List[str], save_to_files=True) -> List[Dict]`: Scrape multiple URLs\n\n### Convenience Function\n\n```python\nscrape(url: str, api_key: str, fields: List[str], model_name: Optional[str] = None) -> Dict\n```\n\nQuick scraping function for simple use cases.\n\n## Output Format\n\nThe scraped data is returned in a structured format:\n\n```json\n{\n  \"url\": \"https://example.com\",\n  \"timestamp\": \"2025-01-01T12:00:00\",\n  \"fields\": [\"company_name\", \"job_title\", \"apply_link\"],\n  \"data\": [\n    {\n      \"company_name\": \"Example Corp\",\n      \"job_title\": \"Software Engineer\", \n      \"apply_link\": \"https://example.com/apply/123\"\n    }\n  ],\n  \"metadata\": {\n    \"raw_html_length\": 50000,\n    \"cleaned_html_length\": 15000,\n    \"items_extracted\": 1\n  }\n}\n```\n\n## Common Field Examples\n\n### Job Listings\n```python\nscraper.set_fields([\n    \"company_name\",\n    \"job_title\", \n    \"apply_link\",\n    \"salary_range\",\n    \"location\",\n    \"job_description\",\n    \"employment_type\",\n    \"experience_level\"\n])\n```\n\n### E-commerce Products\n```python\nscraper.set_fields([\n    \"product_name\",\n    \"product_price\",\n    \"product_rating\", \n    \"product_reviews_count\",\n    \"product_availability\",\n    \"product_image_url\",\n    \"product_description\"\n])\n```\n\n### News Articles\n```python\nscraper.set_fields([\n    \"article_title\",\n    \"article_content\",\n    \"article_author\",\n    \"publish_date\", \n    \"article_url\",\n    \"article_category\"\n])\n```\n\n## Testing\n\nRun the test suite to verify everything works:\n\n```bash\npython test_module.py\n```\n\n## Example Files\n\n- `example_usage.py`: Comprehensive examples of different usage patterns\n- `test_module.py`: Test suite for the module\n\n## How It Works\n\n1. **HTML Fetching**: Uses cloudscraper to fetch HTML content, handling anti-bot measures\n2. **Smart HTML Cleaning**: Removes 65%+ of noise (scripts, ads, navigation) while preserving data structure\n3. **Structure-Based Caching**: Creates structural hash and checks cache for existing extraction code\n4. **AI Code Generation**: Uses Google Gemini to generate custom BeautifulSoup code (only when not cached)\n5. **Code Execution**: Runs the cached/generated code to extract structured data\n6. **JSON Output**: Returns clean, structured data with metadata and performance stats\n\n## Troubleshooting\n\n### Common Issues\n\n1. **API Key Error**: Make sure your Gemini API key is valid and set correctly\n2. **Empty Results**: The AI might need more specific field names or the page might not contain the expected data\n3. **Network Errors**: Some sites block scrapers - the tool uses cloudscraper to handle most cases\n\n### Debug Mode\n\nEnable debug logging to see what's happening:\n\n```python\nimport logging\nscraper = UniversalScraper(api_key=\"your_key\", log_level=logging.DEBUG)\n```\n\n## Core Contributors\n\n<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->\n<!-- prettier-ignore-start -->\n<!-- markdownlint-disable -->\n\n<table>\n<tr>\n\n<td align=\"center\">\n    <a href=\"https://github.com/PushpenderIndia\">\n        <kbd><img src=\"https://avatars3.githubusercontent.com/PushpenderIndia?size=400\" width=\"100px;\" alt=\"\"/></kbd><br />\n        <sub><b>Pushpender Singh</b></sub>\n    </a><br />\n    <a href=\"https://github.com/Ayushi0405/Universal_Scrapper/commits?author=PushpenderIndia\" title=\"Code\"> :computer: </a> \n</td>\n\n<td align=\"center\">\n    <a href=\"https://github.com/Ayushi0405\">\n        <kbd><img src=\"https://avatars3.githubusercontent.com/Ayushi0405?size=400\" width=\"100px;\" alt=\"\"/></kbd><br />\n        <sub><b>Ayushi Gupta</b></sub>\n    </a><br />\n    <a href=\"https://github.com/Ayushi0405/Universal_Scrapper/commits?author=Ayushi0405\" title=\"Code\"> :computer: </a> \n</td>\n\n</tr>\n</tr>\n</table>\n\n<!-- markdownlint-enable -->\n<!-- prettier-ignore-end -->\n<!-- ALL-CONTRIBUTORS-LIST:END -->\n\nContributions of any kind welcome!\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Run tests: `python test_module.py`\n5. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Changelog\n\n### v1.2.0 - Smart Caching & HTML Optimization Release\n- \ud83d\ude80 **NEW**: Intelligent code caching system - **saves 90%+ API tokens**\n- \ud83e\uddf9 **HIGHLIGHT**: Smart HTML cleaner reduces payload by 65%+ - **massive token savings**\n- \ud83d\udd27 **NEW**: Structural HTML hashing for cache key generation\n- \ud83d\udd27 **NEW**: SQLite-based cache storage with metadata\n- \ud83d\udd27 **NEW**: Cache management methods: `get_cache_stats()`, `clear_cache()`, `cleanup_old_cache()`\n- \ud83d\udd27 **NEW**: Automatic cache hit/miss detection and logging  \n- \ud83d\udd27 **NEW**: URL normalization (removes query params) for better cache matching\n- \u26a1 **PERF**: 5-10x faster scraping on cached HTML structures\n- \ud83d\udcb0 **COST**: Significant API cost reduction (HTML cleaning + caching combined)\n- \ud83d\udcc1 **ORG**: Moved sample code to `sample_code/` directory\n\n### v1.1.0\n- \u2728 **NEW**: Gemini model selection functionality\n- \ud83d\udd27 Added `model_name` parameter to `UniversalScraper()` constructor\n- \ud83d\udd27 Added `get_model_name()` and `set_model_name()` methods\n- \ud83d\udd27 Enhanced convenience `scrape()` function with `model_name` parameter  \n- \ud83d\udd04 Updated default model to `gemini-2.5-flash`\n- \ud83d\udcda Updated documentation with model examples\n- \u2705 Fixed missing `cloudscraper` dependency\n\n### v1.0.0\n- Initial release\n- AI-powered field extraction\n- Customizable field configuration\n- Multiple URL support\n- Comprehensive test suite\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AI-powered web scraping with customizable field extraction",
    "version": "1.2.0",
    "project_urls": {
        "Bug Reports": "https://github.com/Ayushi0405/Universal_Scrapper/issues",
        "Documentation": "https://github.com/Ayushi0405/Universal_Scrapper/wiki",
        "Homepage": "https://github.com/Ayushi0405/Universal_Scrapper",
        "Source": "https://github.com/Ayushi0405/Universal_Scrapper"
    },
    "split_keywords": [
        "web scraping",
        " ai",
        " data extraction",
        " beautifulsoup",
        " gemini",
        " automation",
        " html parsing",
        " structured data",
        " caching",
        " performance"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d5c9c769487d4e5da7f40fc92c6bd6107d0c7ffcecf07ddfb1b104c854bc21d8",
                "md5": "d7ad258128bac58a819debd3262da127",
                "sha256": "c82cc618a3b6811b685bfd474b4bdc5277d13935d5533f172e223952f93e84fa"
            },
            "downloads": -1,
            "filename": "universal_scraper-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d7ad258128bac58a819debd3262da127",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 29398,
            "upload_time": "2025-09-02T07:16:10",
            "upload_time_iso_8601": "2025-09-02T07:16:10.112431Z",
            "url": "https://files.pythonhosted.org/packages/d5/c9/c769487d4e5da7f40fc92c6bd6107d0c7ffcecf07ddfb1b104c854bc21d8/universal_scraper-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "392b46cd7a41ab7cfcd626f391e2522cf7f8d8756f664f50bdf1b39d04bee267",
                "md5": "a39e57abae1daf795ca3cf4665eb0116",
                "sha256": "35887d9ccf141ccdb3fc3e9ce6cb358fd23ad9b54bb99654def35242a4f4ab19"
            },
            "downloads": -1,
            "filename": "universal_scraper-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a39e57abae1daf795ca3cf4665eb0116",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 30684,
            "upload_time": "2025-09-02T07:16:11",
            "upload_time_iso_8601": "2025-09-02T07:16:11.487398Z",
            "url": "https://files.pythonhosted.org/packages/39/2b/46cd7a41ab7cfcd626f391e2522cf7f8d8756f664f50bdf1b39d04bee267/universal_scraper-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-02 07:16:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Ayushi0405",
    "github_project": "Universal_Scrapper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "cloudscraper",
            "specs": [
                [
                    ">=",
                    "1.2.60"
                ]
            ]
        },
        {
            "name": "selenium",
            "specs": [
                [
                    ">=",
                    "4.15.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "google-generativeai",
            "specs": [
                [
                    ">=",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.31.0"
                ]
            ]
        },
        {
            "name": "urllib3",
            "specs": [
                [
                    ">=",
                    "1.26.0"
                ]
            ]
        },
        {
            "name": "webdriver-manager",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    ">=",
                    "2.8.0"
                ]
            ]
        },
        {
            "name": "cchardet",
            "specs": [
                [
                    ">=",
                    "2.1.7"
                ]
            ]
        },
        {
            "name": "html5lib",
            "specs": [
                [
                    ">=",
                    "1.1"
                ]
            ]
        }
    ],
    "lcname": "universal-scraper"
}
        
Elapsed time: 1.61843s