cloudflare-peek

Name	cloudflare-peek JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	A Python utility for scraping Cloudflare-protected websites using screenshot + OCR fallback
upload_time	2025-07-27 16:41:12
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	scraping cloudflare ocr screenshot web-scraping bypass
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # CloudflarePeek 🔍
### Made by Talha Ali

A powerful Python utility that can scrape **any website**—even those protected by Cloudflare. When traditional scraping fails, CloudflarePeek automatically falls back to taking a full-page screenshot and extracting text using Google's Gemini OCR.

## ✨ Features

- **🛡️ Cloudflare Detection**: Automatically detects Cloudflare-protected sites
- **📸 Screenshot Fallback**: Takes full-page screenshots when traditional scraping fails  
- **🤖 AI-Powered OCR**: Uses Google Gemini to extract clean text from screenshots
- **⚡ Smart Switching**: Tries fast scraping first, falls back to OCR only when needed
- **🔄 Auto-scrolling**: Scrolls pages to bottom to capture all content
- **🎯 Zero Config**: Works out of the box with minimal setup
- **⚙️ Event Loop Safe**: Automatically handles asyncio conflicts in Jupyter/existing loops

## 🚀 Installation

### From GitHub
```bash
# Install in development mode
pip install -e git+https://github.com/Talha-Ali-5365/CloudflarePeek.git#egg=cloudflare-peek

# Or clone and install locally
git clone https://github.com/Talha-Ali-5365/CloudflarePeek.git
cd CloudflarePeek
pip install -e .
```

### Additional Setup

1. **Install Playwright browsers** (required for screenshot functionality):
```bash
playwright install chromium
```

2. **Get a Gemini API key** (required for OCR):
   - Go to [Google AI Studio](https://makersuite.google.com/app/apikey)
   - Create a new API key
   - Set it as an environment variable:
```bash
export GEMINI_API_KEY="your-gemini-api-key-here"
```

## 📖 Quick Start

### Basic Usage

```python
from cloudflare_peek import peek

# Scrape any website - automatically handles Cloudflare
text = peek("https://example.com")
print(text)
```

### Advanced Usage

```python
from cloudflare_peek import peek, behind_cloudflare

# Check if a site is behind Cloudflare
if behind_cloudflare("https://example.com"):
    print("Site is protected by Cloudflare")

# Force OCR method (useful for dynamic content)
text = peek("https://example.com", force_ocr=True)

# Use with custom API key and timeout (5 minutes)
text = peek("https://example.com", api_key="your-gemini-key", timeout=300000)
```

### CLI Usage

CloudflarePeek also comes with a powerful command-line interface.

**Scrape a website:**
```bash
cloudflare-peek scrape https://example.com
```

**Check if a site is behind Cloudflare:**
```bash
cloudflare-peek check-cloudflare https://example.com
```

**Save content to a file:**
```bash
cloudflare-peek scrape https://example.com -o content.txt
```

**Advanced options:**
```bash
# Force OCR, run in non-headless mode, and set a 60s timeout
cloudflare-peek scrape https://example.com --force-ocr --no-headless --timeout 60

# See all commands and options
cloudflare-peek --help
cloudflare-peek scrape --help
```

### Environment Variables

```bash
# Required for OCR functionality
export GEMINI_API_KEY="your-gemini-api-key"
```

### ⏱️ Timeout Configuration

CloudflarePeek uses a default timeout of **2 minutes (120,000ms)** for page loading during OCR extraction. You can customize this:

```python
# Quick timeout (30 seconds) for fast sites
text = peek("https://example.com", timeout=30000)

# Extended timeout (5 minutes) for slow/complex sites  
text = peek("https://example.com", timeout=300000)

# Very long timeout (10 minutes) for extremely slow sites
text = peek("https://example.com", timeout=600000)
```

### 📋 Progress Logging

CloudflarePeek provides detailed progress information during scraping:

```python
import logging
from cloudflare_peek import peek

# Enable detailed debug logging to see all steps
logging.getLogger('cloudflare_peek').setLevel(logging.DEBUG)

# You'll see progress like:
# 🎯 Starting CloudflarePeek for: https://example.com
# 🔍 Checking if https://example.com is behind Cloudflare...
# 🚀 No Cloudflare detected - attempting fast scraping...
# ✅ Fast scraping successful! (1234 characters extracted)

text = peek("https://example.com")
```

### 🔧 Event Loop Compatibility

CloudflarePeek automatically handles asyncio event loop conflicts, so it works seamlessly in:

- **Jupyter Notebooks** ✅
- **IPython environments** ✅ 
- **Web frameworks** (FastAPI, Django, etc.) ✅
- **Standalone scripts** ✅

No need for `nest_asyncio.apply()` or other workarounds - it's all handled internally!

## 🛠️ API Reference

### `peek(url, api_key=None, force_ocr=False, timeout=120000)`

The main function that intelligently chooses between fast scraping and OCR extraction.

**Parameters:**
- `url` (str): The URL to scrape
- `api_key` (str, optional): Gemini API key (uses `GEMINI_API_KEY` env var if not provided)  
- `force_ocr` (bool): Skip fast scraping and use OCR method directly
- `timeout` (int): Page load timeout in milliseconds for OCR method (default: 120000 = 2 minutes)

**Returns:** Extracted text content as string

### `behind_cloudflare(url)`

Check if a website is protected by Cloudflare.

**Parameters:**
- `url` (str): The URL to check

**Returns:** `True` if behind Cloudflare, `False` otherwise


## 📝 Examples

### Example 1: Basic Website Scraping
```python
from cloudflare_peek import peek

# Works with any website
websites = [
    "https://httpbin.org/html",
    "https://quotes.toscrape.com",
    "https://scrapethissite.com"
]

for url in websites:
    content = peek(url)
    print(f"Content from {url}:")
    print(content[:200] + "...")
    print("-" * 50)
```

### Example 2: Batch Processing URLs
```python
import asyncio
from cloudflare_peek import peek

async def scrape_multiple(urls):
    results = {}
    for url in urls:
        try:
            content = peek(url)
            results[url] = content
            print(f"✅ Successfully scraped {url}")
        except Exception as e:
            print(f"❌ Failed to scrape {url}: {e}")
            results[url] = None
    return results

urls = ["https://example1.com", "https://example2.com"]
results = asyncio.run(scrape_multiple(urls))
```

### Example 3: Error Handling
```python
from cloudflare_peek import peek

def safe_scrape(url):
    try:
        return peek(url)
    except ValueError as e:
        if "API key" in str(e):
            print("❌ Gemini API key not found. Please set GEMINI_API_KEY environment variable.")
        return None
    except Exception as e:
        print(f"❌ Scraping failed: {e}")
        return None

content = safe_scrape("https://example.com")
if content:
    print("Scraping successful!")
```

## 🔧 Development

### Setting up for Development

```bash
# Clone the repository
git clone https://github.com/your-username/CloudflarePeek.git
cd CloudflarePeek

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Install Playwright browsers
playwright install chromium

# Set up your API key
export GEMINI_API_KEY="your-key-here"
```

### Running Tests

```bash
# Run tests
pytest

# Run tests with coverage
pytest --cov=cloudflare_peek
```

### Code Formatting

```bash
# Format code
black cloudflare_peek/

# Check types
mypy cloudflare_peek/
```

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ⚠️ Disclaimer

This tool is for educational purposes only. The author is not responsible for any misuse or damage caused by this tool.

This tool is intended for legitimate web scraping purposes only. Always respect websites' robots.txt files and terms of service. Be mindful of rate limiting and don't overload servers with requests.

## 🙏 Acknowledgments

- [Playwright](https://playwright.dev/) for browser automation
- [Google Gemini](https://ai.google.dev/) for OCR capabilities  
- [LangChain](https://python.langchain.com/) for traditional web scraping
- The open source community for inspiration and support

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cloudflare-peek",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Talha Ali <your-email@example.com>",
    "keywords": "scraping, cloudflare, ocr, screenshot, web-scraping, bypass",
    "author": null,
    "author_email": "Talha Ali <your-email@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/44/a9/4d1e42ce762805ada7856de79e7b090aab470afdf051d508e785ff2830c4/cloudflare_peek-0.1.0.tar.gz",
    "platform": null,
    "description": "# CloudflarePeek \ud83d\udd0d\n### Made by Talha Ali\n\nA powerful Python utility that can scrape **any website**\u2014even those protected by Cloudflare. When traditional scraping fails, CloudflarePeek automatically falls back to taking a full-page screenshot and extracting text using Google's Gemini OCR.\n\n## \u2728 Features\n\n- **\ud83d\udee1\ufe0f Cloudflare Detection**: Automatically detects Cloudflare-protected sites\n- **\ud83d\udcf8 Screenshot Fallback**: Takes full-page screenshots when traditional scraping fails  \n- **\ud83e\udd16 AI-Powered OCR**: Uses Google Gemini to extract clean text from screenshots\n- **\u26a1 Smart Switching**: Tries fast scraping first, falls back to OCR only when needed\n- **\ud83d\udd04 Auto-scrolling**: Scrolls pages to bottom to capture all content\n- **\ud83c\udfaf Zero Config**: Works out of the box with minimal setup\n- **\u2699\ufe0f Event Loop Safe**: Automatically handles asyncio conflicts in Jupyter/existing loops\n\n## \ud83d\ude80 Installation\n\n### From GitHub\n```bash\n# Install in development mode\npip install -e git+https://github.com/Talha-Ali-5365/CloudflarePeek.git#egg=cloudflare-peek\n\n# Or clone and install locally\ngit clone https://github.com/Talha-Ali-5365/CloudflarePeek.git\ncd CloudflarePeek\npip install -e .\n```\n\n### Additional Setup\n\n1. **Install Playwright browsers** (required for screenshot functionality):\n```bash\nplaywright install chromium\n```\n\n2. **Get a Gemini API key** (required for OCR):\n   - Go to [Google AI Studio](https://makersuite.google.com/app/apikey)\n   - Create a new API key\n   - Set it as an environment variable:\n```bash\nexport GEMINI_API_KEY=\"your-gemini-api-key-here\"\n```\n\n## \ud83d\udcd6 Quick Start\n\n### Basic Usage\n\n```python\nfrom cloudflare_peek import peek\n\n# Scrape any website - automatically handles Cloudflare\ntext = peek(\"https://example.com\")\nprint(text)\n```\n\n### Advanced Usage\n\n```python\nfrom cloudflare_peek import peek, behind_cloudflare\n\n# Check if a site is behind Cloudflare\nif behind_cloudflare(\"https://example.com\"):\n    print(\"Site is protected by Cloudflare\")\n\n# Force OCR method (useful for dynamic content)\ntext = peek(\"https://example.com\", force_ocr=True)\n\n# Use with custom API key and timeout (5 minutes)\ntext = peek(\"https://example.com\", api_key=\"your-gemini-key\", timeout=300000)\n```\n\n### CLI Usage\n\nCloudflarePeek also comes with a powerful command-line interface.\n\n**Scrape a website:**\n```bash\ncloudflare-peek scrape https://example.com\n```\n\n**Check if a site is behind Cloudflare:**\n```bash\ncloudflare-peek check-cloudflare https://example.com\n```\n\n**Save content to a file:**\n```bash\ncloudflare-peek scrape https://example.com -o content.txt\n```\n\n**Advanced options:**\n```bash\n# Force OCR, run in non-headless mode, and set a 60s timeout\ncloudflare-peek scrape https://example.com --force-ocr --no-headless --timeout 60\n\n# See all commands and options\ncloudflare-peek --help\ncloudflare-peek scrape --help\n```\n\n### Environment Variables\n\n```bash\n# Required for OCR functionality\nexport GEMINI_API_KEY=\"your-gemini-api-key\"\n```\n\n### \u23f1\ufe0f Timeout Configuration\n\nCloudflarePeek uses a default timeout of **2 minutes (120,000ms)** for page loading during OCR extraction. You can customize this:\n\n```python\n# Quick timeout (30 seconds) for fast sites\ntext = peek(\"https://example.com\", timeout=30000)\n\n# Extended timeout (5 minutes) for slow/complex sites  \ntext = peek(\"https://example.com\", timeout=300000)\n\n# Very long timeout (10 minutes) for extremely slow sites\ntext = peek(\"https://example.com\", timeout=600000)\n```\n\n### \ud83d\udccb Progress Logging\n\nCloudflarePeek provides detailed progress information during scraping:\n\n```python\nimport logging\nfrom cloudflare_peek import peek\n\n# Enable detailed debug logging to see all steps\nlogging.getLogger('cloudflare_peek').setLevel(logging.DEBUG)\n\n# You'll see progress like:\n# \ud83c\udfaf Starting CloudflarePeek for: https://example.com\n# \ud83d\udd0d Checking if https://example.com is behind Cloudflare...\n# \ud83d\ude80 No Cloudflare detected - attempting fast scraping...\n# \u2705 Fast scraping successful! (1234 characters extracted)\n\ntext = peek(\"https://example.com\")\n```\n\n### \ud83d\udd27 Event Loop Compatibility\n\nCloudflarePeek automatically handles asyncio event loop conflicts, so it works seamlessly in:\n\n- **Jupyter Notebooks** \u2705\n- **IPython environments** \u2705 \n- **Web frameworks** (FastAPI, Django, etc.) \u2705\n- **Standalone scripts** \u2705\n\nNo need for `nest_asyncio.apply()` or other workarounds - it's all handled internally!\n\n## \ud83d\udee0\ufe0f API Reference\n\n### `peek(url, api_key=None, force_ocr=False, timeout=120000)`\n\nThe main function that intelligently chooses between fast scraping and OCR extraction.\n\n**Parameters:**\n- `url` (str): The URL to scrape\n- `api_key` (str, optional): Gemini API key (uses `GEMINI_API_KEY` env var if not provided)  \n- `force_ocr` (bool): Skip fast scraping and use OCR method directly\n- `timeout` (int): Page load timeout in milliseconds for OCR method (default: 120000 = 2 minutes)\n\n**Returns:** Extracted text content as string\n\n### `behind_cloudflare(url)`\n\nCheck if a website is protected by Cloudflare.\n\n**Parameters:**\n- `url` (str): The URL to check\n\n**Returns:** `True` if behind Cloudflare, `False` otherwise\n\n\n## \ud83d\udcdd Examples\n\n### Example 1: Basic Website Scraping\n```python\nfrom cloudflare_peek import peek\n\n# Works with any website\nwebsites = [\n    \"https://httpbin.org/html\",\n    \"https://quotes.toscrape.com\",\n    \"https://scrapethissite.com\"\n]\n\nfor url in websites:\n    content = peek(url)\n    print(f\"Content from {url}:\")\n    print(content[:200] + \"...\")\n    print(\"-\" * 50)\n```\n\n### Example 2: Batch Processing URLs\n```python\nimport asyncio\nfrom cloudflare_peek import peek\n\nasync def scrape_multiple(urls):\n    results = {}\n    for url in urls:\n        try:\n            content = peek(url)\n            results[url] = content\n            print(f\"\u2705 Successfully scraped {url}\")\n        except Exception as e:\n            print(f\"\u274c Failed to scrape {url}: {e}\")\n            results[url] = None\n    return results\n\nurls = [\"https://example1.com\", \"https://example2.com\"]\nresults = asyncio.run(scrape_multiple(urls))\n```\n\n### Example 3: Error Handling\n```python\nfrom cloudflare_peek import peek\n\ndef safe_scrape(url):\n    try:\n        return peek(url)\n    except ValueError as e:\n        if \"API key\" in str(e):\n            print(\"\u274c Gemini API key not found. Please set GEMINI_API_KEY environment variable.\")\n        return None\n    except Exception as e:\n        print(f\"\u274c Scraping failed: {e}\")\n        return None\n\ncontent = safe_scrape(\"https://example.com\")\nif content:\n    print(\"Scraping successful!\")\n```\n\n## \ud83d\udd27 Development\n\n### Setting up for Development\n\n```bash\n# Clone the repository\ngit clone https://github.com/your-username/CloudflarePeek.git\ncd CloudflarePeek\n\n# Install in development mode with dev dependencies\npip install -e \".[dev]\"\n\n# Install Playwright browsers\nplaywright install chromium\n\n# Set up your API key\nexport GEMINI_API_KEY=\"your-key-here\"\n```\n\n### Running Tests\n\n```bash\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=cloudflare_peek\n```\n\n### Code Formatting\n\n```bash\n# Format code\nblack cloudflare_peek/\n\n# Check types\nmypy cloudflare_peek/\n```\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Commit your changes (`git commit -m 'Add amazing feature'`)\n6. Push to the branch (`git push origin feature/amazing-feature`)\n7. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \u26a0\ufe0f Disclaimer\n\nThis tool is for educational purposes only. The author is not responsible for any misuse or damage caused by this tool.\n\nThis tool is intended for legitimate web scraping purposes only. Always respect websites' robots.txt files and terms of service. Be mindful of rate limiting and don't overload servers with requests.\n\n## \ud83d\ude4f Acknowledgments\n\n- [Playwright](https://playwright.dev/) for browser automation\n- [Google Gemini](https://ai.google.dev/) for OCR capabilities  \n- [LangChain](https://python.langchain.com/) for traditional web scraping\n- The open source community for inspiration and support\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python utility for scraping Cloudflare-protected websites using screenshot + OCR fallback",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/Talha-Ali-5365/CloudflarePeek/issues",
        "Documentation": "https://github.com/Talha-Ali-5365/CloudflarePeek#readme",
        "Homepage": "https://github.com/Talha-Ali-5365/CloudflarePeek",
        "Repository": "https://github.com/Talha-Ali-5365/CloudflarePeek"
    },
    "split_keywords": [
        "scraping",
        " cloudflare",
        " ocr",
        " screenshot",
        " web-scraping",
        " bypass"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9ddf1c4454f3f47c88fa467a447bda716e018fc44336629ea1c987a29f6ce082",
                "md5": "e4d469fce7d84844d2ca5aee35279bca",
                "sha256": "44d289c238178f9a25fe6a4d6aff3c6ea20450597a6205cdb006443d14b46505"
            },
            "downloads": -1,
            "filename": "cloudflare_peek-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e4d469fce7d84844d2ca5aee35279bca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14251,
            "upload_time": "2025-07-27T16:41:10",
            "upload_time_iso_8601": "2025-07-27T16:41:10.131362Z",
            "url": "https://files.pythonhosted.org/packages/9d/df/1c4454f3f47c88fa467a447bda716e018fc44336629ea1c987a29f6ce082/cloudflare_peek-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "44a94d1e42ce762805ada7856de79e7b090aab470afdf051d508e785ff2830c4",
                "md5": "a00ae3433dce691efd0281de99acb472",
                "sha256": "eaa99d1804789ad705c4dccc74be819688dfbcd245d8929127cf03d72b03929e"
            },
            "downloads": -1,
            "filename": "cloudflare_peek-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a00ae3433dce691efd0281de99acb472",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 15647,
            "upload_time": "2025-07-27T16:41:12",
            "upload_time_iso_8601": "2025-07-27T16:41:12.050533Z",
            "url": "https://files.pythonhosted.org/packages/44/a9/4d1e42ce762805ada7856de79e7b090aab470afdf051d508e785ff2830c4/cloudflare_peek-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 16:41:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Talha-Ali-5365",
    "github_project": "CloudflarePeek",
    "github_not_found": true,
    "lcname": "cloudflare-peek"
}

None