scrava


Namescrava JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryThe Definitive Web Scraping Framework for Python
upload_time2025-10-12 18:32:03
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords web-scraping scraping crawler spider async asyncio http playwright data-extraction web-automation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Scrava

**Scrava** is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.

> 🏢 **Built by [Nextract Data Solutions](https://nextract.dev)** - Your partner for enterprise web scraping and data extraction.

[![PyPI version](https://badge.fury.io/py/scrava.svg)](https://pypi.org/project/scrava/)
[![Python versions](https://img.shields.io/pypi/pyversions/scrava.svg)](https://pypi.org/project/scrava/)
[![License](https://img.shields.io/pypi/l/scrava.svg)](https://github.com/nextractdevelopers/Scrava/blob/main/LICENSE)

## 🎯 Philosophy

Scrava doesn't reinvent the wheel. Instead, it provides a **composition-over-invention** approach:

- **Unifying Force**: Eliminates boilerplate and integration complexity
- **Battle-Tested Libraries**: Built on httpx, Playwright, parsel, and more
- **Developer Experience**: Designed to be intuitive and "piece of cake" for newcomers
- **Production-Ready**: Structured logging, statistics, error handling, and more

## ✨ Features

- 🚀 **Async-First**: Built on asyncio for maximum performance
- 🔄 **Dual-Mode Fetching**: HTTP (httpx) and Browser (Playwright) support
- 📦 **Flexible Queuing**: In-memory or Redis-backed with duplicate filtering
- 🪝 **Powerful Hooks**: Intercept and modify requests, responses, and data flow
- 💾 **Pipeline System**: MongoDB, JSON, or custom data storage
- 🎯 **Pydantic Integration**: Type-safe data models with validation
- 📊 **Structured Logging**: Production-grade logging with structlog
- ⚙️ **Config Management**: YAML + Pydantic for type-safe configuration
- 🛠️ **CLI Tools**: Project scaffolding, bot runner, and interactive shell

## 📦 Installation

### Prerequisites
- Python 3.8 or higher
- pip (latest version recommended)

### Platform-Specific Notes

**macOS (Apple Silicon - M1/M2/M3/M4):**
```bash
# Use native ARM64 Python for best performance
arch -arm64 pip install scrava
```

**macOS (Intel):**
```bash
pip install scrava
```

**Windows:**
```bash
pip install scrava
```

**Linux:**
```bash
pip install scrava
```

### Installation Options

```bash
# Basic installation (works on all platforms)
pip install scrava

# With browser support (Playwright)
pip install scrava[browser]

# With Redis queue support
pip install scrava[redis]

# With MongoDB pipeline support
pip install scrava[mongodb]

# Install everything
pip install scrava[all]
```

### Development Installation

```bash
# Clone and install in editable mode
git clone https://github.com/nextractdevelopers/Scrava.git
cd Scrava
pip install -e .

# With all optional dependencies
pip install -e ".[all]"
```

### Quick Installation Scripts

For easier installation, use our platform-specific scripts:

**macOS/Linux:**
```bash
# Auto-detects architecture and installs correctly
curl -sSL https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.sh | bash

# Or download and run manually
chmod +x install.sh
./install.sh
```

**Windows (PowerShell):**
```powershell
# Download and run the installation script
iwr -useb https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.ps1 | iex

# Or download and run manually
.\install.ps1
```

### Verify Installation

```bash
# Check if Scrava is properly installed
scrava version

# Run the welcome screen
scrava
```

### Troubleshooting

If you encounter installation issues, see [PLATFORM.md](PLATFORM.md) for detailed platform-specific instructions.

## 🚀 Quick Start

### 1. Create a New Project

```bash
scrava new my_project
cd my_project
```

### 2. Define Your Bot

```python
# bots/book_bot.py
from pydantic import BaseModel, HttpUrl
from scrava import BaseBot, Request, Response


class Book(BaseModel):
    """A scraped book record."""
    title: str
    price: float
    url: HttpUrl
    in_stock: bool = True


class BookBot(BaseBot):
    """Bot for scraping books.toscrape.com"""
    
    start_urls = ['https://books.toscrape.com']
    
    async def process(self, response: Response):
        """Extract book data from the page."""
        # Extract books using parsel selectors
        for book in response.selector.css('article.product_pod'):
            title = book.css('h3 a::attr(title)').get()
            price_text = book.css('.price_color::text').get()
            price = float(price_text.replace('£', ''))
            url = response.urljoin(book.css('h3 a::attr(href)').get())
            
            yield Book(
                title=title,
                price=price,
                url=url
            )
        
        # Follow pagination
        next_page = response.selector.css('.next a::attr(href)').get()
        if next_page:
            yield Request(response.urljoin(next_page))
```

### 3. Run Your Bot

```bash
scrava run book_bot
```

## 🏗️ Core Components

### Request & Response

```python
from scrava import Request, Response

# Create a request
request = Request(
    url='https://example.com',
    method='GET',
    headers={'User-Agent': 'MyBot/1.0'},
    priority=10,  # Higher priority = processed first
    meta={'browser': True}  # Use browser rendering
)

# Response provides powerful selectors
async def process(self, response: Response):
    # CSS selectors
    title = response.selector.css('h1::text').get()
    
    # XPath selectors
    links = response.selector.xpath('//a/@href').getall()
    
    # Join relative URLs
    absolute_url = response.urljoin('/path')
```

### Bot Lifecycle

```python
from scrava import BaseBot, Response

class MyBot(BaseBot):
    start_urls = ['https://example.com']
    
    async def setup(self):
        """Called before crawling starts."""
        self.session_data = {}
    
    async def process(self, response: Response):
        """Main processing method."""
        yield Record(...)
        yield Request(...)
    
    async def teardown(self):
        """Called after crawling completes."""
        pass
```

### Queue System

```python
from scrava import Crawler
from scrava.queue import MemoryQueue, RedisQueue

# In-memory queue (default)
crawler = Crawler(queue=MemoryQueue())

# Redis-backed queue for distributed crawls
crawler = Crawler(queue=RedisQueue(redis_url="redis://localhost:6379/0"))
```

### Fetchers

```python
# HTTP fetcher (default)
from scrava.fetchers import HttpxFetcher

crawler = Crawler(
    fetcher=HttpxFetcher(
        timeout=30.0,
        follow_redirects=True,
        verify_ssl=True
    )
)

# Browser fetcher for JavaScript-heavy sites
from scrava.fetchers import PlaywrightFetcher

crawler = Crawler(
    browser_fetcher=PlaywrightFetcher(
        headless=True,
        browser_type='chromium',
        context_pool_size=5
    ),
    enable_browser=True
)

# Use browser for specific requests
yield Request(url, meta={'browser': True})
```

### Hooks

#### Request Hooks

```python
from scrava.hooks import RequestHook

class UserAgentHook(RequestHook):
    async def process_req(self, request, bot):
        # Modify request before fetching
        request.headers['User-Agent'] = 'MyBot/1.0'
        return None
    
    async def process_res(self, request, response, bot):
        # Process response after fetching
        print(f"Got {response.status} from {response.url}")
        return None

crawler = Crawler(request_hooks=[UserAgentHook()])
```

#### Built-in Cache Hook

```python
from scrava.hooks import CacheHook

# Enable caching
crawler = Crawler(
    request_hooks=[
        CacheHook(expiration=86400)  # Cache for 1 day
    ]
)

# Disable caching for specific requests
yield Request(url, meta={'cache': False})
```

### Pipelines

```python
from scrava.pipelines import JsonPipeline, MongoPipeline

# JSON output
crawler = Crawler(
    pipelines=[JsonPipeline(output_file='output.jsonl')]
)

# MongoDB with batching
crawler = Crawler(
    pipelines=[
        MongoPipeline(
            uri='mongodb://localhost:27017',
            database='scrava',
            batch_size=100,
            batch_timeout=5.0
        )
    ]
)

# Custom pipeline
from scrava.pipelines import BasePipeline

class CustomPipeline(BasePipeline):
    async def process_rec(self, record, bot):
        # Process and store record
        await self.save_to_db(record)
        return record
```

### Configuration

```yaml
# config/settings.yaml
project_name: "my_project"

scrava:
  concurrent_reqs: 16
  download_delay: 0.0
  enable_browser: false

cache:
  enabled: true
  path: ".scrava_cache"
  expiration_secs: 86400

queue:
  backend: "scrava.queue.memory.MemoryQueue"
  redis_url: "redis://localhost:6379/0"

pipeline:
  enabled:
    - scrava.pipelines.json.JsonPipeline
  mongodb_uri: "mongodb://localhost:27017"
  mongodb_database: "scrava"

logging:
  level: "INFO"
  format: "console"  # or "json" for production
  use_colors: true
```

```python
from scrava.config import load_settings

settings = load_settings('config/settings.yaml')
```

### Logging

```python
from scrava.logging import setup_logging, get_logger

# Setup logging
setup_logging(
    level="INFO",
    format="console",  # "json" for production
    use_colors=True
)

# Get logger
logger = get_logger(__name__)

logger.info("Bot started", bot_name="my_bot", url="https://example.com")
# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com
```

## 🔧 CLI Commands

```bash
# Create a new project
scrava new <project_name>

# Run a bot
scrava run <bot_name>

# List all bots
scrava list

# Interactive selector shell
scrava shell <url>
scrava shell <url> --browser  # Use browser rendering

# Show version
scrava version
```

## 📚 Advanced Examples

### Custom Callback Methods

```python
class ProductBot(BaseBot):
    start_urls = ['https://shop.example.com']
    
    async def process(self, response: Response):
        # Extract category links
        for category in response.selector.css('.category'):
            url = response.urljoin(category.css('a::attr(href)').get())
            yield Request(url, callback=self.parse_category)
    
    async def parse_category(self, response: Response):
        # Extract products
        for product in response.selector.css('.product'):
            yield Request(
                response.urljoin(product.css('a::attr(href)').get()),
                callback=self.parse_product
            )
    
    async def parse_product(self, response: Response):
        yield Product(
            name=response.selector.css('h1::text').get(),
            price=float(response.selector.css('.price::text').get())
        )
```

### Browser Automation

```python
async def process(self, response: Response):
    # Scroll page, click buttons, etc. with JavaScript
    yield Request(
        url='https://spa-site.com',
        meta={
            'browser': True,
            'wait_for': '.dynamic-content',
            'scroll': True
        }
    )
```

### Error Handling Hook

```python
class RetryHook(RequestHook):
    async def process_exc(self, request, exception, bot):
        if request.meta.get('retry_count', 0) < 3:
            # Retry with incremented counter
            request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1
            await bot.queue.push(request)
        return None
```

### Data Validation Pipeline

```python
class ValidationPipeline(BasePipeline):
    async def process_rec(self, record, bot):
        # Pydantic automatically validates
        if record.price < 0:
            logger.warning("Invalid price", record=record)
            return None  # Filter out
        return record
```

## 🎯 Best Practices

1. **Use Pydantic Models**: Define clear schemas for your scraped data
2. **Leverage Hooks**: Keep bot logic clean by using hooks for cross-cutting concerns
3. **Configure Delays**: Be respectful with `download_delay` to avoid overwhelming servers
4. **Enable Caching**: Speed up development with the built-in CacheHook
5. **Structure Logs**: Use structured logging for easy debugging and monitoring
6. **Handle Errors**: Implement retry logic and error hooks for robust crawls
7. **Test Selectors**: Use `scrava shell <url>` to test CSS/XPath selectors interactively

## 🔗 Architecture

```
┌─────────────┐
│    Bot      │  ← Your scraping logic
└──────┬──────┘
       │
       ↓
┌─────────────┐
│    Core     │  ← Orchestrator (asyncio event loop)
└──────┬──────┘
       │
       ├→ Queue      (MemoryQueue / RedisQueue)
       ├→ Fetcher    (HttpxFetcher / PlaywrightFetcher)
       ├→ Hooks      (RequestHook / BotHook)
       └→ Pipelines  (MongoPipeline / JsonPipeline)
```

## 📖 Documentation

For full documentation, visit: [https://scrava.readthedocs.io](https://scrava.readthedocs.io)

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## 📄 License

MIT License - see LICENSE file for details

## 🙏 Acknowledgments

Scrava is built on the shoulders of giants:
- [httpx](https://www.python-httpx.org/) - HTTP client
- [Playwright](https://playwright.dev/python/) - Browser automation
- [parsel](https://parsel.readthedocs.io/) - Data extraction
- [Pydantic](https://pydantic.dev/) - Data validation
- [structlog](https://www.structlog.org/) - Structured logging
- [Typer](https://typer.tiangolo.com/) - CLI framework

---

## 🏢 About Nextract Data Solutions

Scrava is developed and maintained by [**Nextract Data Solutions**](https://nextract.dev), a leading provider of enterprise web scraping and data extraction services.

**Need enterprise-grade data extraction?**

While Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:

- ✅ Custom enterprise scraping solutions
- ✅ Data-as-a-Service (DaaS) subscriptions
- ✅ Data enrichment and validation
- ✅ 99.9% accuracy and reliability
- ✅ Dedicated support and SLA guarantees

### 📞 Contact Nextract

- **Website**: [https://nextract.dev](https://nextract.dev)
- **Email**: [hello@nextract.dev](mailto:hello@nextract.dev)
- **Phone**: +91 85110-98799
- **GitHub**: [@nextractdevelopers](https://github.com/nextractdevelopers)

[**Schedule a Free Strategy Call**](https://nextract.dev) | [**Download Capabilities Deck**](https://nextract.dev)

---

**Happy Scraping! 🕷️**



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrava",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Nextract Data Solutions <hello@nextract.dev>",
    "keywords": "web-scraping, scraping, crawler, spider, async, asyncio, http, playwright, data-extraction, web-automation",
    "author": null,
    "author_email": "Nextract Data Solutions <hello@nextract.dev>, Parth Ravrani <hello@nextract.dev>",
    "download_url": "https://files.pythonhosted.org/packages/73/8f/b084684efe70b9e4434ae9218c99e0984b3d6b157b4d29510f2d80e6e1a5/scrava-0.1.1.tar.gz",
    "platform": null,
    "description": "# Scrava\n\n**Scrava** is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.\n\n> \ud83c\udfe2 **Built by [Nextract Data Solutions](https://nextract.dev)** - Your partner for enterprise web scraping and data extraction.\n\n[![PyPI version](https://badge.fury.io/py/scrava.svg)](https://pypi.org/project/scrava/)\n[![Python versions](https://img.shields.io/pypi/pyversions/scrava.svg)](https://pypi.org/project/scrava/)\n[![License](https://img.shields.io/pypi/l/scrava.svg)](https://github.com/nextractdevelopers/Scrava/blob/main/LICENSE)\n\n## \ud83c\udfaf Philosophy\n\nScrava doesn't reinvent the wheel. Instead, it provides a **composition-over-invention** approach:\n\n- **Unifying Force**: Eliminates boilerplate and integration complexity\n- **Battle-Tested Libraries**: Built on httpx, Playwright, parsel, and more\n- **Developer Experience**: Designed to be intuitive and \"piece of cake\" for newcomers\n- **Production-Ready**: Structured logging, statistics, error handling, and more\n\n## \u2728 Features\n\n- \ud83d\ude80 **Async-First**: Built on asyncio for maximum performance\n- \ud83d\udd04 **Dual-Mode Fetching**: HTTP (httpx) and Browser (Playwright) support\n- \ud83d\udce6 **Flexible Queuing**: In-memory or Redis-backed with duplicate filtering\n- \ud83e\ude9d **Powerful Hooks**: Intercept and modify requests, responses, and data flow\n- \ud83d\udcbe **Pipeline System**: MongoDB, JSON, or custom data storage\n- \ud83c\udfaf **Pydantic Integration**: Type-safe data models with validation\n- \ud83d\udcca **Structured Logging**: Production-grade logging with structlog\n- \u2699\ufe0f **Config Management**: YAML + Pydantic for type-safe configuration\n- \ud83d\udee0\ufe0f **CLI Tools**: Project scaffolding, bot runner, and interactive shell\n\n## \ud83d\udce6 Installation\n\n### Prerequisites\n- Python 3.8 or higher\n- pip (latest version recommended)\n\n### Platform-Specific Notes\n\n**macOS (Apple Silicon - M1/M2/M3/M4):**\n```bash\n# Use native ARM64 Python for best performance\narch -arm64 pip install scrava\n```\n\n**macOS (Intel):**\n```bash\npip install scrava\n```\n\n**Windows:**\n```bash\npip install scrava\n```\n\n**Linux:**\n```bash\npip install scrava\n```\n\n### Installation Options\n\n```bash\n# Basic installation (works on all platforms)\npip install scrava\n\n# With browser support (Playwright)\npip install scrava[browser]\n\n# With Redis queue support\npip install scrava[redis]\n\n# With MongoDB pipeline support\npip install scrava[mongodb]\n\n# Install everything\npip install scrava[all]\n```\n\n### Development Installation\n\n```bash\n# Clone and install in editable mode\ngit clone https://github.com/nextractdevelopers/Scrava.git\ncd Scrava\npip install -e .\n\n# With all optional dependencies\npip install -e \".[all]\"\n```\n\n### Quick Installation Scripts\n\nFor easier installation, use our platform-specific scripts:\n\n**macOS/Linux:**\n```bash\n# Auto-detects architecture and installs correctly\ncurl -sSL https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.sh | bash\n\n# Or download and run manually\nchmod +x install.sh\n./install.sh\n```\n\n**Windows (PowerShell):**\n```powershell\n# Download and run the installation script\niwr -useb https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.ps1 | iex\n\n# Or download and run manually\n.\\install.ps1\n```\n\n### Verify Installation\n\n```bash\n# Check if Scrava is properly installed\nscrava version\n\n# Run the welcome screen\nscrava\n```\n\n### Troubleshooting\n\nIf you encounter installation issues, see [PLATFORM.md](PLATFORM.md) for detailed platform-specific instructions.\n\n## \ud83d\ude80 Quick Start\n\n### 1. Create a New Project\n\n```bash\nscrava new my_project\ncd my_project\n```\n\n### 2. Define Your Bot\n\n```python\n# bots/book_bot.py\nfrom pydantic import BaseModel, HttpUrl\nfrom scrava import BaseBot, Request, Response\n\n\nclass Book(BaseModel):\n    \"\"\"A scraped book record.\"\"\"\n    title: str\n    price: float\n    url: HttpUrl\n    in_stock: bool = True\n\n\nclass BookBot(BaseBot):\n    \"\"\"Bot for scraping books.toscrape.com\"\"\"\n    \n    start_urls = ['https://books.toscrape.com']\n    \n    async def process(self, response: Response):\n        \"\"\"Extract book data from the page.\"\"\"\n        # Extract books using parsel selectors\n        for book in response.selector.css('article.product_pod'):\n            title = book.css('h3 a::attr(title)').get()\n            price_text = book.css('.price_color::text').get()\n            price = float(price_text.replace('\u00a3', ''))\n            url = response.urljoin(book.css('h3 a::attr(href)').get())\n            \n            yield Book(\n                title=title,\n                price=price,\n                url=url\n            )\n        \n        # Follow pagination\n        next_page = response.selector.css('.next a::attr(href)').get()\n        if next_page:\n            yield Request(response.urljoin(next_page))\n```\n\n### 3. Run Your Bot\n\n```bash\nscrava run book_bot\n```\n\n## \ud83c\udfd7\ufe0f Core Components\n\n### Request & Response\n\n```python\nfrom scrava import Request, Response\n\n# Create a request\nrequest = Request(\n    url='https://example.com',\n    method='GET',\n    headers={'User-Agent': 'MyBot/1.0'},\n    priority=10,  # Higher priority = processed first\n    meta={'browser': True}  # Use browser rendering\n)\n\n# Response provides powerful selectors\nasync def process(self, response: Response):\n    # CSS selectors\n    title = response.selector.css('h1::text').get()\n    \n    # XPath selectors\n    links = response.selector.xpath('//a/@href').getall()\n    \n    # Join relative URLs\n    absolute_url = response.urljoin('/path')\n```\n\n### Bot Lifecycle\n\n```python\nfrom scrava import BaseBot, Response\n\nclass MyBot(BaseBot):\n    start_urls = ['https://example.com']\n    \n    async def setup(self):\n        \"\"\"Called before crawling starts.\"\"\"\n        self.session_data = {}\n    \n    async def process(self, response: Response):\n        \"\"\"Main processing method.\"\"\"\n        yield Record(...)\n        yield Request(...)\n    \n    async def teardown(self):\n        \"\"\"Called after crawling completes.\"\"\"\n        pass\n```\n\n### Queue System\n\n```python\nfrom scrava import Crawler\nfrom scrava.queue import MemoryQueue, RedisQueue\n\n# In-memory queue (default)\ncrawler = Crawler(queue=MemoryQueue())\n\n# Redis-backed queue for distributed crawls\ncrawler = Crawler(queue=RedisQueue(redis_url=\"redis://localhost:6379/0\"))\n```\n\n### Fetchers\n\n```python\n# HTTP fetcher (default)\nfrom scrava.fetchers import HttpxFetcher\n\ncrawler = Crawler(\n    fetcher=HttpxFetcher(\n        timeout=30.0,\n        follow_redirects=True,\n        verify_ssl=True\n    )\n)\n\n# Browser fetcher for JavaScript-heavy sites\nfrom scrava.fetchers import PlaywrightFetcher\n\ncrawler = Crawler(\n    browser_fetcher=PlaywrightFetcher(\n        headless=True,\n        browser_type='chromium',\n        context_pool_size=5\n    ),\n    enable_browser=True\n)\n\n# Use browser for specific requests\nyield Request(url, meta={'browser': True})\n```\n\n### Hooks\n\n#### Request Hooks\n\n```python\nfrom scrava.hooks import RequestHook\n\nclass UserAgentHook(RequestHook):\n    async def process_req(self, request, bot):\n        # Modify request before fetching\n        request.headers['User-Agent'] = 'MyBot/1.0'\n        return None\n    \n    async def process_res(self, request, response, bot):\n        # Process response after fetching\n        print(f\"Got {response.status} from {response.url}\")\n        return None\n\ncrawler = Crawler(request_hooks=[UserAgentHook()])\n```\n\n#### Built-in Cache Hook\n\n```python\nfrom scrava.hooks import CacheHook\n\n# Enable caching\ncrawler = Crawler(\n    request_hooks=[\n        CacheHook(expiration=86400)  # Cache for 1 day\n    ]\n)\n\n# Disable caching for specific requests\nyield Request(url, meta={'cache': False})\n```\n\n### Pipelines\n\n```python\nfrom scrava.pipelines import JsonPipeline, MongoPipeline\n\n# JSON output\ncrawler = Crawler(\n    pipelines=[JsonPipeline(output_file='output.jsonl')]\n)\n\n# MongoDB with batching\ncrawler = Crawler(\n    pipelines=[\n        MongoPipeline(\n            uri='mongodb://localhost:27017',\n            database='scrava',\n            batch_size=100,\n            batch_timeout=5.0\n        )\n    ]\n)\n\n# Custom pipeline\nfrom scrava.pipelines import BasePipeline\n\nclass CustomPipeline(BasePipeline):\n    async def process_rec(self, record, bot):\n        # Process and store record\n        await self.save_to_db(record)\n        return record\n```\n\n### Configuration\n\n```yaml\n# config/settings.yaml\nproject_name: \"my_project\"\n\nscrava:\n  concurrent_reqs: 16\n  download_delay: 0.0\n  enable_browser: false\n\ncache:\n  enabled: true\n  path: \".scrava_cache\"\n  expiration_secs: 86400\n\nqueue:\n  backend: \"scrava.queue.memory.MemoryQueue\"\n  redis_url: \"redis://localhost:6379/0\"\n\npipeline:\n  enabled:\n    - scrava.pipelines.json.JsonPipeline\n  mongodb_uri: \"mongodb://localhost:27017\"\n  mongodb_database: \"scrava\"\n\nlogging:\n  level: \"INFO\"\n  format: \"console\"  # or \"json\" for production\n  use_colors: true\n```\n\n```python\nfrom scrava.config import load_settings\n\nsettings = load_settings('config/settings.yaml')\n```\n\n### Logging\n\n```python\nfrom scrava.logging import setup_logging, get_logger\n\n# Setup logging\nsetup_logging(\n    level=\"INFO\",\n    format=\"console\",  # \"json\" for production\n    use_colors=True\n)\n\n# Get logger\nlogger = get_logger(__name__)\n\nlogger.info(\"Bot started\", bot_name=\"my_bot\", url=\"https://example.com\")\n# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com\n```\n\n## \ud83d\udd27 CLI Commands\n\n```bash\n# Create a new project\nscrava new <project_name>\n\n# Run a bot\nscrava run <bot_name>\n\n# List all bots\nscrava list\n\n# Interactive selector shell\nscrava shell <url>\nscrava shell <url> --browser  # Use browser rendering\n\n# Show version\nscrava version\n```\n\n## \ud83d\udcda Advanced Examples\n\n### Custom Callback Methods\n\n```python\nclass ProductBot(BaseBot):\n    start_urls = ['https://shop.example.com']\n    \n    async def process(self, response: Response):\n        # Extract category links\n        for category in response.selector.css('.category'):\n            url = response.urljoin(category.css('a::attr(href)').get())\n            yield Request(url, callback=self.parse_category)\n    \n    async def parse_category(self, response: Response):\n        # Extract products\n        for product in response.selector.css('.product'):\n            yield Request(\n                response.urljoin(product.css('a::attr(href)').get()),\n                callback=self.parse_product\n            )\n    \n    async def parse_product(self, response: Response):\n        yield Product(\n            name=response.selector.css('h1::text').get(),\n            price=float(response.selector.css('.price::text').get())\n        )\n```\n\n### Browser Automation\n\n```python\nasync def process(self, response: Response):\n    # Scroll page, click buttons, etc. with JavaScript\n    yield Request(\n        url='https://spa-site.com',\n        meta={\n            'browser': True,\n            'wait_for': '.dynamic-content',\n            'scroll': True\n        }\n    )\n```\n\n### Error Handling Hook\n\n```python\nclass RetryHook(RequestHook):\n    async def process_exc(self, request, exception, bot):\n        if request.meta.get('retry_count', 0) < 3:\n            # Retry with incremented counter\n            request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1\n            await bot.queue.push(request)\n        return None\n```\n\n### Data Validation Pipeline\n\n```python\nclass ValidationPipeline(BasePipeline):\n    async def process_rec(self, record, bot):\n        # Pydantic automatically validates\n        if record.price < 0:\n            logger.warning(\"Invalid price\", record=record)\n            return None  # Filter out\n        return record\n```\n\n## \ud83c\udfaf Best Practices\n\n1. **Use Pydantic Models**: Define clear schemas for your scraped data\n2. **Leverage Hooks**: Keep bot logic clean by using hooks for cross-cutting concerns\n3. **Configure Delays**: Be respectful with `download_delay` to avoid overwhelming servers\n4. **Enable Caching**: Speed up development with the built-in CacheHook\n5. **Structure Logs**: Use structured logging for easy debugging and monitoring\n6. **Handle Errors**: Implement retry logic and error hooks for robust crawls\n7. **Test Selectors**: Use `scrava shell <url>` to test CSS/XPath selectors interactively\n\n## \ud83d\udd17 Architecture\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502    Bot      \u2502  \u2190 Your scraping logic\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n       \u2502\n       \u2193\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502    Core     \u2502  \u2190 Orchestrator (asyncio event loop)\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n       \u2502\n       \u251c\u2192 Queue      (MemoryQueue / RedisQueue)\n       \u251c\u2192 Fetcher    (HttpxFetcher / PlaywrightFetcher)\n       \u251c\u2192 Hooks      (RequestHook / BotHook)\n       \u2514\u2192 Pipelines  (MongoPipeline / JsonPipeline)\n```\n\n## \ud83d\udcd6 Documentation\n\nFor full documentation, visit: [https://scrava.readthedocs.io](https://scrava.readthedocs.io)\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## \ud83d\udcc4 License\n\nMIT License - see LICENSE file for details\n\n## \ud83d\ude4f Acknowledgments\n\nScrava is built on the shoulders of giants:\n- [httpx](https://www.python-httpx.org/) - HTTP client\n- [Playwright](https://playwright.dev/python/) - Browser automation\n- [parsel](https://parsel.readthedocs.io/) - Data extraction\n- [Pydantic](https://pydantic.dev/) - Data validation\n- [structlog](https://www.structlog.org/) - Structured logging\n- [Typer](https://typer.tiangolo.com/) - CLI framework\n\n---\n\n## \ud83c\udfe2 About Nextract Data Solutions\n\nScrava is developed and maintained by [**Nextract Data Solutions**](https://nextract.dev), a leading provider of enterprise web scraping and data extraction services.\n\n**Need enterprise-grade data extraction?**\n\nWhile Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:\n\n- \u2705 Custom enterprise scraping solutions\n- \u2705 Data-as-a-Service (DaaS) subscriptions\n- \u2705 Data enrichment and validation\n- \u2705 99.9% accuracy and reliability\n- \u2705 Dedicated support and SLA guarantees\n\n### \ud83d\udcde Contact Nextract\n\n- **Website**: [https://nextract.dev](https://nextract.dev)\n- **Email**: [hello@nextract.dev](mailto:hello@nextract.dev)\n- **Phone**: +91 85110-98799\n- **GitHub**: [@nextractdevelopers](https://github.com/nextractdevelopers)\n\n[**Schedule a Free Strategy Call**](https://nextract.dev) | [**Download Capabilities Deck**](https://nextract.dev)\n\n---\n\n**Happy Scraping! \ud83d\udd77\ufe0f**\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "The Definitive Web Scraping Framework for Python",
    "version": "0.1.1",
    "project_urls": {
        "Changelog": "https://github.com/nextractdevelopers/Scrava/releases",
        "Company": "https://nextract.dev",
        "Documentation": "https://github.com/nextractdevelopers/Scrava#readme",
        "Homepage": "https://github.com/nextractdevelopers/Scrava",
        "Issues": "https://github.com/nextractdevelopers/Scrava/issues",
        "Repository": "https://github.com/nextractdevelopers/Scrava"
    },
    "split_keywords": [
        "web-scraping",
        " scraping",
        " crawler",
        " spider",
        " async",
        " asyncio",
        " http",
        " playwright",
        " data-extraction",
        " web-automation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8624c281af88d936ec70c716100938b7a3005851c78df91a366b74879e9207af",
                "md5": "d81c365429f527f6eefbde708c4326f3",
                "sha256": "9f2eca5027af388cda1a4d8229e713a2514b8006284adef797e4397dad4fe026"
            },
            "downloads": -1,
            "filename": "scrava-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d81c365429f527f6eefbde708c4326f3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 13328,
            "upload_time": "2025-10-12T18:32:01",
            "upload_time_iso_8601": "2025-10-12T18:32:01.631187Z",
            "url": "https://files.pythonhosted.org/packages/86/24/c281af88d936ec70c716100938b7a3005851c78df91a366b74879e9207af/scrava-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "738fb084684efe70b9e4434ae9218c99e0984b3d6b157b4d29510f2d80e6e1a5",
                "md5": "29f0483b82cd3b1384ba663615b77294",
                "sha256": "378178cfb2fcc019dffcc5e3ea3cdaeb1bc992af133789c327a24a846d70b8f6"
            },
            "downloads": -1,
            "filename": "scrava-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "29f0483b82cd3b1384ba663615b77294",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 35225,
            "upload_time": "2025-10-12T18:32:03",
            "upload_time_iso_8601": "2025-10-12T18:32:03.762667Z",
            "url": "https://files.pythonhosted.org/packages/73/8f/b084684efe70b9e4434ae9218c99e0984b3d6b157b4d29510f2d80e6e1a5/scrava-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-12 18:32:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nextractdevelopers",
    "github_project": "Scrava",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "scrava"
}
        
Elapsed time: 0.69386s