# Scrava
**Scrava** is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.
> 🏢 **Built by [Nextract Data Solutions](https://nextract.dev)** - Your partner for enterprise web scraping and data extraction.
[](https://pypi.org/project/scrava/)
[](https://pypi.org/project/scrava/)
[](https://github.com/nextractdevelopers/Scrava/blob/main/LICENSE)
## 🎯 Philosophy
Scrava doesn't reinvent the wheel. Instead, it provides a **composition-over-invention** approach:
- **Unifying Force**: Eliminates boilerplate and integration complexity
- **Battle-Tested Libraries**: Built on httpx, Playwright, parsel, and more
- **Developer Experience**: Designed to be intuitive and "piece of cake" for newcomers
- **Production-Ready**: Structured logging, statistics, error handling, and more
## ✨ Features
- 🚀 **Async-First**: Built on asyncio for maximum performance
- 🔄 **Dual-Mode Fetching**: HTTP (httpx) and Browser (Playwright) support
- 📦 **Flexible Queuing**: In-memory or Redis-backed with duplicate filtering
- 🪝 **Powerful Hooks**: Intercept and modify requests, responses, and data flow
- 💾 **Pipeline System**: MongoDB, JSON, or custom data storage
- 🎯 **Pydantic Integration**: Type-safe data models with validation
- 📊 **Structured Logging**: Production-grade logging with structlog
- ⚙️ **Config Management**: YAML + Pydantic for type-safe configuration
- 🛠️ **CLI Tools**: Project scaffolding, bot runner, and interactive shell
## 📦 Installation
### Prerequisites
- Python 3.8 or higher
- pip (latest version recommended)
### Platform-Specific Notes
**macOS (Apple Silicon - M1/M2/M3/M4):**
```bash
# Use native ARM64 Python for best performance
arch -arm64 pip install scrava
```
**macOS (Intel):**
```bash
pip install scrava
```
**Windows:**
```bash
pip install scrava
```
**Linux:**
```bash
pip install scrava
```
### Installation Options
```bash
# Basic installation (works on all platforms)
pip install scrava
# With browser support (Playwright)
pip install scrava[browser]
# With Redis queue support
pip install scrava[redis]
# With MongoDB pipeline support
pip install scrava[mongodb]
# Install everything
pip install scrava[all]
```
### Development Installation
```bash
# Clone and install in editable mode
git clone https://github.com/nextractdevelopers/Scrava.git
cd Scrava
pip install -e .
# With all optional dependencies
pip install -e ".[all]"
```
### Quick Installation Scripts
For easier installation, use our platform-specific scripts:
**macOS/Linux:**
```bash
# Auto-detects architecture and installs correctly
curl -sSL https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.sh | bash
# Or download and run manually
chmod +x install.sh
./install.sh
```
**Windows (PowerShell):**
```powershell
# Download and run the installation script
iwr -useb https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.ps1 | iex
# Or download and run manually
.\install.ps1
```
### Verify Installation
```bash
# Check if Scrava is properly installed
scrava version
# Run the welcome screen
scrava
```
### Troubleshooting
If you encounter installation issues, see [PLATFORM.md](PLATFORM.md) for detailed platform-specific instructions.
## 🚀 Quick Start
### 1. Create a New Project
```bash
scrava new my_project
cd my_project
```
### 2. Define Your Bot
```python
# bots/book_bot.py
from pydantic import BaseModel, HttpUrl
from scrava import BaseBot, Request, Response
class Book(BaseModel):
"""A scraped book record."""
title: str
price: float
url: HttpUrl
in_stock: bool = True
class BookBot(BaseBot):
"""Bot for scraping books.toscrape.com"""
start_urls = ['https://books.toscrape.com']
async def process(self, response: Response):
"""Extract book data from the page."""
# Extract books using parsel selectors
for book in response.selector.css('article.product_pod'):
title = book.css('h3 a::attr(title)').get()
price_text = book.css('.price_color::text').get()
price = float(price_text.replace('£', ''))
url = response.urljoin(book.css('h3 a::attr(href)').get())
yield Book(
title=title,
price=price,
url=url
)
# Follow pagination
next_page = response.selector.css('.next a::attr(href)').get()
if next_page:
yield Request(response.urljoin(next_page))
```
### 3. Run Your Bot
```bash
scrava run book_bot
```
## 🏗️ Core Components
### Request & Response
```python
from scrava import Request, Response
# Create a request
request = Request(
url='https://example.com',
method='GET',
headers={'User-Agent': 'MyBot/1.0'},
priority=10, # Higher priority = processed first
meta={'browser': True} # Use browser rendering
)
# Response provides powerful selectors
async def process(self, response: Response):
# CSS selectors
title = response.selector.css('h1::text').get()
# XPath selectors
links = response.selector.xpath('//a/@href').getall()
# Join relative URLs
absolute_url = response.urljoin('/path')
```
### Bot Lifecycle
```python
from scrava import BaseBot, Response
class MyBot(BaseBot):
start_urls = ['https://example.com']
async def setup(self):
"""Called before crawling starts."""
self.session_data = {}
async def process(self, response: Response):
"""Main processing method."""
yield Record(...)
yield Request(...)
async def teardown(self):
"""Called after crawling completes."""
pass
```
### Queue System
```python
from scrava import Crawler
from scrava.queue import MemoryQueue, RedisQueue
# In-memory queue (default)
crawler = Crawler(queue=MemoryQueue())
# Redis-backed queue for distributed crawls
crawler = Crawler(queue=RedisQueue(redis_url="redis://localhost:6379/0"))
```
### Fetchers
```python
# HTTP fetcher (default)
from scrava.fetchers import HttpxFetcher
crawler = Crawler(
fetcher=HttpxFetcher(
timeout=30.0,
follow_redirects=True,
verify_ssl=True
)
)
# Browser fetcher for JavaScript-heavy sites
from scrava.fetchers import PlaywrightFetcher
crawler = Crawler(
browser_fetcher=PlaywrightFetcher(
headless=True,
browser_type='chromium',
context_pool_size=5
),
enable_browser=True
)
# Use browser for specific requests
yield Request(url, meta={'browser': True})
```
### Hooks
#### Request Hooks
```python
from scrava.hooks import RequestHook
class UserAgentHook(RequestHook):
async def process_req(self, request, bot):
# Modify request before fetching
request.headers['User-Agent'] = 'MyBot/1.0'
return None
async def process_res(self, request, response, bot):
# Process response after fetching
print(f"Got {response.status} from {response.url}")
return None
crawler = Crawler(request_hooks=[UserAgentHook()])
```
#### Built-in Cache Hook
```python
from scrava.hooks import CacheHook
# Enable caching
crawler = Crawler(
request_hooks=[
CacheHook(expiration=86400) # Cache for 1 day
]
)
# Disable caching for specific requests
yield Request(url, meta={'cache': False})
```
### Pipelines
```python
from scrava.pipelines import JsonPipeline, MongoPipeline
# JSON output
crawler = Crawler(
pipelines=[JsonPipeline(output_file='output.jsonl')]
)
# MongoDB with batching
crawler = Crawler(
pipelines=[
MongoPipeline(
uri='mongodb://localhost:27017',
database='scrava',
batch_size=100,
batch_timeout=5.0
)
]
)
# Custom pipeline
from scrava.pipelines import BasePipeline
class CustomPipeline(BasePipeline):
async def process_rec(self, record, bot):
# Process and store record
await self.save_to_db(record)
return record
```
### Configuration
```yaml
# config/settings.yaml
project_name: "my_project"
scrava:
concurrent_reqs: 16
download_delay: 0.0
enable_browser: false
cache:
enabled: true
path: ".scrava_cache"
expiration_secs: 86400
queue:
backend: "scrava.queue.memory.MemoryQueue"
redis_url: "redis://localhost:6379/0"
pipeline:
enabled:
- scrava.pipelines.json.JsonPipeline
mongodb_uri: "mongodb://localhost:27017"
mongodb_database: "scrava"
logging:
level: "INFO"
format: "console" # or "json" for production
use_colors: true
```
```python
from scrava.config import load_settings
settings = load_settings('config/settings.yaml')
```
### Logging
```python
from scrava.logging import setup_logging, get_logger
# Setup logging
setup_logging(
level="INFO",
format="console", # "json" for production
use_colors=True
)
# Get logger
logger = get_logger(__name__)
logger.info("Bot started", bot_name="my_bot", url="https://example.com")
# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com
```
## 🔧 CLI Commands
```bash
# Create a new project
scrava new <project_name>
# Run a bot
scrava run <bot_name>
# List all bots
scrava list
# Interactive selector shell
scrava shell <url>
scrava shell <url> --browser # Use browser rendering
# Show version
scrava version
```
## 📚 Advanced Examples
### Custom Callback Methods
```python
class ProductBot(BaseBot):
start_urls = ['https://shop.example.com']
async def process(self, response: Response):
# Extract category links
for category in response.selector.css('.category'):
url = response.urljoin(category.css('a::attr(href)').get())
yield Request(url, callback=self.parse_category)
async def parse_category(self, response: Response):
# Extract products
for product in response.selector.css('.product'):
yield Request(
response.urljoin(product.css('a::attr(href)').get()),
callback=self.parse_product
)
async def parse_product(self, response: Response):
yield Product(
name=response.selector.css('h1::text').get(),
price=float(response.selector.css('.price::text').get())
)
```
### Browser Automation
```python
async def process(self, response: Response):
# Scroll page, click buttons, etc. with JavaScript
yield Request(
url='https://spa-site.com',
meta={
'browser': True,
'wait_for': '.dynamic-content',
'scroll': True
}
)
```
### Error Handling Hook
```python
class RetryHook(RequestHook):
async def process_exc(self, request, exception, bot):
if request.meta.get('retry_count', 0) < 3:
# Retry with incremented counter
request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1
await bot.queue.push(request)
return None
```
### Data Validation Pipeline
```python
class ValidationPipeline(BasePipeline):
async def process_rec(self, record, bot):
# Pydantic automatically validates
if record.price < 0:
logger.warning("Invalid price", record=record)
return None # Filter out
return record
```
## 🎯 Best Practices
1. **Use Pydantic Models**: Define clear schemas for your scraped data
2. **Leverage Hooks**: Keep bot logic clean by using hooks for cross-cutting concerns
3. **Configure Delays**: Be respectful with `download_delay` to avoid overwhelming servers
4. **Enable Caching**: Speed up development with the built-in CacheHook
5. **Structure Logs**: Use structured logging for easy debugging and monitoring
6. **Handle Errors**: Implement retry logic and error hooks for robust crawls
7. **Test Selectors**: Use `scrava shell <url>` to test CSS/XPath selectors interactively
## 🔗 Architecture
```
┌─────────────┐
│ Bot │ ← Your scraping logic
└──────┬──────┘
│
↓
┌─────────────┐
│ Core │ ← Orchestrator (asyncio event loop)
└──────┬──────┘
│
├→ Queue (MemoryQueue / RedisQueue)
├→ Fetcher (HttpxFetcher / PlaywrightFetcher)
├→ Hooks (RequestHook / BotHook)
└→ Pipelines (MongoPipeline / JsonPipeline)
```
## 📖 Documentation
For full documentation, visit: [https://scrava.readthedocs.io](https://scrava.readthedocs.io)
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## 📄 License
MIT License - see LICENSE file for details
## 🙏 Acknowledgments
Scrava is built on the shoulders of giants:
- [httpx](https://www.python-httpx.org/) - HTTP client
- [Playwright](https://playwright.dev/python/) - Browser automation
- [parsel](https://parsel.readthedocs.io/) - Data extraction
- [Pydantic](https://pydantic.dev/) - Data validation
- [structlog](https://www.structlog.org/) - Structured logging
- [Typer](https://typer.tiangolo.com/) - CLI framework
---
## 🏢 About Nextract Data Solutions
Scrava is developed and maintained by [**Nextract Data Solutions**](https://nextract.dev), a leading provider of enterprise web scraping and data extraction services.
**Need enterprise-grade data extraction?**
While Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:
- ✅ Custom enterprise scraping solutions
- ✅ Data-as-a-Service (DaaS) subscriptions
- ✅ Data enrichment and validation
- ✅ 99.9% accuracy and reliability
- ✅ Dedicated support and SLA guarantees
### 📞 Contact Nextract
- **Website**: [https://nextract.dev](https://nextract.dev)
- **Email**: [hello@nextract.dev](mailto:hello@nextract.dev)
- **Phone**: +91 85110-98799
- **GitHub**: [@nextractdevelopers](https://github.com/nextractdevelopers)
[**Schedule a Free Strategy Call**](https://nextract.dev) | [**Download Capabilities Deck**](https://nextract.dev)
---
**Happy Scraping! 🕷️**
Raw data
{
"_id": null,
"home_page": null,
"name": "scrava",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Nextract Data Solutions <hello@nextract.dev>",
"keywords": "web-scraping, scraping, crawler, spider, async, asyncio, http, playwright, data-extraction, web-automation",
"author": null,
"author_email": "Nextract Data Solutions <hello@nextract.dev>, Parth Ravrani <hello@nextract.dev>",
"download_url": "https://files.pythonhosted.org/packages/73/8f/b084684efe70b9e4434ae9218c99e0984b3d6b157b4d29510f2d80e6e1a5/scrava-0.1.1.tar.gz",
"platform": null,
"description": "# Scrava\n\n**Scrava** is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.\n\n> \ud83c\udfe2 **Built by [Nextract Data Solutions](https://nextract.dev)** - Your partner for enterprise web scraping and data extraction.\n\n[](https://pypi.org/project/scrava/)\n[](https://pypi.org/project/scrava/)\n[](https://github.com/nextractdevelopers/Scrava/blob/main/LICENSE)\n\n## \ud83c\udfaf Philosophy\n\nScrava doesn't reinvent the wheel. Instead, it provides a **composition-over-invention** approach:\n\n- **Unifying Force**: Eliminates boilerplate and integration complexity\n- **Battle-Tested Libraries**: Built on httpx, Playwright, parsel, and more\n- **Developer Experience**: Designed to be intuitive and \"piece of cake\" for newcomers\n- **Production-Ready**: Structured logging, statistics, error handling, and more\n\n## \u2728 Features\n\n- \ud83d\ude80 **Async-First**: Built on asyncio for maximum performance\n- \ud83d\udd04 **Dual-Mode Fetching**: HTTP (httpx) and Browser (Playwright) support\n- \ud83d\udce6 **Flexible Queuing**: In-memory or Redis-backed with duplicate filtering\n- \ud83e\ude9d **Powerful Hooks**: Intercept and modify requests, responses, and data flow\n- \ud83d\udcbe **Pipeline System**: MongoDB, JSON, or custom data storage\n- \ud83c\udfaf **Pydantic Integration**: Type-safe data models with validation\n- \ud83d\udcca **Structured Logging**: Production-grade logging with structlog\n- \u2699\ufe0f **Config Management**: YAML + Pydantic for type-safe configuration\n- \ud83d\udee0\ufe0f **CLI Tools**: Project scaffolding, bot runner, and interactive shell\n\n## \ud83d\udce6 Installation\n\n### Prerequisites\n- Python 3.8 or higher\n- pip (latest version recommended)\n\n### Platform-Specific Notes\n\n**macOS (Apple Silicon - M1/M2/M3/M4):**\n```bash\n# Use native ARM64 Python for best performance\narch -arm64 pip install scrava\n```\n\n**macOS (Intel):**\n```bash\npip install scrava\n```\n\n**Windows:**\n```bash\npip install scrava\n```\n\n**Linux:**\n```bash\npip install scrava\n```\n\n### Installation Options\n\n```bash\n# Basic installation (works on all platforms)\npip install scrava\n\n# With browser support (Playwright)\npip install scrava[browser]\n\n# With Redis queue support\npip install scrava[redis]\n\n# With MongoDB pipeline support\npip install scrava[mongodb]\n\n# Install everything\npip install scrava[all]\n```\n\n### Development Installation\n\n```bash\n# Clone and install in editable mode\ngit clone https://github.com/nextractdevelopers/Scrava.git\ncd Scrava\npip install -e .\n\n# With all optional dependencies\npip install -e \".[all]\"\n```\n\n### Quick Installation Scripts\n\nFor easier installation, use our platform-specific scripts:\n\n**macOS/Linux:**\n```bash\n# Auto-detects architecture and installs correctly\ncurl -sSL https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.sh | bash\n\n# Or download and run manually\nchmod +x install.sh\n./install.sh\n```\n\n**Windows (PowerShell):**\n```powershell\n# Download and run the installation script\niwr -useb https://raw.githubusercontent.com/nextractdevelopers/Scrava/main/install.ps1 | iex\n\n# Or download and run manually\n.\\install.ps1\n```\n\n### Verify Installation\n\n```bash\n# Check if Scrava is properly installed\nscrava version\n\n# Run the welcome screen\nscrava\n```\n\n### Troubleshooting\n\nIf you encounter installation issues, see [PLATFORM.md](PLATFORM.md) for detailed platform-specific instructions.\n\n## \ud83d\ude80 Quick Start\n\n### 1. Create a New Project\n\n```bash\nscrava new my_project\ncd my_project\n```\n\n### 2. Define Your Bot\n\n```python\n# bots/book_bot.py\nfrom pydantic import BaseModel, HttpUrl\nfrom scrava import BaseBot, Request, Response\n\n\nclass Book(BaseModel):\n \"\"\"A scraped book record.\"\"\"\n title: str\n price: float\n url: HttpUrl\n in_stock: bool = True\n\n\nclass BookBot(BaseBot):\n \"\"\"Bot for scraping books.toscrape.com\"\"\"\n \n start_urls = ['https://books.toscrape.com']\n \n async def process(self, response: Response):\n \"\"\"Extract book data from the page.\"\"\"\n # Extract books using parsel selectors\n for book in response.selector.css('article.product_pod'):\n title = book.css('h3 a::attr(title)').get()\n price_text = book.css('.price_color::text').get()\n price = float(price_text.replace('\u00a3', ''))\n url = response.urljoin(book.css('h3 a::attr(href)').get())\n \n yield Book(\n title=title,\n price=price,\n url=url\n )\n \n # Follow pagination\n next_page = response.selector.css('.next a::attr(href)').get()\n if next_page:\n yield Request(response.urljoin(next_page))\n```\n\n### 3. Run Your Bot\n\n```bash\nscrava run book_bot\n```\n\n## \ud83c\udfd7\ufe0f Core Components\n\n### Request & Response\n\n```python\nfrom scrava import Request, Response\n\n# Create a request\nrequest = Request(\n url='https://example.com',\n method='GET',\n headers={'User-Agent': 'MyBot/1.0'},\n priority=10, # Higher priority = processed first\n meta={'browser': True} # Use browser rendering\n)\n\n# Response provides powerful selectors\nasync def process(self, response: Response):\n # CSS selectors\n title = response.selector.css('h1::text').get()\n \n # XPath selectors\n links = response.selector.xpath('//a/@href').getall()\n \n # Join relative URLs\n absolute_url = response.urljoin('/path')\n```\n\n### Bot Lifecycle\n\n```python\nfrom scrava import BaseBot, Response\n\nclass MyBot(BaseBot):\n start_urls = ['https://example.com']\n \n async def setup(self):\n \"\"\"Called before crawling starts.\"\"\"\n self.session_data = {}\n \n async def process(self, response: Response):\n \"\"\"Main processing method.\"\"\"\n yield Record(...)\n yield Request(...)\n \n async def teardown(self):\n \"\"\"Called after crawling completes.\"\"\"\n pass\n```\n\n### Queue System\n\n```python\nfrom scrava import Crawler\nfrom scrava.queue import MemoryQueue, RedisQueue\n\n# In-memory queue (default)\ncrawler = Crawler(queue=MemoryQueue())\n\n# Redis-backed queue for distributed crawls\ncrawler = Crawler(queue=RedisQueue(redis_url=\"redis://localhost:6379/0\"))\n```\n\n### Fetchers\n\n```python\n# HTTP fetcher (default)\nfrom scrava.fetchers import HttpxFetcher\n\ncrawler = Crawler(\n fetcher=HttpxFetcher(\n timeout=30.0,\n follow_redirects=True,\n verify_ssl=True\n )\n)\n\n# Browser fetcher for JavaScript-heavy sites\nfrom scrava.fetchers import PlaywrightFetcher\n\ncrawler = Crawler(\n browser_fetcher=PlaywrightFetcher(\n headless=True,\n browser_type='chromium',\n context_pool_size=5\n ),\n enable_browser=True\n)\n\n# Use browser for specific requests\nyield Request(url, meta={'browser': True})\n```\n\n### Hooks\n\n#### Request Hooks\n\n```python\nfrom scrava.hooks import RequestHook\n\nclass UserAgentHook(RequestHook):\n async def process_req(self, request, bot):\n # Modify request before fetching\n request.headers['User-Agent'] = 'MyBot/1.0'\n return None\n \n async def process_res(self, request, response, bot):\n # Process response after fetching\n print(f\"Got {response.status} from {response.url}\")\n return None\n\ncrawler = Crawler(request_hooks=[UserAgentHook()])\n```\n\n#### Built-in Cache Hook\n\n```python\nfrom scrava.hooks import CacheHook\n\n# Enable caching\ncrawler = Crawler(\n request_hooks=[\n CacheHook(expiration=86400) # Cache for 1 day\n ]\n)\n\n# Disable caching for specific requests\nyield Request(url, meta={'cache': False})\n```\n\n### Pipelines\n\n```python\nfrom scrava.pipelines import JsonPipeline, MongoPipeline\n\n# JSON output\ncrawler = Crawler(\n pipelines=[JsonPipeline(output_file='output.jsonl')]\n)\n\n# MongoDB with batching\ncrawler = Crawler(\n pipelines=[\n MongoPipeline(\n uri='mongodb://localhost:27017',\n database='scrava',\n batch_size=100,\n batch_timeout=5.0\n )\n ]\n)\n\n# Custom pipeline\nfrom scrava.pipelines import BasePipeline\n\nclass CustomPipeline(BasePipeline):\n async def process_rec(self, record, bot):\n # Process and store record\n await self.save_to_db(record)\n return record\n```\n\n### Configuration\n\n```yaml\n# config/settings.yaml\nproject_name: \"my_project\"\n\nscrava:\n concurrent_reqs: 16\n download_delay: 0.0\n enable_browser: false\n\ncache:\n enabled: true\n path: \".scrava_cache\"\n expiration_secs: 86400\n\nqueue:\n backend: \"scrava.queue.memory.MemoryQueue\"\n redis_url: \"redis://localhost:6379/0\"\n\npipeline:\n enabled:\n - scrava.pipelines.json.JsonPipeline\n mongodb_uri: \"mongodb://localhost:27017\"\n mongodb_database: \"scrava\"\n\nlogging:\n level: \"INFO\"\n format: \"console\" # or \"json\" for production\n use_colors: true\n```\n\n```python\nfrom scrava.config import load_settings\n\nsettings = load_settings('config/settings.yaml')\n```\n\n### Logging\n\n```python\nfrom scrava.logging import setup_logging, get_logger\n\n# Setup logging\nsetup_logging(\n level=\"INFO\",\n format=\"console\", # \"json\" for production\n use_colors=True\n)\n\n# Get logger\nlogger = get_logger(__name__)\n\nlogger.info(\"Bot started\", bot_name=\"my_bot\", url=\"https://example.com\")\n# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com\n```\n\n## \ud83d\udd27 CLI Commands\n\n```bash\n# Create a new project\nscrava new <project_name>\n\n# Run a bot\nscrava run <bot_name>\n\n# List all bots\nscrava list\n\n# Interactive selector shell\nscrava shell <url>\nscrava shell <url> --browser # Use browser rendering\n\n# Show version\nscrava version\n```\n\n## \ud83d\udcda Advanced Examples\n\n### Custom Callback Methods\n\n```python\nclass ProductBot(BaseBot):\n start_urls = ['https://shop.example.com']\n \n async def process(self, response: Response):\n # Extract category links\n for category in response.selector.css('.category'):\n url = response.urljoin(category.css('a::attr(href)').get())\n yield Request(url, callback=self.parse_category)\n \n async def parse_category(self, response: Response):\n # Extract products\n for product in response.selector.css('.product'):\n yield Request(\n response.urljoin(product.css('a::attr(href)').get()),\n callback=self.parse_product\n )\n \n async def parse_product(self, response: Response):\n yield Product(\n name=response.selector.css('h1::text').get(),\n price=float(response.selector.css('.price::text').get())\n )\n```\n\n### Browser Automation\n\n```python\nasync def process(self, response: Response):\n # Scroll page, click buttons, etc. with JavaScript\n yield Request(\n url='https://spa-site.com',\n meta={\n 'browser': True,\n 'wait_for': '.dynamic-content',\n 'scroll': True\n }\n )\n```\n\n### Error Handling Hook\n\n```python\nclass RetryHook(RequestHook):\n async def process_exc(self, request, exception, bot):\n if request.meta.get('retry_count', 0) < 3:\n # Retry with incremented counter\n request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1\n await bot.queue.push(request)\n return None\n```\n\n### Data Validation Pipeline\n\n```python\nclass ValidationPipeline(BasePipeline):\n async def process_rec(self, record, bot):\n # Pydantic automatically validates\n if record.price < 0:\n logger.warning(\"Invalid price\", record=record)\n return None # Filter out\n return record\n```\n\n## \ud83c\udfaf Best Practices\n\n1. **Use Pydantic Models**: Define clear schemas for your scraped data\n2. **Leverage Hooks**: Keep bot logic clean by using hooks for cross-cutting concerns\n3. **Configure Delays**: Be respectful with `download_delay` to avoid overwhelming servers\n4. **Enable Caching**: Speed up development with the built-in CacheHook\n5. **Structure Logs**: Use structured logging for easy debugging and monitoring\n6. **Handle Errors**: Implement retry logic and error hooks for robust crawls\n7. **Test Selectors**: Use `scrava shell <url>` to test CSS/XPath selectors interactively\n\n## \ud83d\udd17 Architecture\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Bot \u2502 \u2190 Your scraping logic\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u2193\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Core \u2502 \u2190 Orchestrator (asyncio event loop)\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u251c\u2192 Queue (MemoryQueue / RedisQueue)\n \u251c\u2192 Fetcher (HttpxFetcher / PlaywrightFetcher)\n \u251c\u2192 Hooks (RequestHook / BotHook)\n \u2514\u2192 Pipelines (MongoPipeline / JsonPipeline)\n```\n\n## \ud83d\udcd6 Documentation\n\nFor full documentation, visit: [https://scrava.readthedocs.io](https://scrava.readthedocs.io)\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## \ud83d\udcc4 License\n\nMIT License - see LICENSE file for details\n\n## \ud83d\ude4f Acknowledgments\n\nScrava is built on the shoulders of giants:\n- [httpx](https://www.python-httpx.org/) - HTTP client\n- [Playwright](https://playwright.dev/python/) - Browser automation\n- [parsel](https://parsel.readthedocs.io/) - Data extraction\n- [Pydantic](https://pydantic.dev/) - Data validation\n- [structlog](https://www.structlog.org/) - Structured logging\n- [Typer](https://typer.tiangolo.com/) - CLI framework\n\n---\n\n## \ud83c\udfe2 About Nextract Data Solutions\n\nScrava is developed and maintained by [**Nextract Data Solutions**](https://nextract.dev), a leading provider of enterprise web scraping and data extraction services.\n\n**Need enterprise-grade data extraction?**\n\nWhile Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:\n\n- \u2705 Custom enterprise scraping solutions\n- \u2705 Data-as-a-Service (DaaS) subscriptions\n- \u2705 Data enrichment and validation\n- \u2705 99.9% accuracy and reliability\n- \u2705 Dedicated support and SLA guarantees\n\n### \ud83d\udcde Contact Nextract\n\n- **Website**: [https://nextract.dev](https://nextract.dev)\n- **Email**: [hello@nextract.dev](mailto:hello@nextract.dev)\n- **Phone**: +91 85110-98799\n- **GitHub**: [@nextractdevelopers](https://github.com/nextractdevelopers)\n\n[**Schedule a Free Strategy Call**](https://nextract.dev) | [**Download Capabilities Deck**](https://nextract.dev)\n\n---\n\n**Happy Scraping! \ud83d\udd77\ufe0f**\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "The Definitive Web Scraping Framework for Python",
"version": "0.1.1",
"project_urls": {
"Changelog": "https://github.com/nextractdevelopers/Scrava/releases",
"Company": "https://nextract.dev",
"Documentation": "https://github.com/nextractdevelopers/Scrava#readme",
"Homepage": "https://github.com/nextractdevelopers/Scrava",
"Issues": "https://github.com/nextractdevelopers/Scrava/issues",
"Repository": "https://github.com/nextractdevelopers/Scrava"
},
"split_keywords": [
"web-scraping",
" scraping",
" crawler",
" spider",
" async",
" asyncio",
" http",
" playwright",
" data-extraction",
" web-automation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8624c281af88d936ec70c716100938b7a3005851c78df91a366b74879e9207af",
"md5": "d81c365429f527f6eefbde708c4326f3",
"sha256": "9f2eca5027af388cda1a4d8229e713a2514b8006284adef797e4397dad4fe026"
},
"downloads": -1,
"filename": "scrava-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d81c365429f527f6eefbde708c4326f3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 13328,
"upload_time": "2025-10-12T18:32:01",
"upload_time_iso_8601": "2025-10-12T18:32:01.631187Z",
"url": "https://files.pythonhosted.org/packages/86/24/c281af88d936ec70c716100938b7a3005851c78df91a366b74879e9207af/scrava-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "738fb084684efe70b9e4434ae9218c99e0984b3d6b157b4d29510f2d80e6e1a5",
"md5": "29f0483b82cd3b1384ba663615b77294",
"sha256": "378178cfb2fcc019dffcc5e3ea3cdaeb1bc992af133789c327a24a846d70b8f6"
},
"downloads": -1,
"filename": "scrava-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "29f0483b82cd3b1384ba663615b77294",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 35225,
"upload_time": "2025-10-12T18:32:03",
"upload_time_iso_8601": "2025-10-12T18:32:03.762667Z",
"url": "https://files.pythonhosted.org/packages/73/8f/b084684efe70b9e4434ae9218c99e0984b3d6b157b4d29510f2d80e6e1a5/scrava-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-12 18:32:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nextractdevelopers",
"github_project": "Scrava",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "scrava"
}