# RapidCrawl
<p align="center">
<img src="https://img.shields.io/pypi/v/rapid-crawl.svg" alt="PyPI version">
<img src="https://img.shields.io/pypi/pyversions/rapid-crawl.svg" alt="Python versions">
<img src="https://img.shields.io/github/license/aoneahsan/rapid-crawl.svg" alt="License">
<img src="https://img.shields.io/github/stars/aoneahsan/rapid-crawl.svg" alt="Stars">
</p>
A powerful Python SDK for web scraping, crawling, and data extraction. RapidCrawl provides a comprehensive toolkit for extracting data from websites, handling dynamic content, and converting web pages into clean, structured formats suitable for AI and LLM applications.
## π Features
- **π Scrape**: Convert any URL into clean markdown, HTML, text, or structured data
- **π·οΈ Crawl**: Recursively crawl websites with depth control and filtering
- **πΊοΈ Map**: Quickly discover all URLs on a website
- **π Search**: Web search with automatic result scraping
- **πΈ Screenshot**: Capture full-page screenshots
- **π Dynamic Content**: Handle JavaScript-rendered pages with Playwright
- **π Multiple Formats**: Support for Markdown, HTML, PDF, images, and more
- **π Async Support**: High-performance asynchronous operations
- **π‘οΈ Error Handling**: Comprehensive error handling and retry logic
- **π¦ CLI Tool**: Feature-rich command-line interface
## π Table of Contents
- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [Features](#-features-1)
- [Scraping](#scraping)
- [Crawling](#crawling)
- [Mapping](#mapping)
- [Searching](#searching)
- [CLI Usage](#-cli-usage)
- [Configuration](#-configuration)
- [API Reference](#-api-reference)
- [Examples](#-examples)
- [Advanced Usage](#-advanced-usage)
- [Performance](#-performance)
- [Troubleshooting](#-troubleshooting)
- [Development](#-development)
- [Contributing](#-contributing)
- [Security](#-security)
- [License](#-license)
- [Support](#-support)
## π¦ Installation
### Using pip
```bash
pip install rapid-crawl
```
### Using pip with all optional dependencies
```bash
pip install rapid-crawl[dev]
```
### From source
```bash
git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl
pip install -e .
```
### Install Playwright browsers (required for dynamic content)
```bash
playwright install chromium
```
## π Quick Start
### Python SDK
```python
from rapidcrawl import RapidCrawlApp
# Initialize the client
app = RapidCrawlApp()
# Scrape a single page
result = app.scrape_url("https://example.com")
print(result.content["markdown"])
# Crawl a website
crawl_result = app.crawl_url(
"https://example.com",
max_pages=10,
max_depth=2
)
# Map all URLs
map_result = app.map_url("https://example.com")
print(f"Found {map_result.total_urls} URLs")
# Search and scrape
search_result = app.search(
"python web scraping",
num_results=5,
scrape_results=True
)
```
### Command Line
```bash
# Scrape a URL
rapidcrawl scrape https://example.com
# Crawl a website
rapidcrawl crawl https://example.com --max-pages 10
# Map URLs
rapidcrawl map https://example.com --limit 100
# Search
rapidcrawl search "python tutorials" --scrape
```
## π― Features
### Scraping
Convert any web page into clean, structured data:
```python
from rapidcrawl import RapidCrawlApp, OutputFormat
app = RapidCrawlApp()
# Basic scraping
result = app.scrape_url("https://example.com")
# Multiple formats
result = app.scrape_url(
"https://example.com",
formats=["markdown", "html", "screenshot"],
wait_for=".content", # Wait for element
timeout=60000, # 60 seconds timeout
)
# Extract structured data
result = app.scrape_url(
"https://example.com/product",
extract_schema=[
{"name": "title", "selector": "h1"},
{"name": "price", "selector": ".price", "type": "number"},
{"name": "description", "selector": ".description"}
]
)
print(result.structured_data)
# {'title': 'Product Name', 'price': 29.99, 'description': '...'}
# Mobile viewport
result = app.scrape_url(
"https://example.com",
mobile=True
)
# With actions (click, type, scroll)
result = app.scrape_url(
"https://example.com",
actions=[
{"type": "click", "selector": ".load-more"},
{"type": "wait", "value": 2000},
{"type": "scroll", "value": 1000}
]
)
```
### Crawling
Recursively crawl websites with advanced filtering:
```python
# Basic crawling
result = app.crawl_url(
"https://example.com",
max_pages=50,
max_depth=3
)
# With URL filtering
result = app.crawl_url(
"https://example.com",
include_patterns=[r"/blog/.*", r"/docs/.*"],
exclude_patterns=[r".*\.pdf$", r".*/tag/.*"]
)
# Async crawling for better performance
import asyncio
async def crawl_async():
result = await app.crawl_url_async(
"https://example.com",
max_pages=100,
max_depth=5,
allow_subdomains=True
)
return result
result = asyncio.run(crawl_async())
# With webhook notifications
result = app.crawl_url(
"https://example.com",
webhook_url="https://your-webhook.com/progress"
)
```
### Mapping
Quickly discover all URLs on a website:
```python
# Basic mapping
result = app.map_url("https://example.com")
print(f"Found {result.total_urls} URLs")
# Filter URLs by search term
result = app.map_url(
"https://example.com",
search="product",
limit=1000
)
# Include subdomains
result = app.map_url(
"https://example.com",
include_subdomains=True,
ignore_sitemap=False # Use sitemap.xml if available
)
# Access the URLs
for url in result.urls[:10]:
print(url)
```
### Searching
Search the web and optionally scrape results:
```python
# Basic search
result = app.search("python web scraping tutorial")
# Search with scraping
result = app.search(
"latest AI news",
num_results=10,
scrape_results=True,
formats=["markdown", "text"]
)
# Access results
for item in result.results:
print(f"{item.position}. {item.title}")
print(f" URL: {item.url}")
if item.scraped_content:
print(f" Content: {item.scraped_content.content['markdown'][:200]}...")
# Different search engines
result = app.search(
"machine learning",
engine="duckduckgo", # or "google", "bing"
num_results=20
)
# With date filtering
from datetime import datetime, timedelta
result = app.search(
"tech news",
start_date=datetime.now() - timedelta(days=7),
end_date=datetime.now()
)
```
## π» CLI Usage
RapidCrawl provides a comprehensive command-line interface:
### Setup Wizard
```bash
# Interactive setup
rapidcrawl setup
```
### Scraping
```bash
# Basic scrape
rapidcrawl scrape https://example.com
# Save to file
rapidcrawl scrape https://example.com -o output.md
# Multiple formats
rapidcrawl scrape https://example.com -f markdown -f html -f screenshot
# Wait for element
rapidcrawl scrape https://example.com --wait-for ".content"
# Extract structured data
rapidcrawl scrape https://example.com \
--extract-schema '[{"name": "title", "selector": "h1"}]'
```
### Crawling
```bash
# Basic crawl
rapidcrawl crawl https://example.com
# Advanced crawl
rapidcrawl crawl https://example.com \
--max-pages 100 \
--max-depth 3 \
--include "*/blog/*" \
--exclude "*.pdf" \
--output ./crawl-results/
```
### Mapping
```bash
# Map all URLs
rapidcrawl map https://example.com
# Filter and save
rapidcrawl map https://example.com \
--search "product" \
--limit 1000 \
--output urls.txt
```
### Searching
```bash
# Basic search
rapidcrawl search "python tutorials"
# Search and scrape
rapidcrawl search "machine learning" \
--scrape \
--num-results 20 \
--engine google \
--output results/
```
## βοΈ Configuration
### Environment Variables
Create a `.env` file in your project root:
```env
# API Configuration
RAPIDCRAWL_API_KEY=your_api_key_here
RAPIDCRAWL_BASE_URL=https://api.rapidcrawl.io/v1
RAPIDCRAWL_TIMEOUT=30
# Optional
RAPIDCRAWL_MAX_RETRIES=3
```
### Python Configuration
```python
from rapidcrawl import RapidCrawlApp
# Custom configuration
app = RapidCrawlApp(
api_key="your_api_key",
base_url="https://custom-api.example.com",
timeout=60.0,
max_retries=5,
debug=True
)
```
### Manual Configuration Options
If the automated setup doesn't work, you can manually configure RapidCrawl:
1. **API Key**: Set via environment variable or pass to constructor
2. **Base URL**: For self-hosted instances
3. **Timeout**: Request timeout in seconds
4. **SSL Verification**: Disable for self-signed certificates
5. **Debug Mode**: Enable verbose logging
## π API Reference
### RapidCrawlApp
The main client class for interacting with RapidCrawl.
#### Constructor
```python
RapidCrawlApp(
api_key: Optional[str] = None,
base_url: Optional[str] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
verify_ssl: bool = True,
debug: bool = False
)
```
#### Methods
- `scrape_url(url, **options)`: Scrape a single URL
- `crawl_url(url, **options)`: Crawl a website
- `crawl_url_async(url, **options)`: Async crawl
- `map_url(url, **options)`: Map website URLs
- `search(query, **options)`: Search the web
- `extract(urls, schema, prompt)`: Extract structured data
### Models
#### ScrapeOptions
```python
from rapidcrawl.models import ScrapeOptions, OutputFormat
options = ScrapeOptions(
url="https://example.com",
formats=[OutputFormat.MARKDOWN, OutputFormat.HTML],
wait_for=".content",
timeout=30000,
mobile=False,
actions=[...],
extract_schema=[...],
headers={"User-Agent": "Custom UA"}
)
```
#### CrawlOptions
```python
from rapidcrawl.models import CrawlOptions
options = CrawlOptions(
url="https://example.com",
max_pages=100,
max_depth=3,
include_patterns=["*/blog/*"],
exclude_patterns=["*.pdf"],
allow_subdomains=False,
webhook_url="https://webhook.example.com"
)
```
## π§ Examples
For comprehensive examples, see the [examples directory](examples/):
- [Basic Scraping](examples/basic_scraping.py) - Getting started with web scraping
- [Web Crawling](examples/web_crawling.py) - Crawling websites recursively
- [Search and Map](examples/search_and_map.py) - Search and URL mapping
- [Data Extraction](examples/data_extraction.py) - Structured data extraction
- [Advanced Usage](examples/advanced_usage.py) - Production patterns
### E-commerce Price Monitoring
```python
from rapidcrawl import RapidCrawlApp
import json
app = RapidCrawlApp()
# Define extraction schema
schema = [
{"name": "title", "selector": "h1.product-title"},
{"name": "price", "selector": ".price-now", "type": "number"},
{"name": "stock", "selector": ".availability"},
{"name": "image", "selector": "img.product-image", "attribute": "src"}
]
# Monitor multiple products
products = [
"https://shop.example.com/product1",
"https://shop.example.com/product2",
]
results = []
for url in products:
result = app.scrape_url(url, extract_schema=schema)
if result.success:
results.append({
"url": url,
"data": result.structured_data,
"timestamp": result.scraped_at
})
# Save results
with open("prices.json", "w") as f:
json.dump(results, f, indent=2, default=str)
```
### Content Aggregation
```python
import asyncio
from rapidcrawl import RapidCrawlApp
app = RapidCrawlApp()
async def aggregate_news():
# Search multiple queries
queries = [
"artificial intelligence breakthroughs",
"quantum computing news",
"robotics innovation"
]
all_articles = []
for query in queries:
result = app.search(
query,
num_results=5,
scrape_results=True,
formats=["markdown"]
)
for item in result.results:
if item.scraped_content and item.scraped_content.success:
all_articles.append({
"title": item.title,
"url": item.url,
"content": item.scraped_content.content["markdown"],
"query": query
})
return all_articles
# Run aggregation
articles = asyncio.run(aggregate_news())
```
### Website Change Detection
```python
import hashlib
import time
from rapidcrawl import RapidCrawlApp
app = RapidCrawlApp()
def monitor_changes(url, interval=3600):
"""Monitor a webpage for changes."""
previous_hash = None
while True:
result = app.scrape_url(url, formats=["text"])
if result.success:
content = result.content["text"]
current_hash = hashlib.md5(content.encode()).hexdigest()
if previous_hash and current_hash != previous_hash:
print(f"Change detected at {url}!")
# Send notification, save diff, etc.
previous_hash = current_hash
time.sleep(interval)
# Monitor a page
monitor_changes("https://example.com/status", interval=300) # Check every 5 minutes
```
## π Advanced Usage
### Rate Limiting
```python
import time
from rapidcrawl import RapidCrawlApp
class RateLimitedScraper:
def __init__(self, requests_per_second=2):
self.app = RapidCrawlApp()
self.min_interval = 1.0 / requests_per_second
self.last_request = 0
def scrape_url(self, url):
current = time.time()
elapsed = current - self.last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = time.time()
return self.app.scrape_url(url)
```
### Caching Results
```python
from functools import lru_cache
import hashlib
class CachedScraper:
def __init__(self):
self.app = RapidCrawlApp()
self.cache = {}
def scrape_with_cache(self, url, max_age_hours=24):
cache_key = hashlib.md5(url.encode()).hexdigest()
if cache_key in self.cache:
cached_time, cached_result = self.cache[cache_key]
age_hours = (time.time() - cached_time) / 3600
if age_hours < max_age_hours:
return cached_result
result = self.app.scrape_url(url)
self.cache[cache_key] = (time.time(), result)
return result
```
### Error Handling
```python
from rapidcrawl.exceptions import (
RateLimitError,
TimeoutError,
NetworkError
)
def robust_scrape(url, max_retries=3):
app = RapidCrawlApp()
for attempt in range(max_retries):
try:
return app.scrape_url(url)
except RateLimitError as e:
wait_time = e.retry_after or 60
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
except TimeoutError:
print(f"Timeout on attempt {attempt + 1}")
if attempt == max_retries - 1:
raise
except NetworkError as e:
print(f"Network error: {e}")
time.sleep(2 ** attempt) # Exponential backoff
```
### Concurrent Scraping
```python
from concurrent.futures import ThreadPoolExecutor, as_completed
def concurrent_scrape(urls, max_workers=5):
app = RapidCrawlApp()
results = {}
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_url = {
executor.submit(app.scrape_url, url): url
for url in urls
}
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
results[url] = future.result()
except Exception as e:
results[url] = {"error": str(e)}
return results
```
For more advanced patterns, see the [Advanced Usage Guide](docs/ADVANCED.md).
## β‘ Performance
### Benchmarks
| Operation | URLs | Time | Speed |
|-----------|------|------|-------|
| Sequential Scraping | 10 | 12.3s | 0.8 pages/sec |
| Concurrent Scraping | 10 | 3.1s | 3.2 pages/sec |
| Async Crawling | 100 | 28.5s | 3.5 pages/sec |
| URL Mapping | 1000 | 5.2s | 192 URLs/sec |
### Optimization Tips
1. **Use Async Operations**: For crawling large sites, use `crawl_url_async()`
2. **Enable Connection Pooling**: Reuse HTTP connections
3. **Limit Concurrent Requests**: Prevent overwhelming servers
4. **Cache Results**: Avoid re-scraping unchanged content
5. **Use Specific Formats**: Only request needed output formats
## π Troubleshooting
### Common Issues
#### Installation Problems
```bash
# Update pip
python -m pip install --upgrade pip
# Install in virtual environment
python -m venv venv
source venv/bin/activate
pip install rapid-crawl
```
#### Playwright Issues
```bash
# Install browser dependencies
playwright install-deps chromium
# Or use Firefox
playwright install firefox
```
#### SSL Certificate Errors
```python
# For self-signed certificates (development only!)
app = RapidCrawlApp(verify_ssl=False)
```
#### Rate Limiting
```python
# Handle rate limits gracefully
try:
result = app.scrape_url(url)
except RateLimitError as e:
time.sleep(e.retry_after or 60)
result = app.scrape_url(url)
```
For detailed troubleshooting, see the [Troubleshooting Guide](docs/TROUBLESHOOTING.md).
## π οΈ Development
### Setting up development environment
```bash
# Clone the repository
git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
```
### Running tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=rapidcrawl
# Run specific test file
pytest tests/test_scrape.py
```
### Code formatting
```bash
# Format code
black src/rapidcrawl
# Run linter
ruff check src/rapidcrawl
# Type checking
mypy src/rapidcrawl
```
## π€ Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
### Quick Start
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
### Development Guidelines
- Write tests for new features
- Follow PEP 8 style guide
- Update documentation
- Add type hints
- Run tests before submitting
## π Security
Security is important to us. Please see our [Security Policy](SECURITY.md) for details on:
- Reporting vulnerabilities
- Security best practices
- API key management
- Data privacy
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π¬ Support
### Documentation
- π [Full Documentation](docs/)
- π [API Reference](docs/API.md)
- π‘ [Examples](docs/EXAMPLES.md)
- π§ [Advanced Usage](docs/ADVANCED.md)
- β [Troubleshooting](docs/TROUBLESHOOTING.md)
### Community
- π [Report Issues](https://github.com/aoneahsan/rapid-crawl/issues)
- π¬ [Discussions](https://github.com/aoneahsan/rapid-crawl/discussions)
- π§ [Email Support](mailto:aoneahsan@gmail.com)
### Resources
- π [Changelog](CHANGELOG.md)
- π [Security Policy](SECURITY.md)
- π€ [Contributing Guide](CONTRIBUTING.md)
- βοΈ [License](LICENSE)
## π¨βπ» Developer
**Ahsan Mahmood**
- π Website: [https://aoneahsan.com](https://aoneahsan.com)
- π§ Email: [aoneahsan@gmail.com](mailto:aoneahsan@gmail.com)
- πΌ LinkedIn: [https://linkedin.com/in/aoneahsan](https://linkedin.com/in/aoneahsan)
- π¦ Twitter: [@aoneahsan](https://twitter.com/aoneahsan)
## π Acknowledgments
- Inspired by [Firecrawl](https://www.firecrawl.dev/)
- Built with [Playwright](https://playwright.dev/) for dynamic content handling
- Uses [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [Click](https://click.palletsprojects.com/) for the CLI interface
- [Rich](https://rich.readthedocs.io/) for beautiful terminal output
## π Statistics





---
<p align="center">
<strong>RapidCrawl</strong> - Fast, reliable web scraping for Python<br>
Made with β€οΈ by <a href="https://aoneahsan.com">Ahsan Mahmood</a>
</p>
<p align="center">
<a href="https://github.com/aoneahsan/rapid-crawl/stargazers">β Star us on GitHub</a> β’
<a href="https://pypi.org/project/rapid-crawl/">π¦ Install from PyPI</a> β’
<a href="https://github.com/aoneahsan/rapid-crawl/issues/new/choose">π Report a Bug</a>
</p>
Raw data
{
"_id": null,
"home_page": null,
"name": "rapid-crawl",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Ahsan Mahmood <aoneahsan@gmail.com>",
"keywords": "ai, content-extraction, crawling, data-extraction, data-mining, html-to-markdown, llm, markdown, scraper, spider, web-automation, web-crawler, web-scraping, website-crawler",
"author": null,
"author_email": "Ahsan Mahmood <aoneahsan@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/04/8d/a11e4be19dfa64d3e5073d725937208f030543c9494cc3c81c5b08ad4f98/rapid_crawl-0.1.0.tar.gz",
"platform": null,
"description": "# RapidCrawl\n\n<p align=\"center\">\n <img src=\"https://img.shields.io/pypi/v/rapid-crawl.svg\" alt=\"PyPI version\">\n <img src=\"https://img.shields.io/pypi/pyversions/rapid-crawl.svg\" alt=\"Python versions\">\n <img src=\"https://img.shields.io/github/license/aoneahsan/rapid-crawl.svg\" alt=\"License\">\n <img src=\"https://img.shields.io/github/stars/aoneahsan/rapid-crawl.svg\" alt=\"Stars\">\n</p>\n\nA powerful Python SDK for web scraping, crawling, and data extraction. RapidCrawl provides a comprehensive toolkit for extracting data from websites, handling dynamic content, and converting web pages into clean, structured formats suitable for AI and LLM applications.\n\n## \ud83d\ude80 Features\n\n- **\ud83d\udd0d Scrape**: Convert any URL into clean markdown, HTML, text, or structured data\n- **\ud83d\udd77\ufe0f Crawl**: Recursively crawl websites with depth control and filtering\n- **\ud83d\uddfa\ufe0f Map**: Quickly discover all URLs on a website\n- **\ud83d\udd0e Search**: Web search with automatic result scraping\n- **\ud83d\udcf8 Screenshot**: Capture full-page screenshots\n- **\ud83c\udfad Dynamic Content**: Handle JavaScript-rendered pages with Playwright\n- **\ud83d\udcc4 Multiple Formats**: Support for Markdown, HTML, PDF, images, and more\n- **\ud83d\ude84 Async Support**: High-performance asynchronous operations\n- **\ud83d\udee1\ufe0f Error Handling**: Comprehensive error handling and retry logic\n- **\ud83d\udce6 CLI Tool**: Feature-rich command-line interface\n\n## \ud83d\udccb Table of Contents\n\n- [Installation](#-installation)\n- [Quick Start](#-quick-start)\n- [Features](#-features-1)\n - [Scraping](#scraping)\n - [Crawling](#crawling)\n - [Mapping](#mapping)\n - [Searching](#searching)\n- [CLI Usage](#-cli-usage)\n- [Configuration](#-configuration)\n- [API Reference](#-api-reference)\n- [Examples](#-examples)\n- [Advanced Usage](#-advanced-usage)\n- [Performance](#-performance)\n- [Troubleshooting](#-troubleshooting)\n- [Development](#-development)\n- [Contributing](#-contributing)\n- [Security](#-security)\n- [License](#-license)\n- [Support](#-support)\n\n## \ud83d\udce6 Installation\n\n### Using pip\n\n```bash\npip install rapid-crawl\n```\n\n### Using pip with all optional dependencies\n\n```bash\npip install rapid-crawl[dev]\n```\n\n### From source\n\n```bash\ngit clone https://github.com/aoneahsan/rapid-crawl.git\ncd rapid-crawl\npip install -e .\n```\n\n### Install Playwright browsers (required for dynamic content)\n\n```bash\nplaywright install chromium\n```\n\n## \ud83d\ude80 Quick Start\n\n### Python SDK\n\n```python\nfrom rapidcrawl import RapidCrawlApp\n\n# Initialize the client\napp = RapidCrawlApp()\n\n# Scrape a single page\nresult = app.scrape_url(\"https://example.com\")\nprint(result.content[\"markdown\"])\n\n# Crawl a website\ncrawl_result = app.crawl_url(\n \"https://example.com\",\n max_pages=10,\n max_depth=2\n)\n\n# Map all URLs\nmap_result = app.map_url(\"https://example.com\")\nprint(f\"Found {map_result.total_urls} URLs\")\n\n# Search and scrape\nsearch_result = app.search(\n \"python web scraping\",\n num_results=5,\n scrape_results=True\n)\n```\n\n### Command Line\n\n```bash\n# Scrape a URL\nrapidcrawl scrape https://example.com\n\n# Crawl a website\nrapidcrawl crawl https://example.com --max-pages 10\n\n# Map URLs\nrapidcrawl map https://example.com --limit 100\n\n# Search\nrapidcrawl search \"python tutorials\" --scrape\n```\n\n## \ud83c\udfaf Features\n\n### Scraping\n\nConvert any web page into clean, structured data:\n\n```python\nfrom rapidcrawl import RapidCrawlApp, OutputFormat\n\napp = RapidCrawlApp()\n\n# Basic scraping\nresult = app.scrape_url(\"https://example.com\")\n\n# Multiple formats\nresult = app.scrape_url(\n \"https://example.com\",\n formats=[\"markdown\", \"html\", \"screenshot\"],\n wait_for=\".content\", # Wait for element\n timeout=60000, # 60 seconds timeout\n)\n\n# Extract structured data\nresult = app.scrape_url(\n \"https://example.com/product\",\n extract_schema=[\n {\"name\": \"title\", \"selector\": \"h1\"},\n {\"name\": \"price\", \"selector\": \".price\", \"type\": \"number\"},\n {\"name\": \"description\", \"selector\": \".description\"}\n ]\n)\n\nprint(result.structured_data)\n# {'title': 'Product Name', 'price': 29.99, 'description': '...'}\n\n# Mobile viewport\nresult = app.scrape_url(\n \"https://example.com\",\n mobile=True\n)\n\n# With actions (click, type, scroll)\nresult = app.scrape_url(\n \"https://example.com\",\n actions=[\n {\"type\": \"click\", \"selector\": \".load-more\"},\n {\"type\": \"wait\", \"value\": 2000},\n {\"type\": \"scroll\", \"value\": 1000}\n ]\n)\n```\n\n### Crawling\n\nRecursively crawl websites with advanced filtering:\n\n```python\n# Basic crawling\nresult = app.crawl_url(\n \"https://example.com\",\n max_pages=50,\n max_depth=3\n)\n\n# With URL filtering\nresult = app.crawl_url(\n \"https://example.com\",\n include_patterns=[r\"/blog/.*\", r\"/docs/.*\"],\n exclude_patterns=[r\".*\\.pdf$\", r\".*/tag/.*\"]\n)\n\n# Async crawling for better performance\nimport asyncio\n\nasync def crawl_async():\n result = await app.crawl_url_async(\n \"https://example.com\",\n max_pages=100,\n max_depth=5,\n allow_subdomains=True\n )\n return result\n\nresult = asyncio.run(crawl_async())\n\n# With webhook notifications\nresult = app.crawl_url(\n \"https://example.com\",\n webhook_url=\"https://your-webhook.com/progress\"\n)\n```\n\n### Mapping\n\nQuickly discover all URLs on a website:\n\n```python\n# Basic mapping\nresult = app.map_url(\"https://example.com\")\nprint(f\"Found {result.total_urls} URLs\")\n\n# Filter URLs by search term\nresult = app.map_url(\n \"https://example.com\",\n search=\"product\",\n limit=1000\n)\n\n# Include subdomains\nresult = app.map_url(\n \"https://example.com\",\n include_subdomains=True,\n ignore_sitemap=False # Use sitemap.xml if available\n)\n\n# Access the URLs\nfor url in result.urls[:10]:\n print(url)\n```\n\n### Searching\n\nSearch the web and optionally scrape results:\n\n```python\n# Basic search\nresult = app.search(\"python web scraping tutorial\")\n\n# Search with scraping\nresult = app.search(\n \"latest AI news\",\n num_results=10,\n scrape_results=True,\n formats=[\"markdown\", \"text\"]\n)\n\n# Access results\nfor item in result.results:\n print(f\"{item.position}. {item.title}\")\n print(f\" URL: {item.url}\")\n if item.scraped_content:\n print(f\" Content: {item.scraped_content.content['markdown'][:200]}...\")\n\n# Different search engines\nresult = app.search(\n \"machine learning\",\n engine=\"duckduckgo\", # or \"google\", \"bing\"\n num_results=20\n)\n\n# With date filtering\nfrom datetime import datetime, timedelta\n\nresult = app.search(\n \"tech news\",\n start_date=datetime.now() - timedelta(days=7),\n end_date=datetime.now()\n)\n```\n\n## \ud83d\udcbb CLI Usage\n\nRapidCrawl provides a comprehensive command-line interface:\n\n### Setup Wizard\n\n```bash\n# Interactive setup\nrapidcrawl setup\n```\n\n### Scraping\n\n```bash\n# Basic scrape\nrapidcrawl scrape https://example.com\n\n# Save to file\nrapidcrawl scrape https://example.com -o output.md\n\n# Multiple formats\nrapidcrawl scrape https://example.com -f markdown -f html -f screenshot\n\n# Wait for element\nrapidcrawl scrape https://example.com --wait-for \".content\"\n\n# Extract structured data\nrapidcrawl scrape https://example.com \\\n --extract-schema '[{\"name\": \"title\", \"selector\": \"h1\"}]'\n```\n\n### Crawling\n\n```bash\n# Basic crawl\nrapidcrawl crawl https://example.com\n\n# Advanced crawl\nrapidcrawl crawl https://example.com \\\n --max-pages 100 \\\n --max-depth 3 \\\n --include \"*/blog/*\" \\\n --exclude \"*.pdf\" \\\n --output ./crawl-results/\n```\n\n### Mapping\n\n```bash\n# Map all URLs\nrapidcrawl map https://example.com\n\n# Filter and save\nrapidcrawl map https://example.com \\\n --search \"product\" \\\n --limit 1000 \\\n --output urls.txt\n```\n\n### Searching\n\n```bash\n# Basic search\nrapidcrawl search \"python tutorials\"\n\n# Search and scrape\nrapidcrawl search \"machine learning\" \\\n --scrape \\\n --num-results 20 \\\n --engine google \\\n --output results/\n```\n\n## \u2699\ufe0f Configuration\n\n### Environment Variables\n\nCreate a `.env` file in your project root:\n\n```env\n# API Configuration\nRAPIDCRAWL_API_KEY=your_api_key_here\nRAPIDCRAWL_BASE_URL=https://api.rapidcrawl.io/v1\nRAPIDCRAWL_TIMEOUT=30\n\n# Optional\nRAPIDCRAWL_MAX_RETRIES=3\n```\n\n### Python Configuration\n\n```python\nfrom rapidcrawl import RapidCrawlApp\n\n# Custom configuration\napp = RapidCrawlApp(\n api_key=\"your_api_key\",\n base_url=\"https://custom-api.example.com\",\n timeout=60.0,\n max_retries=5,\n debug=True\n)\n```\n\n### Manual Configuration Options\n\nIf the automated setup doesn't work, you can manually configure RapidCrawl:\n\n1. **API Key**: Set via environment variable or pass to constructor\n2. **Base URL**: For self-hosted instances\n3. **Timeout**: Request timeout in seconds\n4. **SSL Verification**: Disable for self-signed certificates\n5. **Debug Mode**: Enable verbose logging\n\n## \ud83d\udcda API Reference\n\n### RapidCrawlApp\n\nThe main client class for interacting with RapidCrawl.\n\n#### Constructor\n\n```python\nRapidCrawlApp(\n api_key: Optional[str] = None,\n base_url: Optional[str] = None,\n timeout: Optional[float] = None,\n max_retries: Optional[int] = None,\n verify_ssl: bool = True,\n debug: bool = False\n)\n```\n\n#### Methods\n\n- `scrape_url(url, **options)`: Scrape a single URL\n- `crawl_url(url, **options)`: Crawl a website\n- `crawl_url_async(url, **options)`: Async crawl\n- `map_url(url, **options)`: Map website URLs\n- `search(query, **options)`: Search the web\n- `extract(urls, schema, prompt)`: Extract structured data\n\n### Models\n\n#### ScrapeOptions\n\n```python\nfrom rapidcrawl.models import ScrapeOptions, OutputFormat\n\noptions = ScrapeOptions(\n url=\"https://example.com\",\n formats=[OutputFormat.MARKDOWN, OutputFormat.HTML],\n wait_for=\".content\",\n timeout=30000,\n mobile=False,\n actions=[...],\n extract_schema=[...],\n headers={\"User-Agent\": \"Custom UA\"}\n)\n```\n\n#### CrawlOptions\n\n```python\nfrom rapidcrawl.models import CrawlOptions\n\noptions = CrawlOptions(\n url=\"https://example.com\",\n max_pages=100,\n max_depth=3,\n include_patterns=[\"*/blog/*\"],\n exclude_patterns=[\"*.pdf\"],\n allow_subdomains=False,\n webhook_url=\"https://webhook.example.com\"\n)\n```\n\n## \ud83d\udd27 Examples\n\nFor comprehensive examples, see the [examples directory](examples/):\n- [Basic Scraping](examples/basic_scraping.py) - Getting started with web scraping\n- [Web Crawling](examples/web_crawling.py) - Crawling websites recursively\n- [Search and Map](examples/search_and_map.py) - Search and URL mapping\n- [Data Extraction](examples/data_extraction.py) - Structured data extraction\n- [Advanced Usage](examples/advanced_usage.py) - Production patterns\n\n### E-commerce Price Monitoring\n\n```python\nfrom rapidcrawl import RapidCrawlApp\nimport json\n\napp = RapidCrawlApp()\n\n# Define extraction schema\nschema = [\n {\"name\": \"title\", \"selector\": \"h1.product-title\"},\n {\"name\": \"price\", \"selector\": \".price-now\", \"type\": \"number\"},\n {\"name\": \"stock\", \"selector\": \".availability\"},\n {\"name\": \"image\", \"selector\": \"img.product-image\", \"attribute\": \"src\"}\n]\n\n# Monitor multiple products\nproducts = [\n \"https://shop.example.com/product1\",\n \"https://shop.example.com/product2\",\n]\n\nresults = []\nfor url in products:\n result = app.scrape_url(url, extract_schema=schema)\n if result.success:\n results.append({\n \"url\": url,\n \"data\": result.structured_data,\n \"timestamp\": result.scraped_at\n })\n\n# Save results\nwith open(\"prices.json\", \"w\") as f:\n json.dump(results, f, indent=2, default=str)\n```\n\n### Content Aggregation\n\n```python\nimport asyncio\nfrom rapidcrawl import RapidCrawlApp\n\napp = RapidCrawlApp()\n\nasync def aggregate_news():\n # Search multiple queries\n queries = [\n \"artificial intelligence breakthroughs\",\n \"quantum computing news\",\n \"robotics innovation\"\n ]\n \n all_articles = []\n \n for query in queries:\n result = app.search(\n query,\n num_results=5,\n scrape_results=True,\n formats=[\"markdown\"]\n )\n \n for item in result.results:\n if item.scraped_content and item.scraped_content.success:\n all_articles.append({\n \"title\": item.title,\n \"url\": item.url,\n \"content\": item.scraped_content.content[\"markdown\"],\n \"query\": query\n })\n \n return all_articles\n\n# Run aggregation\narticles = asyncio.run(aggregate_news())\n```\n\n### Website Change Detection\n\n```python\nimport hashlib\nimport time\nfrom rapidcrawl import RapidCrawlApp\n\napp = RapidCrawlApp()\n\ndef monitor_changes(url, interval=3600):\n \"\"\"Monitor a webpage for changes.\"\"\"\n previous_hash = None\n \n while True:\n result = app.scrape_url(url, formats=[\"text\"])\n \n if result.success:\n content = result.content[\"text\"]\n current_hash = hashlib.md5(content.encode()).hexdigest()\n \n if previous_hash and current_hash != previous_hash:\n print(f\"Change detected at {url}!\")\n # Send notification, save diff, etc.\n \n previous_hash = current_hash\n \n time.sleep(interval)\n\n# Monitor a page\nmonitor_changes(\"https://example.com/status\", interval=300) # Check every 5 minutes\n```\n\n## \ud83d\ude80 Advanced Usage\n\n### Rate Limiting\n\n```python\nimport time\nfrom rapidcrawl import RapidCrawlApp\n\nclass RateLimitedScraper:\n def __init__(self, requests_per_second=2):\n self.app = RapidCrawlApp()\n self.min_interval = 1.0 / requests_per_second\n self.last_request = 0\n \n def scrape_url(self, url):\n current = time.time()\n elapsed = current - self.last_request\n if elapsed < self.min_interval:\n time.sleep(self.min_interval - elapsed)\n \n self.last_request = time.time()\n return self.app.scrape_url(url)\n```\n\n### Caching Results\n\n```python\nfrom functools import lru_cache\nimport hashlib\n\nclass CachedScraper:\n def __init__(self):\n self.app = RapidCrawlApp()\n self.cache = {}\n \n def scrape_with_cache(self, url, max_age_hours=24):\n cache_key = hashlib.md5(url.encode()).hexdigest()\n \n if cache_key in self.cache:\n cached_time, cached_result = self.cache[cache_key]\n age_hours = (time.time() - cached_time) / 3600\n if age_hours < max_age_hours:\n return cached_result\n \n result = self.app.scrape_url(url)\n self.cache[cache_key] = (time.time(), result)\n return result\n```\n\n### Error Handling\n\n```python\nfrom rapidcrawl.exceptions import (\n RateLimitError,\n TimeoutError,\n NetworkError\n)\n\ndef robust_scrape(url, max_retries=3):\n app = RapidCrawlApp()\n \n for attempt in range(max_retries):\n try:\n return app.scrape_url(url)\n except RateLimitError as e:\n wait_time = e.retry_after or 60\n print(f\"Rate limited. Waiting {wait_time}s...\")\n time.sleep(wait_time)\n except TimeoutError:\n print(f\"Timeout on attempt {attempt + 1}\")\n if attempt == max_retries - 1:\n raise\n except NetworkError as e:\n print(f\"Network error: {e}\")\n time.sleep(2 ** attempt) # Exponential backoff\n```\n\n### Concurrent Scraping\n\n```python\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef concurrent_scrape(urls, max_workers=5):\n app = RapidCrawlApp()\n results = {}\n \n with ThreadPoolExecutor(max_workers=max_workers) as executor:\n future_to_url = {\n executor.submit(app.scrape_url, url): url \n for url in urls\n }\n \n for future in as_completed(future_to_url):\n url = future_to_url[future]\n try:\n results[url] = future.result()\n except Exception as e:\n results[url] = {\"error\": str(e)}\n \n return results\n```\n\nFor more advanced patterns, see the [Advanced Usage Guide](docs/ADVANCED.md).\n\n## \u26a1 Performance\n\n### Benchmarks\n\n| Operation | URLs | Time | Speed |\n|-----------|------|------|-------|\n| Sequential Scraping | 10 | 12.3s | 0.8 pages/sec |\n| Concurrent Scraping | 10 | 3.1s | 3.2 pages/sec |\n| Async Crawling | 100 | 28.5s | 3.5 pages/sec |\n| URL Mapping | 1000 | 5.2s | 192 URLs/sec |\n\n### Optimization Tips\n\n1. **Use Async Operations**: For crawling large sites, use `crawl_url_async()`\n2. **Enable Connection Pooling**: Reuse HTTP connections\n3. **Limit Concurrent Requests**: Prevent overwhelming servers\n4. **Cache Results**: Avoid re-scraping unchanged content\n5. **Use Specific Formats**: Only request needed output formats\n\n## \ud83d\udd0d Troubleshooting\n\n### Common Issues\n\n#### Installation Problems\n```bash\n# Update pip\npython -m pip install --upgrade pip\n\n# Install in virtual environment\npython -m venv venv\nsource venv/bin/activate\npip install rapid-crawl\n```\n\n#### Playwright Issues\n```bash\n# Install browser dependencies\nplaywright install-deps chromium\n\n# Or use Firefox\nplaywright install firefox\n```\n\n#### SSL Certificate Errors\n```python\n# For self-signed certificates (development only!)\napp = RapidCrawlApp(verify_ssl=False)\n```\n\n#### Rate Limiting\n```python\n# Handle rate limits gracefully\ntry:\n result = app.scrape_url(url)\nexcept RateLimitError as e:\n time.sleep(e.retry_after or 60)\n result = app.scrape_url(url)\n```\n\nFor detailed troubleshooting, see the [Troubleshooting Guide](docs/TROUBLESHOOTING.md).\n\n## \ud83d\udee0\ufe0f Development\n\n### Setting up development environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/aoneahsan/rapid-crawl.git\ncd rapid-crawl\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Running tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=rapidcrawl\n\n# Run specific test file\npytest tests/test_scrape.py\n```\n\n### Code formatting\n\n```bash\n# Format code\nblack src/rapidcrawl\n\n# Run linter\nruff check src/rapidcrawl\n\n# Type checking\nmypy src/rapidcrawl\n```\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Quick Start\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n### Development Guidelines\n\n- Write tests for new features\n- Follow PEP 8 style guide\n- Update documentation\n- Add type hints\n- Run tests before submitting\n\n## \ud83d\udd12 Security\n\nSecurity is important to us. Please see our [Security Policy](SECURITY.md) for details on:\n- Reporting vulnerabilities\n- Security best practices\n- API key management\n- Data privacy\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcac Support\n\n### Documentation\n\n- \ud83d\udcd6 [Full Documentation](docs/)\n- \ud83d\ude80 [API Reference](docs/API.md)\n- \ud83d\udca1 [Examples](docs/EXAMPLES.md)\n- \ud83d\udd27 [Advanced Usage](docs/ADVANCED.md)\n- \u2753 [Troubleshooting](docs/TROUBLESHOOTING.md)\n\n### Community\n\n- \ud83d\udc1b [Report Issues](https://github.com/aoneahsan/rapid-crawl/issues)\n- \ud83d\udcac [Discussions](https://github.com/aoneahsan/rapid-crawl/discussions)\n- \ud83d\udce7 [Email Support](mailto:aoneahsan@gmail.com)\n\n### Resources\n\n- \ud83d\udcdd [Changelog](CHANGELOG.md)\n- \ud83d\udd12 [Security Policy](SECURITY.md)\n- \ud83e\udd1d [Contributing Guide](CONTRIBUTING.md)\n- \u2696\ufe0f [License](LICENSE)\n\n## \ud83d\udc68\u200d\ud83d\udcbb Developer\n\n**Ahsan Mahmood**\n\n- \ud83c\udf10 Website: [https://aoneahsan.com](https://aoneahsan.com)\n- \ud83d\udce7 Email: [aoneahsan@gmail.com](mailto:aoneahsan@gmail.com)\n- \ud83d\udcbc LinkedIn: [https://linkedin.com/in/aoneahsan](https://linkedin.com/in/aoneahsan)\n- \ud83d\udc26 Twitter: [@aoneahsan](https://twitter.com/aoneahsan)\n\n## \ud83d\ude4f Acknowledgments\n\n- Inspired by [Firecrawl](https://www.firecrawl.dev/)\n- Built with [Playwright](https://playwright.dev/) for dynamic content handling\n- Uses [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing\n- [Click](https://click.palletsprojects.com/) for the CLI interface\n- [Rich](https://rich.readthedocs.io/) for beautiful terminal output\n\n## \ud83d\udcca Statistics\n\n\n\n\n\n\n\n---\n\n<p align=\"center\">\n <strong>RapidCrawl</strong> - Fast, reliable web scraping for Python<br>\n Made with \u2764\ufe0f by <a href=\"https://aoneahsan.com\">Ahsan Mahmood</a>\n</p>\n\n<p align=\"center\">\n <a href=\"https://github.com/aoneahsan/rapid-crawl/stargazers\">\u2b50 Star us on GitHub</a> \u2022\n <a href=\"https://pypi.org/project/rapid-crawl/\">\ud83d\udce6 Install from PyPI</a> \u2022\n <a href=\"https://github.com/aoneahsan/rapid-crawl/issues/new/choose\">\ud83d\udc1b Report a Bug</a>\n</p>",
"bugtrack_url": null,
"license": "MIT",
"summary": "A powerful Python SDK for web scraping, crawling, and data extraction - inspired by Firecrawl",
"version": "0.1.0",
"project_urls": {
"Changelog": "https://github.com/aoneahsan/rapid-crawl/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/aoneahsan/rapid-crawl/blob/main/README.md",
"Funding": "https://github.com/sponsors/aoneahsan",
"Homepage": "https://github.com/aoneahsan/rapid-crawl",
"Issues": "https://github.com/aoneahsan/rapid-crawl/issues",
"Source": "https://github.com/aoneahsan/rapid-crawl",
"Twitter": "https://twitter.com/aoneahsan"
},
"split_keywords": [
"ai",
" content-extraction",
" crawling",
" data-extraction",
" data-mining",
" html-to-markdown",
" llm",
" markdown",
" scraper",
" spider",
" web-automation",
" web-crawler",
" web-scraping",
" website-crawler"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4a9256633c1d5ba4274dcfb0dd791636a63e2c6877a39c301dd5a94b64070454",
"md5": "2595f2aa075eb77465ff50786f1f6227",
"sha256": "59a3c7b9895fa747d1d6b41ea7509e2207f8044ca7c6788380364173fd10ae45"
},
"downloads": -1,
"filename": "rapid_crawl-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2595f2aa075eb77465ff50786f1f6227",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 39037,
"upload_time": "2025-07-11T12:32:20",
"upload_time_iso_8601": "2025-07-11T12:32:20.040095Z",
"url": "https://files.pythonhosted.org/packages/4a/92/56633c1d5ba4274dcfb0dd791636a63e2c6877a39c301dd5a94b64070454/rapid_crawl-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "048da11e4be19dfa64d3e5073d725937208f030543c9494cc3c81c5b08ad4f98",
"md5": "e7fd45838dc4f10f9fe86b229c1a4bff",
"sha256": "db4b603d8461df3a9f71f792c7ee74fdf328ab2b906963de9143599a84ee8cd9"
},
"downloads": -1,
"filename": "rapid_crawl-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "e7fd45838dc4f10f9fe86b229c1a4bff",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 84679,
"upload_time": "2025-07-11T12:32:22",
"upload_time_iso_8601": "2025-07-11T12:32:22.787249Z",
"url": "https://files.pythonhosted.org/packages/04/8d/a11e4be19dfa64d3e5073d725937208f030543c9494cc3c81c5b08ad4f98/rapid_crawl-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-11 12:32:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aoneahsan",
"github_project": "rapid-crawl",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "rapid-crawl"
}