opencrawler

Name	opencrawler JSON
Version	1.0.2 JSON
	download
home_page	None
Summary	Production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration
upload_time	2025-07-16 16:04:28
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	web-scraping crawling ai llm automation data-extraction playwright selenium fastapi microservices enterprise production
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OpenCrawler

<div align="center">
  <img src="assets/opencrawler-logo.svg" alt="OpenCrawler Logo" width="200" height="200">
  <br>
  <em>AI-Powered Web Intelligence</em>
</div>

[![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![PyPI](https://img.shields.io/badge/pypi-1.0.2-blue.svg)](https://pypi.org/project/opencrawler/)
[![Tests](https://img.shields.io/badge/tests-passing-brightgreen.svg)](tests/)
[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

**OpenCrawler** is a production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration, comprehensive monitoring, and scalable architecture.

## 🚀 Quick Installation

```bash
# Install from PyPI
pip install opencrawler

# Install with AI capabilities
pip install "opencrawler[ai]"

# Install with all features
pip install "opencrawler[all]"
```

## Features

### Core Capabilities
- **Multi-Engine Support**: Playwright, Selenium, Requests, CloudScraper
- **AI-Powered Extraction**: OpenAI Agents SDK integration for intelligent data extraction
- **Stealth Technology**: Advanced anti-detection and bot bypass capabilities
- **Distributed Processing**: Scalable architecture for high-volume operations
- **Real-time Monitoring**: Comprehensive metrics and health monitoring
- **Enterprise Security**: RBAC, audit trails, and compliance features

### Advanced Features
- **LLM Integration**: Support for OpenAI, Anthropic, and local models
- **Microservice Architecture**: FastAPI-based REST API with auto-documentation
- **Database Support**: PostgreSQL, TimescaleDB, Redis integration
- **Container Ready**: Docker and Kubernetes deployment configurations
- **Performance Optimization**: Intelligent caching, rate limiting, and resource management
- **Error Recovery**: Sophisticated error handling and retry mechanisms

## Quick Start

### Basic Usage

```python
import asyncio
from webscraper.core.advanced_scraper import AdvancedWebScraper

async def main():
    # Initialize scraper
    scraper = AdvancedWebScraper()
    await scraper.setup()
    
    # Scrape a webpage
    result = await scraper.scrape_url("https://example.com")
    print(f"Title: {result.get('title')}")
    print(f"Content length: {len(result.get('content', ''))}")
    
    # Cleanup
    await scraper.cleanup()

asyncio.run(main())
```

### CLI Usage

```bash
# Basic scraping
opencrawler scrape https://example.com

# Advanced scraping with AI
opencrawler scrape https://example.com --ai-extract --model gpt-4

# Start API server
opencrawler api --host 0.0.0.0 --port 8000

# Run system validation
opencrawler-validate --level production
```

## Architecture

OpenCrawler follows a modular, microservice-oriented architecture:

```
OpenCrawler/
├── webscraper/
│   ├── core/           # Core scraping engines
│   ├── ai/             # AI/LLM integration
│   ├── api/            # FastAPI REST API
│   ├── engines/        # Scraping engines (Playwright, Selenium, etc.)
│   ├── processors/     # Data processing pipelines
│   ├── monitoring/     # System monitoring and metrics
│   ├── security/       # Authentication and security
│   ├── utils/          # Utilities and helpers
│   └── orchestrator/   # System orchestration
├── tests/              # Comprehensive test suite
├── deployment/         # Docker and Kubernetes configs
├── docs/               # Documentation
└── examples/           # Usage examples
```

## Configuration

### Environment Variables

```bash
# OpenAI API (optional)
export OPENAI_API_KEY="your-api-key-here"

# Database (optional)
export DATABASE_URL="postgresql://user:pass@localhost/opencrawler"

# Redis (optional)
export REDIS_URL="redis://localhost:6379"

# Test mode
export OPENCRAWLER_TEST_MODE=true
```

### Configuration File

Create a `config.yaml` file:

```yaml
scraper:
  engines: ["playwright", "requests"]
  stealth_level: "medium"
  javascript_enabled: true
  
ai:
  enabled: true
  model: "gpt-4"
  temperature: 0.7
  
database:
  url: "postgresql://localhost/opencrawler"
  pool_size: 10
  
monitoring:
  enabled: true
  metrics_port: 9090
  
security:
  enable_auth: true
  rate_limit: 100
```

## API Reference

### REST API

Start the API server:

```bash
opencrawler-api --port 8000
```

#### Endpoints

- `GET /health` - Health check
- `POST /scrape` - Scrape a single URL
- `POST /crawl` - Crawl multiple URLs
- `GET /metrics` - System metrics
- `GET /docs` - API documentation

#### Example Request

```bash
curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "extract_ai": true}'
```

### Python API

```python
from webscraper.api.complete_api import OpenCrawlerAPI

# Initialize API
api = OpenCrawlerAPI()
await api.initialize()

# Scrape with AI
result = await api.scrape_with_ai(
    url="https://example.com",
    schema={"title": "string", "content": "string"}
)

# Cleanup
await api.cleanup()
```

## Advanced Usage

### AI-Powered Extraction

```python
from webscraper.ai.llm_scraper import LLMScraper

scraper = LLMScraper()
await scraper.initialize()

# Extract structured data
result = await scraper.run(
    url="https://news.example.com",
    schema={
        "title": "string",
        "author": "string", 
        "date": "date",
        "content": "string"
    }
)
```

### Distributed Processing

```python
from webscraper.core.distributed_processor import DistributedProcessor

processor = DistributedProcessor(worker_count=16)
await processor.initialize()

# Process multiple URLs
results = await processor.process_batch([
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
])
```

### Custom Engines

```python
from webscraper.engines.base_engine import BaseEngine

class CustomEngine(BaseEngine):
    async def fetch(self, url: str, **kwargs) -> dict:
        # Custom implementation
        return {"content": "...", "status": 200}

# Register custom engine
scraper.register_engine("custom", CustomEngine())
```

## Monitoring and Metrics

### Built-in Monitoring

```python
from webscraper.monitoring.advanced_monitoring import AdvancedMonitoringSystem

monitor = AdvancedMonitoringSystem()
await monitor.initialize()

# Get system metrics
metrics = await monitor.get_system_metrics()
print(f"CPU: {metrics['cpu_usage']}%")
print(f"Memory: {metrics['memory_usage']}%")
```

### Prometheus Integration

OpenCrawler exports metrics to Prometheus:

```bash
# Start with monitoring
python master_cli.py api --enable-metrics --metrics-port 9090
```

Metrics available at `http://localhost:9090/metrics`

## Deployment

### Docker

```bash
# Build image
docker build -t opencrawler .

# Run container
docker run -p 8000:8000 opencrawler
```

### Docker Compose

```bash
# Start all services
docker-compose up -d

# Production deployment
docker-compose -f docker-compose.production.yml up -d
```

### Kubernetes

```bash
# Deploy to Kubernetes
kubectl apply -f kubernetes/

# Check deployment
kubectl get pods -l app=opencrawler
```

### Production Deployment

```python
from deployment.production_deployment import ProductionDeploymentSystem

deployment = ProductionDeploymentSystem()
await deployment.initialize()

# Deploy to production
result = await deployment.deploy(
    environment="production",
    config_overrides={"replicas": 5}
)
```

## Testing

### Running Tests

```bash
# Run all tests
pytest

# Run specific test suite
pytest tests/test_complete_system.py

# Run with coverage
pytest --cov=webscraper

# Run in test mode
OPENCRAWLER_TEST_MODE=true pytest
```

### Test Categories

- **Unit Tests**: Core component testing
- **Integration Tests**: Service integration testing
- **Performance Tests**: Load and performance testing
- **Security Tests**: Security validation
- **End-to-End Tests**: Complete workflow testing

### Validation

```bash
# Run comprehensive validation
python webscraper/utils/comprehensive_validator.py --level production

# Check system health
python -c "
from webscraper.orchestrator.system_orchestrator import SystemOrchestrator
import asyncio

async def main():
    orchestrator = SystemOrchestrator()
    await orchestrator.initialize()
    health = await orchestrator.get_system_health()
    print(f'System Status: {health[\"status\"]}')
    await orchestrator.shutdown()

asyncio.run(main())
"
```

## Performance

### Benchmarks

- **Single Page**: ~2-5 seconds per page
- **Concurrent Crawling**: 50-100 pages/minute
- **Memory Usage**: <1GB for typical workloads
- **CPU Usage**: Optimized for multi-core systems

### Optimization

```python
# Enable performance optimizations
scraper = AdvancedWebScraper(
    stealth_level="low",  # Faster but less stealthy
    javascript_enabled=False,  # Skip JS rendering
    cache_enabled=True,  # Enable caching
    concurrent_requests=10  # Increase concurrency
)
```

## Security

### Authentication

```python
from webscraper.security.authentication import AuthManager

auth = AuthManager()
await auth.initialize()

# Create user
user = await auth.create_user("username", "password", ["scraper"])

# Authenticate
token = await auth.authenticate("username", "password")
```

### Rate Limiting

```python
from webscraper.security.rate_limiter import RateLimiter

limiter = RateLimiter(requests_per_minute=60)
await limiter.check_rate_limit(user_id="user123")
```

## Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Development Setup

```bash
# Clone and install
git clone https://github.com/llamasearch/opencrawler.git
cd opencrawler
pip install -e ".[dev]"

# Run pre-commit hooks
pre-commit install

# Run tests
pytest
```

### Code Style

We use [Black](https://github.com/psf/black) for code formatting:

```bash
# Format code
black webscraper/

# Check formatting
black --check webscraper/
```

## License

OpenCrawler is licensed under the MIT License. See [LICENSE](LICENSE) for details.

## Support

- **Documentation**: [docs/](docs/)
- **Examples**: [examples/](examples/)
- **Issues**: [GitHub Issues](https://github.com/llamasearch/opencrawler/issues)
- **Discussions**: [GitHub Discussions](https://github.com/llamasearch/opencrawler/discussions)

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for version history and updates.

## Assets

OpenCrawler includes a complete set of professional logo assets:

### Logo Variants

- **`assets/opencrawler-logo.svg`** - Main logo with full branding (light theme)
- **`assets/opencrawler-logo-dark.svg`** - Dark variant for light backgrounds
- **`assets/opencrawler-icon.svg`** - Icon version for app icons and buttons
- **`assets/favicon.svg`** - Favicon optimized for small sizes

### Design Features

- **Spider/Crawler Theme**: Represents web crawling and data extraction
- **AI/Neural Network Elements**: Symbolizes AI-powered intelligence
- **Modern Gradients**: Professional blue, green, and orange color scheme
- **Scalable Vector Graphics**: Perfect quality at any size
- **Multiple Formats**: SVG for web, can be converted to PNG/ICO as needed

### Usage Guidelines

```html
<!-- Main logo for documentation -->
<img src="assets/opencrawler-logo.svg" alt="OpenCrawler" width="200">

<!-- Dark variant for light backgrounds -->
<img src="assets/opencrawler-logo-dark.svg" alt="OpenCrawler" width="200">

<!-- Icon for buttons/navigation -->
<img src="assets/opencrawler-icon.svg" alt="OpenCrawler" width="32">

<!-- Favicon -->
<link rel="icon" type="image/svg+xml" href="assets/favicon.svg">
```

## Acknowledgments

OpenCrawler is built with these excellent libraries:

- [Playwright](https://playwright.dev/) - Modern web automation
- [FastAPI](https://fastapi.tiangolo.com/) - High-performance API framework
- [OpenAI](https://openai.com/) - AI/LLM integration
- [PostgreSQL](https://www.postgresql.org/) - Database backend
- [Docker](https://www.docker.com/) - Containerization
- [Kubernetes](https://kubernetes.io/) - Container orchestration

---

**Author**: Nik Jois <nikjois@llamasearch.ai>  
**Organization**: LlamaSearch.ai  
**Version**: 1.0.1  
**Status**: Production Ready

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "opencrawler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Nik Jois <nikjois@llamasearch.ai>",
    "keywords": "web-scraping, crawling, ai, llm, automation, data-extraction, playwright, selenium, fastapi, microservices, enterprise, production",
    "author": null,
    "author_email": "Nik Jois <nikjois@llamasearch.ai>",
    "download_url": "https://files.pythonhosted.org/packages/7c/1c/654ed4818796c9f3e76d21c1f3d09f85e8e05e67443e8269198e22f7084a/opencrawler-1.0.2.tar.gz",
    "platform": null,
    "description": "# OpenCrawler\n\n<div align=\"center\">\n  <img src=\"assets/opencrawler-logo.svg\" alt=\"OpenCrawler Logo\" width=\"200\" height=\"200\">\n  <br>\n  <em>AI-Powered Web Intelligence</em>\n</div>\n\n[![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n[![PyPI](https://img.shields.io/badge/pypi-1.0.2-blue.svg)](https://pypi.org/project/opencrawler/)\n[![Tests](https://img.shields.io/badge/tests-passing-brightgreen.svg)](tests/)\n[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n**OpenCrawler** is a production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration, comprehensive monitoring, and scalable architecture.\n\n## \ud83d\ude80 Quick Installation\n\n```bash\n# Install from PyPI\npip install opencrawler\n\n# Install with AI capabilities\npip install \"opencrawler[ai]\"\n\n# Install with all features\npip install \"opencrawler[all]\"\n```\n\n## Features\n\n### Core Capabilities\n- **Multi-Engine Support**: Playwright, Selenium, Requests, CloudScraper\n- **AI-Powered Extraction**: OpenAI Agents SDK integration for intelligent data extraction\n- **Stealth Technology**: Advanced anti-detection and bot bypass capabilities\n- **Distributed Processing**: Scalable architecture for high-volume operations\n- **Real-time Monitoring**: Comprehensive metrics and health monitoring\n- **Enterprise Security**: RBAC, audit trails, and compliance features\n\n### Advanced Features\n- **LLM Integration**: Support for OpenAI, Anthropic, and local models\n- **Microservice Architecture**: FastAPI-based REST API with auto-documentation\n- **Database Support**: PostgreSQL, TimescaleDB, Redis integration\n- **Container Ready**: Docker and Kubernetes deployment configurations\n- **Performance Optimization**: Intelligent caching, rate limiting, and resource management\n- **Error Recovery**: Sophisticated error handling and retry mechanisms\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom webscraper.core.advanced_scraper import AdvancedWebScraper\n\nasync def main():\n    # Initialize scraper\n    scraper = AdvancedWebScraper()\n    await scraper.setup()\n    \n    # Scrape a webpage\n    result = await scraper.scrape_url(\"https://example.com\")\n    print(f\"Title: {result.get('title')}\")\n    print(f\"Content length: {len(result.get('content', ''))}\")\n    \n    # Cleanup\n    await scraper.cleanup()\n\nasyncio.run(main())\n```\n\n### CLI Usage\n\n```bash\n# Basic scraping\nopencrawler scrape https://example.com\n\n# Advanced scraping with AI\nopencrawler scrape https://example.com --ai-extract --model gpt-4\n\n# Start API server\nopencrawler api --host 0.0.0.0 --port 8000\n\n# Run system validation\nopencrawler-validate --level production\n```\n\n## Architecture\n\nOpenCrawler follows a modular, microservice-oriented architecture:\n\n```\nOpenCrawler/\n\u251c\u2500\u2500 webscraper/\n\u2502   \u251c\u2500\u2500 core/           # Core scraping engines\n\u2502   \u251c\u2500\u2500 ai/             # AI/LLM integration\n\u2502   \u251c\u2500\u2500 api/            # FastAPI REST API\n\u2502   \u251c\u2500\u2500 engines/        # Scraping engines (Playwright, Selenium, etc.)\n\u2502   \u251c\u2500\u2500 processors/     # Data processing pipelines\n\u2502   \u251c\u2500\u2500 monitoring/     # System monitoring and metrics\n\u2502   \u251c\u2500\u2500 security/       # Authentication and security\n\u2502   \u251c\u2500\u2500 utils/          # Utilities and helpers\n\u2502   \u2514\u2500\u2500 orchestrator/   # System orchestration\n\u251c\u2500\u2500 tests/              # Comprehensive test suite\n\u251c\u2500\u2500 deployment/         # Docker and Kubernetes configs\n\u251c\u2500\u2500 docs/               # Documentation\n\u2514\u2500\u2500 examples/           # Usage examples\n```\n\n## Configuration\n\n### Environment Variables\n\n```bash\n# OpenAI API (optional)\nexport OPENAI_API_KEY=\"your-api-key-here\"\n\n# Database (optional)\nexport DATABASE_URL=\"postgresql://user:pass@localhost/opencrawler\"\n\n# Redis (optional)\nexport REDIS_URL=\"redis://localhost:6379\"\n\n# Test mode\nexport OPENCRAWLER_TEST_MODE=true\n```\n\n### Configuration File\n\nCreate a `config.yaml` file:\n\n```yaml\nscraper:\n  engines: [\"playwright\", \"requests\"]\n  stealth_level: \"medium\"\n  javascript_enabled: true\n  \nai:\n  enabled: true\n  model: \"gpt-4\"\n  temperature: 0.7\n  \ndatabase:\n  url: \"postgresql://localhost/opencrawler\"\n  pool_size: 10\n  \nmonitoring:\n  enabled: true\n  metrics_port: 9090\n  \nsecurity:\n  enable_auth: true\n  rate_limit: 100\n```\n\n## API Reference\n\n### REST API\n\nStart the API server:\n\n```bash\nopencrawler-api --port 8000\n```\n\n#### Endpoints\n\n- `GET /health` - Health check\n- `POST /scrape` - Scrape a single URL\n- `POST /crawl` - Crawl multiple URLs\n- `GET /metrics` - System metrics\n- `GET /docs` - API documentation\n\n#### Example Request\n\n```bash\ncurl -X POST \"http://localhost:8000/scrape\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"url\": \"https://example.com\", \"extract_ai\": true}'\n```\n\n### Python API\n\n```python\nfrom webscraper.api.complete_api import OpenCrawlerAPI\n\n# Initialize API\napi = OpenCrawlerAPI()\nawait api.initialize()\n\n# Scrape with AI\nresult = await api.scrape_with_ai(\n    url=\"https://example.com\",\n    schema={\"title\": \"string\", \"content\": \"string\"}\n)\n\n# Cleanup\nawait api.cleanup()\n```\n\n## Advanced Usage\n\n### AI-Powered Extraction\n\n```python\nfrom webscraper.ai.llm_scraper import LLMScraper\n\nscraper = LLMScraper()\nawait scraper.initialize()\n\n# Extract structured data\nresult = await scraper.run(\n    url=\"https://news.example.com\",\n    schema={\n        \"title\": \"string\",\n        \"author\": \"string\", \n        \"date\": \"date\",\n        \"content\": \"string\"\n    }\n)\n```\n\n### Distributed Processing\n\n```python\nfrom webscraper.core.distributed_processor import DistributedProcessor\n\nprocessor = DistributedProcessor(worker_count=16)\nawait processor.initialize()\n\n# Process multiple URLs\nresults = await processor.process_batch([\n    \"https://example1.com\",\n    \"https://example2.com\",\n    \"https://example3.com\"\n])\n```\n\n### Custom Engines\n\n```python\nfrom webscraper.engines.base_engine import BaseEngine\n\nclass CustomEngine(BaseEngine):\n    async def fetch(self, url: str, **kwargs) -> dict:\n        # Custom implementation\n        return {\"content\": \"...\", \"status\": 200}\n\n# Register custom engine\nscraper.register_engine(\"custom\", CustomEngine())\n```\n\n## Monitoring and Metrics\n\n### Built-in Monitoring\n\n```python\nfrom webscraper.monitoring.advanced_monitoring import AdvancedMonitoringSystem\n\nmonitor = AdvancedMonitoringSystem()\nawait monitor.initialize()\n\n# Get system metrics\nmetrics = await monitor.get_system_metrics()\nprint(f\"CPU: {metrics['cpu_usage']}%\")\nprint(f\"Memory: {metrics['memory_usage']}%\")\n```\n\n### Prometheus Integration\n\nOpenCrawler exports metrics to Prometheus:\n\n```bash\n# Start with monitoring\npython master_cli.py api --enable-metrics --metrics-port 9090\n```\n\nMetrics available at `http://localhost:9090/metrics`\n\n## Deployment\n\n### Docker\n\n```bash\n# Build image\ndocker build -t opencrawler .\n\n# Run container\ndocker run -p 8000:8000 opencrawler\n```\n\n### Docker Compose\n\n```bash\n# Start all services\ndocker-compose up -d\n\n# Production deployment\ndocker-compose -f docker-compose.production.yml up -d\n```\n\n### Kubernetes\n\n```bash\n# Deploy to Kubernetes\nkubectl apply -f kubernetes/\n\n# Check deployment\nkubectl get pods -l app=opencrawler\n```\n\n### Production Deployment\n\n```python\nfrom deployment.production_deployment import ProductionDeploymentSystem\n\ndeployment = ProductionDeploymentSystem()\nawait deployment.initialize()\n\n# Deploy to production\nresult = await deployment.deploy(\n    environment=\"production\",\n    config_overrides={\"replicas\": 5}\n)\n```\n\n## Testing\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run specific test suite\npytest tests/test_complete_system.py\n\n# Run with coverage\npytest --cov=webscraper\n\n# Run in test mode\nOPENCRAWLER_TEST_MODE=true pytest\n```\n\n### Test Categories\n\n- **Unit Tests**: Core component testing\n- **Integration Tests**: Service integration testing\n- **Performance Tests**: Load and performance testing\n- **Security Tests**: Security validation\n- **End-to-End Tests**: Complete workflow testing\n\n### Validation\n\n```bash\n# Run comprehensive validation\npython webscraper/utils/comprehensive_validator.py --level production\n\n# Check system health\npython -c \"\nfrom webscraper.orchestrator.system_orchestrator import SystemOrchestrator\nimport asyncio\n\nasync def main():\n    orchestrator = SystemOrchestrator()\n    await orchestrator.initialize()\n    health = await orchestrator.get_system_health()\n    print(f'System Status: {health[\\\"status\\\"]}')\n    await orchestrator.shutdown()\n\nasyncio.run(main())\n\"\n```\n\n## Performance\n\n### Benchmarks\n\n- **Single Page**: ~2-5 seconds per page\n- **Concurrent Crawling**: 50-100 pages/minute\n- **Memory Usage**: <1GB for typical workloads\n- **CPU Usage**: Optimized for multi-core systems\n\n### Optimization\n\n```python\n# Enable performance optimizations\nscraper = AdvancedWebScraper(\n    stealth_level=\"low\",  # Faster but less stealthy\n    javascript_enabled=False,  # Skip JS rendering\n    cache_enabled=True,  # Enable caching\n    concurrent_requests=10  # Increase concurrency\n)\n```\n\n## Security\n\n### Authentication\n\n```python\nfrom webscraper.security.authentication import AuthManager\n\nauth = AuthManager()\nawait auth.initialize()\n\n# Create user\nuser = await auth.create_user(\"username\", \"password\", [\"scraper\"])\n\n# Authenticate\ntoken = await auth.authenticate(\"username\", \"password\")\n```\n\n### Rate Limiting\n\n```python\nfrom webscraper.security.rate_limiter import RateLimiter\n\nlimiter = RateLimiter(requests_per_minute=60)\nawait limiter.check_rate_limit(user_id=\"user123\")\n```\n\n## Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n\n```bash\n# Clone and install\ngit clone https://github.com/llamasearch/opencrawler.git\ncd opencrawler\npip install -e \".[dev]\"\n\n# Run pre-commit hooks\npre-commit install\n\n# Run tests\npytest\n```\n\n### Code Style\n\nWe use [Black](https://github.com/psf/black) for code formatting:\n\n```bash\n# Format code\nblack webscraper/\n\n# Check formatting\nblack --check webscraper/\n```\n\n## License\n\nOpenCrawler is licensed under the MIT License. See [LICENSE](LICENSE) for details.\n\n## Support\n\n- **Documentation**: [docs/](docs/)\n- **Examples**: [examples/](examples/)\n- **Issues**: [GitHub Issues](https://github.com/llamasearch/opencrawler/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/llamasearch/opencrawler/discussions)\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for version history and updates.\n\n## Assets\n\nOpenCrawler includes a complete set of professional logo assets:\n\n### Logo Variants\n\n- **`assets/opencrawler-logo.svg`** - Main logo with full branding (light theme)\n- **`assets/opencrawler-logo-dark.svg`** - Dark variant for light backgrounds\n- **`assets/opencrawler-icon.svg`** - Icon version for app icons and buttons\n- **`assets/favicon.svg`** - Favicon optimized for small sizes\n\n### Design Features\n\n- **Spider/Crawler Theme**: Represents web crawling and data extraction\n- **AI/Neural Network Elements**: Symbolizes AI-powered intelligence\n- **Modern Gradients**: Professional blue, green, and orange color scheme\n- **Scalable Vector Graphics**: Perfect quality at any size\n- **Multiple Formats**: SVG for web, can be converted to PNG/ICO as needed\n\n### Usage Guidelines\n\n```html\n<!-- Main logo for documentation -->\n<img src=\"assets/opencrawler-logo.svg\" alt=\"OpenCrawler\" width=\"200\">\n\n<!-- Dark variant for light backgrounds -->\n<img src=\"assets/opencrawler-logo-dark.svg\" alt=\"OpenCrawler\" width=\"200\">\n\n<!-- Icon for buttons/navigation -->\n<img src=\"assets/opencrawler-icon.svg\" alt=\"OpenCrawler\" width=\"32\">\n\n<!-- Favicon -->\n<link rel=\"icon\" type=\"image/svg+xml\" href=\"assets/favicon.svg\">\n```\n\n## Acknowledgments\n\nOpenCrawler is built with these excellent libraries:\n\n- [Playwright](https://playwright.dev/) - Modern web automation\n- [FastAPI](https://fastapi.tiangolo.com/) - High-performance API framework\n- [OpenAI](https://openai.com/) - AI/LLM integration\n- [PostgreSQL](https://www.postgresql.org/) - Database backend\n- [Docker](https://www.docker.com/) - Containerization\n- [Kubernetes](https://kubernetes.io/) - Container orchestration\n\n---\n\n**Author**: Nik Jois <nikjois@llamasearch.ai>  \n**Organization**: LlamaSearch.ai  \n**Version**: 1.0.1  \n**Status**: Production Ready \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration",
    "version": "1.0.2",
    "project_urls": {
        "Changelog": "https://github.com/llamasearch/opencrawler/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/llamasearch/opencrawler/docs",
        "Homepage": "https://github.com/llamasearch/opencrawler",
        "Issues": "https://github.com/llamasearch/opencrawler/issues",
        "Repository": "https://github.com/llamasearch/opencrawler"
    },
    "split_keywords": [
        "web-scraping",
        " crawling",
        " ai",
        " llm",
        " automation",
        " data-extraction",
        " playwright",
        " selenium",
        " fastapi",
        " microservices",
        " enterprise",
        " production"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "22ebfb279d35e792363e9a46d10317ca7c621a0318268a1a02edca53e04330c0",
                "md5": "7908eead0d4bf45fa145d60affb5653a",
                "sha256": "ce2dc519d0aa1021ab30d0ca50b194c835d1be1f5ad3ec0be938ab8700c422b3"
            },
            "downloads": -1,
            "filename": "opencrawler-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7908eead0d4bf45fa145d60affb5653a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 24400,
            "upload_time": "2025-07-16T16:04:27",
            "upload_time_iso_8601": "2025-07-16T16:04:27.594254Z",
            "url": "https://files.pythonhosted.org/packages/22/eb/fb279d35e792363e9a46d10317ca7c621a0318268a1a02edca53e04330c0/opencrawler-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7c1c654ed4818796c9f3e76d21c1f3d09f85e8e05e67443e8269198e22f7084a",
                "md5": "045b2f681fd936047bdf2b48df6cd9d4",
                "sha256": "9be15833b6b9ad19552192e35044025b426fdfda2780ad72eb6f579af4af7386"
            },
            "downloads": -1,
            "filename": "opencrawler-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "045b2f681fd936047bdf2b48df6cd9d4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 48329,
            "upload_time": "2025-07-16T16:04:28",
            "upload_time_iso_8601": "2025-07-16T16:04:28.891884Z",
            "url": "https://files.pythonhosted.org/packages/7c/1c/654ed4818796c9f3e76d21c1f3d09f85e8e05e67443e8269198e22f7084a/opencrawler-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-16 16:04:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "llamasearch",
    "github_project": "opencrawler",
    "github_not_found": true,
    "lcname": "opencrawler"
}

None