# Multi-Browser Crawler
Enterprise-grade browser automation with advanced features like session isolation, proxy rotation, JavaScript execution, and multiprocess crawling capabilities.
## Features
- **Session Management**: Unique browser sessions with automatic cleanup
- **Proxy Support**: File-based proxy rotation with health monitoring
- **JavaScript Execution**: Execute custom JavaScript with DOM state waiting
- **Image Processing**: Bulk image download with format conversion
- **Multiprocess Support**: True multiprocess crawling with isolated sessions
- **Anti-Detection**: Stealth techniques to avoid bot detection
- **Clean API**: Simple, intuitive interface for common operations
## Installation
```bash
pip install multi-browser-crawler
```
### Install Playwright browsers
```bash
playwright install chromium
```
## Quick Start
### Simple Usage
```python
import asyncio
from multi_browser_crawler import BrowserCrawler
async def main():
async with BrowserCrawler() as crawler:
result = await crawler.fetch("https://example.com")
print(result.html)
print(result.title)
asyncio.run(main())
```
### Advanced Configuration
```python
from multi_browser_crawler import BrowserCrawler, BrowserConfig
config = BrowserConfig(
headless=True,
max_sessions=5,
proxy=ProxyConfig(
enabled=True,
proxy_file="proxies.txt"
)
)
async with BrowserCrawler(config) as crawler:
result = await crawler.fetch(
"https://example.com",
execute_js="document.querySelector('#load-more').click()",
wait_for="#dynamic-content",
download_images=True
)
```
### Batch Processing
```python
urls = [
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
async with BrowserCrawler() as crawler:
results = await crawler.fetch_batch(urls, max_concurrent=3)
for result in results:
if result.success:
print(f"✅ {result.url}: {len(result.html)} chars")
else:
print(f"❌ {result.url}: {result.error}")
```
### JavaScript Execution
```python
async with BrowserCrawler() as crawler:
result = await crawler.execute_js(
"https://spa-site.com",
"""
return {
title: document.title,
links: Array.from(document.querySelectorAll('a')).length,
images: Array.from(document.querySelectorAll('img')).length
};
"""
)
print(result['result'])
```
## Configuration
### Browser Configuration
```python
from multi_browser_crawler import BrowserConfig
config = BrowserConfig(
headless=True, # Run in headless mode
max_sessions=10, # Maximum concurrent sessions
fetch_limit=1000, # Maximum fetches per session
cache_expiry=3600, # Cache expiry in seconds
stealth_mode=True, # Enable anti-detection
download_images=False, # Download images by default
browser_args=['--no-sandbox'] # Additional browser arguments
)
```
### Proxy Configuration
```python
from multi_browser_crawler import ProxyConfig
proxy_config = ProxyConfig(
enabled=True,
proxy_file="proxies.txt", # File containing proxy list
recycle_threshold=0.5, # Recycle when 50% fail
test_timeout=10, # Proxy test timeout
allow_direct_connection=True # Fallback to direct connection
)
```
### Environment Variables
You can also configure using environment variables:
```bash
export BROWSER_HEADLESS=true
export BROWSER_MAX_SESSIONS=10
export PROXY_ENABLED=true
export PROXY_FILE=proxies.txt
```
## Proxy File Format
Create a `proxies.txt` file with one proxy per line:
```
proxy1.example.com:8080
proxy2.example.com:3128
user:pass@proxy3.example.com:8080
```
## Multiprocess Usage
```python
import multiprocessing
from multi_browser_crawler import BrowserCrawler
async def crawl_worker(urls):
config = BrowserConfig(max_sessions=2)
async with BrowserCrawler(config) as crawler:
results = []
for url in urls:
result = await crawler.fetch(url)
results.append(result)
return results
def main():
urls = ["https://example.com"] * 100
url_chunks = [urls[i:i+25] for i in range(0, len(urls), 25)]
with multiprocessing.Pool(4) as pool:
results = pool.map(crawl_worker, url_chunks)
```
## API Reference
### BrowserCrawler
Main class for browser automation.
#### Methods
- `fetch(url, **options)` - Fetch content from a single URL
- `fetch_batch(urls, **options)` - Fetch content from multiple URLs
- `execute_js(url, js_code, **options)` - Execute JavaScript on a page
- `get_stats()` - Get crawler statistics
- `cleanup()` - Clean up all resources
#### Options
- `session_name` - Browser session name
- `use_cache` - Use cached content
- `clean_html` - Clean HTML content
- `download_images` - Download images
- `execute_js` - JavaScript to execute
- `wait_for` - CSS selector to wait for
- `max_concurrent` - Maximum concurrent requests (batch only)
### CrawlResult
Result object returned by fetch operations.
#### Properties
- `url` - The fetched URL
- `html` - HTML content
- `title` - Page title
- `success` - Whether the fetch was successful
- `error` - Error message (if any)
- `metadata` - Additional metadata
- `images` - Downloaded images
- `timestamp` - Fetch timestamp
## Development
### Setup Development Environment
```bash
git clone https://github.com/spider-mcp/multi-browser-crawler.git
cd multi-browser-crawler
pip install -e ".[dev]"
playwright install chromium
```
### Run Tests
```bash
pytest
```
### Code Formatting
```bash
black .
isort .
flake8 .
```
## License
MIT License - see LICENSE file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Run the test suite
6. Submit a pull request
## Support
- GitHub Issues: https://github.com/spider-mcp/multi-browser-crawler/issues
- Documentation: https://multi-browser-crawler.readthedocs.io/
Raw data
{
"_id": null,
"home_page": "https://github.com/spider-mcp/multi-browser-crawler",
"name": "multi-browser-crawler",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "browser, automation, crawling, scraping, playwright, selenium, web, testing, multiprocess, proxy",
"author": "Spider MCP Team",
"author_email": "Spider MCP Team <team@spider-mcp.com>",
"download_url": "https://files.pythonhosted.org/packages/b8/db/06b60d98067bc0faf26f20fc93b1a71624f474b700b03b8bff930bf4a3db/multi_browser_crawler-0.1.1.tar.gz",
"platform": null,
"description": "# Multi-Browser Crawler\n\nEnterprise-grade browser automation with advanced features like session isolation, proxy rotation, JavaScript execution, and multiprocess crawling capabilities.\n\n## Features\n\n- **Session Management**: Unique browser sessions with automatic cleanup\n- **Proxy Support**: File-based proxy rotation with health monitoring\n- **JavaScript Execution**: Execute custom JavaScript with DOM state waiting\n- **Image Processing**: Bulk image download with format conversion\n- **Multiprocess Support**: True multiprocess crawling with isolated sessions\n- **Anti-Detection**: Stealth techniques to avoid bot detection\n- **Clean API**: Simple, intuitive interface for common operations\n\n## Installation\n\n```bash\npip install multi-browser-crawler\n```\n\n### Install Playwright browsers\n\n```bash\nplaywright install chromium\n```\n\n## Quick Start\n\n### Simple Usage\n\n```python\nimport asyncio\nfrom multi_browser_crawler import BrowserCrawler\n\nasync def main():\n async with BrowserCrawler() as crawler:\n result = await crawler.fetch(\"https://example.com\")\n print(result.html)\n print(result.title)\n\nasyncio.run(main())\n```\n\n### Advanced Configuration\n\n```python\nfrom multi_browser_crawler import BrowserCrawler, BrowserConfig\n\nconfig = BrowserConfig(\n headless=True,\n max_sessions=5,\n proxy=ProxyConfig(\n enabled=True,\n proxy_file=\"proxies.txt\"\n )\n)\n\nasync with BrowserCrawler(config) as crawler:\n result = await crawler.fetch(\n \"https://example.com\",\n execute_js=\"document.querySelector('#load-more').click()\",\n wait_for=\"#dynamic-content\",\n download_images=True\n )\n```\n\n### Batch Processing\n\n```python\nurls = [\n \"https://example1.com\",\n \"https://example2.com\", \n \"https://example3.com\"\n]\n\nasync with BrowserCrawler() as crawler:\n results = await crawler.fetch_batch(urls, max_concurrent=3)\n \n for result in results:\n if result.success:\n print(f\"\u2705 {result.url}: {len(result.html)} chars\")\n else:\n print(f\"\u274c {result.url}: {result.error}\")\n```\n\n### JavaScript Execution\n\n```python\nasync with BrowserCrawler() as crawler:\n result = await crawler.execute_js(\n \"https://spa-site.com\",\n \"\"\"\n return {\n title: document.title,\n links: Array.from(document.querySelectorAll('a')).length,\n images: Array.from(document.querySelectorAll('img')).length\n };\n \"\"\"\n )\n print(result['result'])\n```\n\n## Configuration\n\n### Browser Configuration\n\n```python\nfrom multi_browser_crawler import BrowserConfig\n\nconfig = BrowserConfig(\n headless=True, # Run in headless mode\n max_sessions=10, # Maximum concurrent sessions\n fetch_limit=1000, # Maximum fetches per session\n cache_expiry=3600, # Cache expiry in seconds\n stealth_mode=True, # Enable anti-detection\n download_images=False, # Download images by default\n browser_args=['--no-sandbox'] # Additional browser arguments\n)\n```\n\n### Proxy Configuration\n\n```python\nfrom multi_browser_crawler import ProxyConfig\n\nproxy_config = ProxyConfig(\n enabled=True,\n proxy_file=\"proxies.txt\", # File containing proxy list\n recycle_threshold=0.5, # Recycle when 50% fail\n test_timeout=10, # Proxy test timeout\n allow_direct_connection=True # Fallback to direct connection\n)\n```\n\n### Environment Variables\n\nYou can also configure using environment variables:\n\n```bash\nexport BROWSER_HEADLESS=true\nexport BROWSER_MAX_SESSIONS=10\nexport PROXY_ENABLED=true\nexport PROXY_FILE=proxies.txt\n```\n\n## Proxy File Format\n\nCreate a `proxies.txt` file with one proxy per line:\n\n```\nproxy1.example.com:8080\nproxy2.example.com:3128\nuser:pass@proxy3.example.com:8080\n```\n\n## Multiprocess Usage\n\n```python\nimport multiprocessing\nfrom multi_browser_crawler import BrowserCrawler\n\nasync def crawl_worker(urls):\n config = BrowserConfig(max_sessions=2)\n async with BrowserCrawler(config) as crawler:\n results = []\n for url in urls:\n result = await crawler.fetch(url)\n results.append(result)\n return results\n\ndef main():\n urls = [\"https://example.com\"] * 100\n url_chunks = [urls[i:i+25] for i in range(0, len(urls), 25)]\n \n with multiprocessing.Pool(4) as pool:\n results = pool.map(crawl_worker, url_chunks)\n```\n\n## API Reference\n\n### BrowserCrawler\n\nMain class for browser automation.\n\n#### Methods\n\n- `fetch(url, **options)` - Fetch content from a single URL\n- `fetch_batch(urls, **options)` - Fetch content from multiple URLs\n- `execute_js(url, js_code, **options)` - Execute JavaScript on a page\n- `get_stats()` - Get crawler statistics\n- `cleanup()` - Clean up all resources\n\n#### Options\n\n- `session_name` - Browser session name\n- `use_cache` - Use cached content\n- `clean_html` - Clean HTML content\n- `download_images` - Download images\n- `execute_js` - JavaScript to execute\n- `wait_for` - CSS selector to wait for\n- `max_concurrent` - Maximum concurrent requests (batch only)\n\n### CrawlResult\n\nResult object returned by fetch operations.\n\n#### Properties\n\n- `url` - The fetched URL\n- `html` - HTML content\n- `title` - Page title\n- `success` - Whether the fetch was successful\n- `error` - Error message (if any)\n- `metadata` - Additional metadata\n- `images` - Downloaded images\n- `timestamp` - Fetch timestamp\n\n## Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/spider-mcp/multi-browser-crawler.git\ncd multi-browser-crawler\npip install -e \".[dev]\"\nplaywright install chromium\n```\n\n### Run Tests\n\n```bash\npytest\n```\n\n### Code Formatting\n\n```bash\nblack .\nisort .\nflake8 .\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests\n5. Run the test suite\n6. Submit a pull request\n\n## Support\n\n- GitHub Issues: https://github.com/spider-mcp/multi-browser-crawler/issues\n- Documentation: https://multi-browser-crawler.readthedocs.io/\n",
"bugtrack_url": null,
"license": null,
"summary": "Enterprise-grade browser automation with advanced features",
"version": "0.1.1",
"project_urls": {
"Bug Reports": "https://github.com/spider-mcp/multi-browser-crawler/issues",
"Documentation": "https://multi-browser-crawler.readthedocs.io/",
"Homepage": "https://github.com/spider-mcp/multi-browser-crawler",
"Repository": "https://github.com/spider-mcp/multi-browser-crawler"
},
"split_keywords": [
"browser",
" automation",
" crawling",
" scraping",
" playwright",
" selenium",
" web",
" testing",
" multiprocess",
" proxy"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7e3a311db633d65c852bce949479199d704a0bad431a6fdd9e159acd37426e1f",
"md5": "16fdcee1eb7d620c8fbb6269b4187af4",
"sha256": "d8333bbc3dfe9ca50a0e498e5aba1278eaa1b3d6d003c664aab80873f3ac312c"
},
"downloads": -1,
"filename": "multi_browser_crawler-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "16fdcee1eb7d620c8fbb6269b4187af4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 64325,
"upload_time": "2025-08-18T01:33:57",
"upload_time_iso_8601": "2025-08-18T01:33:57.328213Z",
"url": "https://files.pythonhosted.org/packages/7e/3a/311db633d65c852bce949479199d704a0bad431a6fdd9e159acd37426e1f/multi_browser_crawler-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b8db06b60d98067bc0faf26f20fc93b1a71624f474b700b03b8bff930bf4a3db",
"md5": "41dca57e2c9ae06fa175c5c519a5c943",
"sha256": "aa331073dec59877aa22b0129f8363cba07ffd4dd5d1bc28105a46b10d44504d"
},
"downloads": -1,
"filename": "multi_browser_crawler-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "41dca57e2c9ae06fa175c5c519a5c943",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 83055,
"upload_time": "2025-08-18T01:33:58",
"upload_time_iso_8601": "2025-08-18T01:33:58.889102Z",
"url": "https://files.pythonhosted.org/packages/b8/db/06b60d98067bc0faf26f20fc93b1a71624f474b700b03b8bff930bf4a3db/multi_browser_crawler-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-18 01:33:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "spider-mcp",
"github_project": "multi-browser-crawler",
"github_not_found": true,
"lcname": "multi-browser-crawler"
}