# Multi-Browser Crawler
A focused browser automation package for web scraping and content extraction.
## Features
- **Browser Pool Management**: Auto-scaling browser pools with session management
- **Proxy Support**: Built-in proxy rotation and management
- **Image Download**: Automatic image capture and localization
- **API Discovery**: Network request capture and pattern matching
- **Session Persistence**: Stateful browsing with cookie/session support
## Installation
```bash
pip install multi-browser-crawler
```
## Quick Start
```python
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig
async def main():
# Simple configuration
config = BrowserConfig(headless=True, timeout=30)
pool = BrowserPoolManager(config.to_dict())
try:
await pool.initialize()
# Fetch HTML
result = await pool.fetch_html(
url="https://example.com",
session_id="my_session"
)
if result['status']['success']:
print(f"✅ Success! Title: {result.get('title', 'N/A')}")
print(f"HTML size: {len(result.get('html', ''))} characters")
else:
print(f"❌ Error: {result['status'].get('error')}")
finally:
await pool.shutdown()
if __name__ == "__main__":
asyncio.run(main())
```
## Configuration Options
```python
config = BrowserConfig(
headless=True, # Run in headless mode
timeout=30, # Page load timeout (seconds)
min_browsers=1, # Minimum browsers in pool
max_browsers=5, # Maximum browsers in pool
proxy_url="http://proxy:8080", # Optional proxy URL
download_images_dir="/tmp/images" # Image download directory
)
```
## API Methods
### fetch_html()
```python
result = await pool.fetch_html(
url="https://example.com",
session_id="optional_session", # For persistent sessions
timeout=30, # Request timeout
api_patterns=["*/api/*"], # Capture API calls
images_to_capture=["*.jpg", "*.png"] # Download images
)
```
**Response format:**
```python
{
'status': {'success': True, 'url': '...', 'load_time': 1.23},
'html': '<html>...</html>',
'title': 'Page Title',
'api_calls': [...], # Captured API requests
'images': [...] # Downloaded images
}
```
## Session Management
```python
# Persistent session - maintains cookies/state
result1 = await pool.fetch_html(url="https://site.com/login", session_id="user1")
result2 = await pool.fetch_html(url="https://site.com/profile", session_id="user1")
# Non-persistent - fresh browser each time
result3 = await pool.fetch_html(url="https://site.com", session_id=None)
```
## Proxy Support
```python
# Single proxy
config = BrowserConfig(proxy_url="http://proxy:8080")
# The package integrates with rotating-mitmproxy for advanced proxy rotation
```
## Testing
```bash
# Run all tests
python -m pytest tests/ -v
# Run specific test categories
python -m pytest tests/ -m "not slow" -v
```
## License
MIT License - see LICENSE file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "multi-browser-crawler",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "browser, automation, crawling, scraping, playwright, selenium, web, testing, multiprocess, proxy",
"author": null,
"author_email": "Spider MCP Team <team@spider-mcp.com>",
"download_url": "https://files.pythonhosted.org/packages/2e/9d/1595a5309b041d212466521df874a17fc7844837e5e1d01061c5ecc667a7/multi_browser_crawler-0.5.5.tar.gz",
"platform": null,
"description": "# Multi-Browser Crawler\n\nA focused browser automation package for web scraping and content extraction.\n\n## Features\n\n- **Browser Pool Management**: Auto-scaling browser pools with session management\n- **Proxy Support**: Built-in proxy rotation and management \n- **Image Download**: Automatic image capture and localization\n- **API Discovery**: Network request capture and pattern matching\n- **Session Persistence**: Stateful browsing with cookie/session support\n\n## Installation\n\n```bash\npip install multi-browser-crawler\n```\n\n## Quick Start\n\n```python\nimport asyncio\nfrom multi_browser_crawler import BrowserPoolManager, BrowserConfig\n\nasync def main():\n # Simple configuration\n config = BrowserConfig(headless=True, timeout=30)\n pool = BrowserPoolManager(config.to_dict())\n\n try:\n await pool.initialize()\n \n # Fetch HTML\n result = await pool.fetch_html(\n url=\"https://example.com\",\n session_id=\"my_session\"\n )\n\n if result['status']['success']:\n print(f\"\u2705 Success! Title: {result.get('title', 'N/A')}\")\n print(f\"HTML size: {len(result.get('html', ''))} characters\")\n else:\n print(f\"\u274c Error: {result['status'].get('error')}\")\n\n finally:\n await pool.shutdown()\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\n## Configuration Options\n\n```python\nconfig = BrowserConfig(\n headless=True, # Run in headless mode\n timeout=30, # Page load timeout (seconds)\n min_browsers=1, # Minimum browsers in pool\n max_browsers=5, # Maximum browsers in pool\n proxy_url=\"http://proxy:8080\", # Optional proxy URL\n download_images_dir=\"/tmp/images\" # Image download directory\n)\n```\n\n## API Methods\n\n### fetch_html()\n\n```python\nresult = await pool.fetch_html(\n url=\"https://example.com\",\n session_id=\"optional_session\", # For persistent sessions\n timeout=30, # Request timeout\n api_patterns=[\"*/api/*\"], # Capture API calls\n images_to_capture=[\"*.jpg\", \"*.png\"] # Download images\n)\n```\n\n**Response format:**\n```python\n{\n 'status': {'success': True, 'url': '...', 'load_time': 1.23},\n 'html': '<html>...</html>',\n 'title': 'Page Title',\n 'api_calls': [...], # Captured API requests\n 'images': [...] # Downloaded images\n}\n```\n\n## Session Management\n\n```python\n# Persistent session - maintains cookies/state\nresult1 = await pool.fetch_html(url=\"https://site.com/login\", session_id=\"user1\")\nresult2 = await pool.fetch_html(url=\"https://site.com/profile\", session_id=\"user1\")\n\n# Non-persistent - fresh browser each time \nresult3 = await pool.fetch_html(url=\"https://site.com\", session_id=None)\n```\n\n## Proxy Support\n\n```python\n# Single proxy\nconfig = BrowserConfig(proxy_url=\"http://proxy:8080\")\n\n# The package integrates with rotating-mitmproxy for advanced proxy rotation\n```\n\n## Testing\n\n```bash\n# Run all tests\npython -m pytest tests/ -v\n\n# Run specific test categories\npython -m pytest tests/ -m \"not slow\" -v\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Focused browser automation package for web scraping and content extraction",
"version": "0.5.5",
"project_urls": {
"Bug Reports": "https://github.com/spider-mcp/multi-browser-crawler/issues",
"Documentation": "https://multi-browser-crawler.readthedocs.io/",
"Homepage": "https://github.com/spider-mcp/multi-browser-crawler",
"Repository": "https://github.com/spider-mcp/multi-browser-crawler"
},
"split_keywords": [
"browser",
" automation",
" crawling",
" scraping",
" playwright",
" selenium",
" web",
" testing",
" multiprocess",
" proxy"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "eb045c091e07b9ed6f8215c775f7d12c5c9c960c7dc89aec0deb645ec8fc0473",
"md5": "be59f54b6f3a522ed35add7eb3d81f1e",
"sha256": "d44c8b86495a18f8df88198cbc6db5090f71b970801605a163bf7f8725077f43"
},
"downloads": -1,
"filename": "multi_browser_crawler-0.5.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "be59f54b6f3a522ed35add7eb3d81f1e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 58254,
"upload_time": "2025-09-04T02:34:52",
"upload_time_iso_8601": "2025-09-04T02:34:52.512731Z",
"url": "https://files.pythonhosted.org/packages/eb/04/5c091e07b9ed6f8215c775f7d12c5c9c960c7dc89aec0deb645ec8fc0473/multi_browser_crawler-0.5.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2e9d1595a5309b041d212466521df874a17fc7844837e5e1d01061c5ecc667a7",
"md5": "b1ca657aa12a83a291bf6dfc7d13db53",
"sha256": "49a803da8b0cfe8843c1d0326b5931251241d997a99746c8ee4f8401b8bcd032"
},
"downloads": -1,
"filename": "multi_browser_crawler-0.5.5.tar.gz",
"has_sig": false,
"md5_digest": "b1ca657aa12a83a291bf6dfc7d13db53",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 52791,
"upload_time": "2025-09-04T02:34:54",
"upload_time_iso_8601": "2025-09-04T02:34:54.145755Z",
"url": "https://files.pythonhosted.org/packages/2e/9d/1595a5309b041d212466521df874a17fc7844837e5e1d01061c5ecc667a7/multi_browser_crawler-0.5.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-04 02:34:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "spider-mcp",
"github_project": "multi-browser-crawler",
"github_not_found": true,
"lcname": "multi-browser-crawler"
}