multi-browser-crawler


Namemulti-browser-crawler JSON
Version 0.5.5 PyPI version JSON
download
home_pageNone
SummaryFocused browser automation package for web scraping and content extraction
upload_time2025-09-04 02:34:54
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords browser automation crawling scraping playwright selenium web testing multiprocess proxy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Multi-Browser Crawler

A focused browser automation package for web scraping and content extraction.

## Features

- **Browser Pool Management**: Auto-scaling browser pools with session management
- **Proxy Support**: Built-in proxy rotation and management  
- **Image Download**: Automatic image capture and localization
- **API Discovery**: Network request capture and pattern matching
- **Session Persistence**: Stateful browsing with cookie/session support

## Installation

```bash
pip install multi-browser-crawler
```

## Quick Start

```python
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def main():
    # Simple configuration
    config = BrowserConfig(headless=True, timeout=30)
    pool = BrowserPoolManager(config.to_dict())

    try:
        await pool.initialize()
        
        # Fetch HTML
        result = await pool.fetch_html(
            url="https://example.com",
            session_id="my_session"
        )

        if result['status']['success']:
            print(f"✅ Success! Title: {result.get('title', 'N/A')}")
            print(f"HTML size: {len(result.get('html', ''))} characters")
        else:
            print(f"❌ Error: {result['status'].get('error')}")

    finally:
        await pool.shutdown()

if __name__ == "__main__":
    asyncio.run(main())
```

## Configuration Options

```python
config = BrowserConfig(
    headless=True,              # Run in headless mode
    timeout=30,                 # Page load timeout (seconds)
    min_browsers=1,             # Minimum browsers in pool
    max_browsers=5,             # Maximum browsers in pool
    proxy_url="http://proxy:8080",  # Optional proxy URL
    download_images_dir="/tmp/images"  # Image download directory
)
```

## API Methods

### fetch_html()

```python
result = await pool.fetch_html(
    url="https://example.com",
    session_id="optional_session",      # For persistent sessions
    timeout=30,                         # Request timeout
    api_patterns=["*/api/*"],          # Capture API calls
    images_to_capture=["*.jpg", "*.png"] # Download images
)
```

**Response format:**
```python
{
    'status': {'success': True, 'url': '...', 'load_time': 1.23},
    'html': '<html>...</html>',
    'title': 'Page Title',
    'api_calls': [...],  # Captured API requests
    'images': [...]      # Downloaded images
}
```

## Session Management

```python
# Persistent session - maintains cookies/state
result1 = await pool.fetch_html(url="https://site.com/login", session_id="user1")
result2 = await pool.fetch_html(url="https://site.com/profile", session_id="user1")

# Non-persistent - fresh browser each time  
result3 = await pool.fetch_html(url="https://site.com", session_id=None)
```

## Proxy Support

```python
# Single proxy
config = BrowserConfig(proxy_url="http://proxy:8080")

# The package integrates with rotating-mitmproxy for advanced proxy rotation
```

## Testing

```bash
# Run all tests
python -m pytest tests/ -v

# Run specific test categories
python -m pytest tests/ -m "not slow" -v
```

## License

MIT License - see LICENSE file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "multi-browser-crawler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "browser, automation, crawling, scraping, playwright, selenium, web, testing, multiprocess, proxy",
    "author": null,
    "author_email": "Spider MCP Team <team@spider-mcp.com>",
    "download_url": "https://files.pythonhosted.org/packages/2e/9d/1595a5309b041d212466521df874a17fc7844837e5e1d01061c5ecc667a7/multi_browser_crawler-0.5.5.tar.gz",
    "platform": null,
    "description": "# Multi-Browser Crawler\n\nA focused browser automation package for web scraping and content extraction.\n\n## Features\n\n- **Browser Pool Management**: Auto-scaling browser pools with session management\n- **Proxy Support**: Built-in proxy rotation and management  \n- **Image Download**: Automatic image capture and localization\n- **API Discovery**: Network request capture and pattern matching\n- **Session Persistence**: Stateful browsing with cookie/session support\n\n## Installation\n\n```bash\npip install multi-browser-crawler\n```\n\n## Quick Start\n\n```python\nimport asyncio\nfrom multi_browser_crawler import BrowserPoolManager, BrowserConfig\n\nasync def main():\n    # Simple configuration\n    config = BrowserConfig(headless=True, timeout=30)\n    pool = BrowserPoolManager(config.to_dict())\n\n    try:\n        await pool.initialize()\n        \n        # Fetch HTML\n        result = await pool.fetch_html(\n            url=\"https://example.com\",\n            session_id=\"my_session\"\n        )\n\n        if result['status']['success']:\n            print(f\"\u2705 Success! Title: {result.get('title', 'N/A')}\")\n            print(f\"HTML size: {len(result.get('html', ''))} characters\")\n        else:\n            print(f\"\u274c Error: {result['status'].get('error')}\")\n\n    finally:\n        await pool.shutdown()\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n## Configuration Options\n\n```python\nconfig = BrowserConfig(\n    headless=True,              # Run in headless mode\n    timeout=30,                 # Page load timeout (seconds)\n    min_browsers=1,             # Minimum browsers in pool\n    max_browsers=5,             # Maximum browsers in pool\n    proxy_url=\"http://proxy:8080\",  # Optional proxy URL\n    download_images_dir=\"/tmp/images\"  # Image download directory\n)\n```\n\n## API Methods\n\n### fetch_html()\n\n```python\nresult = await pool.fetch_html(\n    url=\"https://example.com\",\n    session_id=\"optional_session\",      # For persistent sessions\n    timeout=30,                         # Request timeout\n    api_patterns=[\"*/api/*\"],          # Capture API calls\n    images_to_capture=[\"*.jpg\", \"*.png\"] # Download images\n)\n```\n\n**Response format:**\n```python\n{\n    'status': {'success': True, 'url': '...', 'load_time': 1.23},\n    'html': '<html>...</html>',\n    'title': 'Page Title',\n    'api_calls': [...],  # Captured API requests\n    'images': [...]      # Downloaded images\n}\n```\n\n## Session Management\n\n```python\n# Persistent session - maintains cookies/state\nresult1 = await pool.fetch_html(url=\"https://site.com/login\", session_id=\"user1\")\nresult2 = await pool.fetch_html(url=\"https://site.com/profile\", session_id=\"user1\")\n\n# Non-persistent - fresh browser each time  \nresult3 = await pool.fetch_html(url=\"https://site.com\", session_id=None)\n```\n\n## Proxy Support\n\n```python\n# Single proxy\nconfig = BrowserConfig(proxy_url=\"http://proxy:8080\")\n\n# The package integrates with rotating-mitmproxy for advanced proxy rotation\n```\n\n## Testing\n\n```bash\n# Run all tests\npython -m pytest tests/ -v\n\n# Run specific test categories\npython -m pytest tests/ -m \"not slow\" -v\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Focused browser automation package for web scraping and content extraction",
    "version": "0.5.5",
    "project_urls": {
        "Bug Reports": "https://github.com/spider-mcp/multi-browser-crawler/issues",
        "Documentation": "https://multi-browser-crawler.readthedocs.io/",
        "Homepage": "https://github.com/spider-mcp/multi-browser-crawler",
        "Repository": "https://github.com/spider-mcp/multi-browser-crawler"
    },
    "split_keywords": [
        "browser",
        " automation",
        " crawling",
        " scraping",
        " playwright",
        " selenium",
        " web",
        " testing",
        " multiprocess",
        " proxy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eb045c091e07b9ed6f8215c775f7d12c5c9c960c7dc89aec0deb645ec8fc0473",
                "md5": "be59f54b6f3a522ed35add7eb3d81f1e",
                "sha256": "d44c8b86495a18f8df88198cbc6db5090f71b970801605a163bf7f8725077f43"
            },
            "downloads": -1,
            "filename": "multi_browser_crawler-0.5.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "be59f54b6f3a522ed35add7eb3d81f1e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 58254,
            "upload_time": "2025-09-04T02:34:52",
            "upload_time_iso_8601": "2025-09-04T02:34:52.512731Z",
            "url": "https://files.pythonhosted.org/packages/eb/04/5c091e07b9ed6f8215c775f7d12c5c9c960c7dc89aec0deb645ec8fc0473/multi_browser_crawler-0.5.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2e9d1595a5309b041d212466521df874a17fc7844837e5e1d01061c5ecc667a7",
                "md5": "b1ca657aa12a83a291bf6dfc7d13db53",
                "sha256": "49a803da8b0cfe8843c1d0326b5931251241d997a99746c8ee4f8401b8bcd032"
            },
            "downloads": -1,
            "filename": "multi_browser_crawler-0.5.5.tar.gz",
            "has_sig": false,
            "md5_digest": "b1ca657aa12a83a291bf6dfc7d13db53",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 52791,
            "upload_time": "2025-09-04T02:34:54",
            "upload_time_iso_8601": "2025-09-04T02:34:54.145755Z",
            "url": "https://files.pythonhosted.org/packages/2e/9d/1595a5309b041d212466521df874a17fc7844837e5e1d01061c5ecc667a7/multi_browser_crawler-0.5.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-04 02:34:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "spider-mcp",
    "github_project": "multi-browser-crawler",
    "github_not_found": true,
    "lcname": "multi-browser-crawler"
}
        
Elapsed time: 2.66783s