crawlstudio


Namecrawlstudio JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryUnified Python library for web crawling tools - supports Firecrawl, Crawl4AI, Scrapy, and Browser-Use backends
upload_time2025-10-06 12:30:07
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseNone
keywords browser automation crawl4ai crawling firecrawl scrapy web scraping
VCS
bugtrack_url
requirements pydantic pydantic-settings python-dotenv crawl4ai firecrawl-py scrapy langchain-anthropic langchain-openai browser-use aiohttp diskcache playwright
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ๐Ÿ•ท๏ธ CrawlStudio

Unified wrapper for web crawling tools, inspired by modular, community-driven design.

## ๐ŸŽฏ Vision

CrawlStudio provides a unified Python API for various web crawling backends including Firecrawl, Crawl4AI, Scrapy, and Browser-Use (AI-driven). It emphasizes modularity, ease of use, and intelligent extraction capabilities.

## ๐Ÿ“ฆ Installation

```bash
pip install crawlstudio
```

From source (recommended for contributors):
```bash
git clone https://github.com/aiapicore/CrawlStudio.git
cd CrawlStudio
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]
```

Optional extras for AI browser backend:
```bash
pip install .[browser-use]
```

## โšก Quick Start
- CLI:
```bash
crawlstudio https://example.com --backend firecrawl --format markdown --print markdown
```
- Python:
```python
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend

async def main():
    cfg = CrawlConfig()
    res = await FirecrawlBackend(cfg).crawl("https://example.com", format="markdown")
    print(res.markdown)

asyncio.run(main())
```

## ๐Ÿ” Environment
Create a `.env` in the project root if using external services/backends:
```env
FIRECRAWL_API_KEY=your_firecrawl_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
```

For Browser-Use backend, you must set at least one of `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`.

If you use headless browsers (via browser-use), install Playwright runtime:
```bash
python -m pip install playwright
python -m playwright install
```

## ๐Ÿš€ CLI Usage
After install, use the CLI to crawl a URL with different backends and formats:
```bash
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend firecrawl --format markdown --print markdown
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend crawl4ai --format html --print html
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend scrapy --format markdown --print markdown
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend browser-use --format markdown --print markdown
```
- `--backend`: one of `firecrawl`, `crawl4ai`, `scrapy`, `browser-use`
- `--format`: one of `markdown`, `html`, `structured`
- `--print`: choose what to print: `summary` (default), `markdown`, `html`, `structured`

## ๐Ÿ“˜ Usage Examples (API)
The library exposes a unified interface; below are end-to-end examples for each backend.

### ๐Ÿง‘โ€๐Ÿ’ป Python Usage
```python
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend

async def main():
    config = CrawlConfig()
    backend = FirecrawlBackend(config)
    result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
    print(result.markdown)

asyncio.run(main())
```

### Firecrawl Example
```python
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend

async def main():
    config = CrawlConfig()
    backend = FirecrawlBackend(config)
    result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
    print(result.markdown)

asyncio.run(main())
```

### Crawl4AI Example
```python
import asyncio
from crawlstudio import CrawlConfig, Crawl4AIBackend

async def main():
    config = CrawlConfig()
    backend = Crawl4AIBackend(config)
    result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
    print(result.markdown)  # Outputs title, summary, keywords

asyncio.run(main())
```

### Scrapy Example
```python
import asyncio
from crawlstudio import CrawlConfig, ScrapyBackend

async def main():
    config = CrawlConfig()
    backend = ScrapyBackend(config)
    result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="html")
    print(result.raw_html)

asyncio.run(main())
```

### Browser-Use (AI-Driven) Example
```python
import asyncio
from crawlstudio import CrawlConfig, BrowserUseBackend

async def main():
    config = CrawlConfig()
    backend = BrowserUseBackend(config)
    result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
    print(result.markdown)  # AI-extracted data

asyncio.run(main())
```

## ๐Ÿงช Tests & Checks
Run the test suite (pytest) and local checks (flake8, mypy):
```bash
pytest -q
flake8
mypy crawlstudio
```
Notes:
- We target Python 3.10+ for typing (PEP 604 `X | Y` unions).
- Third-party libraries without type stubs are ignored by mypy (`ignore_missing_imports = true`).

## ๐Ÿ› ๏ธ Contributing Quickstart
- Fork and clone the repo, create a virtual env, then install dev deps:
```bash
git clone https://github.com/aiapicore/CrawlStudio.git
cd CrawlStudio
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\Activate.ps1
pip install -e .[dev]
```
- Optional: install pre-commit hooks
```bash
pip install pre-commit
pre-commit install
pre-commit run --all-files
```
- Run the suite before submitting a PR:
```bash
flake8
mypy crawlstudio
pytest -q
```

## โšก Backend Comparison

| Backend | Speed | Cost | AI Intelligence | Best For |
|---------|-------|------|----------------|----------|
| **Firecrawl** | Fast | API costs | Medium | Production scraping |
| **Crawl4AI** | Medium | Free | Medium | Development & testing |
| **Scrapy** | Fastest | Free | Low | Simple HTML extraction |
| **Browser-Use** | Slower | AI costs | High | Complex dynamic sites |

## ๐Ÿ”ฎ Future Enhancements

### Recursive Crawling (Planned)
```python
# Future API - configurable depth and page limits
config = CrawlConfig(
    max_depth=3,                    # Crawl up to 3 levels deep
    max_pages_per_level=5,          # Max 5 pages per depth level
    recursive_delay=1.0,            # 1 second delay between requests
    follow_external_links=False     # Stay within same domain
)

# Recursive crawling with depth control
result = await backend.crawl_recursive("https://example.com", format="markdown")
print(f"Crawled {len(result.pages)} pages across {result.max_depth_reached} levels")
```

### Additional Crawler Backends (Roadmap)

#### High Priority
- **[Playwright](https://github.com/microsoft/playwright-python)** - Fast browser automation, excellent for SPAs
- **[Selenium](https://github.com/SeleniumHQ/selenium)** - Industry standard, huge ecosystem
- **[BeautifulSoup + Requests](https://github.com/psf/requests)** - Lightweight, simple parsing

#### Specialized Crawlers  
- **[Apify SDK](https://github.com/apify/apify-sdk-python)** - Cloud scraping platform
- **[Colly](https://github.com/gocolly/colly)** (via Python bindings) - High-performance Go crawler
- **[Puppeteer](https://github.com/puppeteer/puppeteer)** (via pyppeteer) - Headless Chrome control

#### AI-Enhanced Crawlers
- **[ScrapeGraphAI](https://github.com/VinciGit00/ScrapeGraphAI)** - LLM-powered scraping
- **[AutoScraper](https://github.com/alirezamika/autoscraper)** - Machine learning-based pattern detection
- **[WebGPT](https://github.com/sukhadagholba/webgpt)** - GPT-powered web interaction

#### Enterprise/Commercial
- **[ScrapingBee](https://www.scrapingbee.com/)** - Anti-bot bypass service
- **[Bright Data](https://brightdata.com/)** - Proxy + scraping platform  
- **[Zyte](https://www.zyte.com/)** - Enterprise web data platform

### Advanced Features (Future Versions)
- Multi-page crawling with link discovery
- Batch processing for multiple URLs
- CLI tool (`crawlstudio crawl <url>`)
- Content deduplication and similarity detection
- Rate limiting and respectful crawling policies
- Caching system with Redis/disk storage
- Webhook integrations for real-time notifications
- GraphQL API for programmatic access
- Docker containerization for easy deployment

### Development Roadmap
1. **Core Features** (Current): 4 working backends
2. **Recursive Crawling**: Depth-based multi-page crawling  
3. **CLI Tool**: `pip install crawlstudio` โ†’ command line usage
4. **Additional Backends**: Playwright, Selenium, BeautifulSoup
5. **Enterprise Features**: Batch processing, advanced caching
6. **AI Integration**: More AI-powered extraction capabilities
7. **Cloud Platform**: SaaS offering with web interface
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "crawlstudio",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "browser automation, crawl4ai, crawling, firecrawl, scrapy, web scraping",
    "author": null,
    "author_email": "Saeed Ashraf <saeedashrafv@gmail.com>, Abbas Mirmashhouri <mirmashhouri@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d6/23/50f85bb4a64cadb75bdc63ca021b556c301054bf054bb406efb809905adb/crawlstudio-0.1.1.tar.gz",
    "platform": null,
    "description": "# \ud83d\udd77\ufe0f CrawlStudio\n\nUnified wrapper for web crawling tools, inspired by modular, community-driven design.\n\n## \ud83c\udfaf Vision\n\nCrawlStudio provides a unified Python API for various web crawling backends including Firecrawl, Crawl4AI, Scrapy, and Browser-Use (AI-driven). It emphasizes modularity, ease of use, and intelligent extraction capabilities.\n\n## \ud83d\udce6 Installation\n\n```bash\npip install crawlstudio\n```\n\nFrom source (recommended for contributors):\n```bash\ngit clone https://github.com/aiapicore/CrawlStudio.git\ncd CrawlStudio\npython -m venv .venv && source .venv/bin/activate  # Windows: .venv\\Scripts\\Activate.ps1\npython -m pip install --upgrade pip\npip install -e .[dev]\n```\n\nOptional extras for AI browser backend:\n```bash\npip install .[browser-use]\n```\n\n## \u26a1 Quick Start\n- CLI:\n```bash\ncrawlstudio https://example.com --backend firecrawl --format markdown --print markdown\n```\n- Python:\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, FirecrawlBackend\n\nasync def main():\n    cfg = CrawlConfig()\n    res = await FirecrawlBackend(cfg).crawl(\"https://example.com\", format=\"markdown\")\n    print(res.markdown)\n\nasyncio.run(main())\n```\n\n## \ud83d\udd10 Environment\nCreate a `.env` in the project root if using external services/backends:\n```env\nFIRECRAWL_API_KEY=your_firecrawl_key\nOPENAI_API_KEY=your_openai_key\nANTHROPIC_API_KEY=your_anthropic_key\n```\n\nFor Browser-Use backend, you must set at least one of `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`.\n\nIf you use headless browsers (via browser-use), install Playwright runtime:\n```bash\npython -m pip install playwright\npython -m playwright install\n```\n\n## \ud83d\ude80 CLI Usage\nAfter install, use the CLI to crawl a URL with different backends and formats:\n```bash\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend firecrawl --format markdown --print markdown\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend crawl4ai --format html --print html\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend scrapy --format markdown --print markdown\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend browser-use --format markdown --print markdown\n```\n- `--backend`: one of `firecrawl`, `crawl4ai`, `scrapy`, `browser-use`\n- `--format`: one of `markdown`, `html`, `structured`\n- `--print`: choose what to print: `summary` (default), `markdown`, `html`, `structured`\n\n## \ud83d\udcd8 Usage Examples (API)\nThe library exposes a unified interface; below are end-to-end examples for each backend.\n\n### \ud83e\uddd1\u200d\ud83d\udcbb Python Usage\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, FirecrawlBackend\n\nasync def main():\n    config = CrawlConfig()\n    backend = FirecrawlBackend(config)\n    result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n    print(result.markdown)\n\nasyncio.run(main())\n```\n\n### Firecrawl Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, FirecrawlBackend\n\nasync def main():\n    config = CrawlConfig()\n    backend = FirecrawlBackend(config)\n    result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n    print(result.markdown)\n\nasyncio.run(main())\n```\n\n### Crawl4AI Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, Crawl4AIBackend\n\nasync def main():\n    config = CrawlConfig()\n    backend = Crawl4AIBackend(config)\n    result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n    print(result.markdown)  # Outputs title, summary, keywords\n\nasyncio.run(main())\n```\n\n### Scrapy Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, ScrapyBackend\n\nasync def main():\n    config = CrawlConfig()\n    backend = ScrapyBackend(config)\n    result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"html\")\n    print(result.raw_html)\n\nasyncio.run(main())\n```\n\n### Browser-Use (AI-Driven) Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, BrowserUseBackend\n\nasync def main():\n    config = CrawlConfig()\n    backend = BrowserUseBackend(config)\n    result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n    print(result.markdown)  # AI-extracted data\n\nasyncio.run(main())\n```\n\n## \ud83e\uddea Tests & Checks\nRun the test suite (pytest) and local checks (flake8, mypy):\n```bash\npytest -q\nflake8\nmypy crawlstudio\n```\nNotes:\n- We target Python 3.10+ for typing (PEP 604 `X | Y` unions).\n- Third-party libraries without type stubs are ignored by mypy (`ignore_missing_imports = true`).\n\n## \ud83d\udee0\ufe0f Contributing Quickstart\n- Fork and clone the repo, create a virtual env, then install dev deps:\n```bash\ngit clone https://github.com/aiapicore/CrawlStudio.git\ncd CrawlStudio\npython -m venv .venv && source .venv/bin/activate  # Windows: .venv\\Scripts\\Activate.ps1\npip install -e .[dev]\n```\n- Optional: install pre-commit hooks\n```bash\npip install pre-commit\npre-commit install\npre-commit run --all-files\n```\n- Run the suite before submitting a PR:\n```bash\nflake8\nmypy crawlstudio\npytest -q\n```\n\n## \u26a1 Backend Comparison\n\n| Backend | Speed | Cost | AI Intelligence | Best For |\n|---------|-------|------|----------------|----------|\n| **Firecrawl** | Fast | API costs | Medium | Production scraping |\n| **Crawl4AI** | Medium | Free | Medium | Development & testing |\n| **Scrapy** | Fastest | Free | Low | Simple HTML extraction |\n| **Browser-Use** | Slower | AI costs | High | Complex dynamic sites |\n\n## \ud83d\udd2e Future Enhancements\n\n### Recursive Crawling (Planned)\n```python\n# Future API - configurable depth and page limits\nconfig = CrawlConfig(\n    max_depth=3,                    # Crawl up to 3 levels deep\n    max_pages_per_level=5,          # Max 5 pages per depth level\n    recursive_delay=1.0,            # 1 second delay between requests\n    follow_external_links=False     # Stay within same domain\n)\n\n# Recursive crawling with depth control\nresult = await backend.crawl_recursive(\"https://example.com\", format=\"markdown\")\nprint(f\"Crawled {len(result.pages)} pages across {result.max_depth_reached} levels\")\n```\n\n### Additional Crawler Backends (Roadmap)\n\n#### High Priority\n- **[Playwright](https://github.com/microsoft/playwright-python)** - Fast browser automation, excellent for SPAs\n- **[Selenium](https://github.com/SeleniumHQ/selenium)** - Industry standard, huge ecosystem\n- **[BeautifulSoup + Requests](https://github.com/psf/requests)** - Lightweight, simple parsing\n\n#### Specialized Crawlers  \n- **[Apify SDK](https://github.com/apify/apify-sdk-python)** - Cloud scraping platform\n- **[Colly](https://github.com/gocolly/colly)** (via Python bindings) - High-performance Go crawler\n- **[Puppeteer](https://github.com/puppeteer/puppeteer)** (via pyppeteer) - Headless Chrome control\n\n#### AI-Enhanced Crawlers\n- **[ScrapeGraphAI](https://github.com/VinciGit00/ScrapeGraphAI)** - LLM-powered scraping\n- **[AutoScraper](https://github.com/alirezamika/autoscraper)** - Machine learning-based pattern detection\n- **[WebGPT](https://github.com/sukhadagholba/webgpt)** - GPT-powered web interaction\n\n#### Enterprise/Commercial\n- **[ScrapingBee](https://www.scrapingbee.com/)** - Anti-bot bypass service\n- **[Bright Data](https://brightdata.com/)** - Proxy + scraping platform  \n- **[Zyte](https://www.zyte.com/)** - Enterprise web data platform\n\n### Advanced Features (Future Versions)\n- Multi-page crawling with link discovery\n- Batch processing for multiple URLs\n- CLI tool (`crawlstudio crawl <url>`)\n- Content deduplication and similarity detection\n- Rate limiting and respectful crawling policies\n- Caching system with Redis/disk storage\n- Webhook integrations for real-time notifications\n- GraphQL API for programmatic access\n- Docker containerization for easy deployment\n\n### Development Roadmap\n1. **Core Features** (Current): 4 working backends\n2. **Recursive Crawling**: Depth-based multi-page crawling  \n3. **CLI Tool**: `pip install crawlstudio` \u2192 command line usage\n4. **Additional Backends**: Playwright, Selenium, BeautifulSoup\n5. **Enterprise Features**: Batch processing, advanced caching\n6. **AI Integration**: More AI-powered extraction capabilities\n7. **Cloud Platform**: SaaS offering with web interface",
    "bugtrack_url": null,
    "license": null,
    "summary": "Unified Python library for web crawling tools - supports Firecrawl, Crawl4AI, Scrapy, and Browser-Use backends",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/aiapicore/CrawlStudio#readme",
        "Homepage": "https://github.com/aiapicore/CrawlStudio",
        "Issues": "https://github.com/aiapicore/CrawlStudio/issues",
        "Repository": "https://github.com/aiapicore/CrawlStudio"
    },
    "split_keywords": [
        "browser automation",
        " crawl4ai",
        " crawling",
        " firecrawl",
        " scrapy",
        " web scraping"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "00351c94a78d9359927d9e4e8501ef1f420b68ce1da3ec2bc1a4c613e80b2966",
                "md5": "48f166d65ebf8aa0d50a3a89c2c2f81a",
                "sha256": "c01f36f6b27df4ba2cc25718ec90e53cf6417c38d34b59148e5055b1da3c4a2d"
            },
            "downloads": -1,
            "filename": "crawlstudio-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "48f166d65ebf8aa0d50a3a89c2c2f81a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 17928,
            "upload_time": "2025-10-06T12:30:05",
            "upload_time_iso_8601": "2025-10-06T12:30:05.781309Z",
            "url": "https://files.pythonhosted.org/packages/00/35/1c94a78d9359927d9e4e8501ef1f420b68ce1da3ec2bc1a4c613e80b2966/crawlstudio-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d62350f85bb4a64cadb75bdc63ca021b556c301054bf054bb406efb809905adb",
                "md5": "1d1d7bc7733712552cfc059973ade08d",
                "sha256": "98d163965738892e65aacc4254afbb19f3ed2e60d4685bfddc8414deb88fea45"
            },
            "downloads": -1,
            "filename": "crawlstudio-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "1d1d7bc7733712552cfc059973ade08d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 33204,
            "upload_time": "2025-10-06T12:30:07",
            "upload_time_iso_8601": "2025-10-06T12:30:07.192413Z",
            "url": "https://files.pythonhosted.org/packages/d6/23/50f85bb4a64cadb75bdc63ca021b556c301054bf054bb406efb809905adb/crawlstudio-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 12:30:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "aiapicore",
    "github_project": "CrawlStudio#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "pydantic-settings",
            "specs": [
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    ">=",
                    "1.0"
                ]
            ]
        },
        {
            "name": "crawl4ai",
            "specs": [
                [
                    ">=",
                    "0.7.4"
                ]
            ]
        },
        {
            "name": "firecrawl-py",
            "specs": [
                [
                    ">=",
                    "0.0.20"
                ]
            ]
        },
        {
            "name": "scrapy",
            "specs": [
                [
                    ">=",
                    "2.11"
                ]
            ]
        },
        {
            "name": "langchain-anthropic",
            "specs": [
                [
                    ">=",
                    "0.3.19"
                ]
            ]
        },
        {
            "name": "langchain-openai",
            "specs": [
                [
                    ">=",
                    "0.3.33"
                ]
            ]
        },
        {
            "name": "browser-use",
            "specs": [
                [
                    ">=",
                    "0.7.7"
                ]
            ]
        },
        {
            "name": "aiohttp",
            "specs": [
                [
                    ">=",
                    "3.9"
                ]
            ]
        },
        {
            "name": "diskcache",
            "specs": [
                [
                    ">=",
                    "5.6"
                ]
            ]
        },
        {
            "name": "playwright",
            "specs": [
                [
                    ">=",
                    "1.54"
                ]
            ]
        }
    ],
    "lcname": "crawlstudio"
}
        
Elapsed time: 0.64749s