# ๐ท๏ธ CrawlStudio
Unified wrapper for web crawling tools, inspired by modular, community-driven design.
## ๐ฏ Vision
CrawlStudio provides a unified Python API for various web crawling backends including Firecrawl, Crawl4AI, Scrapy, and Browser-Use (AI-driven). It emphasizes modularity, ease of use, and intelligent extraction capabilities.
## ๐ฆ Installation
```bash
pip install crawlstudio
```
From source (recommended for contributors):
```bash
git clone https://github.com/aiapicore/CrawlStudio.git
cd CrawlStudio
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]
```
Optional extras for AI browser backend:
```bash
pip install .[browser-use]
```
## โก Quick Start
- CLI:
```bash
crawlstudio https://example.com --backend firecrawl --format markdown --print markdown
```
- Python:
```python
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend
async def main():
cfg = CrawlConfig()
res = await FirecrawlBackend(cfg).crawl("https://example.com", format="markdown")
print(res.markdown)
asyncio.run(main())
```
## ๐ Environment
Create a `.env` in the project root if using external services/backends:
```env
FIRECRAWL_API_KEY=your_firecrawl_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
```
For Browser-Use backend, you must set at least one of `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`.
If you use headless browsers (via browser-use), install Playwright runtime:
```bash
python -m pip install playwright
python -m playwright install
```
## ๐ CLI Usage
After install, use the CLI to crawl a URL with different backends and formats:
```bash
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend firecrawl --format markdown --print markdown
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend crawl4ai --format html --print html
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend scrapy --format markdown --print markdown
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend browser-use --format markdown --print markdown
```
- `--backend`: one of `firecrawl`, `crawl4ai`, `scrapy`, `browser-use`
- `--format`: one of `markdown`, `html`, `structured`
- `--print`: choose what to print: `summary` (default), `markdown`, `html`, `structured`
## ๐ Usage Examples (API)
The library exposes a unified interface; below are end-to-end examples for each backend.
### ๐งโ๐ป Python Usage
```python
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend
async def main():
config = CrawlConfig()
backend = FirecrawlBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(result.markdown)
asyncio.run(main())
```
### Firecrawl Example
```python
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend
async def main():
config = CrawlConfig()
backend = FirecrawlBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(result.markdown)
asyncio.run(main())
```
### Crawl4AI Example
```python
import asyncio
from crawlstudio import CrawlConfig, Crawl4AIBackend
async def main():
config = CrawlConfig()
backend = Crawl4AIBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(result.markdown) # Outputs title, summary, keywords
asyncio.run(main())
```
### Scrapy Example
```python
import asyncio
from crawlstudio import CrawlConfig, ScrapyBackend
async def main():
config = CrawlConfig()
backend = ScrapyBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="html")
print(result.raw_html)
asyncio.run(main())
```
### Browser-Use (AI-Driven) Example
```python
import asyncio
from crawlstudio import CrawlConfig, BrowserUseBackend
async def main():
config = CrawlConfig()
backend = BrowserUseBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(result.markdown) # AI-extracted data
asyncio.run(main())
```
## ๐งช Tests & Checks
Run the test suite (pytest) and local checks (flake8, mypy):
```bash
pytest -q
flake8
mypy crawlstudio
```
Notes:
- We target Python 3.10+ for typing (PEP 604 `X | Y` unions).
- Third-party libraries without type stubs are ignored by mypy (`ignore_missing_imports = true`).
## ๐ ๏ธ Contributing Quickstart
- Fork and clone the repo, create a virtual env, then install dev deps:
```bash
git clone https://github.com/aiapicore/CrawlStudio.git
cd CrawlStudio
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
pip install -e .[dev]
```
- Optional: install pre-commit hooks
```bash
pip install pre-commit
pre-commit install
pre-commit run --all-files
```
- Run the suite before submitting a PR:
```bash
flake8
mypy crawlstudio
pytest -q
```
## โก Backend Comparison
| Backend | Speed | Cost | AI Intelligence | Best For |
|---------|-------|------|----------------|----------|
| **Firecrawl** | Fast | API costs | Medium | Production scraping |
| **Crawl4AI** | Medium | Free | Medium | Development & testing |
| **Scrapy** | Fastest | Free | Low | Simple HTML extraction |
| **Browser-Use** | Slower | AI costs | High | Complex dynamic sites |
## ๐ฎ Future Enhancements
### Recursive Crawling (Planned)
```python
# Future API - configurable depth and page limits
config = CrawlConfig(
max_depth=3, # Crawl up to 3 levels deep
max_pages_per_level=5, # Max 5 pages per depth level
recursive_delay=1.0, # 1 second delay between requests
follow_external_links=False # Stay within same domain
)
# Recursive crawling with depth control
result = await backend.crawl_recursive("https://example.com", format="markdown")
print(f"Crawled {len(result.pages)} pages across {result.max_depth_reached} levels")
```
### Additional Crawler Backends (Roadmap)
#### High Priority
- **[Playwright](https://github.com/microsoft/playwright-python)** - Fast browser automation, excellent for SPAs
- **[Selenium](https://github.com/SeleniumHQ/selenium)** - Industry standard, huge ecosystem
- **[BeautifulSoup + Requests](https://github.com/psf/requests)** - Lightweight, simple parsing
#### Specialized Crawlers
- **[Apify SDK](https://github.com/apify/apify-sdk-python)** - Cloud scraping platform
- **[Colly](https://github.com/gocolly/colly)** (via Python bindings) - High-performance Go crawler
- **[Puppeteer](https://github.com/puppeteer/puppeteer)** (via pyppeteer) - Headless Chrome control
#### AI-Enhanced Crawlers
- **[ScrapeGraphAI](https://github.com/VinciGit00/ScrapeGraphAI)** - LLM-powered scraping
- **[AutoScraper](https://github.com/alirezamika/autoscraper)** - Machine learning-based pattern detection
- **[WebGPT](https://github.com/sukhadagholba/webgpt)** - GPT-powered web interaction
#### Enterprise/Commercial
- **[ScrapingBee](https://www.scrapingbee.com/)** - Anti-bot bypass service
- **[Bright Data](https://brightdata.com/)** - Proxy + scraping platform
- **[Zyte](https://www.zyte.com/)** - Enterprise web data platform
### Advanced Features (Future Versions)
- Multi-page crawling with link discovery
- Batch processing for multiple URLs
- CLI tool (`crawlstudio crawl <url>`)
- Content deduplication and similarity detection
- Rate limiting and respectful crawling policies
- Caching system with Redis/disk storage
- Webhook integrations for real-time notifications
- GraphQL API for programmatic access
- Docker containerization for easy deployment
### Development Roadmap
1. **Core Features** (Current): 4 working backends
2. **Recursive Crawling**: Depth-based multi-page crawling
3. **CLI Tool**: `pip install crawlstudio` โ command line usage
4. **Additional Backends**: Playwright, Selenium, BeautifulSoup
5. **Enterprise Features**: Batch processing, advanced caching
6. **AI Integration**: More AI-powered extraction capabilities
7. **Cloud Platform**: SaaS offering with web interface
Raw data
{
"_id": null,
"home_page": null,
"name": "crawlstudio",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "browser automation, crawl4ai, crawling, firecrawl, scrapy, web scraping",
"author": null,
"author_email": "Saeed Ashraf <saeedashrafv@gmail.com>, Abbas Mirmashhouri <mirmashhouri@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d6/23/50f85bb4a64cadb75bdc63ca021b556c301054bf054bb406efb809905adb/crawlstudio-0.1.1.tar.gz",
"platform": null,
"description": "# \ud83d\udd77\ufe0f CrawlStudio\n\nUnified wrapper for web crawling tools, inspired by modular, community-driven design.\n\n## \ud83c\udfaf Vision\n\nCrawlStudio provides a unified Python API for various web crawling backends including Firecrawl, Crawl4AI, Scrapy, and Browser-Use (AI-driven). It emphasizes modularity, ease of use, and intelligent extraction capabilities.\n\n## \ud83d\udce6 Installation\n\n```bash\npip install crawlstudio\n```\n\nFrom source (recommended for contributors):\n```bash\ngit clone https://github.com/aiapicore/CrawlStudio.git\ncd CrawlStudio\npython -m venv .venv && source .venv/bin/activate # Windows: .venv\\Scripts\\Activate.ps1\npython -m pip install --upgrade pip\npip install -e .[dev]\n```\n\nOptional extras for AI browser backend:\n```bash\npip install .[browser-use]\n```\n\n## \u26a1 Quick Start\n- CLI:\n```bash\ncrawlstudio https://example.com --backend firecrawl --format markdown --print markdown\n```\n- Python:\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, FirecrawlBackend\n\nasync def main():\n cfg = CrawlConfig()\n res = await FirecrawlBackend(cfg).crawl(\"https://example.com\", format=\"markdown\")\n print(res.markdown)\n\nasyncio.run(main())\n```\n\n## \ud83d\udd10 Environment\nCreate a `.env` in the project root if using external services/backends:\n```env\nFIRECRAWL_API_KEY=your_firecrawl_key\nOPENAI_API_KEY=your_openai_key\nANTHROPIC_API_KEY=your_anthropic_key\n```\n\nFor Browser-Use backend, you must set at least one of `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`.\n\nIf you use headless browsers (via browser-use), install Playwright runtime:\n```bash\npython -m pip install playwright\npython -m playwright install\n```\n\n## \ud83d\ude80 CLI Usage\nAfter install, use the CLI to crawl a URL with different backends and formats:\n```bash\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend firecrawl --format markdown --print markdown\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend crawl4ai --format html --print html\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend scrapy --format markdown --print markdown\ncrawlstudio https://en.wikipedia.org/wiki/Switzerland --backend browser-use --format markdown --print markdown\n```\n- `--backend`: one of `firecrawl`, `crawl4ai`, `scrapy`, `browser-use`\n- `--format`: one of `markdown`, `html`, `structured`\n- `--print`: choose what to print: `summary` (default), `markdown`, `html`, `structured`\n\n## \ud83d\udcd8 Usage Examples (API)\nThe library exposes a unified interface; below are end-to-end examples for each backend.\n\n### \ud83e\uddd1\u200d\ud83d\udcbb Python Usage\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, FirecrawlBackend\n\nasync def main():\n config = CrawlConfig()\n backend = FirecrawlBackend(config)\n result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n print(result.markdown)\n\nasyncio.run(main())\n```\n\n### Firecrawl Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, FirecrawlBackend\n\nasync def main():\n config = CrawlConfig()\n backend = FirecrawlBackend(config)\n result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n print(result.markdown)\n\nasyncio.run(main())\n```\n\n### Crawl4AI Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, Crawl4AIBackend\n\nasync def main():\n config = CrawlConfig()\n backend = Crawl4AIBackend(config)\n result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n print(result.markdown) # Outputs title, summary, keywords\n\nasyncio.run(main())\n```\n\n### Scrapy Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, ScrapyBackend\n\nasync def main():\n config = CrawlConfig()\n backend = ScrapyBackend(config)\n result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"html\")\n print(result.raw_html)\n\nasyncio.run(main())\n```\n\n### Browser-Use (AI-Driven) Example\n```python\nimport asyncio\nfrom crawlstudio import CrawlConfig, BrowserUseBackend\n\nasync def main():\n config = CrawlConfig()\n backend = BrowserUseBackend(config)\n result = await backend.crawl(\"https://en.wikipedia.org/wiki/Switzerland\", format=\"markdown\")\n print(result.markdown) # AI-extracted data\n\nasyncio.run(main())\n```\n\n## \ud83e\uddea Tests & Checks\nRun the test suite (pytest) and local checks (flake8, mypy):\n```bash\npytest -q\nflake8\nmypy crawlstudio\n```\nNotes:\n- We target Python 3.10+ for typing (PEP 604 `X | Y` unions).\n- Third-party libraries without type stubs are ignored by mypy (`ignore_missing_imports = true`).\n\n## \ud83d\udee0\ufe0f Contributing Quickstart\n- Fork and clone the repo, create a virtual env, then install dev deps:\n```bash\ngit clone https://github.com/aiapicore/CrawlStudio.git\ncd CrawlStudio\npython -m venv .venv && source .venv/bin/activate # Windows: .venv\\Scripts\\Activate.ps1\npip install -e .[dev]\n```\n- Optional: install pre-commit hooks\n```bash\npip install pre-commit\npre-commit install\npre-commit run --all-files\n```\n- Run the suite before submitting a PR:\n```bash\nflake8\nmypy crawlstudio\npytest -q\n```\n\n## \u26a1 Backend Comparison\n\n| Backend | Speed | Cost | AI Intelligence | Best For |\n|---------|-------|------|----------------|----------|\n| **Firecrawl** | Fast | API costs | Medium | Production scraping |\n| **Crawl4AI** | Medium | Free | Medium | Development & testing |\n| **Scrapy** | Fastest | Free | Low | Simple HTML extraction |\n| **Browser-Use** | Slower | AI costs | High | Complex dynamic sites |\n\n## \ud83d\udd2e Future Enhancements\n\n### Recursive Crawling (Planned)\n```python\n# Future API - configurable depth and page limits\nconfig = CrawlConfig(\n max_depth=3, # Crawl up to 3 levels deep\n max_pages_per_level=5, # Max 5 pages per depth level\n recursive_delay=1.0, # 1 second delay between requests\n follow_external_links=False # Stay within same domain\n)\n\n# Recursive crawling with depth control\nresult = await backend.crawl_recursive(\"https://example.com\", format=\"markdown\")\nprint(f\"Crawled {len(result.pages)} pages across {result.max_depth_reached} levels\")\n```\n\n### Additional Crawler Backends (Roadmap)\n\n#### High Priority\n- **[Playwright](https://github.com/microsoft/playwright-python)** - Fast browser automation, excellent for SPAs\n- **[Selenium](https://github.com/SeleniumHQ/selenium)** - Industry standard, huge ecosystem\n- **[BeautifulSoup + Requests](https://github.com/psf/requests)** - Lightweight, simple parsing\n\n#### Specialized Crawlers \n- **[Apify SDK](https://github.com/apify/apify-sdk-python)** - Cloud scraping platform\n- **[Colly](https://github.com/gocolly/colly)** (via Python bindings) - High-performance Go crawler\n- **[Puppeteer](https://github.com/puppeteer/puppeteer)** (via pyppeteer) - Headless Chrome control\n\n#### AI-Enhanced Crawlers\n- **[ScrapeGraphAI](https://github.com/VinciGit00/ScrapeGraphAI)** - LLM-powered scraping\n- **[AutoScraper](https://github.com/alirezamika/autoscraper)** - Machine learning-based pattern detection\n- **[WebGPT](https://github.com/sukhadagholba/webgpt)** - GPT-powered web interaction\n\n#### Enterprise/Commercial\n- **[ScrapingBee](https://www.scrapingbee.com/)** - Anti-bot bypass service\n- **[Bright Data](https://brightdata.com/)** - Proxy + scraping platform \n- **[Zyte](https://www.zyte.com/)** - Enterprise web data platform\n\n### Advanced Features (Future Versions)\n- Multi-page crawling with link discovery\n- Batch processing for multiple URLs\n- CLI tool (`crawlstudio crawl <url>`)\n- Content deduplication and similarity detection\n- Rate limiting and respectful crawling policies\n- Caching system with Redis/disk storage\n- Webhook integrations for real-time notifications\n- GraphQL API for programmatic access\n- Docker containerization for easy deployment\n\n### Development Roadmap\n1. **Core Features** (Current): 4 working backends\n2. **Recursive Crawling**: Depth-based multi-page crawling \n3. **CLI Tool**: `pip install crawlstudio` \u2192 command line usage\n4. **Additional Backends**: Playwright, Selenium, BeautifulSoup\n5. **Enterprise Features**: Batch processing, advanced caching\n6. **AI Integration**: More AI-powered extraction capabilities\n7. **Cloud Platform**: SaaS offering with web interface",
"bugtrack_url": null,
"license": null,
"summary": "Unified Python library for web crawling tools - supports Firecrawl, Crawl4AI, Scrapy, and Browser-Use backends",
"version": "0.1.1",
"project_urls": {
"Documentation": "https://github.com/aiapicore/CrawlStudio#readme",
"Homepage": "https://github.com/aiapicore/CrawlStudio",
"Issues": "https://github.com/aiapicore/CrawlStudio/issues",
"Repository": "https://github.com/aiapicore/CrawlStudio"
},
"split_keywords": [
"browser automation",
" crawl4ai",
" crawling",
" firecrawl",
" scrapy",
" web scraping"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "00351c94a78d9359927d9e4e8501ef1f420b68ce1da3ec2bc1a4c613e80b2966",
"md5": "48f166d65ebf8aa0d50a3a89c2c2f81a",
"sha256": "c01f36f6b27df4ba2cc25718ec90e53cf6417c38d34b59148e5055b1da3c4a2d"
},
"downloads": -1,
"filename": "crawlstudio-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "48f166d65ebf8aa0d50a3a89c2c2f81a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 17928,
"upload_time": "2025-10-06T12:30:05",
"upload_time_iso_8601": "2025-10-06T12:30:05.781309Z",
"url": "https://files.pythonhosted.org/packages/00/35/1c94a78d9359927d9e4e8501ef1f420b68ce1da3ec2bc1a4c613e80b2966/crawlstudio-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d62350f85bb4a64cadb75bdc63ca021b556c301054bf054bb406efb809905adb",
"md5": "1d1d7bc7733712552cfc059973ade08d",
"sha256": "98d163965738892e65aacc4254afbb19f3ed2e60d4685bfddc8414deb88fea45"
},
"downloads": -1,
"filename": "crawlstudio-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "1d1d7bc7733712552cfc059973ade08d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 33204,
"upload_time": "2025-10-06T12:30:07",
"upload_time_iso_8601": "2025-10-06T12:30:07.192413Z",
"url": "https://files.pythonhosted.org/packages/d6/23/50f85bb4a64cadb75bdc63ca021b556c301054bf054bb406efb809905adb/crawlstudio-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 12:30:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aiapicore",
"github_project": "CrawlStudio#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pydantic",
"specs": [
[
">=",
"2.0"
]
]
},
{
"name": "pydantic-settings",
"specs": [
[
">=",
"2.0"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
">=",
"1.0"
]
]
},
{
"name": "crawl4ai",
"specs": [
[
">=",
"0.7.4"
]
]
},
{
"name": "firecrawl-py",
"specs": [
[
">=",
"0.0.20"
]
]
},
{
"name": "scrapy",
"specs": [
[
">=",
"2.11"
]
]
},
{
"name": "langchain-anthropic",
"specs": [
[
">=",
"0.3.19"
]
]
},
{
"name": "langchain-openai",
"specs": [
[
">=",
"0.3.33"
]
]
},
{
"name": "browser-use",
"specs": [
[
">=",
"0.7.7"
]
]
},
{
"name": "aiohttp",
"specs": [
[
">=",
"3.9"
]
]
},
{
"name": "diskcache",
"specs": [
[
">=",
"5.6"
]
]
},
{
"name": "playwright",
"specs": [
[
">=",
"1.54"
]
]
}
],
"lcname": "crawlstudio"
}