# Pydantic Scrape
A modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.
## Why Pydantic Scrape?
Web scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.
Pydantic Scrape bridges this gap by providing:
- **AI-powered extraction** - Let AI understand and extract what you need instead of writing brittle selectors
- **Type-safe workflows** - Structured data with validation built-in
- **Academic research focus** - First-class support for papers, citations, and research workflows
- **Browser automation** - Handle JavaScript, authentication, and complex interactions seamlessly
## Installation
### 1. Install Chawan Terminal Browser (Required)
Pydantic Scrape uses [Chawan](https://sr.ht/~bptato/chawan/) for advanced web automation and JavaScript-heavy sites.
**Option 1: Homebrew (macOS) - Stable Release**
```bash
brew install chawan
```
**Option 2: From Source - Latest Development Version**
```bash
# Install Nim compiler (if not already installed)
brew install nim # macOS
# OR for Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh -s -- -y
# Build latest Chawan from source
git clone https://git.sr.ht/~bptato/chawan
cd chawan && make
# Install locally (recommended for development)
mkdir -p ~/.local/bin ~/.local/libexec
cp target/release/bin/cha ~/.local/bin/
cp -r target/release/libexec/chawan ~/.local/libexec/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
```
**Verify installation:**
```bash
cha --version
```
### 2. Install Pydantic Scrape
**Lean Installation (Recommended)**
```bash
# Core functionality only (~50MB)
pip install pydantic-scrape
```
**Extended Installation with Optional Features**
```bash
# YouTube video processing
pip install pydantic-scrape[youtube]
# AI services (OpenAI, Google AI)
pip install pydantic-scrape[ai]
# Academic research (OpenAlex, CrossRef)
pip install pydantic-scrape[academic]
# Document processing (PDF, Word, eBooks)
pip install pydantic-scrape[documents]
# Advanced content extraction
pip install pydantic-scrape[content]
# Everything included (~300MB)
pip install pydantic-scrape[all]
# Development tools (if contributing)
pip install pydantic-scrape[dev]
```
## Quick Start
Get a comprehensive research answer in under 10 lines:
```python
import asyncio
from pydantic_scrape.graphs.search_answer import search_answer
async def research():
result = await search_answer(
query="latest advances in quantum computing",
max_search_results=5
)
print(f"Found {len(result['answer']['sources'])} sources")
print(result['answer']['answer'])
asyncio.run(research())
```
This searches academic sources, extracts content, and generates a structured answer with citations - all automatically.
## Core Features
### 🔍 Smart Content Extraction
```python
from pydantic_scrape.dependencies.fetch import fetch_url
# Automatically handles JavaScript, selects best extraction method
content = await fetch_url("https://example.com/article")
print(content.title, content.text, content.metadata)
```
### 🤖 AI-Powered Scraping
```python
from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent
# AI writes the scraping code for you
agent = get_bs4_scrape_script_agent()
result = await agent.run_sync("Extract product prices from this e-commerce page",
html_content=page_html)
```
### 📚 Academic Research
```python
from pydantic_scrape.dependencies.openalex import OpenAlexDependency
# Search papers by topic, author, or DOI
openalex = OpenAlexDependency()
papers = await openalex.search_papers("machine learning healthcare")
```
### 📄 Document Processing
```python
from pydantic_scrape.dependencies.document import DocumentDependency
# Extract text from PDFs, Word docs, EPUBs
doc = DocumentDependency()
content = await doc.extract_text("research_paper.pdf")
```
### 🌐 Advanced Browser Automation
```python
from pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent
# Intelligent search + browse with memory and geographic targeting
agent = SearchAndBrowseAgent()
result = await agent.run(
"Find 5 cabinet refacing services in North West England with contact details"
)
# Automatically handles: cookie popups, JavaScript, geographic targeting,
# content caching, parallel browsing, and form detection
```
## Common Use Cases
- **Literature Reviews** - Automatically search, extract, and summarize academic papers
- **Market Research** - Monitor competitor content, pricing, and product updates
- **News Monitoring** - Track mentions, trends, and breaking news across sources
- **Content Migration** - Extract structured data from legacy systems or websites
- **Research Workflows** - Build custom pipelines for domain-specific content extraction
## Architecture
Pydantic Scrape organizes code into three layers:
- **Dependencies** (`pydantic_scrape.dependencies.*`) - Reusable components for specific tasks
- **Agents** (`pydantic_scrape.agents.*`) - AI-powered workers that make decisions
- **Graphs** (`pydantic_scrape.graphs.*`) - Orchestrate multi-step workflows
This makes it easy to compose complex workflows from simple, tested components.
## Configuration
### 1. Set API Keys
Create a `.env` file in your project root:
```bash
# AI Providers (choose one or more)
OPENAI_API_KEY=your_openai_key
GOOGLE_GENAI_API_KEY=your_google_key
ANTHROPIC_API_KEY=your_anthropic_key
# Google Search (for enhanced search capabilities)
GOOGLE_SEARCH_API_KEY=your_google_search_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
```
### 2. Chawan Configuration
The package includes an optimized Chawan configuration in `.chawan/config.toml` that provides:
- **7x faster** web automation vs default settings
- **Cookie popup handling** without JavaScript overhead
- **Content caching** for instant subsequent operations
- **Geographic search targeting** for accurate local results
No additional Chawan setup required - works out of the box!
## Documentation
- [Installation Guide](INSTALLATION.md) - Detailed setup instructions
- [Examples](examples/) - Working code samples for common tasks
- [API Reference](https://github.com/philmade/pydantic_scrape) - Full documentation
## Contributing
We welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.
## License
MIT License - see [LICENSE](LICENSE) for details.
---
**Ready to build intelligent scraping workflows?** Start with `pip install pydantic-scrape` and try the examples above.
Raw data
{
"_id": null,
"home_page": null,
"name": "pydantic-scrape",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "scraping, web-scraping, pydantic, chawan, automation, browser-automation, ai-agents, geographic-search",
"author": "Pydantic Scrape Contributors",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/22/57/1086e99f6420debc5ec238aed2059228f33f5fe367e066b0cc8035acebea/pydantic_scrape-0.2.2.tar.gz",
"platform": null,
"description": "# Pydantic Scrape\n\nA modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.\n\n## Why Pydantic Scrape?\n\nWeb scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.\n\nPydantic Scrape bridges this gap by providing:\n\n- **AI-powered extraction** - Let AI understand and extract what you need instead of writing brittle selectors\n- **Type-safe workflows** - Structured data with validation built-in \n- **Academic research focus** - First-class support for papers, citations, and research workflows\n- **Browser automation** - Handle JavaScript, authentication, and complex interactions seamlessly\n\n## Installation\n\n### 1. Install Chawan Terminal Browser (Required)\n\nPydantic Scrape uses [Chawan](https://sr.ht/~bptato/chawan/) for advanced web automation and JavaScript-heavy sites.\n\n**Option 1: Homebrew (macOS) - Stable Release**\n```bash\nbrew install chawan\n```\n\n**Option 2: From Source - Latest Development Version**\n```bash\n# Install Nim compiler (if not already installed)\nbrew install nim # macOS\n# OR for Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh -s -- -y\n\n# Build latest Chawan from source\ngit clone https://git.sr.ht/~bptato/chawan\ncd chawan && make\n\n# Install locally (recommended for development)\nmkdir -p ~/.local/bin ~/.local/libexec\ncp target/release/bin/cha ~/.local/bin/\ncp -r target/release/libexec/chawan ~/.local/libexec/\necho 'export PATH=\"$HOME/.local/bin:$PATH\"' >> ~/.zshrc\nsource ~/.zshrc\n```\n\n**Verify installation:**\n```bash\ncha --version\n```\n\n### 2. Install Pydantic Scrape\n\n**Lean Installation (Recommended)**\n```bash\n# Core functionality only (~50MB)\npip install pydantic-scrape\n```\n\n**Extended Installation with Optional Features**\n```bash\n# YouTube video processing\npip install pydantic-scrape[youtube]\n\n# AI services (OpenAI, Google AI) \npip install pydantic-scrape[ai]\n\n# Academic research (OpenAlex, CrossRef)\npip install pydantic-scrape[academic]\n\n# Document processing (PDF, Word, eBooks)\npip install pydantic-scrape[documents]\n\n# Advanced content extraction\npip install pydantic-scrape[content]\n\n# Everything included (~300MB)\npip install pydantic-scrape[all]\n\n# Development tools (if contributing)\npip install pydantic-scrape[dev]\n```\n\n## Quick Start\n\nGet a comprehensive research answer in under 10 lines:\n\n```python\nimport asyncio\nfrom pydantic_scrape.graphs.search_answer import search_answer\n\nasync def research():\n result = await search_answer(\n query=\"latest advances in quantum computing\",\n max_search_results=5\n )\n \n print(f\"Found {len(result['answer']['sources'])} sources\")\n print(result['answer']['answer'])\n\nasyncio.run(research())\n```\n\nThis searches academic sources, extracts content, and generates a structured answer with citations - all automatically.\n\n## Core Features\n\n### \ud83d\udd0d Smart Content Extraction\n```python\nfrom pydantic_scrape.dependencies.fetch import fetch_url\n\n# Automatically handles JavaScript, selects best extraction method\ncontent = await fetch_url(\"https://example.com/article\")\nprint(content.title, content.text, content.metadata)\n```\n\n### \ud83e\udd16 AI-Powered Scraping\n```python\nfrom pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent\n\n# AI writes the scraping code for you\nagent = get_bs4_scrape_script_agent()\nresult = await agent.run_sync(\"Extract product prices from this e-commerce page\", \n html_content=page_html)\n```\n\n### \ud83d\udcda Academic Research\n```python\nfrom pydantic_scrape.dependencies.openalex import OpenAlexDependency\n\n# Search papers by topic, author, or DOI\nopenalex = OpenAlexDependency()\npapers = await openalex.search_papers(\"machine learning healthcare\")\n```\n\n### \ud83d\udcc4 Document Processing\n```python\nfrom pydantic_scrape.dependencies.document import DocumentDependency\n\n# Extract text from PDFs, Word docs, EPUBs\ndoc = DocumentDependency()\ncontent = await doc.extract_text(\"research_paper.pdf\")\n```\n\n### \ud83c\udf10 Advanced Browser Automation\n```python\nfrom pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent\n\n# Intelligent search + browse with memory and geographic targeting\nagent = SearchAndBrowseAgent()\nresult = await agent.run(\n \"Find 5 cabinet refacing services in North West England with contact details\"\n)\n\n# Automatically handles: cookie popups, JavaScript, geographic targeting, \n# content caching, parallel browsing, and form detection\n```\n\n## Common Use Cases\n\n- **Literature Reviews** - Automatically search, extract, and summarize academic papers\n- **Market Research** - Monitor competitor content, pricing, and product updates \n- **News Monitoring** - Track mentions, trends, and breaking news across sources\n- **Content Migration** - Extract structured data from legacy systems or websites\n- **Research Workflows** - Build custom pipelines for domain-specific content extraction\n\n## Architecture\n\nPydantic Scrape organizes code into three layers:\n\n- **Dependencies** (`pydantic_scrape.dependencies.*`) - Reusable components for specific tasks\n- **Agents** (`pydantic_scrape.agents.*`) - AI-powered workers that make decisions \n- **Graphs** (`pydantic_scrape.graphs.*`) - Orchestrate multi-step workflows\n\nThis makes it easy to compose complex workflows from simple, tested components.\n\n## Configuration\n\n### 1. Set API Keys\nCreate a `.env` file in your project root:\n\n```bash\n# AI Providers (choose one or more)\nOPENAI_API_KEY=your_openai_key\nGOOGLE_GENAI_API_KEY=your_google_key \nANTHROPIC_API_KEY=your_anthropic_key\n\n# Google Search (for enhanced search capabilities)\nGOOGLE_SEARCH_API_KEY=your_google_search_key\nGOOGLE_SEARCH_ENGINE_ID=your_search_engine_id\n```\n\n### 2. Chawan Configuration\nThe package includes an optimized Chawan configuration in `.chawan/config.toml` that provides:\n\n- **7x faster** web automation vs default settings\n- **Cookie popup handling** without JavaScript overhead \n- **Content caching** for instant subsequent operations\n- **Geographic search targeting** for accurate local results\n\nNo additional Chawan setup required - works out of the box!\n\n## Documentation\n\n- [Installation Guide](INSTALLATION.md) - Detailed setup instructions\n- [Examples](examples/) - Working code samples for common tasks\n- [API Reference](https://github.com/philmade/pydantic_scrape) - Full documentation\n\n## Contributing\n\nWe welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n---\n\n**Ready to build intelligent scraping workflows?** Start with `pip install pydantic-scrape` and try the examples above.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Advanced web automation framework with AI-powered agents, Chawan terminal browser integration, and geographic search targeting",
"version": "0.2.2",
"project_urls": {
"Homepage": "https://github.com/philmade/pydantic_scrape",
"Issues": "https://github.com/philmade/pydantic_scrape/issues",
"Repository": "https://github.com/philmade/pydantic_scrape"
},
"split_keywords": [
"scraping",
" web-scraping",
" pydantic",
" chawan",
" automation",
" browser-automation",
" ai-agents",
" geographic-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "47ca34363e55d8555706739ee6d3e8d6dec7e56086e7c21f069d1ae15522f798",
"md5": "0e6d46288548bae9d0143f3f20f5be0f",
"sha256": "1c5c7cdc1fabed7342dfbc90f3a4581dd7f9f2b70d68f9174a9567550e23df31"
},
"downloads": -1,
"filename": "pydantic_scrape-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0e6d46288548bae9d0143f3f20f5be0f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 178308,
"upload_time": "2025-07-31T20:38:31",
"upload_time_iso_8601": "2025-07-31T20:38:31.198678Z",
"url": "https://files.pythonhosted.org/packages/47/ca/34363e55d8555706739ee6d3e8d6dec7e56086e7c21f069d1ae15522f798/pydantic_scrape-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "22571086e99f6420debc5ec238aed2059228f33f5fe367e066b0cc8035acebea",
"md5": "ce6bca3c7eab442bddd285944881c251",
"sha256": "c39c3084e12fbff32376d5150f9687680866691dfb23bfb26d7292955d9c47a7"
},
"downloads": -1,
"filename": "pydantic_scrape-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "ce6bca3c7eab442bddd285944881c251",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 151249,
"upload_time": "2025-07-31T20:38:32",
"upload_time_iso_8601": "2025-07-31T20:38:32.854688Z",
"url": "https://files.pythonhosted.org/packages/22/57/1086e99f6420debc5ec238aed2059228f33f5fe367e066b0cc8035acebea/pydantic_scrape-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-31 20:38:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "philmade",
"github_project": "pydantic_scrape",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pydantic-ai",
"specs": [
[
">=",
"0.2.11"
]
]
},
{
"name": "pydantic-graph",
"specs": [
[
">=",
"0.2.11"
]
]
},
{
"name": "camoufox",
"specs": [
[
">=",
"0.4.11"
]
]
},
{
"name": "loguru",
"specs": [
[
">=",
"0.7.3"
]
]
},
{
"name": "newspaper3k",
"specs": [
[
">=",
"0.2.8"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.13.0"
]
]
},
{
"name": "httpx",
"specs": [
[
">=",
"0.24.0"
]
]
},
{
"name": "pyalex",
"specs": [
[
">=",
"0.17"
]
]
},
{
"name": "habanero",
"specs": [
[
">=",
"1.2.6"
]
]
},
{
"name": "goose3",
"specs": [
[
">=",
"3.1.19"
]
]
},
{
"name": "PyMuPDF",
"specs": [
[
">=",
"1.25.0"
]
]
},
{
"name": "python-docx",
"specs": [
[
">=",
"1.1.0"
]
]
},
{
"name": "EbookLib",
"specs": [
[
">=",
"0.19"
]
]
},
{
"name": "yt-dlp",
"specs": [
[
">=",
"2023.12.30"
]
]
},
{
"name": "lxml",
"specs": [
[
">=",
"4.9.0"
]
]
},
{
"name": "lxml_html_clean",
"specs": [
[
">=",
"0.1.0"
]
]
},
{
"name": "rapidfuzz",
"specs": []
},
{
"name": "python-dotenv",
"specs": []
},
{
"name": "google-generativeai",
"specs": [
[
">=",
"0.3.0"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "platformdirs",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "searchthescience",
"specs": [
[
">=",
"1.0.0"
]
]
}
],
"lcname": "pydantic-scrape"
}