pydantic-scrape


Namepydantic-scrape JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryAdvanced web automation framework with AI-powered agents, Chawan terminal browser integration, and geographic search targeting
upload_time2025-07-31 20:38:32
maintainerNone
docs_urlNone
authorPydantic Scrape Contributors
requires_python>=3.10
licenseMIT
keywords scraping web-scraping pydantic chawan automation browser-automation ai-agents geographic-search
VCS
bugtrack_url
requirements pydantic-ai pydantic-graph camoufox loguru newspaper3k beautifulsoup4 httpx pyalex habanero goose3 PyMuPDF python-docx EbookLib yt-dlp lxml lxml_html_clean rapidfuzz python-dotenv google-generativeai openai platformdirs searchthescience
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Pydantic Scrape

A modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.

## Why Pydantic Scrape?

Web scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.

Pydantic Scrape bridges this gap by providing:

- **AI-powered extraction** - Let AI understand and extract what you need instead of writing brittle selectors
- **Type-safe workflows** - Structured data with validation built-in  
- **Academic research focus** - First-class support for papers, citations, and research workflows
- **Browser automation** - Handle JavaScript, authentication, and complex interactions seamlessly

## Installation

### 1. Install Chawan Terminal Browser (Required)

Pydantic Scrape uses [Chawan](https://sr.ht/~bptato/chawan/) for advanced web automation and JavaScript-heavy sites.

**Option 1: Homebrew (macOS) - Stable Release**
```bash
brew install chawan
```

**Option 2: From Source - Latest Development Version**
```bash
# Install Nim compiler (if not already installed)
brew install nim  # macOS
# OR for Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh -s -- -y

# Build latest Chawan from source
git clone https://git.sr.ht/~bptato/chawan
cd chawan && make

# Install locally (recommended for development)
mkdir -p ~/.local/bin ~/.local/libexec
cp target/release/bin/cha ~/.local/bin/
cp -r target/release/libexec/chawan ~/.local/libexec/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
```

**Verify installation:**
```bash
cha --version
```

### 2. Install Pydantic Scrape

**Lean Installation (Recommended)**
```bash
# Core functionality only (~50MB)
pip install pydantic-scrape
```

**Extended Installation with Optional Features**
```bash
# YouTube video processing
pip install pydantic-scrape[youtube]

# AI services (OpenAI, Google AI)  
pip install pydantic-scrape[ai]

# Academic research (OpenAlex, CrossRef)
pip install pydantic-scrape[academic]

# Document processing (PDF, Word, eBooks)
pip install pydantic-scrape[documents]

# Advanced content extraction
pip install pydantic-scrape[content]

# Everything included (~300MB)
pip install pydantic-scrape[all]

# Development tools (if contributing)
pip install pydantic-scrape[dev]
```

## Quick Start

Get a comprehensive research answer in under 10 lines:

```python
import asyncio
from pydantic_scrape.graphs.search_answer import search_answer

async def research():
    result = await search_answer(
        query="latest advances in quantum computing",
        max_search_results=5
    )
    
    print(f"Found {len(result['answer']['sources'])} sources")
    print(result['answer']['answer'])

asyncio.run(research())
```

This searches academic sources, extracts content, and generates a structured answer with citations - all automatically.

## Core Features

### 🔍 Smart Content Extraction
```python
from pydantic_scrape.dependencies.fetch import fetch_url

# Automatically handles JavaScript, selects best extraction method
content = await fetch_url("https://example.com/article")
print(content.title, content.text, content.metadata)
```

### 🤖 AI-Powered Scraping
```python
from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent

# AI writes the scraping code for you
agent = get_bs4_scrape_script_agent()
result = await agent.run_sync("Extract product prices from this e-commerce page", 
                              html_content=page_html)
```

### 📚 Academic Research
```python
from pydantic_scrape.dependencies.openalex import OpenAlexDependency

# Search papers by topic, author, or DOI
openalex = OpenAlexDependency()
papers = await openalex.search_papers("machine learning healthcare")
```

### 📄 Document Processing
```python
from pydantic_scrape.dependencies.document import DocumentDependency

# Extract text from PDFs, Word docs, EPUBs
doc = DocumentDependency()
content = await doc.extract_text("research_paper.pdf")
```

### 🌐 Advanced Browser Automation
```python
from pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent

# Intelligent search + browse with memory and geographic targeting
agent = SearchAndBrowseAgent()
result = await agent.run(
    "Find 5 cabinet refacing services in North West England with contact details"
)

# Automatically handles: cookie popups, JavaScript, geographic targeting, 
# content caching, parallel browsing, and form detection
```

## Common Use Cases

- **Literature Reviews** - Automatically search, extract, and summarize academic papers
- **Market Research** - Monitor competitor content, pricing, and product updates  
- **News Monitoring** - Track mentions, trends, and breaking news across sources
- **Content Migration** - Extract structured data from legacy systems or websites
- **Research Workflows** - Build custom pipelines for domain-specific content extraction

## Architecture

Pydantic Scrape organizes code into three layers:

- **Dependencies** (`pydantic_scrape.dependencies.*`) - Reusable components for specific tasks
- **Agents** (`pydantic_scrape.agents.*`) - AI-powered workers that make decisions  
- **Graphs** (`pydantic_scrape.graphs.*`) - Orchestrate multi-step workflows

This makes it easy to compose complex workflows from simple, tested components.

## Configuration

### 1. Set API Keys
Create a `.env` file in your project root:

```bash
# AI Providers (choose one or more)
OPENAI_API_KEY=your_openai_key
GOOGLE_GENAI_API_KEY=your_google_key  
ANTHROPIC_API_KEY=your_anthropic_key

# Google Search (for enhanced search capabilities)
GOOGLE_SEARCH_API_KEY=your_google_search_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
```

### 2. Chawan Configuration
The package includes an optimized Chawan configuration in `.chawan/config.toml` that provides:

- **7x faster** web automation vs default settings
- **Cookie popup handling** without JavaScript overhead  
- **Content caching** for instant subsequent operations
- **Geographic search targeting** for accurate local results

No additional Chawan setup required - works out of the box!

## Documentation

- [Installation Guide](INSTALLATION.md) - Detailed setup instructions
- [Examples](examples/) - Working code samples for common tasks
- [API Reference](https://github.com/philmade/pydantic_scrape) - Full documentation

## Contributing

We welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

---

**Ready to build intelligent scraping workflows?** Start with `pip install pydantic-scrape` and try the examples above.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pydantic-scrape",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "scraping, web-scraping, pydantic, chawan, automation, browser-automation, ai-agents, geographic-search",
    "author": "Pydantic Scrape Contributors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/22/57/1086e99f6420debc5ec238aed2059228f33f5fe367e066b0cc8035acebea/pydantic_scrape-0.2.2.tar.gz",
    "platform": null,
    "description": "# Pydantic Scrape\n\nA modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.\n\n## Why Pydantic Scrape?\n\nWeb scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.\n\nPydantic Scrape bridges this gap by providing:\n\n- **AI-powered extraction** - Let AI understand and extract what you need instead of writing brittle selectors\n- **Type-safe workflows** - Structured data with validation built-in  \n- **Academic research focus** - First-class support for papers, citations, and research workflows\n- **Browser automation** - Handle JavaScript, authentication, and complex interactions seamlessly\n\n## Installation\n\n### 1. Install Chawan Terminal Browser (Required)\n\nPydantic Scrape uses [Chawan](https://sr.ht/~bptato/chawan/) for advanced web automation and JavaScript-heavy sites.\n\n**Option 1: Homebrew (macOS) - Stable Release**\n```bash\nbrew install chawan\n```\n\n**Option 2: From Source - Latest Development Version**\n```bash\n# Install Nim compiler (if not already installed)\nbrew install nim  # macOS\n# OR for Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh -s -- -y\n\n# Build latest Chawan from source\ngit clone https://git.sr.ht/~bptato/chawan\ncd chawan && make\n\n# Install locally (recommended for development)\nmkdir -p ~/.local/bin ~/.local/libexec\ncp target/release/bin/cha ~/.local/bin/\ncp -r target/release/libexec/chawan ~/.local/libexec/\necho 'export PATH=\"$HOME/.local/bin:$PATH\"' >> ~/.zshrc\nsource ~/.zshrc\n```\n\n**Verify installation:**\n```bash\ncha --version\n```\n\n### 2. Install Pydantic Scrape\n\n**Lean Installation (Recommended)**\n```bash\n# Core functionality only (~50MB)\npip install pydantic-scrape\n```\n\n**Extended Installation with Optional Features**\n```bash\n# YouTube video processing\npip install pydantic-scrape[youtube]\n\n# AI services (OpenAI, Google AI)  \npip install pydantic-scrape[ai]\n\n# Academic research (OpenAlex, CrossRef)\npip install pydantic-scrape[academic]\n\n# Document processing (PDF, Word, eBooks)\npip install pydantic-scrape[documents]\n\n# Advanced content extraction\npip install pydantic-scrape[content]\n\n# Everything included (~300MB)\npip install pydantic-scrape[all]\n\n# Development tools (if contributing)\npip install pydantic-scrape[dev]\n```\n\n## Quick Start\n\nGet a comprehensive research answer in under 10 lines:\n\n```python\nimport asyncio\nfrom pydantic_scrape.graphs.search_answer import search_answer\n\nasync def research():\n    result = await search_answer(\n        query=\"latest advances in quantum computing\",\n        max_search_results=5\n    )\n    \n    print(f\"Found {len(result['answer']['sources'])} sources\")\n    print(result['answer']['answer'])\n\nasyncio.run(research())\n```\n\nThis searches academic sources, extracts content, and generates a structured answer with citations - all automatically.\n\n## Core Features\n\n### \ud83d\udd0d Smart Content Extraction\n```python\nfrom pydantic_scrape.dependencies.fetch import fetch_url\n\n# Automatically handles JavaScript, selects best extraction method\ncontent = await fetch_url(\"https://example.com/article\")\nprint(content.title, content.text, content.metadata)\n```\n\n### \ud83e\udd16 AI-Powered Scraping\n```python\nfrom pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent\n\n# AI writes the scraping code for you\nagent = get_bs4_scrape_script_agent()\nresult = await agent.run_sync(\"Extract product prices from this e-commerce page\", \n                              html_content=page_html)\n```\n\n### \ud83d\udcda Academic Research\n```python\nfrom pydantic_scrape.dependencies.openalex import OpenAlexDependency\n\n# Search papers by topic, author, or DOI\nopenalex = OpenAlexDependency()\npapers = await openalex.search_papers(\"machine learning healthcare\")\n```\n\n### \ud83d\udcc4 Document Processing\n```python\nfrom pydantic_scrape.dependencies.document import DocumentDependency\n\n# Extract text from PDFs, Word docs, EPUBs\ndoc = DocumentDependency()\ncontent = await doc.extract_text(\"research_paper.pdf\")\n```\n\n### \ud83c\udf10 Advanced Browser Automation\n```python\nfrom pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent\n\n# Intelligent search + browse with memory and geographic targeting\nagent = SearchAndBrowseAgent()\nresult = await agent.run(\n    \"Find 5 cabinet refacing services in North West England with contact details\"\n)\n\n# Automatically handles: cookie popups, JavaScript, geographic targeting, \n# content caching, parallel browsing, and form detection\n```\n\n## Common Use Cases\n\n- **Literature Reviews** - Automatically search, extract, and summarize academic papers\n- **Market Research** - Monitor competitor content, pricing, and product updates  \n- **News Monitoring** - Track mentions, trends, and breaking news across sources\n- **Content Migration** - Extract structured data from legacy systems or websites\n- **Research Workflows** - Build custom pipelines for domain-specific content extraction\n\n## Architecture\n\nPydantic Scrape organizes code into three layers:\n\n- **Dependencies** (`pydantic_scrape.dependencies.*`) - Reusable components for specific tasks\n- **Agents** (`pydantic_scrape.agents.*`) - AI-powered workers that make decisions  \n- **Graphs** (`pydantic_scrape.graphs.*`) - Orchestrate multi-step workflows\n\nThis makes it easy to compose complex workflows from simple, tested components.\n\n## Configuration\n\n### 1. Set API Keys\nCreate a `.env` file in your project root:\n\n```bash\n# AI Providers (choose one or more)\nOPENAI_API_KEY=your_openai_key\nGOOGLE_GENAI_API_KEY=your_google_key  \nANTHROPIC_API_KEY=your_anthropic_key\n\n# Google Search (for enhanced search capabilities)\nGOOGLE_SEARCH_API_KEY=your_google_search_key\nGOOGLE_SEARCH_ENGINE_ID=your_search_engine_id\n```\n\n### 2. Chawan Configuration\nThe package includes an optimized Chawan configuration in `.chawan/config.toml` that provides:\n\n- **7x faster** web automation vs default settings\n- **Cookie popup handling** without JavaScript overhead  \n- **Content caching** for instant subsequent operations\n- **Geographic search targeting** for accurate local results\n\nNo additional Chawan setup required - works out of the box!\n\n## Documentation\n\n- [Installation Guide](INSTALLATION.md) - Detailed setup instructions\n- [Examples](examples/) - Working code samples for common tasks\n- [API Reference](https://github.com/philmade/pydantic_scrape) - Full documentation\n\n## Contributing\n\nWe welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n---\n\n**Ready to build intelligent scraping workflows?** Start with `pip install pydantic-scrape` and try the examples above.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Advanced web automation framework with AI-powered agents, Chawan terminal browser integration, and geographic search targeting",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/philmade/pydantic_scrape",
        "Issues": "https://github.com/philmade/pydantic_scrape/issues",
        "Repository": "https://github.com/philmade/pydantic_scrape"
    },
    "split_keywords": [
        "scraping",
        " web-scraping",
        " pydantic",
        " chawan",
        " automation",
        " browser-automation",
        " ai-agents",
        " geographic-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "47ca34363e55d8555706739ee6d3e8d6dec7e56086e7c21f069d1ae15522f798",
                "md5": "0e6d46288548bae9d0143f3f20f5be0f",
                "sha256": "1c5c7cdc1fabed7342dfbc90f3a4581dd7f9f2b70d68f9174a9567550e23df31"
            },
            "downloads": -1,
            "filename": "pydantic_scrape-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0e6d46288548bae9d0143f3f20f5be0f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 178308,
            "upload_time": "2025-07-31T20:38:31",
            "upload_time_iso_8601": "2025-07-31T20:38:31.198678Z",
            "url": "https://files.pythonhosted.org/packages/47/ca/34363e55d8555706739ee6d3e8d6dec7e56086e7c21f069d1ae15522f798/pydantic_scrape-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "22571086e99f6420debc5ec238aed2059228f33f5fe367e066b0cc8035acebea",
                "md5": "ce6bca3c7eab442bddd285944881c251",
                "sha256": "c39c3084e12fbff32376d5150f9687680866691dfb23bfb26d7292955d9c47a7"
            },
            "downloads": -1,
            "filename": "pydantic_scrape-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ce6bca3c7eab442bddd285944881c251",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 151249,
            "upload_time": "2025-07-31T20:38:32",
            "upload_time_iso_8601": "2025-07-31T20:38:32.854688Z",
            "url": "https://files.pythonhosted.org/packages/22/57/1086e99f6420debc5ec238aed2059228f33f5fe367e066b0cc8035acebea/pydantic_scrape-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-31 20:38:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "philmade",
    "github_project": "pydantic_scrape",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pydantic-ai",
            "specs": [
                [
                    ">=",
                    "0.2.11"
                ]
            ]
        },
        {
            "name": "pydantic-graph",
            "specs": [
                [
                    ">=",
                    "0.2.11"
                ]
            ]
        },
        {
            "name": "camoufox",
            "specs": [
                [
                    ">=",
                    "0.4.11"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    ">=",
                    "0.7.3"
                ]
            ]
        },
        {
            "name": "newspaper3k",
            "specs": [
                [
                    ">=",
                    "0.2.8"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.13.0"
                ]
            ]
        },
        {
            "name": "httpx",
            "specs": [
                [
                    ">=",
                    "0.24.0"
                ]
            ]
        },
        {
            "name": "pyalex",
            "specs": [
                [
                    ">=",
                    "0.17"
                ]
            ]
        },
        {
            "name": "habanero",
            "specs": [
                [
                    ">=",
                    "1.2.6"
                ]
            ]
        },
        {
            "name": "goose3",
            "specs": [
                [
                    ">=",
                    "3.1.19"
                ]
            ]
        },
        {
            "name": "PyMuPDF",
            "specs": [
                [
                    ">=",
                    "1.25.0"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    ">=",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "EbookLib",
            "specs": [
                [
                    ">=",
                    "0.19"
                ]
            ]
        },
        {
            "name": "yt-dlp",
            "specs": [
                [
                    ">=",
                    "2023.12.30"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "lxml_html_clean",
            "specs": [
                [
                    ">=",
                    "0.1.0"
                ]
            ]
        },
        {
            "name": "rapidfuzz",
            "specs": []
        },
        {
            "name": "python-dotenv",
            "specs": []
        },
        {
            "name": "google-generativeai",
            "specs": [
                [
                    ">=",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "platformdirs",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "searchthescience",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        }
    ],
    "lcname": "pydantic-scrape"
}
        
Elapsed time: 1.43816s