scrapai


Namescrapai JSON
Version 0.6.0 PyPI version JSON
download
home_pagehttps://github.com/zohaib3249/scrapai
SummaryAI-powered web scraping SDK with intelligent configuration generation
upload_time2025-11-05 09:20:45
maintainerNone
docs_urlNone
authorZohaib Yousaf
requires_python>=3.8
licenseMIT
keywords web-scraping ai scraping automation data-extraction scraping-sdk ai-agent web-crawler
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ScrapAI - AI-Powered Web Scraping Made Simple

**Extract data from any website or API using natural language - no coding required!**

ScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.

---

## ✨ What Can You Achieve?

- **Extract structured data** from websites and APIs using simple descriptions
- **Create reusable scraping configurations** for repeated data collection
- **Get instant results** with one-off data extraction (SmartScraper)
- **Automate data pipelines** with scheduled scraping configurations
- **Support multiple AI services** including OpenAI, Ollama, Anthropic, Grok, and more
- **No manual configuration** - AI discovers APIs, tests paths, and creates optimal configs automatically

---

## 🚀 Quick Start

### Installation

```bash
pip install scrapai
```

### Option 1: Direct Data Extraction (SmartScraper)

Get structured data immediately without creating configuration files:

```python
import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",  # or "openai", "grok", "anthropic", etc.
        service_key="your-api-key",  # not needed for local Ollama
    )
    
    result = await client.smartscraper(
        url="https://example.com/data",
        description="Get product name, price, and rating"
    )
    
    if result["success"]:
        print(result["data"])  # Structured JSON output
    
    await client.close()

asyncio.run(main())
```

**Output:**
```json
{
  "product_name": "Example Product",
  "price": 29.99,
  "rating": 4.5
}
```

### Option 2: Create Reusable Configuration

Generate a reusable scraping configuration for repeated data collection:

```python
import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",
        service_key="your-api-key",
    )
    
    # AI creates the configuration automatically
    result = client.add_config(
        url="https://api.example.com/metrics",
        description="Get transaction count and total volume"
    )
    
    config_name = result["config_name"]
    
    # Execute the configuration
    data = await client.execute_config(config_name)
    
    # Results are in structured format
    for item in data:
        print(f"{item['name']}: {item['metric']} = {item['value']}")
    
    await client.close()

asyncio.run(main())
```

**Once created, you can run configurations anytime - perfect for scheduled jobs!**

```python
# Run existing configuration (no AI needed - already configured)
async def scheduled_data_collection():
    client = ScrapAIClient(service_name="ollama", service_key="your-key")
    
    # Execute any existing configuration
    data = await client.execute_config("my_config_name")
    
    # Process the data (save to database, send alerts, etc.)
    if data.get("success"):
        for item in data["data"]:
            print(f"Collected: {item['name']} - {item['metric']} = {item['value']}")
    
    await client.close()

# Use with cron jobs, task schedulers, or automation tools
# This runs without AI - just executes the saved configuration
```

---

## 📋 Use Cases

### For Data Engineers
- Rapidly create scraping configs for data pipelines
- Automate data collection from multiple sources
- **Schedule recurring extractions** - Run saved configurations anytime (cron jobs, task schedulers, etc.)
- No AI calls needed for execution - configs run independently

### For Analysts
- Extract metrics from APIs and websites without coding
- Get structured data ready for analysis
- No need to learn XPath, CSS selectors, or API endpoints

### For Developers
- Integrate intelligent scraping into applications
- Support multiple AI services with unified API
- Handle complex pages with automatic fallback strategies

---

## 🔧 Supported AI Services

ScrapAI works with any OpenAI-compatible API:

- **OpenAI** - GPT-4, GPT-3.5
- **Ollama** - Local models (llama3, qwen, mistral, etc.)
- **Anthropic** - Claude models
- **Grok** - xAI's Grok
- **Google** - Gemini models
- **Mistral AI** - Mistral models
- **Custom Services** - Any OpenAI-compatible endpoint

```python
# Using OpenAI
client = ScrapAIClient(
    service_name="openai",
    service_key="sk-...",
    service_model="gpt-4"
)

# Using Ollama (local)
client = ScrapAIClient(
    service_name="ollama",
    service_key="not-needed",  # Local Ollama doesn't need key
    service_model="llama3:latest"
)

# Using custom service
client = ScrapAIClient(
    service_name="custom",
    service_key="your-key",
    service_base_url="https://your-api.com/v1",
    service_model="your-model"
)
```

---

## 💡 Key Features

### Intelligent Resource Selection
- **API-first approach** - Automatically discovers and uses APIs when available
- **HTML fallback** - Falls back to HTML scraping if API fails
- **Multiple resources** - Configures automatic fallback strategies

### Automatic Configuration Generation
- AI analyzes URLs and discovers APIs
- Tests extraction paths before creating configs
- Iteratively refines until config works correctly
- Creates reusable configuration files

### Production-Ready
- Error handling and automatic retries
- Proxy rotation support
- Browser rendering for JavaScript-heavy pages
- Structured data output with metadata

---

## 📖 Basic Usage

### List Available Configurations

```python
configs = client.list_configs()
print(configs)  # ['config1', 'config2', ...]
```

### Execute a Configuration

```python
result = await client.execute_config("config_name")
if result["success"]:
    for item in result["data"]:
        print(f"{item['name']}: {item['metric']} = {item['value']}")
```

### Remove a Configuration

```python
client.remove_config("config_name")
```

---

## 📊 Output Format

All extractions return structured data:

```python
[
    {
        "name": "entity_name",
        "metric": "metric_name",
        "value": 12345,
        "date": "2024-01-15T10:30:00Z",
        "config_name": "my_config"
    },
    ...
]
```

---

## 🔗 Additional Resources

- **GitHub Repository**: [https://github.com/zohaib3249/scrapai](https://github.com/zohaib3249/scrapai)
- **Issue Tracker**: [https://github.com/zohaib3249/scrapai/issues](https://github.com/zohaib3249/scrapai/issues)
- **Documentation**: See GitHub README for detailed architecture and examples

---

## 📄 License

MIT License - See LICENSE file for details

---

## 🤝 Contributing

Contributions are welcome! Please see the GitHub repository for contribution guidelines.

---

## 👨‍💻 About the Author

**Zohaib Yousaf** - Full Stack Developer & Data Engineer

Passionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.

- **GitHub**: [@zohaib3249](https://github.com/zohaib3249)
- **Email**: chzohaib136@gmail.com

---

**Version**: 0.6.0  
**Last Updated**: November 2025


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/zohaib3249/scrapai",
    "name": "scrapai",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "web-scraping, ai, scraping, automation, data-extraction, scraping-sdk, ai-agent, web-crawler",
    "author": "Zohaib Yousaf",
    "author_email": "chzohaib136@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0a/9e/d912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be/scrapai-0.6.0.tar.gz",
    "platform": null,
    "description": "# ScrapAI - AI-Powered Web Scraping Made Simple\n\n**Extract data from any website or API using natural language - no coding required!**\n\nScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.\n\n---\n\n## \u2728 What Can You Achieve?\n\n- **Extract structured data** from websites and APIs using simple descriptions\n- **Create reusable scraping configurations** for repeated data collection\n- **Get instant results** with one-off data extraction (SmartScraper)\n- **Automate data pipelines** with scheduled scraping configurations\n- **Support multiple AI services** including OpenAI, Ollama, Anthropic, Grok, and more\n- **No manual configuration** - AI discovers APIs, tests paths, and creates optimal configs automatically\n\n---\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\npip install scrapai\n```\n\n### Option 1: Direct Data Extraction (SmartScraper)\n\nGet structured data immediately without creating configuration files:\n\n```python\nimport asyncio\nfrom scrapai import ScrapAIClient\n\nasync def main():\n    client = ScrapAIClient(\n        service_name=\"ollama\",  # or \"openai\", \"grok\", \"anthropic\", etc.\n        service_key=\"your-api-key\",  # not needed for local Ollama\n    )\n    \n    result = await client.smartscraper(\n        url=\"https://example.com/data\",\n        description=\"Get product name, price, and rating\"\n    )\n    \n    if result[\"success\"]:\n        print(result[\"data\"])  # Structured JSON output\n    \n    await client.close()\n\nasyncio.run(main())\n```\n\n**Output:**\n```json\n{\n  \"product_name\": \"Example Product\",\n  \"price\": 29.99,\n  \"rating\": 4.5\n}\n```\n\n### Option 2: Create Reusable Configuration\n\nGenerate a reusable scraping configuration for repeated data collection:\n\n```python\nimport asyncio\nfrom scrapai import ScrapAIClient\n\nasync def main():\n    client = ScrapAIClient(\n        service_name=\"ollama\",\n        service_key=\"your-api-key\",\n    )\n    \n    # AI creates the configuration automatically\n    result = client.add_config(\n        url=\"https://api.example.com/metrics\",\n        description=\"Get transaction count and total volume\"\n    )\n    \n    config_name = result[\"config_name\"]\n    \n    # Execute the configuration\n    data = await client.execute_config(config_name)\n    \n    # Results are in structured format\n    for item in data:\n        print(f\"{item['name']}: {item['metric']} = {item['value']}\")\n    \n    await client.close()\n\nasyncio.run(main())\n```\n\n**Once created, you can run configurations anytime - perfect for scheduled jobs!**\n\n```python\n# Run existing configuration (no AI needed - already configured)\nasync def scheduled_data_collection():\n    client = ScrapAIClient(service_name=\"ollama\", service_key=\"your-key\")\n    \n    # Execute any existing configuration\n    data = await client.execute_config(\"my_config_name\")\n    \n    # Process the data (save to database, send alerts, etc.)\n    if data.get(\"success\"):\n        for item in data[\"data\"]:\n            print(f\"Collected: {item['name']} - {item['metric']} = {item['value']}\")\n    \n    await client.close()\n\n# Use with cron jobs, task schedulers, or automation tools\n# This runs without AI - just executes the saved configuration\n```\n\n---\n\n## \ud83d\udccb Use Cases\n\n### For Data Engineers\n- Rapidly create scraping configs for data pipelines\n- Automate data collection from multiple sources\n- **Schedule recurring extractions** - Run saved configurations anytime (cron jobs, task schedulers, etc.)\n- No AI calls needed for execution - configs run independently\n\n### For Analysts\n- Extract metrics from APIs and websites without coding\n- Get structured data ready for analysis\n- No need to learn XPath, CSS selectors, or API endpoints\n\n### For Developers\n- Integrate intelligent scraping into applications\n- Support multiple AI services with unified API\n- Handle complex pages with automatic fallback strategies\n\n---\n\n## \ud83d\udd27 Supported AI Services\n\nScrapAI works with any OpenAI-compatible API:\n\n- **OpenAI** - GPT-4, GPT-3.5\n- **Ollama** - Local models (llama3, qwen, mistral, etc.)\n- **Anthropic** - Claude models\n- **Grok** - xAI's Grok\n- **Google** - Gemini models\n- **Mistral AI** - Mistral models\n- **Custom Services** - Any OpenAI-compatible endpoint\n\n```python\n# Using OpenAI\nclient = ScrapAIClient(\n    service_name=\"openai\",\n    service_key=\"sk-...\",\n    service_model=\"gpt-4\"\n)\n\n# Using Ollama (local)\nclient = ScrapAIClient(\n    service_name=\"ollama\",\n    service_key=\"not-needed\",  # Local Ollama doesn't need key\n    service_model=\"llama3:latest\"\n)\n\n# Using custom service\nclient = ScrapAIClient(\n    service_name=\"custom\",\n    service_key=\"your-key\",\n    service_base_url=\"https://your-api.com/v1\",\n    service_model=\"your-model\"\n)\n```\n\n---\n\n## \ud83d\udca1 Key Features\n\n### Intelligent Resource Selection\n- **API-first approach** - Automatically discovers and uses APIs when available\n- **HTML fallback** - Falls back to HTML scraping if API fails\n- **Multiple resources** - Configures automatic fallback strategies\n\n### Automatic Configuration Generation\n- AI analyzes URLs and discovers APIs\n- Tests extraction paths before creating configs\n- Iteratively refines until config works correctly\n- Creates reusable configuration files\n\n### Production-Ready\n- Error handling and automatic retries\n- Proxy rotation support\n- Browser rendering for JavaScript-heavy pages\n- Structured data output with metadata\n\n---\n\n## \ud83d\udcd6 Basic Usage\n\n### List Available Configurations\n\n```python\nconfigs = client.list_configs()\nprint(configs)  # ['config1', 'config2', ...]\n```\n\n### Execute a Configuration\n\n```python\nresult = await client.execute_config(\"config_name\")\nif result[\"success\"]:\n    for item in result[\"data\"]:\n        print(f\"{item['name']}: {item['metric']} = {item['value']}\")\n```\n\n### Remove a Configuration\n\n```python\nclient.remove_config(\"config_name\")\n```\n\n---\n\n## \ud83d\udcca Output Format\n\nAll extractions return structured data:\n\n```python\n[\n    {\n        \"name\": \"entity_name\",\n        \"metric\": \"metric_name\",\n        \"value\": 12345,\n        \"date\": \"2024-01-15T10:30:00Z\",\n        \"config_name\": \"my_config\"\n    },\n    ...\n]\n```\n\n---\n\n## \ud83d\udd17 Additional Resources\n\n- **GitHub Repository**: [https://github.com/zohaib3249/scrapai](https://github.com/zohaib3249/scrapai)\n- **Issue Tracker**: [https://github.com/zohaib3249/scrapai/issues](https://github.com/zohaib3249/scrapai/issues)\n- **Documentation**: See GitHub README for detailed architecture and examples\n\n---\n\n## \ud83d\udcc4 License\n\nMIT License - See LICENSE file for details\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please see the GitHub repository for contribution guidelines.\n\n---\n\n## \ud83d\udc68\u200d\ud83d\udcbb About the Author\n\n**Zohaib Yousaf** - Full Stack Developer & Data Engineer\n\nPassionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.\n\n- **GitHub**: [@zohaib3249](https://github.com/zohaib3249)\n- **Email**: chzohaib136@gmail.com\n\n---\n\n**Version**: 0.6.0  \n**Last Updated**: November 2025\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AI-powered web scraping SDK with intelligent configuration generation",
    "version": "0.6.0",
    "project_urls": {
        "Bug Reports": "https://github.com/zohaib3249/scrapai/issues",
        "Documentation": "https://github.com/zohaib3249/scrapai#readme",
        "Homepage": "https://github.com/zohaib3249/scrapai",
        "Source": "https://github.com/zohaib3249/scrapai"
    },
    "split_keywords": [
        "web-scraping",
        " ai",
        " scraping",
        " automation",
        " data-extraction",
        " scraping-sdk",
        " ai-agent",
        " web-crawler"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0a9ed912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be",
                "md5": "b7e476725980992cf568009d4ff7dda8",
                "sha256": "03d3ea96cc8e9367ab51994e93f7a51c82ea84c44b335dbb83148eff7d714e9b"
            },
            "downloads": -1,
            "filename": "scrapai-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b7e476725980992cf568009d4ff7dda8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 91323,
            "upload_time": "2025-11-05T09:20:45",
            "upload_time_iso_8601": "2025-11-05T09:20:45.021349Z",
            "url": "https://files.pythonhosted.org/packages/0a/9e/d912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be/scrapai-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-05 09:20:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zohaib3249",
    "github_project": "scrapai",
    "github_not_found": true,
    "lcname": "scrapai"
}
        
Elapsed time: 2.80177s