# ScrapAI - AI-Powered Web Scraping Made Simple
**Extract data from any website or API using natural language - no coding required!**
ScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.
---
## ✨ What Can You Achieve?
- **Extract structured data** from websites and APIs using simple descriptions
- **Create reusable scraping configurations** for repeated data collection
- **Get instant results** with one-off data extraction (SmartScraper)
- **Automate data pipelines** with scheduled scraping configurations
- **Support multiple AI services** including OpenAI, Ollama, Anthropic, Grok, and more
- **No manual configuration** - AI discovers APIs, tests paths, and creates optimal configs automatically
---
## 🚀 Quick Start
### Installation
```bash
pip install scrapai
```
### Option 1: Direct Data Extraction (SmartScraper)
Get structured data immediately without creating configuration files:
```python
import asyncio
from scrapai import ScrapAIClient
async def main():
client = ScrapAIClient(
service_name="ollama", # or "openai", "grok", "anthropic", etc.
service_key="your-api-key", # not needed for local Ollama
)
result = await client.smartscraper(
url="https://example.com/data",
description="Get product name, price, and rating"
)
if result["success"]:
print(result["data"]) # Structured JSON output
await client.close()
asyncio.run(main())
```
**Output:**
```json
{
"product_name": "Example Product",
"price": 29.99,
"rating": 4.5
}
```
### Option 2: Create Reusable Configuration
Generate a reusable scraping configuration for repeated data collection:
```python
import asyncio
from scrapai import ScrapAIClient
async def main():
client = ScrapAIClient(
service_name="ollama",
service_key="your-api-key",
)
# AI creates the configuration automatically
result = client.add_config(
url="https://api.example.com/metrics",
description="Get transaction count and total volume"
)
config_name = result["config_name"]
# Execute the configuration
data = await client.execute_config(config_name)
# Results are in structured format
for item in data:
print(f"{item['name']}: {item['metric']} = {item['value']}")
await client.close()
asyncio.run(main())
```
**Once created, you can run configurations anytime - perfect for scheduled jobs!**
```python
# Run existing configuration (no AI needed - already configured)
async def scheduled_data_collection():
client = ScrapAIClient(service_name="ollama", service_key="your-key")
# Execute any existing configuration
data = await client.execute_config("my_config_name")
# Process the data (save to database, send alerts, etc.)
if data.get("success"):
for item in data["data"]:
print(f"Collected: {item['name']} - {item['metric']} = {item['value']}")
await client.close()
# Use with cron jobs, task schedulers, or automation tools
# This runs without AI - just executes the saved configuration
```
---
## 📋 Use Cases
### For Data Engineers
- Rapidly create scraping configs for data pipelines
- Automate data collection from multiple sources
- **Schedule recurring extractions** - Run saved configurations anytime (cron jobs, task schedulers, etc.)
- No AI calls needed for execution - configs run independently
### For Analysts
- Extract metrics from APIs and websites without coding
- Get structured data ready for analysis
- No need to learn XPath, CSS selectors, or API endpoints
### For Developers
- Integrate intelligent scraping into applications
- Support multiple AI services with unified API
- Handle complex pages with automatic fallback strategies
---
## 🔧 Supported AI Services
ScrapAI works with any OpenAI-compatible API:
- **OpenAI** - GPT-4, GPT-3.5
- **Ollama** - Local models (llama3, qwen, mistral, etc.)
- **Anthropic** - Claude models
- **Grok** - xAI's Grok
- **Google** - Gemini models
- **Mistral AI** - Mistral models
- **Custom Services** - Any OpenAI-compatible endpoint
```python
# Using OpenAI
client = ScrapAIClient(
service_name="openai",
service_key="sk-...",
service_model="gpt-4"
)
# Using Ollama (local)
client = ScrapAIClient(
service_name="ollama",
service_key="not-needed", # Local Ollama doesn't need key
service_model="llama3:latest"
)
# Using custom service
client = ScrapAIClient(
service_name="custom",
service_key="your-key",
service_base_url="https://your-api.com/v1",
service_model="your-model"
)
```
---
## 💡 Key Features
### Intelligent Resource Selection
- **API-first approach** - Automatically discovers and uses APIs when available
- **HTML fallback** - Falls back to HTML scraping if API fails
- **Multiple resources** - Configures automatic fallback strategies
### Automatic Configuration Generation
- AI analyzes URLs and discovers APIs
- Tests extraction paths before creating configs
- Iteratively refines until config works correctly
- Creates reusable configuration files
### Production-Ready
- Error handling and automatic retries
- Proxy rotation support
- Browser rendering for JavaScript-heavy pages
- Structured data output with metadata
---
## 📖 Basic Usage
### List Available Configurations
```python
configs = client.list_configs()
print(configs) # ['config1', 'config2', ...]
```
### Execute a Configuration
```python
result = await client.execute_config("config_name")
if result["success"]:
for item in result["data"]:
print(f"{item['name']}: {item['metric']} = {item['value']}")
```
### Remove a Configuration
```python
client.remove_config("config_name")
```
---
## 📊 Output Format
All extractions return structured data:
```python
[
{
"name": "entity_name",
"metric": "metric_name",
"value": 12345,
"date": "2024-01-15T10:30:00Z",
"config_name": "my_config"
},
...
]
```
---
## 🔗 Additional Resources
- **GitHub Repository**: [https://github.com/zohaib3249/scrapai](https://github.com/zohaib3249/scrapai)
- **Issue Tracker**: [https://github.com/zohaib3249/scrapai/issues](https://github.com/zohaib3249/scrapai/issues)
- **Documentation**: See GitHub README for detailed architecture and examples
---
## 📄 License
MIT License - See LICENSE file for details
---
## 🤝 Contributing
Contributions are welcome! Please see the GitHub repository for contribution guidelines.
---
## 👨💻 About the Author
**Zohaib Yousaf** - Full Stack Developer & Data Engineer
Passionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.
- **GitHub**: [@zohaib3249](https://github.com/zohaib3249)
- **Email**: chzohaib136@gmail.com
---
**Version**: 0.6.0
**Last Updated**: November 2025
Raw data
{
"_id": null,
"home_page": "https://github.com/zohaib3249/scrapai",
"name": "scrapai",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "web-scraping, ai, scraping, automation, data-extraction, scraping-sdk, ai-agent, web-crawler",
"author": "Zohaib Yousaf",
"author_email": "chzohaib136@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/0a/9e/d912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be/scrapai-0.6.0.tar.gz",
"platform": null,
"description": "# ScrapAI - AI-Powered Web Scraping Made Simple\n\n**Extract data from any website or API using natural language - no coding required!**\n\nScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.\n\n---\n\n## \u2728 What Can You Achieve?\n\n- **Extract structured data** from websites and APIs using simple descriptions\n- **Create reusable scraping configurations** for repeated data collection\n- **Get instant results** with one-off data extraction (SmartScraper)\n- **Automate data pipelines** with scheduled scraping configurations\n- **Support multiple AI services** including OpenAI, Ollama, Anthropic, Grok, and more\n- **No manual configuration** - AI discovers APIs, tests paths, and creates optimal configs automatically\n\n---\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\npip install scrapai\n```\n\n### Option 1: Direct Data Extraction (SmartScraper)\n\nGet structured data immediately without creating configuration files:\n\n```python\nimport asyncio\nfrom scrapai import ScrapAIClient\n\nasync def main():\n client = ScrapAIClient(\n service_name=\"ollama\", # or \"openai\", \"grok\", \"anthropic\", etc.\n service_key=\"your-api-key\", # not needed for local Ollama\n )\n \n result = await client.smartscraper(\n url=\"https://example.com/data\",\n description=\"Get product name, price, and rating\"\n )\n \n if result[\"success\"]:\n print(result[\"data\"]) # Structured JSON output\n \n await client.close()\n\nasyncio.run(main())\n```\n\n**Output:**\n```json\n{\n \"product_name\": \"Example Product\",\n \"price\": 29.99,\n \"rating\": 4.5\n}\n```\n\n### Option 2: Create Reusable Configuration\n\nGenerate a reusable scraping configuration for repeated data collection:\n\n```python\nimport asyncio\nfrom scrapai import ScrapAIClient\n\nasync def main():\n client = ScrapAIClient(\n service_name=\"ollama\",\n service_key=\"your-api-key\",\n )\n \n # AI creates the configuration automatically\n result = client.add_config(\n url=\"https://api.example.com/metrics\",\n description=\"Get transaction count and total volume\"\n )\n \n config_name = result[\"config_name\"]\n \n # Execute the configuration\n data = await client.execute_config(config_name)\n \n # Results are in structured format\n for item in data:\n print(f\"{item['name']}: {item['metric']} = {item['value']}\")\n \n await client.close()\n\nasyncio.run(main())\n```\n\n**Once created, you can run configurations anytime - perfect for scheduled jobs!**\n\n```python\n# Run existing configuration (no AI needed - already configured)\nasync def scheduled_data_collection():\n client = ScrapAIClient(service_name=\"ollama\", service_key=\"your-key\")\n \n # Execute any existing configuration\n data = await client.execute_config(\"my_config_name\")\n \n # Process the data (save to database, send alerts, etc.)\n if data.get(\"success\"):\n for item in data[\"data\"]:\n print(f\"Collected: {item['name']} - {item['metric']} = {item['value']}\")\n \n await client.close()\n\n# Use with cron jobs, task schedulers, or automation tools\n# This runs without AI - just executes the saved configuration\n```\n\n---\n\n## \ud83d\udccb Use Cases\n\n### For Data Engineers\n- Rapidly create scraping configs for data pipelines\n- Automate data collection from multiple sources\n- **Schedule recurring extractions** - Run saved configurations anytime (cron jobs, task schedulers, etc.)\n- No AI calls needed for execution - configs run independently\n\n### For Analysts\n- Extract metrics from APIs and websites without coding\n- Get structured data ready for analysis\n- No need to learn XPath, CSS selectors, or API endpoints\n\n### For Developers\n- Integrate intelligent scraping into applications\n- Support multiple AI services with unified API\n- Handle complex pages with automatic fallback strategies\n\n---\n\n## \ud83d\udd27 Supported AI Services\n\nScrapAI works with any OpenAI-compatible API:\n\n- **OpenAI** - GPT-4, GPT-3.5\n- **Ollama** - Local models (llama3, qwen, mistral, etc.)\n- **Anthropic** - Claude models\n- **Grok** - xAI's Grok\n- **Google** - Gemini models\n- **Mistral AI** - Mistral models\n- **Custom Services** - Any OpenAI-compatible endpoint\n\n```python\n# Using OpenAI\nclient = ScrapAIClient(\n service_name=\"openai\",\n service_key=\"sk-...\",\n service_model=\"gpt-4\"\n)\n\n# Using Ollama (local)\nclient = ScrapAIClient(\n service_name=\"ollama\",\n service_key=\"not-needed\", # Local Ollama doesn't need key\n service_model=\"llama3:latest\"\n)\n\n# Using custom service\nclient = ScrapAIClient(\n service_name=\"custom\",\n service_key=\"your-key\",\n service_base_url=\"https://your-api.com/v1\",\n service_model=\"your-model\"\n)\n```\n\n---\n\n## \ud83d\udca1 Key Features\n\n### Intelligent Resource Selection\n- **API-first approach** - Automatically discovers and uses APIs when available\n- **HTML fallback** - Falls back to HTML scraping if API fails\n- **Multiple resources** - Configures automatic fallback strategies\n\n### Automatic Configuration Generation\n- AI analyzes URLs and discovers APIs\n- Tests extraction paths before creating configs\n- Iteratively refines until config works correctly\n- Creates reusable configuration files\n\n### Production-Ready\n- Error handling and automatic retries\n- Proxy rotation support\n- Browser rendering for JavaScript-heavy pages\n- Structured data output with metadata\n\n---\n\n## \ud83d\udcd6 Basic Usage\n\n### List Available Configurations\n\n```python\nconfigs = client.list_configs()\nprint(configs) # ['config1', 'config2', ...]\n```\n\n### Execute a Configuration\n\n```python\nresult = await client.execute_config(\"config_name\")\nif result[\"success\"]:\n for item in result[\"data\"]:\n print(f\"{item['name']}: {item['metric']} = {item['value']}\")\n```\n\n### Remove a Configuration\n\n```python\nclient.remove_config(\"config_name\")\n```\n\n---\n\n## \ud83d\udcca Output Format\n\nAll extractions return structured data:\n\n```python\n[\n {\n \"name\": \"entity_name\",\n \"metric\": \"metric_name\",\n \"value\": 12345,\n \"date\": \"2024-01-15T10:30:00Z\",\n \"config_name\": \"my_config\"\n },\n ...\n]\n```\n\n---\n\n## \ud83d\udd17 Additional Resources\n\n- **GitHub Repository**: [https://github.com/zohaib3249/scrapai](https://github.com/zohaib3249/scrapai)\n- **Issue Tracker**: [https://github.com/zohaib3249/scrapai/issues](https://github.com/zohaib3249/scrapai/issues)\n- **Documentation**: See GitHub README for detailed architecture and examples\n\n---\n\n## \ud83d\udcc4 License\n\nMIT License - See LICENSE file for details\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please see the GitHub repository for contribution guidelines.\n\n---\n\n## \ud83d\udc68\u200d\ud83d\udcbb About the Author\n\n**Zohaib Yousaf** - Full Stack Developer & Data Engineer\n\nPassionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.\n\n- **GitHub**: [@zohaib3249](https://github.com/zohaib3249)\n- **Email**: chzohaib136@gmail.com\n\n---\n\n**Version**: 0.6.0 \n**Last Updated**: November 2025\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AI-powered web scraping SDK with intelligent configuration generation",
"version": "0.6.0",
"project_urls": {
"Bug Reports": "https://github.com/zohaib3249/scrapai/issues",
"Documentation": "https://github.com/zohaib3249/scrapai#readme",
"Homepage": "https://github.com/zohaib3249/scrapai",
"Source": "https://github.com/zohaib3249/scrapai"
},
"split_keywords": [
"web-scraping",
" ai",
" scraping",
" automation",
" data-extraction",
" scraping-sdk",
" ai-agent",
" web-crawler"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0a9ed912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be",
"md5": "b7e476725980992cf568009d4ff7dda8",
"sha256": "03d3ea96cc8e9367ab51994e93f7a51c82ea84c44b335dbb83148eff7d714e9b"
},
"downloads": -1,
"filename": "scrapai-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "b7e476725980992cf568009d4ff7dda8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 91323,
"upload_time": "2025-11-05T09:20:45",
"upload_time_iso_8601": "2025-11-05T09:20:45.021349Z",
"url": "https://files.pythonhosted.org/packages/0a/9e/d912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be/scrapai-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-05 09:20:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zohaib3249",
"github_project": "scrapai",
"github_not_found": true,
"lcname": "scrapai"
}