# 🌐 ScrapeGraph Python SDK
[](https://badge.fury.io/py/scrapegraph-py)
[](https://pypi.org/project/scrapegraph-py/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
[](https://docs.scrapegraphai.com)
<p align="left">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
</p>
Official [Python SDK ](https://scrapegraphai.com) for the ScrapeGraph API - Smart web scraping powered by AI.
## 📦 Installation
```bash
pip install scrapegraph-py
```
## 🚀 Features
- 🤖 AI-powered web scraping and search
- 🕷️ Smart crawling with both AI extraction and markdown conversion modes
- 💰 Cost-effective markdown conversion (80% savings vs AI mode)
- 🔄 Both sync and async clients
- 📊 Structured output with Pydantic schemas
- 🔍 Detailed logging
- ⚡ Automatic retries
- 🔐 Secure authentication
## 🎯 Quick Start
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
```
> [!NOTE]
> You can set the `SGAI_API_KEY` environment variable and initialize the client without parameters: `client = Client()`
## 📚 Available Endpoints
### 🤖 SmartScraper
Extract structured data from any webpage or HTML content using AI.
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
# Using a URL
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main heading and description"
)
# Or using HTML content
html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
</body>
</html>
"""
response = client.smartscraper(
website_html=html_content,
user_prompt="Extract the company description"
)
print(response)
```
<details>
<summary>Output Schema (Optional)</summary>
```python
from pydantic import BaseModel, Field
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
class WebsiteData(BaseModel):
title: str = Field(description="The page title")
description: str = Field(description="The meta description")
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the title and description",
output_schema=WebsiteData
)
```
</details>
<details>
<summary>🍪 Cookies Support</summary>
Use cookies for authentication and session management:
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
# Define cookies for authentication
cookies = {
"session_id": "abc123def456",
"auth_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"user_preferences": "dark_mode,usd"
}
response = client.smartscraper(
website_url="https://example.com/dashboard",
user_prompt="Extract user profile information",
cookies=cookies
)
```
**Common Use Cases:**
- **E-commerce sites**: User authentication, shopping cart persistence
- **Social media**: Session management, user preferences
- **Banking/Financial**: Secure authentication, transaction history
- **News sites**: User preferences, subscription content
- **API endpoints**: Authentication tokens, API keys
</details>
<details>
<summary>🔄 Advanced Features</summary>
**Infinite Scrolling:**
```python
response = client.smartscraper(
website_url="https://example.com/feed",
user_prompt="Extract all posts from the feed",
cookies=cookies,
number_of_scrolls=10 # Scroll 10 times to load more content
)
```
**Pagination:**
```python
response = client.smartscraper(
website_url="https://example.com/products",
user_prompt="Extract all product information",
cookies=cookies,
total_pages=5 # Scrape 5 pages
)
```
**Combined with Cookies:**
```python
response = client.smartscraper(
website_url="https://example.com/dashboard",
user_prompt="Extract user data from all pages",
cookies=cookies,
number_of_scrolls=5,
total_pages=3
)
```
</details>
### 🔍 SearchScraper
Perform AI-powered web searches with structured results and reference URLs.
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
response = client.searchscraper(
user_prompt="What is the latest version of Python and its main features?"
)
print(f"Answer: {response['result']}")
print(f"Sources: {response['reference_urls']}")
```
<details>
<summary>Output Schema (Optional)</summary>
```python
from pydantic import BaseModel, Field
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
class PythonVersionInfo(BaseModel):
version: str = Field(description="The latest Python version number")
release_date: str = Field(description="When this version was released")
major_features: list[str] = Field(description="List of main features")
response = client.searchscraper(
user_prompt="What is the latest version of Python and its main features?",
output_schema=PythonVersionInfo
)
```
</details>
### 📝 Markdownify
Converts any webpage into clean, formatted markdown.
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
response = client.markdownify(
website_url="https://example.com"
)
print(response)
```
### 🕷️ Crawler
Intelligently crawl and extract data from multiple pages with support for both AI extraction and markdown conversion modes.
#### AI Extraction Mode (Default)
Extract structured data from multiple pages using AI:
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
# Define the data schema for extraction
schema = {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"founders": {
"type": "array",
"items": {"type": "string"}
},
"description": {"type": "string"}
}
}
response = client.crawl(
url="https://scrapegraphai.com",
prompt="extract the company information and founders",
data_schema=schema,
depth=2,
max_pages=5,
same_domain_only=True
)
# Poll for results (crawl is asynchronous)
crawl_id = response.get("crawl_id")
result = client.get_crawl(crawl_id)
```
#### Markdown Conversion Mode (Cost-Effective)
Convert pages to clean markdown without AI processing (80% cheaper):
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
response = client.crawl(
url="https://scrapegraphai.com",
extraction_mode=False, # Markdown conversion mode
depth=2,
max_pages=5,
same_domain_only=True,
sitemap=True # Use sitemap for better page discovery
)
# Poll for results
crawl_id = response.get("crawl_id")
result = client.get_crawl(crawl_id)
# Access markdown content
for page in result["result"]["pages"]:
print(f"URL: {page['url']}")
print(f"Markdown: {page['markdown']}")
print(f"Metadata: {page['metadata']}")
```
<details>
<summary>🔧 Crawl Parameters</summary>
- **url** (required): Starting URL for the crawl
- **extraction_mode** (default: True):
- `True` = AI extraction mode (requires prompt and data_schema)
- `False` = Markdown conversion mode (no AI, 80% cheaper)
- **prompt** (required for AI mode): AI prompt to guide data extraction
- **data_schema** (required for AI mode): JSON schema defining extracted data structure
- **depth** (default: 2): Maximum crawl depth (1-10)
- **max_pages** (default: 2): Maximum pages to crawl (1-100)
- **same_domain_only** (default: True): Only crawl pages from the same domain
- **sitemap** (default: False): Use sitemap for better page discovery
- **cache_website** (default: True): Cache website content
- **batch_size** (optional): Batch size for processing pages (1-10)
**Cost Comparison:**
- AI Extraction Mode: ~10 credits per page
- Markdown Conversion Mode: ~2 credits per page (80% savings!)
</details>
## ⚡ Async Support
All endpoints support async operations:
```python
import asyncio
from scrapegraph_py import AsyncClient
async def main():
async with AsyncClient() as client:
response = await client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main content"
)
print(response)
asyncio.run(main())
```
## 📖 Documentation
For detailed documentation, visit [docs.scrapegraphai.com](https://docs.scrapegraphai.com)
## 🛠️ Development
For information about setting up the development environment and contributing to the project, see our [Contributing Guide](CONTRIBUTING.md).
## 💬 Support & Feedback
- 📧 Email: support@scrapegraphai.com
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues)
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/new)
- ⭐ API Feedback: You can also submit feedback programmatically using the feedback endpoint:
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
client.submit_feedback(
request_id="your-request-id",
rating=5,
feedback_text="Great results!"
)
```
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🔗 Links
- [Website](https://scrapegraphai.com)
- [Documentation](https://docs.scrapegraphai.com)
- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)
---
Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com)
Raw data
{
"_id": null,
"home_page": null,
"name": "scrapegraph-py",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "ai, api, artificial intelligence, gpt, graph, machine learning, natural language processing, nlp, openai, scraping, sdk, web scraping tool, webscraping",
"author": null,
"author_email": "Marco Vinciguerra <mvincig11@gmail.com>, perinim.98@gmail.com, Lorenzo Padoan <lorenzo.padoan977@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f0/f3/1303ce8f52aabcaaf3fd4098a35eaacfbca7e2f68d31503dedf4f7b5e301/scrapegraph_py-1.24.0.tar.gz",
"platform": null,
"description": "# \ud83c\udf10 ScrapeGraph Python SDK\n\n[](https://badge.fury.io/py/scrapegraph-py)\n[](https://pypi.org/project/scrapegraph-py/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/psf/black)\n[](https://docs.scrapegraphai.com)\n\n<p align=\"left\">\n <img src=\"https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png\" alt=\"ScrapeGraph API Banner\" style=\"width: 70%;\">\n</p>\n\nOfficial [Python SDK ](https://scrapegraphai.com) for the ScrapeGraph API - Smart web scraping powered by AI.\n\n## \ud83d\udce6 Installation\n\n```bash\npip install scrapegraph-py\n```\n\n## \ud83d\ude80 Features\n\n- \ud83e\udd16 AI-powered web scraping and search\n- \ud83d\udd77\ufe0f Smart crawling with both AI extraction and markdown conversion modes\n- \ud83d\udcb0 Cost-effective markdown conversion (80% savings vs AI mode)\n- \ud83d\udd04 Both sync and async clients\n- \ud83d\udcca Structured output with Pydantic schemas\n- \ud83d\udd0d Detailed logging\n- \u26a1 Automatic retries\n- \ud83d\udd10 Secure authentication\n\n## \ud83c\udfaf Quick Start\n\n```python\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n```\n\n> [!NOTE]\n> You can set the `SGAI_API_KEY` environment variable and initialize the client without parameters: `client = Client()`\n\n## \ud83d\udcda Available Endpoints\n\n### \ud83e\udd16 SmartScraper\n\nExtract structured data from any webpage or HTML content using AI.\n\n```python\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\n# Using a URL\nresponse = client.smartscraper(\n website_url=\"https://example.com\",\n user_prompt=\"Extract the main heading and description\"\n)\n\n# Or using HTML content\nhtml_content = \"\"\"\n<html>\n <body>\n <h1>Company Name</h1>\n <p>We are a technology company focused on AI solutions.</p>\n </body>\n</html>\n\"\"\"\n\nresponse = client.smartscraper(\n website_html=html_content,\n user_prompt=\"Extract the company description\"\n)\n\nprint(response)\n```\n\n<details>\n<summary>Output Schema (Optional)</summary>\n\n```python\nfrom pydantic import BaseModel, Field\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\nclass WebsiteData(BaseModel):\n title: str = Field(description=\"The page title\")\n description: str = Field(description=\"The meta description\")\n\nresponse = client.smartscraper(\n website_url=\"https://example.com\",\n user_prompt=\"Extract the title and description\",\n output_schema=WebsiteData\n)\n```\n\n</details>\n\n<details>\n<summary>\ud83c\udf6a Cookies Support</summary>\n\nUse cookies for authentication and session management:\n\n```python\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\n# Define cookies for authentication\ncookies = {\n \"session_id\": \"abc123def456\",\n \"auth_token\": \"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...\",\n \"user_preferences\": \"dark_mode,usd\"\n}\n\nresponse = client.smartscraper(\n website_url=\"https://example.com/dashboard\",\n user_prompt=\"Extract user profile information\",\n cookies=cookies\n)\n```\n\n**Common Use Cases:**\n- **E-commerce sites**: User authentication, shopping cart persistence\n- **Social media**: Session management, user preferences\n- **Banking/Financial**: Secure authentication, transaction history\n- **News sites**: User preferences, subscription content\n- **API endpoints**: Authentication tokens, API keys\n\n</details>\n\n<details>\n<summary>\ud83d\udd04 Advanced Features</summary>\n\n**Infinite Scrolling:**\n```python\nresponse = client.smartscraper(\n website_url=\"https://example.com/feed\",\n user_prompt=\"Extract all posts from the feed\",\n cookies=cookies,\n number_of_scrolls=10 # Scroll 10 times to load more content\n)\n```\n\n**Pagination:**\n```python\nresponse = client.smartscraper(\n website_url=\"https://example.com/products\",\n user_prompt=\"Extract all product information\",\n cookies=cookies,\n total_pages=5 # Scrape 5 pages\n)\n```\n\n**Combined with Cookies:**\n```python\nresponse = client.smartscraper(\n website_url=\"https://example.com/dashboard\",\n user_prompt=\"Extract user data from all pages\",\n cookies=cookies,\n number_of_scrolls=5,\n total_pages=3\n)\n```\n\n</details>\n\n### \ud83d\udd0d SearchScraper\n\nPerform AI-powered web searches with structured results and reference URLs.\n\n```python\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\nresponse = client.searchscraper(\n user_prompt=\"What is the latest version of Python and its main features?\"\n)\n\nprint(f\"Answer: {response['result']}\")\nprint(f\"Sources: {response['reference_urls']}\")\n```\n\n<details>\n<summary>Output Schema (Optional)</summary>\n\n```python\nfrom pydantic import BaseModel, Field\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\nclass PythonVersionInfo(BaseModel):\n version: str = Field(description=\"The latest Python version number\")\n release_date: str = Field(description=\"When this version was released\")\n major_features: list[str] = Field(description=\"List of main features\")\n\nresponse = client.searchscraper(\n user_prompt=\"What is the latest version of Python and its main features?\",\n output_schema=PythonVersionInfo\n)\n```\n\n</details>\n\n### \ud83d\udcdd Markdownify\n\nConverts any webpage into clean, formatted markdown.\n\n```python\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\nresponse = client.markdownify(\n website_url=\"https://example.com\"\n)\n\nprint(response)\n```\n\n### \ud83d\udd77\ufe0f Crawler\n\nIntelligently crawl and extract data from multiple pages with support for both AI extraction and markdown conversion modes.\n\n#### AI Extraction Mode (Default)\nExtract structured data from multiple pages using AI:\n\n```python\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\n# Define the data schema for extraction\nschema = {\n \"type\": \"object\",\n \"properties\": {\n \"company_name\": {\"type\": \"string\"},\n \"founders\": {\n \"type\": \"array\",\n \"items\": {\"type\": \"string\"}\n },\n \"description\": {\"type\": \"string\"}\n }\n}\n\nresponse = client.crawl(\n url=\"https://scrapegraphai.com\",\n prompt=\"extract the company information and founders\",\n data_schema=schema,\n depth=2,\n max_pages=5,\n same_domain_only=True\n)\n\n# Poll for results (crawl is asynchronous)\ncrawl_id = response.get(\"crawl_id\")\nresult = client.get_crawl(crawl_id)\n```\n\n#### Markdown Conversion Mode (Cost-Effective)\nConvert pages to clean markdown without AI processing (80% cheaper):\n\n```python\nfrom scrapegraph_py import Client\n\nclient = Client(api_key=\"your-api-key-here\")\n\nresponse = client.crawl(\n url=\"https://scrapegraphai.com\",\n extraction_mode=False, # Markdown conversion mode\n depth=2,\n max_pages=5,\n same_domain_only=True,\n sitemap=True # Use sitemap for better page discovery\n)\n\n# Poll for results\ncrawl_id = response.get(\"crawl_id\")\nresult = client.get_crawl(crawl_id)\n\n# Access markdown content\nfor page in result[\"result\"][\"pages\"]:\n print(f\"URL: {page['url']}\")\n print(f\"Markdown: {page['markdown']}\")\n print(f\"Metadata: {page['metadata']}\")\n```\n\n<details>\n<summary>\ud83d\udd27 Crawl Parameters</summary>\n\n- **url** (required): Starting URL for the crawl\n- **extraction_mode** (default: True):\n - `True` = AI extraction mode (requires prompt and data_schema)\n - `False` = Markdown conversion mode (no AI, 80% cheaper)\n- **prompt** (required for AI mode): AI prompt to guide data extraction\n- **data_schema** (required for AI mode): JSON schema defining extracted data structure\n- **depth** (default: 2): Maximum crawl depth (1-10)\n- **max_pages** (default: 2): Maximum pages to crawl (1-100)\n- **same_domain_only** (default: True): Only crawl pages from the same domain\n- **sitemap** (default: False): Use sitemap for better page discovery\n- **cache_website** (default: True): Cache website content\n- **batch_size** (optional): Batch size for processing pages (1-10)\n\n**Cost Comparison:**\n- AI Extraction Mode: ~10 credits per page\n- Markdown Conversion Mode: ~2 credits per page (80% savings!)\n\n</details>\n\n## \u26a1 Async Support\n\nAll endpoints support async operations:\n\n```python\nimport asyncio\nfrom scrapegraph_py import AsyncClient\n\nasync def main():\n async with AsyncClient() as client:\n response = await client.smartscraper(\n website_url=\"https://example.com\",\n user_prompt=\"Extract the main content\"\n )\n print(response)\n\nasyncio.run(main())\n```\n\n## \ud83d\udcd6 Documentation\n\nFor detailed documentation, visit [docs.scrapegraphai.com](https://docs.scrapegraphai.com)\n\n## \ud83d\udee0\ufe0f Development\n\nFor information about setting up the development environment and contributing to the project, see our [Contributing Guide](CONTRIBUTING.md).\n\n## \ud83d\udcac Support & Feedback\n\n- \ud83d\udce7 Email: support@scrapegraphai.com\n- \ud83d\udcbb GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues)\n- \ud83c\udf1f Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/new)\n- \u2b50 API Feedback: You can also submit feedback programmatically using the feedback endpoint:\n ```python\n from scrapegraph_py import Client\n\n client = Client(api_key=\"your-api-key-here\")\n\n client.submit_feedback(\n request_id=\"your-request-id\",\n rating=5,\n feedback_text=\"Great results!\"\n )\n ```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd17 Links\n\n- [Website](https://scrapegraphai.com)\n- [Documentation](https://docs.scrapegraphai.com)\n- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)\n\n---\n\nMade with \u2764\ufe0f by [ScrapeGraph AI](https://scrapegraphai.com)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "ScrapeGraph Python SDK for API",
"version": "1.24.0",
"project_urls": null,
"split_keywords": [
"ai",
" api",
" artificial intelligence",
" gpt",
" graph",
" machine learning",
" natural language processing",
" nlp",
" openai",
" scraping",
" sdk",
" web scraping tool",
" webscraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "84a61c46ea797066e07877fcf4e4b5d5d123debf0d4ce6caba149a1196f169f8",
"md5": "fa6927b379e15de835f9b3f74f5b0b3f",
"sha256": "27728a71c006ea28eb7aea2b21763994fff034931b1416e86a2218f3d507cf2e"
},
"downloads": -1,
"filename": "scrapegraph_py-1.24.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fa6927b379e15de835f9b3f74f5b0b3f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 27060,
"upload_time": "2025-09-03T07:08:31",
"upload_time_iso_8601": "2025-09-03T07:08:31.250678Z",
"url": "https://files.pythonhosted.org/packages/84/a6/1c46ea797066e07877fcf4e4b5d5d123debf0d4ce6caba149a1196f169f8/scrapegraph_py-1.24.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f0f31303ce8f52aabcaaf3fd4098a35eaacfbca7e2f68d31503dedf4f7b5e301",
"md5": "d6d507dc0a69749bf65425062da824aa",
"sha256": "843a73909b06bf7b2d82d4f9968a02d929a85cab96e95ecf3ba5e32e28300717"
},
"downloads": -1,
"filename": "scrapegraph_py-1.24.0.tar.gz",
"has_sig": false,
"md5_digest": "d6d507dc0a69749bf65425062da824aa",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 210093,
"upload_time": "2025-09-03T07:08:32",
"upload_time_iso_8601": "2025-09-03T07:08:32.463597Z",
"url": "https://files.pythonhosted.org/packages/f0/f3/1303ce8f52aabcaaf3fd4098a35eaacfbca7e2f68d31503dedf4f7b5e301/scrapegraph_py-1.24.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-03 07:08:32",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "scrapegraph-py"
}