# WebCrawler API Python SDK
A Python SDK for interacting with the WebCrawlerAPI.
## In order to us API you have to get an API key from [WebCrawlerAPI](https://dash.webcrawlerapi.com/access)
## Installation
```bash
pip install webcrawlerapi
```
## Usage
### Crawling
```python
from webcrawlerapi import WebCrawlerAPI
# Initialize the client
crawler = WebCrawlerAPI(api_key="your_api_key")
# Synchronous crawling (blocks until completion)
job = crawler.crawl(
url="https://example.com",
scrape_type="markdown",
items_limit=10,
webhook_url="https://yourserver.com/webhook",
allow_subdomains=False,
max_polls=100 # Optional: maximum number of status checks. Use higher for bigger websites
)
print(f"Job completed with status: {job.status}")
# Access job items and their content
for item in job.job_items:
print(f"Page title: {item.title}")
print(f"Original URL: {item.original_url}")
print(f"Item status: {item.status}")
# Get the content based on job's scrape_type
# Returns None if item is not in "done" status
content = item.content
if content:
print(f"Content length: {len(content)}")
print(f"Content preview: {content[:200]}...")
else:
print("Content not available or item not done")
# Access job items and their parent job
for item in job.job_items:
print(f"Item URL: {item.original_url}")
print(f"Parent job status: {item.job.status}")
print(f"Parent job URL: {item.job.url}")
# Or use asynchronous crawling
response = crawler.crawl_async(
url="https://example.com",
scrape_type="markdown",
items_limit=10,
webhook_url="https://yourserver.com/webhook",
allow_subdomains=False
)
# Get the job ID from the response
job_id = response.id
print(f"Crawling job started with ID: {job_id}")
# Check job status and get results
job = crawler.get_job(job_id)
print(f"Job status: {job.status}")
# Access job details
print(f"Crawled URL: {job.url}")
print(f"Created at: {job.created_at}")
print(f"Number of items: {len(job.job_items)}")
# Cancel a running job if needed
cancel_response = crawler.cancel_job(job_id)
print(f"Cancellation response: {cancel_response['message']}")
```
### Scraping
Check a working code example of [scraping](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping) and [scraping with a prompt](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping_prompt)
```python
# Returns structured data directly
response = crawler.scrape(
url="https://webcrawlerapi.com"
)
if response.success:
print(response.markdown)
else:
print(f"Code: {response.error_code} Error: {response.error_message}")
```
## API Methods
### crawl()
Starts a new crawling job and waits for its completion. This method will continuously poll the job status until:
- The job reaches a terminal state (done, error, or cancelled)
- The maximum number of polls is reached (default: 100)
- The polling interval is determined by the server's `recommended_pull_delay_ms` or defaults to 5 seconds
### crawl_async()
Starts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.
### get_job()
Retrieves the current status and details of a specific job.
### cancel_job()
Cancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.
### scrape()
Scrapes a single URL and returns the markdown, cleaned or raw content, page status code and page title.
#### Scrape Params
Read more in [API Docs](https://webcrawlerapi.com/docs/api/scrape)
- `url` (required): The URL to scrape.
- `output_format` (required): The format of the output. Can be "markdown", "cleaned" or "raw"
## Parameters
### Crawl Methods (crawl and crawl_async)
- `url` (required): The seed URL where the crawler starts. Can be any valid URL.
- `scrape_type` (default: "html"): The type of scraping you want to perform. Can be "html", "cleaned", or "markdown".
- `items_limit` (default: 10): Crawler will stop when it reaches this limit of pages for this job.
- `webhook_url` (optional): The URL where the server will send a POST request once the task is completed.
- `allow_subdomains` (default: False): If True, the crawler will also crawl subdomains.
- `whitelist_regexp` (optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
- `blacklist_regexp` (optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.
- `max_polls` (optional, crawl only): Maximum number of status checks before returning (default: 100)
### Responses
#### CrawlAsync Response
The `crawl_async()` method returns a `CrawlResponse` object with:
- `id`: The unique identifier of the created job
#### Job Response
The Job object contains detailed information about the crawling job:
- `id`: The unique identifier of the job
- `org_id`: Your organization identifier
- `url`: The seed URL where the crawler started
- `status`: The status of the job (new, in_progress, done, error)
- `scrape_type`: The type of scraping performed
- `created_at`: The date when the job was created
- `finished_at`: The date when the job was finished (if completed)
- `webhook_url`: The webhook URL for notifications
- `webhook_status`: The status of the webhook request
- `webhook_error`: Any error message if the webhook request failed
- `job_items`: List of JobItem objects representing crawled pages
- `recommended_pull_delay_ms`: Server-recommended delay between status checks
### JobItem Properties
Each JobItem object represents a crawled page and contains:
- `id`: The unique identifier of the item
- `job_id`: The parent job identifier
- `job`: Reference to the parent Job object
- `original_url`: The URL of the page
- `page_status_code`: The HTTP status code of the page request
- `status`: The status of the item (new, in_progress, done, error)
- `title`: The page title
- `created_at`: The date when the item was created
- `cost`: The cost of the item in $
- `referred_url`: The URL where the page was referred from
- `last_error`: Any error message if the item failed
- `error_code`: The error code if the item failed (if available)
- `content`: The page content based on the job's scrape_type (html, cleaned, or markdown). Returns None if the item's status is not "done" or if content is not available. Content is automatically fetched and cached when accessed.
- `raw_content_url`: URL to the raw content (if available)
- `cleaned_content_url`: URL to the cleaned content (if scrape_type is "cleaned")
- `markdown_content_url`: URL to the markdown content (if scrape_type is "markdown")
## Requirements
- Python 3.6+
- requests>=2.25.0
## License
MIT License
Raw data
{
"_id": null,
"home_page": "https://github.com/webcrawlerapi/webcrawlerapi-python-sdk",
"name": "webcrawlerapi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": null,
"author": "Andrew",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/e1/4d/5d60241b35d31234305d0199434febe60ff623dcbb6ed210c5f2001d1e7c/webcrawlerapi-2.0.6.tar.gz",
"platform": null,
"description": "# WebCrawler API Python SDK\n\nA Python SDK for interacting with the WebCrawlerAPI.\n\n## In order to us API you have to get an API key from [WebCrawlerAPI](https://dash.webcrawlerapi.com/access)\n\n## Installation\n\n```bash\npip install webcrawlerapi\n```\n\n## Usage\n\n### Crawling\n\n```python\nfrom webcrawlerapi import WebCrawlerAPI\n\n# Initialize the client\ncrawler = WebCrawlerAPI(api_key=\"your_api_key\")\n\n# Synchronous crawling (blocks until completion)\njob = crawler.crawl(\n url=\"https://example.com\",\n scrape_type=\"markdown\",\n items_limit=10,\n webhook_url=\"https://yourserver.com/webhook\",\n allow_subdomains=False,\n max_polls=100 # Optional: maximum number of status checks. Use higher for bigger websites\n)\nprint(f\"Job completed with status: {job.status}\")\n\n# Access job items and their content\nfor item in job.job_items:\n print(f\"Page title: {item.title}\")\n print(f\"Original URL: {item.original_url}\")\n print(f\"Item status: {item.status}\")\n \n # Get the content based on job's scrape_type\n # Returns None if item is not in \"done\" status\n content = item.content\n if content:\n print(f\"Content length: {len(content)}\")\n print(f\"Content preview: {content[:200]}...\")\n else:\n print(\"Content not available or item not done\")\n\n# Access job items and their parent job\nfor item in job.job_items:\n print(f\"Item URL: {item.original_url}\")\n print(f\"Parent job status: {item.job.status}\")\n print(f\"Parent job URL: {item.job.url}\")\n\n# Or use asynchronous crawling\nresponse = crawler.crawl_async(\n url=\"https://example.com\",\n scrape_type=\"markdown\",\n items_limit=10,\n webhook_url=\"https://yourserver.com/webhook\",\n allow_subdomains=False\n)\n\n# Get the job ID from the response\njob_id = response.id\nprint(f\"Crawling job started with ID: {job_id}\")\n\n# Check job status and get results\njob = crawler.get_job(job_id)\nprint(f\"Job status: {job.status}\")\n\n# Access job details\nprint(f\"Crawled URL: {job.url}\")\nprint(f\"Created at: {job.created_at}\")\nprint(f\"Number of items: {len(job.job_items)}\")\n\n# Cancel a running job if needed\ncancel_response = crawler.cancel_job(job_id)\nprint(f\"Cancellation response: {cancel_response['message']}\")\n```\n\n### Scraping\nCheck a working code example of [scraping](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping) and [scraping with a prompt](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping_prompt)\n```python\n# Returns structured data directly\nresponse = crawler.scrape(\n url=\"https://webcrawlerapi.com\"\n)\nif response.success:\n print(response.markdown)\nelse:\n print(f\"Code: {response.error_code} Error: {response.error_message}\")\n```\n\n## API Methods\n\n### crawl()\nStarts a new crawling job and waits for its completion. This method will continuously poll the job status until:\n- The job reaches a terminal state (done, error, or cancelled)\n- The maximum number of polls is reached (default: 100)\n- The polling interval is determined by the server's `recommended_pull_delay_ms` or defaults to 5 seconds\n\n### crawl_async()\nStarts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.\n\n### get_job()\nRetrieves the current status and details of a specific job.\n\n### cancel_job()\nCancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.\n\n### scrape()\nScrapes a single URL and returns the markdown, cleaned or raw content, page status code and page title.\n\n#### Scrape Params\nRead more in [API Docs](https://webcrawlerapi.com/docs/api/scrape)\n\n- `url` (required): The URL to scrape.\n- `output_format` (required): The format of the output. Can be \"markdown\", \"cleaned\" or \"raw\"\n\n## Parameters\n\n### Crawl Methods (crawl and crawl_async)\n- `url` (required): The seed URL where the crawler starts. Can be any valid URL.\n- `scrape_type` (default: \"html\"): The type of scraping you want to perform. Can be \"html\", \"cleaned\", or \"markdown\".\n- `items_limit` (default: 10): Crawler will stop when it reaches this limit of pages for this job.\n- `webhook_url` (optional): The URL where the server will send a POST request once the task is completed.\n- `allow_subdomains` (default: False): If True, the crawler will also crawl subdomains.\n- `whitelist_regexp` (optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.\n- `blacklist_regexp` (optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.\n- `max_polls` (optional, crawl only): Maximum number of status checks before returning (default: 100)\n\n\n### Responses\n\n#### CrawlAsync Response\nThe `crawl_async()` method returns a `CrawlResponse` object with:\n- `id`: The unique identifier of the created job\n\n#### Job Response\nThe Job object contains detailed information about the crawling job:\n\n- `id`: The unique identifier of the job\n- `org_id`: Your organization identifier\n- `url`: The seed URL where the crawler started\n- `status`: The status of the job (new, in_progress, done, error)\n- `scrape_type`: The type of scraping performed\n- `created_at`: The date when the job was created\n- `finished_at`: The date when the job was finished (if completed)\n- `webhook_url`: The webhook URL for notifications\n- `webhook_status`: The status of the webhook request\n- `webhook_error`: Any error message if the webhook request failed\n- `job_items`: List of JobItem objects representing crawled pages\n- `recommended_pull_delay_ms`: Server-recommended delay between status checks\n\n### JobItem Properties\n\nEach JobItem object represents a crawled page and contains:\n\n- `id`: The unique identifier of the item\n- `job_id`: The parent job identifier\n- `job`: Reference to the parent Job object\n- `original_url`: The URL of the page\n- `page_status_code`: The HTTP status code of the page request\n- `status`: The status of the item (new, in_progress, done, error)\n- `title`: The page title\n- `created_at`: The date when the item was created\n- `cost`: The cost of the item in $\n- `referred_url`: The URL where the page was referred from\n- `last_error`: Any error message if the item failed\n- `error_code`: The error code if the item failed (if available)\n- `content`: The page content based on the job's scrape_type (html, cleaned, or markdown). Returns None if the item's status is not \"done\" or if content is not available. Content is automatically fetched and cached when accessed.\n- `raw_content_url`: URL to the raw content (if available)\n- `cleaned_content_url`: URL to the cleaned content (if scrape_type is \"cleaned\")\n- `markdown_content_url`: URL to the markdown content (if scrape_type is \"markdown\")\n\n## Requirements\n\n- Python 3.6+\n- requests>=2.25.0\n\n## License\n\nMIT License \n",
"bugtrack_url": null,
"license": null,
"summary": "Python SDK for WebCrawler API",
"version": "2.0.6",
"project_urls": {
"Homepage": "https://github.com/webcrawlerapi/webcrawlerapi-python-sdk"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8394791bfcb151caee97115415486682a8b66a082d27b443e150ab1f981b10f4",
"md5": "2a1d4a3ba4948aaef58ef59419422c25",
"sha256": "db29b5d2247e279e33ae227fed5fcf75c58a328cab8c39de69b7cf18b5b28c37"
},
"downloads": -1,
"filename": "webcrawlerapi-2.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2a1d4a3ba4948aaef58ef59419422c25",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 15359,
"upload_time": "2025-07-20T19:49:55",
"upload_time_iso_8601": "2025-07-20T19:49:55.443466Z",
"url": "https://files.pythonhosted.org/packages/83/94/791bfcb151caee97115415486682a8b66a082d27b443e150ab1f981b10f4/webcrawlerapi-2.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e14d5d60241b35d31234305d0199434febe60ff623dcbb6ed210c5f2001d1e7c",
"md5": "b27437cb924df46106ec6040c57e74b6",
"sha256": "c8c4d364a185e2f369ce1939e228e9b1aeb409968c6e0c5c02cd4f409bcbca68"
},
"downloads": -1,
"filename": "webcrawlerapi-2.0.6.tar.gz",
"has_sig": false,
"md5_digest": "b27437cb924df46106ec6040c57e74b6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 15683,
"upload_time": "2025-07-20T19:49:56",
"upload_time_iso_8601": "2025-07-20T19:49:56.596195Z",
"url": "https://files.pythonhosted.org/packages/e1/4d/5d60241b35d31234305d0199434febe60ff623dcbb6ed210c5f2001d1e7c/webcrawlerapi-2.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 19:49:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "webcrawlerapi",
"github_project": "webcrawlerapi-python-sdk",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
}
],
"lcname": "webcrawlerapi"
}