webcrawlerapi

Name	webcrawlerapi JSON
Version	2.0.6 JSON
	download
home_page	https://github.com/webcrawlerapi/webcrawlerapi-python-sdk
Summary	Python SDK for WebCrawler API
upload_time	2025-07-20 19:49:56
maintainer	None
docs_url	None
author	Andrew
requires_python	>=3.6
license	None
keywords
VCS
bugtrack_url
requirements	requests
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # WebCrawler API Python SDK

A Python SDK for interacting with the WebCrawlerAPI.

## In order to us API you have to get an API key from [WebCrawlerAPI](https://dash.webcrawlerapi.com/access)

## Installation

```bash
pip install webcrawlerapi
```

## Usage

### Crawling

```python
from webcrawlerapi import WebCrawlerAPI

# Initialize the client
crawler = WebCrawlerAPI(api_key="your_api_key")

# Synchronous crawling (blocks until completion)
job = crawler.crawl(
    url="https://example.com",
    scrape_type="markdown",
    items_limit=10,
    webhook_url="https://yourserver.com/webhook",
    allow_subdomains=False,
    max_polls=100  # Optional: maximum number of status checks. Use higher for bigger websites
)
print(f"Job completed with status: {job.status}")

# Access job items and their content
for item in job.job_items:
    print(f"Page title: {item.title}")
    print(f"Original URL: {item.original_url}")
    print(f"Item status: {item.status}")
    
    # Get the content based on job's scrape_type
    # Returns None if item is not in "done" status
    content = item.content
    if content:
        print(f"Content length: {len(content)}")
        print(f"Content preview: {content[:200]}...")
    else:
        print("Content not available or item not done")

# Access job items and their parent job
for item in job.job_items:
    print(f"Item URL: {item.original_url}")
    print(f"Parent job status: {item.job.status}")
    print(f"Parent job URL: {item.job.url}")

# Or use asynchronous crawling
response = crawler.crawl_async(
    url="https://example.com",
    scrape_type="markdown",
    items_limit=10,
    webhook_url="https://yourserver.com/webhook",
    allow_subdomains=False
)

# Get the job ID from the response
job_id = response.id
print(f"Crawling job started with ID: {job_id}")

# Check job status and get results
job = crawler.get_job(job_id)
print(f"Job status: {job.status}")

# Access job details
print(f"Crawled URL: {job.url}")
print(f"Created at: {job.created_at}")
print(f"Number of items: {len(job.job_items)}")

# Cancel a running job if needed
cancel_response = crawler.cancel_job(job_id)
print(f"Cancellation response: {cancel_response['message']}")
```

### Scraping
Check a working code example of [scraping](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping) and [scraping with a prompt](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping_prompt)
```python
# Returns structured data directly
response = crawler.scrape(
    url="https://webcrawlerapi.com"
)
if response.success:
    print(response.markdown)
else:
    print(f"Code: {response.error_code} Error: {response.error_message}")
```

## API Methods

### crawl()
Starts a new crawling job and waits for its completion. This method will continuously poll the job status until:
- The job reaches a terminal state (done, error, or cancelled)
- The maximum number of polls is reached (default: 100)
- The polling interval is determined by the server's `recommended_pull_delay_ms` or defaults to 5 seconds

### crawl_async()
Starts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.

### get_job()
Retrieves the current status and details of a specific job.

### cancel_job()
Cancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.

### scrape()
Scrapes a single URL and returns the markdown, cleaned or raw content, page status code and page title.

#### Scrape Params
Read more in [API Docs](https://webcrawlerapi.com/docs/api/scrape)

- `url` (required): The URL to scrape.
- `output_format` (required): The format of the output. Can be "markdown", "cleaned" or "raw"

## Parameters

### Crawl Methods (crawl and crawl_async)
- `url` (required): The seed URL where the crawler starts. Can be any valid URL.
- `scrape_type` (default: "html"): The type of scraping you want to perform. Can be "html", "cleaned", or "markdown".
- `items_limit` (default: 10): Crawler will stop when it reaches this limit of pages for this job.
- `webhook_url` (optional): The URL where the server will send a POST request once the task is completed.
- `allow_subdomains` (default: False): If True, the crawler will also crawl subdomains.
- `whitelist_regexp` (optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
- `blacklist_regexp` (optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.
- `max_polls` (optional, crawl only): Maximum number of status checks before returning (default: 100)


### Responses

#### CrawlAsync Response
The `crawl_async()` method returns a `CrawlResponse` object with:
- `id`: The unique identifier of the created job

#### Job Response
The Job object contains detailed information about the crawling job:

- `id`: The unique identifier of the job
- `org_id`: Your organization identifier
- `url`: The seed URL where the crawler started
- `status`: The status of the job (new, in_progress, done, error)
- `scrape_type`: The type of scraping performed
- `created_at`: The date when the job was created
- `finished_at`: The date when the job was finished (if completed)
- `webhook_url`: The webhook URL for notifications
- `webhook_status`: The status of the webhook request
- `webhook_error`: Any error message if the webhook request failed
- `job_items`: List of JobItem objects representing crawled pages
- `recommended_pull_delay_ms`: Server-recommended delay between status checks

### JobItem Properties

Each JobItem object represents a crawled page and contains:

- `id`: The unique identifier of the item
- `job_id`: The parent job identifier
- `job`: Reference to the parent Job object
- `original_url`: The URL of the page
- `page_status_code`: The HTTP status code of the page request
- `status`: The status of the item (new, in_progress, done, error)
- `title`: The page title
- `created_at`: The date when the item was created
- `cost`: The cost of the item in $
- `referred_url`: The URL where the page was referred from
- `last_error`: Any error message if the item failed
- `error_code`: The error code if the item failed (if available)
- `content`: The page content based on the job's scrape_type (html, cleaned, or markdown). Returns None if the item's status is not "done" or if content is not available. Content is automatically fetched and cached when accessed.
- `raw_content_url`: URL to the raw content (if available)
- `cleaned_content_url`: URL to the cleaned content (if scrape_type is "cleaned")
- `markdown_content_url`: URL to the markdown content (if scrape_type is "markdown")

## Requirements

- Python 3.6+
- requests>=2.25.0

## License

MIT License

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/webcrawlerapi/webcrawlerapi-python-sdk",
    "name": "webcrawlerapi",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Andrew",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/e1/4d/5d60241b35d31234305d0199434febe60ff623dcbb6ed210c5f2001d1e7c/webcrawlerapi-2.0.6.tar.gz",
    "platform": null,
    "description": "# WebCrawler API Python SDK\n\nA Python SDK for interacting with the WebCrawlerAPI.\n\n## In order to us API you have to get an API key from [WebCrawlerAPI](https://dash.webcrawlerapi.com/access)\n\n## Installation\n\n```bash\npip install webcrawlerapi\n```\n\n## Usage\n\n### Crawling\n\n```python\nfrom webcrawlerapi import WebCrawlerAPI\n\n# Initialize the client\ncrawler = WebCrawlerAPI(api_key=\"your_api_key\")\n\n# Synchronous crawling (blocks until completion)\njob = crawler.crawl(\n    url=\"https://example.com\",\n    scrape_type=\"markdown\",\n    items_limit=10,\n    webhook_url=\"https://yourserver.com/webhook\",\n    allow_subdomains=False,\n    max_polls=100  # Optional: maximum number of status checks. Use higher for bigger websites\n)\nprint(f\"Job completed with status: {job.status}\")\n\n# Access job items and their content\nfor item in job.job_items:\n    print(f\"Page title: {item.title}\")\n    print(f\"Original URL: {item.original_url}\")\n    print(f\"Item status: {item.status}\")\n    \n    # Get the content based on job's scrape_type\n    # Returns None if item is not in \"done\" status\n    content = item.content\n    if content:\n        print(f\"Content length: {len(content)}\")\n        print(f\"Content preview: {content[:200]}...\")\n    else:\n        print(\"Content not available or item not done\")\n\n# Access job items and their parent job\nfor item in job.job_items:\n    print(f\"Item URL: {item.original_url}\")\n    print(f\"Parent job status: {item.job.status}\")\n    print(f\"Parent job URL: {item.job.url}\")\n\n# Or use asynchronous crawling\nresponse = crawler.crawl_async(\n    url=\"https://example.com\",\n    scrape_type=\"markdown\",\n    items_limit=10,\n    webhook_url=\"https://yourserver.com/webhook\",\n    allow_subdomains=False\n)\n\n# Get the job ID from the response\njob_id = response.id\nprint(f\"Crawling job started with ID: {job_id}\")\n\n# Check job status and get results\njob = crawler.get_job(job_id)\nprint(f\"Job status: {job.status}\")\n\n# Access job details\nprint(f\"Crawled URL: {job.url}\")\nprint(f\"Created at: {job.created_at}\")\nprint(f\"Number of items: {len(job.job_items)}\")\n\n# Cancel a running job if needed\ncancel_response = crawler.cancel_job(job_id)\nprint(f\"Cancellation response: {cancel_response['message']}\")\n```\n\n### Scraping\nCheck a working code example of [scraping](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping) and [scraping with a prompt](https://github.com/WebCrawlerAPI/webcrawlerapi-examples/tree/master/python/scraping_prompt)\n```python\n# Returns structured data directly\nresponse = crawler.scrape(\n    url=\"https://webcrawlerapi.com\"\n)\nif response.success:\n    print(response.markdown)\nelse:\n    print(f\"Code: {response.error_code} Error: {response.error_message}\")\n```\n\n## API Methods\n\n### crawl()\nStarts a new crawling job and waits for its completion. This method will continuously poll the job status until:\n- The job reaches a terminal state (done, error, or cancelled)\n- The maximum number of polls is reached (default: 100)\n- The polling interval is determined by the server's `recommended_pull_delay_ms` or defaults to 5 seconds\n\n### crawl_async()\nStarts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.\n\n### get_job()\nRetrieves the current status and details of a specific job.\n\n### cancel_job()\nCancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.\n\n### scrape()\nScrapes a single URL and returns the markdown, cleaned or raw content, page status code and page title.\n\n#### Scrape Params\nRead more in [API Docs](https://webcrawlerapi.com/docs/api/scrape)\n\n- `url` (required): The URL to scrape.\n- `output_format` (required): The format of the output. Can be \"markdown\", \"cleaned\" or \"raw\"\n\n## Parameters\n\n### Crawl Methods (crawl and crawl_async)\n- `url` (required): The seed URL where the crawler starts. Can be any valid URL.\n- `scrape_type` (default: \"html\"): The type of scraping you want to perform. Can be \"html\", \"cleaned\", or \"markdown\".\n- `items_limit` (default: 10): Crawler will stop when it reaches this limit of pages for this job.\n- `webhook_url` (optional): The URL where the server will send a POST request once the task is completed.\n- `allow_subdomains` (default: False): If True, the crawler will also crawl subdomains.\n- `whitelist_regexp` (optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.\n- `blacklist_regexp` (optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.\n- `max_polls` (optional, crawl only): Maximum number of status checks before returning (default: 100)\n\n\n### Responses\n\n#### CrawlAsync Response\nThe `crawl_async()` method returns a `CrawlResponse` object with:\n- `id`: The unique identifier of the created job\n\n#### Job Response\nThe Job object contains detailed information about the crawling job:\n\n- `id`: The unique identifier of the job\n- `org_id`: Your organization identifier\n- `url`: The seed URL where the crawler started\n- `status`: The status of the job (new, in_progress, done, error)\n- `scrape_type`: The type of scraping performed\n- `created_at`: The date when the job was created\n- `finished_at`: The date when the job was finished (if completed)\n- `webhook_url`: The webhook URL for notifications\n- `webhook_status`: The status of the webhook request\n- `webhook_error`: Any error message if the webhook request failed\n- `job_items`: List of JobItem objects representing crawled pages\n- `recommended_pull_delay_ms`: Server-recommended delay between status checks\n\n### JobItem Properties\n\nEach JobItem object represents a crawled page and contains:\n\n- `id`: The unique identifier of the item\n- `job_id`: The parent job identifier\n- `job`: Reference to the parent Job object\n- `original_url`: The URL of the page\n- `page_status_code`: The HTTP status code of the page request\n- `status`: The status of the item (new, in_progress, done, error)\n- `title`: The page title\n- `created_at`: The date when the item was created\n- `cost`: The cost of the item in $\n- `referred_url`: The URL where the page was referred from\n- `last_error`: Any error message if the item failed\n- `error_code`: The error code if the item failed (if available)\n- `content`: The page content based on the job's scrape_type (html, cleaned, or markdown). Returns None if the item's status is not \"done\" or if content is not available. Content is automatically fetched and cached when accessed.\n- `raw_content_url`: URL to the raw content (if available)\n- `cleaned_content_url`: URL to the cleaned content (if scrape_type is \"cleaned\")\n- `markdown_content_url`: URL to the markdown content (if scrape_type is \"markdown\")\n\n## Requirements\n\n- Python 3.6+\n- requests>=2.25.0\n\n## License\n\nMIT License \n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python SDK for WebCrawler API",
    "version": "2.0.6",
    "project_urls": {
        "Homepage": "https://github.com/webcrawlerapi/webcrawlerapi-python-sdk"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8394791bfcb151caee97115415486682a8b66a082d27b443e150ab1f981b10f4",
                "md5": "2a1d4a3ba4948aaef58ef59419422c25",
                "sha256": "db29b5d2247e279e33ae227fed5fcf75c58a328cab8c39de69b7cf18b5b28c37"
            },
            "downloads": -1,
            "filename": "webcrawlerapi-2.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2a1d4a3ba4948aaef58ef59419422c25",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 15359,
            "upload_time": "2025-07-20T19:49:55",
            "upload_time_iso_8601": "2025-07-20T19:49:55.443466Z",
            "url": "https://files.pythonhosted.org/packages/83/94/791bfcb151caee97115415486682a8b66a082d27b443e150ab1f981b10f4/webcrawlerapi-2.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e14d5d60241b35d31234305d0199434febe60ff623dcbb6ed210c5f2001d1e7c",
                "md5": "b27437cb924df46106ec6040c57e74b6",
                "sha256": "c8c4d364a185e2f369ce1939e228e9b1aeb409968c6e0c5c02cd4f409bcbca68"
            },
            "downloads": -1,
            "filename": "webcrawlerapi-2.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "b27437cb924df46106ec6040c57e74b6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 15683,
            "upload_time": "2025-07-20T19:49:56",
            "upload_time_iso_8601": "2025-07-20T19:49:56.596195Z",
            "url": "https://files.pythonhosted.org/packages/e1/4d/5d60241b35d31234305d0199434febe60ff623dcbb6ed210c5f2001d1e7c/webcrawlerapi-2.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-20 19:49:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "webcrawlerapi",
    "github_project": "webcrawlerapi-python-sdk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.25.0"
                ]
            ]
        }
    ],
    "lcname": "webcrawlerapi"
}

Andrew