# Spider Cloud Python SDK
The Spider Cloud Python SDK offers a toolkit for straightforward website scraping, crawling at scale, and other utilities like extracting links and taking screenshots, enabling you to collect data formatted for compatibility with language models (LLMs). It features a user-friendly interface for seamless integration with the Spider Cloud API.
## Installation
To install the Spider Cloud Python SDK, you can use pip:
```bash
pip install spider-client
```
## Usage
1. Get an API key from [spider.cloud](https://spider.cloud)
2. Set the API key as an environment variable named `SPIDER_API_KEY` or pass it as a parameter to the `Spider` class.
Here's an example of how to use the SDK:
```python
from spider import Spider
# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')
# Scrape a single URL
url = 'https://spider.cloud'
scraped_data = app.scrape_url(url)
# Crawl a website
crawler_params = {
'limit': 1,
'proxy_enabled': True,
'store_data': False,
'metadata': False,
'request': 'http'
}
crawl_result = app.crawl_url(url, params=crawler_params)
```
### Scraping a URL
To scrape data from a single URL:
```python
url = 'https://example.com'
scraped_data = app.scrape_url(url)
```
### Crawling a Website
To automate crawling a website:
```python
url = 'https://example.com'
crawl_params = {
'limit': 200,
'request': 'smart_mode'
}
crawl_result = app.crawl_url(url, params=crawl_params)
```
#### Crawl Streaming
Stream crawl the website in chunks to scale.
```python
def handle_json(json_obj: dict) -> None:
assert json_obj["url"] is not None
url = 'https://example.com'
crawl_params = {
'limit': 200,
'store_data': False
}
response = app.crawl_url(
url,
params=params,
stream=True,
callback=handle_json,
)
```
### Search
Perform a search for websites to crawl or gather search results:
```python
query = 'a sports website'
crawl_params = {
'request': 'smart_mode',
'search_limit': 5,
'limit': 5,
'fetch_page_content': True
}
crawl_result = app.search(query, params=crawl_params)
```
### Retrieving Links from a URL(s)
Extract all links from a specified URL:
```python
url = 'https://example.com'
links = app.links(url)
```
### Transform
Transform HTML to markdown or text lightning fast:
```python
data = [ { 'html': '<html><body><h1>Hello world</h1></body></html>' } ]
params = {
'readability': False,
'return_format': 'markdown',
}
result = app.transform(data, params=params)
```
### Taking Screenshots of a URL(s)
Capture a screenshot of a given URL:
```python
url = 'https://example.com'
screenshot = app.screenshot(url)
```
### Extracting Contact Information
Extract contact details from a specified URL:
```python
url = 'https://example.com'
contacts = app.extract_contacts(url)
```
### Labeling Data from a URL(s)
Label the data extracted from a particular URL:
```python
url = 'https://example.com'
labeled_data = app.label(url)
```
### Checking Crawl State
You can check the crawl state of the website:
```python
url = 'https://example.com'
state = app.get_crawl_state(url)
```
### Downloading files
You can download the results of the website:
```python
url = 'https://example.com'
params = {
'page': 0,
'limit': 100,
'expiresIn': 3600 # Optional, add if needed
}
stream = True
state = app.create_signed_url(url, params, stream)
```
### Checking Available Credits
You can check the remaining credits on your account:
```python
credits = app.get_credits()
```
### Data Operations
The Spider client can now interact with specific data tables to create, retrieve, and delete data.
#### Retrieve Data from a Table
To fetch data from a specified table by applying query parameters:
```python
table_name = 'pages'
query_params = {'limit': 20 }
response = app.data_get(table_name, query_params)
print(response)
```
#### Delete Data from a Table
To delete data from a specified table based on certain conditions:
```python
table_name = 'websites'
delete_params = {'domain': 'www.example.com'}
response = app.data_delete(table_name, delete_params)
print(response)
```
## Streaming
If you need to stream the request use the third param:
```python
url = 'https://example.com'
crawler_params = {
'limit': 1,
'proxy_enabled': True,
'store_data': False,
'metadata': False,
'request': 'http'
}
links = app.links(url, crawler_params, True)
```
## Content-Type
The following Content-type headers are supported using the fourth param:
1. `application/json`
1. `text/csv`
1. `application/xml`
1. `application/jsonl`
```python
url = 'https://example.com'
crawler_params = {
'limit': 1,
'proxy_enabled': True,
'store_data': False,
'metadata': False,
'request': 'http'
}
# stream json lines back to the client
links = app.crawl(url, crawler_params, True, "application/jsonl")
```
## Error Handling
The SDK handles errors returned by the Spider Cloud API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
## Contributing
Contributions to the Spider Cloud Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
## License
The Spider Cloud Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).
Raw data
{
"_id": null,
"home_page": "https://github.com/spider-rs/spider-clients/tree/main/python",
"name": "spider-client",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Spider",
"author_email": "jeff@spider.cloud",
"download_url": "https://files.pythonhosted.org/packages/1a/5d/0e79598257613be7f03288cc56e71492be0f98f443c37e533b6cbc5328f0/spider-client-0.1.23.tar.gz",
"platform": null,
"description": "# Spider Cloud Python SDK\n\nThe Spider Cloud Python SDK offers a toolkit for straightforward website scraping, crawling at scale, and other utilities like extracting links and taking screenshots, enabling you to collect data formatted for compatibility with language models (LLMs). It features a user-friendly interface for seamless integration with the Spider Cloud API.\n\n## Installation\n\nTo install the Spider Cloud Python SDK, you can use pip:\n\n```bash\npip install spider-client\n```\n\n## Usage\n\n1. Get an API key from [spider.cloud](https://spider.cloud)\n2. Set the API key as an environment variable named `SPIDER_API_KEY` or pass it as a parameter to the `Spider` class.\n\nHere's an example of how to use the SDK:\n\n```python\nfrom spider import Spider\n\n# Initialize the Spider with your API key\napp = Spider(api_key='your_api_key')\n\n# Scrape a single URL\nurl = 'https://spider.cloud'\nscraped_data = app.scrape_url(url)\n\n# Crawl a website\ncrawler_params = {\n 'limit': 1,\n 'proxy_enabled': True,\n 'store_data': False,\n 'metadata': False,\n 'request': 'http'\n}\ncrawl_result = app.crawl_url(url, params=crawler_params)\n```\n\n### Scraping a URL\n\nTo scrape data from a single URL:\n\n```python\nurl = 'https://example.com'\nscraped_data = app.scrape_url(url)\n```\n\n### Crawling a Website\n\nTo automate crawling a website:\n\n```python\nurl = 'https://example.com'\ncrawl_params = {\n 'limit': 200,\n 'request': 'smart_mode'\n}\ncrawl_result = app.crawl_url(url, params=crawl_params)\n```\n\n#### Crawl Streaming\n\nStream crawl the website in chunks to scale.\n\n```python\n def handle_json(json_obj: dict) -> None:\n assert json_obj[\"url\"] is not None\n\n url = 'https://example.com'\n crawl_params = {\n 'limit': 200,\n 'store_data': False\n }\n response = app.crawl_url(\n url,\n params=params,\n stream=True,\n callback=handle_json,\n )\n```\n\n### Search\n\nPerform a search for websites to crawl or gather search results:\n\n```python\nquery = 'a sports website'\ncrawl_params = {\n 'request': 'smart_mode',\n 'search_limit': 5,\n 'limit': 5,\n 'fetch_page_content': True\n}\ncrawl_result = app.search(query, params=crawl_params)\n```\n\n### Retrieving Links from a URL(s)\n\nExtract all links from a specified URL:\n\n```python\nurl = 'https://example.com'\nlinks = app.links(url)\n```\n\n### Transform\n\nTransform HTML to markdown or text lightning fast:\n\n```python\ndata = [ { 'html': '<html><body><h1>Hello world</h1></body></html>' } ]\nparams = {\n 'readability': False,\n 'return_format': 'markdown',\n}\nresult = app.transform(data, params=params)\n```\n\n### Taking Screenshots of a URL(s)\n\nCapture a screenshot of a given URL:\n\n```python\nurl = 'https://example.com'\nscreenshot = app.screenshot(url)\n```\n\n### Extracting Contact Information\n\nExtract contact details from a specified URL:\n\n```python\nurl = 'https://example.com'\ncontacts = app.extract_contacts(url)\n```\n\n### Labeling Data from a URL(s)\n\nLabel the data extracted from a particular URL:\n\n```python\nurl = 'https://example.com'\nlabeled_data = app.label(url)\n```\n\n### Checking Crawl State\n\nYou can check the crawl state of the website:\n\n```python\nurl = 'https://example.com'\nstate = app.get_crawl_state(url)\n```\n\n### Downloading files\n\nYou can download the results of the website:\n\n```python\nurl = 'https://example.com'\nparams = {\n 'page': 0,\n 'limit': 100,\n 'expiresIn': 3600 # Optional, add if needed\n}\nstream = True\n\nstate = app.create_signed_url(url, params, stream)\n```\n\n### Checking Available Credits\n\nYou can check the remaining credits on your account:\n\n```python\ncredits = app.get_credits()\n```\n\n### Data Operations\n\nThe Spider client can now interact with specific data tables to create, retrieve, and delete data.\n\n#### Retrieve Data from a Table\n\nTo fetch data from a specified table by applying query parameters:\n\n```python\ntable_name = 'pages'\nquery_params = {'limit': 20 }\nresponse = app.data_get(table_name, query_params)\nprint(response)\n```\n\n#### Delete Data from a Table\n\nTo delete data from a specified table based on certain conditions:\n\n```python\ntable_name = 'websites'\ndelete_params = {'domain': 'www.example.com'}\nresponse = app.data_delete(table_name, delete_params)\nprint(response)\n```\n\n## Streaming\n\nIf you need to stream the request use the third param:\n\n```python\nurl = 'https://example.com'\n\ncrawler_params = {\n 'limit': 1,\n 'proxy_enabled': True,\n 'store_data': False,\n 'metadata': False,\n 'request': 'http'\n}\n\nlinks = app.links(url, crawler_params, True)\n```\n\n## Content-Type\n\nThe following Content-type headers are supported using the fourth param:\n\n1. `application/json`\n1. `text/csv`\n1. `application/xml`\n1. `application/jsonl`\n\n```python\nurl = 'https://example.com'\n\ncrawler_params = {\n 'limit': 1,\n 'proxy_enabled': True,\n 'store_data': False,\n 'metadata': False,\n 'request': 'http'\n}\n\n# stream json lines back to the client\nlinks = app.crawl(url, crawler_params, True, \"application/jsonl\")\n```\n\n## Error Handling\n\nThe SDK handles errors returned by the Spider Cloud API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.\n\n## Contributing\n\nContributions to the Spider Cloud Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.\n\n## License\n\nThe Spider Cloud Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).\n",
"bugtrack_url": null,
"license": null,
"summary": "Python SDK for Spider Cloud API",
"version": "0.1.23",
"project_urls": {
"Homepage": "https://github.com/spider-rs/spider-clients/tree/main/python"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1a5d0e79598257613be7f03288cc56e71492be0f98f443c37e533b6cbc5328f0",
"md5": "edcb800b6ad58c9b9a9ec69c1978a953",
"sha256": "fc4e00bca2ccd0087842fc698763a1ebe90ede7a2f1ddfcfcd2d11c5cd3146bb"
},
"downloads": -1,
"filename": "spider-client-0.1.23.tar.gz",
"has_sig": false,
"md5_digest": "edcb800b6ad58c9b9a9ec69c1978a953",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15320,
"upload_time": "2024-11-07T07:37:41",
"upload_time_iso_8601": "2024-11-07T07:37:41.295672Z",
"url": "https://files.pythonhosted.org/packages/1a/5d/0e79598257613be7f03288cc56e71492be0f98f443c37e533b6cbc5328f0/spider-client-0.1.23.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-07 07:37:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "spider-rs",
"github_project": "spider-clients",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "spider-client"
}