proxycrawl


Nameproxycrawl JSON
Version 3.2.2 PyPI version JSON
download
home_pagehttps://github.com/proxycrawl/proxycrawl-python
SummaryA Python class that acts as wrapper for ProxyCrawl scraping and crawling API
upload_time2023-07-04 13:49:23
maintainer
docs_urlNone
authorProxyCrawl
requires_python
licenseApache-2.0
keywords scraping scraper crawler crawling proxycrawl api
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DEPRECATION NOTICE

> :warning: **IMPORTANT:** This package is no longer maintained or supported. For the latest updates, please use our new package at [crawlbase-python](https://github.com/crawlbase-source/crawlbase-python).

---

# ProxyCrawl API Python class

A lightweight, dependency free Python class that acts as wrapper for ProxyCrawl API.

## Installing

Choose a way of installing:

- Download the python class from Github.
- Or use [PyPi](https://pypi.org/project/proxycrawl/) Python package manager. `pip install proxycrawl`

Then import the CrawlingAPI, ScraperAPI, etc as needed.

```python
from proxycrawl import CrawlingAPI, ScraperAPI, LeadsAPI, ScreenshotsAPI, StorageAPI
```

### Upgrading to version 3

Version 3 deprecates the usage of ProxyCrawlAPI in favour of CrawlingAPI (although is still usable). Please test the upgrade before deploying to production.

## Crawling API

First initialize the CrawlingAPI class.

```python
api = CrawlingAPI({ 'token': 'YOUR_PROXYCRAWL_TOKEN' })
```

### GET requests

Pass the url that you want to scrape plus any options from the ones available in the [API documentation](https://proxycrawl.com/docs).

```python
api.get(url, options = {})
```

Example:

```python
response = api.get('https://www.facebook.com/britneyspears')
if response['status_code'] == 200:
    print(response['body'])
```

You can pass any options from ProxyCrawl API.

Example:

```python
response = api.get('https://www.reddit.com/r/pics/comments/5bx4bx/thanks_obama/', {
    'user_agent': 'Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/30.0',
    'format': 'json'
})
if response['status_code'] == 200:
    print(response['body'])
```

### POST requests

Pass the url that you want to scrape, the data that you want to send which can be either a json or a string, plus any options from the ones available in the [API documentation](https://proxycrawl.com/docs).

```python
api.post(url, dictionary or string data, options = {})
```

Example:

```python
response = api.post('https://producthunt.com/search', { 'text': 'example search' })
if response['status_code'] == 200:
    print(response['body'])
```

You can send the data as `application/json` instead of `x-www-form-urlencoded` by setting option `post_content_type` as json.

```python
import json
response = api.post('https://httpbin.org/post', json.dumps({ 'some_json': 'with some value' }), { 'post_content_type': 'json' })
if response['status_code'] == 200:
    print(response['body'])
```

### Javascript requests

If you need to scrape any website built with Javascript like React, Angular, Vue, etc. You just need to pass your javascript token and use the same calls. Note that only `.get` is available for javascript and not `.post`.

```python
api = CrawlingAPI({ 'token': 'YOUR_JAVASCRIPT_TOKEN' })
```

```python
response = api.get('https://www.nfl.com')
if response['status_code'] == 200:
    print(response['body'])
```

Same way you can pass javascript additional options.

```python
response = api.get('https://www.freelancer.com', { 'page_wait': 5000 })
if response['status_code'] == 200:
    print(response['body'])
```

## Original status

You can always get the original status and proxycrawl status from the response. Read the [ProxyCrawl documentation](https://proxycrawl.com/docs) to learn more about those status.

```python
response = api.get('https://craiglist.com')
print(response['headers']['original_status'])
print(response['headers']['pc_status'])
```

If you have questions or need help using the library, please open an issue or [contact us](https://proxycrawl.com/contact).

## Scraper API

The usage of the Scraper API is very similar, just change the class name to initialize.

```python
scraper_api = ScraperAPI({ 'token': 'YOUR_NORMAL_TOKEN' })

response = scraper_api.get('https://www.amazon.com/DualSense-Wireless-Controller-PlayStation-5/dp/B08FC6C75Y/')
if response['status_code'] == 200:
    print(response['json']['name']) # Will print the name of the Amazon product
```

## Leads API

To find email leads you can use the leads API, you can check the full [API documentation](https://proxycrawl.com/docs/leads-api/) if needed.

```python
leads_api = LeadsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })

response = leads_api.get_from_domain('microsoft.com')

if response['status_code'] == 200:
    print(response['json']['leads'])
```

## Screenshots API

Initialize with your Screenshots API token and call the `get` method.

```python
screenshots_api = ScreenshotsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })
response = screenshots_api.get('https://www.apple.com')
if response['status_code'] == 200:
    print(response['headers']['success'])
    print(response['headers']['url'])
    print(response['headers']['remaining_requests'])
    print(response['file'])
```

or specifying a file path

```python
screenshots_api = ScreenshotsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })
response = screenshots_api.get('https://www.apple.com', { 'save_to_path': 'apple.jpg' })
if response['status_code'] == 200:
    print(response['headers']['success'])
    print(response['headers']['url'])
    print(response['headers']['remaining_requests'])
    print(response['file'])
```

or if you set `store=true` then `screenshot_url` is set in the returned headers 

```python
screenshots_api = ScreenshotsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })
response = screenshots_api.get('https://www.apple.com', { 'store': 'true' })
if response['status_code'] == 200:
    print(response['headers']['success'])
    print(response['headers']['url'])
    print(response['headers']['remaining_requests'])
    print(response['file'])
    print(response['headers']['screenshot_url'])
```

Note that `screenshots_api.get(url, options)` method accepts an [options](https://proxycrawl.com/docs/screenshots-api/parameters)

## Storage API

Initialize the Storage API using your private token.

```python
storage_api = StorageAPI({ 'token': 'YOUR_NORMAL_TOKEN' })
```

Pass the [url](https://proxycrawl.com/docs/storage-api/parameters/#url) that you want to get from [Proxycrawl Storage](https://proxycrawl.com/dashboard/storage).

```python
response = storage_api.get('https://www.apple.com')
if response['status_code'] == 200:
    print(response['headers']['original_status'])
    print(response['headers']['pc_status'])
    print(response['headers']['url'])
    print(response['headers']['rid'])
    print(response['headers']['stored_at'])
    print(response['body'])
```

or you can use the [RID](https://proxycrawl.com/docs/storage-api/parameters/#rid)

```python
response = storage_api.get('RID_REPLACE')
if response['status_code'] == 200:
    print(response['headers']['original_status'])
    print(response['headers']['pc_status'])
    print(response['headers']['url'])
    print(response['headers']['rid'])
    print(response['headers']['stored_at'])
    print(response['body'])
```

Note: One of the two RID or URL must be sent. So both are optional but it's mandatory to send one of the two.

### [Delete](https://proxycrawl.com/docs/storage-api/delete/) request

To delete a storage item from your storage area, use the correct RID

```python
if storage_api.delete('RID_REPLACE'):
  print('delete success')
else:
  print('Unable to delete')
```

### [Bulk](https://proxycrawl.com/docs/storage-api/bulk/) request

To do a bulk request with a list of RIDs, please send the list of rids as an array

```python
response = storage_api.bulk(['RID1', 'RID2', 'RID3', ...])
if response['status_code'] == 200:
    for item in response['json']:
        print(item['original_status'])
        print(item['pc_status'])
        print(item['url'])
        print(item['rid'])
        print(item['stored_at'])
        print(item['body'])
```

### [RIDs](https://proxycrawl.com/docs/storage-api/rids) request

To request a bulk list of RIDs from your storage area

```python
rids = storage_api.rids()
print(rids)
```

You can also specify a limit as a parameter

```python
storage_api.rids(100)
```

### [Total Count](https://proxycrawl.com/docs/storage-api/total_count)

To get the total number of documents in your storage area

```python
total_count = storage_api.totalCount()
print(total_count)
```

## Custom timeout

If you need to use a custom timeout, you can pass it to the class instance creation like the following:

```python
api = CrawlingAPI({ 'token': 'TOKEN', 'timeout': 120 })
```

Timeout is in seconds.

---

Copyright 2023 ProxyCrawl
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/proxycrawl/proxycrawl-python",
    "name": "proxycrawl",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "scraping scraper crawler crawling proxycrawl api",
    "author": "ProxyCrawl",
    "author_email": "info@proxycrawl.com",
    "download_url": "https://files.pythonhosted.org/packages/b5/59/efed26d17e5d60b02682423ea85075489ac6d4dd830e7e37c0a29e9a3ce1/proxycrawl-3.2.2.tar.gz",
    "platform": null,
    "description": "# DEPRECATION NOTICE\n\n> :warning: **IMPORTANT:** This package is no longer maintained or supported. For the latest updates, please use our new package at [crawlbase-python](https://github.com/crawlbase-source/crawlbase-python).\n\n---\n\n# ProxyCrawl API Python class\n\nA lightweight, dependency free Python class that acts as wrapper for ProxyCrawl API.\n\n## Installing\n\nChoose a way of installing:\n\n- Download the python class from Github.\n- Or use [PyPi](https://pypi.org/project/proxycrawl/) Python package manager. `pip install proxycrawl`\n\nThen import the CrawlingAPI, ScraperAPI, etc as needed.\n\n```python\nfrom proxycrawl import CrawlingAPI, ScraperAPI, LeadsAPI, ScreenshotsAPI, StorageAPI\n```\n\n### Upgrading to version 3\n\nVersion 3 deprecates the usage of ProxyCrawlAPI in favour of CrawlingAPI (although is still usable). Please test the upgrade before deploying to production.\n\n## Crawling API\n\nFirst initialize the CrawlingAPI class.\n\n```python\napi = CrawlingAPI({ 'token': 'YOUR_PROXYCRAWL_TOKEN' })\n```\n\n### GET requests\n\nPass the url that you want to scrape plus any options from the ones available in the [API documentation](https://proxycrawl.com/docs).\n\n```python\napi.get(url, options = {})\n```\n\nExample:\n\n```python\nresponse = api.get('https://www.facebook.com/britneyspears')\nif response['status_code'] == 200:\n    print(response['body'])\n```\n\nYou can pass any options from ProxyCrawl API.\n\nExample:\n\n```python\nresponse = api.get('https://www.reddit.com/r/pics/comments/5bx4bx/thanks_obama/', {\n    'user_agent': 'Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/30.0',\n    'format': 'json'\n})\nif response['status_code'] == 200:\n    print(response['body'])\n```\n\n### POST requests\n\nPass the url that you want to scrape, the data that you want to send which can be either a json or a string, plus any options from the ones available in the [API documentation](https://proxycrawl.com/docs).\n\n```python\napi.post(url, dictionary or string data, options = {})\n```\n\nExample:\n\n```python\nresponse = api.post('https://producthunt.com/search', { 'text': 'example search' })\nif response['status_code'] == 200:\n    print(response['body'])\n```\n\nYou can send the data as `application/json` instead of `x-www-form-urlencoded` by setting option `post_content_type` as json.\n\n```python\nimport json\nresponse = api.post('https://httpbin.org/post', json.dumps({ 'some_json': 'with some value' }), { 'post_content_type': 'json' })\nif response['status_code'] == 200:\n    print(response['body'])\n```\n\n### Javascript requests\n\nIf you need to scrape any website built with Javascript like React, Angular, Vue, etc. You just need to pass your javascript token and use the same calls. Note that only `.get` is available for javascript and not `.post`.\n\n```python\napi = CrawlingAPI({ 'token': 'YOUR_JAVASCRIPT_TOKEN' })\n```\n\n```python\nresponse = api.get('https://www.nfl.com')\nif response['status_code'] == 200:\n    print(response['body'])\n```\n\nSame way you can pass javascript additional options.\n\n```python\nresponse = api.get('https://www.freelancer.com', { 'page_wait': 5000 })\nif response['status_code'] == 200:\n    print(response['body'])\n```\n\n## Original status\n\nYou can always get the original status and proxycrawl status from the response. Read the [ProxyCrawl documentation](https://proxycrawl.com/docs) to learn more about those status.\n\n```python\nresponse = api.get('https://craiglist.com')\nprint(response['headers']['original_status'])\nprint(response['headers']['pc_status'])\n```\n\nIf you have questions or need help using the library, please open an issue or [contact us](https://proxycrawl.com/contact).\n\n## Scraper API\n\nThe usage of the Scraper API is very similar, just change the class name to initialize.\n\n```python\nscraper_api = ScraperAPI({ 'token': 'YOUR_NORMAL_TOKEN' })\n\nresponse = scraper_api.get('https://www.amazon.com/DualSense-Wireless-Controller-PlayStation-5/dp/B08FC6C75Y/')\nif response['status_code'] == 200:\n    print(response['json']['name']) # Will print the name of the Amazon product\n```\n\n## Leads API\n\nTo find email leads you can use the leads API, you can check the full [API documentation](https://proxycrawl.com/docs/leads-api/) if needed.\n\n```python\nleads_api = LeadsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })\n\nresponse = leads_api.get_from_domain('microsoft.com')\n\nif response['status_code'] == 200:\n    print(response['json']['leads'])\n```\n\n## Screenshots API\n\nInitialize with your Screenshots API token and call the `get` method.\n\n```python\nscreenshots_api = ScreenshotsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })\nresponse = screenshots_api.get('https://www.apple.com')\nif response['status_code'] == 200:\n    print(response['headers']['success'])\n    print(response['headers']['url'])\n    print(response['headers']['remaining_requests'])\n    print(response['file'])\n```\n\nor specifying a file path\n\n```python\nscreenshots_api = ScreenshotsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })\nresponse = screenshots_api.get('https://www.apple.com', { 'save_to_path': 'apple.jpg' })\nif response['status_code'] == 200:\n    print(response['headers']['success'])\n    print(response['headers']['url'])\n    print(response['headers']['remaining_requests'])\n    print(response['file'])\n```\n\nor if you set `store=true` then `screenshot_url` is set in the returned headers \n\n```python\nscreenshots_api = ScreenshotsAPI({ 'token': 'YOUR_NORMAL_TOKEN' })\nresponse = screenshots_api.get('https://www.apple.com', { 'store': 'true' })\nif response['status_code'] == 200:\n    print(response['headers']['success'])\n    print(response['headers']['url'])\n    print(response['headers']['remaining_requests'])\n    print(response['file'])\n    print(response['headers']['screenshot_url'])\n```\n\nNote that `screenshots_api.get(url, options)` method accepts an [options](https://proxycrawl.com/docs/screenshots-api/parameters)\n\n## Storage API\n\nInitialize the Storage API using your private token.\n\n```python\nstorage_api = StorageAPI({ 'token': 'YOUR_NORMAL_TOKEN' })\n```\n\nPass the [url](https://proxycrawl.com/docs/storage-api/parameters/#url) that you want to get from [Proxycrawl Storage](https://proxycrawl.com/dashboard/storage).\n\n```python\nresponse = storage_api.get('https://www.apple.com')\nif response['status_code'] == 200:\n    print(response['headers']['original_status'])\n    print(response['headers']['pc_status'])\n    print(response['headers']['url'])\n    print(response['headers']['rid'])\n    print(response['headers']['stored_at'])\n    print(response['body'])\n```\n\nor you can use the [RID](https://proxycrawl.com/docs/storage-api/parameters/#rid)\n\n```python\nresponse = storage_api.get('RID_REPLACE')\nif response['status_code'] == 200:\n    print(response['headers']['original_status'])\n    print(response['headers']['pc_status'])\n    print(response['headers']['url'])\n    print(response['headers']['rid'])\n    print(response['headers']['stored_at'])\n    print(response['body'])\n```\n\nNote: One of the two RID or URL must be sent. So both are optional but it's mandatory to send one of the two.\n\n### [Delete](https://proxycrawl.com/docs/storage-api/delete/) request\n\nTo delete a storage item from your storage area, use the correct RID\n\n```python\nif storage_api.delete('RID_REPLACE'):\n  print('delete success')\nelse:\n  print('Unable to delete')\n```\n\n### [Bulk](https://proxycrawl.com/docs/storage-api/bulk/) request\n\nTo do a bulk request with a list of RIDs, please send the list of rids as an array\n\n```python\nresponse = storage_api.bulk(['RID1', 'RID2', 'RID3', ...])\nif response['status_code'] == 200:\n    for item in response['json']:\n        print(item['original_status'])\n        print(item['pc_status'])\n        print(item['url'])\n        print(item['rid'])\n        print(item['stored_at'])\n        print(item['body'])\n```\n\n### [RIDs](https://proxycrawl.com/docs/storage-api/rids) request\n\nTo request a bulk list of RIDs from your storage area\n\n```python\nrids = storage_api.rids()\nprint(rids)\n```\n\nYou can also specify a limit as a parameter\n\n```python\nstorage_api.rids(100)\n```\n\n### [Total Count](https://proxycrawl.com/docs/storage-api/total_count)\n\nTo get the total number of documents in your storage area\n\n```python\ntotal_count = storage_api.totalCount()\nprint(total_count)\n```\n\n## Custom timeout\n\nIf you need to use a custom timeout, you can pass it to the class instance creation like the following:\n\n```python\napi = CrawlingAPI({ 'token': 'TOKEN', 'timeout': 120 })\n```\n\nTimeout is in seconds.\n\n---\n\nCopyright 2023 ProxyCrawl",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A Python class that acts as wrapper for ProxyCrawl scraping and crawling API",
    "version": "3.2.2",
    "project_urls": {
        "Homepage": "https://github.com/proxycrawl/proxycrawl-python"
    },
    "split_keywords": [
        "scraping",
        "scraper",
        "crawler",
        "crawling",
        "proxycrawl",
        "api"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b559efed26d17e5d60b02682423ea85075489ac6d4dd830e7e37c0a29e9a3ce1",
                "md5": "f3a345c28af583ea617055d5609d0348",
                "sha256": "3b5fabe70b0d929b0b4ee1ea90f68910a1fa50a8299bb80e908b31f66db1df90"
            },
            "downloads": -1,
            "filename": "proxycrawl-3.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "f3a345c28af583ea617055d5609d0348",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7922,
            "upload_time": "2023-07-04T13:49:23",
            "upload_time_iso_8601": "2023-07-04T13:49:23.660163Z",
            "url": "https://files.pythonhosted.org/packages/b5/59/efed26d17e5d60b02682423ea85075489ac6d4dd830e7e37c0a29e9a3ce1/proxycrawl-3.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-04 13:49:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "proxycrawl",
    "github_project": "proxycrawl-python",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "proxycrawl"
}
        
Elapsed time: 0.09675s