async-scrape


Nameasync-scrape JSON
Version 0.1.20 PyPI version JSON
download
home_pagehttps://github.com/cia05rf/async-scrape/
SummaryA package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.
upload_time2024-12-08 21:29:56
maintainerRobert Franklin
docs_urlNone
authorRobert Franklin
requires_python<4.0,>=3.11
licenseMIT
keywords scraping async requests
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Async-scrape
## _Perform webscrape asyncronously_

[![Build Status](https://travis-ci.org/joemccann/dillinger.svg?branch=master)](https://travis-ci.org/joemccann/dillinger)

Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.

## Features

- Breaks - pause scraping when a website blocks your requests consistently
- Rate limit - slow down scraping to prevent being blocked


## Installation

Async-scrape requires [C++ Build tools](https://go.microsoft.com/fwlink/?LinkId=691126) v15+ to run.


```
pip install async-scrape
```

## How to use it
Key inpur parameters:
- `post_process_func` - the callable used to process the returned response
- `post_process_kwargs` - and kwargs to be passed to the callable
- `use_proxy` - should a proxy be used (if this is true then either provide a `proxy` or `pac_url` variable)
- `attempt_limit` - how manay attempts should each request be given before it is marked as failed
- `rest_wait` - how long should the programme pause between loops
- `call_rate_limit` - limits the rate of requests (useful to stop getting blocked from websites)
- `randomise_headers` - if set to `True` a new set of headers will be generated between each request

### Get requests
```
# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://www.google.com",
    "https://www.bing.com",
]

resps = async_Scrape.scrape_all(urls)
```

### Post requests
```
# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://eos1jv6curljagq.m.pipedream.net",
    "https://eos1jv6curljagq.m.pipedream.net",
]
payloads = [
    {"value": 0},
    {"value": 1}
]

resps = async_Scrape.scrape_all(urls, payloads=payloads)
```

### Response
Response object is a list of dicts in the format:
```
{
    "url":url, # url of request
    "req":req, # combination of url and params
    "func_resp":func_resp, # response from post processing function
    "status":resp.status, # http status
    "error":None # any error encountered
}
```


## License

MIT

**Free Software, Hell Yeah!**

[//]: # (These are reference links used in the body of this note and get stripped out when the markdown processor does its job. There is no need to format nicely because it shouldn't be seen. Thanks SO - http://stackoverflow.com/questions/4823468/store-comments-in-markdown-syntax)

   [dill]: <https://github.com/joemccann/dillinger>
   [git-repo-url]: <https://github.com/joemccann/dillinger.git>
   [john gruber]: <http://daringfireball.net>
   [df1]: <http://daringfireball.net/projects/markdown/>
   [markdown-it]: <https://github.com/markdown-it/markdown-it>
   [Ace Editor]: <http://ace.ajax.org>
   [node.js]: <http://nodejs.org>
   [Twitter Bootstrap]: <http://twitter.github.com/bootstrap/>
   [jQuery]: <http://jquery.com>
   [@tjholowaychuk]: <http://twitter.com/tjholowaychuk>
   [express]: <http://expressjs.com>
   [AngularJS]: <http://angularjs.org>
   [Gulp]: <http://gulpjs.com>

   [PlDb]: <https://github.com/joemccann/dillinger/tree/master/plugins/dropbox/README.md>
   [PlGh]: <https://github.com/joemccann/dillinger/tree/master/plugins/github/README.md>
   [PlGd]: <https://github.com/joemccann/dillinger/tree/master/plugins/googledrive/README.md>
   [PlOd]: <https://github.com/joemccann/dillinger/tree/master/plugins/onedrive/README.md>
   [PlMe]: <https://github.com/joemccann/dillinger/tree/master/plugins/medium/README.md>
   [PlGa]: <https://github.com/RahulHP/dillinger/blob/master/plugins/googleanalytics/README.md>


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cia05rf/async-scrape/",
    "name": "async-scrape",
    "maintainer": "Robert Franklin",
    "docs_url": null,
    "requires_python": "<4.0,>=3.11",
    "maintainer_email": "cia05rf@gmail.com",
    "keywords": "scraping, async, requests",
    "author": "Robert Franklin",
    "author_email": "cia05rf@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/43/96/b29c75a6b5d0367a232a71661191aa8a93d086355ceec4778f85c864b599/async_scrape-0.1.20.tar.gz",
    "platform": null,
    "description": "# Async-scrape\n## _Perform webscrape asyncronously_\n\n[![Build Status](https://travis-ci.org/joemccann/dillinger.svg?branch=master)](https://travis-ci.org/joemccann/dillinger)\n\nAsync-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.\n\n## Features\n\n- Breaks - pause scraping when a website blocks your requests consistently\n- Rate limit - slow down scraping to prevent being blocked\n\n\n## Installation\n\nAsync-scrape requires [C++ Build tools](https://go.microsoft.com/fwlink/?LinkId=691126) v15+ to run.\n\n\n```\npip install async-scrape\n```\n\n## How to use it\nKey inpur parameters:\n- `post_process_func` - the callable used to process the returned response\n- `post_process_kwargs` - and kwargs to be passed to the callable\n- `use_proxy` - should a proxy be used (if this is true then either provide a `proxy` or `pac_url` variable)\n- `attempt_limit` - how manay attempts should each request be given before it is marked as failed\n- `rest_wait` - how long should the programme pause between loops\n- `call_rate_limit` - limits the rate of requests (useful to stop getting blocked from websites)\n- `randomise_headers` - if set to `True` a new set of headers will be generated between each request\n\n### Get requests\n```\n# Create an instance\nfrom async_scrape import AsyncScrape\n\ndef post_process(html, resp, **kwargs):\n    \"\"\"Function to process the gathered response from the request\"\"\"\n    if resp.status == 200:\n        return \"Request worked\"\n    else:\n        return \"Request failed\"\n\nasync_Scrape = AsyncScrape(\n    post_process_func=post_process,\n    post_process_kwargs={},\n    fetch_error_handler=None,\n    use_proxy=False,\n    proxy=None,\n    pac_url=None,\n    acceptable_error_limit=100,\n    attempt_limit=5,\n    rest_between_attempts=True,\n    rest_wait=60,\n    call_rate_limit=None,\n    randomise_headers=True\n)\n\nurls = [\n    \"https://www.google.com\",\n    \"https://www.bing.com\",\n]\n\nresps = async_Scrape.scrape_all(urls)\n```\n\n### Post requests\n```\n# Create an instance\nfrom async_scrape import AsyncScrape\n\ndef post_process(html, resp, **kwargs):\n    \"\"\"Function to process the gathered response from the request\"\"\"\n    if resp.status == 200:\n        return \"Request worked\"\n    else:\n        return \"Request failed\"\n\nasync_Scrape = AsyncScrape(\n    post_process_func=post_process,\n    post_process_kwargs={},\n    fetch_error_handler=None,\n    use_proxy=False,\n    proxy=None,\n    pac_url=None,\n    acceptable_error_limit=100,\n    attempt_limit=5,\n    rest_between_attempts=True,\n    rest_wait=60,\n    call_rate_limit=None,\n    randomise_headers=True\n)\n\nurls = [\n    \"https://eos1jv6curljagq.m.pipedream.net\",\n    \"https://eos1jv6curljagq.m.pipedream.net\",\n]\npayloads = [\n    {\"value\": 0},\n    {\"value\": 1}\n]\n\nresps = async_Scrape.scrape_all(urls, payloads=payloads)\n```\n\n### Response\nResponse object is a list of dicts in the format:\n```\n{\n    \"url\":url, # url of request\n    \"req\":req, # combination of url and params\n    \"func_resp\":func_resp, # response from post processing function\n    \"status\":resp.status, # http status\n    \"error\":None # any error encountered\n}\n```\n\n\n## License\n\nMIT\n\n**Free Software, Hell Yeah!**\n\n[//]: # (These are reference links used in the body of this note and get stripped out when the markdown processor does its job. There is no need to format nicely because it shouldn't be seen. Thanks SO - http://stackoverflow.com/questions/4823468/store-comments-in-markdown-syntax)\n\n   [dill]: <https://github.com/joemccann/dillinger>\n   [git-repo-url]: <https://github.com/joemccann/dillinger.git>\n   [john gruber]: <http://daringfireball.net>\n   [df1]: <http://daringfireball.net/projects/markdown/>\n   [markdown-it]: <https://github.com/markdown-it/markdown-it>\n   [Ace Editor]: <http://ace.ajax.org>\n   [node.js]: <http://nodejs.org>\n   [Twitter Bootstrap]: <http://twitter.github.com/bootstrap/>\n   [jQuery]: <http://jquery.com>\n   [@tjholowaychuk]: <http://twitter.com/tjholowaychuk>\n   [express]: <http://expressjs.com>\n   [AngularJS]: <http://angularjs.org>\n   [Gulp]: <http://gulpjs.com>\n\n   [PlDb]: <https://github.com/joemccann/dillinger/tree/master/plugins/dropbox/README.md>\n   [PlGh]: <https://github.com/joemccann/dillinger/tree/master/plugins/github/README.md>\n   [PlGd]: <https://github.com/joemccann/dillinger/tree/master/plugins/googledrive/README.md>\n   [PlOd]: <https://github.com/joemccann/dillinger/tree/master/plugins/onedrive/README.md>\n   [PlMe]: <https://github.com/joemccann/dillinger/tree/master/plugins/medium/README.md>\n   [PlGa]: <https://github.com/RahulHP/dillinger/blob/master/plugins/googleanalytics/README.md>\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.",
    "version": "0.1.20",
    "project_urls": {
        "Documentation": "https://github.com/cia05rf/async-scrape/",
        "Homepage": "https://github.com/cia05rf/async-scrape/",
        "Repository": "https://github.com/cia05rf/async-scrape/"
    },
    "split_keywords": [
        "scraping",
        " async",
        " requests"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bbda3d93a7c1fd5211495dc05edc750d04e97bb8c21fc80a645fbb7f594825c4",
                "md5": "85ce80487a95338332c45c7759c73456",
                "sha256": "36321206ce61656b0ee3d678b084d68b50040de4e04d8d34acc7b1d17271628c"
            },
            "downloads": -1,
            "filename": "async_scrape-0.1.20-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "85ce80487a95338332c45c7759c73456",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.11",
            "size": 16649,
            "upload_time": "2024-12-08T21:29:54",
            "upload_time_iso_8601": "2024-12-08T21:29:54.459332Z",
            "url": "https://files.pythonhosted.org/packages/bb/da/3d93a7c1fd5211495dc05edc750d04e97bb8c21fc80a645fbb7f594825c4/async_scrape-0.1.20-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4396b29c75a6b5d0367a232a71661191aa8a93d086355ceec4778f85c864b599",
                "md5": "999b95590b23419646f672cb53dd6fc7",
                "sha256": "f46a478983a7edc6e49259336366721dc1516266d692e66ff8d7c387d19e16c9"
            },
            "downloads": -1,
            "filename": "async_scrape-0.1.20.tar.gz",
            "has_sig": false,
            "md5_digest": "999b95590b23419646f672cb53dd6fc7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.11",
            "size": 12828,
            "upload_time": "2024-12-08T21:29:56",
            "upload_time_iso_8601": "2024-12-08T21:29:56.168441Z",
            "url": "https://files.pythonhosted.org/packages/43/96/b29c75a6b5d0367a232a71661191aa8a93d086355ceec4778f85c864b599/async_scrape-0.1.20.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-08 21:29:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cia05rf",
    "github_project": "async-scrape",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "async-scrape"
}
        
Elapsed time: 0.99319s