webrider-async


Namewebrider-async JSON
Version 0.0.7 PyPI version JSON
download
home_pagehttps://github.com/bogdan-sikorsky/webrider
SummaryA simple manager for async requests
upload_time2024-09-20 09:10:56
maintainerNone
docs_urlNone
authorBogdan Sikorsky
requires_python>=3.8
licenseNone
keywords python async scraping requests aiohttp asyncio
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # WebRiderAsync

**[PyPI](https://pypi.org/project/webrider-async/) |
[GitHub](https://github.com/bogdan-sikorsky/webrider_async) |
[Docs](https://github.com/bogdan-sikorsky/webrider_async/blob/main/README.md) |
[Examples](https://github.com/bogdan-sikorsky/webrider_async/blob/main/examples/scraper.py) |
[Contacts](https://bogdanko.live/Contacts)**

---

WebRiderAsync is an asynchronous utility designed for very simple and highly tunable handling of large volumes of web requests. 

It leverages Python's `aiohttp` for asynchronous HTTP requests, making it capable of achieving high performance by processing multiple requests in parallel. This utility could be useful both for working with APIs and web scraping.

### Key Features:

- **Simple Setup**: Unlike complex frameworks like Scrapy, WebRiderAsync requires no in-depth knowledge of asynchronous programming or framework-specific structures. All settings are handled via class initialization, offering flexibility with minimal overhead.

- **Asynchronous by Design**: Designed to process multiple requests in parallel, WebRiderAsync leverages Python’s `asyncio` and `aiohttp` to maximize performance without requiring users to write asynchronous code themselves.

- **User-Friendly**: There’s no need to understand `asyncio` or `aiohttp`. Simply pass a list of URLs to the `request()` function, and WebRiderAsync will handle the rest.

### Why WebRiderAsync?

Compared to frameworks like Scrapy, WebRiderAsync is straightforward and ideal for users who want the power of asynchronous requests without the need for a deep dive into project structures or complex configurations. It’s perfect for both beginners and advanced users who need rapid, customizable scraping or API requests.

### Capabilities
- Asynchronous requests for high performance
- Customizable user agents and proxies
- Retry policy for handling failed requests
- Logging support with customizable log levels and file output
- Configurable concurrency and delay settings
- Statistics tracking and reporting

## Installation

To use WebRiderAsync, you need to have Python 3.8 or higher installed. Install the required packages using `pip`:

```shell
pip install webrider-async
```

Check out the [PyPI page](https://pypi.org/project/webrider-async/) for the latest version and updates.

## Usage

> Full working example of usage you can find here [examples folder](https://github.com/bogdan-sikorsky/webrider_async/tree/main/examples).

Here's a basic example of how to use the WebRiderAsync class:

### Initialization

```python
from webrider_async import WebRiderAsync

# Create an instance of WebRiderAsync
webrider = WebRiderAsync(
    log_level='debug',                  # Logging level: 'debug', 'info', 'warning', 'error'
    file_output=True,                   # Save logs to a file
    random_user_agents=True,            # Use random user agents
    concurrent_requests=20,             # Number of concurrent requests
    max_retries=3,                      # Maximum number of retries per request
    delay_before_retry=2                # Delay before retrying a request (in seconds)
)
```

### Making Requests

```python
urls = ['https://example.com/page1', 'https://example.com/page2']

# Perform requests
responses = webrider.request(urls)

# Process responses
for response in responses:
    print(response.url, response.status_code)
    print(response.html[:100])  # Print the first 100 characters of the HTML
```

### Updating Settings

```python
webrider.update_settings(
    log_level='info',
    file_output=False,
    random_user_agents=False,
    custom_user_agent='MyCustomUserAgent',
    concurrent_requests=10,
    max_retries=5
)
```

> Full working example of usage you can find here [examples folder](https://github.com/bogdan-sikorsky/webrider_async/tree/main/examples).

### Tracking Statistics

```python
# Print current statistics
webrider.stats()

# Reset statistics
webrider.reset_stats()
```

## Best practices

### General scraping advice

It is almost impossible nowadays to find websites that do not require a `user-agent` to respond. Don't forget to specify your own using `custom_user_agent` or just simply set `random_user_agents=True`.

You can specify `custom_headers` if the request requires that.

Use proxies if the website blocking you.

Remember that WebRiderAsync does not handle JavaScript.

### Speed and parallel requests

Nothing holds you from sending into the `request()` function list with 1000 URLs but unlikely that you will be satisfied with the result.

The problem with the example above is that all 1000 responses will be accumulating in the computer memory which will overload it at some point.

Use the `chunkify()` function to split your list of URLs into chunks of reasonable size and feed to `request()` those chunks.

**Efficient, safe and predictable usage looks like this:**

```python
my_urls = ['https://example.com/page1', 'https://example.com/page2', ...]
my_urls_chunks = webrider.chunkify(my_urls, 10)  # integer means number of urls in a single chunk
for urls_chunk in pagination_pages_chunks:
    responses = webrider.request(urls_chunk)  # Parsing chunks of 10 pages simultaneously
        for response in responses:
            parse_response(response.html)
```

You can set your own concurrent requests and delay per chunk policies but be aware that out-of-the-box settings might have weird behavior. To maximise efficiency each website requires tuning scraper according to its capabilities. 

### User-agents, headers and proxies

WebRider always keeps user-agents, headers and proxy policies in memory that you specified during initialization unless you used the `update_settings()` method.

Meanwhile, you might need to specify another header for specific chunks of requests and you can do it via the `request()` method. 

It will not overwrite settings passed to the class on initialization but the `request()` method will prioritise settings passed in the method over passed in class. 

## Parameters

### `__init__` Parameters

- `log_level`: Specifies the log level. Options: 'debug', 'info', 'warning', 'error'.
- `file_output`: If True, logs will be saved to a file.
- `random_user_agents`: If True, a random user agent will be used for each request.
- `custom_user_agent`: A custom user agent string.
- `custom_headers`: A dictionary of custom headers.
- `custom_proxies`: A list or single string of proxies to be used.
- `concurrent_requests`: Number of concurrent requests allowed.
- `delay_per_chunk`: Delay between chunks of requests (in seconds).
- `max_retries`: Maximum number of retries per request.
- `delay_before_retry`: Delay before retrying a failed request (in seconds).
- `max_wait_for_resp`: Maximum time to wait for a response (in seconds).

### Methods

- `request(urls, headers=None, user_agent=None, proxies=None)`: Perform asynchronous requests to the specified URLs.
- `update_settings()`: Update settings for the WebRiderAsync instance.
- `stats()`: Print current scraping statistics.
- `reset_stats()`: Reset statistics to zero.
- `chunkify(initial_list, chunk_size=10)`: Split a list into chunks of the specified size.

## Logging

Logging can be configured to print to the console or save to a file. The log file is saved in a logs directory under the current working directory, with a timestamp in the filename.

## Error Handling

If a request fails after the maximum number of retries, it is logged as a failure. Errors during request processing are logged with traceback information.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

---

**May the 4th be with you!**

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bogdan-sikorsky/webrider",
    "name": "webrider-async",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "python, async, scraping, requests, aiohttp, asyncio",
    "author": "Bogdan Sikorsky",
    "author_email": "<bogdan.sikorsky.dev@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/72/4b/8aa361b8a2b6e7fae2a25994ba214d8ec76f747c67d2b6efa1e792efabfd/webrider_async-0.0.7.tar.gz",
    "platform": null,
    "description": "# WebRiderAsync\n\n**[PyPI](https://pypi.org/project/webrider-async/) |\n[GitHub](https://github.com/bogdan-sikorsky/webrider_async) |\n[Docs](https://github.com/bogdan-sikorsky/webrider_async/blob/main/README.md) |\n[Examples](https://github.com/bogdan-sikorsky/webrider_async/blob/main/examples/scraper.py) |\n[Contacts](https://bogdanko.live/Contacts)**\n\n---\n\nWebRiderAsync is an asynchronous utility designed for very simple and highly tunable handling of large volumes of web requests. \n\nIt leverages Python's `aiohttp` for asynchronous HTTP requests, making it capable of achieving high performance by processing multiple requests in parallel. This utility could be useful both for working with APIs and web scraping.\n\n### Key Features:\n\n- **Simple Setup**: Unlike complex frameworks like Scrapy, WebRiderAsync requires no in-depth knowledge of asynchronous programming or framework-specific structures. All settings are handled via class initialization, offering flexibility with minimal overhead.\n\n- **Asynchronous by Design**: Designed to process multiple requests in parallel, WebRiderAsync leverages Python\u2019s `asyncio` and `aiohttp` to maximize performance without requiring users to write asynchronous code themselves.\n\n- **User-Friendly**: There\u2019s no need to understand `asyncio` or `aiohttp`. Simply pass a list of URLs to the `request()` function, and WebRiderAsync will handle the rest.\n\n### Why WebRiderAsync?\n\nCompared to frameworks like Scrapy, WebRiderAsync is straightforward and ideal for users who want the power of asynchronous requests without the need for a deep dive into project structures or complex configurations. It\u2019s perfect for both beginners and advanced users who need rapid, customizable scraping or API requests.\n\n### Capabilities\n- Asynchronous requests for high performance\n- Customizable user agents and proxies\n- Retry policy for handling failed requests\n- Logging support with customizable log levels and file output\n- Configurable concurrency and delay settings\n- Statistics tracking and reporting\n\n## Installation\n\nTo use WebRiderAsync, you need to have Python 3.8 or higher installed. Install the required packages using `pip`:\n\n```shell\npip install webrider-async\n```\n\nCheck out the [PyPI page](https://pypi.org/project/webrider-async/) for the latest version and updates.\n\n## Usage\n\n> Full working example of usage you can find here [examples folder](https://github.com/bogdan-sikorsky/webrider_async/tree/main/examples).\n\nHere's a basic example of how to use the WebRiderAsync class:\n\n### Initialization\n\n```python\nfrom webrider_async import WebRiderAsync\n\n# Create an instance of WebRiderAsync\nwebrider = WebRiderAsync(\n    log_level='debug',                  # Logging level: 'debug', 'info', 'warning', 'error'\n    file_output=True,                   # Save logs to a file\n    random_user_agents=True,            # Use random user agents\n    concurrent_requests=20,             # Number of concurrent requests\n    max_retries=3,                      # Maximum number of retries per request\n    delay_before_retry=2                # Delay before retrying a request (in seconds)\n)\n```\n\n### Making Requests\n\n```python\nurls = ['https://example.com/page1', 'https://example.com/page2']\n\n# Perform requests\nresponses = webrider.request(urls)\n\n# Process responses\nfor response in responses:\n    print(response.url, response.status_code)\n    print(response.html[:100])  # Print the first 100 characters of the HTML\n```\n\n### Updating Settings\n\n```python\nwebrider.update_settings(\n    log_level='info',\n    file_output=False,\n    random_user_agents=False,\n    custom_user_agent='MyCustomUserAgent',\n    concurrent_requests=10,\n    max_retries=5\n)\n```\n\n> Full working example of usage you can find here [examples folder](https://github.com/bogdan-sikorsky/webrider_async/tree/main/examples).\n\n### Tracking Statistics\n\n```python\n# Print current statistics\nwebrider.stats()\n\n# Reset statistics\nwebrider.reset_stats()\n```\n\n## Best practices\n\n### General scraping advice\n\nIt is almost impossible nowadays to find websites that do not require a `user-agent` to respond. Don't forget to specify your own using `custom_user_agent` or just simply set `random_user_agents=True`.\n\nYou can specify `custom_headers` if the request requires that.\n\nUse proxies if the website blocking you.\n\nRemember that WebRiderAsync does not handle JavaScript.\n\n### Speed and parallel requests\n\nNothing holds you from sending into the `request()` function list with 1000 URLs but unlikely that you will be satisfied with the result.\n\nThe problem with the example above is that all 1000 responses will be accumulating in the computer memory which will overload it at some point.\n\nUse the `chunkify()` function to split your list of URLs into chunks of reasonable size and feed to `request()` those chunks.\n\n**Efficient, safe and predictable usage looks like this:**\n\n```python\nmy_urls = ['https://example.com/page1', 'https://example.com/page2', ...]\nmy_urls_chunks = webrider.chunkify(my_urls, 10)  # integer means number of urls in a single chunk\nfor urls_chunk in pagination_pages_chunks:\n    responses = webrider.request(urls_chunk)  # Parsing chunks of 10 pages simultaneously\n        for response in responses:\n            parse_response(response.html)\n```\n\nYou can set your own concurrent requests and delay per chunk policies but be aware that out-of-the-box settings might have weird behavior. To maximise efficiency each website requires tuning scraper according to its capabilities. \n\n### User-agents, headers and proxies\n\nWebRider always keeps user-agents, headers and proxy policies in memory that you specified during initialization unless you used the `update_settings()` method.\n\nMeanwhile, you might need to specify another header for specific chunks of requests and you can do it via the `request()` method. \n\nIt will not overwrite settings passed to the class on initialization but the `request()` method will prioritise settings passed in the method over passed in class. \n\n## Parameters\n\n### `__init__` Parameters\n\n- `log_level`: Specifies the log level. Options: 'debug', 'info', 'warning', 'error'.\n- `file_output`: If True, logs will be saved to a file.\n- `random_user_agents`: If True, a random user agent will be used for each request.\n- `custom_user_agent`: A custom user agent string.\n- `custom_headers`: A dictionary of custom headers.\n- `custom_proxies`: A list or single string of proxies to be used.\n- `concurrent_requests`: Number of concurrent requests allowed.\n- `delay_per_chunk`: Delay between chunks of requests (in seconds).\n- `max_retries`: Maximum number of retries per request.\n- `delay_before_retry`: Delay before retrying a failed request (in seconds).\n- `max_wait_for_resp`: Maximum time to wait for a response (in seconds).\n\n### Methods\n\n- `request(urls, headers=None, user_agent=None, proxies=None)`: Perform asynchronous requests to the specified URLs.\n- `update_settings()`: Update settings for the WebRiderAsync instance.\n- `stats()`: Print current scraping statistics.\n- `reset_stats()`: Reset statistics to zero.\n- `chunkify(initial_list, chunk_size=10)`: Split a list into chunks of the specified size.\n\n## Logging\n\nLogging can be configured to print to the console or save to a file. The log file is saved in a logs directory under the current working directory, with a timestamp in the filename.\n\n## Error Handling\n\nIf a request fails after the maximum number of retries, it is logged as a failure. Errors during request processing are logged with traceback information.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n---\n\n**May the 4th be with you!**\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A simple manager for async requests",
    "version": "0.0.7",
    "project_urls": {
        "Documentation": "https://github.com/bogdan-sikorsky/webrider/README.md",
        "Homepage": "https://github.com/bogdan-sikorsky/webrider",
        "Issues": "https://github.com/bogdan-sikorsky/webrider/issues",
        "Source": "https://github.com/bogdan-sikorsky/webrider"
    },
    "split_keywords": [
        "python",
        " async",
        " scraping",
        " requests",
        " aiohttp",
        " asyncio"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "868b6e6410acbd2239a34c8608de8c8914a0fc79f1afd169a51cc620f9c07eb0",
                "md5": "91f2934a84e30849bd1d040f64685250",
                "sha256": "23ca6e007f661647ee2cd0b9a1fd642d544c77bfa745de9b91d19a949adf5f3f"
            },
            "downloads": -1,
            "filename": "webrider_async-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "91f2934a84e30849bd1d040f64685250",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 30888,
            "upload_time": "2024-09-20T09:10:55",
            "upload_time_iso_8601": "2024-09-20T09:10:55.353437Z",
            "url": "https://files.pythonhosted.org/packages/86/8b/6e6410acbd2239a34c8608de8c8914a0fc79f1afd169a51cc620f9c07eb0/webrider_async-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "724b8aa361b8a2b6e7fae2a25994ba214d8ec76f747c67d2b6efa1e792efabfd",
                "md5": "308bb7119ee8b50d947538aa51587c5d",
                "sha256": "d1c6acb99e3a6983d2ca03ebc2495429bae1fcd50409f0b1385ece1c6be43799"
            },
            "downloads": -1,
            "filename": "webrider_async-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "308bb7119ee8b50d947538aa51587c5d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 33247,
            "upload_time": "2024-09-20T09:10:56",
            "upload_time_iso_8601": "2024-09-20T09:10:56.737379Z",
            "url": "https://files.pythonhosted.org/packages/72/4b/8aa361b8a2b6e7fae2a25994ba214d8ec76f747c67d2b6efa1e792efabfd/webrider_async-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-20 09:10:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bogdan-sikorsky",
    "github_project": "webrider",
    "github_not_found": true,
    "lcname": "webrider-async"
}
        
Elapsed time: 0.34512s