Name | stealth-requests JSON |
Version |
1.2.1
JSON |
| download |
home_page | None |
Summary | Make HTTP requests exactly like a browser. |
upload_time | 2024-10-22 00:49:45 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT |
keywords |
http
requests
scraping
browser
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<p align="center">
<img src="https://github.com/jpjacobpadilla/Stealth-Requests/blob/0572cdf58d141239e945a1562490b1d00054379c/logo.png?raw=true">
</p>
<h1 align="center">Stay Undetected While Scraping the Web.</h1>
### The All-In-One Solution to Web Scraping:
- **Realistic HTTP Requests:**
- Mimics browser headers for undetected scraping, adapting to the requested file type
- Tracks dynamic headers such as `Referer` and `Host`
- Masks the TLS fingerprint of HTTP requests using the [curl_cffi](https://curl-cffi.readthedocs.io/en/latest/) package
- **Faster and Easier Parsing:**
- Automatically extracts metadata (title, description, author, etc.) from HTML-based responses
- Methods to extract all webpage and image URLs
- Seamlessly converts responses into [Lxml](https://lxml.de/apidoc/lxml.html) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) objects
### Install
```
$ pip install stealth_requests
```
### Sending Requests
Stealth-Requests mimics the API of the [requests](https://requests.readthedocs.io/en/latest/) package, allowing you to use it in nearly the same way.
You can send one-off requests like such:
```python
import stealth_requests as requests
resp = requests.get('https://link-here.com')
```
Or you can use a `StealthSession` object which will keep track of certain headers for you between requests such as the `Referer` header.
```python
from stealth_requests import StealthSession
with StealthSession() as session:
resp = session.get('https://link-here.com')
```
When sending a request, or creating a `StealthSession`, you can specify the type of browser that you want the request to mimic - either `chrome`, which is the default, or `safari`. If you want to change which browser to mimic, set the `impersonate` argument, either in `requests.get` or when initializing `StealthSession` to `safari` or `chrome`.
### Sending Requests With Asyncio
This package supports Asyncio in the same way as the `requests` package:
```python
from stealth_requests import AsyncStealthSession
async with AsyncStealthSession(impersonate='safari') as session:
resp = await session.get('https://link-here.com')
```
or, for a one-off request, you can make a request like this:
```python
import stealth_requests as requests
resp = await requests.get('https://link-here.com', impersonate='safari')
```
### Getting Response Metadata
The response returned from this package is a `StealthResponse`, which has all of the same methods and attributes as a standard [requests response object](https://requests.readthedocs.io/en/latest/api/#requests.Response), with a few added features. One of these extra features is automatic parsing of header metadata from HTML-based responses. The metadata can be accessed from the `meta` property, which gives you access to the following metadata:
- title: `str | None`
- author: `str | None`
- description: `str | None`
- thumbnail: `str | None`
- canonical: `str | None`
- twitter_handle: `str | None`
- keywords: `tuple[str] | None`
- robots: `tuple[str] | None`
Here's an example of how to get the title of a page:
```python
import stealth_requests as requests
resp = requests.get('https://link-here.com')
print(resp.meta.title)
```
### Parsing Responses
To make parsing HTML faster, I've also added two popular parsing packages to Stealth-Requests - Lxml and BeautifulSoup4. To use these add-ons you need to install the `parsers` extra:
```
$ pip install stealth_requests[parsers]
```
To easily get an Lxml tree, you can use `resp.tree()` and to get a BeautifulSoup object, use the `resp.soup()` method.
For simple parsing, I've also added the following convenience methods, from the Lxml package, right into the `StealthResponse` object:
- `text_content()`: Get all text content in a response
- `xpath()` Go right to using XPath expressions instead of getting your own Lxml tree.
### Get All Image and Page Links From a Response
If you would like to get all of the webpage URLs (`a` tags) from an HTML-based response, you can use the `links` property. If you'd like to get all image URLs (`img` tags) you can use the `images` property from a response object.
```python
import stealth_requests as requests
resp = requests.get('https://link-here.com')
for image_url in resp.images:
# ...
```
### Getting HTML Responses in Markdown Format
In some cases, it’s easier to work with a webpage in Markdown format rather than HTML. After making a GET request that returns HTML, you can use the `resp.markdown()` method to convert the response into a Markdown string, providing a simplified and readable version of the page content!
`markdown()` has two optional parameters:
1. `content_xpath` An XPath expression, in the form of a string, which can be used to narrow down what text is converted to Markdown. This can be useful if you don't want the header and footer of a webpage to be turned into Markdown.
2. `ignore_links` A boolean value that tells Html2Text whether it should include any links in the output of the Markdown.
Raw data
{
"_id": null,
"home_page": null,
"name": "stealth-requests",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "HTTP, requests, scraping, browser",
"author": null,
"author_email": "Jacob Padilla <jp@jacobpadilla.com>",
"download_url": "https://files.pythonhosted.org/packages/e3/1b/22a556b133d7978634e89e6a52d6a33d2208f0ff4669b1e28a016f347f4c/stealth_requests-1.2.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"https://github.com/jpjacobpadilla/Stealth-Requests/blob/0572cdf58d141239e945a1562490b1d00054379c/logo.png?raw=true\">\n</p>\n\n<h1 align=\"center\">Stay Undetected While Scraping the Web.</h1>\n\n### The All-In-One Solution to Web Scraping:\n- **Realistic HTTP Requests:**\n - Mimics browser headers for undetected scraping, adapting to the requested file type\n - Tracks dynamic headers such as `Referer` and `Host`\n - Masks the TLS fingerprint of HTTP requests using the [curl_cffi](https://curl-cffi.readthedocs.io/en/latest/) package\n- **Faster and Easier Parsing:**\n - Automatically extracts metadata (title, description, author, etc.) from HTML-based responses\n - Methods to extract all webpage and image URLs\n - Seamlessly converts responses into [Lxml](https://lxml.de/apidoc/lxml.html) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) objects\n\n### Install\n\n```\n$ pip install stealth_requests\n```\n\n### Sending Requests\n\nStealth-Requests mimics the API of the [requests](https://requests.readthedocs.io/en/latest/) package, allowing you to use it in nearly the same way.\n\nYou can send one-off requests like such:\n\n```python\nimport stealth_requests as requests\n\nresp = requests.get('https://link-here.com')\n```\n\nOr you can use a `StealthSession` object which will keep track of certain headers for you between requests such as the `Referer` header.\n\n```python\nfrom stealth_requests import StealthSession\n\nwith StealthSession() as session:\n resp = session.get('https://link-here.com')\n```\n\nWhen sending a request, or creating a `StealthSession`, you can specify the type of browser that you want the request to mimic - either `chrome`, which is the default, or `safari`. If you want to change which browser to mimic, set the `impersonate` argument, either in `requests.get` or when initializing `StealthSession` to `safari` or `chrome`.\n\n### Sending Requests With Asyncio\n\nThis package supports Asyncio in the same way as the `requests` package:\n\n```python\nfrom stealth_requests import AsyncStealthSession\n\nasync with AsyncStealthSession(impersonate='safari') as session:\n resp = await session.get('https://link-here.com')\n```\n\nor, for a one-off request, you can make a request like this:\n\n```python\nimport stealth_requests as requests\n\nresp = await requests.get('https://link-here.com', impersonate='safari')\n```\n\n### Getting Response Metadata\n\nThe response returned from this package is a `StealthResponse`, which has all of the same methods and attributes as a standard [requests response object](https://requests.readthedocs.io/en/latest/api/#requests.Response), with a few added features. One of these extra features is automatic parsing of header metadata from HTML-based responses. The metadata can be accessed from the `meta` property, which gives you access to the following metadata:\n\n- title: `str | None`\n- author: `str | None`\n- description: `str | None`\n- thumbnail: `str | None`\n- canonical: `str | None`\n- twitter_handle: `str | None`\n- keywords: `tuple[str] | None`\n- robots: `tuple[str] | None`\n\nHere's an example of how to get the title of a page:\n\n```python\nimport stealth_requests as requests\n\nresp = requests.get('https://link-here.com')\nprint(resp.meta.title)\n```\n\n### Parsing Responses\n\nTo make parsing HTML faster, I've also added two popular parsing packages to Stealth-Requests - Lxml and BeautifulSoup4. To use these add-ons you need to install the `parsers` extra: \n\n```\n$ pip install stealth_requests[parsers]\n```\n\nTo easily get an Lxml tree, you can use `resp.tree()` and to get a BeautifulSoup object, use the `resp.soup()` method.\n\nFor simple parsing, I've also added the following convenience methods, from the Lxml package, right into the `StealthResponse` object:\n\n- `text_content()`: Get all text content in a response\n- `xpath()` Go right to using XPath expressions instead of getting your own Lxml tree.\n\n### Get All Image and Page Links From a Response\n\nIf you would like to get all of the webpage URLs (`a` tags) from an HTML-based response, you can use the `links` property. If you'd like to get all image URLs (`img` tags) you can use the `images` property from a response object.\n\n```python\nimport stealth_requests as requests\n\nresp = requests.get('https://link-here.com')\nfor image_url in resp.images:\n # ...\n```\n\n\n### Getting HTML Responses in Markdown Format\n\nIn some cases, it\u2019s easier to work with a webpage in Markdown format rather than HTML. After making a GET request that returns HTML, you can use the `resp.markdown()` method to convert the response into a Markdown string, providing a simplified and readable version of the page content!\n\n`markdown()` has two optional parameters:\n\n1. `content_xpath` An XPath expression, in the form of a string, which can be used to narrow down what text is converted to Markdown. This can be useful if you don't want the header and footer of a webpage to be turned into Markdown.\n2. `ignore_links` A boolean value that tells Html2Text whether it should include any links in the output of the Markdown.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Make HTTP requests exactly like a browser.",
"version": "1.2.1",
"project_urls": {
"Homepage": "https://github.com/jpjacobpadilla/Stealth-Requests"
},
"split_keywords": [
"http",
" requests",
" scraping",
" browser"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e5bb6fc6a17c3a37cc651ce09108508ef5cc581be8ec5b61be5d959ed4e0ff79",
"md5": "3a3a400cde174ec13be40f70265b6afd",
"sha256": "0a1c2b926d39c2dbd5074cb5a789973eead3c3ff3b604ead7b0ed739533a88f7"
},
"downloads": -1,
"filename": "stealth_requests-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3a3a400cde174ec13be40f70265b6afd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8255,
"upload_time": "2024-10-22T00:49:43",
"upload_time_iso_8601": "2024-10-22T00:49:43.251679Z",
"url": "https://files.pythonhosted.org/packages/e5/bb/6fc6a17c3a37cc651ce09108508ef5cc581be8ec5b61be5d959ed4e0ff79/stealth_requests-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e31b22a556b133d7978634e89e6a52d6a33d2208f0ff4669b1e28a016f347f4c",
"md5": "13fba6d8c22ed52835fcb1def321d5f9",
"sha256": "48cf22d32f56ee987852f7b48203d802ca8b6a1d268e6dae659400ea88770c87"
},
"downloads": -1,
"filename": "stealth_requests-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "13fba6d8c22ed52835fcb1def321d5f9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 9886,
"upload_time": "2024-10-22T00:49:45",
"upload_time_iso_8601": "2024-10-22T00:49:45.543817Z",
"url": "https://files.pythonhosted.org/packages/e3/1b/22a556b133d7978634e89e6a52d6a33d2208f0ff4669b1e28a016f347f4c/stealth_requests-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-22 00:49:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jpjacobpadilla",
"github_project": "Stealth-Requests",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "stealth-requests"
}