# Scrapy Manipulate Request Downloader Middleware
This is an async scrapy request downloader middleware, support random request and response manipulation.
With this, you can do any change to the reqeust and response in an easy way, you can send request by tls_client,
pyhttpx, requests-go, etc. You can even manipulate chrome by selenium, undetected_chrome, playwright, etc. without
any thinking of the async logic behind the scrapy.
## Installation
```shell script
pip3 install scrapy-manipulate-request
```
## Usage
You need to enable `ManipulateRequestDownloaderMiddleware` in `DOWNLOADER_MIDDLEWARES` first:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_manipulate_request.downloadermiddlewares.ManipulateRequestDownloaderMiddleware': 543,
}
```
Notice, this middleware is async, that means it is affected by some scrapy settings, such as:
```python
CONCURRENT_REQUESTS = 16
```
To manipulate request and response, it is very simple and convenient, just add manipulate_request function
in your spider, and send it to the meta, it's something like parse function.
```python
import scrapy
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self,):
meta_data = {'manipulate_request': self.manipulate_request}
yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
def manipulate_request(self, request, spider):
# return None, the requesst will be ignored
# return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
# the process of handle response will be started.
pass
def parse(self, response):
pass
```
## Useful Example
### Send request by tls_client in order to bypass ja3 verification
```python
import scrapy
import tls_client
from scrapy.http import TextResponse
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self,):
meta_data = {'manipulate_request': self.manipulate_request}
yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
def manipulate_request(self, request, spider):
url = request.url
headers = request.headers.to_unicode_dict()
tls_session = tls_client.Session(
client_identifier='chrome_112',
random_tls_extension_order=True
)
proxy = 'http://username:password@ip:port'
raw_response = tls_session.get(url=url, headers=headers, proxy=proxy)
response = TextResponse(url=request.url, status=raw_response.status_code, headers=raw_response.headers,
body=raw_response.text, request=request, encoding='utf-8')
return response
# return None, the requesst will be ignored
# return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
# the process of handle response will be started.
def parse(self, response):
pass
```
More and detailed tls_client usage see [Python-Tls-Client](https://github.com/FlorianREGAZ/Python-Tls-Client).
### Use undetected chrom to operate webpage
```python
import scrapy
from pprint import pformat
from scrapy.http import HtmlResponse
from seleniumwire import undetected_chromedriver as uc
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self,):
meta_data = {'manipulate_request': self.manipulate_request}
yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
def manipulate_request(self, request, spider):
chrome_options = uc.ChromeOptions()
chrome_options.add_experimental_option()
chrome_options.add_argument()
chrome_options.add_extension()
seleniumwire_options = {
'proxy': {
'http': 'http://username:password@ip:port',
'https': 'https://username:password@ip:port',
}
}
browser = uc.Chrome(version_main=108, options=chrome_options, seleniumwire_options= seleniumwire_options,
headless=True, enable_cdp_events=True)
browser.set_page_load_timeout(10)
browser.maximize_window()
browser.add_cdp_listener('Network.requestWillBeSent', self.mylousyprintfunction)
browser.execute_script()
browser.execute_cdp_cmd()
browser.request_interceptor = self.request_interceptor
browser.get("https://tls.browserleaks.com/json")
elements = browser.find_elements()
...
raw_response = browser.page_source
response = HtmlResponse(url=request.url, status=200, body=raw_response, request=request, encoding='utf-8')
return response
# return None, the requesst will be ignored
# return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
# the process of handle response will be started.
def mylousyprintfunction(self, message):
print(pformat(message))
def request_interceptor(self, request):
request.headers['New-Header'] = 'Some Value'
del request.headers['Referer']
request.headers['Referer'] = 'some_referer'
```
More and detailed chrome operations see [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)
and [selenium-wire](https://github.com/wkeeling/selenium-wire).
Raw data
{
"_id": null,
"home_page": "https://github.com/dylankeepon/ScrapyManipulateRequestMiddleware",
"name": "scrapy-manipulate-request",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": "",
"keywords": "",
"author": "Dylan Chen",
"author_email": "dylankeep@163.com",
"download_url": "https://files.pythonhosted.org/packages/68/e1/3ab93841984ddcfdf928b9ef7efe13d1ba6e024854f8ed8aba4f0e1a873a/scrapy-manipulate-request-0.0.2.tar.gz",
"platform": null,
"description": "# Scrapy Manipulate Request Downloader Middleware\r\n\r\nThis is an async scrapy request downloader middleware, support random request and response manipulation.\r\n\r\nWith this, you can do any change to the reqeust and response in an easy way, you can send request by tls_client, \r\n\r\npyhttpx, requests-go, etc. You can even manipulate chrome by selenium, undetected_chrome, playwright, etc. without\r\n\r\nany thinking of the async logic behind the scrapy.\r\n\r\n## Installation\r\n\r\n```shell script\r\npip3 install scrapy-manipulate-request\r\n```\r\n\r\n## Usage\r\n\r\nYou need to enable `ManipulateRequestDownloaderMiddleware` in `DOWNLOADER_MIDDLEWARES` first:\r\n\r\n```python\r\nDOWNLOADER_MIDDLEWARES = {\r\n 'scrapy_manipulate_request.downloadermiddlewares.ManipulateRequestDownloaderMiddleware': 543,\r\n}\r\n```\r\n\r\nNotice, this middleware is async, that means it is affected by some scrapy settings, such as:\r\n\r\n```python\r\nCONCURRENT_REQUESTS = 16\r\n\r\n```\r\n\r\nTo manipulate request and response, it is very simple and convenient, just add manipulate_request function\r\n\r\nin your spider, and send it to the meta, it's something like parse function.\r\n\r\n```python\r\nimport scrapy\r\n\r\nclass TestSpider(scrapy.Spider):\r\n name = \"test\"\r\n\r\n def start_requests(self,):\r\n meta_data = {'manipulate_request': self.manipulate_request}\r\n yield scrapy.Request(url=\"https://tls.browserleaks.com/json\", meta=meta_data)\r\n \r\n def manipulate_request(self, request, spider):\r\n \r\n # return None, the requesst will be ignored\r\n # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,\r\n # the process of handle response will be started.\r\n pass\r\n \r\n def parse(self, response):\r\n pass\r\n```\r\n\r\n## Useful Example\r\n\r\n### Send request by tls_client in order to bypass ja3 verification\r\n\r\n```python\r\nimport scrapy\r\nimport tls_client\r\nfrom scrapy.http import TextResponse\r\n\r\nclass TestSpider(scrapy.Spider):\r\n name = \"test\"\r\n\r\n def start_requests(self,):\r\n meta_data = {'manipulate_request': self.manipulate_request}\r\n yield scrapy.Request(url=\"https://tls.browserleaks.com/json\", meta=meta_data)\r\n \r\n def manipulate_request(self, request, spider):\r\n url = request.url\r\n headers = request.headers.to_unicode_dict()\r\n tls_session = tls_client.Session(\r\n client_identifier='chrome_112',\r\n random_tls_extension_order=True\r\n )\r\n proxy = 'http://username:password@ip:port'\r\n raw_response = tls_session.get(url=url, headers=headers, proxy=proxy)\r\n response = TextResponse(url=request.url, status=raw_response.status_code, headers=raw_response.headers,\r\n body=raw_response.text, request=request, encoding='utf-8')\r\n return response\r\n \r\n # return None, the requesst will be ignored\r\n # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,\r\n # the process of handle response will be started.\r\n \r\n def parse(self, response):\r\n pass\r\n```\r\n\r\nMore and detailed tls_client usage see [Python-Tls-Client](https://github.com/FlorianREGAZ/Python-Tls-Client).\r\n\r\n### Use undetected chrom to operate webpage\r\n\r\n```python\r\nimport scrapy\r\nfrom pprint import pformat\r\nfrom scrapy.http import HtmlResponse\r\nfrom seleniumwire import undetected_chromedriver as uc\r\n\r\nclass TestSpider(scrapy.Spider):\r\n name = \"test\"\r\n\r\n def start_requests(self,):\r\n meta_data = {'manipulate_request': self.manipulate_request}\r\n yield scrapy.Request(url=\"https://tls.browserleaks.com/json\", meta=meta_data)\r\n \r\n def manipulate_request(self, request, spider):\r\n chrome_options = uc.ChromeOptions()\r\n chrome_options.add_experimental_option()\r\n chrome_options.add_argument()\r\n chrome_options.add_extension()\r\n seleniumwire_options = {\r\n 'proxy': {\r\n 'http': 'http://username:password@ip:port',\r\n 'https': 'https://username:password@ip:port',\r\n }\r\n }\r\n browser = uc.Chrome(version_main=108, options=chrome_options, seleniumwire_options= seleniumwire_options,\r\n headless=True, enable_cdp_events=True)\r\n browser.set_page_load_timeout(10)\r\n browser.maximize_window()\r\n browser.add_cdp_listener('Network.requestWillBeSent', self.mylousyprintfunction)\r\n browser.execute_script()\r\n browser.execute_cdp_cmd()\r\n browser.request_interceptor = self.request_interceptor\r\n browser.get(\"https://tls.browserleaks.com/json\")\r\n elements = browser.find_elements()\r\n ...\r\n raw_response = browser.page_source\r\n response = HtmlResponse(url=request.url, status=200, body=raw_response, request=request, encoding='utf-8')\r\n return response\r\n # return None, the requesst will be ignored\r\n # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,\r\n # the process of handle response will be started.\r\n\r\n def mylousyprintfunction(self, message):\r\n print(pformat(message))\r\n\r\n def request_interceptor(self, request):\r\n request.headers['New-Header'] = 'Some Value'\r\n del request.headers['Referer']\r\n request.headers['Referer'] = 'some_referer'\r\n```\r\n\r\nMore and detailed chrome operations see [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)\r\nand [selenium-wire](https://github.com/wkeeling/selenium-wire).\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An async scrapy request downloader middleware, support random request and response manipulation.",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/dylankeepon/ScrapyManipulateRequestMiddleware"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "91502d91b94dc8338fbd86214534ba35523a8d5b4dbe0ddd9cee5357b9f0e658",
"md5": "ef503c13ff7b4a333c51b09b88f9ee06",
"sha256": "69dfcb6e01bb7fc5080d4c46a955ba1ae41f8c3ddfae61d50d42f49afdd70bfc"
},
"downloads": -1,
"filename": "scrapy_manipulate_request-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ef503c13ff7b4a333c51b09b88f9ee06",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 7182,
"upload_time": "2023-06-21T23:27:24",
"upload_time_iso_8601": "2023-06-21T23:27:24.970765Z",
"url": "https://files.pythonhosted.org/packages/91/50/2d91b94dc8338fbd86214534ba35523a8d5b4dbe0ddd9cee5357b9f0e658/scrapy_manipulate_request-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "68e13ab93841984ddcfdf928b9ef7efe13d1ba6e024854f8ed8aba4f0e1a873a",
"md5": "8958560e4d1be16416c2ede7edbbb165",
"sha256": "763a515dbe5d33555cbe9edc9c3737f37c685a5be5797b31bf5bba03b78779e3"
},
"downloads": -1,
"filename": "scrapy-manipulate-request-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "8958560e4d1be16416c2ede7edbbb165",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 6006,
"upload_time": "2023-06-21T23:27:26",
"upload_time_iso_8601": "2023-06-21T23:27:26.743406Z",
"url": "https://files.pythonhosted.org/packages/68/e1/3ab93841984ddcfdf928b9ef7efe13d1ba6e024854f8ed8aba4f0e1a873a/scrapy-manipulate-request-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-21 23:27:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dylankeepon",
"github_project": "ScrapyManipulateRequestMiddleware",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "scrapy-manipulate-request"
}