scrapy-manipulate-request


Namescrapy-manipulate-request JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/dylankeepon/ScrapyManipulateRequestMiddleware
SummaryAn async scrapy request downloader middleware, support random request and response manipulation.
upload_time2023-06-21 23:27:26
maintainer
docs_urlNone
authorDylan Chen
requires_python>=3.7.0
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Scrapy Manipulate Request Downloader Middleware

This is an async scrapy request downloader middleware, support random request and response manipulation.

With this, you can do any change to the reqeust and response in an easy way, you can send request by tls_client, 

pyhttpx, requests-go, etc. You can even manipulate chrome by selenium, undetected_chrome, playwright, etc. without

any thinking of the async logic behind the scrapy.

## Installation

```shell script
pip3 install scrapy-manipulate-request
```

## Usage

You need to enable `ManipulateRequestDownloaderMiddleware` in `DOWNLOADER_MIDDLEWARES` first:

```python
DOWNLOADER_MIDDLEWARES = {
    'scrapy_manipulate_request.downloadermiddlewares.ManipulateRequestDownloaderMiddleware': 543,
}
```

Notice, this middleware is async, that means it is affected by some scrapy settings, such as:

```python
CONCURRENT_REQUESTS = 16

```

To manipulate request and response, it is very simple and convenient, just add manipulate_request function

in your spider, and send it to the meta, it's something like parse function.

```python
import scrapy

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request': self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
    
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.
        pass
    
    def parse(self, response):
        pass
```

## Useful Example

### Send request by tls_client in order to bypass ja3 verification

```python
import scrapy
import tls_client
from scrapy.http import TextResponse

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request': self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
        url = request.url
        headers = request.headers.to_unicode_dict()
        tls_session = tls_client.Session(
            client_identifier='chrome_112',
            random_tls_extension_order=True
        )
        proxy = 'http://username:password@ip:port'
        raw_response = tls_session.get(url=url, headers=headers, proxy=proxy)
        response = TextResponse(url=request.url, status=raw_response.status_code, headers=raw_response.headers,
                                body=raw_response.text, request=request, encoding='utf-8')
        return response
        
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.
    
    def parse(self, response):
        pass
```

More and detailed tls_client usage see [Python-Tls-Client](https://github.com/FlorianREGAZ/Python-Tls-Client).

### Use undetected chrom to operate webpage

```python
import scrapy
from pprint import pformat
from scrapy.http import HtmlResponse
from seleniumwire import undetected_chromedriver as uc

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request': self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
        chrome_options = uc.ChromeOptions()
        chrome_options.add_experimental_option()
        chrome_options.add_argument()
        chrome_options.add_extension()
        seleniumwire_options = {
            'proxy': {
                'http': 'http://username:password@ip:port',
                'https': 'https://username:password@ip:port',
            }
        }
        browser = uc.Chrome(version_main=108, options=chrome_options, seleniumwire_options= seleniumwire_options,
                            headless=True, enable_cdp_events=True)
        browser.set_page_load_timeout(10)
        browser.maximize_window()
        browser.add_cdp_listener('Network.requestWillBeSent', self.mylousyprintfunction)
        browser.execute_script()
        browser.execute_cdp_cmd()
        browser.request_interceptor = self.request_interceptor
        browser.get("https://tls.browserleaks.com/json")
        elements = browser.find_elements()
        ...
        raw_response = browser.page_source
        response = HtmlResponse(url=request.url, status=200, body=raw_response, request=request, encoding='utf-8')
        return response
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.

    def mylousyprintfunction(self, message):
        print(pformat(message))

    def request_interceptor(self, request):
        request.headers['New-Header'] = 'Some Value'
        del request.headers['Referer']
        request.headers['Referer'] = 'some_referer'
```

More and detailed chrome operations see [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)
and [selenium-wire](https://github.com/wkeeling/selenium-wire).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dylankeepon/ScrapyManipulateRequestMiddleware",
    "name": "scrapy-manipulate-request",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Dylan Chen",
    "author_email": "dylankeep@163.com",
    "download_url": "https://files.pythonhosted.org/packages/68/e1/3ab93841984ddcfdf928b9ef7efe13d1ba6e024854f8ed8aba4f0e1a873a/scrapy-manipulate-request-0.0.2.tar.gz",
    "platform": null,
    "description": "# Scrapy Manipulate Request Downloader Middleware\r\n\r\nThis is an async scrapy request downloader middleware, support random request and response manipulation.\r\n\r\nWith this, you can do any change to the reqeust and response in an easy way, you can send request by tls_client, \r\n\r\npyhttpx, requests-go, etc. You can even manipulate chrome by selenium, undetected_chrome, playwright, etc. without\r\n\r\nany thinking of the async logic behind the scrapy.\r\n\r\n## Installation\r\n\r\n```shell script\r\npip3 install scrapy-manipulate-request\r\n```\r\n\r\n## Usage\r\n\r\nYou need to enable `ManipulateRequestDownloaderMiddleware` in `DOWNLOADER_MIDDLEWARES` first:\r\n\r\n```python\r\nDOWNLOADER_MIDDLEWARES = {\r\n    'scrapy_manipulate_request.downloadermiddlewares.ManipulateRequestDownloaderMiddleware': 543,\r\n}\r\n```\r\n\r\nNotice, this middleware is async, that means it is affected by some scrapy settings, such as:\r\n\r\n```python\r\nCONCURRENT_REQUESTS = 16\r\n\r\n```\r\n\r\nTo manipulate request and response, it is very simple and convenient, just add manipulate_request function\r\n\r\nin your spider, and send it to the meta, it's something like parse function.\r\n\r\n```python\r\nimport scrapy\r\n\r\nclass TestSpider(scrapy.Spider):\r\n    name = \"test\"\r\n\r\n    def start_requests(self,):\r\n        meta_data = {'manipulate_request': self.manipulate_request}\r\n        yield scrapy.Request(url=\"https://tls.browserleaks.com/json\", meta=meta_data)\r\n    \r\n    def manipulate_request(self, request, spider):\r\n    \r\n        # return None, the requesst will be ignored\r\n        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,\r\n        # the process of handle response will be started.\r\n        pass\r\n    \r\n    def parse(self, response):\r\n        pass\r\n```\r\n\r\n## Useful Example\r\n\r\n### Send request by tls_client in order to bypass ja3 verification\r\n\r\n```python\r\nimport scrapy\r\nimport tls_client\r\nfrom scrapy.http import TextResponse\r\n\r\nclass TestSpider(scrapy.Spider):\r\n    name = \"test\"\r\n\r\n    def start_requests(self,):\r\n        meta_data = {'manipulate_request': self.manipulate_request}\r\n        yield scrapy.Request(url=\"https://tls.browserleaks.com/json\", meta=meta_data)\r\n    \r\n    def manipulate_request(self, request, spider):\r\n        url = request.url\r\n        headers = request.headers.to_unicode_dict()\r\n        tls_session = tls_client.Session(\r\n            client_identifier='chrome_112',\r\n            random_tls_extension_order=True\r\n        )\r\n        proxy = 'http://username:password@ip:port'\r\n        raw_response = tls_session.get(url=url, headers=headers, proxy=proxy)\r\n        response = TextResponse(url=request.url, status=raw_response.status_code, headers=raw_response.headers,\r\n                                body=raw_response.text, request=request, encoding='utf-8')\r\n        return response\r\n        \r\n        # return None, the requesst will be ignored\r\n        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,\r\n        # the process of handle response will be started.\r\n    \r\n    def parse(self, response):\r\n        pass\r\n```\r\n\r\nMore and detailed tls_client usage see [Python-Tls-Client](https://github.com/FlorianREGAZ/Python-Tls-Client).\r\n\r\n### Use undetected chrom to operate webpage\r\n\r\n```python\r\nimport scrapy\r\nfrom pprint import pformat\r\nfrom scrapy.http import HtmlResponse\r\nfrom seleniumwire import undetected_chromedriver as uc\r\n\r\nclass TestSpider(scrapy.Spider):\r\n    name = \"test\"\r\n\r\n    def start_requests(self,):\r\n        meta_data = {'manipulate_request': self.manipulate_request}\r\n        yield scrapy.Request(url=\"https://tls.browserleaks.com/json\", meta=meta_data)\r\n    \r\n    def manipulate_request(self, request, spider):\r\n        chrome_options = uc.ChromeOptions()\r\n        chrome_options.add_experimental_option()\r\n        chrome_options.add_argument()\r\n        chrome_options.add_extension()\r\n        seleniumwire_options = {\r\n            'proxy': {\r\n                'http': 'http://username:password@ip:port',\r\n                'https': 'https://username:password@ip:port',\r\n            }\r\n        }\r\n        browser = uc.Chrome(version_main=108, options=chrome_options, seleniumwire_options= seleniumwire_options,\r\n                            headless=True, enable_cdp_events=True)\r\n        browser.set_page_load_timeout(10)\r\n        browser.maximize_window()\r\n        browser.add_cdp_listener('Network.requestWillBeSent', self.mylousyprintfunction)\r\n        browser.execute_script()\r\n        browser.execute_cdp_cmd()\r\n        browser.request_interceptor = self.request_interceptor\r\n        browser.get(\"https://tls.browserleaks.com/json\")\r\n        elements = browser.find_elements()\r\n        ...\r\n        raw_response = browser.page_source\r\n        response = HtmlResponse(url=request.url, status=200, body=raw_response, request=request, encoding='utf-8')\r\n        return response\r\n        # return None, the requesst will be ignored\r\n        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,\r\n        # the process of handle response will be started.\r\n\r\n    def mylousyprintfunction(self, message):\r\n        print(pformat(message))\r\n\r\n    def request_interceptor(self, request):\r\n        request.headers['New-Header'] = 'Some Value'\r\n        del request.headers['Referer']\r\n        request.headers['Referer'] = 'some_referer'\r\n```\r\n\r\nMore and detailed chrome operations see [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)\r\nand [selenium-wire](https://github.com/wkeeling/selenium-wire).\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An async scrapy request downloader middleware, support random request and response manipulation.",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/dylankeepon/ScrapyManipulateRequestMiddleware"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "91502d91b94dc8338fbd86214534ba35523a8d5b4dbe0ddd9cee5357b9f0e658",
                "md5": "ef503c13ff7b4a333c51b09b88f9ee06",
                "sha256": "69dfcb6e01bb7fc5080d4c46a955ba1ae41f8c3ddfae61d50d42f49afdd70bfc"
            },
            "downloads": -1,
            "filename": "scrapy_manipulate_request-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ef503c13ff7b4a333c51b09b88f9ee06",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 7182,
            "upload_time": "2023-06-21T23:27:24",
            "upload_time_iso_8601": "2023-06-21T23:27:24.970765Z",
            "url": "https://files.pythonhosted.org/packages/91/50/2d91b94dc8338fbd86214534ba35523a8d5b4dbe0ddd9cee5357b9f0e658/scrapy_manipulate_request-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "68e13ab93841984ddcfdf928b9ef7efe13d1ba6e024854f8ed8aba4f0e1a873a",
                "md5": "8958560e4d1be16416c2ede7edbbb165",
                "sha256": "763a515dbe5d33555cbe9edc9c3737f37c685a5be5797b31bf5bba03b78779e3"
            },
            "downloads": -1,
            "filename": "scrapy-manipulate-request-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "8958560e4d1be16416c2ede7edbbb165",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 6006,
            "upload_time": "2023-06-21T23:27:26",
            "upload_time_iso_8601": "2023-06-21T23:27:26.743406Z",
            "url": "https://files.pythonhosted.org/packages/68/e1/3ab93841984ddcfdf928b9ef7efe13d1ba6e024854f8ed8aba4f0e1a873a/scrapy-manipulate-request-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-21 23:27:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dylankeepon",
    "github_project": "ScrapyManipulateRequestMiddleware",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "scrapy-manipulate-request"
}
        
Elapsed time: 0.88095s