zcbot-scrapy-redis


Namezcbot-scrapy-redis JSON
Version 0.7.3.2110.1 PyPI version JSON
download
home_pagehttps://github.com/rolando/scrapy-redis
SummaryRedis-based components for Scrapy 2.11.0+.
upload_time2023-11-15 01:00:42
maintainer
docs_urlNone
authorzsodata
requires_python
licenseMIT
keywords scrapy-redis
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage
            ============
Scrapy-Redis
============

.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest
        :alt: Documentation Status
        :target: https://readthedocs.org/projects/scrapy-redis/?badge=latest

.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg
        :target: https://pypi.python.org/pypi/scrapy-redis

.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg
        :target: https://pypi.python.org/pypi/scrapy-redis

.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg
        :target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml
        
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg
        :target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml
        
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg
        :target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml
        
.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master
        :alt: Coverage Status
        :target: https://codecov.io/github/rmax/scrapy-redis

.. image:: https://img.shields.io/badge/security-bandit-green.svg
        :alt: Security Status
        :target: https://github.com/rmax/scrapy-redis
    
Redis-based components for Scrapy.

* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
* Documentation: https://github.com/rmax/scrapy-redis/wiki.
* Release: https://github.com/rmax/scrapy-redis/wiki/History
* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
* LICENSE: MIT license

Features
--------

* Distributed crawling/scraping

    You can start multiple spider instances that share a single redis queue.
    Best suitable for broad multi-domain crawls.

* Distributed post-processing

    Scraped items gets pushed into a redis queued meaning that you can start as
    many as needed post-processing processes sharing the items queue.

* Scrapy plug-and-play components

    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.

* In this forked version: added ``json`` supported data in Redis

    data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.
    this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.

    For example:

    .. code-block:: json

        { "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }

    this data can be accessed in `scrapy spider` through response.
    like: `request.url`, `request.meta`, `request.cookies`
    
.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.

Requirements
------------

* Python 3.7+
* Redis >= 5.0
* ``Scrapy`` >=  2.0
* ``redis-py`` >= 4.0

Installation
------------

From pip 

.. code-block:: bash

    pip install scrapy-redis

From GitHub

.. code-block:: bash

    git clone https://github.com/darkrho/scrapy-redis.git
    cd scrapy-redis
    python setup.py install

.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
  
.. code-block:: bash

    pip uninstall scrapy-redis

Alternative Choice
---------------------------

Frontera_  is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.

.. _Frontera: https://github.com/scrapinghub/frontera
.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html

=======
History
=======


0.7.3 (2022-07-21)
------------------
* Move docs to GitHub Wiki
* Update tox and support dynamic tests
* Update support for json data
* Refactor max idle time
* Add support for python3.7~python3.10
* Deprecate python2.x support

0.7.2 (2021-12-27)
------------------
* Fix RedisStatsCollector._get_key()
* Fix redis-py dependency version
* Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE

0.7.1 (2021-03-27)
------------------
* Fixes datetime parse error for redis-py 3.x.
* Add support for stats extensions.

0.7.1-rc1 (2021-03-27)
----------------------
* Fixes datetime parse error for redis-py 3.x.

0.7.1-b1 (2021-03-22)
---------------------
* Add support for stats extensions.

0.7.0-dev (unreleased)
----------------------
* Unreleased.

0.6.8 (2017-02-14)
------------------
* Fixed automated release due to not matching registered email.

0.6.7 (2016-12-27)
------------------
* Fixes bad formatting in logging message.

0.6.6 (2016-12-20)
------------------
* Fixes wrong message on dupefilter duplicates.

0.6.5 (2016-12-19)
------------------
* Fixed typo in default settings.

0.6.4 (2016-12-18)
------------------
* Fixed data decoding in Python 3.x.
* Added ``REDIS_ENCODING`` setting (default ``utf-8``).
* Default to ``CONCURRENT_REQUESTS`` value for ``REDIS_START_URLS_BATCH_SIZE``.
* Renamed queue classes to a proper naming conventiong (backwards compatible).

0.6.3 (2016-07-03)
------------------
* Added ``REDIS_START_URLS_KEY`` setting.
* Fixed spider method ``from_crawler`` signature.

0.6.2 (2016-06-26)
------------------
* Support ``redis_cls`` parameter in ``REDIS_PARAMS`` setting.
* Python 3.x compatibility fixed.
* Added ``SCHEDULER_SERIALIZER`` setting.

0.6.1 (2016-06-25)
------------------
* **Backwards incompatible change:** Require explicit ``DUPEFILTER_CLASS``
  setting.
* Added ``SCHEDULER_FLUSH_ON_START`` setting.
* Added ``REDIS_START_URLS_AS_SET`` setting.
* Added ``REDIS_ITEMS_KEY`` setting.
* Added ``REDIS_ITEMS_SERIALIZER`` setting.
* Added ``REDIS_PARAMS`` setting.
* Added ``REDIS_START_URLS_BATCH_SIZE`` spider attribute to read start urls
  in batches.
* Added ``RedisCrawlSpider``.

0.6.0 (2015-07-05)
------------------
* Updated code to be compatible with Scrapy 1.0.
* Added `-a domain=...` option for example spiders.

0.5.0 (2013-09-02)
------------------
* Added `REDIS_URL` setting to support Redis connection string.
* Added `SCHEDULER_IDLE_BEFORE_CLOSE` setting to prevent the spider closing too
  quickly when the queue is empty. Default value is zero keeping the previous
  behavior.
* Schedule preemptively requests on item scraped.
* This version is the latest release compatible with Scrapy 0.24.x.

0.4.0 (2013-04-19)
------------------
* Added `RedisSpider` and `RedisMixin` classes as building blocks for spiders
  to be fed through a redis queue.
* Added redis queue stats.
* Let the encoder handle the item as it comes instead converting it to a dict.

0.3.0 (2013-02-18)
------------------
* Added support for different queue classes.
* Changed requests serialization from `marshal` to `cPickle`.

0.2.0 (2013-02-17)
------------------
* Improved backward compatibility.
* Added example project.

0.1.0 (2011-09-01)
------------------
* First release on PyPI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rolando/scrapy-redis",
    "name": "zcbot-scrapy-redis",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "scrapy-redis",
    "author": "zsodata",
    "author_email": "rolando@rmax.io",
    "download_url": "https://files.pythonhosted.org/packages/f0/41/92e41e9224064382ca0e864c4b78bfacd45d6c8e1b321deea9020f5713ca/zcbot-scrapy-redis-0.7.3.2110.1.tar.gz",
    "platform": null,
    "description": "============\r\nScrapy-Redis\r\n============\r\n\r\n.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest\r\n        :alt: Documentation Status\r\n        :target: https://readthedocs.org/projects/scrapy-redis/?badge=latest\r\n\r\n.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg\r\n        :target: https://pypi.python.org/pypi/scrapy-redis\r\n\r\n.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg\r\n        :target: https://pypi.python.org/pypi/scrapy-redis\r\n\r\n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg\r\n        :target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml\r\n        \r\n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg\r\n        :target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml\r\n        \r\n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg\r\n        :target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml\r\n        \r\n.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master\r\n        :alt: Coverage Status\r\n        :target: https://codecov.io/github/rmax/scrapy-redis\r\n\r\n.. image:: https://img.shields.io/badge/security-bandit-green.svg\r\n        :alt: Security Status\r\n        :target: https://github.com/rmax/scrapy-redis\r\n    \r\nRedis-based components for Scrapy.\r\n\r\n* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage\r\n* Documentation: https://github.com/rmax/scrapy-redis/wiki.\r\n* Release: https://github.com/rmax/scrapy-redis/wiki/History\r\n* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started\r\n* LICENSE: MIT license\r\n\r\nFeatures\r\n--------\r\n\r\n* Distributed crawling/scraping\r\n\r\n    You can start multiple spider instances that share a single redis queue.\r\n    Best suitable for broad multi-domain crawls.\r\n\r\n* Distributed post-processing\r\n\r\n    Scraped items gets pushed into a redis queued meaning that you can start as\r\n    many as needed post-processing processes sharing the items queue.\r\n\r\n* Scrapy plug-and-play components\r\n\r\n    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.\r\n\r\n* In this forked version: added ``json`` supported data in Redis\r\n\r\n    data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.\r\n    this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.\r\n\r\n    For example:\r\n\r\n    .. code-block:: json\r\n\r\n        { \"url\": \"https://exaple.com\", \"meta\": {\"job-id\":\"123xsd\", \"start-date\":\"dd/mm/yy\"}, \"url_cookie_key\":\"fertxsas\" }\r\n\r\n    this data can be accessed in `scrapy spider` through response.\r\n    like: `request.url`, `request.meta`, `request.cookies`\r\n    \r\n.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.\r\n\r\nRequirements\r\n------------\r\n\r\n* Python 3.7+\r\n* Redis >= 5.0\r\n* ``Scrapy`` >=  2.0\r\n* ``redis-py`` >= 4.0\r\n\r\nInstallation\r\n------------\r\n\r\nFrom pip \r\n\r\n.. code-block:: bash\r\n\r\n    pip install scrapy-redis\r\n\r\nFrom GitHub\r\n\r\n.. code-block:: bash\r\n\r\n    git clone https://github.com/darkrho/scrapy-redis.git\r\n    cd scrapy-redis\r\n    python setup.py install\r\n\r\n.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.\r\n  \r\n.. code-block:: bash\r\n\r\n    pip uninstall scrapy-redis\r\n\r\nAlternative Choice\r\n---------------------------\r\n\r\nFrontera_  is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.\r\n\r\n.. _Frontera: https://github.com/scrapinghub/frontera\r\n.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html\r\n\r\n=======\r\nHistory\r\n=======\r\n\r\n\r\n0.7.3 (2022-07-21)\r\n------------------\r\n* Move docs to GitHub Wiki\r\n* Update tox and support dynamic tests\r\n* Update support for json data\r\n* Refactor max idle time\r\n* Add support for python3.7~python3.10\r\n* Deprecate python2.x support\r\n\r\n0.7.2 (2021-12-27)\r\n------------------\r\n* Fix RedisStatsCollector._get_key()\r\n* Fix redis-py dependency version\r\n* Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE\r\n\r\n0.7.1 (2021-03-27)\r\n------------------\r\n* Fixes datetime parse error for redis-py 3.x.\r\n* Add support for stats extensions.\r\n\r\n0.7.1-rc1 (2021-03-27)\r\n----------------------\r\n* Fixes datetime parse error for redis-py 3.x.\r\n\r\n0.7.1-b1 (2021-03-22)\r\n---------------------\r\n* Add support for stats extensions.\r\n\r\n0.7.0-dev (unreleased)\r\n----------------------\r\n* Unreleased.\r\n\r\n0.6.8 (2017-02-14)\r\n------------------\r\n* Fixed automated release due to not matching registered email.\r\n\r\n0.6.7 (2016-12-27)\r\n------------------\r\n* Fixes bad formatting in logging message.\r\n\r\n0.6.6 (2016-12-20)\r\n------------------\r\n* Fixes wrong message on dupefilter duplicates.\r\n\r\n0.6.5 (2016-12-19)\r\n------------------\r\n* Fixed typo in default settings.\r\n\r\n0.6.4 (2016-12-18)\r\n------------------\r\n* Fixed data decoding in Python 3.x.\r\n* Added ``REDIS_ENCODING`` setting (default ``utf-8``).\r\n* Default to ``CONCURRENT_REQUESTS`` value for ``REDIS_START_URLS_BATCH_SIZE``.\r\n* Renamed queue classes to a proper naming conventiong (backwards compatible).\r\n\r\n0.6.3 (2016-07-03)\r\n------------------\r\n* Added ``REDIS_START_URLS_KEY`` setting.\r\n* Fixed spider method ``from_crawler`` signature.\r\n\r\n0.6.2 (2016-06-26)\r\n------------------\r\n* Support ``redis_cls`` parameter in ``REDIS_PARAMS`` setting.\r\n* Python 3.x compatibility fixed.\r\n* Added ``SCHEDULER_SERIALIZER`` setting.\r\n\r\n0.6.1 (2016-06-25)\r\n------------------\r\n* **Backwards incompatible change:** Require explicit ``DUPEFILTER_CLASS``\r\n  setting.\r\n* Added ``SCHEDULER_FLUSH_ON_START`` setting.\r\n* Added ``REDIS_START_URLS_AS_SET`` setting.\r\n* Added ``REDIS_ITEMS_KEY`` setting.\r\n* Added ``REDIS_ITEMS_SERIALIZER`` setting.\r\n* Added ``REDIS_PARAMS`` setting.\r\n* Added ``REDIS_START_URLS_BATCH_SIZE`` spider attribute to read start urls\r\n  in batches.\r\n* Added ``RedisCrawlSpider``.\r\n\r\n0.6.0 (2015-07-05)\r\n------------------\r\n* Updated code to be compatible with Scrapy 1.0.\r\n* Added `-a domain=...` option for example spiders.\r\n\r\n0.5.0 (2013-09-02)\r\n------------------\r\n* Added `REDIS_URL` setting to support Redis connection string.\r\n* Added `SCHEDULER_IDLE_BEFORE_CLOSE` setting to prevent the spider closing too\r\n  quickly when the queue is empty. Default value is zero keeping the previous\r\n  behavior.\r\n* Schedule preemptively requests on item scraped.\r\n* This version is the latest release compatible with Scrapy 0.24.x.\r\n\r\n0.4.0 (2013-04-19)\r\n------------------\r\n* Added `RedisSpider` and `RedisMixin` classes as building blocks for spiders\r\n  to be fed through a redis queue.\r\n* Added redis queue stats.\r\n* Let the encoder handle the item as it comes instead converting it to a dict.\r\n\r\n0.3.0 (2013-02-18)\r\n------------------\r\n* Added support for different queue classes.\r\n* Changed requests serialization from `marshal` to `cPickle`.\r\n\r\n0.2.0 (2013-02-17)\r\n------------------\r\n* Improved backward compatibility.\r\n* Added example project.\r\n\r\n0.1.0 (2011-09-01)\r\n------------------\r\n* First release on PyPI.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Redis-based components for Scrapy 2.11.0+.",
    "version": "0.7.3.2110.1",
    "project_urls": {
        "Homepage": "https://github.com/rolando/scrapy-redis"
    },
    "split_keywords": [
        "scrapy-redis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f04192e41e9224064382ca0e864c4b78bfacd45d6c8e1b321deea9020f5713ca",
                "md5": "6e2ba41eff02f0b968940ea33b629b90",
                "sha256": "97d68206287cb90fd65b83c8da4c0371a1d812d7aa8965f313c8a6c34f6c26b4"
            },
            "downloads": -1,
            "filename": "zcbot-scrapy-redis-0.7.3.2110.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6e2ba41eff02f0b968940ea33b629b90",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17883,
            "upload_time": "2023-11-15T01:00:42",
            "upload_time_iso_8601": "2023-11-15T01:00:42.391647Z",
            "url": "https://files.pythonhosted.org/packages/f0/41/92e41e9224064382ca0e864c4b78bfacd45d6c8e1b321deea9020f5713ca/zcbot-scrapy-redis-0.7.3.2110.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-15 01:00:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rolando",
    "github_project": "scrapy-redis",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "zcbot-scrapy-redis"
}
        
Elapsed time: 0.16975s