============
Scrapy-Redis
============
.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest
:alt: Documentation Status
:target: https://readthedocs.org/projects/scrapy-redis/?badge=latest
.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg
:target: https://pypi.python.org/pypi/scrapy-redis
.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg
:target: https://pypi.python.org/pypi/scrapy-redis
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg
:target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg
:target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg
:target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml
.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master
:alt: Coverage Status
:target: https://codecov.io/github/rmax/scrapy-redis
.. image:: https://img.shields.io/badge/security-bandit-green.svg
:alt: Security Status
:target: https://github.com/rmax/scrapy-redis
Redis-based components for Scrapy.
* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
* Documentation: https://github.com/rmax/scrapy-redis/wiki.
* Release: https://github.com/rmax/scrapy-redis/wiki/History
* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
* LICENSE: MIT license
Features
--------
* Distributed crawling/scraping
You can start multiple spider instances that share a single redis queue.
Best suitable for broad multi-domain crawls.
* Distributed post-processing
Scraped items gets pushed into a redis queued meaning that you can start as
many as needed post-processing processes sharing the items queue.
* Scrapy plug-and-play components
Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
* In this forked version: added ``json`` supported data in Redis
data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.
this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.
For example:
.. code-block:: json
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
this data can be accessed in `scrapy spider` through response.
like: `request.url`, `request.meta`, `request.cookies`
.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.
Requirements
------------
* Python 3.7+
* Redis >= 5.0
* ``Scrapy`` >= 2.0
* ``redis-py`` >= 4.0
Installation
------------
From pip
.. code-block:: bash
pip install scrapy-redis
From GitHub
.. code-block:: bash
git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install
.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
.. code-block:: bash
pip uninstall scrapy-redis
Alternative Choice
---------------------------
Frontera_ is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.
.. _Frontera: https://github.com/scrapinghub/frontera
.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html
=======
History
=======
0.7.3 (2022-07-21)
------------------
* Move docs to GitHub Wiki
* Update tox and support dynamic tests
* Update support for json data
* Refactor max idle time
* Add support for python3.7~python3.10
* Deprecate python2.x support
0.7.2 (2021-12-27)
------------------
* Fix RedisStatsCollector._get_key()
* Fix redis-py dependency version
* Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE
0.7.1 (2021-03-27)
------------------
* Fixes datetime parse error for redis-py 3.x.
* Add support for stats extensions.
0.7.1-rc1 (2021-03-27)
----------------------
* Fixes datetime parse error for redis-py 3.x.
0.7.1-b1 (2021-03-22)
---------------------
* Add support for stats extensions.
0.7.0-dev (unreleased)
----------------------
* Unreleased.
0.6.8 (2017-02-14)
------------------
* Fixed automated release due to not matching registered email.
0.6.7 (2016-12-27)
------------------
* Fixes bad formatting in logging message.
0.6.6 (2016-12-20)
------------------
* Fixes wrong message on dupefilter duplicates.
0.6.5 (2016-12-19)
------------------
* Fixed typo in default settings.
0.6.4 (2016-12-18)
------------------
* Fixed data decoding in Python 3.x.
* Added ``REDIS_ENCODING`` setting (default ``utf-8``).
* Default to ``CONCURRENT_REQUESTS`` value for ``REDIS_START_URLS_BATCH_SIZE``.
* Renamed queue classes to a proper naming conventiong (backwards compatible).
0.6.3 (2016-07-03)
------------------
* Added ``REDIS_START_URLS_KEY`` setting.
* Fixed spider method ``from_crawler`` signature.
0.6.2 (2016-06-26)
------------------
* Support ``redis_cls`` parameter in ``REDIS_PARAMS`` setting.
* Python 3.x compatibility fixed.
* Added ``SCHEDULER_SERIALIZER`` setting.
0.6.1 (2016-06-25)
------------------
* **Backwards incompatible change:** Require explicit ``DUPEFILTER_CLASS``
setting.
* Added ``SCHEDULER_FLUSH_ON_START`` setting.
* Added ``REDIS_START_URLS_AS_SET`` setting.
* Added ``REDIS_ITEMS_KEY`` setting.
* Added ``REDIS_ITEMS_SERIALIZER`` setting.
* Added ``REDIS_PARAMS`` setting.
* Added ``REDIS_START_URLS_BATCH_SIZE`` spider attribute to read start urls
in batches.
* Added ``RedisCrawlSpider``.
0.6.0 (2015-07-05)
------------------
* Updated code to be compatible with Scrapy 1.0.
* Added `-a domain=...` option for example spiders.
0.5.0 (2013-09-02)
------------------
* Added `REDIS_URL` setting to support Redis connection string.
* Added `SCHEDULER_IDLE_BEFORE_CLOSE` setting to prevent the spider closing too
quickly when the queue is empty. Default value is zero keeping the previous
behavior.
* Schedule preemptively requests on item scraped.
* This version is the latest release compatible with Scrapy 0.24.x.
0.4.0 (2013-04-19)
------------------
* Added `RedisSpider` and `RedisMixin` classes as building blocks for spiders
to be fed through a redis queue.
* Added redis queue stats.
* Let the encoder handle the item as it comes instead converting it to a dict.
0.3.0 (2013-02-18)
------------------
* Added support for different queue classes.
* Changed requests serialization from `marshal` to `cPickle`.
0.2.0 (2013-02-17)
------------------
* Improved backward compatibility.
* Added example project.
0.1.0 (2011-09-01)
------------------
* First release on PyPI.
Raw data
{
"_id": null,
"home_page": "https://github.com/rolando/scrapy-redis",
"name": "zcbot-scrapy-redis",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "scrapy-redis",
"author": "zsodata",
"author_email": "rolando@rmax.io",
"download_url": "https://files.pythonhosted.org/packages/f0/41/92e41e9224064382ca0e864c4b78bfacd45d6c8e1b321deea9020f5713ca/zcbot-scrapy-redis-0.7.3.2110.1.tar.gz",
"platform": null,
"description": "============\r\nScrapy-Redis\r\n============\r\n\r\n.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest\r\n :alt: Documentation Status\r\n :target: https://readthedocs.org/projects/scrapy-redis/?badge=latest\r\n\r\n.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg\r\n :target: https://pypi.python.org/pypi/scrapy-redis\r\n\r\n.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg\r\n :target: https://pypi.python.org/pypi/scrapy-redis\r\n\r\n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg\r\n :target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml\r\n \r\n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg\r\n :target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml\r\n \r\n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg\r\n :target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml\r\n \r\n.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master\r\n :alt: Coverage Status\r\n :target: https://codecov.io/github/rmax/scrapy-redis\r\n\r\n.. image:: https://img.shields.io/badge/security-bandit-green.svg\r\n :alt: Security Status\r\n :target: https://github.com/rmax/scrapy-redis\r\n \r\nRedis-based components for Scrapy.\r\n\r\n* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage\r\n* Documentation: https://github.com/rmax/scrapy-redis/wiki.\r\n* Release: https://github.com/rmax/scrapy-redis/wiki/History\r\n* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started\r\n* LICENSE: MIT license\r\n\r\nFeatures\r\n--------\r\n\r\n* Distributed crawling/scraping\r\n\r\n You can start multiple spider instances that share a single redis queue.\r\n Best suitable for broad multi-domain crawls.\r\n\r\n* Distributed post-processing\r\n\r\n Scraped items gets pushed into a redis queued meaning that you can start as\r\n many as needed post-processing processes sharing the items queue.\r\n\r\n* Scrapy plug-and-play components\r\n\r\n Scheduler + Duplication Filter, Item Pipeline, Base Spiders.\r\n\r\n* In this forked version: added ``json`` supported data in Redis\r\n\r\n data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.\r\n this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.\r\n\r\n For example:\r\n\r\n .. code-block:: json\r\n\r\n { \"url\": \"https://exaple.com\", \"meta\": {\"job-id\":\"123xsd\", \"start-date\":\"dd/mm/yy\"}, \"url_cookie_key\":\"fertxsas\" }\r\n\r\n this data can be accessed in `scrapy spider` through response.\r\n like: `request.url`, `request.meta`, `request.cookies`\r\n \r\n.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.\r\n\r\nRequirements\r\n------------\r\n\r\n* Python 3.7+\r\n* Redis >= 5.0\r\n* ``Scrapy`` >= 2.0\r\n* ``redis-py`` >= 4.0\r\n\r\nInstallation\r\n------------\r\n\r\nFrom pip \r\n\r\n.. code-block:: bash\r\n\r\n pip install scrapy-redis\r\n\r\nFrom GitHub\r\n\r\n.. code-block:: bash\r\n\r\n git clone https://github.com/darkrho/scrapy-redis.git\r\n cd scrapy-redis\r\n python setup.py install\r\n\r\n.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.\r\n \r\n.. code-block:: bash\r\n\r\n pip uninstall scrapy-redis\r\n\r\nAlternative Choice\r\n---------------------------\r\n\r\nFrontera_ is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.\r\n\r\n.. _Frontera: https://github.com/scrapinghub/frontera\r\n.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html\r\n\r\n=======\r\nHistory\r\n=======\r\n\r\n\r\n0.7.3 (2022-07-21)\r\n------------------\r\n* Move docs to GitHub Wiki\r\n* Update tox and support dynamic tests\r\n* Update support for json data\r\n* Refactor max idle time\r\n* Add support for python3.7~python3.10\r\n* Deprecate python2.x support\r\n\r\n0.7.2 (2021-12-27)\r\n------------------\r\n* Fix RedisStatsCollector._get_key()\r\n* Fix redis-py dependency version\r\n* Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE\r\n\r\n0.7.1 (2021-03-27)\r\n------------------\r\n* Fixes datetime parse error for redis-py 3.x.\r\n* Add support for stats extensions.\r\n\r\n0.7.1-rc1 (2021-03-27)\r\n----------------------\r\n* Fixes datetime parse error for redis-py 3.x.\r\n\r\n0.7.1-b1 (2021-03-22)\r\n---------------------\r\n* Add support for stats extensions.\r\n\r\n0.7.0-dev (unreleased)\r\n----------------------\r\n* Unreleased.\r\n\r\n0.6.8 (2017-02-14)\r\n------------------\r\n* Fixed automated release due to not matching registered email.\r\n\r\n0.6.7 (2016-12-27)\r\n------------------\r\n* Fixes bad formatting in logging message.\r\n\r\n0.6.6 (2016-12-20)\r\n------------------\r\n* Fixes wrong message on dupefilter duplicates.\r\n\r\n0.6.5 (2016-12-19)\r\n------------------\r\n* Fixed typo in default settings.\r\n\r\n0.6.4 (2016-12-18)\r\n------------------\r\n* Fixed data decoding in Python 3.x.\r\n* Added ``REDIS_ENCODING`` setting (default ``utf-8``).\r\n* Default to ``CONCURRENT_REQUESTS`` value for ``REDIS_START_URLS_BATCH_SIZE``.\r\n* Renamed queue classes to a proper naming conventiong (backwards compatible).\r\n\r\n0.6.3 (2016-07-03)\r\n------------------\r\n* Added ``REDIS_START_URLS_KEY`` setting.\r\n* Fixed spider method ``from_crawler`` signature.\r\n\r\n0.6.2 (2016-06-26)\r\n------------------\r\n* Support ``redis_cls`` parameter in ``REDIS_PARAMS`` setting.\r\n* Python 3.x compatibility fixed.\r\n* Added ``SCHEDULER_SERIALIZER`` setting.\r\n\r\n0.6.1 (2016-06-25)\r\n------------------\r\n* **Backwards incompatible change:** Require explicit ``DUPEFILTER_CLASS``\r\n setting.\r\n* Added ``SCHEDULER_FLUSH_ON_START`` setting.\r\n* Added ``REDIS_START_URLS_AS_SET`` setting.\r\n* Added ``REDIS_ITEMS_KEY`` setting.\r\n* Added ``REDIS_ITEMS_SERIALIZER`` setting.\r\n* Added ``REDIS_PARAMS`` setting.\r\n* Added ``REDIS_START_URLS_BATCH_SIZE`` spider attribute to read start urls\r\n in batches.\r\n* Added ``RedisCrawlSpider``.\r\n\r\n0.6.0 (2015-07-05)\r\n------------------\r\n* Updated code to be compatible with Scrapy 1.0.\r\n* Added `-a domain=...` option for example spiders.\r\n\r\n0.5.0 (2013-09-02)\r\n------------------\r\n* Added `REDIS_URL` setting to support Redis connection string.\r\n* Added `SCHEDULER_IDLE_BEFORE_CLOSE` setting to prevent the spider closing too\r\n quickly when the queue is empty. Default value is zero keeping the previous\r\n behavior.\r\n* Schedule preemptively requests on item scraped.\r\n* This version is the latest release compatible with Scrapy 0.24.x.\r\n\r\n0.4.0 (2013-04-19)\r\n------------------\r\n* Added `RedisSpider` and `RedisMixin` classes as building blocks for spiders\r\n to be fed through a redis queue.\r\n* Added redis queue stats.\r\n* Let the encoder handle the item as it comes instead converting it to a dict.\r\n\r\n0.3.0 (2013-02-18)\r\n------------------\r\n* Added support for different queue classes.\r\n* Changed requests serialization from `marshal` to `cPickle`.\r\n\r\n0.2.0 (2013-02-17)\r\n------------------\r\n* Improved backward compatibility.\r\n* Added example project.\r\n\r\n0.1.0 (2011-09-01)\r\n------------------\r\n* First release on PyPI.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Redis-based components for Scrapy 2.11.0+.",
"version": "0.7.3.2110.1",
"project_urls": {
"Homepage": "https://github.com/rolando/scrapy-redis"
},
"split_keywords": [
"scrapy-redis"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f04192e41e9224064382ca0e864c4b78bfacd45d6c8e1b321deea9020f5713ca",
"md5": "6e2ba41eff02f0b968940ea33b629b90",
"sha256": "97d68206287cb90fd65b83c8da4c0371a1d812d7aa8965f313c8a6c34f6c26b4"
},
"downloads": -1,
"filename": "zcbot-scrapy-redis-0.7.3.2110.1.tar.gz",
"has_sig": false,
"md5_digest": "6e2ba41eff02f0b968940ea33b629b90",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 17883,
"upload_time": "2023-11-15T01:00:42",
"upload_time_iso_8601": "2023-11-15T01:00:42.391647Z",
"url": "https://files.pythonhosted.org/packages/f0/41/92e41e9224064382ca0e864c4b78bfacd45d6c8e1b321deea9020f5713ca/zcbot-scrapy-redis-0.7.3.2110.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-15 01:00:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rolando",
"github_project": "scrapy-redis",
"travis_ci": true,
"coveralls": true,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "zcbot-scrapy-redis"
}