latest-scrapy-redis


Namelatest-scrapy-redis JSON
Version 0.7.3 PyPI version JSON
download
home_pagehttps://github.com/Ehsan-U/scrapy-redis
SummaryRedis-based components for Scrapy.
upload_time2024-02-27 02:36:16
maintainer
docs_urlNone
authorEhsan U.
requires_python
licenseMIT
keywords scrapy-redis
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage
            ============
Scrapy-Redis
============

.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest
        :alt: Documentation Status
        :target: https://readthedocs.org/projects/scrapy-redis/?badge=latest

.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg
        :target: https://pypi.python.org/pypi/scrapy-redis

.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg
        :target: https://pypi.python.org/pypi/scrapy-redis

.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg
        :target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml
        
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg
        :target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml
        
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg
        :target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml
        
.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master
        :alt: Coverage Status
        :target: https://codecov.io/github/rmax/scrapy-redis

.. image:: https://img.shields.io/badge/security-bandit-green.svg
        :alt: Security Status
        :target: https://github.com/rmax/scrapy-redis
    
Redis-based components for Scrapy.

* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
* Documentation: https://github.com/rmax/scrapy-redis/wiki.
* Release: https://github.com/rmax/scrapy-redis/wiki/History
* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
* LICENSE: MIT license

Features
--------

* Distributed crawling/scraping

    You can start multiple spider instances that share a single redis queue.
    Best suitable for broad multi-domain crawls.

* Distributed post-processing

    Scraped items gets pushed into a redis queued meaning that you can start as
    many as needed post-processing processes sharing the items queue.

* Scrapy plug-and-play components

    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.

* In this forked version: added ``json`` supported data in Redis

    data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.
    this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.

    For example:

    .. code-block:: json

        { "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }

    this data can be accessed in `scrapy spider` through response.
    like: `request.url`, `request.meta`, `request.cookies`
    
.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.

Requirements
------------

* Python 3.7+
* Redis >= 5.0
* ``Scrapy`` >=  2.0
* ``redis-py`` >= 4.0

Installation
------------

From pip 

.. code-block:: bash

    pip install scrapy-redis

From GitHub

.. code-block:: bash

    git clone https://github.com/darkrho/scrapy-redis.git
    cd scrapy-redis
    python setup.py install

.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
  
.. code-block:: bash

    pip uninstall scrapy-redis

Alternative Choice
---------------------------

Frontera_  is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.

.. _Frontera: https://github.com/scrapinghub/frontera
.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html

=======
History
=======


0.7.3 (2022-07-21)
------------------
* Move docs to GitHub Wiki
* Update tox and support dynamic tests
* Update support for json data
* Refactor max idle time
* Add support for python3.7~python3.10
* Deprecate python2.x support

0.7.2 (2021-12-27)
------------------
* Fix RedisStatsCollector._get_key()
* Fix redis-py dependency version
* Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE

0.7.1 (2021-03-27)
------------------
* Fixes datetime parse error for redis-py 3.x.
* Add support for stats extensions.

0.7.1-rc1 (2021-03-27)
----------------------
* Fixes datetime parse error for redis-py 3.x.

0.7.1-b1 (2021-03-22)
---------------------
* Add support for stats extensions.

0.7.0-dev (unreleased)
----------------------
* Unreleased.

0.6.8 (2017-02-14)
------------------
* Fixed automated release due to not matching registered email.

0.6.7 (2016-12-27)
------------------
* Fixes bad formatting in logging message.

0.6.6 (2016-12-20)
------------------
* Fixes wrong message on dupefilter duplicates.

0.6.5 (2016-12-19)
------------------
* Fixed typo in default settings.

0.6.4 (2016-12-18)
------------------
* Fixed data decoding in Python 3.x.
* Added ``REDIS_ENCODING`` setting (default ``utf-8``).
* Default to ``CONCURRENT_REQUESTS`` value for ``REDIS_START_URLS_BATCH_SIZE``.
* Renamed queue classes to a proper naming conventiong (backwards compatible).

0.6.3 (2016-07-03)
------------------
* Added ``REDIS_START_URLS_KEY`` setting.
* Fixed spider method ``from_crawler`` signature.

0.6.2 (2016-06-26)
------------------
* Support ``redis_cls`` parameter in ``REDIS_PARAMS`` setting.
* Python 3.x compatibility fixed.
* Added ``SCHEDULER_SERIALIZER`` setting.

0.6.1 (2016-06-25)
------------------
* **Backwards incompatible change:** Require explicit ``DUPEFILTER_CLASS``
  setting.
* Added ``SCHEDULER_FLUSH_ON_START`` setting.
* Added ``REDIS_START_URLS_AS_SET`` setting.
* Added ``REDIS_ITEMS_KEY`` setting.
* Added ``REDIS_ITEMS_SERIALIZER`` setting.
* Added ``REDIS_PARAMS`` setting.
* Added ``REDIS_START_URLS_BATCH_SIZE`` spider attribute to read start urls
  in batches.
* Added ``RedisCrawlSpider``.

0.6.0 (2015-07-05)
------------------
* Updated code to be compatible with Scrapy 1.0.
* Added `-a domain=...` option for example spiders.

0.5.0 (2013-09-02)
------------------
* Added `REDIS_URL` setting to support Redis connection string.
* Added `SCHEDULER_IDLE_BEFORE_CLOSE` setting to prevent the spider closing too
  quickly when the queue is empty. Default value is zero keeping the previous
  behavior.
* Schedule preemptively requests on item scraped.
* This version is the latest release compatible with Scrapy 0.24.x.

0.4.0 (2013-04-19)
------------------
* Added `RedisSpider` and `RedisMixin` classes as building blocks for spiders
  to be fed through a redis queue.
* Added redis queue stats.
* Let the encoder handle the item as it comes instead converting it to a dict.

0.3.0 (2013-02-18)
------------------
* Added support for different queue classes.
* Changed requests serialization from `marshal` to `cPickle`.

0.2.0 (2013-02-17)
------------------
* Improved backward compatibility.
* Added example project.

0.1.0 (2011-09-01)
------------------
* First release on PyPI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Ehsan-U/scrapy-redis",
    "name": "latest-scrapy-redis",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "scrapy-redis",
    "author": "Ehsan U.",
    "author_email": "au85265@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/76/3e/65b0b51b969806ebe1d1d0f71650202085dcf60a76490344634c9dee85e7/latest-scrapy-redis-0.7.3.tar.gz",
    "platform": null,
    "description": "============\nScrapy-Redis\n============\n\n.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest\n        :alt: Documentation Status\n        :target: https://readthedocs.org/projects/scrapy-redis/?badge=latest\n\n.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg\n        :target: https://pypi.python.org/pypi/scrapy-redis\n\n.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg\n        :target: https://pypi.python.org/pypi/scrapy-redis\n\n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg\n        :target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml\n        \n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg\n        :target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml\n        \n.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg\n        :target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml\n        \n.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master\n        :alt: Coverage Status\n        :target: https://codecov.io/github/rmax/scrapy-redis\n\n.. image:: https://img.shields.io/badge/security-bandit-green.svg\n        :alt: Security Status\n        :target: https://github.com/rmax/scrapy-redis\n    \nRedis-based components for Scrapy.\n\n* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage\n* Documentation: https://github.com/rmax/scrapy-redis/wiki.\n* Release: https://github.com/rmax/scrapy-redis/wiki/History\n* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started\n* LICENSE: MIT license\n\nFeatures\n--------\n\n* Distributed crawling/scraping\n\n    You can start multiple spider instances that share a single redis queue.\n    Best suitable for broad multi-domain crawls.\n\n* Distributed post-processing\n\n    Scraped items gets pushed into a redis queued meaning that you can start as\n    many as needed post-processing processes sharing the items queue.\n\n* Scrapy plug-and-play components\n\n    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.\n\n* In this forked version: added ``json`` supported data in Redis\n\n    data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.\n    this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.\n\n    For example:\n\n    .. code-block:: json\n\n        { \"url\": \"https://exaple.com\", \"meta\": {\"job-id\":\"123xsd\", \"start-date\":\"dd/mm/yy\"}, \"url_cookie_key\":\"fertxsas\" }\n\n    this data can be accessed in `scrapy spider` through response.\n    like: `request.url`, `request.meta`, `request.cookies`\n    \n.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.\n\nRequirements\n------------\n\n* Python 3.7+\n* Redis >= 5.0\n* ``Scrapy`` >=  2.0\n* ``redis-py`` >= 4.0\n\nInstallation\n------------\n\nFrom pip \n\n.. code-block:: bash\n\n    pip install scrapy-redis\n\nFrom GitHub\n\n.. code-block:: bash\n\n    git clone https://github.com/darkrho/scrapy-redis.git\n    cd scrapy-redis\n    python setup.py install\n\n.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.\n  \n.. code-block:: bash\n\n    pip uninstall scrapy-redis\n\nAlternative Choice\n---------------------------\n\nFrontera_  is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.\n\n.. _Frontera: https://github.com/scrapinghub/frontera\n.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html\n\n=======\nHistory\n=======\n\n\n0.7.3 (2022-07-21)\n------------------\n* Move docs to GitHub Wiki\n* Update tox and support dynamic tests\n* Update support for json data\n* Refactor max idle time\n* Add support for python3.7~python3.10\n* Deprecate python2.x support\n\n0.7.2 (2021-12-27)\n------------------\n* Fix RedisStatsCollector._get_key()\n* Fix redis-py dependency version\n* Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE\n\n0.7.1 (2021-03-27)\n------------------\n* Fixes datetime parse error for redis-py 3.x.\n* Add support for stats extensions.\n\n0.7.1-rc1 (2021-03-27)\n----------------------\n* Fixes datetime parse error for redis-py 3.x.\n\n0.7.1-b1 (2021-03-22)\n---------------------\n* Add support for stats extensions.\n\n0.7.0-dev (unreleased)\n----------------------\n* Unreleased.\n\n0.6.8 (2017-02-14)\n------------------\n* Fixed automated release due to not matching registered email.\n\n0.6.7 (2016-12-27)\n------------------\n* Fixes bad formatting in logging message.\n\n0.6.6 (2016-12-20)\n------------------\n* Fixes wrong message on dupefilter duplicates.\n\n0.6.5 (2016-12-19)\n------------------\n* Fixed typo in default settings.\n\n0.6.4 (2016-12-18)\n------------------\n* Fixed data decoding in Python 3.x.\n* Added ``REDIS_ENCODING`` setting (default ``utf-8``).\n* Default to ``CONCURRENT_REQUESTS`` value for ``REDIS_START_URLS_BATCH_SIZE``.\n* Renamed queue classes to a proper naming conventiong (backwards compatible).\n\n0.6.3 (2016-07-03)\n------------------\n* Added ``REDIS_START_URLS_KEY`` setting.\n* Fixed spider method ``from_crawler`` signature.\n\n0.6.2 (2016-06-26)\n------------------\n* Support ``redis_cls`` parameter in ``REDIS_PARAMS`` setting.\n* Python 3.x compatibility fixed.\n* Added ``SCHEDULER_SERIALIZER`` setting.\n\n0.6.1 (2016-06-25)\n------------------\n* **Backwards incompatible change:** Require explicit ``DUPEFILTER_CLASS``\n  setting.\n* Added ``SCHEDULER_FLUSH_ON_START`` setting.\n* Added ``REDIS_START_URLS_AS_SET`` setting.\n* Added ``REDIS_ITEMS_KEY`` setting.\n* Added ``REDIS_ITEMS_SERIALIZER`` setting.\n* Added ``REDIS_PARAMS`` setting.\n* Added ``REDIS_START_URLS_BATCH_SIZE`` spider attribute to read start urls\n  in batches.\n* Added ``RedisCrawlSpider``.\n\n0.6.0 (2015-07-05)\n------------------\n* Updated code to be compatible with Scrapy 1.0.\n* Added `-a domain=...` option for example spiders.\n\n0.5.0 (2013-09-02)\n------------------\n* Added `REDIS_URL` setting to support Redis connection string.\n* Added `SCHEDULER_IDLE_BEFORE_CLOSE` setting to prevent the spider closing too\n  quickly when the queue is empty. Default value is zero keeping the previous\n  behavior.\n* Schedule preemptively requests on item scraped.\n* This version is the latest release compatible with Scrapy 0.24.x.\n\n0.4.0 (2013-04-19)\n------------------\n* Added `RedisSpider` and `RedisMixin` classes as building blocks for spiders\n  to be fed through a redis queue.\n* Added redis queue stats.\n* Let the encoder handle the item as it comes instead converting it to a dict.\n\n0.3.0 (2013-02-18)\n------------------\n* Added support for different queue classes.\n* Changed requests serialization from `marshal` to `cPickle`.\n\n0.2.0 (2013-02-17)\n------------------\n* Improved backward compatibility.\n* Added example project.\n\n0.1.0 (2011-09-01)\n------------------\n* First release on PyPI.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Redis-based components for Scrapy.",
    "version": "0.7.3",
    "project_urls": {
        "Homepage": "https://github.com/Ehsan-U/scrapy-redis"
    },
    "split_keywords": [
        "scrapy-redis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f1102620c98cf98050533b4bf974346c332f3e3d312a4d33152d18151a4a8787",
                "md5": "60743461de5cd109a6d3d4558a677a39",
                "sha256": "cba254a3c5f955d5ed1c61313a8efc8e5d770b8d2923b7d8427090881c93a911"
            },
            "downloads": -1,
            "filename": "latest_scrapy_redis-0.7.3-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "60743461de5cd109a6d3d4558a677a39",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 18119,
            "upload_time": "2024-02-27T02:36:14",
            "upload_time_iso_8601": "2024-02-27T02:36:14.021764Z",
            "url": "https://files.pythonhosted.org/packages/f1/10/2620c98cf98050533b4bf974346c332f3e3d312a4d33152d18151a4a8787/latest_scrapy_redis-0.7.3-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "763e65b0b51b969806ebe1d1d0f71650202085dcf60a76490344634c9dee85e7",
                "md5": "441e62cc3190055de31aa5d3b609f070",
                "sha256": "e1d5ddd1de50704e1a66ba8bb22bc731196f7ae9ee64b888f2536da60c4a1b7a"
            },
            "downloads": -1,
            "filename": "latest-scrapy-redis-0.7.3.tar.gz",
            "has_sig": false,
            "md5_digest": "441e62cc3190055de31aa5d3b609f070",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 38713,
            "upload_time": "2024-02-27T02:36:16",
            "upload_time_iso_8601": "2024-02-27T02:36:16.896955Z",
            "url": "https://files.pythonhosted.org/packages/76/3e/65b0b51b969806ebe1d1d0f71650202085dcf60a76490344634c9dee85e7/latest-scrapy-redis-0.7.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-27 02:36:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Ehsan-U",
    "github_project": "scrapy-redis",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "latest-scrapy-redis"
}
        
Elapsed time: 0.18892s