gerapy-redis


Namegerapy-redis JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/Gerapy/GerapyRedis
SummaryDistribution Support for Scrapy & Gerapy using Redis
upload_time2021-03-16 16:36:33
maintainer
docs_urlNone
authorGermey
requires_python>=3.5.0
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Gerapy Redis

This is a package for supporting distribution in Scrapy using Redis, also this
package is a module in [Gerapy](https://github.com/Gerapy/Gerapy).

This package is almost copied from [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).

## Change

Removed RedisSpider, move the logic to Scheduler. It will pre enqueue 
all start requests to Redis Queue instead of adding one start request
when crawler is idle.

Arg: `SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS`, default to `True`.

## Installation

```shell script
pip3 install gerapy-redis
```

## Usage

```python
# Enables scheduling storing requests queue in redis.
SCHEDULER = "gerapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "gerapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "gerapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Pre enqueue all start requests to queue, (default True)
#SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'gerapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``
# command to add URLS and Scores to redis queue. This could be useful if you
# want to use priority and avoid duplicates in your start urls list.
#REDIS_START_URLS_AS_ZSET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'
```

For more information, please refer to [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Gerapy/GerapyRedis",
    "name": "gerapy-redis",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Germey",
    "author_email": "cqc@cuiqingcai.com",
    "download_url": "https://files.pythonhosted.org/packages/55/52/1a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a/gerapy-redis-0.1.1.tar.gz",
    "platform": "",
    "description": "\n# Gerapy Redis\n\nThis is a package for supporting distribution in Scrapy using Redis, also this\npackage is a module in [Gerapy](https://github.com/Gerapy/Gerapy).\n\nThis package is almost copied from [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).\n\n## Change\n\nRemoved RedisSpider, move the logic to Scheduler. It will pre enqueue \nall start requests to Redis Queue instead of adding one start request\nwhen crawler is idle.\n\nArg: `SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS`, default to `True`.\n\n## Installation\n\n```shell script\npip3 install gerapy-redis\n```\n\n## Usage\n\n```python\n# Enables scheduling storing requests queue in redis.\nSCHEDULER = \"gerapy_redis.scheduler.Scheduler\"\n\n# Ensure all spiders share same duplicates filter through redis.\nDUPEFILTER_CLASS = \"gerapy_redis.dupefilter.RFPDupeFilter\"\n\n# Default requests serializer is pickle, but it can be changed to any module\n# with loads and dumps functions. Note that pickle is not compatible between\n# python versions.\n# Caveat: In python 3.x, the serializer must return strings keys and support\n# bytes as values. Because of this reason the json or msgpack module will not\n# work by default. In python 2.x there is no such issue and you can use\n# 'json' or 'msgpack' as serializers.\n#SCHEDULER_SERIALIZER = \"gerapy_redis.picklecompat\"\n\n# Don't cleanup redis queues, allows to pause/resume crawls.\n#SCHEDULER_PERSIST = True\n\n# Pre enqueue all start requests to queue, (default True)\n#SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS = True\n\n# Schedule requests using a priority queue. (default)\n#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.PriorityQueue'\n\n# Alternative queues.\n#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.FifoQueue'\n#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.LifoQueue'\n\n# Max idle time to prevent the spider from being closed when distributed crawling.\n# This only works if queue class is SpiderQueue or SpiderStack,\n# and may also block the same time when your spider start at the first time (because the queue is empty).\n#SCHEDULER_IDLE_BEFORE_CLOSE = 10\n\n# Store scraped item in redis for post-processing.\nITEM_PIPELINES = {\n    'gerapy_redis.pipelines.RedisPipeline': 300\n}\n\n# The item pipeline serializes and stores the items in this redis key.\n#REDIS_ITEMS_KEY = '%(spider)s:items'\n\n# The items serializer is by default ScrapyJSONEncoder. You can use any\n# importable path to a callable object.\n#REDIS_ITEMS_SERIALIZER = 'json.dumps'\n\n# Specify the host and port to use when connecting to Redis (optional).\n#REDIS_HOST = 'localhost'\n#REDIS_PORT = 6379\n\n# Specify the full Redis URL for connecting (optional).\n# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.\n#REDIS_URL = 'redis://user:pass@hostname:9001'\n\n# Custom redis client parameters (i.e.: socket timeout, etc.)\n#REDIS_PARAMS  = {}\n# Use custom redis client class.\n#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'\n\n# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``\n# command to add URLs to the redis queue. This could be useful if you\n# want to avoid duplicates in your start urls list and the order of\n# processing does not matter.\n#REDIS_START_URLS_AS_SET = False\n\n# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``\n# command to add URLS and Scores to redis queue. This could be useful if you\n# want to use priority and avoid duplicates in your start urls list.\n#REDIS_START_URLS_AS_ZSET = False\n\n# Default start urls key for RedisSpider and RedisCrawlSpider.\n#REDIS_START_URLS_KEY = '%(name)s:start_urls'\n\n# Use other encoding than utf-8 for redis.\n#REDIS_ENCODING = 'latin1'\n```\n\nFor more information, please refer to [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Distribution Support for Scrapy & Gerapy using Redis",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/Gerapy/GerapyRedis"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2ee2e287af8d55726ce558834d439ed8767945aa8d7e356495a1542a6633d1ce",
                "md5": "c7a3df26d1fd073788f08d6c82dcfaa3",
                "sha256": "2f4cea8785e170b3458a1e3d57042ecc0350f85b3923e74271e4b68cbaa7dd61"
            },
            "downloads": -1,
            "filename": "gerapy_redis-0.1.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c7a3df26d1fd073788f08d6c82dcfaa3",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.5.0",
            "size": 12407,
            "upload_time": "2021-03-16T16:36:31",
            "upload_time_iso_8601": "2021-03-16T16:36:31.670783Z",
            "url": "https://files.pythonhosted.org/packages/2e/e2/e287af8d55726ce558834d439ed8767945aa8d7e356495a1542a6633d1ce/gerapy_redis-0.1.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "55521a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a",
                "md5": "45322e11c12e700ad50d692a148009db",
                "sha256": "b4e4a273c9ed12d8bbd86b299a32f672b30e6397b7be045e3ae49fa712e48338"
            },
            "downloads": -1,
            "filename": "gerapy-redis-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "45322e11c12e700ad50d692a148009db",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5.0",
            "size": 12050,
            "upload_time": "2021-03-16T16:36:33",
            "upload_time_iso_8601": "2021-03-16T16:36:33.304476Z",
            "url": "https://files.pythonhosted.org/packages/55/52/1a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a/gerapy-redis-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-03-16 16:36:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Gerapy",
    "github_project": "GerapyRedis",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "gerapy-redis"
}
        
Elapsed time: 0.40092s