# Gerapy Redis
This is a package for supporting distribution in Scrapy using Redis, also this
package is a module in [Gerapy](https://github.com/Gerapy/Gerapy).
This package is almost copied from [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).
## Change
Removed RedisSpider, move the logic to Scheduler. It will pre enqueue
all start requests to Redis Queue instead of adding one start request
when crawler is idle.
Arg: `SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS`, default to `True`.
## Installation
```shell script
pip3 install gerapy-redis
```
## Usage
```python
# Enables scheduling storing requests queue in redis.
SCHEDULER = "gerapy_redis.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "gerapy_redis.dupefilter.RFPDupeFilter"
# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "gerapy_redis.picklecompat"
# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True
# Pre enqueue all start requests to queue, (default True)
#SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS = True
# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.PriorityQueue'
# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.LifoQueue'
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10
# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
'gerapy_redis.pipelines.RedisPipeline': 300
}
# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'
# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'
# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379
# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'
# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'
# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False
# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``
# command to add URLS and Scores to redis queue. This could be useful if you
# want to use priority and avoid duplicates in your start urls list.
#REDIS_START_URLS_AS_ZSET = False
# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'
# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'
```
For more information, please refer to [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).
Raw data
{
"_id": null,
"home_page": "https://github.com/Gerapy/GerapyRedis",
"name": "gerapy-redis",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5.0",
"maintainer_email": "",
"keywords": "",
"author": "Germey",
"author_email": "cqc@cuiqingcai.com",
"download_url": "https://files.pythonhosted.org/packages/55/52/1a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a/gerapy-redis-0.1.1.tar.gz",
"platform": "",
"description": "\n# Gerapy Redis\n\nThis is a package for supporting distribution in Scrapy using Redis, also this\npackage is a module in [Gerapy](https://github.com/Gerapy/Gerapy).\n\nThis package is almost copied from [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).\n\n## Change\n\nRemoved RedisSpider, move the logic to Scheduler. It will pre enqueue \nall start requests to Redis Queue instead of adding one start request\nwhen crawler is idle.\n\nArg: `SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS`, default to `True`.\n\n## Installation\n\n```shell script\npip3 install gerapy-redis\n```\n\n## Usage\n\n```python\n# Enables scheduling storing requests queue in redis.\nSCHEDULER = \"gerapy_redis.scheduler.Scheduler\"\n\n# Ensure all spiders share same duplicates filter through redis.\nDUPEFILTER_CLASS = \"gerapy_redis.dupefilter.RFPDupeFilter\"\n\n# Default requests serializer is pickle, but it can be changed to any module\n# with loads and dumps functions. Note that pickle is not compatible between\n# python versions.\n# Caveat: In python 3.x, the serializer must return strings keys and support\n# bytes as values. Because of this reason the json or msgpack module will not\n# work by default. In python 2.x there is no such issue and you can use\n# 'json' or 'msgpack' as serializers.\n#SCHEDULER_SERIALIZER = \"gerapy_redis.picklecompat\"\n\n# Don't cleanup redis queues, allows to pause/resume crawls.\n#SCHEDULER_PERSIST = True\n\n# Pre enqueue all start requests to queue, (default True)\n#SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS = True\n\n# Schedule requests using a priority queue. (default)\n#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.PriorityQueue'\n\n# Alternative queues.\n#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.FifoQueue'\n#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.LifoQueue'\n\n# Max idle time to prevent the spider from being closed when distributed crawling.\n# This only works if queue class is SpiderQueue or SpiderStack,\n# and may also block the same time when your spider start at the first time (because the queue is empty).\n#SCHEDULER_IDLE_BEFORE_CLOSE = 10\n\n# Store scraped item in redis for post-processing.\nITEM_PIPELINES = {\n 'gerapy_redis.pipelines.RedisPipeline': 300\n}\n\n# The item pipeline serializes and stores the items in this redis key.\n#REDIS_ITEMS_KEY = '%(spider)s:items'\n\n# The items serializer is by default ScrapyJSONEncoder. You can use any\n# importable path to a callable object.\n#REDIS_ITEMS_SERIALIZER = 'json.dumps'\n\n# Specify the host and port to use when connecting to Redis (optional).\n#REDIS_HOST = 'localhost'\n#REDIS_PORT = 6379\n\n# Specify the full Redis URL for connecting (optional).\n# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.\n#REDIS_URL = 'redis://user:pass@hostname:9001'\n\n# Custom redis client parameters (i.e.: socket timeout, etc.)\n#REDIS_PARAMS = {}\n# Use custom redis client class.\n#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'\n\n# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``\n# command to add URLs to the redis queue. This could be useful if you\n# want to avoid duplicates in your start urls list and the order of\n# processing does not matter.\n#REDIS_START_URLS_AS_SET = False\n\n# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``\n# command to add URLS and Scores to redis queue. This could be useful if you\n# want to use priority and avoid duplicates in your start urls list.\n#REDIS_START_URLS_AS_ZSET = False\n\n# Default start urls key for RedisSpider and RedisCrawlSpider.\n#REDIS_START_URLS_KEY = '%(name)s:start_urls'\n\n# Use other encoding than utf-8 for redis.\n#REDIS_ENCODING = 'latin1'\n```\n\nFor more information, please refer to [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-redis).\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Distribution Support for Scrapy & Gerapy using Redis",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/Gerapy/GerapyRedis"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2ee2e287af8d55726ce558834d439ed8767945aa8d7e356495a1542a6633d1ce",
"md5": "c7a3df26d1fd073788f08d6c82dcfaa3",
"sha256": "2f4cea8785e170b3458a1e3d57042ecc0350f85b3923e74271e4b68cbaa7dd61"
},
"downloads": -1,
"filename": "gerapy_redis-0.1.1-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "c7a3df26d1fd073788f08d6c82dcfaa3",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.5.0",
"size": 12407,
"upload_time": "2021-03-16T16:36:31",
"upload_time_iso_8601": "2021-03-16T16:36:31.670783Z",
"url": "https://files.pythonhosted.org/packages/2e/e2/e287af8d55726ce558834d439ed8767945aa8d7e356495a1542a6633d1ce/gerapy_redis-0.1.1-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "55521a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a",
"md5": "45322e11c12e700ad50d692a148009db",
"sha256": "b4e4a273c9ed12d8bbd86b299a32f672b30e6397b7be045e3ae49fa712e48338"
},
"downloads": -1,
"filename": "gerapy-redis-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "45322e11c12e700ad50d692a148009db",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5.0",
"size": 12050,
"upload_time": "2021-03-16T16:36:33",
"upload_time_iso_8601": "2021-03-16T16:36:33.304476Z",
"url": "https://files.pythonhosted.org/packages/55/52/1a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a/gerapy-redis-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2021-03-16 16:36:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Gerapy",
"github_project": "GerapyRedis",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "gerapy-redis"
}