Name | scrapy-mongo JSON |
Version |
1.1.0
JSON |
| download |
home_page | None |
Summary | MongoDB plugins for Scrapy |
upload_time | 2025-02-14 08:38:11 |
maintainer | None |
docs_url | None |
author | Fabien Vauchelles |
requires_python | >=3.8 |
license | None |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# MongoDB plugins for Scrapy
## Installation
```shell
pip install scrapy-mongo
```
## Pipeline
This pipeline stores scraped items into a MongoDB collection.
Each item must have a unique `id` field to avoid duplicates.
This field is automatically mapped to MongoDB’s `_id` field.
Each item must include a `collection` field that specifies the name of the target MongoDB collection.
Items are upserted in batches of `100` by default.
The batch size can be adjusted using the `PIPELINE_MONGO_BATCH_SIZE` setting.
To enable the pipeline, include the following lines in `settings.py`:
```python
ITEM_PIPELINES = {
'scrapy_mongo.MongoPipeline': 300,
}
PIPELINE_MONGO_URL = "mongodb://localhost:27017"
PIPELINE_MONGO_DATABASE = "mycollection"
```
**Note:** Update `PIPELINE_MONGO_URL` and `PIPELINE_MONGO_DATABASE`
with the appropriate values for the specific environment.
## Cache
The cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times.
It leverages Scrapy’s fingerprinting mechanism to identify responses.
It uses [Scrapy's fingerprint](https://docs.scrapy.org/en/2.10/topics/request-response.html#request-fingerprints)
mechanism to identify the responses.
To enable caching, include the following lines in `settings.py`:
```python
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'
HTTPCACHE_MONGO_URL = "mongodb://localhost:27017"
HTTPCACHE_MONGO_DATABASE = "scraping"
HTTPCACHE_EXPIRATION_SECS = 604800 # Default is 1 week
```
**Note:** Update `HTTPCACHE_MONGO_URL` and `HTTPCACHE_MONGO_DATABASE`
with the appropriate values for the specific environment.
The default expiration time is set to **1 week** (604800 seconds).
This value can be modified via `HTTPCACHE_EXPIRATION_SECS`.
## Cache policy
An advanced cache policy mechanism with whitelist support is available.
This feature allows for the definition of specific HTTP response codes to be cached,
using both explicit **lists and regular expressions**.
To enable the cache policy, add the following lines to `settings.py`:
```python
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'
HTTPCACHE_ACCEPT_HTTP_CODES = [302]
HTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\d\d'
```
This configuration will accept all `2XX HTTP` codes and `302` redirects.
## Error
The error component stores error logs in a MongoDB collection.
It catches error from the Downloader pipeline and the Spider pipeline.
To enable error logging, include the following lines in `settings.py`:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_mongo.TraceErrorDownloaderMiddleware': 1000,
}
SPIDER_MIDDLEWARES = {
'scrapy_mongo.TraceErrorSpiderMiddleware': 1000,
}
ERROR_MONGO_URL = "mongodb://localhost:27017"
ERROR_MONGO_DATABASE = 'scraping'
ERROR_MONGO_COLLECTION = 'errors'
```
**Note:** Update `ERROR_MONGO_URL`, `ERROR_MONGO_DATABASE` and `ERROR_MONGO_COLLECTION`
with the appropriate values for the specific environment.
It is possible to use the same MongoDB connection for both the pipeline and cache
by replacing `PIPELINE_MONGO_URL`, `HTTPCACHE_MONGO_URL` and `ERROR_MONGO_URL` with a unified `MONGO_URL` setting.
## Build for publish
Install dependencies:
```shell
pip install build twine
```
Build the package:
```shell
python -m build --outdir dist
```
And publish to PyPi:
```shell
python -m twine upload dist/*
```
Raw data
{
"_id": null,
"home_page": null,
"name": "scrapy-mongo",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Fabien Vauchelles",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/59/96/4579ed7c3d74a07c70f4ab1341c4bc8b7e7f7ada4e4d24c5d99907dba221/scrapy_mongo-1.1.0.tar.gz",
"platform": null,
"description": "# MongoDB plugins for Scrapy\n\n## Installation\n\n```shell\npip install scrapy-mongo\n```\n\n\n## Pipeline\n\nThis pipeline stores scraped items into a MongoDB collection.\n\nEach item must have a unique `id` field to avoid duplicates. \nThis field is automatically mapped to MongoDB\u2019s `_id` field.\n\nEach item must include a `collection` field that specifies the name of the target MongoDB collection.\n\nItems are upserted in batches of `100` by default. \nThe batch size can be adjusted using the `PIPELINE_MONGO_BATCH_SIZE` setting.\n\nTo enable the pipeline, include the following lines in `settings.py`:\n\n```python\nITEM_PIPELINES = {\n 'scrapy_mongo.MongoPipeline': 300,\n}\nPIPELINE_MONGO_URL = \"mongodb://localhost:27017\"\nPIPELINE_MONGO_DATABASE = \"mycollection\"\n```\n\n**Note:** Update `PIPELINE_MONGO_URL` and `PIPELINE_MONGO_DATABASE` \nwith the appropriate values for the specific environment.\n\n\n## Cache\n\nThe cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times.\nIt leverages Scrapy\u2019s fingerprinting mechanism to identify responses.\n\nIt uses [Scrapy's fingerprint](https://docs.scrapy.org/en/2.10/topics/request-response.html#request-fingerprints) \nmechanism to identify the responses.\n\nTo enable caching, include the following lines in `settings.py`:\n\n```python\nHTTPCACHE_ENABLED = True\nHTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'\nHTTPCACHE_MONGO_URL = \"mongodb://localhost:27017\"\nHTTPCACHE_MONGO_DATABASE = \"scraping\"\nHTTPCACHE_EXPIRATION_SECS = 604800 # Default is 1 week\n```\n\n**Note:** Update `HTTPCACHE_MONGO_URL` and `HTTPCACHE_MONGO_DATABASE` \nwith the appropriate values for the specific environment.\n\nThe default expiration time is set to **1 week** (604800 seconds). \nThis value can be modified via `HTTPCACHE_EXPIRATION_SECS`.\n\n\n## Cache policy\n\nAn advanced cache policy mechanism with whitelist support is available. \nThis feature allows for the definition of specific HTTP response codes to be cached,\nusing both explicit **lists and regular expressions**.\n\nTo enable the cache policy, add the following lines to `settings.py`:\n\n```python\nHTTPCACHE_ENABLED = True\nHTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'\nHTTPCACHE_ACCEPT_HTTP_CODES = [302]\nHTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\\d\\d'\n```\n\nThis configuration will accept all `2XX HTTP` codes and `302` redirects.\n\n\n## Error\n\nThe error component stores error logs in a MongoDB collection.\nIt catches error from the Downloader pipeline and the Spider pipeline.\n\nTo enable error logging, include the following lines in `settings.py`:\n\n```python\nDOWNLOADER_MIDDLEWARES = {\n 'scrapy_mongo.TraceErrorDownloaderMiddleware': 1000,\n}\n\nSPIDER_MIDDLEWARES = {\n 'scrapy_mongo.TraceErrorSpiderMiddleware': 1000,\n}\n\nERROR_MONGO_URL = \"mongodb://localhost:27017\"\nERROR_MONGO_DATABASE = 'scraping'\nERROR_MONGO_COLLECTION = 'errors'\n```\n\n**Note:** Update `ERROR_MONGO_URL`, `ERROR_MONGO_DATABASE` and `ERROR_MONGO_COLLECTION` \nwith the appropriate values for the specific environment.\n\nIt is possible to use the same MongoDB connection for both the pipeline and cache\nby replacing `PIPELINE_MONGO_URL`, `HTTPCACHE_MONGO_URL` and `ERROR_MONGO_URL` with a unified `MONGO_URL` setting.\n\n\n## Build for publish\n\nInstall dependencies:\n\n```shell\npip install build twine\n```\n\nBuild the package:\n\n```shell\npython -m build --outdir dist\n```\n\nAnd publish to PyPi:\n\n```shell\npython -m twine upload dist/*\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "MongoDB plugins for Scrapy",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://github.com/scrapoxy/scrapy-mongo",
"Issues": "https://github.com/scrapoxy/scrapy-mongo/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7a5348ad4eb70a0bf8384c2e8b325a2b20f9d46619065e2b77a1d35df3803057",
"md5": "a5e3179dc41a372a1c2e5cd12bfa47d1",
"sha256": "826105f2f587574692c213bcd6c48f4bce33c381afa3b9c65d69954ed979c378"
},
"downloads": -1,
"filename": "scrapy_mongo-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a5e3179dc41a372a1c2e5cd12bfa47d1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 7424,
"upload_time": "2025-02-14T08:38:07",
"upload_time_iso_8601": "2025-02-14T08:38:07.391415Z",
"url": "https://files.pythonhosted.org/packages/7a/53/48ad4eb70a0bf8384c2e8b325a2b20f9d46619065e2b77a1d35df3803057/scrapy_mongo-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "59964579ed7c3d74a07c70f4ab1341c4bc8b7e7f7ada4e4d24c5d99907dba221",
"md5": "8c64b1733888e4f4e8495474c6201541",
"sha256": "c83297e67eac122abc0cf68ead0d394c0b099cd0c25b2d5ca8c009ab427e9e06"
},
"downloads": -1,
"filename": "scrapy_mongo-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "8c64b1733888e4f4e8495474c6201541",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 5907,
"upload_time": "2025-02-14T08:38:11",
"upload_time_iso_8601": "2025-02-14T08:38:11.319613Z",
"url": "https://files.pythonhosted.org/packages/59/96/4579ed7c3d74a07c70f4ab1341c4bc8b7e7f7ada4e4d24c5d99907dba221/scrapy_mongo-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-14 08:38:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "scrapoxy",
"github_project": "scrapy-mongo",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "scrapy-mongo"
}