scrapy-mongo

Name	scrapy-mongo JSON
Version	1.1.0 JSON
	download
home_page	None
Summary	MongoDB plugins for Scrapy
upload_time	2025-02-14 08:38:11
maintainer	None
docs_url	None
author	Fabien Vauchelles
requires_python	>=3.8
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# MongoDB plugins for Scrapy

## Installation

```shell
pip install scrapy-mongo
```

## Pipeline

This pipeline stores scraped items into a MongoDB collection.

Each item must have a unique `id` field to avoid duplicates.
This field is automatically mapped to MongoDB’s `_id` field.

Each item must include a `collection` field that specifies the name of the target MongoDB collection.

Items are upserted in batches of `100` by default.
The batch size can be adjusted using the `PIPELINE_MONGO_BATCH_SIZE` setting.

To enable the pipeline, include the following lines in `settings.py`:

```python
ITEM_PIPELINES = {
'scrapy_mongo.MongoPipeline': 300,
}
PIPELINE_MONGO_URL = "mongodb://localhost:27017"
PIPELINE_MONGO_DATABASE = "mycollection"
```

**Note:** Update `PIPELINE_MONGO_URL` and `PIPELINE_MONGO_DATABASE`
with the appropriate values for the specific environment.

## Cache

The cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times.
It leverages Scrapy’s fingerprinting mechanism to identify responses.

It uses [Scrapy's fingerprint](https://docs.scrapy.org/en/2.10/topics/request-response.html#request-fingerprints)
mechanism to identify the responses.

To enable caching, include the following lines in `settings.py`:

```python
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'
HTTPCACHE_MONGO_URL = "mongodb://localhost:27017"
HTTPCACHE_MONGO_DATABASE = "scraping"
HTTPCACHE_EXPIRATION_SECS = 604800 # Default is 1 week
```

**Note:** Update `HTTPCACHE_MONGO_URL` and `HTTPCACHE_MONGO_DATABASE`
with the appropriate values for the specific environment.

The default expiration time is set to **1 week** (604800 seconds).
This value can be modified via `HTTPCACHE_EXPIRATION_SECS`.

## Cache policy

An advanced cache policy mechanism with whitelist support is available.
This feature allows for the definition of specific HTTP response codes to be cached,
using both explicit **lists and regular expressions**.

To enable the cache policy, add the following lines to `settings.py`:

```python
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'
HTTPCACHE_ACCEPT_HTTP_CODES = [302]
HTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\d\d'
```

This configuration will accept all `2XX HTTP` codes and `302` redirects.

## Error

The error component stores error logs in a MongoDB collection.
It catches error from the Downloader pipeline and the Spider pipeline.

To enable error logging, include the following lines in `settings.py`:

```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_mongo.TraceErrorDownloaderMiddleware': 1000,
}

SPIDER_MIDDLEWARES = {
'scrapy_mongo.TraceErrorSpiderMiddleware': 1000,
}

ERROR_MONGO_URL = "mongodb://localhost:27017"
ERROR_MONGO_DATABASE = 'scraping'
ERROR_MONGO_COLLECTION = 'errors'
```

**Note:** Update `ERROR_MONGO_URL`, `ERROR_MONGO_DATABASE` and `ERROR_MONGO_COLLECTION`
with the appropriate values for the specific environment.

It is possible to use the same MongoDB connection for both the pipeline and cache
by replacing `PIPELINE_MONGO_URL`, `HTTPCACHE_MONGO_URL` and `ERROR_MONGO_URL` with a unified `MONGO_URL` setting.

## Build for publish

Install dependencies:

```shell
pip install build twine
```

Build the package:

```shell
python -m build --outdir dist
```

And publish to PyPi:

```shell
python -m twine upload dist/*
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrapy-mongo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Fabien Vauchelles",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/59/96/4579ed7c3d74a07c70f4ab1341c4bc8b7e7f7ada4e4d24c5d99907dba221/scrapy_mongo-1.1.0.tar.gz",
    "platform": null,
    "description": "# MongoDB plugins for Scrapy\n\n## Installation\n\n```shell\npip install scrapy-mongo\n```\n\n\n## Pipeline\n\nThis pipeline stores scraped items into a MongoDB collection.\n\nEach item must have a unique `id` field to avoid duplicates. \nThis field is automatically mapped to MongoDB\u2019s `_id` field.\n\nEach item must include a `collection` field that specifies the name of the target MongoDB collection.\n\nItems are upserted in batches of `100` by default. \nThe batch size can be adjusted using the `PIPELINE_MONGO_BATCH_SIZE` setting.\n\nTo enable the pipeline, include the following lines in `settings.py`:\n\n```python\nITEM_PIPELINES = {\n    'scrapy_mongo.MongoPipeline': 300,\n}\nPIPELINE_MONGO_URL = \"mongodb://localhost:27017\"\nPIPELINE_MONGO_DATABASE = \"mycollection\"\n```\n\n**Note:** Update `PIPELINE_MONGO_URL` and `PIPELINE_MONGO_DATABASE` \nwith the appropriate values for the specific environment.\n\n\n## Cache\n\nThe cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times.\nIt leverages Scrapy\u2019s fingerprinting mechanism to identify responses.\n\nIt uses [Scrapy's fingerprint](https://docs.scrapy.org/en/2.10/topics/request-response.html#request-fingerprints) \nmechanism to identify the responses.\n\nTo enable caching, include the following lines in `settings.py`:\n\n```python\nHTTPCACHE_ENABLED = True\nHTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'\nHTTPCACHE_MONGO_URL = \"mongodb://localhost:27017\"\nHTTPCACHE_MONGO_DATABASE = \"scraping\"\nHTTPCACHE_EXPIRATION_SECS = 604800  # Default is 1 week\n```\n\n**Note:** Update `HTTPCACHE_MONGO_URL` and `HTTPCACHE_MONGO_DATABASE` \nwith the appropriate values for the specific environment.\n\nThe default expiration time is set to **1 week** (604800 seconds). \nThis value can be modified via `HTTPCACHE_EXPIRATION_SECS`.\n\n\n## Cache policy\n\nAn advanced cache policy mechanism with whitelist support is available. \nThis feature allows for the definition of specific HTTP response codes to be cached,\nusing both explicit **lists and regular expressions**.\n\nTo enable the cache policy, add the following lines to `settings.py`:\n\n```python\nHTTPCACHE_ENABLED = True\nHTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'\nHTTPCACHE_ACCEPT_HTTP_CODES = [302]\nHTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\\d\\d'\n```\n\nThis configuration will accept all `2XX HTTP` codes and `302` redirects.\n\n\n## Error\n\nThe error component stores error logs in a MongoDB collection.\nIt catches error from the Downloader pipeline and the Spider pipeline.\n\nTo enable error logging, include the following lines in `settings.py`:\n\n```python\nDOWNLOADER_MIDDLEWARES = {\n    'scrapy_mongo.TraceErrorDownloaderMiddleware': 1000,\n}\n\nSPIDER_MIDDLEWARES = {\n    'scrapy_mongo.TraceErrorSpiderMiddleware': 1000,\n}\n\nERROR_MONGO_URL = \"mongodb://localhost:27017\"\nERROR_MONGO_DATABASE = 'scraping'\nERROR_MONGO_COLLECTION = 'errors'\n```\n\n**Note:** Update `ERROR_MONGO_URL`, `ERROR_MONGO_DATABASE` and `ERROR_MONGO_COLLECTION` \nwith the appropriate values for the specific environment.\n\nIt is possible to use the same MongoDB connection for both the pipeline and cache\nby replacing `PIPELINE_MONGO_URL`, `HTTPCACHE_MONGO_URL` and `ERROR_MONGO_URL` with a unified `MONGO_URL` setting.\n\n\n## Build for publish\n\nInstall dependencies:\n\n```shell\npip install build twine\n```\n\nBuild the package:\n\n```shell\npython -m build --outdir dist\n```\n\nAnd publish to PyPi:\n\n```shell\npython -m twine upload dist/*\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "MongoDB plugins for Scrapy",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/scrapoxy/scrapy-mongo",
        "Issues": "https://github.com/scrapoxy/scrapy-mongo/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7a5348ad4eb70a0bf8384c2e8b325a2b20f9d46619065e2b77a1d35df3803057",
                "md5": "a5e3179dc41a372a1c2e5cd12bfa47d1",
                "sha256": "826105f2f587574692c213bcd6c48f4bce33c381afa3b9c65d69954ed979c378"
            },
            "downloads": -1,
            "filename": "scrapy_mongo-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a5e3179dc41a372a1c2e5cd12bfa47d1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 7424,
            "upload_time": "2025-02-14T08:38:07",
            "upload_time_iso_8601": "2025-02-14T08:38:07.391415Z",
            "url": "https://files.pythonhosted.org/packages/7a/53/48ad4eb70a0bf8384c2e8b325a2b20f9d46619065e2b77a1d35df3803057/scrapy_mongo-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "59964579ed7c3d74a07c70f4ab1341c4bc8b7e7f7ada4e4d24c5d99907dba221",
                "md5": "8c64b1733888e4f4e8495474c6201541",
                "sha256": "c83297e67eac122abc0cf68ead0d394c0b099cd0c25b2d5ca8c009ab427e9e06"
            },
            "downloads": -1,
            "filename": "scrapy_mongo-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8c64b1733888e4f4e8495474c6201541",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5907,
            "upload_time": "2025-02-14T08:38:11",
            "upload_time_iso_8601": "2025-02-14T08:38:11.319613Z",
            "url": "https://files.pythonhosted.org/packages/59/96/4579ed7c3d74a07c70f4ab1341c4bc8b7e7f7ada4e4d24c5d99907dba221/scrapy_mongo-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-14 08:38:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "scrapoxy",
    "github_project": "scrapy-mongo",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "scrapy-mongo"
}

Fabien Vauchelles