scrachy


Namescrachy JSON
Version 0.3.1 PyPI version JSON
download
home_pagehttps://bitbucket.org/reidswanson/scrachy
SummaryEnhanced caching modules for scrapy.
upload_time2020-09-19 22:21:49
maintainerReid Swanson
docs_urlNone
authorReid Swanson
requires_python
licenselgpl-v3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Scrachy
Scrachy is a flexible cache storage backend for [Scrapy](https://scrapy.org/) that stores its data in a relational database using [SQLAlchemy](https://www.sqlalchemy.org/).
It also comes with a downloader middleware that will optionally ignore requests that are already in the cache.

# Install
You can install the latest version from git:

```
>pip install git+https://bitbucket.org/reidswanson/scrachy.git
``` 

or from PyPI:

```
>pip install scrachy
```

# Documentation
## Storage Backend
To (minimally) use the storage backend you simply need to enable caching by adding the following to your `settings.py` file:  
```python
# Enable caching
HTTPCACHE_ENABLED = True

# Set the storage backend to the one provided by Scrachy.
HTTPCACHE_STORAGE = 'scrachy.middleware.httpcache.AlchemyCacheStorage'

# You do not need any other settings for the engine to work, but by default
# it will use an in memory sqlite database, which is probably not what you
# want. To use another database configure the following settings.
# Set the type of database to use. It must be one of the supported types
# provided by SQLAlchemy.
SCRACHY_DIALECT = <database-dialect>

# Set the driver to use. This must also be recognized by SQLAlchemy, but can
# be None (the default) to use the the SQLAlchemy default driver.
SCRACHY_DRIVER = <database-driver>

# Set the hostname of the database server. This should be None for sqlite 
# databases. If it is None for other databases (the default) it is assumed to be
# localhost.
SCRACHY_HOST = <database-hostname>

# Set the port the database server is listening on. If set, this must be an
# integer. If it is None (the default), the default database port is used.
SCRACHY_PORT = <database-port>

# The schema where the database will be created. This is only valid for
# databases that support schemas (e.g., postgresql).
SCRACHY_SCHEMA = <database-schema>

# Set the name of the database. For sqlite this should be the path to the
# database file, which will be created if it does not exist. For other
# databases this should be the name of the database. Note, the database must
# exist prior to crawling, but all necessary tables will be created on the
# first run. This can only be None for in memory sqlite databases. 
SCRACHY_DATABASE = <database-name>

# A locally accessible file that stores the username and password used to
# connect to the database. Each line contains an '=' separated key/value pair
# where the only valid keys are `username` and `password`. Both are optional,
# but you will not be able to connect if they do not match the security
# settings of your database.
SCRACHY_CREDENTIALS = <credentials-file>

# There is some contention between this storage engine and the
# HttpCompressionMiddleware. I believe this is because of the way encoding
# inference is done in HtmlResponses. The header for (at least some) pages
# that have been compressed specify the encoding as the compression type.
# The first time the page is downloaded the response is compressed and the
# compression middleware recognizes it as such and successfully deflates the
# text. However, the body then gets cached in the database as plain text.
# So, when the page is retrieved from the cache the encoding inference still
# says that it is compressed, but when the compression middleware tries to
# deflate the content, it raises an exception because it is not actually
# compressed. I may try to solve this in a more robust way in the future, but
# for now the easiest solution is to change the priority of the 
# HttpCompressionMiddleware so that it comes after the HttpCacheMiddleware.
# By default the HttpCacheMiddleware has a priority of 900 (higher means lower
# priority).
DOWNLOADER_MIDDLEWARES = {
    # Other middlewares excluded but would go here.

    # It's not exactly clear to me the best place for this, but it must be
    # after the HttpCacheMiddleware for now.
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 1000,
}
```
The storage backend also has several other features that are useful for specialized crawling and scraping tasks.
For other configuration options and features please see the full documentation on [Read the Docs](https://scrachy.readthedocs.io/en/latest).

## Ignoring Cached Items
This project also comes with a middleware to ignore any requests that are already cached.
For example, sometimes you want to scrape a domain periodically looking for new pages.
In this scenario it often doesn't make sense, and wastes resources, to reparse the pages you have already cached (unless you have changed your parsing rules of course).
Enabling this middleware will allow you to specify which pages from your cache you would like to ignore.

To activate this middleware simply add the class ``IgnoreCachedResponse`` to the set of ``DOWNLOADER_MIDDLEWARES`` as follows:

```python
DOWNLOADER_MIDDLEWARES = {
    # You probably want this early in the pipeline, because there's no point
    # in the other middleware if its in the cache and we are going to ignore it
    # anyway.
    'scrachy.middleware.ignorecached.IgnoreCachedResponse': 50,

    # Other middlewares excluded but would go here.
}
``` 

You can choose not to ignore requests using 3 methods.

  1. Any request that is excluded from caching by setting the request meta variable ``dont_cache`` to ``True``
  2. A request can be excluded by setting the new request meta variable ``dont_ignore`` to ``True``
  3. A request can be ignored if its url matches a regular expression in the settings variable ``SCRACHY_IGNORE_EXCLUDE_URL_PATTERNS``, which takes a list of strings that can be compiled into regular expressions.

# License
Scrachy is released using the GNU Lesser General Public License.
See the [LICENSE](LICENSE.md) file for more details.
Files that are adapted or use code from other sources are indicated either at the top of the file or at the location of the code snippet.
Some of these files were adapted from code released under a 3-clause BSD license.
Those files should indicate the original copyright in a comment at the top of the file.
See the [BSD_LICENSE](BSD_LICENSE.md) file for details of this license.



            

Raw data

            {
    "_id": null,
    "home_page": "https://bitbucket.org/reidswanson/scrachy",
    "name": "scrachy",
    "maintainer": "Reid Swanson",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "reid@reidswanson.com",
    "keywords": "",
    "author": "Reid Swanson",
    "author_email": "reid@reidswanson.com",
    "download_url": "https://files.pythonhosted.org/packages/62/4f/9f1ed9698f57ceaa76b77f12134a211056fdfde85d771d9391ec6fc246e3/scrachy-0.3.1.tar.gz",
    "platform": "",
    "description": "# Scrachy\nScrachy is a flexible cache storage backend for [Scrapy](https://scrapy.org/) that stores its data in a relational database using [SQLAlchemy](https://www.sqlalchemy.org/).\nIt also comes with a downloader middleware that will optionally ignore requests that are already in the cache.\n\n# Install\nYou can install the latest version from git:\n\n```\n>pip install git+https://bitbucket.org/reidswanson/scrachy.git\n``` \n\nor from PyPI:\n\n```\n>pip install scrachy\n```\n\n# Documentation\n## Storage Backend\nTo (minimally) use the storage backend you simply need to enable caching by adding the following to your `settings.py` file:  \n```python\n# Enable caching\nHTTPCACHE_ENABLED = True\n\n# Set the storage backend to the one provided by Scrachy.\nHTTPCACHE_STORAGE = 'scrachy.middleware.httpcache.AlchemyCacheStorage'\n\n# You do not need any other settings for the engine to work, but by default\n# it will use an in memory sqlite database, which is probably not what you\n# want. To use another database configure the following settings.\n# Set the type of database to use. It must be one of the supported types\n# provided by SQLAlchemy.\nSCRACHY_DIALECT = <database-dialect>\n\n# Set the driver to use. This must also be recognized by SQLAlchemy, but can\n# be None (the default) to use the the SQLAlchemy default driver.\nSCRACHY_DRIVER = <database-driver>\n\n# Set the hostname of the database server. This should be None for sqlite \n# databases. If it is None for other databases (the default) it is assumed to be\n# localhost.\nSCRACHY_HOST = <database-hostname>\n\n# Set the port the database server is listening on. If set, this must be an\n# integer. If it is None (the default), the default database port is used.\nSCRACHY_PORT = <database-port>\n\n# The schema where the database will be created. This is only valid for\n# databases that support schemas (e.g., postgresql).\nSCRACHY_SCHEMA = <database-schema>\n\n# Set the name of the database. For sqlite this should be the path to the\n# database file, which will be created if it does not exist. For other\n# databases this should be the name of the database. Note, the database must\n# exist prior to crawling, but all necessary tables will be created on the\n# first run. This can only be None for in memory sqlite databases. \nSCRACHY_DATABASE = <database-name>\n\n# A locally accessible file that stores the username and password used to\n# connect to the database. Each line contains an '=' separated key/value pair\n# where the only valid keys are `username` and `password`. Both are optional,\n# but you will not be able to connect if they do not match the security\n# settings of your database.\nSCRACHY_CREDENTIALS = <credentials-file>\n\n# There is some contention between this storage engine and the\n# HttpCompressionMiddleware. I believe this is because of the way encoding\n# inference is done in HtmlResponses. The header for (at least some) pages\n# that have been compressed specify the encoding as the compression type.\n# The first time the page is downloaded the response is compressed and the\n# compression middleware recognizes it as such and successfully deflates the\n# text. However, the body then gets cached in the database as plain text.\n# So, when the page is retrieved from the cache the encoding inference still\n# says that it is compressed, but when the compression middleware tries to\n# deflate the content, it raises an exception because it is not actually\n# compressed. I may try to solve this in a more robust way in the future, but\n# for now the easiest solution is to change the priority of the \n# HttpCompressionMiddleware so that it comes after the HttpCacheMiddleware.\n# By default the HttpCacheMiddleware has a priority of 900 (higher means lower\n# priority).\nDOWNLOADER_MIDDLEWARES = {\n    # Other middlewares excluded but would go here.\n\n    # It's not exactly clear to me the best place for this, but it must be\n    # after the HttpCacheMiddleware for now.\n    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 1000,\n}\n```\nThe storage backend also has several other features that are useful for specialized crawling and scraping tasks.\nFor other configuration options and features please see the full documentation on [Read the Docs](https://scrachy.readthedocs.io/en/latest).\n\n## Ignoring Cached Items\nThis project also comes with a middleware to ignore any requests that are already cached.\nFor example, sometimes you want to scrape a domain periodically looking for new pages.\nIn this scenario it often doesn't make sense, and wastes resources, to reparse the pages you have already cached (unless you have changed your parsing rules of course).\nEnabling this middleware will allow you to specify which pages from your cache you would like to ignore.\n\nTo activate this middleware simply add the class ``IgnoreCachedResponse`` to the set of ``DOWNLOADER_MIDDLEWARES`` as follows:\n\n```python\nDOWNLOADER_MIDDLEWARES = {\n    # You probably want this early in the pipeline, because there's no point\n    # in the other middleware if its in the cache and we are going to ignore it\n    # anyway.\n    'scrachy.middleware.ignorecached.IgnoreCachedResponse': 50,\n\n    # Other middlewares excluded but would go here.\n}\n``` \n\nYou can choose not to ignore requests using 3 methods.\n\n  1. Any request that is excluded from caching by setting the request meta variable ``dont_cache`` to ``True``\n  2. A request can be excluded by setting the new request meta variable ``dont_ignore`` to ``True``\n  3. A request can be ignored if its url matches a regular expression in the settings variable ``SCRACHY_IGNORE_EXCLUDE_URL_PATTERNS``, which takes a list of strings that can be compiled into regular expressions.\n\n# License\nScrachy is released using the GNU Lesser General Public License.\nSee the [LICENSE](LICENSE.md) file for more details.\nFiles that are adapted or use code from other sources are indicated either at the top of the file or at the location of the code snippet.\nSome of these files were adapted from code released under a 3-clause BSD license.\nThose files should indicate the original copyright in a comment at the top of the file.\nSee the [BSD_LICENSE](BSD_LICENSE.md) file for details of this license.\n\n\n",
    "bugtrack_url": null,
    "license": "lgpl-v3",
    "summary": "Enhanced caching modules for scrapy.",
    "version": "0.3.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "70432bfbf6a8509fe115da41b363bfb4",
                "sha256": "e876d11477958d6745b559dffbdd4f375bf9d9a127bdaef03674d7b90ebba8b8"
            },
            "downloads": -1,
            "filename": "scrachy-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "70432bfbf6a8509fe115da41b363bfb4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 65910,
            "upload_time": "2020-09-19T22:21:47",
            "upload_time_iso_8601": "2020-09-19T22:21:47.532019Z",
            "url": "https://files.pythonhosted.org/packages/70/ee/45de8df8010c24c742512dd41b21b42864ee3780fc925f090f372fe70ce5/scrachy-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "f8171c731e30d3265200e7f6ac117eb1",
                "sha256": "76051352c90d2b440700cfc83751c027bd294c1dfdefafc0f17bd1fb902074e3"
            },
            "downloads": -1,
            "filename": "scrachy-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f8171c731e30d3265200e7f6ac117eb1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 46486,
            "upload_time": "2020-09-19T22:21:49",
            "upload_time_iso_8601": "2020-09-19T22:21:49.546782Z",
            "url": "https://files.pythonhosted.org/packages/62/4f/9f1ed9698f57ceaa76b77f12134a211056fdfde85d771d9391ec6fc246e3/scrachy-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-09-19 22:21:49",
    "github": false,
    "gitlab": false,
    "bitbucket": true,
    "bitbucket_user": null,
    "bitbucket_project": "reidswanson",
    "lcname": "scrachy"
}
        
Elapsed time: 0.13814s