bhfutils


Namebhfutils JSON
Version 0.1.18 PyPI version JSON
download
home_pagehttps://behoof.app/
SummaryUtilities that are used by any spider of Behoof project
upload_time2024-03-06 18:40:30
maintainer
docs_urlNone
authorTeplygin Vladimir
requires_python
licenseMIT
keywords behoof scrapy-cluster utilities
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ******************************
Behoof Scrapy Cluster Template
******************************

Overview
--------

The ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.

Requirements
------------

- Unix based machine (Linux or OS X)
- Python 2.7 or 3.6

Installation
------------

Inside a virtualenv, run ``pip install -U bhfutils``.  This will install the latest version of the Behoof Scrapy Cluster Spider utilities.  After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)

Documentation
-------------

Full documentation for the ``bhfutils`` package does not exist

custom_cookies.py
==================

The ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls

distributed_scheduler.py
========================

The ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster

redis_domain_max_page_filter.py
===============================

The ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded 

redis_dupefilter.py
===================

The ``redis_dupefilter`` module is redis-based request duplication filter

redis_global_page_per_domain_filter.py
======================================

The ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.

            

Raw data

            {
    "_id": null,
    "home_page": "https://behoof.app/",
    "name": "bhfutils",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "behoof,scrapy-cluster,utilities",
    "author": "Teplygin Vladimir",
    "author_email": "vvteplygin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f0/55/f1428dc6ab4943789f46f36dd642215d1d14cd7ebecdbd36e3035f8aa2f9/bhfutils-0.1.18.tar.gz",
    "platform": null,
    "description": "******************************\nBehoof Scrapy Cluster Template\n******************************\n\nOverview\n--------\n\nThe ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.\n\nRequirements\n------------\n\n- Unix based machine (Linux or OS X)\n- Python 2.7 or 3.6\n\nInstallation\n------------\n\nInside a virtualenv, run ``pip install -U bhfutils``.  This will install the latest version of the Behoof Scrapy Cluster Spider utilities.  After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)\n\nDocumentation\n-------------\n\nFull documentation for the ``bhfutils`` package does not exist\n\ncustom_cookies.py\n==================\n\nThe ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls\n\ndistributed_scheduler.py\n========================\n\nThe ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster\n\nredis_domain_max_page_filter.py\n===============================\n\nThe ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded \n\nredis_dupefilter.py\n===================\n\nThe ``redis_dupefilter`` module is redis-based request duplication filter\n\nredis_global_page_per_domain_filter.py\n======================================\n\nThe ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Utilities that are used by any spider of Behoof project",
    "version": "0.1.18",
    "project_urls": {
        "Homepage": "https://behoof.app/"
    },
    "split_keywords": [
        "behoof",
        "scrapy-cluster",
        "utilities"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1fb416d9ca51c7a9df59a25ea9a16b07083787e8c591e6112e957d0c38fd4d9d",
                "md5": "d95380fab3e67e7a63b17df219e57f7d",
                "sha256": "af5e25779cb979cab3d0a29e1fb9deed0931ceb998740ac7f75cc0c98eac618a"
            },
            "downloads": -1,
            "filename": "bhfutils-0.1.18-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d95380fab3e67e7a63b17df219e57f7d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 124756,
            "upload_time": "2024-03-06T18:40:27",
            "upload_time_iso_8601": "2024-03-06T18:40:27.858149Z",
            "url": "https://files.pythonhosted.org/packages/1f/b4/16d9ca51c7a9df59a25ea9a16b07083787e8c591e6112e957d0c38fd4d9d/bhfutils-0.1.18-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f055f1428dc6ab4943789f46f36dd642215d1d14cd7ebecdbd36e3035f8aa2f9",
                "md5": "1e43a48151747aa30823b9f093d3e03c",
                "sha256": "b83d0344a47cde14654d54e6da6503b2c20f162090cc3c241cae662238abe342"
            },
            "downloads": -1,
            "filename": "bhfutils-0.1.18.tar.gz",
            "has_sig": false,
            "md5_digest": "1e43a48151747aa30823b9f093d3e03c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 91114,
            "upload_time": "2024-03-06T18:40:30",
            "upload_time_iso_8601": "2024-03-06T18:40:30.403834Z",
            "url": "https://files.pythonhosted.org/packages/f0/55/f1428dc6ab4943789f46f36dd642215d1d14cd7ebecdbd36e3035f8aa2f9/bhfutils-0.1.18.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-06 18:40:30",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "bhfutils"
}
        
Elapsed time: 0.20528s