******************************
Behoof Scrapy Cluster Template
******************************
Overview
--------
The ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.
Requirements
------------
- Unix based machine (Linux or OS X)
- Python 2.7 or 3.6
Installation
------------
Inside a virtualenv, run ``pip install -U bhfutils``. This will install the latest version of the Behoof Scrapy Cluster Spider utilities. After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)
Documentation
-------------
Full documentation for the ``bhfutils`` package does not exist
custom_cookies.py
==================
The ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls
distributed_scheduler.py
========================
The ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster
redis_domain_max_page_filter.py
===============================
The ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded
redis_dupefilter.py
===================
The ``redis_dupefilter`` module is redis-based request duplication filter
redis_global_page_per_domain_filter.py
======================================
The ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.
Raw data
{
"_id": null,
"home_page": "https://behoof.app/",
"name": "bhfutils",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "behoof, scrapy-cluster, utilities",
"author": "Teplygin Vladimir",
"author_email": "vvteplygin@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/76/8b/65a3d317be5992429b681e877fc4298ba48fc7d16d8a0f6e1947a8b51e82/bhfutils-0.1.21.tar.gz",
"platform": null,
"description": "******************************\nBehoof Scrapy Cluster Template\n******************************\n\nOverview\n--------\n\nThe ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.\n\nRequirements\n------------\n\n- Unix based machine (Linux or OS X)\n- Python 2.7 or 3.6\n\nInstallation\n------------\n\nInside a virtualenv, run ``pip install -U bhfutils``. This will install the latest version of the Behoof Scrapy Cluster Spider utilities. After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)\n\nDocumentation\n-------------\n\nFull documentation for the ``bhfutils`` package does not exist\n\ncustom_cookies.py\n==================\n\nThe ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls\n\ndistributed_scheduler.py\n========================\n\nThe ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster\n\nredis_domain_max_page_filter.py\n===============================\n\nThe ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded \n\nredis_dupefilter.py\n===================\n\nThe ``redis_dupefilter`` module is redis-based request duplication filter\n\nredis_global_page_per_domain_filter.py\n======================================\n\nThe ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Utilities that are used by any spider of Behoof project",
"version": "0.1.21",
"project_urls": {
"Homepage": "https://behoof.app/"
},
"split_keywords": [
"behoof",
" scrapy-cluster",
" utilities"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "eccb1a7ac0319ef4aab06dd6607307fa67bedf5f0bcb92d29a4cbe54760b61fe",
"md5": "683ed3196b3343a355031a2d2093d332",
"sha256": "982798cdf984f137fa13115e562a7ee6814185c974538ccbf6de6d30134d11b0"
},
"downloads": -1,
"filename": "bhfutils-0.1.21-py3-none-any.whl",
"has_sig": false,
"md5_digest": "683ed3196b3343a355031a2d2093d332",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 124880,
"upload_time": "2025-08-30T10:42:00",
"upload_time_iso_8601": "2025-08-30T10:42:00.587406Z",
"url": "https://files.pythonhosted.org/packages/ec/cb/1a7ac0319ef4aab06dd6607307fa67bedf5f0bcb92d29a4cbe54760b61fe/bhfutils-0.1.21-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "768b65a3d317be5992429b681e877fc4298ba48fc7d16d8a0f6e1947a8b51e82",
"md5": "71a38797d14e458239c822a7274e9cfc",
"sha256": "f50682bd1c21cb469e622fcae966cf83f914f91b26ebf35ad874c7fcc775e097"
},
"downloads": -1,
"filename": "bhfutils-0.1.21.tar.gz",
"has_sig": false,
"md5_digest": "71a38797d14e458239c822a7274e9cfc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 91365,
"upload_time": "2025-08-30T10:42:02",
"upload_time_iso_8601": "2025-08-30T10:42:02.598752Z",
"url": "https://files.pythonhosted.org/packages/76/8b/65a3d317be5992429b681e877fc4298ba48fc7d16d8a0f6e1947a8b51e82/bhfutils-0.1.21.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-30 10:42:02",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "bhfutils"
}