Name | webchecks JSON |
Version |
0.1.3
JSON |
| download |
home_page | None |
Summary | WebChecks is a BSD-licensed web search and research tool in Python traversing a given set of domains. |
upload_time | 2024-07-15 12:13:35 |
maintainer | None |
docs_url | None |
author | CopperEagle |
requires_python | >=3.9 |
license | Copyright (c) WebChecks developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of WebChecks nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
keywords |
websearch
analysis
automation
web
|
VCS |
|
bugtrack_url |
|
requirements |
beautifulsoup4
bs4
python-dotenv
requests
selenium
selenium-wire
typing_extensions
urllib3
webdriver-manager
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# WebChecks
[![Python](https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
![Version](https://img.shields.io/badge/Webchecks_version-0.1.3-darkgreen)
[![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![Linting: Pylint](https://img.shields.io/badge/linting-pylint-green)](https://github.com/pylint-dev/pylint)
![Linting Score](https://img.shields.io/badge/Linting_score-9.52/10.0-green)
![Test Coverage](https://img.shields.io/badge/Test_coverage-87%25-green)
## The project - goals
WebCheck is a BSD-licensed web search and research tool for searching on a given set of domains given some starting web addresses. It has three goals:
- Easy steerability to ensure ethical scraping practices and effectively achieving set goals. This is in part achieved by offering an extensible per (sub)domain profile system allowing to precisely define digital behaviour, access pattern / access frequency amongst other things. *It mandatorely respects robots.txt.* (No, there is no option to disable that.)
- Scaling over time: A given Project can merge the results of diffrent scraping runs in a structured manner and allows easy adaption of the rules in between.
- Ample options to ensure safety: Not only does it provide the user to opt out Javascript, it also allows to set a policy **from which sources Javascript is allowed to run**. The user can specify which domains can be visited, which ones not.
The tool was written about one and a half year ago as a side project.
## Intended uses
**This tool is NOT intended to scrape many websites to create large datasets** (e.g. for AI/LLMs). This is simply not the purpose of this tool: It is not intended and does not scale over several machines or large datagrabs.
It's main use is to help research and decisionmaking by collecting some information given some clearly defined interest topics or websites, etc.
Examples may be to regularely check a set of news sites and filter for reports on local disease outbreaks, which may help projects like proMEDmail. Such (*enter your topic*) awareness projects usually have a small staff which also needs to do human networking etc. Given we live in the world where change accelerates (including content generation), keeping track of all of it may become more challenging and important.
## Features and Todos
Features:
- Allows scraping many diffrent domains in one go.
- Optional Javascript. If enabled, it will require Seleniumwire (allows intercepting requests). Can optionally specify per domain whether to allow running Javascript from that domain. See Limitations.
- Adding per-(sub)domain profiles that allows to specify some of the fallowing: Access frequency, allowed links (besides robots.txt), managing the result (fetched content). It may also modify the header sent to the server. Websites without user defined profile will have a reasonable default Profile.
Limitations:
- Currently the inline javascript on the html page itself will be executed, *even if this domain is disallowed to execute JS*. This is because disallowing it requires editing HTML pages midflight and may break it.
Todos:
- Currently the Scraper will remember which websites it has already visited and will not revisit them again. Sometimes, however, it may be interesting to allow revisits to this page after some time has passed.
- Currently the Javascript feature only allows using the Firefox browser. This should be an easy fix. Currently there is just little time to do it.
- More tests
- Keywords feature not yet usable.
## Installing it
Requires Python 3.10+. Optionally, you may create a virtual environment using
```bash
python3 -m venv path/to/new/venv
cd path/to/new/venv
source bin/activate
```
Then do
```bash
pip3 install webchecks
```
or download the code, navigate into that directory and run
```bash
pip3 install -r requirements.txt
pip3 install .
```
Note: If you require Javascript then make sure you have Firefox because this is what is being used currently by this project.
## How to use it
For the tutorial, please go to the [wiki](https://github.com/CopperEagle/WebChecks/wiki) or visit the [tutorial page](TUTORIAL.md).
Basic use is fairly straight forward. *ALMOST ALL* calls the user performs is using the Project class.
For more examples and how to programmatically access the results, enable compression, using profiles, etc. check out [the tutorial](TUTORIAL.md).
```python
from webchecks import Project
# Give name and starting address. The latter may be a list of URLs.
proj = Project("project_name", "mywebsite.com/coolsite.html")
# Allowing to visit mywebsite.com and all wikipedia sites, regardless of language,
# like en.wikipedia.org. Note that you can use regular expressions here.
proj.set_allowed_websites((r"(.*\.)?wikipedia\.org", "mywebsite.com"))
# Enabling Javascript? Default value is False.
proj.enable_javascript(False)
# Whether links in retrieved HTML sources should be visited.
# The default value is true. If False only visits the initially given addresses.
proj.enable_crawl(True)
# Default minimum wait in seconds between two requests to the same domain.
# Applies only to domains that have no dedicated profile. (See below.)
proj.set_min_wait(10)
# Default average wait time in seconds between two requests to the same domain.
# The access timing pattern is randomized.
proj.set_avg_wait(10)
# Translates into seconds. (roughly, will finish last
# access before shutting down)
proj.run(1000)
```
This will set the initial roots of your search at mywebsite.com/coolsite.html and it will follow any links on that website provided that they are allowed by the security policy you define. The above security policy is fairly simple: You specify just what websites you allow the tool to visit and you disallow Javascript.
Once finished, there will be a REPORT.txt file in the project directory.
It will summarize the scraping process. The file is also printed on the terminal.
## Questions and Contibuting
For any questions or issues, just open an issue.
## Notes
It is in your responsibility to make sure you comply with the website's TOS when using this tool. There is no warranty whatsoever. The code is covered by the BSD-3-Clause license, and the license is included in this repository.
Raw data
{
"_id": null,
"home_page": null,
"name": "webchecks",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "websearch, analysis, automation, web",
"author": "CopperEagle",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/42/aa/d54ad8ec778164a4c77ac0b8648667baaa95af32ef1cd6cae2b42ae2acb1/webchecks-0.1.3.tar.gz",
"platform": null,
"description": "# WebChecks\n\n[![Python](https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)\n![Version](https://img.shields.io/badge/Webchecks_version-0.1.3-darkgreen)\n[![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)\n[![Linting: Pylint](https://img.shields.io/badge/linting-pylint-green)](https://github.com/pylint-dev/pylint)\n\n![Linting Score](https://img.shields.io/badge/Linting_score-9.52/10.0-green)\n![Test Coverage](https://img.shields.io/badge/Test_coverage-87%25-green)\n\n## The project - goals\n\nWebCheck is a BSD-licensed web search and research tool for searching on a given set of domains given some starting web addresses. It has three goals:\n\n- Easy steerability to ensure ethical scraping practices and effectively achieving set goals. This is in part achieved by offering an extensible per (sub)domain profile system allowing to precisely define digital behaviour, access pattern / access frequency amongst other things. *It mandatorely respects robots.txt.* (No, there is no option to disable that.)\n\n- Scaling over time: A given Project can merge the results of diffrent scraping runs in a structured manner and allows easy adaption of the rules in between.\n\n- Ample options to ensure safety: Not only does it provide the user to opt out Javascript, it also allows to set a policy **from which sources Javascript is allowed to run**. The user can specify which domains can be visited, which ones not.\n\nThe tool was written about one and a half year ago as a side project.\n\n## Intended uses\n\n**This tool is NOT intended to scrape many websites to create large datasets** (e.g. for AI/LLMs). This is simply not the purpose of this tool: It is not intended and does not scale over several machines or large datagrabs.\n\nIt's main use is to help research and decisionmaking by collecting some information given some clearly defined interest topics or websites, etc.\n\nExamples may be to regularely check a set of news sites and filter for reports on local disease outbreaks, which may help projects like proMEDmail. Such (*enter your topic*) awareness projects usually have a small staff which also needs to do human networking etc. Given we live in the world where change accelerates (including content generation), keeping track of all of it may become more challenging and important.\n\n## Features and Todos\n\nFeatures:\n- Allows scraping many diffrent domains in one go.\n- Optional Javascript. If enabled, it will require Seleniumwire (allows intercepting requests). Can optionally specify per domain whether to allow running Javascript from that domain. See Limitations.\n- Adding per-(sub)domain profiles that allows to specify some of the fallowing: Access frequency, allowed links (besides robots.txt), managing the result (fetched content). It may also modify the header sent to the server. Websites without user defined profile will have a reasonable default Profile.\n\nLimitations:\n- Currently the inline javascript on the html page itself will be executed, *even if this domain is disallowed to execute JS*. This is because disallowing it requires editing HTML pages midflight and may break it. \n\nTodos:\n- Currently the Scraper will remember which websites it has already visited and will not revisit them again. Sometimes, however, it may be interesting to allow revisits to this page after some time has passed.\n- Currently the Javascript feature only allows using the Firefox browser. This should be an easy fix. Currently there is just little time to do it.\n- More tests\n- Keywords feature not yet usable.\n\n\n## Installing it\n\nRequires Python 3.10+. Optionally, you may create a virtual environment using\n\n```bash\npython3 -m venv path/to/new/venv\ncd path/to/new/venv\nsource bin/activate\n```\n\nThen do\n\n```bash\npip3 install webchecks\n```\n\nor download the code, navigate into that directory and run\n\n```bash\npip3 install -r requirements.txt\npip3 install .\n```\n\nNote: If you require Javascript then make sure you have Firefox because this is what is being used currently by this project.\n\n## How to use it\n\nFor the tutorial, please go to the [wiki](https://github.com/CopperEagle/WebChecks/wiki) or visit the [tutorial page](TUTORIAL.md).\n\nBasic use is fairly straight forward. *ALMOST ALL* calls the user performs is using the Project class.\nFor more examples and how to programmatically access the results, enable compression, using profiles, etc. check out [the tutorial](TUTORIAL.md).\n\n```python\nfrom webchecks import Project\n\n# Give name and starting address. The latter may be a list of URLs.\nproj = Project(\"project_name\", \"mywebsite.com/coolsite.html\")\n\n# Allowing to visit mywebsite.com and all wikipedia sites, regardless of language, \n# like en.wikipedia.org. Note that you can use regular expressions here.\nproj.set_allowed_websites((r\"(.*\\.)?wikipedia\\.org\", \"mywebsite.com\")) \n\n# Enabling Javascript? Default value is False.\nproj.enable_javascript(False) \n\n# Whether links in retrieved HTML sources should be visited.\n# The default value is true. If False only visits the initially given addresses.\nproj.enable_crawl(True)\n\n# Default minimum wait in seconds between two requests to the same domain.\n# Applies only to domains that have no dedicated profile. (See below.)\nproj.set_min_wait(10)\n\n# Default average wait time in seconds between two requests to the same domain.\n# The access timing pattern is randomized.\nproj.set_avg_wait(10)\t\t\t\t\n\n# Translates into seconds. (roughly, will finish last \n# access before shutting down)\nproj.run(1000)\t\t\t\t\t\t\n```\nThis will set the initial roots of your search at mywebsite.com/coolsite.html and it will follow any links on that website provided that they are allowed by the security policy you define. The above security policy is fairly simple: You specify just what websites you allow the tool to visit and you disallow Javascript.\n\nOnce finished, there will be a REPORT.txt file in the project directory.\nIt will summarize the scraping process. The file is also printed on the terminal.\n\n\n## Questions and Contibuting\n\nFor any questions or issues, just open an issue.\n\n\n## Notes\n\nIt is in your responsibility to make sure you comply with the website's TOS when using this tool. There is no warranty whatsoever. The code is covered by the BSD-3-Clause license, and the license is included in this repository.\n\n",
"bugtrack_url": null,
"license": "Copyright (c) WebChecks developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of WebChecks nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ",
"summary": "WebChecks is a BSD-licensed web search and research tool in Python traversing a given set of domains.",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/CopperEagle/WebChecks"
},
"split_keywords": [
"websearch",
" analysis",
" automation",
" web"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "122eea09dbe1c18a80e22158cfc9c518ce547f3865b71ff5bfbe939352072d8e",
"md5": "69e98d146cc33ec9e52e9276bbaa88f1",
"sha256": "745f7758e05733ff9c7e7de46998ca924140646b7c96f3f47d0ddcd9cce9fcd1"
},
"downloads": -1,
"filename": "webchecks-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "69e98d146cc33ec9e52e9276bbaa88f1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 58466,
"upload_time": "2024-07-15T12:13:33",
"upload_time_iso_8601": "2024-07-15T12:13:33.590984Z",
"url": "https://files.pythonhosted.org/packages/12/2e/ea09dbe1c18a80e22158cfc9c518ce547f3865b71ff5bfbe939352072d8e/webchecks-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "42aad54ad8ec778164a4c77ac0b8648667baaa95af32ef1cd6cae2b42ae2acb1",
"md5": "b4368da4782f182f2f3d18cedc584b76",
"sha256": "db7a58c1c1c1d89291b54831fd336746ed11f79099ce5c2d5f1b4a3651f419d6"
},
"downloads": -1,
"filename": "webchecks-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "b4368da4782f182f2f3d18cedc584b76",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 50358,
"upload_time": "2024-07-15T12:13:35",
"upload_time_iso_8601": "2024-07-15T12:13:35.047096Z",
"url": "https://files.pythonhosted.org/packages/42/aa/d54ad8ec778164a4c77ac0b8648667baaa95af32ef1cd6cae2b42ae2acb1/webchecks-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-15 12:13:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CopperEagle",
"github_project": "WebChecks",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.12.2"
]
]
},
{
"name": "bs4",
"specs": [
[
"==",
"0.0.1"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.32.0"
]
]
},
{
"name": "selenium",
"specs": [
[
"==",
"4.12.0"
]
]
},
{
"name": "selenium-wire",
"specs": [
[
"==",
"5.1.0"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
"==",
"4.7.1"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"2.2.2"
]
]
},
{
"name": "webdriver-manager",
"specs": [
[
"==",
"4.0.0"
]
]
}
],
"lcname": "webchecks"
}