# Description
this is a sight Parser to help you pre_parser the datas from `specified website url or api`, it help you get ride of the duplicate coding to get the request from the `specified url and speed up the process with the threading pool` and you just need focused on the bussiness proceess coding after you get the specified request response from the `specified webpage or api urls`
# Attention
as this slight pre_parser for the old version 1.0.0, which only can help preparser the `static html` or `api` inform, but now from the 2.0.0 , I have added an new `html_dynamic` mode, which will help get all inform even generated by the `JS` code.
```bash
python version >= 3.9
```
# How to use
## install
```bash
$ pip install preparser
```
> Github Resouce ➡️ [Github Repos](https://github.com/BertramYe/preparser)
> and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu.
> PyPI: ➡️ [PyPI Publish](https://pypi.org/project/preparser/)
## parameters
here below are some of the parameters you can use for initai the Object `PreParser` from the package `preparser`:
| Parameters | Type | Description |
| --------------------- | ----------------- |-------------------------------------------------------- |
| url_list | list | The list of URLs to parse from. Default is an empty list. |
| request_call_back_func | Callable or None | A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object. |
| parser_mode | `'html'`, `'api'` or `'html_dynamic'` | The pre-parsing datas mode,default is `'html'`.<br/> `html`: parse the content from static html, and return an `BeautifulSoup` Object. <br/> `api`: parse the datas from an api, and return the `json` Object. <br/> `html_dynamic`: parse from the whole webpage html content and return an `BeautifulSoup` Object, even the content that generated by the dynamic js code. <br/> **and all of Object you can get when you defined the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas` |
| cached_data | bool | weather cache the parsed datas, defalt is False. |
| start_threading | bool | Whether to use threading pool for parsing the data. Default is `False`.|
| threading_mode | `'map'` or `'single'` | to run the task mode, default is `single`. <br/> `map`: use the `map` func of the theading pool to distribute tasks. <br/> `single`: use the `submit` func to distribute the task one by one into the theading pool. |
| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True` |
| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. |
| checked_same_site | bool | wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. |
## example
```python
# test.py
from preparser import PreParser,BeautifulSoup,Json_Data,Filer
def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:
# here you can just write the bussiness logical you want
# attention:
# preparser_object type depaned on the `parser_mode` in the `PreParser`:
# 'api' : preparser_object is the type of a Json_Data
# 'html' : preparser_object is the type of a BeautifulSoup
........
# for the finally return:
# if you want to show current result is failed just Return a None, else just return any object which is not None.
return preparser_object
if __name__ == "__main__":
# start the parser
url_list = [
'https://example.com/api/1',
'https://example.com/api/2',
.....
]
parser = PreParser(
url_list=url_list,
request_call_back_func=handle_preparser_result,
parser_mode='api', # this mode depands on you set, you can use the "api" or "html"
start_threading=True,
threading_mode='single',
cached_data=True,
stop_when_task_failed=False,
threading_numbers=3,
checked_same_site=True
)
# start parse
parser.start_parse()
# when all task finished, you can get the all task result result like below:
all_result = parser.cached_request_datas
# if you want to terminal, just execute the function here below
# parser.stop_parse()
# also you can use the Filer to save the final result above
# and also find the datas in the `result/test.json`
filer = Filer('json')
filer.write_data_into_file('result/test',[all_result])
```
# Get Help
Get help ➡️ [Github issue](https://github.com/BertramYe/preparser/issues)
# Update logs
* `version 2.0.4 `: test the installing process command.
* `version 2.0.3 `: optimise the `error` alert for `html_dynamic`.
* `version 2.0.2 `: correct the README Doc of `parser_mode`.
* `version 2.0.1 `: update the README Doc.
* `version 2.0.0 `: add the new `parser_mode` of the `html_dynamic`, which help `preparser` all of the content from `html` , event it generated by the `JS` code.
* `version 1.0.0 `: basical version, only `perparser` the static `html` and `api` content.
Raw data
{
"_id": null,
"home_page": "https://github.com/BertramYe/preparser",
"name": "preparser",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9.0",
"maintainer_email": null,
"keywords": "preparser, parser, parse, crawl, webpage, html, api, requests, BeautifulSoup4, BeautifulSoup4, python3, windows, mac, linux",
"author": "BertramYe",
"author_email": "bertramyerik@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/8e/4e/d18799d817d73a24d72b2759e5a21719bbbf31e9f6d2db50313618401a8e/preparser-2.0.4.tar.gz",
"platform": null,
"description": "\r\n# Description\r\n\r\nthis is a sight Parser to help you pre_parser the datas from `specified website url or api`, it help you get ride of the duplicate coding to get the request from the `specified url and speed up the process with the threading pool` and you just need focused on the bussiness proceess coding after you get the specified request response from the `specified webpage or api urls`\r\n\r\n# Attention\r\n\r\nas this slight pre_parser for the old version 1.0.0, which only can help preparser the `static html` or `api` inform, but now from the 2.0.0 , I have added an new `html_dynamic` mode, which will help get all inform even generated by the `JS` code.\r\n\r\n```bash\r\n\r\npython version >= 3.9 \r\n\r\n```\r\n\r\n# How to use\r\n\r\n## install\r\n\r\n```bash\r\n$ pip install preparser\r\n```\r\n\r\n\r\n\r\n> Github Resouce \u27a1\ufe0f [Github Repos](https://github.com/BertramYe/preparser) \r\n\r\n> and also just feel free to fork and modify this codes. if you like current project, star \u2b50 it please, uwu.\r\n\r\n> PyPI: \u27a1\ufe0f [PyPI Publish](https://pypi.org/project/preparser/) \r\n\r\n## parameters\r\n\r\nhere below are some of the parameters you can use for initai the Object `PreParser` from the package `preparser`:\r\n\r\n\r\n| Parameters | Type | Description |\r\n| --------------------- | ----------------- |-------------------------------------------------------- |\r\n| url_list | list | The list of URLs to parse from. Default is an empty list. |\r\n| request_call_back_func | Callable or None | A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object. |\r\n| parser_mode | `'html'`, `'api'` or `'html_dynamic'` | The pre-parsing datas mode,default is `'html'`.<br/> `html`: parse the content from static html, and return an `BeautifulSoup` Object. <br/> `api`: parse the datas from an api, and return the `json` Object. <br/> `html_dynamic`: parse from the whole webpage html content and return an `BeautifulSoup` Object, even the content that generated by the dynamic js code. <br/> **and all of Object you can get when you defined the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas` |\r\n| cached_data | bool | weather cache the parsed datas, defalt is False. |\r\n| start_threading | bool | Whether to use threading pool for parsing the data. Default is `False`.|\r\n| threading_mode | `'map'` or `'single'` | to run the task mode, default is `single`. <br/> `map`: use the `map` func of the theading pool to distribute tasks. <br/> `single`: use the `submit` func to distribute the task one by one into the theading pool. |\r\n| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True` |\r\n| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. |\r\n| checked_same_site | bool | wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. |\r\n\r\n\r\n## example\r\n\r\n```python\r\n\r\n# test.py\r\nfrom preparser import PreParser,BeautifulSoup,Json_Data,Filer\r\n\r\n\r\ndef handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:\r\n # here you can just write the bussiness logical you want\r\n \r\n # attention\uff1a\r\n # preparser_object type depaned on the `parser_mode` in the `PreParser`:\r\n # 'api' : preparser_object is the type of a Json_Data\r\n # 'html' : preparser_object is the type of a BeautifulSoup \r\n \r\n ........\r\n \r\n # for the finally return:\r\n # if you want to show current result is failed just Return a None, else just return any object which is not None.\r\n return preparser_object\r\n\r\n\r\nif __name__ == \"__main__\":\r\n \r\n # start the parser\r\n url_list = [\r\n 'https://example.com/api/1',\r\n 'https://example.com/api/2',\r\n .....\r\n ]\r\n \r\n parser = PreParser(\r\n url_list=url_list,\r\n request_call_back_func=handle_preparser_result,\r\n parser_mode='api', # this mode depands on you set, you can use the \"api\" or \"html\"\r\n start_threading=True,\r\n threading_mode='single',\r\n cached_data=True,\r\n stop_when_task_failed=False,\r\n threading_numbers=3,\r\n checked_same_site=True\r\n )\r\n \r\n # start parse\r\n parser.start_parse()\r\n\r\n # when all task finished, you can get the all task result result like below:\r\n all_result = parser.cached_request_datas\r\n \r\n # if you want to terminal, just execute the function here below\r\n # parser.stop_parse()\r\n\r\n # also you can use the Filer to save the final result above\r\n # and also find the datas in the `result/test.json` \r\n filer = Filer('json')\r\n filer.write_data_into_file('result/test',[all_result])\r\n\r\n```\r\n\r\n\r\n# Get Help\r\n\r\nGet help \u27a1\ufe0f [Github issue](https://github.com/BertramYe/preparser/issues)\r\n\r\n\r\n# Update logs\r\n\r\n* `version 2.0.4 `: test the installing process command.\r\n\r\n* `version 2.0.3 `: optimise the `error` alert for `html_dynamic`.\r\n\r\n* `version 2.0.2 `: correct the README Doc of `parser_mode`.\r\n\r\n* `version 2.0.1 `: update the README Doc.\r\n\r\n* `version 2.0.0 `: add the new `parser_mode` of the `html_dynamic`, which help `preparser` all of the content from `html` , event it generated by the `JS` code.\r\n\r\n* `version 1.0.0 `: basical version, only `perparser` the static `html` and `api` content.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "a slight preparser to help parse webpage content or get request from urls,which supports win, mac and unix.",
"version": "2.0.4",
"project_urls": {
"Homepage": "https://github.com/BertramYe/preparser"
},
"split_keywords": [
"preparser",
" parser",
" parse",
" crawl",
" webpage",
" html",
" api",
" requests",
" beautifulsoup4",
" beautifulsoup4",
" python3",
" windows",
" mac",
" linux"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bc50f408cf3c738ad168fc34dbad4ea2d928c64d6afb491c1923c935267079b9",
"md5": "f4512bb4333c8e09258398503447530c",
"sha256": "ef570c3c7888d90c707112d6bd2511901c34dce420c28529b47eb45019417eff"
},
"downloads": -1,
"filename": "preparser-2.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f4512bb4333c8e09258398503447530c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9.0",
"size": 11839,
"upload_time": "2025-01-12T12:20:04",
"upload_time_iso_8601": "2025-01-12T12:20:04.752213Z",
"url": "https://files.pythonhosted.org/packages/bc/50/f408cf3c738ad168fc34dbad4ea2d928c64d6afb491c1923c935267079b9/preparser-2.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8e4ed18799d817d73a24d72b2759e5a21719bbbf31e9f6d2db50313618401a8e",
"md5": "895662bd15197b42f8ef8971a9551cac",
"sha256": "1cc64aee5222fc1adeee51b63adb11ab8276f16d18d91753ffae5fbe66603747"
},
"downloads": -1,
"filename": "preparser-2.0.4.tar.gz",
"has_sig": false,
"md5_digest": "895662bd15197b42f8ef8971a9551cac",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9.0",
"size": 14580,
"upload_time": "2025-01-12T12:20:06",
"upload_time_iso_8601": "2025-01-12T12:20:06.690452Z",
"url": "https://files.pythonhosted.org/packages/8e/4e/d18799d817d73a24d72b2759e5a21719bbbf31e9f6d2db50313618401a8e/preparser-2.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-12 12:20:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "BertramYe",
"github_project": "preparser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "preparser"
}