preparser


Namepreparser JSON
Version 2.0.4 PyPI version JSON
download
home_pagehttps://github.com/BertramYe/preparser
Summarya slight preparser to help parse webpage content or get request from urls,which supports win, mac and unix.
upload_time2025-01-12 12:20:06
maintainerNone
docs_urlNone
authorBertramYe
requires_python>=3.9.0
licenseMIT
keywords preparser parser parse crawl webpage html api requests beautifulsoup4 beautifulsoup4 python3 windows mac linux
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Description

this is a sight Parser to help you pre_parser the datas from `specified website url or api`, it help you get ride of the duplicate coding to get the request from the `specified url and speed up the process with the threading pool` and you just need focused on the bussiness proceess coding after you get the specified  request response from the `specified webpage or api urls`

# Attention

as this slight pre_parser  for the old version 1.0.0, which only can help preparser the `static html` or `api` inform, but now from the 2.0.0 , I have added an new `html_dynamic` mode, which will help get all inform even generated by the `JS` code.

```bash

python version >= 3.9 

```

# How to use

## install

```bash
$ pip install preparser
```



> Github Resouce ➡️ [Github Repos](https://github.com/BertramYe/preparser) 

> and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu.

> PyPI: ➡️ [PyPI Publish](https://pypi.org/project/preparser/)  

## parameters

here below are some of the parameters you can use for initai the Object `PreParser` from the package `preparser`:


|        Parameters      | Type                | Description                                               |
| ---------------------  | -----------------   |--------------------------------------------------------   |
| url_list               | list                | The list of URLs to parse from. Default is an empty list. |
| request_call_back_func | Callable or None    | A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object.        |
|  parser_mode           | `'html'`, `'api'` or `'html_dynamic'` | The pre-parsing datas mode,default is `'html'`.<br/>  `html`: parse the content from static html, and return an `BeautifulSoup` Object. <br/> `api`: parse the datas from an api, and return the `json` Object. <br/> `html_dynamic`: parse  from  the whole webpage html content and return an `BeautifulSoup` Object, even the content that generated by the dynamic js code. <br/>  **and all of Object you can get when you defined the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas`    |
| cached_data | bool | weather cache the parsed datas, defalt is False. |
| start_threading | bool | Whether to use threading pool for parsing the data. Default is `False`.|
| threading_mode | `'map'` or `'single'` | to run the task mode, default is `single`. <br/>  `map`: use the `map` func of the theading pool to distribute tasks. <br/> `single`: use the `submit` func to distribute the task one by one into the theading pool. |
| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True` |
| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. |
| checked_same_site | bool |  wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. |


## example

```python

#  test.py
from preparser import PreParser,BeautifulSoup,Json_Data,Filer


def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:
    # here you can just write the bussiness logical you want
    
    # attention:
    # preparser_object type depaned on the `parser_mode` in the `PreParser`:
    #               'api' : preparser_object is the type of a Json_Data
    #               'html' : preparser_object is the type of a BeautifulSoup 
    
    ........
    
    # for the finally return:
    # if you want to show current result is failed just Return a None, else just return any object which is not None.
    return preparser_object


if __name__ == "__main__":
    
    #  start the parser
    url_list = [
        'https://example.com/api/1',
        'https://example.com/api/2',
        .....
    ]
  
    parser = PreParser(
        url_list=url_list,
        request_call_back_func=handle_preparser_result,
        parser_mode='api',    # this mode depands on you set, you can use the "api" or "html"
        start_threading=True,
        threading_mode='single',
        cached_data=True,
        stop_when_task_failed=False,
        threading_numbers=3,
        checked_same_site=True
    )
    
    #  start parse
    parser.start_parse()

    # when all task finished, you can get the all task result result like below:
    all_result = parser.cached_request_datas
    
    # if you want to terminal, just execute the function here below
    # parser.stop_parse()

    # also you can use the Filer to save the final result above
    # and also find the datas in the `result/test.json` 
    filer = Filer('json')
    filer.write_data_into_file('result/test',[all_result])

```


# Get Help

Get help ➡️ [Github issue](https://github.com/BertramYe/preparser/issues)


# Update logs

* `version 2.0.4 `: test the installing process command.

* `version 2.0.3 `: optimise the `error` alert for `html_dynamic`.

* `version 2.0.2 `: correct the README Doc of `parser_mode`.

* `version 2.0.1 `: update the README Doc.

* `version 2.0.0 `: add the new `parser_mode` of the `html_dynamic`, which help `preparser` all of the content from `html` , event it generated by the `JS` code.

* `version 1.0.0 `: basical version, only `perparser` the static `html` and `api` content.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/BertramYe/preparser",
    "name": "preparser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9.0",
    "maintainer_email": null,
    "keywords": "preparser, parser, parse, crawl, webpage, html, api, requests, BeautifulSoup4, BeautifulSoup4, python3, windows, mac, linux",
    "author": "BertramYe",
    "author_email": "bertramyerik@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8e/4e/d18799d817d73a24d72b2759e5a21719bbbf31e9f6d2db50313618401a8e/preparser-2.0.4.tar.gz",
    "platform": null,
    "description": "\r\n# Description\r\n\r\nthis is a sight Parser to help you pre_parser the datas from `specified website url or api`, it help you get ride of the duplicate coding to get the request from the `specified url and speed up the process with the threading pool` and you just need focused on the bussiness proceess coding after you get the specified  request response from the `specified webpage or api urls`\r\n\r\n# Attention\r\n\r\nas this slight pre_parser  for the old version 1.0.0, which only can help preparser the `static html` or `api` inform, but now from the 2.0.0 , I have added an new `html_dynamic` mode, which will help get all inform even generated by the `JS` code.\r\n\r\n```bash\r\n\r\npython version >= 3.9 \r\n\r\n```\r\n\r\n# How to use\r\n\r\n## install\r\n\r\n```bash\r\n$ pip install preparser\r\n```\r\n\r\n\r\n\r\n> Github Resouce \u27a1\ufe0f [Github Repos](https://github.com/BertramYe/preparser) \r\n\r\n> and also just feel free to fork and modify this codes. if you like current project, star \u2b50 it please, uwu.\r\n\r\n> PyPI: \u27a1\ufe0f [PyPI Publish](https://pypi.org/project/preparser/)  \r\n\r\n## parameters\r\n\r\nhere below are some of the parameters you can use for initai the Object `PreParser` from the package `preparser`:\r\n\r\n\r\n|        Parameters      | Type                | Description                                               |\r\n| ---------------------  | -----------------   |--------------------------------------------------------   |\r\n| url_list               | list                | The list of URLs to parse from. Default is an empty list. |\r\n| request_call_back_func | Callable or None    | A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object.        |\r\n|  parser_mode           | `'html'`, `'api'` or `'html_dynamic'` | The pre-parsing datas mode,default is `'html'`.<br/>  `html`: parse the content from static html, and return an `BeautifulSoup` Object. <br/> `api`: parse the datas from an api, and return the `json` Object. <br/> `html_dynamic`: parse  from  the whole webpage html content and return an `BeautifulSoup` Object, even the content that generated by the dynamic js code. <br/>  **and all of Object you can get when you defined the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas`    |\r\n| cached_data | bool | weather cache the parsed datas, defalt is False. |\r\n| start_threading | bool | Whether to use threading pool for parsing the data. Default is `False`.|\r\n| threading_mode | `'map'` or `'single'` | to run the task mode, default is `single`. <br/>  `map`: use the `map` func of the theading pool to distribute tasks. <br/> `single`: use the `submit` func to distribute the task one by one into the theading pool. |\r\n| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True` |\r\n| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. |\r\n| checked_same_site | bool |  wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. |\r\n\r\n\r\n## example\r\n\r\n```python\r\n\r\n#  test.py\r\nfrom preparser import PreParser,BeautifulSoup,Json_Data,Filer\r\n\r\n\r\ndef handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:\r\n    # here you can just write the bussiness logical you want\r\n    \r\n    # attention\uff1a\r\n    # preparser_object type depaned on the `parser_mode` in the `PreParser`:\r\n    #               'api' : preparser_object is the type of a Json_Data\r\n    #               'html' : preparser_object is the type of a BeautifulSoup \r\n    \r\n    ........\r\n    \r\n    # for the finally return:\r\n    # if you want to show current result is failed just Return a None, else just return any object which is not None.\r\n    return preparser_object\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    \r\n    #  start the parser\r\n    url_list = [\r\n        'https://example.com/api/1',\r\n        'https://example.com/api/2',\r\n        .....\r\n    ]\r\n  \r\n    parser = PreParser(\r\n        url_list=url_list,\r\n        request_call_back_func=handle_preparser_result,\r\n        parser_mode='api',    # this mode depands on you set, you can use the \"api\" or \"html\"\r\n        start_threading=True,\r\n        threading_mode='single',\r\n        cached_data=True,\r\n        stop_when_task_failed=False,\r\n        threading_numbers=3,\r\n        checked_same_site=True\r\n    )\r\n    \r\n    #  start parse\r\n    parser.start_parse()\r\n\r\n    # when all task finished, you can get the all task result result like below:\r\n    all_result = parser.cached_request_datas\r\n    \r\n    # if you want to terminal, just execute the function here below\r\n    # parser.stop_parse()\r\n\r\n    # also you can use the Filer to save the final result above\r\n    # and also find the datas in the `result/test.json` \r\n    filer = Filer('json')\r\n    filer.write_data_into_file('result/test',[all_result])\r\n\r\n```\r\n\r\n\r\n# Get Help\r\n\r\nGet help \u27a1\ufe0f [Github issue](https://github.com/BertramYe/preparser/issues)\r\n\r\n\r\n# Update logs\r\n\r\n* `version 2.0.4 `: test the installing process command.\r\n\r\n* `version 2.0.3 `: optimise the `error` alert for `html_dynamic`.\r\n\r\n* `version 2.0.2 `: correct the README Doc of `parser_mode`.\r\n\r\n* `version 2.0.1 `: update the README Doc.\r\n\r\n* `version 2.0.0 `: add the new `parser_mode` of the `html_dynamic`, which help `preparser` all of the content from `html` , event it generated by the `JS` code.\r\n\r\n* `version 1.0.0 `: basical version, only `perparser` the static `html` and `api` content.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "a slight preparser to help parse webpage content or get request from urls,which supports win, mac and unix.",
    "version": "2.0.4",
    "project_urls": {
        "Homepage": "https://github.com/BertramYe/preparser"
    },
    "split_keywords": [
        "preparser",
        " parser",
        " parse",
        " crawl",
        " webpage",
        " html",
        " api",
        " requests",
        " beautifulsoup4",
        " beautifulsoup4",
        " python3",
        " windows",
        " mac",
        " linux"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc50f408cf3c738ad168fc34dbad4ea2d928c64d6afb491c1923c935267079b9",
                "md5": "f4512bb4333c8e09258398503447530c",
                "sha256": "ef570c3c7888d90c707112d6bd2511901c34dce420c28529b47eb45019417eff"
            },
            "downloads": -1,
            "filename": "preparser-2.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f4512bb4333c8e09258398503447530c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9.0",
            "size": 11839,
            "upload_time": "2025-01-12T12:20:04",
            "upload_time_iso_8601": "2025-01-12T12:20:04.752213Z",
            "url": "https://files.pythonhosted.org/packages/bc/50/f408cf3c738ad168fc34dbad4ea2d928c64d6afb491c1923c935267079b9/preparser-2.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8e4ed18799d817d73a24d72b2759e5a21719bbbf31e9f6d2db50313618401a8e",
                "md5": "895662bd15197b42f8ef8971a9551cac",
                "sha256": "1cc64aee5222fc1adeee51b63adb11ab8276f16d18d91753ffae5fbe66603747"
            },
            "downloads": -1,
            "filename": "preparser-2.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "895662bd15197b42f8ef8971a9551cac",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9.0",
            "size": 14580,
            "upload_time": "2025-01-12T12:20:06",
            "upload_time_iso_8601": "2025-01-12T12:20:06.690452Z",
            "url": "https://files.pythonhosted.org/packages/8e/4e/d18799d817d73a24d72b2759e5a21719bbbf31e9f6d2db50313618401a8e/preparser-2.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-12 12:20:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "BertramYe",
    "github_project": "preparser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "preparser"
}
        
Elapsed time: 0.42573s