bs4multiproc


Namebs4multiproc JSON
Version 0.10 PyPI version JSON
download
home_pagehttps://github.com/hansalemaos/bs4multiproc
SummaryManages Android packages on a device through DataFrames
upload_time2023-10-14 03:28:16
maintainer
docs_urlNone
authorJohannes Fischer
requires_python
licenseMIT
keywords bs4 dataframe webscraping html parsing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# BeautifulSoup multiprocessing parsing to pandas DataFrame 

## Tested against Windows / Python 3.11 / Anaconda

## pip install bs4multiproc


A Python library for parsing HTML content. It leverages various libraries and methods, including BeautifulSoup, pandas, multiprocessing, and subprocesses, to efficiently extract structured information from HTML documents. 

## What does it do?

### HTML Parsing: 

The library's primary purpose is to parse HTML content. 
It can handle both local HTML files and web-based 
HTML content retrieved via URLs.

### Parallel Processing: 

The library offers two main functions, parse_html_subprocess and parse_html_multiproc, to process HTML content in parallel. 
This parallelism can significantly speed up the parsing 
of multiple HTML documents.

### DataFrame Output: 

The library returns structured data in the form of pandas DataFrames. These DataFrames contain detailed information about HTML 
elements within the parsed content, such as tag names, 
attributes, text, and more.

### Caching: 

The script utilizes functools.cache for memoization, 
which can improve performance by avoiding 
unnecessary recomputation of previously processed data.

## Advantages of the Library:

### Efficiency: 

Parallel processing is a key advantage of this library. 
It can distribute the parsing of multiple HTML documents 
across multiple CPU cores, making it significantly faster 
when dealing with a large number of documents.

### Structured Data: 

The library doesn't just parse HTML; it structures the data 
in a tabular format using DataFrames. 
This structured data can be easily analyzed, 
transformed, and used for various purposes.

### Flexibility: 

The library is flexible and can handle various input 
sources, including local files, web URLs, 
and multipart messages (e.g., email content).

### Subprocess Support: 

The parse_html_subprocess function allows you to offload the HTML 
parsing task to a separate subprocess. This can be useful when 
dealing with potentially untrusted or resource-intensive 
HTML content, as it isolates the parsing process.

### Parallelism Control: 

You can control the level of parallelism by specifying 
the number of processes and chunks. 
This flexibility allows you to fine-tune the 
performance based on your system's capabilities and specific requirements.

### Caching: 

The caching mechanism helps save time by reusing previously 
parsed results, especially when working with the same 
content repeatedly.

### Cross-Platform: 

The library supports both Windows and non-Windows environments, 
ensuring compatibility across different operating systems.


## parse_html_subprocess

```python

def parse_html_subprocess(html,chunks=2,processes=None):
	Parse HTML Content Using Subprocess

	This function takes a single HTML content as input, processes it using a subprocess,
	and returns a structured DataFrame containing information about HTML elements.
	It is suitable for parsing a single HTML document using subprocess-based parallelism.

	Parameters:
	- html (str or bytes): HTML content to be processed. It can be provided as a string, bytes, or a file path.
	- chunks (int, optional): The number of chunks to divide the HTML processing into.
	This can help optimize processing for large datasets. Default is 2.
	- processes (int, optional): The number of parallel processes to use for parsing.
	If not specified, it defaults to (number of CPU cores - 1).

	Returns:
	- pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.


```


## parse_html_multiproc

```python

def parse_html_multiproc(htmls, chunks=2, processes=5):
    r"""
    Parse HTML Content Using Multiprocessing

    This function takes a list of HTML content, processes it in parallel using the multiprocessing library,
    and returns a structured DataFrame containing information about HTML elements. It is suitable for
    parsing multiple HTML documents simultaneously.

    Parameters:
    - htmls (list): A list of HTML content to be processed. Each item in the list should represent HTML content, typically as strings or bytes.
    - chunks (int, optional): The number of chunks to divide the HTML processing into. This can help optimize processing for large datasets. Default is 2.
    - processes (int, optional): The number of parallel processes to use for parsing. If not specified, it defaults to (number of CPU cores - 1).

    Returns:
    - pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.
    """
```

## Examples 

```python
import re
import sys

import bs4
from PrettyColorPrinter import add_printer
from bs4multiproc import parse_html_subprocess, parse_html_multiproc
import pandas as pd

add_printer(1)
from time import perf_counter

sys.setrecursionlimit(10000)
import numpy as np

if __name__ == "__main__":
    execute_examples = False
    if execute_examples:
        start = perf_counter()
        df1 = parse_html_multiproc(  # needs if __name__ == "__main__": !!!!
            htmls=[
                r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online.mhtml",
                "https://docs.python.org/3/library/multiprocessing.html",
                r"C:\Users\hansc\Downloads\Your Repositories.mhtml",
            ],
            chunks=1,
            processes=4,
        )
        end = perf_counter() - start

        start1 = perf_counter()
        df2 = parse_html_subprocess(  # doesn't need if __name__ == "__main__":
            html=[
                r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online.mhtml",
                "https://docs.python.org/3/library/multiprocessing.html",
                r"C:\Users\hansc\Downloads\Your Repositories.mhtml",
            ],
            chunks=1,
            processes=4,
        )
        end1 = perf_counter() - start1

        print(df1)
        print(df2)
        print(end, end1)
        df1.drop_duplicates(subset=["aa_h0", "aa_h1", "aa_h2", "aa_h3"]).aa_soup.apply(
            lambda x: g if (g := x.find_all("a")) else pd.NA
        ).dropna().ds_color_print_all()

    df = parse_html_multiproc(
        r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online2.mhtml",
        chunks=3,
        processes=4,
    )

    results = (
        df.loc[
            np.all(
                df[["aa_tag", "aa_value", "aa_attr"]].__array__()
                == np.array([["div", "ovm-Fixture-media", "class"]]),
                axis=1,
            )
        ]
        .aa_html.apply(
            lambda x: [
                [y.text]
                for y in bs4.BeautifulSoup(x).find_all(
                    re.compile(r"\b(?:span|div)\b"),
                    class_=re.compile(
                        "(?:ovm-ParticipantOddsOnly_Odds)|(?:ovm-FixtureDetailsTwoWay_TeamName)"
                    ),
                )
            ]
        )
        .apply(lambda q: [t[0] for t in q])
        .apply(pd.Series)
    ).reset_index(drop=True)
    print(results.to_string())
sys.setrecursionlimit(3000)

# Example - Odds - Live Games from bet365.com
#                               0                              1      2     3      4
# 0                      Criciúma                    Chapecoense   4.00  3.00   2.05
# 1                      Barbados             Dominican Republic    NaN   NaN    NaN
# 2             Trindade e Tobago                      Guatemala  11.00  4.75   1.33
# 3         CA Union Villa Krause              San Lorenzo Ullum   3.20  2.62   2.50
# 4                   Once Caldas            Jaguares de Córdoba   1.66  3.20   6.50
# 5             New Mexico United                 Memphis 901 FC   1.11  7.50  13.00
# 6                   FC Santiago           Huracanes Izcalli FC   2.75  3.60   2.20
# 7                Grupo Sherwood       Club Leones Huixquilucan   4.75  3.75   1.66
# 8                 Auckland City             Cashmere Technical   1.44  4.00   7.00
# 9                     Petone FC             Auckland United FC  12.00  8.00   1.11
# 10                  Árabe Unido         Sporting San Miguelito   3.50  2.30   2.75
# 11                    Udelas FC                    Union Cocle   3.50  1.61   5.50
# 12          Deportivo Maldonado                        Peñarol  29.00  5.00   1.18
# 13              Central Espanol                        Basanez   3.75  1.66   4.33
# 14     Argentina (JKey) Esports       Portugal (RuBIX) Esports   2.10  3.75   2.87
# 15  Eintracht (Aleksis) Esports  Dortmund (Kalibrikon) Esports   4.50  1.90   2.87
# 16   Germany (lowheels) Esports   France (DangerDim77) Esports   1.83  3.75   3.50
# 17          Lazio (Nio) Esports        Arsenal (Panic) Esports   1.80  5.00   3.00
# 18       Lens (General) Esports      Sevilla (Chemist) Esports   1.83  5.00   2.87
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/bs4multiproc",
    "name": "bs4multiproc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "bs4,DataFrame,webscraping,html,parsing",
    "author": "Johannes Fischer",
    "author_email": "aulasparticularesdealemaosp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/37/dd/a7669178fca19a7e182a68cd6ebac0ee282700ba0540f9de70dccfc6e836/bs4multiproc-0.10.tar.gz",
    "platform": null,
    "description": "\r\n# BeautifulSoup multiprocessing parsing to pandas DataFrame \r\n\r\n## Tested against Windows / Python 3.11 / Anaconda\r\n\r\n## pip install bs4multiproc\r\n\r\n\r\nA Python library for parsing HTML content. It leverages various libraries and methods, including BeautifulSoup, pandas, multiprocessing, and subprocesses, to efficiently extract structured information from HTML documents. \r\n\r\n## What does it do?\r\n\r\n### HTML Parsing: \r\n\r\nThe library's primary purpose is to parse HTML content. \r\nIt can handle both local HTML files and web-based \r\nHTML content retrieved via URLs.\r\n\r\n### Parallel Processing: \r\n\r\nThe library offers two main functions, parse_html_subprocess and parse_html_multiproc, to process HTML content in parallel. \r\nThis parallelism can significantly speed up the parsing \r\nof multiple HTML documents.\r\n\r\n### DataFrame Output: \r\n\r\nThe library returns structured data in the form of pandas DataFrames. These DataFrames contain detailed information about HTML \r\nelements within the parsed content, such as tag names, \r\nattributes, text, and more.\r\n\r\n### Caching: \r\n\r\nThe script utilizes functools.cache for memoization, \r\nwhich can improve performance by avoiding \r\nunnecessary recomputation of previously processed data.\r\n\r\n## Advantages of the Library:\r\n\r\n### Efficiency: \r\n\r\nParallel processing is a key advantage of this library. \r\nIt can distribute the parsing of multiple HTML documents \r\nacross multiple CPU cores, making it significantly faster \r\nwhen dealing with a large number of documents.\r\n\r\n### Structured Data: \r\n\r\nThe library doesn't just parse HTML; it structures the data \r\nin a tabular format using DataFrames. \r\nThis structured data can be easily analyzed, \r\ntransformed, and used for various purposes.\r\n\r\n### Flexibility: \r\n\r\nThe library is flexible and can handle various input \r\nsources, including local files, web URLs, \r\nand multipart messages (e.g., email content).\r\n\r\n### Subprocess Support: \r\n\r\nThe parse_html_subprocess function allows you to offload the HTML \r\nparsing task to a separate subprocess. This can be useful when \r\ndealing with potentially untrusted or resource-intensive \r\nHTML content, as it isolates the parsing process.\r\n\r\n### Parallelism Control: \r\n\r\nYou can control the level of parallelism by specifying \r\nthe number of processes and chunks. \r\nThis flexibility allows you to fine-tune the \r\nperformance based on your system's capabilities and specific requirements.\r\n\r\n### Caching: \r\n\r\nThe caching mechanism helps save time by reusing previously \r\nparsed results, especially when working with the same \r\ncontent repeatedly.\r\n\r\n### Cross-Platform: \r\n\r\nThe library supports both Windows and non-Windows environments, \r\nensuring compatibility across different operating systems.\r\n\r\n\r\n## parse_html_subprocess\r\n\r\n```python\r\n\r\ndef parse_html_subprocess(html,chunks=2,processes=None):\r\n\tParse HTML Content Using Subprocess\r\n\r\n\tThis function takes a single HTML content as input, processes it using a subprocess,\r\n\tand returns a structured DataFrame containing information about HTML elements.\r\n\tIt is suitable for parsing a single HTML document using subprocess-based parallelism.\r\n\r\n\tParameters:\r\n\t- html (str or bytes): HTML content to be processed. It can be provided as a string, bytes, or a file path.\r\n\t- chunks (int, optional): The number of chunks to divide the HTML processing into.\r\n\tThis can help optimize processing for large datasets. Default is 2.\r\n\t- processes (int, optional): The number of parallel processes to use for parsing.\r\n\tIf not specified, it defaults to (number of CPU cores - 1).\r\n\r\n\tReturns:\r\n\t- pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.\r\n\r\n\r\n```\r\n\r\n\r\n## parse_html_multiproc\r\n\r\n```python\r\n\r\ndef parse_html_multiproc(htmls, chunks=2, processes=5):\r\n    r\"\"\"\r\n    Parse HTML Content Using Multiprocessing\r\n\r\n    This function takes a list of HTML content, processes it in parallel using the multiprocessing library,\r\n    and returns a structured DataFrame containing information about HTML elements. It is suitable for\r\n    parsing multiple HTML documents simultaneously.\r\n\r\n    Parameters:\r\n    - htmls (list): A list of HTML content to be processed. Each item in the list should represent HTML content, typically as strings or bytes.\r\n    - chunks (int, optional): The number of chunks to divide the HTML processing into. This can help optimize processing for large datasets. Default is 2.\r\n    - processes (int, optional): The number of parallel processes to use for parsing. If not specified, it defaults to (number of CPU cores - 1).\r\n\r\n    Returns:\r\n    - pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.\r\n    \"\"\"\r\n```\r\n\r\n## Examples \r\n\r\n```python\r\nimport re\r\nimport sys\r\n\r\nimport bs4\r\nfrom PrettyColorPrinter import add_printer\r\nfrom bs4multiproc import parse_html_subprocess, parse_html_multiproc\r\nimport pandas as pd\r\n\r\nadd_printer(1)\r\nfrom time import perf_counter\r\n\r\nsys.setrecursionlimit(10000)\r\nimport numpy as np\r\n\r\nif __name__ == \"__main__\":\r\n    execute_examples = False\r\n    if execute_examples:\r\n        start = perf_counter()\r\n        df1 = parse_html_multiproc(  # needs if __name__ == \"__main__\": !!!!\r\n            htmls=[\r\n                r\"C:\\Users\\hansc\\Downloads\\bet365 - Apostas Desportivas Online.mhtml\",\r\n                \"https://docs.python.org/3/library/multiprocessing.html\",\r\n                r\"C:\\Users\\hansc\\Downloads\\Your Repositories.mhtml\",\r\n            ],\r\n            chunks=1,\r\n            processes=4,\r\n        )\r\n        end = perf_counter() - start\r\n\r\n        start1 = perf_counter()\r\n        df2 = parse_html_subprocess(  # doesn't need if __name__ == \"__main__\":\r\n            html=[\r\n                r\"C:\\Users\\hansc\\Downloads\\bet365 - Apostas Desportivas Online.mhtml\",\r\n                \"https://docs.python.org/3/library/multiprocessing.html\",\r\n                r\"C:\\Users\\hansc\\Downloads\\Your Repositories.mhtml\",\r\n            ],\r\n            chunks=1,\r\n            processes=4,\r\n        )\r\n        end1 = perf_counter() - start1\r\n\r\n        print(df1)\r\n        print(df2)\r\n        print(end, end1)\r\n        df1.drop_duplicates(subset=[\"aa_h0\", \"aa_h1\", \"aa_h2\", \"aa_h3\"]).aa_soup.apply(\r\n            lambda x: g if (g := x.find_all(\"a\")) else pd.NA\r\n        ).dropna().ds_color_print_all()\r\n\r\n    df = parse_html_multiproc(\r\n        r\"C:\\Users\\hansc\\Downloads\\bet365 - Apostas Desportivas Online2.mhtml\",\r\n        chunks=3,\r\n        processes=4,\r\n    )\r\n\r\n    results = (\r\n        df.loc[\r\n            np.all(\r\n                df[[\"aa_tag\", \"aa_value\", \"aa_attr\"]].__array__()\r\n                == np.array([[\"div\", \"ovm-Fixture-media\", \"class\"]]),\r\n                axis=1,\r\n            )\r\n        ]\r\n        .aa_html.apply(\r\n            lambda x: [\r\n                [y.text]\r\n                for y in bs4.BeautifulSoup(x).find_all(\r\n                    re.compile(r\"\\b(?:span|div)\\b\"),\r\n                    class_=re.compile(\r\n                        \"(?:ovm-ParticipantOddsOnly_Odds)|(?:ovm-FixtureDetailsTwoWay_TeamName)\"\r\n                    ),\r\n                )\r\n            ]\r\n        )\r\n        .apply(lambda q: [t[0] for t in q])\r\n        .apply(pd.Series)\r\n    ).reset_index(drop=True)\r\n    print(results.to_string())\r\nsys.setrecursionlimit(3000)\r\n\r\n# Example - Odds - Live Games from bet365.com\r\n#                               0                              1      2     3      4\r\n# 0                      Crici\u00fama                    Chapecoense   4.00  3.00   2.05\r\n# 1                      Barbados             Dominican Republic    NaN   NaN    NaN\r\n# 2             Trindade e Tobago                      Guatemala  11.00  4.75   1.33\r\n# 3         CA Union Villa Krause              San Lorenzo Ullum   3.20  2.62   2.50\r\n# 4                   Once Caldas            Jaguares de C\u00f3rdoba   1.66  3.20   6.50\r\n# 5             New Mexico United                 Memphis 901 FC   1.11  7.50  13.00\r\n# 6                   FC Santiago           Huracanes Izcalli FC   2.75  3.60   2.20\r\n# 7                Grupo Sherwood       Club Leones Huixquilucan   4.75  3.75   1.66\r\n# 8                 Auckland City             Cashmere Technical   1.44  4.00   7.00\r\n# 9                     Petone FC             Auckland United FC  12.00  8.00   1.11\r\n# 10                  \u00c1rabe Unido         Sporting San Miguelito   3.50  2.30   2.75\r\n# 11                    Udelas FC                    Union Cocle   3.50  1.61   5.50\r\n# 12          Deportivo Maldonado                        Pe\u00f1arol  29.00  5.00   1.18\r\n# 13              Central Espanol                        Basanez   3.75  1.66   4.33\r\n# 14     Argentina (JKey) Esports       Portugal (RuBIX) Esports   2.10  3.75   2.87\r\n# 15  Eintracht (Aleksis) Esports  Dortmund (Kalibrikon) Esports   4.50  1.90   2.87\r\n# 16   Germany (lowheels) Esports   France (DangerDim77) Esports   1.83  3.75   3.50\r\n# 17          Lazio (Nio) Esports        Arsenal (Panic) Esports   1.80  5.00   3.00\r\n# 18       Lens (General) Esports      Sevilla (Chemist) Esports   1.83  5.00   2.87\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Manages Android packages on a device through DataFrames",
    "version": "0.10",
    "project_urls": {
        "Homepage": "https://github.com/hansalemaos/bs4multiproc"
    },
    "split_keywords": [
        "bs4",
        "dataframe",
        "webscraping",
        "html",
        "parsing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9c785dad9a17069af1af65db1ff5ae29bf3308005dd58348808cadb399890b0a",
                "md5": "091b937d5bc596ac4257c36570421ca1",
                "sha256": "849093141a342003655c7e46c658dbbb1baa0b32dacb1d21da77da98b4d151cc"
            },
            "downloads": -1,
            "filename": "bs4multiproc-0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "091b937d5bc596ac4257c36570421ca1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 48377,
            "upload_time": "2023-10-14T03:28:14",
            "upload_time_iso_8601": "2023-10-14T03:28:14.579214Z",
            "url": "https://files.pythonhosted.org/packages/9c/78/5dad9a17069af1af65db1ff5ae29bf3308005dd58348808cadb399890b0a/bs4multiproc-0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "37dda7669178fca19a7e182a68cd6ebac0ee282700ba0540f9de70dccfc6e836",
                "md5": "725788d2cc1ac28d805e43c3feb47ff0",
                "sha256": "b883501a91b06ab1bf743a1cd0ea848191b0a8a6d1d9a27ecd43392c0d5626eb"
            },
            "downloads": -1,
            "filename": "bs4multiproc-0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "725788d2cc1ac28d805e43c3feb47ff0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 47991,
            "upload_time": "2023-10-14T03:28:16",
            "upload_time_iso_8601": "2023-10-14T03:28:16.865278Z",
            "url": "https://files.pythonhosted.org/packages/37/dd/a7669178fca19a7e182a68cd6ebac0ee282700ba0540f9de70dccfc6e836/bs4multiproc-0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-14 03:28:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hansalemaos",
    "github_project": "bs4multiproc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "bs4multiproc"
}
        
Elapsed time: 0.12438s