# BeautifulSoup multiprocessing parsing to pandas DataFrame
## Tested against Windows / Python 3.11 / Anaconda
## pip install bs4multiproc
A Python library for parsing HTML content. It leverages various libraries and methods, including BeautifulSoup, pandas, multiprocessing, and subprocesses, to efficiently extract structured information from HTML documents.
## What does it do?
### HTML Parsing:
The library's primary purpose is to parse HTML content.
It can handle both local HTML files and web-based
HTML content retrieved via URLs.
### Parallel Processing:
The library offers two main functions, parse_html_subprocess and parse_html_multiproc, to process HTML content in parallel.
This parallelism can significantly speed up the parsing
of multiple HTML documents.
### DataFrame Output:
The library returns structured data in the form of pandas DataFrames. These DataFrames contain detailed information about HTML
elements within the parsed content, such as tag names,
attributes, text, and more.
### Caching:
The script utilizes functools.cache for memoization,
which can improve performance by avoiding
unnecessary recomputation of previously processed data.
## Advantages of the Library:
### Efficiency:
Parallel processing is a key advantage of this library.
It can distribute the parsing of multiple HTML documents
across multiple CPU cores, making it significantly faster
when dealing with a large number of documents.
### Structured Data:
The library doesn't just parse HTML; it structures the data
in a tabular format using DataFrames.
This structured data can be easily analyzed,
transformed, and used for various purposes.
### Flexibility:
The library is flexible and can handle various input
sources, including local files, web URLs,
and multipart messages (e.g., email content).
### Subprocess Support:
The parse_html_subprocess function allows you to offload the HTML
parsing task to a separate subprocess. This can be useful when
dealing with potentially untrusted or resource-intensive
HTML content, as it isolates the parsing process.
### Parallelism Control:
You can control the level of parallelism by specifying
the number of processes and chunks.
This flexibility allows you to fine-tune the
performance based on your system's capabilities and specific requirements.
### Caching:
The caching mechanism helps save time by reusing previously
parsed results, especially when working with the same
content repeatedly.
### Cross-Platform:
The library supports both Windows and non-Windows environments,
ensuring compatibility across different operating systems.
## parse_html_subprocess
```python
def parse_html_subprocess(html,chunks=2,processes=None):
Parse HTML Content Using Subprocess
This function takes a single HTML content as input, processes it using a subprocess,
and returns a structured DataFrame containing information about HTML elements.
It is suitable for parsing a single HTML document using subprocess-based parallelism.
Parameters:
- html (str or bytes): HTML content to be processed. It can be provided as a string, bytes, or a file path.
- chunks (int, optional): The number of chunks to divide the HTML processing into.
This can help optimize processing for large datasets. Default is 2.
- processes (int, optional): The number of parallel processes to use for parsing.
If not specified, it defaults to (number of CPU cores - 1).
Returns:
- pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.
```
## parse_html_multiproc
```python
def parse_html_multiproc(htmls, chunks=2, processes=5):
r"""
Parse HTML Content Using Multiprocessing
This function takes a list of HTML content, processes it in parallel using the multiprocessing library,
and returns a structured DataFrame containing information about HTML elements. It is suitable for
parsing multiple HTML documents simultaneously.
Parameters:
- htmls (list): A list of HTML content to be processed. Each item in the list should represent HTML content, typically as strings or bytes.
- chunks (int, optional): The number of chunks to divide the HTML processing into. This can help optimize processing for large datasets. Default is 2.
- processes (int, optional): The number of parallel processes to use for parsing. If not specified, it defaults to (number of CPU cores - 1).
Returns:
- pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.
"""
```
## Examples
```python
import re
import sys
import bs4
from PrettyColorPrinter import add_printer
from bs4multiproc import parse_html_subprocess, parse_html_multiproc
import pandas as pd
add_printer(1)
from time import perf_counter
sys.setrecursionlimit(10000)
import numpy as np
if __name__ == "__main__":
execute_examples = False
if execute_examples:
start = perf_counter()
df1 = parse_html_multiproc( # needs if __name__ == "__main__": !!!!
htmls=[
r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online.mhtml",
"https://docs.python.org/3/library/multiprocessing.html",
r"C:\Users\hansc\Downloads\Your Repositories.mhtml",
],
chunks=1,
processes=4,
)
end = perf_counter() - start
start1 = perf_counter()
df2 = parse_html_subprocess( # doesn't need if __name__ == "__main__":
html=[
r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online.mhtml",
"https://docs.python.org/3/library/multiprocessing.html",
r"C:\Users\hansc\Downloads\Your Repositories.mhtml",
],
chunks=1,
processes=4,
)
end1 = perf_counter() - start1
print(df1)
print(df2)
print(end, end1)
df1.drop_duplicates(subset=["aa_h0", "aa_h1", "aa_h2", "aa_h3"]).aa_soup.apply(
lambda x: g if (g := x.find_all("a")) else pd.NA
).dropna().ds_color_print_all()
df = parse_html_multiproc(
r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online2.mhtml",
chunks=3,
processes=4,
)
results = (
df.loc[
np.all(
df[["aa_tag", "aa_value", "aa_attr"]].__array__()
== np.array([["div", "ovm-Fixture-media", "class"]]),
axis=1,
)
]
.aa_html.apply(
lambda x: [
[y.text]
for y in bs4.BeautifulSoup(x).find_all(
re.compile(r"\b(?:span|div)\b"),
class_=re.compile(
"(?:ovm-ParticipantOddsOnly_Odds)|(?:ovm-FixtureDetailsTwoWay_TeamName)"
),
)
]
)
.apply(lambda q: [t[0] for t in q])
.apply(pd.Series)
).reset_index(drop=True)
print(results.to_string())
sys.setrecursionlimit(3000)
# Example - Odds - Live Games from bet365.com
# 0 1 2 3 4
# 0 Criciúma Chapecoense 4.00 3.00 2.05
# 1 Barbados Dominican Republic NaN NaN NaN
# 2 Trindade e Tobago Guatemala 11.00 4.75 1.33
# 3 CA Union Villa Krause San Lorenzo Ullum 3.20 2.62 2.50
# 4 Once Caldas Jaguares de Córdoba 1.66 3.20 6.50
# 5 New Mexico United Memphis 901 FC 1.11 7.50 13.00
# 6 FC Santiago Huracanes Izcalli FC 2.75 3.60 2.20
# 7 Grupo Sherwood Club Leones Huixquilucan 4.75 3.75 1.66
# 8 Auckland City Cashmere Technical 1.44 4.00 7.00
# 9 Petone FC Auckland United FC 12.00 8.00 1.11
# 10 Árabe Unido Sporting San Miguelito 3.50 2.30 2.75
# 11 Udelas FC Union Cocle 3.50 1.61 5.50
# 12 Deportivo Maldonado Peñarol 29.00 5.00 1.18
# 13 Central Espanol Basanez 3.75 1.66 4.33
# 14 Argentina (JKey) Esports Portugal (RuBIX) Esports 2.10 3.75 2.87
# 15 Eintracht (Aleksis) Esports Dortmund (Kalibrikon) Esports 4.50 1.90 2.87
# 16 Germany (lowheels) Esports France (DangerDim77) Esports 1.83 3.75 3.50
# 17 Lazio (Nio) Esports Arsenal (Panic) Esports 1.80 5.00 3.00
# 18 Lens (General) Esports Sevilla (Chemist) Esports 1.83 5.00 2.87
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hansalemaos/bs4multiproc",
"name": "bs4multiproc",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "bs4,DataFrame,webscraping,html,parsing",
"author": "Johannes Fischer",
"author_email": "aulasparticularesdealemaosp@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/37/dd/a7669178fca19a7e182a68cd6ebac0ee282700ba0540f9de70dccfc6e836/bs4multiproc-0.10.tar.gz",
"platform": null,
"description": "\r\n# BeautifulSoup multiprocessing parsing to pandas DataFrame \r\n\r\n## Tested against Windows / Python 3.11 / Anaconda\r\n\r\n## pip install bs4multiproc\r\n\r\n\r\nA Python library for parsing HTML content. It leverages various libraries and methods, including BeautifulSoup, pandas, multiprocessing, and subprocesses, to efficiently extract structured information from HTML documents. \r\n\r\n## What does it do?\r\n\r\n### HTML Parsing: \r\n\r\nThe library's primary purpose is to parse HTML content. \r\nIt can handle both local HTML files and web-based \r\nHTML content retrieved via URLs.\r\n\r\n### Parallel Processing: \r\n\r\nThe library offers two main functions, parse_html_subprocess and parse_html_multiproc, to process HTML content in parallel. \r\nThis parallelism can significantly speed up the parsing \r\nof multiple HTML documents.\r\n\r\n### DataFrame Output: \r\n\r\nThe library returns structured data in the form of pandas DataFrames. These DataFrames contain detailed information about HTML \r\nelements within the parsed content, such as tag names, \r\nattributes, text, and more.\r\n\r\n### Caching: \r\n\r\nThe script utilizes functools.cache for memoization, \r\nwhich can improve performance by avoiding \r\nunnecessary recomputation of previously processed data.\r\n\r\n## Advantages of the Library:\r\n\r\n### Efficiency: \r\n\r\nParallel processing is a key advantage of this library. \r\nIt can distribute the parsing of multiple HTML documents \r\nacross multiple CPU cores, making it significantly faster \r\nwhen dealing with a large number of documents.\r\n\r\n### Structured Data: \r\n\r\nThe library doesn't just parse HTML; it structures the data \r\nin a tabular format using DataFrames. \r\nThis structured data can be easily analyzed, \r\ntransformed, and used for various purposes.\r\n\r\n### Flexibility: \r\n\r\nThe library is flexible and can handle various input \r\nsources, including local files, web URLs, \r\nand multipart messages (e.g., email content).\r\n\r\n### Subprocess Support: \r\n\r\nThe parse_html_subprocess function allows you to offload the HTML \r\nparsing task to a separate subprocess. This can be useful when \r\ndealing with potentially untrusted or resource-intensive \r\nHTML content, as it isolates the parsing process.\r\n\r\n### Parallelism Control: \r\n\r\nYou can control the level of parallelism by specifying \r\nthe number of processes and chunks. \r\nThis flexibility allows you to fine-tune the \r\nperformance based on your system's capabilities and specific requirements.\r\n\r\n### Caching: \r\n\r\nThe caching mechanism helps save time by reusing previously \r\nparsed results, especially when working with the same \r\ncontent repeatedly.\r\n\r\n### Cross-Platform: \r\n\r\nThe library supports both Windows and non-Windows environments, \r\nensuring compatibility across different operating systems.\r\n\r\n\r\n## parse_html_subprocess\r\n\r\n```python\r\n\r\ndef parse_html_subprocess(html,chunks=2,processes=None):\r\n\tParse HTML Content Using Subprocess\r\n\r\n\tThis function takes a single HTML content as input, processes it using a subprocess,\r\n\tand returns a structured DataFrame containing information about HTML elements.\r\n\tIt is suitable for parsing a single HTML document using subprocess-based parallelism.\r\n\r\n\tParameters:\r\n\t- html (str or bytes): HTML content to be processed. It can be provided as a string, bytes, or a file path.\r\n\t- chunks (int, optional): The number of chunks to divide the HTML processing into.\r\n\tThis can help optimize processing for large datasets. Default is 2.\r\n\t- processes (int, optional): The number of parallel processes to use for parsing.\r\n\tIf not specified, it defaults to (number of CPU cores - 1).\r\n\r\n\tReturns:\r\n\t- pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.\r\n\r\n\r\n```\r\n\r\n\r\n## parse_html_multiproc\r\n\r\n```python\r\n\r\ndef parse_html_multiproc(htmls, chunks=2, processes=5):\r\n r\"\"\"\r\n Parse HTML Content Using Multiprocessing\r\n\r\n This function takes a list of HTML content, processes it in parallel using the multiprocessing library,\r\n and returns a structured DataFrame containing information about HTML elements. It is suitable for\r\n parsing multiple HTML documents simultaneously.\r\n\r\n Parameters:\r\n - htmls (list): A list of HTML content to be processed. Each item in the list should represent HTML content, typically as strings or bytes.\r\n - chunks (int, optional): The number of chunks to divide the HTML processing into. This can help optimize processing for large datasets. Default is 2.\r\n - processes (int, optional): The number of parallel processes to use for parsing. If not specified, it defaults to (number of CPU cores - 1).\r\n\r\n Returns:\r\n - pandas.DataFrame: A DataFrame containing information about HTML elements, such as tag names, attributes, text, and more.\r\n \"\"\"\r\n```\r\n\r\n## Examples \r\n\r\n```python\r\nimport re\r\nimport sys\r\n\r\nimport bs4\r\nfrom PrettyColorPrinter import add_printer\r\nfrom bs4multiproc import parse_html_subprocess, parse_html_multiproc\r\nimport pandas as pd\r\n\r\nadd_printer(1)\r\nfrom time import perf_counter\r\n\r\nsys.setrecursionlimit(10000)\r\nimport numpy as np\r\n\r\nif __name__ == \"__main__\":\r\n execute_examples = False\r\n if execute_examples:\r\n start = perf_counter()\r\n df1 = parse_html_multiproc( # needs if __name__ == \"__main__\": !!!!\r\n htmls=[\r\n r\"C:\\Users\\hansc\\Downloads\\bet365 - Apostas Desportivas Online.mhtml\",\r\n \"https://docs.python.org/3/library/multiprocessing.html\",\r\n r\"C:\\Users\\hansc\\Downloads\\Your Repositories.mhtml\",\r\n ],\r\n chunks=1,\r\n processes=4,\r\n )\r\n end = perf_counter() - start\r\n\r\n start1 = perf_counter()\r\n df2 = parse_html_subprocess( # doesn't need if __name__ == \"__main__\":\r\n html=[\r\n r\"C:\\Users\\hansc\\Downloads\\bet365 - Apostas Desportivas Online.mhtml\",\r\n \"https://docs.python.org/3/library/multiprocessing.html\",\r\n r\"C:\\Users\\hansc\\Downloads\\Your Repositories.mhtml\",\r\n ],\r\n chunks=1,\r\n processes=4,\r\n )\r\n end1 = perf_counter() - start1\r\n\r\n print(df1)\r\n print(df2)\r\n print(end, end1)\r\n df1.drop_duplicates(subset=[\"aa_h0\", \"aa_h1\", \"aa_h2\", \"aa_h3\"]).aa_soup.apply(\r\n lambda x: g if (g := x.find_all(\"a\")) else pd.NA\r\n ).dropna().ds_color_print_all()\r\n\r\n df = parse_html_multiproc(\r\n r\"C:\\Users\\hansc\\Downloads\\bet365 - Apostas Desportivas Online2.mhtml\",\r\n chunks=3,\r\n processes=4,\r\n )\r\n\r\n results = (\r\n df.loc[\r\n np.all(\r\n df[[\"aa_tag\", \"aa_value\", \"aa_attr\"]].__array__()\r\n == np.array([[\"div\", \"ovm-Fixture-media\", \"class\"]]),\r\n axis=1,\r\n )\r\n ]\r\n .aa_html.apply(\r\n lambda x: [\r\n [y.text]\r\n for y in bs4.BeautifulSoup(x).find_all(\r\n re.compile(r\"\\b(?:span|div)\\b\"),\r\n class_=re.compile(\r\n \"(?:ovm-ParticipantOddsOnly_Odds)|(?:ovm-FixtureDetailsTwoWay_TeamName)\"\r\n ),\r\n )\r\n ]\r\n )\r\n .apply(lambda q: [t[0] for t in q])\r\n .apply(pd.Series)\r\n ).reset_index(drop=True)\r\n print(results.to_string())\r\nsys.setrecursionlimit(3000)\r\n\r\n# Example - Odds - Live Games from bet365.com\r\n# 0 1 2 3 4\r\n# 0 Crici\u00fama Chapecoense 4.00 3.00 2.05\r\n# 1 Barbados Dominican Republic NaN NaN NaN\r\n# 2 Trindade e Tobago Guatemala 11.00 4.75 1.33\r\n# 3 CA Union Villa Krause San Lorenzo Ullum 3.20 2.62 2.50\r\n# 4 Once Caldas Jaguares de C\u00f3rdoba 1.66 3.20 6.50\r\n# 5 New Mexico United Memphis 901 FC 1.11 7.50 13.00\r\n# 6 FC Santiago Huracanes Izcalli FC 2.75 3.60 2.20\r\n# 7 Grupo Sherwood Club Leones Huixquilucan 4.75 3.75 1.66\r\n# 8 Auckland City Cashmere Technical 1.44 4.00 7.00\r\n# 9 Petone FC Auckland United FC 12.00 8.00 1.11\r\n# 10 \u00c1rabe Unido Sporting San Miguelito 3.50 2.30 2.75\r\n# 11 Udelas FC Union Cocle 3.50 1.61 5.50\r\n# 12 Deportivo Maldonado Pe\u00f1arol 29.00 5.00 1.18\r\n# 13 Central Espanol Basanez 3.75 1.66 4.33\r\n# 14 Argentina (JKey) Esports Portugal (RuBIX) Esports 2.10 3.75 2.87\r\n# 15 Eintracht (Aleksis) Esports Dortmund (Kalibrikon) Esports 4.50 1.90 2.87\r\n# 16 Germany (lowheels) Esports France (DangerDim77) Esports 1.83 3.75 3.50\r\n# 17 Lazio (Nio) Esports Arsenal (Panic) Esports 1.80 5.00 3.00\r\n# 18 Lens (General) Esports Sevilla (Chemist) Esports 1.83 5.00 2.87\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Manages Android packages on a device through DataFrames",
"version": "0.10",
"project_urls": {
"Homepage": "https://github.com/hansalemaos/bs4multiproc"
},
"split_keywords": [
"bs4",
"dataframe",
"webscraping",
"html",
"parsing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9c785dad9a17069af1af65db1ff5ae29bf3308005dd58348808cadb399890b0a",
"md5": "091b937d5bc596ac4257c36570421ca1",
"sha256": "849093141a342003655c7e46c658dbbb1baa0b32dacb1d21da77da98b4d151cc"
},
"downloads": -1,
"filename": "bs4multiproc-0.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "091b937d5bc596ac4257c36570421ca1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 48377,
"upload_time": "2023-10-14T03:28:14",
"upload_time_iso_8601": "2023-10-14T03:28:14.579214Z",
"url": "https://files.pythonhosted.org/packages/9c/78/5dad9a17069af1af65db1ff5ae29bf3308005dd58348808cadb399890b0a/bs4multiproc-0.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "37dda7669178fca19a7e182a68cd6ebac0ee282700ba0540f9de70dccfc6e836",
"md5": "725788d2cc1ac28d805e43c3feb47ff0",
"sha256": "b883501a91b06ab1bf743a1cd0ea848191b0a8a6d1d9a27ecd43392c0d5626eb"
},
"downloads": -1,
"filename": "bs4multiproc-0.10.tar.gz",
"has_sig": false,
"md5_digest": "725788d2cc1ac28d805e43c3feb47ff0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 47991,
"upload_time": "2023-10-14T03:28:16",
"upload_time_iso_8601": "2023-10-14T03:28:16.865278Z",
"url": "https://files.pythonhosted.org/packages/37/dd/a7669178fca19a7e182a68cd6ebac0ee282700ba0540f9de70dccfc6e836/bs4multiproc-0.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-14 03:28:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hansalemaos",
"github_project": "bs4multiproc",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "bs4multiproc"
}