### One line web scraping by combining pandas and BeautifulSoup4
##### Check out the video
<div align="left">
<a href="https://www.youtube.com/watch?v=pvnODvnMyrg">
<img src="https://img.youtube.com/vi/pvnODvnMyrg/0.jpg" style="width:100%;">
</a>
</div>
##### Code from the video
```python
pip install a-pandas-ex-bs4df
```
```python
from a_pandas_ex_bs4df import pd_add_bs4_to_df
import pandas as pd
pd_add_bs4_to_df()
from PrettyColorPrinter import add_printer #optional
add_printer(True) #optional
df=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())
df.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
df.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
```
```python
Parameters:
htmlcode:Union[str,bytes]
file path, url or html source code
urls will be downloaded with requests
dontuse:tuple
bs4 attributes to exclude from the dataframe
default = (
"element_classes",
"builder",
"is_xml",
"known_xml",
"_namespaces",
"parse_only",
"markup",
"contains_replacement_characters",
"original_encoding",
"declared_html_encoding",
"parser_class",
"namespace",
"prefix",
"cdata_list_attributes",
"preserve_whitespace_tag_stack",
"open_tag_counter",
"preserve_whitespace_tags",
"interesting_string_types",
"current_data",
"string_container_stack",
"_most_recent_element",
"currentTag",
)
parser: str
Have a look at the bs4 documentation
(default='lxml')
tags_to_find:Union[bool,str]=True
will be passed to soup.find_all()
Have a look at the bs4 documentation
(default=True) #everything
Returns:
df: pd.DataFrame
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hansalemaos/a_pandas_ex_bs4df",
"name": "a-pandas-ex-bs4df",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "BeautifulSoup4,bs4,pandas,web scraping",
"author": "Johannes Fischer",
"author_email": "<aulasparticularesdealemaosp@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ce/04/72c94c4c717af32875e28964dce7b9ea04824eb56a11518f6e2f24f7ed6c/a_pandas_ex_bs4df-0.10.tar.gz",
"platform": null,
"description": "\n### One line web scraping by combining pandas and BeautifulSoup4\n\n\n\n##### Check out the video\n\n\n\n<div align=\"left\">\n\n <a href=\"https://www.youtube.com/watch?v=pvnODvnMyrg\">\n\n <img src=\"https://img.youtube.com/vi/pvnODvnMyrg/0.jpg\" style=\"width:100%;\">\n\n </a>\n\n</div>\n\n\n\n##### Code from the video\n\n\n\n```python\n\npip install a-pandas-ex-bs4df \n\n```\n\n\n\n```python\n\nfrom a_pandas_ex_bs4df import pd_add_bs4_to_df\n\nimport pandas as pd\n\npd_add_bs4_to_df() \n\n\n\nfrom PrettyColorPrinter import add_printer #optional\n\nadd_printer(True) #optional\n\n\n\ndf=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')\n\ndf.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]\n\ndf.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())\n\ndf.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\\.png$',regex=True,na=False))]\n\ndf.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\\.png$',regex=True,na=False))]\n\n```\n\n\n\n```python\n\nParameters:\n\n htmlcode:Union[str,bytes]\n\n file path, url or html source code\n\n urls will be downloaded with requests\n\n dontuse:tuple\n\n bs4 attributes to exclude from the dataframe\n\n default = (\n\n \"element_classes\",\n\n \"builder\",\n\n \"is_xml\",\n\n \"known_xml\",\n\n \"_namespaces\",\n\n \"parse_only\",\n\n \"markup\",\n\n \"contains_replacement_characters\",\n\n \"original_encoding\",\n\n \"declared_html_encoding\",\n\n \"parser_class\",\n\n \"namespace\",\n\n \"prefix\",\n\n \"cdata_list_attributes\",\n\n \"preserve_whitespace_tag_stack\",\n\n \"open_tag_counter\",\n\n \"preserve_whitespace_tags\",\n\n \"interesting_string_types\",\n\n \"current_data\",\n\n \"string_container_stack\",\n\n \"_most_recent_element\",\n\n \"currentTag\",\n\n )\n\n parser: str\n\n Have a look at the bs4 documentation\n\n (default='lxml')\n\n tags_to_find:Union[bool,str]=True\n\n will be passed to soup.find_all()\n\n Have a look at the bs4 documentation\n\n (default=True) #everything\n\nReturns:\n\n df: pd.DataFrame\n\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "One line web scraping by combining pandas and BeautifulSoup4",
"version": "0.10",
"split_keywords": [
"beautifulsoup4",
"bs4",
"pandas",
"web scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0321d85dcef2301023e46cf66aeed325a0fce1492c89d90d84295561840ee67d",
"md5": "eb457682b329a9b7d96ab8ce71a4e177",
"sha256": "58383acd844ccdac85b7f22a2e865bc077e944bcfc02f615d23a563168ccdebf"
},
"downloads": -1,
"filename": "a_pandas_ex_bs4df-0.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "eb457682b329a9b7d96ab8ce71a4e177",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 7742,
"upload_time": "2022-10-29T21:42:52",
"upload_time_iso_8601": "2022-10-29T21:42:52.969587Z",
"url": "https://files.pythonhosted.org/packages/03/21/d85dcef2301023e46cf66aeed325a0fce1492c89d90d84295561840ee67d/a_pandas_ex_bs4df-0.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ce0472c94c4c717af32875e28964dce7b9ea04824eb56a11518f6e2f24f7ed6c",
"md5": "c90939dab6c03bc332d2f8b019acafe0",
"sha256": "2b22ace100590415338716a259c3adcbb939042ec08998b5abb975ecdf73a845"
},
"downloads": -1,
"filename": "a_pandas_ex_bs4df-0.10.tar.gz",
"has_sig": false,
"md5_digest": "c90939dab6c03bc332d2f8b019acafe0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5352,
"upload_time": "2022-10-29T21:42:54",
"upload_time_iso_8601": "2022-10-29T21:42:54.881414Z",
"url": "https://files.pythonhosted.org/packages/ce/04/72c94c4c717af32875e28964dce7b9ea04824eb56a11518f6e2f24f7ed6c/a_pandas_ex_bs4df-0.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-10-29 21:42:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "hansalemaos",
"github_project": "a_pandas_ex_bs4df",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "a-pandas-ex-bs4df"
}