<img src="boxfish.svg" width="100%" alt="">
# boxfish: lightweight table extraction from HTML
[![PyPI](https://img.shields.io/pypi/v/boxfish)](https://img.shields.io/pypi/v/boxfish)
[![PyPI - Status](https://img.shields.io/pypi/status/boxfish)](https://img.shields.io/pypi/status/boxfish)
[![PyPI - License](https://img.shields.io/pypi/l/boxfish)](https://img.shields.io/pypi/l/boxfish)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/boxfish)](https://img.shields.io/pypi/pyversions/boxfish)
[![GitHub top language](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)
[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
### What is it?
Boxfish is a lightweight tool for table extraction from HTML pages.
### Main features
- Easy configuration. No knowledge of CSS or Xpaths required.
- Fast table extraction to CSV files.
- Integration of `requests` and `selenium`.
### Quick start
``` python
import boxfish as bf
import pandas as pd
# Define table layout of an url with strings from two rows.
aurl = ""
row1 = ""
row2 = ""
# Build a configuration
aconfig = bf.build(url=aurl, rows = [row1, row2])
# Extract a table
data = bf.extract(aconfig, url=aurl)
# View results
df = pd.DataFrame(data)
df.head()
```
### Where to get it?
Boxfish is available on [Pypi](https://pypi.org/project/boxfish/) and [Github](https://github.com/peterkorteweg/boxfish/).
```
pip install boxfish
```
### Dependencies
The main dependencies are:
- [**BeautifulSoup4**](https://pypi.org/project/beautifulsoup4/), a Python library for pulling data out of HTML and XML files.
- [**lxml**](https://pypi.org/project/lxml/), a powerful and Pythonic XML processing library.
- [**Requests**](https://pypi.org/project/requests/), a simple, yet elegant, HTTP library.
- [**Selenium**](https://pypi.org/project/selenium/), automated web browser interaction from Python.
### License
Boxfish is available with an [MIT license](https://github.com/peterkorteweg/boxfish/blob/main/LICENSE).
### Limitations
Boxfish extracts text from HTML. To see if the HTML file contains the
text of interest, open the page in a browser, then access the HTML in the developer tools via
<kbd>Cntrl</kbd>+<kbd>Shift</kbd>+ <kbd>I</kbd>.
### Documentation
Full documentation is available [here](https://github.com/peterkorteweg/boxfish/blob/main/Documentation.md).
Raw data
{
"_id": null,
"home_page": "https://github.com/peterkorteweg/boxfish/",
"name": "boxfish",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "beautifulsoup html pandas scraping tables",
"author": "Peter Korteweg",
"author_email": "boxfish@peterkorteweg.com",
"download_url": "https://files.pythonhosted.org/packages/e6/1e/df51a537cca5bb1facda648e836f029ee923cfa743734cc6da6fa9cdc9d5/boxfish-0.1.2.tar.gz",
"platform": null,
"description": "<img src=\"boxfish.svg\" width=\"100%\" alt=\"\">\n\n# boxfish: lightweight table extraction from HTML\n\n[![PyPI](https://img.shields.io/pypi/v/boxfish)](https://img.shields.io/pypi/v/boxfish)\n[![PyPI - Status](https://img.shields.io/pypi/status/boxfish)](https://img.shields.io/pypi/status/boxfish)\n[![PyPI - License](https://img.shields.io/pypi/l/boxfish)](https://img.shields.io/pypi/l/boxfish)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/boxfish)](https://img.shields.io/pypi/pyversions/boxfish)\n\n[![GitHub top language](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)\n[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n### What is it?\nBoxfish is a lightweight tool for table extraction from HTML pages. \n\n### Main features\n\n- Easy configuration. No knowledge of CSS or Xpaths required.\n- Fast table extraction to CSV files.\n- Integration of `requests` and `selenium`.\n\n### Quick start\n\n\n``` python\nimport boxfish as bf\nimport pandas as pd\n\n# Define table layout of an url with strings from two rows.\naurl = \"\"\nrow1 = \"\"\nrow2 = \"\"\n\n# Build a configuration \naconfig = bf.build(url=aurl, rows = [row1, row2])\n\n# Extract a table\ndata = bf.extract(aconfig, url=aurl)\n\n# View results\ndf = pd.DataFrame(data)\ndf.head() \n```\n\n### Where to get it?\nBoxfish is available on [Pypi](https://pypi.org/project/boxfish/) and [Github](https://github.com/peterkorteweg/boxfish/).\n\n```\npip install boxfish\n```\n\n### Dependencies\n\nThe main dependencies are:\n- [**BeautifulSoup4**](https://pypi.org/project/beautifulsoup4/), a Python library for pulling data out of HTML and XML files.\n- [**lxml**](https://pypi.org/project/lxml/), a powerful and Pythonic XML processing library.\n- [**Requests**](https://pypi.org/project/requests/), a simple, yet elegant, HTTP library.\n- [**Selenium**](https://pypi.org/project/selenium/), automated web browser interaction from Python.\n\n\n### License\nBoxfish is available with an [MIT license](https://github.com/peterkorteweg/boxfish/blob/main/LICENSE).\n\n### Limitations\n\nBoxfish extracts text from HTML. To see if the HTML file contains the\ntext of interest, open the page in a browser, then access the HTML in the developer tools via \n<kbd>Cntrl</kbd>+<kbd>Shift</kbd>+ <kbd>I</kbd>.\n\n### Documentation\n\nFull documentation is available [here](https://github.com/peterkorteweg/boxfish/blob/main/Documentation.md).\n\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A lightweight tool for table extraction from HTML pages.",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/peterkorteweg/boxfish/"
},
"split_keywords": [
"beautifulsoup",
"html",
"pandas",
"scraping",
"tables"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6da95ad8e613959e0cfa335dcefb6dbf810606eb2751d9f34bc3486975e2da25",
"md5": "4079d37d1c28ee525f2bef1cfd1f5607",
"sha256": "0a6958437343290d653f3bb07b595d55d2e4b75f4770b562f643806f8270618b"
},
"downloads": -1,
"filename": "boxfish-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4079d37d1c28ee525f2bef1cfd1f5607",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 30736,
"upload_time": "2023-06-25T13:14:21",
"upload_time_iso_8601": "2023-06-25T13:14:21.655026Z",
"url": "https://files.pythonhosted.org/packages/6d/a9/5ad8e613959e0cfa335dcefb6dbf810606eb2751d9f34bc3486975e2da25/boxfish-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e61edf51a537cca5bb1facda648e836f029ee923cfa743734cc6da6fa9cdc9d5",
"md5": "49a2e27fb32bb9060003509c9a620d93",
"sha256": "623cd4d507f255e9299b80ae0a3ff8d8b52388245b86cab102c55b08968c1152"
},
"downloads": -1,
"filename": "boxfish-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "49a2e27fb32bb9060003509c9a620d93",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 27552,
"upload_time": "2023-06-25T13:14:23",
"upload_time_iso_8601": "2023-06-25T13:14:23.403466Z",
"url": "https://files.pythonhosted.org/packages/e6/1e/df51a537cca5bb1facda648e836f029ee923cfa743734cc6da6fa9cdc9d5/boxfish-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-25 13:14:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "peterkorteweg",
"github_project": "boxfish",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "requests",
"specs": [
[
">=",
"2.24.0"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.9.3"
]
]
},
{
"name": "selenium",
"specs": [
[
"~=",
"3.141.0"
]
]
},
{
"name": "lxml",
"specs": [
[
">=",
"4.6.3"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.5"
]
]
}
],
"lcname": "boxfish"
}