boxfish


Nameboxfish JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/peterkorteweg/boxfish/
SummaryA lightweight tool for table extraction from HTML pages.
upload_time2023-06-25 13:14:23
maintainer
docs_urlNone
authorPeter Korteweg
requires_python>=3.6
licenseMIT
keywords beautifulsoup html pandas scraping tables
VCS
bugtrack_url
requirements requests beautifulsoup4 selenium lxml pandas
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <img src="boxfish.svg" width="100%" alt="">

# boxfish: lightweight table extraction from HTML

[![PyPI](https://img.shields.io/pypi/v/boxfish)](https://img.shields.io/pypi/v/boxfish)
[![PyPI - Status](https://img.shields.io/pypi/status/boxfish)](https://img.shields.io/pypi/status/boxfish)
[![PyPI - License](https://img.shields.io/pypi/l/boxfish)](https://img.shields.io/pypi/l/boxfish)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/boxfish)](https://img.shields.io/pypi/pyversions/boxfish)

[![GitHub top language](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)
[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

### What is it?
Boxfish is a lightweight tool for table extraction from HTML pages. 

### Main features

- Easy configuration. No knowledge of CSS or Xpaths required.
- Fast table extraction to CSV files.
- Integration of `requests` and `selenium`.

### Quick start


``` python
import boxfish as bf
import pandas as pd

# Define table layout of an url with strings from two rows.
aurl = ""
row1 = ""
row2 = ""

# Build a configuration 
aconfig = bf.build(url=aurl, rows = [row1, row2])

# Extract a table
data = bf.extract(aconfig, url=aurl)

# View results
df = pd.DataFrame(data)
df.head() 
```

### Where to get it?
Boxfish is available on [Pypi](https://pypi.org/project/boxfish/) and [Github](https://github.com/peterkorteweg/boxfish/).

```
pip install boxfish
```

### Dependencies

The main dependencies are:
- [**BeautifulSoup4**](https://pypi.org/project/beautifulsoup4/), a Python library for pulling data out of HTML and XML files.
- [**lxml**](https://pypi.org/project/lxml/), a powerful and Pythonic XML processing library.
- [**Requests**](https://pypi.org/project/requests/), a simple, yet elegant, HTTP library.
- [**Selenium**](https://pypi.org/project/selenium/), automated web browser interaction from Python.


### License
Boxfish is available with an [MIT license](https://github.com/peterkorteweg/boxfish/blob/main/LICENSE).

### Limitations

Boxfish extracts text from HTML. To see if the HTML file contains the
text of interest, open the page in a browser, then access the HTML in the developer tools via 
<kbd>Cntrl</kbd>+<kbd>Shift</kbd>+ <kbd>I</kbd>.

### Documentation

Full documentation is available [here](https://github.com/peterkorteweg/boxfish/blob/main/Documentation.md).





            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/peterkorteweg/boxfish/",
    "name": "boxfish",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "beautifulsoup html pandas scraping tables",
    "author": "Peter Korteweg",
    "author_email": "boxfish@peterkorteweg.com",
    "download_url": "https://files.pythonhosted.org/packages/e6/1e/df51a537cca5bb1facda648e836f029ee923cfa743734cc6da6fa9cdc9d5/boxfish-0.1.2.tar.gz",
    "platform": null,
    "description": "<img src=\"boxfish.svg\" width=\"100%\" alt=\"\">\n\n# boxfish: lightweight table extraction from HTML\n\n[![PyPI](https://img.shields.io/pypi/v/boxfish)](https://img.shields.io/pypi/v/boxfish)\n[![PyPI - Status](https://img.shields.io/pypi/status/boxfish)](https://img.shields.io/pypi/status/boxfish)\n[![PyPI - License](https://img.shields.io/pypi/l/boxfish)](https://img.shields.io/pypi/l/boxfish)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/boxfish)](https://img.shields.io/pypi/pyversions/boxfish)\n\n[![GitHub top language](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)](https://img.shields.io/github/languages/top/peterkorteweg/boxfish)\n[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n### What is it?\nBoxfish is a lightweight tool for table extraction from HTML pages. \n\n### Main features\n\n- Easy configuration. No knowledge of CSS or Xpaths required.\n- Fast table extraction to CSV files.\n- Integration of `requests` and `selenium`.\n\n### Quick start\n\n\n``` python\nimport boxfish as bf\nimport pandas as pd\n\n# Define table layout of an url with strings from two rows.\naurl = \"\"\nrow1 = \"\"\nrow2 = \"\"\n\n# Build a configuration \naconfig = bf.build(url=aurl, rows = [row1, row2])\n\n# Extract a table\ndata = bf.extract(aconfig, url=aurl)\n\n# View results\ndf = pd.DataFrame(data)\ndf.head() \n```\n\n### Where to get it?\nBoxfish is available on [Pypi](https://pypi.org/project/boxfish/) and [Github](https://github.com/peterkorteweg/boxfish/).\n\n```\npip install boxfish\n```\n\n### Dependencies\n\nThe main dependencies are:\n- [**BeautifulSoup4**](https://pypi.org/project/beautifulsoup4/), a Python library for pulling data out of HTML and XML files.\n- [**lxml**](https://pypi.org/project/lxml/), a powerful and Pythonic XML processing library.\n- [**Requests**](https://pypi.org/project/requests/), a simple, yet elegant, HTTP library.\n- [**Selenium**](https://pypi.org/project/selenium/), automated web browser interaction from Python.\n\n\n### License\nBoxfish is available with an [MIT license](https://github.com/peterkorteweg/boxfish/blob/main/LICENSE).\n\n### Limitations\n\nBoxfish extracts text from HTML. To see if the HTML file contains the\ntext of interest, open the page in a browser, then access the HTML in the developer tools via \n<kbd>Cntrl</kbd>+<kbd>Shift</kbd>+ <kbd>I</kbd>.\n\n### Documentation\n\nFull documentation is available [here](https://github.com/peterkorteweg/boxfish/blob/main/Documentation.md).\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A lightweight tool for table extraction from HTML pages.",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/peterkorteweg/boxfish/"
    },
    "split_keywords": [
        "beautifulsoup",
        "html",
        "pandas",
        "scraping",
        "tables"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6da95ad8e613959e0cfa335dcefb6dbf810606eb2751d9f34bc3486975e2da25",
                "md5": "4079d37d1c28ee525f2bef1cfd1f5607",
                "sha256": "0a6958437343290d653f3bb07b595d55d2e4b75f4770b562f643806f8270618b"
            },
            "downloads": -1,
            "filename": "boxfish-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4079d37d1c28ee525f2bef1cfd1f5607",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 30736,
            "upload_time": "2023-06-25T13:14:21",
            "upload_time_iso_8601": "2023-06-25T13:14:21.655026Z",
            "url": "https://files.pythonhosted.org/packages/6d/a9/5ad8e613959e0cfa335dcefb6dbf810606eb2751d9f34bc3486975e2da25/boxfish-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e61edf51a537cca5bb1facda648e836f029ee923cfa743734cc6da6fa9cdc9d5",
                "md5": "49a2e27fb32bb9060003509c9a620d93",
                "sha256": "623cd4d507f255e9299b80ae0a3ff8d8b52388245b86cab102c55b08968c1152"
            },
            "downloads": -1,
            "filename": "boxfish-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "49a2e27fb32bb9060003509c9a620d93",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 27552,
            "upload_time": "2023-06-25T13:14:23",
            "upload_time_iso_8601": "2023-06-25T13:14:23.403466Z",
            "url": "https://files.pythonhosted.org/packages/e6/1e/df51a537cca5bb1facda648e836f029ee923cfa743734cc6da6fa9cdc9d5/boxfish-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-25 13:14:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "peterkorteweg",
    "github_project": "boxfish",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.24.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.9.3"
                ]
            ]
        },
        {
            "name": "selenium",
            "specs": [
                [
                    "~=",
                    "3.141.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.6.3"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.5"
                ]
            ]
        }
    ],
    "lcname": "boxfish"
}
        
Elapsed time: 0.08421s