html-table-extractor


Namehtml-table-extractor JSON
Version 1.4.1 PyPI version JSON
download
home_pagehttps://github.com/yuanxu-li/html-table-extractor
SummaryA python library for extracting data from html table
upload_time2020-05-01 06:56:54
maintainer
docs_urlNone
authorJustin Li
requires_python
licenseMIT
keywords html table beautifulsoup crawler scrape
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # HTML Table Extractor
[![Build Status](https://travis-ci.org/yuanxu-li/html-table-extractor.svg?branch=master)](https://travis-ci.org/yuanxu-li/html-table-extractor)

_HTML Table Extractor is a python library that uses [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to extract data from complicated and messy html table_

## Important links
* Repository: https://github.com/yuanxu-li/html-table-extractor
* Issues: https://github.com/yuanxu-li/html-table-extractor/issues

## Installation

```bash
pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor
```

## Usage

### Example 1 - Simple

<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>

```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2'], [u'3', u'4']]
```

### Example 2 - Transformer

<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>

```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[1, 2], [3, 4]]
```

### Example 3 - Pass BS4 Tag

<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>

```python
from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2'], [u'3', u'4']]
```

### Example 4 - Complex

<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=3>5</td>
    </tr>
</table>

```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]
```

### Example 5 - Conflicted

<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>

```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]
```

### Example 6 - Write to file

<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>

```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')
```
It will write to a given path and create a new csv file called `output.csv`:
```
1,2
3,4

```

## Team

* [@yuanxu-li](https://github.com/yuanxu-li)

## Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, [report it here](https://github.com/yuanxu-li/table-extractor/issues)

## Copyright

Copyright (c) 2017 Justin Li. Released under the [MIT License](https://github.com/yuanxu-li/html-table-extractor/blob/master/README.md)

Third-party copyright in this distribution is noted where applicable.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yuanxu-li/html-table-extractor",
    "name": "html-table-extractor",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "html table beautifulsoup crawler scrape",
    "author": "Justin Li",
    "author_email": "yuanxu.lee@gmail.com",
    "download_url": "",
    "platform": "",
    "description": "# HTML Table Extractor\n[![Build Status](https://travis-ci.org/yuanxu-li/html-table-extractor.svg?branch=master)](https://travis-ci.org/yuanxu-li/html-table-extractor)\n\n_HTML Table Extractor is a python library that uses [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to extract data from complicated and messy html table_\n\n## Important links\n* Repository: https://github.com/yuanxu-li/html-table-extractor\n* Issues: https://github.com/yuanxu-li/html-table-extractor/issues\n\n## Installation\n\n```bash\npip install 'beautifulsoup4==4.5.3'\npip install html-table-extractor\n```\n\n## Usage\n\n### Example 1 - Simple\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\"\"\"\nextractor = Extractor(table_doc)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2'], [u'3', u'4']]\n```\n\n### Example 2 - Transformer\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\"\"\"\nextractor = Extractor(table_doc, transformer=int)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[1, 2], [3, 4]]\n```\n\n### Example 3 - Pass BS4 Tag\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\nfrom bs4 import BeautifulSoup\ntable_doc = \"\"\"\n<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>\n\"\"\"\nsoup = BeautifulSoup(table_doc, 'html.parser')\nextractor = Extractor(soup, id_='wanted')\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2'], [u'3', u'4']]\n```\n\n### Example 4 - Complex\n\n<table>\n    <tr>\n        <td rowspan=2>1</td>\n        <td>2</td>\n        <td>3</td>\n    </tr>\n    <tr>\n        <td colspan=2>4</td>\n    </tr>\n    <tr>\n        <td colspan=3>5</td>\n    </tr>\n</table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table>\n  <tr>\n    <td rowspan=2>1</td>\n    <td>2</td>\n    <td>3</td>\n  </tr>\n  <tr>\n    <td colspan=2>4</td>\n  </tr>\n  <tr>\n    <td colspan=3>5</td>\n  </tr>\n</table>\n\"\"\"\nextractor = Extractor(table_doc)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]\n```\n\n### Example 5 - Conflicted\n\n<table>\n    <tr>\n        <td rowspan=2>1</td>\n        <td>2</td>\n        <td rowspan=3>3</td>\n    </tr>\n    <tr>\n        <td colspan=2>4</td>\n    </tr>\n    <tr>\n        <td colspan=2>5</td>\n    </tr>\n</table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table>\n    <tr>\n        <td rowspan=2>1</td>\n        <td>2</td>\n        <td rowspan=3>3</td>\n    </tr>\n    <tr>\n        <td colspan=2>4</td>\n    </tr>\n    <tr>\n        <td colspan=2>5</td>\n    </tr>\n</table>\n\"\"\"\nextractor = Extractor(table_doc)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]\n```\n\n### Example 6 - Write to file\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\"\"\"\nextractor = Extractor(table_doc).parse()\nextractor.write_to_csv(path='.')\n```\nIt will write to a given path and create a new csv file called `output.csv`:\n```\n1,2\n3,4\n\n```\n\n## Team\n\n* [@yuanxu-li](https://github.com/yuanxu-li)\n\n## Errors/ Bugs\n\nIf something is not working correctly, or if you have any suggestion on improvements, [report it here](https://github.com/yuanxu-li/table-extractor/issues)\n\n## Copyright\n\nCopyright (c) 2017 Justin Li. Released under the [MIT License](https://github.com/yuanxu-li/html-table-extractor/blob/master/README.md)\n\nThird-party copyright in this distribution is noted where applicable.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A python library for extracting data from html table",
    "version": "1.4.1",
    "project_urls": {
        "Homepage": "https://github.com/yuanxu-li/html-table-extractor"
    },
    "split_keywords": [
        "html",
        "table",
        "beautifulsoup",
        "crawler",
        "scrape"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3b8a12d04fa841340b818b0ff9c1439789301c5abe7f9389c9584f912c1e95b4",
                "md5": "3df0563aa7d7b34a9a3c54028a508859",
                "sha256": "5f3ef41aee2f2bf46400c46227b2a1b553165fb7dea00c9c41ec82c27da28a48"
            },
            "downloads": -1,
            "filename": "html_table_extractor-1.4.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3df0563aa7d7b34a9a3c54028a508859",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 4782,
            "upload_time": "2020-05-01T06:56:54",
            "upload_time_iso_8601": "2020-05-01T06:56:54.152497Z",
            "url": "https://files.pythonhosted.org/packages/3b/8a/12d04fa841340b818b0ff9c1439789301c5abe7f9389c9584f912c1e95b4/html_table_extractor-1.4.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-05-01 06:56:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yuanxu-li",
    "github_project": "html-table-extractor",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "tox": true,
    "lcname": "html-table-extractor"
}
        
Elapsed time: 0.42487s