# HTML Table Extractor
[![Build Status](https://travis-ci.org/yuanxu-li/html-table-extractor.svg?branch=master)](https://travis-ci.org/yuanxu-li/html-table-extractor)
_HTML Table Extractor is a python library that uses [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to extract data from complicated and messy html table_
## Important links
* Repository: https://github.com/yuanxu-li/html-table-extractor
* Issues: https://github.com/yuanxu-li/html-table-extractor/issues
## Installation
```bash
pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor
```
## Usage
### Example 1 - Simple
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2'], [u'3', u'4']]
```
### Example 2 - Transformer
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[1, 2], [3, 4]]
```
### Example 3 - Pass BS4 Tag
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
```python
from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2'], [u'3', u'4']]
```
### Example 4 - Complex
<table>
<tr>
<td rowspan=2>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td colspan=2>4</td>
</tr>
<tr>
<td colspan=3>5</td>
</tr>
</table>
```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
<tr>
<td rowspan=2>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td colspan=2>4</td>
</tr>
<tr>
<td colspan=3>5</td>
</tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]
```
### Example 5 - Conflicted
<table>
<tr>
<td rowspan=2>1</td>
<td>2</td>
<td rowspan=3>3</td>
</tr>
<tr>
<td colspan=2>4</td>
</tr>
<tr>
<td colspan=2>5</td>
</tr>
</table>
```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
<tr>
<td rowspan=2>1</td>
<td>2</td>
<td rowspan=3>3</td>
</tr>
<tr>
<td colspan=2>4</td>
</tr>
<tr>
<td colspan=2>5</td>
</tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]
```
### Example 6 - Write to file
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
```python
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')
```
It will write to a given path and create a new csv file called `output.csv`:
```
1,2
3,4
```
## Team
* [@yuanxu-li](https://github.com/yuanxu-li)
## Errors/ Bugs
If something is not working correctly, or if you have any suggestion on improvements, [report it here](https://github.com/yuanxu-li/table-extractor/issues)
## Copyright
Copyright (c) 2017 Justin Li. Released under the [MIT License](https://github.com/yuanxu-li/html-table-extractor/blob/master/README.md)
Third-party copyright in this distribution is noted where applicable.
Raw data
{
"_id": null,
"home_page": "https://github.com/yuanxu-li/html-table-extractor",
"name": "html-table-extractor",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "html table beautifulsoup crawler scrape",
"author": "Justin Li",
"author_email": "yuanxu.lee@gmail.com",
"download_url": "",
"platform": "",
"description": "# HTML Table Extractor\n[![Build Status](https://travis-ci.org/yuanxu-li/html-table-extractor.svg?branch=master)](https://travis-ci.org/yuanxu-li/html-table-extractor)\n\n_HTML Table Extractor is a python library that uses [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to extract data from complicated and messy html table_\n\n## Important links\n* Repository: https://github.com/yuanxu-li/html-table-extractor\n* Issues: https://github.com/yuanxu-li/html-table-extractor/issues\n\n## Installation\n\n```bash\npip install 'beautifulsoup4==4.5.3'\npip install html-table-extractor\n```\n\n## Usage\n\n### Example 1 - Simple\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\"\"\"\nextractor = Extractor(table_doc)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2'], [u'3', u'4']]\n```\n\n### Example 2 - Transformer\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\"\"\"\nextractor = Extractor(table_doc, transformer=int)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[1, 2], [3, 4]]\n```\n\n### Example 3 - Pass BS4 Tag\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\nfrom bs4 import BeautifulSoup\ntable_doc = \"\"\"\n<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>\n\"\"\"\nsoup = BeautifulSoup(table_doc, 'html.parser')\nextractor = Extractor(soup, id_='wanted')\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2'], [u'3', u'4']]\n```\n\n### Example 4 - Complex\n\n<table>\n <tr>\n <td rowspan=2>1</td>\n <td>2</td>\n <td>3</td>\n </tr>\n <tr>\n <td colspan=2>4</td>\n </tr>\n <tr>\n <td colspan=3>5</td>\n </tr>\n</table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table>\n <tr>\n <td rowspan=2>1</td>\n <td>2</td>\n <td>3</td>\n </tr>\n <tr>\n <td colspan=2>4</td>\n </tr>\n <tr>\n <td colspan=3>5</td>\n </tr>\n</table>\n\"\"\"\nextractor = Extractor(table_doc)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]\n```\n\n### Example 5 - Conflicted\n\n<table>\n <tr>\n <td rowspan=2>1</td>\n <td>2</td>\n <td rowspan=3>3</td>\n </tr>\n <tr>\n <td colspan=2>4</td>\n </tr>\n <tr>\n <td colspan=2>5</td>\n </tr>\n</table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table>\n <tr>\n <td rowspan=2>1</td>\n <td>2</td>\n <td rowspan=3>3</td>\n </tr>\n <tr>\n <td colspan=2>4</td>\n </tr>\n <tr>\n <td colspan=2>5</td>\n </tr>\n</table>\n\"\"\"\nextractor = Extractor(table_doc)\nextractor.parse()\nextractor.return_list()\n```\nIt will print out:\n```python\n[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]\n```\n\n### Example 6 - Write to file\n\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\n```python\nfrom html_table_extractor.extractor import Extractor\ntable_doc = \"\"\"\n<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>\n\"\"\"\nextractor = Extractor(table_doc).parse()\nextractor.write_to_csv(path='.')\n```\nIt will write to a given path and create a new csv file called `output.csv`:\n```\n1,2\n3,4\n\n```\n\n## Team\n\n* [@yuanxu-li](https://github.com/yuanxu-li)\n\n## Errors/ Bugs\n\nIf something is not working correctly, or if you have any suggestion on improvements, [report it here](https://github.com/yuanxu-li/table-extractor/issues)\n\n## Copyright\n\nCopyright (c) 2017 Justin Li. Released under the [MIT License](https://github.com/yuanxu-li/html-table-extractor/blob/master/README.md)\n\nThird-party copyright in this distribution is noted where applicable.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A python library for extracting data from html table",
"version": "1.4.1",
"project_urls": {
"Homepage": "https://github.com/yuanxu-li/html-table-extractor"
},
"split_keywords": [
"html",
"table",
"beautifulsoup",
"crawler",
"scrape"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3b8a12d04fa841340b818b0ff9c1439789301c5abe7f9389c9584f912c1e95b4",
"md5": "3df0563aa7d7b34a9a3c54028a508859",
"sha256": "5f3ef41aee2f2bf46400c46227b2a1b553165fb7dea00c9c41ec82c27da28a48"
},
"downloads": -1,
"filename": "html_table_extractor-1.4.1-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "3df0563aa7d7b34a9a3c54028a508859",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 4782,
"upload_time": "2020-05-01T06:56:54",
"upload_time_iso_8601": "2020-05-01T06:56:54.152497Z",
"url": "https://files.pythonhosted.org/packages/3b/8a/12d04fa841340b818b0ff9c1439789301c5abe7f9389c9584f912c1e95b4/html_table_extractor-1.4.1-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-05-01 06:56:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yuanxu-li",
"github_project": "html-table-extractor",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [],
"tox": true,
"lcname": "html-table-extractor"
}