Name | html-table-takeout JSON |
Version |
1.1.2
JSON |
| download |
home_page | None |
Summary | HTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies. |
upload_time | 2025-07-19 00:53:17 |
maintainer | Calvin Law |
docs_url | None |
author | Calvin Law |
requires_python | >=3.10 |
license | MIT |
keywords |
html
table
parse
scrape
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# HTML Table Takeout
[](https://github.com/lawcal/html-table-takeout/actions/workflows/test.yml)
<img src="https://github.com/lawcal/html-table-takeout/raw/main/images/html_table_takeout_logo.png" alt="HTML Table Takeout project logo" width="300">
A fast, lightweight HTML table parser that supports rowspan, colspan, links and nested tables. No external dependencies are needed.
The input may be text, a URL or local file `Path`.
<sup><sub>HTML5 logo by <a href='https://www.w3.org/'>W3C</a>.</sub></sup>
## Quick Start
Install the package:
```
pip install html-table-takeout
```
Pass in a URL and print out the parsed `Table` as CSV:
```
from html_table_takeout import parse_html
# start with http:// or https:// to source from a URL
tables = parse_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
print(tables[0].to_csv())
# output:
# Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
# MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,0000066740,1902
# ...
```
Pass in HTML text and print out the parsed `Table` as valid HTML:
```
from html_table_takeout import parse_html
tables = parse_html("""
<table>
<tr>
<td rowspan='2'>1</td> <!-- rowspan will be expanded -->
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
</table>""")
print(tables[0].to_html(indent=4))
# output:
# <table data-table-id='0'>
# <tbody>
# <tr>
# <td>1</td>
# <td>2</td>
# </tr>
# <tr>
# <td>1</td>
# <td>3</td>
# </tr>
# </tbody>
# </table>
```
## Usage
The core `parse_html()` function returns a list of zero or more top-level `Table`. A `Table` is guaranteed to have this structure:
- **rows**: List of one or more `TRow`
- **cells**: List of zero or more `TCell` resulting from rowspan and colspan expansion
- **elements**: List of zero or more `TText`, `TLink`, `TRef`
| Type | Description |
| -------- | ------------------------------------------------ |
| `Table` | Each parsed table has an auto-assigned unique id |
| `TRow` | Equal to each `<tr>` in the original table |
| `TCell` | Expanded `<td>` or `<th>` cells from row/colspan |
| `TText` | HTML-decoded text inside `<td>` or `<th>` |
| `TLink` | Equal to each `<a>` inside `<td>` or `<th>` |
| `TRef` | Reference to the child `Table` |
All tables are guaranteed to have at least one `TRow` containing one `TCell`.
The `parse_html()` function also provides filtering by text or attributes to target the tables you want. Check out its docstring for all options.
## Why did you make this
Most HTML table parsers require extra DOM and data processing libraries that aren't needed for my application. I need a parser that handles nesting and gives me the flexibility to process the parsed result however I want.
Now you too can take out tables to go.
## Developing
Install development dependencies:
```
pip install build mypy pytest
```
Run tests:
```
pytest
```
Build the package:
```
python -m build
```
Raw data
{
"_id": null,
"home_page": null,
"name": "html-table-takeout",
"maintainer": "Calvin Law",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "html, table, parse, scrape",
"author": "Calvin Law",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/e3/02/e2e9ef488b6400ea8bcd3ce95e5ec67e8be3368f6014ad014017072dcec9/html_table_takeout-1.1.2.tar.gz",
"platform": null,
"description": "# HTML Table Takeout\n\n[](https://github.com/lawcal/html-table-takeout/actions/workflows/test.yml)\n\n<img src=\"https://github.com/lawcal/html-table-takeout/raw/main/images/html_table_takeout_logo.png\" alt=\"HTML Table Takeout project logo\" width=\"300\">\n\nA fast, lightweight HTML table parser that supports rowspan, colspan, links and nested tables. No external dependencies are needed.\n\nThe input may be text, a URL or local file `Path`.\n\n<sup><sub>HTML5 logo by <a href='https://www.w3.org/'>W3C</a>.</sub></sup>\n\n## Quick Start\n\nInstall the package:\n```\npip install html-table-takeout\n```\n\nPass in a URL and print out the parsed `Table` as CSV:\n```\nfrom html_table_takeout import parse_html\n\n# start with http:// or https:// to source from a URL\ntables = parse_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')\n\nprint(tables[0].to_csv())\n\n# output:\n# Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded\n# MMM,3M,Industrials,Industrial Conglomerates,\"Saint Paul, Minnesota\",1957-03-04,0000066740,1902\n# ...\n```\n\nPass in HTML text and print out the parsed `Table` as valid HTML:\n```\nfrom html_table_takeout import parse_html\n\ntables = parse_html(\"\"\"\n<table>\n <tr>\n <td rowspan='2'>1</td> <!-- rowspan will be expanded -->\n <td>2</td>\n </tr>\n <tr>\n <td>3</td>\n </tr>\n</table>\"\"\")\n\nprint(tables[0].to_html(indent=4))\n\n# output:\n# <table data-table-id='0'>\n# <tbody>\n# <tr>\n# <td>1</td>\n# <td>2</td>\n# </tr>\n# <tr>\n# <td>1</td>\n# <td>3</td>\n# </tr>\n# </tbody>\n# </table>\n```\n\n## Usage\n\nThe core `parse_html()` function returns a list of zero or more top-level `Table`. A `Table` is guaranteed to have this structure:\n- **rows**: List of one or more `TRow`\n - **cells**: List of zero or more `TCell` resulting from rowspan and colspan expansion\n - **elements**: List of zero or more `TText`, `TLink`, `TRef`\n\n| Type | Description |\n| -------- | ------------------------------------------------ |\n| `Table` | Each parsed table has an auto-assigned unique id |\n| `TRow` | Equal to each `<tr>` in the original table |\n| `TCell` | Expanded `<td>` or `<th>` cells from row/colspan |\n| `TText` | HTML-decoded text inside `<td>` or `<th>` |\n| `TLink` | Equal to each `<a>` inside `<td>` or `<th>` |\n| `TRef` | Reference to the child `Table` |\n\nAll tables are guaranteed to have at least one `TRow` containing one `TCell`.\n\nThe `parse_html()` function also provides filtering by text or attributes to target the tables you want. Check out its docstring for all options.\n\n## Why did you make this\n\nMost HTML table parsers require extra DOM and data processing libraries that aren't needed for my application. I need a parser that handles nesting and gives me the flexibility to process the parsed result however I want.\n\nNow you too can take out tables to go.\n\n## Developing\n\nInstall development dependencies:\n```\npip install build mypy pytest\n```\n\nRun tests:\n```\npytest\n```\n\nBuild the package:\n```\npython -m build\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "HTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies.",
"version": "1.1.2",
"project_urls": {
"Homepage": "https://github.com/lawcal/html-table-takeout"
},
"split_keywords": [
"html",
" table",
" parse",
" scrape"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "eec803180b73e9928014c7ebc4f9f0b89eac5f1278813155e22f98d5b3b6f081",
"md5": "6f9c72d6f70c29603f28d43e133546da",
"sha256": "e590c8a41b455722b8eac89ecadb3a1560a563994659b9186f50839ef2ba3238"
},
"downloads": -1,
"filename": "html_table_takeout-1.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6f9c72d6f70c29603f28d43e133546da",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 9741,
"upload_time": "2025-07-19T00:53:16",
"upload_time_iso_8601": "2025-07-19T00:53:16.500537Z",
"url": "https://files.pythonhosted.org/packages/ee/c8/03180b73e9928014c7ebc4f9f0b89eac5f1278813155e22f98d5b3b6f081/html_table_takeout-1.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e302e2e9ef488b6400ea8bcd3ce95e5ec67e8be3368f6014ad014017072dcec9",
"md5": "11b38be691d6f0739db091561494c543",
"sha256": "e2021e7c93271d2c08c1c2fb09f8408f6aac36d353e37cea0a196f3fe54e7f30"
},
"downloads": -1,
"filename": "html_table_takeout-1.1.2.tar.gz",
"has_sig": false,
"md5_digest": "11b38be691d6f0739db091561494c543",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 10324,
"upload_time": "2025-07-19T00:53:17",
"upload_time_iso_8601": "2025-07-19T00:53:17.584670Z",
"url": "https://files.pythonhosted.org/packages/e3/02/e2e9ef488b6400ea8bcd3ce95e5ec67e8be3368f6014ad014017072dcec9/html_table_takeout-1.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-19 00:53:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lawcal",
"github_project": "html-table-takeout",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "html-table-takeout"
}