html-table-takeout


Namehtml-table-takeout JSON
Version 1.1.2 PyPI version JSON
download
home_pageNone
SummaryHTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies.
upload_time2025-07-19 00:53:17
maintainerCalvin Law
docs_urlNone
authorCalvin Law
requires_python>=3.10
licenseMIT
keywords html table parse scrape
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HTML Table Takeout

[![Test](https://github.com/lawcal/html-table-takeout/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/lawcal/html-table-takeout/actions/workflows/test.yml)

<img src="https://github.com/lawcal/html-table-takeout/raw/main/images/html_table_takeout_logo.png" alt="HTML Table Takeout project logo" width="300">

A fast, lightweight HTML table parser that supports rowspan, colspan, links and nested tables. No external dependencies are needed.

The input may be text, a URL or local file `Path`.

<sup><sub>HTML5 logo by <a href='https://www.w3.org/'>W3C</a>.</sub></sup>

## Quick Start

Install the package:
```
pip install html-table-takeout
```

Pass in a URL and print out the parsed `Table` as CSV:
```
from html_table_takeout import parse_html

# start with http:// or https:// to source from a URL
tables = parse_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

print(tables[0].to_csv())

# output:
# Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
# MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,0000066740,1902
# ...
```

Pass in HTML text and print out the parsed `Table` as valid HTML:
```
from html_table_takeout import parse_html

tables = parse_html("""
<table>
    <tr>
        <td rowspan='2'>1</td> <!-- rowspan will be expanded -->
        <td>2</td>
    </tr>
    <tr>
        <td>3</td>
    </tr>
</table>""")

print(tables[0].to_html(indent=4))

# output:
# <table data-table-id='0'>
# <tbody>
#     <tr>
#         <td>1</td>
#         <td>2</td>
#     </tr>
#     <tr>
#         <td>1</td>
#         <td>3</td>
#     </tr>
# </tbody>
# </table>
```

## Usage

The core `parse_html()` function returns a list of zero or more top-level `Table`. A `Table` is guaranteed to have this structure:
- **rows**: List of one or more `TRow`
  - **cells**: List of zero or more `TCell` resulting from rowspan and colspan expansion
    - **elements**: List of zero or more `TText`, `TLink`, `TRef`

| Type     | Description                                      |
| -------- | ------------------------------------------------ |
| `Table`  | Each parsed table has an auto-assigned unique id |
| `TRow`   | Equal to each `<tr>` in the original table       |
| `TCell`  | Expanded `<td>` or `<th>` cells from row/colspan |
| `TText`  | HTML-decoded text inside `<td>` or `<th>`        |
| `TLink`  | Equal to each `<a>` inside `<td>` or `<th>`      |
| `TRef`   | Reference to the child `Table`                   |

All tables are guaranteed to have at least one `TRow` containing one `TCell`.

The `parse_html()` function also provides filtering by text or attributes to target the tables you want. Check out its docstring for all options.

## Why did you make this

Most HTML table parsers require extra DOM and data processing libraries that aren't needed for my application. I need a parser that handles nesting and gives me the flexibility to process the parsed result however I want.

Now you too can take out tables to go.

## Developing

Install development dependencies:
```
pip install build mypy pytest
```

Run tests:
```
pytest
```

Build the package:
```
python -m build
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "html-table-takeout",
    "maintainer": "Calvin Law",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "html, table, parse, scrape",
    "author": "Calvin Law",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/e3/02/e2e9ef488b6400ea8bcd3ce95e5ec67e8be3368f6014ad014017072dcec9/html_table_takeout-1.1.2.tar.gz",
    "platform": null,
    "description": "# HTML Table Takeout\n\n[![Test](https://github.com/lawcal/html-table-takeout/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/lawcal/html-table-takeout/actions/workflows/test.yml)\n\n<img src=\"https://github.com/lawcal/html-table-takeout/raw/main/images/html_table_takeout_logo.png\" alt=\"HTML Table Takeout project logo\" width=\"300\">\n\nA fast, lightweight HTML table parser that supports rowspan, colspan, links and nested tables. No external dependencies are needed.\n\nThe input may be text, a URL or local file `Path`.\n\n<sup><sub>HTML5 logo by <a href='https://www.w3.org/'>W3C</a>.</sub></sup>\n\n## Quick Start\n\nInstall the package:\n```\npip install html-table-takeout\n```\n\nPass in a URL and print out the parsed `Table` as CSV:\n```\nfrom html_table_takeout import parse_html\n\n# start with http:// or https:// to source from a URL\ntables = parse_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')\n\nprint(tables[0].to_csv())\n\n# output:\n# Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded\n# MMM,3M,Industrials,Industrial Conglomerates,\"Saint Paul, Minnesota\",1957-03-04,0000066740,1902\n# ...\n```\n\nPass in HTML text and print out the parsed `Table` as valid HTML:\n```\nfrom html_table_takeout import parse_html\n\ntables = parse_html(\"\"\"\n<table>\n    <tr>\n        <td rowspan='2'>1</td> <!-- rowspan will be expanded -->\n        <td>2</td>\n    </tr>\n    <tr>\n        <td>3</td>\n    </tr>\n</table>\"\"\")\n\nprint(tables[0].to_html(indent=4))\n\n# output:\n# <table data-table-id='0'>\n# <tbody>\n#     <tr>\n#         <td>1</td>\n#         <td>2</td>\n#     </tr>\n#     <tr>\n#         <td>1</td>\n#         <td>3</td>\n#     </tr>\n# </tbody>\n# </table>\n```\n\n## Usage\n\nThe core `parse_html()` function returns a list of zero or more top-level `Table`. A `Table` is guaranteed to have this structure:\n- **rows**: List of one or more `TRow`\n  - **cells**: List of zero or more `TCell` resulting from rowspan and colspan expansion\n    - **elements**: List of zero or more `TText`, `TLink`, `TRef`\n\n| Type     | Description                                      |\n| -------- | ------------------------------------------------ |\n| `Table`  | Each parsed table has an auto-assigned unique id |\n| `TRow`   | Equal to each `<tr>` in the original table       |\n| `TCell`  | Expanded `<td>` or `<th>` cells from row/colspan |\n| `TText`  | HTML-decoded text inside `<td>` or `<th>`        |\n| `TLink`  | Equal to each `<a>` inside `<td>` or `<th>`      |\n| `TRef`   | Reference to the child `Table`                   |\n\nAll tables are guaranteed to have at least one `TRow` containing one `TCell`.\n\nThe `parse_html()` function also provides filtering by text or attributes to target the tables you want. Check out its docstring for all options.\n\n## Why did you make this\n\nMost HTML table parsers require extra DOM and data processing libraries that aren't needed for my application. I need a parser that handles nesting and gives me the flexibility to process the parsed result however I want.\n\nNow you too can take out tables to go.\n\n## Developing\n\nInstall development dependencies:\n```\npip install build mypy pytest\n```\n\nRun tests:\n```\npytest\n```\n\nBuild the package:\n```\npython -m build\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "HTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies.",
    "version": "1.1.2",
    "project_urls": {
        "Homepage": "https://github.com/lawcal/html-table-takeout"
    },
    "split_keywords": [
        "html",
        " table",
        " parse",
        " scrape"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eec803180b73e9928014c7ebc4f9f0b89eac5f1278813155e22f98d5b3b6f081",
                "md5": "6f9c72d6f70c29603f28d43e133546da",
                "sha256": "e590c8a41b455722b8eac89ecadb3a1560a563994659b9186f50839ef2ba3238"
            },
            "downloads": -1,
            "filename": "html_table_takeout-1.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6f9c72d6f70c29603f28d43e133546da",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 9741,
            "upload_time": "2025-07-19T00:53:16",
            "upload_time_iso_8601": "2025-07-19T00:53:16.500537Z",
            "url": "https://files.pythonhosted.org/packages/ee/c8/03180b73e9928014c7ebc4f9f0b89eac5f1278813155e22f98d5b3b6f081/html_table_takeout-1.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e302e2e9ef488b6400ea8bcd3ce95e5ec67e8be3368f6014ad014017072dcec9",
                "md5": "11b38be691d6f0739db091561494c543",
                "sha256": "e2021e7c93271d2c08c1c2fb09f8408f6aac36d353e37cea0a196f3fe54e7f30"
            },
            "downloads": -1,
            "filename": "html_table_takeout-1.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "11b38be691d6f0739db091561494c543",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 10324,
            "upload_time": "2025-07-19T00:53:17",
            "upload_time_iso_8601": "2025-07-19T00:53:17.584670Z",
            "url": "https://files.pythonhosted.org/packages/e3/02/e2e9ef488b6400ea8bcd3ce95e5ec67e8be3368f6014ad014017072dcec9/html_table_takeout-1.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-19 00:53:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lawcal",
    "github_project": "html-table-takeout",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "html-table-takeout"
}
        
Elapsed time: 1.68930s