[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Github Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?include_prereleases&label=pre-release&logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)
[![Github Latest Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)
[![Continuous Integration](https://github.com/PresidioVantage/html-header-chunking/actions/workflows/html_header_chunking_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-header-chunking/actions)
# HTML Chunking with Headers
A lightweight (SAX) HTML parse "chunked" into sequence of contiguous text segments, each with all "relevant" headers.
Chunks are always delimited by relevant headers, but also by a (configurable) set of tags between-which to chunk.
This is a "tree-capitator," if you will. 🪓🌳🔗
The API entry-point is in `src/html_header_chunking/chunker`.
The logical algorithm and data-structures are in `src/html_header_chunking/handler`.
### Installation:
`pip install html-header-chunking`
### Example usage:
```python
from html_header_chunking.chunker import get_chunker
# default chunker is "lxml" for loose html
with get_chunker() as chunker:
# example of favorable structure yielding high-quality chunks
# prints chunk-events directly
chunker.parse_events(
["https://plato.stanford.edu/entries/goedel/"],
print)
# example of moderate structure yielding medium-quality chunks
# gets collection of chunks and loops through them
q = chunker.parse_queue(
["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"])
while q:
print(q.popleft())
# examples of challenging structure yielding poor-quality chunks
l = [
"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
for c in chunker.parse_chunk_sequence(l):
print(c)
# example of mitigating/improving challenging structure by focusing only on html 'h4' and 'h5'
with get_chunker("lxml", ["h4", "h5"]):
chunker.parse_events(
["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
print)
# example of using selenium on a page which requires javascript to load contents
print("using default lxml produces very few chunks:")
with get_chunker():
chunker.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
print("using selenium produces many more chunks:")
with get_chunker("selenium"):
chunker.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/PresidioVantage/html-header-chunking",
"name": "html-header-chunking",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "xml,html",
"author": "Presidio Vantage",
"author_email": "presidiovantage@github.com",
"download_url": "https://files.pythonhosted.org/packages/31/5e/66482b10f015e430f9eb4a1adb6321ccdeb0d59a78d22d2c07d18c159af1/html_header_chunking-0.1.2.tar.gz",
"platform": null,
"description": "[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Github Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?include_prereleases&label=pre-release&logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)\n[![Github Latest Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)\n[![Continuous Integration](https://github.com/PresidioVantage/html-header-chunking/actions/workflows/html_header_chunking_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-header-chunking/actions)\n\n# HTML Chunking with Headers\n\n\nA lightweight (SAX) HTML parse \"chunked\" into sequence of contiguous text segments, each with all \"relevant\" headers.\nChunks are always delimited by relevant headers, but also by a (configurable) set of tags between-which to chunk.\nThis is a \"tree-capitator,\" if you will. \ud83e\ude93\ud83c\udf33\ud83d\udd17\n\nThe API entry-point is in `src/html_header_chunking/chunker`.\nThe logical algorithm and data-structures are in `src/html_header_chunking/handler`.\n\n### Installation:\n`pip install html-header-chunking`\n### Example usage:\n```python\n\nfrom html_header_chunking.chunker import get_chunker\n\n# default chunker is \"lxml\" for loose html\nwith get_chunker() as chunker:\n \n # example of favorable structure yielding high-quality chunks\n # prints chunk-events directly\n chunker.parse_events(\n [\"https://plato.stanford.edu/entries/goedel/\"],\n print)\n \n # example of moderate structure yielding medium-quality chunks\n # gets collection of chunks and loops through them\n q = chunker.parse_queue(\n [\"https://en.wikipedia.org/wiki/Kurt_G%C3%B6del\"])\n while q:\n print(q.popleft())\n \n # examples of challenging structure yielding poor-quality chunks\n l = [\n \"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\",\n \"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl\"]\n for c in chunker.parse_chunk_sequence(l):\n print(c)\n\n# example of mitigating/improving challenging structure by focusing only on html 'h4' and 'h5'\nwith get_chunker(\"lxml\", [\"h4\", \"h5\"]):\n chunker.parse_events(\n [\"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\"],\n print)\n\n# example of using selenium on a page which requires javascript to load contents\nprint(\"using default lxml produces very few chunks:\")\nwith get_chunker():\n chunker.parse_events(\n [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n print)\nprint(\"using selenium produces many more chunks:\")\nwith get_chunker(\"selenium\"):\n chunker.parse_events(\n [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n print)\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "chunk html with relevant headers",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/PresidioVantage/html-header-chunking",
"Repository": "https://github.com/PresidioVantage/html-header-chunking"
},
"split_keywords": [
"xml",
"html"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d764208b4ebfd22fb812a19e9e0db5c0c014d8f630b4637dc78754f20bf4f31a",
"md5": "7515935e3201f264dfcb97fba53dc33c",
"sha256": "5c919ba2effdd6fe8d36b751c01444a543f6ecd255b8ac000b023166b8beecf7"
},
"downloads": -1,
"filename": "html_header_chunking-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7515935e3201f264dfcb97fba53dc33c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 13344,
"upload_time": "2023-11-18T01:44:09",
"upload_time_iso_8601": "2023-11-18T01:44:09.948037Z",
"url": "https://files.pythonhosted.org/packages/d7/64/208b4ebfd22fb812a19e9e0db5c0c014d8f630b4637dc78754f20bf4f31a/html_header_chunking-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "315e66482b10f015e430f9eb4a1adb6321ccdeb0d59a78d22d2c07d18c159af1",
"md5": "b372c11ff826cff9569ab05cf9d87994",
"sha256": "2bd919893d67956ea23a16d4923b124051736499438302e6de9653f057d74ff5"
},
"downloads": -1,
"filename": "html_header_chunking-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "b372c11ff826cff9569ab05cf9d87994",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 18362,
"upload_time": "2023-11-18T01:44:13",
"upload_time_iso_8601": "2023-11-18T01:44:13.787757Z",
"url": "https://files.pythonhosted.org/packages/31/5e/66482b10f015e430f9eb4a1adb6321ccdeb0d59a78d22d2c07d18c159af1/html_header_chunking-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-18 01:44:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PresidioVantage",
"github_project": "html-header-chunking",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "html-header-chunking"
}