html-header-chunking


Namehtml-header-chunking JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/PresidioVantage/html-header-chunking
Summarychunk html with relevant headers
upload_time2023-11-18 01:44:13
maintainer
docs_urlNone
authorPresidio Vantage
requires_python>=3.8,<4.0
licenseMIT
keywords xml html
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Github Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?include_prereleases&label=pre-release&logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)
[![Github Latest Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)
[![Continuous Integration](https://github.com/PresidioVantage/html-header-chunking/actions/workflows/html_header_chunking_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-header-chunking/actions)

# HTML Chunking with Headers


A lightweight (SAX) HTML parse "chunked" into sequence of contiguous text segments, each with all "relevant" headers.
Chunks are always delimited by relevant headers, but also by a (configurable) set of tags between-which to chunk.
This is a "tree-capitator," if you will. 🪓🌳🔗

The API entry-point is in `src/html_header_chunking/chunker`.
The logical algorithm and data-structures are in `src/html_header_chunking/handler`.

### Installation:
`pip install html-header-chunking`
### Example usage:
```python

from html_header_chunking.chunker import get_chunker

# default chunker is "lxml" for loose html
with get_chunker() as chunker:
    
    # example of favorable structure yielding high-quality chunks
    # prints chunk-events directly
    chunker.parse_events(
        ["https://plato.stanford.edu/entries/goedel/"],
        print)
    
    # example of moderate structure yielding medium-quality chunks
    # gets collection of chunks and loops through them
    q = chunker.parse_queue(
        ["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"])
    while q:
        print(q.popleft())
    
    # examples of challenging structure yielding poor-quality chunks
    l = [
        "https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
        "https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
    for c in chunker.parse_chunk_sequence(l):
        print(c)

# example of mitigating/improving challenging structure by focusing only on html 'h4' and 'h5'
with get_chunker("lxml", ["h4", "h5"]):
    chunker.parse_events(
        ["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
        print)

# example of using selenium on a page which requires javascript to load contents
print("using default lxml produces very few chunks:")
with get_chunker():
    chunker.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)
print("using selenium produces many more chunks:")
with get_chunker("selenium"):
    chunker.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PresidioVantage/html-header-chunking",
    "name": "html-header-chunking",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "xml,html",
    "author": "Presidio Vantage",
    "author_email": "presidiovantage@github.com",
    "download_url": "https://files.pythonhosted.org/packages/31/5e/66482b10f015e430f9eb4a1adb6321ccdeb0d59a78d22d2c07d18c159af1/html_header_chunking-0.1.2.tar.gz",
    "platform": null,
    "description": "[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Github Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?include_prereleases&label=pre-release&logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)\n[![Github Latest Release](https://img.shields.io/github/release/PresidioVantage/html-header-chunking?logo=github)](https://github.com/PresidioVantage/html-header-chunking/releases)\n[![Continuous Integration](https://github.com/PresidioVantage/html-header-chunking/actions/workflows/html_header_chunking_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-header-chunking/actions)\n\n# HTML Chunking with Headers\n\n\nA lightweight (SAX) HTML parse \"chunked\" into sequence of contiguous text segments, each with all \"relevant\" headers.\nChunks are always delimited by relevant headers, but also by a (configurable) set of tags between-which to chunk.\nThis is a \"tree-capitator,\" if you will. \ud83e\ude93\ud83c\udf33\ud83d\udd17\n\nThe API entry-point is in `src/html_header_chunking/chunker`.\nThe logical algorithm and data-structures are in `src/html_header_chunking/handler`.\n\n### Installation:\n`pip install html-header-chunking`\n### Example usage:\n```python\n\nfrom html_header_chunking.chunker import get_chunker\n\n# default chunker is \"lxml\" for loose html\nwith get_chunker() as chunker:\n    \n    # example of favorable structure yielding high-quality chunks\n    # prints chunk-events directly\n    chunker.parse_events(\n        [\"https://plato.stanford.edu/entries/goedel/\"],\n        print)\n    \n    # example of moderate structure yielding medium-quality chunks\n    # gets collection of chunks and loops through them\n    q = chunker.parse_queue(\n        [\"https://en.wikipedia.org/wiki/Kurt_G%C3%B6del\"])\n    while q:\n        print(q.popleft())\n    \n    # examples of challenging structure yielding poor-quality chunks\n    l = [\n        \"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\",\n        \"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl\"]\n    for c in chunker.parse_chunk_sequence(l):\n        print(c)\n\n# example of mitigating/improving challenging structure by focusing only on html 'h4' and 'h5'\nwith get_chunker(\"lxml\", [\"h4\", \"h5\"]):\n    chunker.parse_events(\n        [\"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\"],\n        print)\n\n# example of using selenium on a page which requires javascript to load contents\nprint(\"using default lxml produces very few chunks:\")\nwith get_chunker():\n    chunker.parse_events(\n        [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n        print)\nprint(\"using selenium produces many more chunks:\")\nwith get_chunker(\"selenium\"):\n    chunker.parse_events(\n        [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n        print)\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "chunk html with relevant headers",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/PresidioVantage/html-header-chunking",
        "Repository": "https://github.com/PresidioVantage/html-header-chunking"
    },
    "split_keywords": [
        "xml",
        "html"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d764208b4ebfd22fb812a19e9e0db5c0c014d8f630b4637dc78754f20bf4f31a",
                "md5": "7515935e3201f264dfcb97fba53dc33c",
                "sha256": "5c919ba2effdd6fe8d36b751c01444a543f6ecd255b8ac000b023166b8beecf7"
            },
            "downloads": -1,
            "filename": "html_header_chunking-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7515935e3201f264dfcb97fba53dc33c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 13344,
            "upload_time": "2023-11-18T01:44:09",
            "upload_time_iso_8601": "2023-11-18T01:44:09.948037Z",
            "url": "https://files.pythonhosted.org/packages/d7/64/208b4ebfd22fb812a19e9e0db5c0c014d8f630b4637dc78754f20bf4f31a/html_header_chunking-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "315e66482b10f015e430f9eb4a1adb6321ccdeb0d59a78d22d2c07d18c159af1",
                "md5": "b372c11ff826cff9569ab05cf9d87994",
                "sha256": "2bd919893d67956ea23a16d4923b124051736499438302e6de9653f057d74ff5"
            },
            "downloads": -1,
            "filename": "html_header_chunking-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "b372c11ff826cff9569ab05cf9d87994",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 18362,
            "upload_time": "2023-11-18T01:44:13",
            "upload_time_iso_8601": "2023-11-18T01:44:13.787757Z",
            "url": "https://files.pythonhosted.org/packages/31/5e/66482b10f015e430f9eb4a1adb6321ccdeb0d59a78d22d2c07d18c159af1/html_header_chunking-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-18 01:44:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PresidioVantage",
    "github_project": "html-header-chunking",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "html-header-chunking"
}
        
Elapsed time: 1.14265s