html-cleaver


Namehtml-cleaver JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/PresidioVantage/html-cleaver
Summarycleave html headers and text
upload_time2023-11-29 03:40:09
maintainer
docs_urlNone
authorPresidio Vantage
requires_python>=3.8,<4.0
licenseMIT
keywords xml html
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![License: MIT](https://img.shields.io/badge/License-MIT-blue)](https://raw.githubusercontent.com/PresidioVantage/html-cleaver/main/LICENSE.txt)
[![GitHub Latest Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github)](https://github.com/PresidioVantage/html-cleaver/releases)

[![GitHub Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github&include_prereleases&label=pre-release)](https://github.com/PresidioVantage/html-cleaver/releases)
[![GitHub Continuous Integration](https://github.com/PresidioVantage/html-cleaver/actions/workflows/html_cleaver_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-cleaver/actions)

# HTML Cleaver 🍀🦫

Tool for parsing HTML into a chain of chunks with relevant headers.  

The API entry-point is in `src/html_cleaver/cleaver`.  
The logical algorithm and data-structures are in `src/html_cleaver/handler`.

This is a "tree-capitator" if you will,  
cleaving headers together while cleaving text apart.

## Quickstart:
`pip install html-cleaver`

Optionally, if you're working with HTML that requires javascript rendering:  
`pip install selenium`

Testing an example on the command-line:
`python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/`

### Example usage:
Cleaving pages of varying difficulties:

```python
from html_cleaver.cleaver import get_cleaver

# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
    
    # handle chunk-events directly
    # (example of favorable structure yielding high-quality chunks)
    cleaver.parse_events(
        ["https://plato.stanford.edu/entries/goedel/"],
        print)
    
    # get collection of chunks
    # (example of moderate structure yielding medium-quality chunks)
    for c in cleaver.parse_chunk_sequence(
            ["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
        print(c)
    
    # sequence of chunks from sequence of pages
    # (examples of challenging structure yielding poor-quality chunks)
    l = [
        "https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
        "https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
    for c in cleaver.parse_chunk_sequence(l):
        print(c)

# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
    cleaver.parse_events(
        ["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
        print)
```

### Example usage with Selenium:
Using selenium on a page that requires javascript to load contents:

```python
from html_cleaver.cleaver import get_cleaver

print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)

print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)
```


## Development:
### Testing:
Testing without Poetry:  
`pip install lxml`  
`pip install selenium`  
`python -m unittest discover -s src`

Testing with Poetry:  
`poetry install`  
`poetry run pytest`

### Build:
Building from source:  
`rm dist/*`  
`python -m build`

Installing from the build:  
`pip install dist/*.whl`

Publishing from the build:  
`python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*`  
`python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*`


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PresidioVantage/html-cleaver",
    "name": "html-cleaver",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "xml,html",
    "author": "Presidio Vantage",
    "author_email": "presidiovantage@github.com",
    "download_url": "https://files.pythonhosted.org/packages/c2/0b/6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05/html_cleaver-0.3.0.tar.gz",
    "platform": null,
    "description": "[![License: MIT](https://img.shields.io/badge/License-MIT-blue)](https://raw.githubusercontent.com/PresidioVantage/html-cleaver/main/LICENSE.txt)\n[![GitHub Latest Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github)](https://github.com/PresidioVantage/html-cleaver/releases)\n\n[![GitHub Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github&include_prereleases&label=pre-release)](https://github.com/PresidioVantage/html-cleaver/releases)\n[![GitHub Continuous Integration](https://github.com/PresidioVantage/html-cleaver/actions/workflows/html_cleaver_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-cleaver/actions)\n\n# HTML Cleaver \ud83c\udf40\ud83e\uddab\n\nTool for parsing HTML into a chain of chunks with relevant headers.  \n\nThe API entry-point is in `src/html_cleaver/cleaver`.  \nThe logical algorithm and data-structures are in `src/html_cleaver/handler`.\n\nThis is a \"tree-capitator\" if you will,  \ncleaving headers together while cleaving text apart.\n\n## Quickstart:\n`pip install html-cleaver`\n\nOptionally, if you're working with HTML that requires javascript rendering:  \n`pip install selenium`\n\nTesting an example on the command-line:\n`python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/`\n\n### Example usage:\nCleaving pages of varying difficulties:\n\n```python\nfrom html_cleaver.cleaver import get_cleaver\n\n# default parser is \"lxml\" for loose html\nwith get_cleaver() as cleaver:\n    \n    # handle chunk-events directly\n    # (example of favorable structure yielding high-quality chunks)\n    cleaver.parse_events(\n        [\"https://plato.stanford.edu/entries/goedel/\"],\n        print)\n    \n    # get collection of chunks\n    # (example of moderate structure yielding medium-quality chunks)\n    for c in cleaver.parse_chunk_sequence(\n            [\"https://en.wikipedia.org/wiki/Kurt_G%C3%B6del\"]):\n        print(c)\n    \n    # sequence of chunks from sequence of pages\n    # (examples of challenging structure yielding poor-quality chunks)\n    l = [\n        \"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\",\n        \"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl\"]\n    for c in cleaver.parse_chunk_sequence(l):\n        print(c)\n\n# example of mitigating/improving challenging structure by focusing on certain headers\nwith get_cleaver(\"lxml\", [\"h4\", \"h5\"]) as cleaver:\n    cleaver.parse_events(\n        [\"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\"],\n        print)\n```\n\n### Example usage with Selenium:\nUsing selenium on a page that requires javascript to load contents:\n\n```python\nfrom html_cleaver.cleaver import get_cleaver\n\nprint(\"using default lxml produces very few chunks:\")\nwith get_cleaver() as cleaver:\n    cleaver.parse_events(\n        [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n        print)\n\nprint(\"using selenium produces many more chunks:\")\nwith get_cleaver(\"selenium\") as cleaver:\n    cleaver.parse_events(\n        [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n        print)\n```\n\n\n## Development:\n### Testing:\nTesting without Poetry:  \n`pip install lxml`  \n`pip install selenium`  \n`python -m unittest discover -s src`\n\nTesting with Poetry:  \n`poetry install`  \n`poetry run pytest`\n\n### Build:\nBuilding from source:  \n`rm dist/*`  \n`python -m build`\n\nInstalling from the build:  \n`pip install dist/*.whl`\n\nPublishing from the build:  \n`python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*`  \n`python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*`\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "cleave html headers and text",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/PresidioVantage/html-cleaver",
        "Repository": "https://github.com/PresidioVantage/html-cleaver"
    },
    "split_keywords": [
        "xml",
        "html"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9cc194291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a",
                "md5": "98501c231d07c0269970a61ac70f8765",
                "sha256": "f3640fcd887796578f8b7bd4017cb81f27729017020d0dff7ff00d64eae0119a"
            },
            "downloads": -1,
            "filename": "html_cleaver-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "98501c231d07c0269970a61ac70f8765",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 13513,
            "upload_time": "2023-11-29T03:40:08",
            "upload_time_iso_8601": "2023-11-29T03:40:08.111479Z",
            "url": "https://files.pythonhosted.org/packages/9c/c1/94291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a/html_cleaver-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c20b6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05",
                "md5": "5b17920daf103e4824e8218d379afe48",
                "sha256": "d7901934f083e6f36682645e314cea599a2cb2e8139e1a3a5ab581235f0e3839"
            },
            "downloads": -1,
            "filename": "html_cleaver-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5b17920daf103e4824e8218d379afe48",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 15931,
            "upload_time": "2023-11-29T03:40:09",
            "upload_time_iso_8601": "2023-11-29T03:40:09.546252Z",
            "url": "https://files.pythonhosted.org/packages/c2/0b/6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05/html_cleaver-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-29 03:40:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PresidioVantage",
    "github_project": "html-cleaver",
    "github_not_found": true,
    "lcname": "html-cleaver"
}
        
Elapsed time: 0.25866s