[![License: MIT](https://img.shields.io/badge/License-MIT-blue)](https://raw.githubusercontent.com/PresidioVantage/html-cleaver/main/LICENSE.txt)
[![GitHub Latest Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github)](https://github.com/PresidioVantage/html-cleaver/releases)
[![GitHub Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github&include_prereleases&label=pre-release)](https://github.com/PresidioVantage/html-cleaver/releases)
[![GitHub Continuous Integration](https://github.com/PresidioVantage/html-cleaver/actions/workflows/html_cleaver_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-cleaver/actions)
# HTML Cleaver 🍀🦫
Tool for parsing HTML into a chain of chunks with relevant headers.
The API entry-point is in `src/html_cleaver/cleaver`.
The logical algorithm and data-structures are in `src/html_cleaver/handler`.
This is a "tree-capitator" if you will,
cleaving headers together while cleaving text apart.
## Quickstart:
`pip install html-cleaver`
Optionally, if you're working with HTML that requires javascript rendering:
`pip install selenium`
Testing an example on the command-line:
`python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/`
### Example usage:
Cleaving pages of varying difficulties:
```python
from html_cleaver.cleaver import get_cleaver
# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
# handle chunk-events directly
# (example of favorable structure yielding high-quality chunks)
cleaver.parse_events(
["https://plato.stanford.edu/entries/goedel/"],
print)
# get collection of chunks
# (example of moderate structure yielding medium-quality chunks)
for c in cleaver.parse_chunk_sequence(
["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
print(c)
# sequence of chunks from sequence of pages
# (examples of challenging structure yielding poor-quality chunks)
l = [
"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
for c in cleaver.parse_chunk_sequence(l):
print(c)
# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
cleaver.parse_events(
["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
print)
```
### Example usage with Selenium:
Using selenium on a page that requires javascript to load contents:
```python
from html_cleaver.cleaver import get_cleaver
print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
```
## Development:
### Testing:
Testing without Poetry:
`pip install lxml`
`pip install selenium`
`python -m unittest discover -s src`
Testing with Poetry:
`poetry install`
`poetry run pytest`
### Build:
Building from source:
`rm dist/*`
`python -m build`
Installing from the build:
`pip install dist/*.whl`
Publishing from the build:
`python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*`
`python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*`
Raw data
{
"_id": null,
"home_page": "https://github.com/PresidioVantage/html-cleaver",
"name": "html-cleaver",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "xml,html",
"author": "Presidio Vantage",
"author_email": "presidiovantage@github.com",
"download_url": "https://files.pythonhosted.org/packages/c2/0b/6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05/html_cleaver-0.3.0.tar.gz",
"platform": null,
"description": "[![License: MIT](https://img.shields.io/badge/License-MIT-blue)](https://raw.githubusercontent.com/PresidioVantage/html-cleaver/main/LICENSE.txt)\n[![GitHub Latest Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github)](https://github.com/PresidioVantage/html-cleaver/releases)\n\n[![GitHub Latest Pre-Release](https://img.shields.io/github/release/PresidioVantage/html-cleaver?logo=github&include_prereleases&label=pre-release)](https://github.com/PresidioVantage/html-cleaver/releases)\n[![GitHub Continuous Integration](https://github.com/PresidioVantage/html-cleaver/actions/workflows/html_cleaver_CI.yml/badge.svg)](https://github.com/PresidioVantage/html-cleaver/actions)\n\n# HTML Cleaver \ud83c\udf40\ud83e\uddab\n\nTool for parsing HTML into a chain of chunks with relevant headers. \n\nThe API entry-point is in `src/html_cleaver/cleaver`. \nThe logical algorithm and data-structures are in `src/html_cleaver/handler`.\n\nThis is a \"tree-capitator\" if you will, \ncleaving headers together while cleaving text apart.\n\n## Quickstart:\n`pip install html-cleaver`\n\nOptionally, if you're working with HTML that requires javascript rendering: \n`pip install selenium`\n\nTesting an example on the command-line:\n`python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/`\n\n### Example usage:\nCleaving pages of varying difficulties:\n\n```python\nfrom html_cleaver.cleaver import get_cleaver\n\n# default parser is \"lxml\" for loose html\nwith get_cleaver() as cleaver:\n \n # handle chunk-events directly\n # (example of favorable structure yielding high-quality chunks)\n cleaver.parse_events(\n [\"https://plato.stanford.edu/entries/goedel/\"],\n print)\n \n # get collection of chunks\n # (example of moderate structure yielding medium-quality chunks)\n for c in cleaver.parse_chunk_sequence(\n [\"https://en.wikipedia.org/wiki/Kurt_G%C3%B6del\"]):\n print(c)\n \n # sequence of chunks from sequence of pages\n # (examples of challenging structure yielding poor-quality chunks)\n l = [\n \"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\",\n \"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl\"]\n for c in cleaver.parse_chunk_sequence(l):\n print(c)\n\n# example of mitigating/improving challenging structure by focusing on certain headers\nwith get_cleaver(\"lxml\", [\"h4\", \"h5\"]) as cleaver:\n cleaver.parse_events(\n [\"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html\"],\n print)\n```\n\n### Example usage with Selenium:\nUsing selenium on a page that requires javascript to load contents:\n\n```python\nfrom html_cleaver.cleaver import get_cleaver\n\nprint(\"using default lxml produces very few chunks:\")\nwith get_cleaver() as cleaver:\n cleaver.parse_events(\n [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n print)\n\nprint(\"using selenium produces many more chunks:\")\nwith get_cleaver(\"selenium\") as cleaver:\n cleaver.parse_events(\n [\"https://www.youtube.com/watch?v=rfscVS0vtbw\"],\n print)\n```\n\n\n## Development:\n### Testing:\nTesting without Poetry: \n`pip install lxml` \n`pip install selenium` \n`python -m unittest discover -s src`\n\nTesting with Poetry: \n`poetry install` \n`poetry run pytest`\n\n### Build:\nBuilding from source: \n`rm dist/*` \n`python -m build`\n\nInstalling from the build: \n`pip install dist/*.whl`\n\nPublishing from the build: \n`python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*` \n`python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*`\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "cleave html headers and text",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/PresidioVantage/html-cleaver",
"Repository": "https://github.com/PresidioVantage/html-cleaver"
},
"split_keywords": [
"xml",
"html"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9cc194291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a",
"md5": "98501c231d07c0269970a61ac70f8765",
"sha256": "f3640fcd887796578f8b7bd4017cb81f27729017020d0dff7ff00d64eae0119a"
},
"downloads": -1,
"filename": "html_cleaver-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "98501c231d07c0269970a61ac70f8765",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 13513,
"upload_time": "2023-11-29T03:40:08",
"upload_time_iso_8601": "2023-11-29T03:40:08.111479Z",
"url": "https://files.pythonhosted.org/packages/9c/c1/94291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a/html_cleaver-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c20b6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05",
"md5": "5b17920daf103e4824e8218d379afe48",
"sha256": "d7901934f083e6f36682645e314cea599a2cb2e8139e1a3a5ab581235f0e3839"
},
"downloads": -1,
"filename": "html_cleaver-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "5b17920daf103e4824e8218d379afe48",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 15931,
"upload_time": "2023-11-29T03:40:09",
"upload_time_iso_8601": "2023-11-29T03:40:09.546252Z",
"url": "https://files.pythonhosted.org/packages/c2/0b/6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05/html_cleaver-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-29 03:40:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PresidioVantage",
"github_project": "html-cleaver",
"github_not_found": true,
"lcname": "html-cleaver"
}