Name | webleaf JSON |
Version |
0.2.2
JSON |
| download |
home_page | None |
Summary | HTML DOM Tree Leaf Structure Identification Package |
upload_time | 2024-06-14 16:45:51 |
maintainer | None |
docs_url | None |
author | None |
requires_python | None |
license | MIT License Copyright (c) 2024 Matt Thomson Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
dom
web
webscraping
leaf
beautifulsoup
html
tree
structure
embedding
|
VCS |
![](/static/img/github-24-000000.png) |
bugtrack_url |
|
requirements |
lxml
cssselect
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<p align="center">
<img src="https://github.com/thomsn/WebLeaf/raw/main/docs/logo.webp" alt="WebLeaf Logo" style="width: 100%;">
</p>
# WebLeaf Package
#### HTML DOM Tree Leaf Structure Identification Python Package
Websites are generally built as a composition of components. If you understand the structure of a given website then you
can better understand the data within it. WebLeaf helps you classify elements within the DOM tree by creating a
dict representation of an element's neighbors. This dict can then be used to develop robust data scraping logic. WebLeaf
is an alternative to CSS selectors and XPaths which can often fail.
### Install
To install the current release
```bash
pip install webleaf
```
### Basic
Here we will compute the Leaf for the link "a" element in example.com
```python
from webleaf import Leaf
from lxml import etree
def get_html():
import requests
website = requests.get("https://example.com/").text
return website
html = get_html()
root = etree.HTML(html)
tree = etree.ElementTree(root)
leaf = Leaf().from_xpath(tree, xpath=".//a", depth=3)
print(leaf)
```
output
```json
{"./../../h1": "Example Domain", "./../../p[1]": "This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission."}
```
### Comparing Leaves
Leaves can be compared with each other, so you can find similar elements within the document.
```python
from webleaf import Leaf
leaf_one = Leaf({"./../../h1": "Example Domain", "./../../p[1]": "example description"})
leaf_two = Leaf({"./../../h1": "Example Domain", "./../../p[1]": "example description modified"})
leaf_three = Leaf({"./../h1": "Example Domain", "./../../p[3]": "example description"})
print("compare leaf one and two", leaf_one.compare(leaf_two))
print("compare leaf one and three", leaf_one.compare(leaf_three))
```
output
```bash
compare leaf one and two 0.9375
compare leaf one and three 0.50244140625
```
### How it works
Here we will walk through the creation of a Leaf. The link "a" element Leaf of depth=3 has two neighbors "./../../h1" and
"./../../p[1]". WebLeaf will start from the element and breadth first search for a neighbouring element with text. When it finds a
neighbour it will create a relative XPath to it.
```html
<!doctype html>
<body>
<div>
<h1>Example Domain</h1> <!-- ./../../h1 -->
<p>This domain is for use in illustrative examples in documents.... </p> <!-- ./../../p[1] -->
<p>
<a href="https://www.iana.org/domains/example">More information...</a> <!-- start -->
</p>
</div>
</body>
</html>
```
<p align="center">
<img src="https://github.com/thomsn/WebLeaf/raw/main/docs/WebLeaf.png" alt="WebLeaf How it Works" style="width: 80%;">
</p>
In the above DOM tree you can see how WebLeaf encoded the tree structure around the chosen element "a". This Leaf can
then be used to locate the link.
<em>"You become who you surround yourself with."</em> src: Someone Important
Raw data
{
"_id": null,
"home_page": null,
"name": "webleaf",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "dom, web, webscraping, leaf, beautifulsoup, html, tree, structure, embedding",
"author": null,
"author_email": "Matthew Thomson <m7homson@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/fe/57/cf5b44fb75590428592b393fe41e6b106614073898ecf4df0557a9e4c8bc/webleaf-0.2.2.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"https://github.com/thomsn/WebLeaf/raw/main/docs/logo.webp\" alt=\"WebLeaf Logo\" style=\"width: 100%;\">\n</p>\n\n# WebLeaf Package\n#### HTML DOM Tree Leaf Structure Identification Python Package \nWebsites are generally built as a composition of components. If you understand the structure of a given website then you\ncan better understand the data within it. WebLeaf helps you classify elements within the DOM tree by creating a \ndict representation of an element's neighbors. This dict can then be used to develop robust data scraping logic. WebLeaf \nis an alternative to CSS selectors and XPaths which can often fail. \n\n### Install\nTo install the current release\n```bash\npip install webleaf\n```\n### Basic\nHere we will compute the Leaf for the link \"a\" element in example.com\n```python\nfrom webleaf import Leaf\nfrom lxml import etree\n\ndef get_html():\n import requests\n website = requests.get(\"https://example.com/\").text\n return website\n\n\nhtml = get_html()\nroot = etree.HTML(html)\ntree = etree.ElementTree(root)\n\nleaf = Leaf().from_xpath(tree, xpath=\".//a\", depth=3)\nprint(leaf)\n```\noutput\n```json\n{\"./../../h1\": \"Example Domain\", \"./../../p[1]\": \"This domain is for use in illustrative examples in documents. You may use this\\n domain in literature without prior coordination or asking for permission.\"}\n```\n### Comparing Leaves\nLeaves can be compared with each other, so you can find similar elements within the document. \n```python\nfrom webleaf import Leaf\n\nleaf_one = Leaf({\"./../../h1\": \"Example Domain\", \"./../../p[1]\": \"example description\"})\nleaf_two = Leaf({\"./../../h1\": \"Example Domain\", \"./../../p[1]\": \"example description modified\"})\nleaf_three = Leaf({\"./../h1\": \"Example Domain\", \"./../../p[3]\": \"example description\"})\n\nprint(\"compare leaf one and two\", leaf_one.compare(leaf_two))\nprint(\"compare leaf one and three\", leaf_one.compare(leaf_three))\n```\noutput\n```bash\ncompare leaf one and two 0.9375\ncompare leaf one and three 0.50244140625\n```\n\n### How it works\nHere we will walk through the creation of a Leaf. The link \"a\" element Leaf of depth=3 has two neighbors \"./../../h1\" and \n \"./../../p[1]\". WebLeaf will start from the element and breadth first search for a neighbouring element with text. When it finds a \nneighbour it will create a relative XPath to it. \n```html\n<!doctype html>\n <body>\n <div>\n <h1>Example Domain</h1> <!-- ./../../h1 -->\n <p>This domain is for use in illustrative examples in documents.... </p> <!-- ./../../p[1] -->\n <p>\n <a href=\"https://www.iana.org/domains/example\">More information...</a> <!-- start -->\n </p>\n </div>\n </body>\n</html>\n```\n\n<p align=\"center\">\n <img src=\"https://github.com/thomsn/WebLeaf/raw/main/docs/WebLeaf.png\" alt=\"WebLeaf How it Works\" style=\"width: 80%;\">\n</p>\n\nIn the above DOM tree you can see how WebLeaf encoded the tree structure around the chosen element \"a\". This Leaf can \nthen be used to locate the link.\n\n<em>\"You become who you surround yourself with.\"</em> src: Someone Important\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024 Matt Thomson Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "HTML DOM Tree Leaf Structure Identification Package",
"version": "0.2.2",
"project_urls": {
"documentation": "https://thomsn.github.io/WebLeaf/webleaf.html",
"homepage": "https://thomsn.github.io/WebLeaf/webleaf.html",
"repository": "https://github.com/thomsn/WebLeaf"
},
"split_keywords": [
"dom",
" web",
" webscraping",
" leaf",
" beautifulsoup",
" html",
" tree",
" structure",
" embedding"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4de5b9ebe06b0effd5781d5e29f965085f6a23c042b0f62674ca900164d2ce1c",
"md5": "b5cf7e7b02f4e56fcca734ed69034ef8",
"sha256": "7173e14349562f8edcd3788f8b3e65fde74eaf78092f84af6a616602dc16694f"
},
"downloads": -1,
"filename": "webleaf-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b5cf7e7b02f4e56fcca734ed69034ef8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 5689,
"upload_time": "2024-06-14T16:45:50",
"upload_time_iso_8601": "2024-06-14T16:45:50.029701Z",
"url": "https://files.pythonhosted.org/packages/4d/e5/b9ebe06b0effd5781d5e29f965085f6a23c042b0f62674ca900164d2ce1c/webleaf-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fe57cf5b44fb75590428592b393fe41e6b106614073898ecf4df0557a9e4c8bc",
"md5": "1644490a381d85861250b36029dc671f",
"sha256": "e40cf9d00cdc42ff9472c9778427a2107509f1ef7048eeac5f8845e4eea5b857"
},
"downloads": -1,
"filename": "webleaf-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "1644490a381d85861250b36029dc671f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6411,
"upload_time": "2024-06-14T16:45:51",
"upload_time_iso_8601": "2024-06-14T16:45:51.866054Z",
"url": "https://files.pythonhosted.org/packages/fe/57/cf5b44fb75590428592b393fe41e6b106614073898ecf4df0557a9e4c8bc/webleaf-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-14 16:45:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thomsn",
"github_project": "WebLeaf",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "lxml",
"specs": [
[
"==",
"5.2.2"
]
]
},
{
"name": "cssselect",
"specs": [
[
"==",
"1.2.0"
]
]
}
],
"lcname": "webleaf"
}