webleaf


Namewebleaf JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryHTML DOM Tree Leaf Structure Identification Package
upload_time2024-06-14 16:45:51
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseMIT License Copyright (c) 2024 Matt Thomson Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords dom web webscraping leaf beautifulsoup html tree structure embedding
VCS
bugtrack_url
requirements lxml cssselect
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="https://github.com/thomsn/WebLeaf/raw/main/docs/logo.webp" alt="WebLeaf Logo" style="width: 100%;">
</p>

# WebLeaf Package
#### HTML DOM Tree Leaf Structure Identification Python Package 
Websites are generally built as a composition of components. If you understand the structure of a given website then you
can better understand the data within it. WebLeaf helps you classify elements within the DOM tree by creating a 
dict representation of an element's neighbors. This dict can then be used to develop robust data scraping logic. WebLeaf 
is an alternative to CSS selectors and XPaths which can often fail. 

### Install
To install the current release
```bash
pip install webleaf
```
### Basic
Here we will compute the Leaf for the link "a" element in example.com
```python
from webleaf import Leaf
from lxml import etree

def get_html():
    import requests
    website = requests.get("https://example.com/").text
    return website


html = get_html()
root = etree.HTML(html)
tree = etree.ElementTree(root)

leaf = Leaf().from_xpath(tree, xpath=".//a", depth=3)
print(leaf)
```
output
```json
{"./../../h1": "Example Domain", "./../../p[1]": "This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission."}
```
### Comparing Leaves
Leaves can be compared with each other, so you can find similar elements within the document. 
```python
from webleaf import Leaf

leaf_one = Leaf({"./../../h1": "Example Domain", "./../../p[1]": "example description"})
leaf_two = Leaf({"./../../h1": "Example Domain", "./../../p[1]": "example description modified"})
leaf_three = Leaf({"./../h1": "Example Domain", "./../../p[3]": "example description"})

print("compare leaf one and two", leaf_one.compare(leaf_two))
print("compare leaf one and three", leaf_one.compare(leaf_three))
```
output
```bash
compare leaf one and two 0.9375
compare leaf one and three 0.50244140625
```

### How it works
Here we will walk through the creation of a Leaf. The link "a" element Leaf of depth=3 has two neighbors "./../../h1" and 
 "./../../p[1]". WebLeaf will start from the element and breadth first search for a neighbouring element with text. When it finds a 
neighbour it will create a relative XPath to it. 
```html
<!doctype html>
    <body>
        <div>
            <h1>Example Domain</h1>                                                             <!--  ./../../h1  -->
            <p>This domain is for use in illustrative examples in documents....     </p>        <!-- ./../../p[1] -->
            <p>
                <a href="https://www.iana.org/domains/example">More information...</a>          <!--    start     -->
            </p>
        </div>
    </body>
</html>
```

<p align="center">
  <img src="https://github.com/thomsn/WebLeaf/raw/main/docs/WebLeaf.png" alt="WebLeaf How it Works" style="width: 80%;">
</p>

In the above DOM tree you can see how WebLeaf encoded the tree structure around the chosen element "a". This Leaf can 
then be used to locate the link.

<em>"You become who you surround yourself with."</em> src: Someone Important

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "webleaf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "dom, web, webscraping, leaf, beautifulsoup, html, tree, structure, embedding",
    "author": null,
    "author_email": "Matthew Thomson <m7homson@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/fe/57/cf5b44fb75590428592b393fe41e6b106614073898ecf4df0557a9e4c8bc/webleaf-0.2.2.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"https://github.com/thomsn/WebLeaf/raw/main/docs/logo.webp\" alt=\"WebLeaf Logo\" style=\"width: 100%;\">\n</p>\n\n# WebLeaf Package\n#### HTML DOM Tree Leaf Structure Identification Python Package \nWebsites are generally built as a composition of components. If you understand the structure of a given website then you\ncan better understand the data within it. WebLeaf helps you classify elements within the DOM tree by creating a \ndict representation of an element's neighbors. This dict can then be used to develop robust data scraping logic. WebLeaf \nis an alternative to CSS selectors and XPaths which can often fail. \n\n### Install\nTo install the current release\n```bash\npip install webleaf\n```\n### Basic\nHere we will compute the Leaf for the link \"a\" element in example.com\n```python\nfrom webleaf import Leaf\nfrom lxml import etree\n\ndef get_html():\n    import requests\n    website = requests.get(\"https://example.com/\").text\n    return website\n\n\nhtml = get_html()\nroot = etree.HTML(html)\ntree = etree.ElementTree(root)\n\nleaf = Leaf().from_xpath(tree, xpath=\".//a\", depth=3)\nprint(leaf)\n```\noutput\n```json\n{\"./../../h1\": \"Example Domain\", \"./../../p[1]\": \"This domain is for use in illustrative examples in documents. You may use this\\n    domain in literature without prior coordination or asking for permission.\"}\n```\n### Comparing Leaves\nLeaves can be compared with each other, so you can find similar elements within the document. \n```python\nfrom webleaf import Leaf\n\nleaf_one = Leaf({\"./../../h1\": \"Example Domain\", \"./../../p[1]\": \"example description\"})\nleaf_two = Leaf({\"./../../h1\": \"Example Domain\", \"./../../p[1]\": \"example description modified\"})\nleaf_three = Leaf({\"./../h1\": \"Example Domain\", \"./../../p[3]\": \"example description\"})\n\nprint(\"compare leaf one and two\", leaf_one.compare(leaf_two))\nprint(\"compare leaf one and three\", leaf_one.compare(leaf_three))\n```\noutput\n```bash\ncompare leaf one and two 0.9375\ncompare leaf one and three 0.50244140625\n```\n\n### How it works\nHere we will walk through the creation of a Leaf. The link \"a\" element Leaf of depth=3 has two neighbors \"./../../h1\" and \n \"./../../p[1]\". WebLeaf will start from the element and breadth first search for a neighbouring element with text. When it finds a \nneighbour it will create a relative XPath to it. \n```html\n<!doctype html>\n    <body>\n        <div>\n            <h1>Example Domain</h1>                                                             <!--  ./../../h1  -->\n            <p>This domain is for use in illustrative examples in documents....     </p>        <!-- ./../../p[1] -->\n            <p>\n                <a href=\"https://www.iana.org/domains/example\">More information...</a>          <!--    start     -->\n            </p>\n        </div>\n    </body>\n</html>\n```\n\n<p align=\"center\">\n  <img src=\"https://github.com/thomsn/WebLeaf/raw/main/docs/WebLeaf.png\" alt=\"WebLeaf How it Works\" style=\"width: 80%;\">\n</p>\n\nIn the above DOM tree you can see how WebLeaf encoded the tree structure around the chosen element \"a\". This Leaf can \nthen be used to locate the link.\n\n<em>\"You become who you surround yourself with.\"</em> src: Someone Important\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Matt Thomson  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "HTML DOM Tree Leaf Structure Identification Package",
    "version": "0.2.2",
    "project_urls": {
        "documentation": "https://thomsn.github.io/WebLeaf/webleaf.html",
        "homepage": "https://thomsn.github.io/WebLeaf/webleaf.html",
        "repository": "https://github.com/thomsn/WebLeaf"
    },
    "split_keywords": [
        "dom",
        " web",
        " webscraping",
        " leaf",
        " beautifulsoup",
        " html",
        " tree",
        " structure",
        " embedding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4de5b9ebe06b0effd5781d5e29f965085f6a23c042b0f62674ca900164d2ce1c",
                "md5": "b5cf7e7b02f4e56fcca734ed69034ef8",
                "sha256": "7173e14349562f8edcd3788f8b3e65fde74eaf78092f84af6a616602dc16694f"
            },
            "downloads": -1,
            "filename": "webleaf-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b5cf7e7b02f4e56fcca734ed69034ef8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5689,
            "upload_time": "2024-06-14T16:45:50",
            "upload_time_iso_8601": "2024-06-14T16:45:50.029701Z",
            "url": "https://files.pythonhosted.org/packages/4d/e5/b9ebe06b0effd5781d5e29f965085f6a23c042b0f62674ca900164d2ce1c/webleaf-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fe57cf5b44fb75590428592b393fe41e6b106614073898ecf4df0557a9e4c8bc",
                "md5": "1644490a381d85861250b36029dc671f",
                "sha256": "e40cf9d00cdc42ff9472c9778427a2107509f1ef7048eeac5f8845e4eea5b857"
            },
            "downloads": -1,
            "filename": "webleaf-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "1644490a381d85861250b36029dc671f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6411,
            "upload_time": "2024-06-14T16:45:51",
            "upload_time_iso_8601": "2024-06-14T16:45:51.866054Z",
            "url": "https://files.pythonhosted.org/packages/fe/57/cf5b44fb75590428592b393fe41e6b106614073898ecf4df0557a9e4c8bc/webleaf-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-14 16:45:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thomsn",
    "github_project": "WebLeaf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "5.2.2"
                ]
            ]
        },
        {
            "name": "cssselect",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        }
    ],
    "lcname": "webleaf"
}
        
Elapsed time: 0.26387s