webleaf


Namewebleaf JSON
Version 0.3.12 PyPI version JSON
download
home_pageNone
SummaryHTML DOM Tree Leaf Structure Identification Package
upload_time2024-09-12 20:40:35
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseMIT License Copyright (c) 2024 Matt Thomson Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords dom web webscraping leaf beautifulsoup html tree structure embedding
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="https://github.com/thomsn/WebLeaf/raw/main/docs/logo.webp" alt="WebLeaf Logo" style="width: 62%;">
</p>

# 🌿 WebLeaf - A Graph-Based HTML Parsing and Comparison Tool

[![PyPI version](https://badge.fury.io/py/webleaf.svg)](https://badge.fury.io/py/webleaf)  
[![Build Status](https://travis-ci.org/yourusername/webleaf.svg?branch=main)](https://travis-ci.org/yourusername/webleaf)  
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**WebLeaf** is a Python package that brings the power of **graph neural networks (GNNs)** to HTML parsing and element comparison. It encodes HTML elements into feature-rich graph embeddings, allowing for advanced tasks like element extraction, structural comparison, and distance measurement between elements. WebLeaf is perfect for web scraping, semantic HTML analysis, and automated web page comparison tasks.

## Key Features

- 🌟 **Graph-Based HTML Representation**: Treats the HTML structure as a graph, encoding elements as nodes and relationships as edges.
- 📄 **Tag and Text Embeddings**: Leverages embeddings for both HTML tags and textual content to capture meaningful semantic and structural representations.
- 🔍 **Element Extraction**: Retrieve elements using XPath or CSS selectors.
- 🛠️ **Element Comparison**: Measure similarity between elements based on their content and structure using graph embeddings.
- 📈 **Pretrained GCN Model**: Built on top of a pretrained **Graph Convolutional Network (GCN)**, enabling rich semantic and structural analysis out of the box.

## Installation

You can install WebLeaf using pip:

```bash
pip install webleaf
```

## How It Works

WebLeaf represents an HTML document as a **graph**, where each HTML element is a node, and the parent-child relationships between elements form the edges of the graph. The graph is then processed by a **GCN (Graph Convolutional Network)** that creates embeddings for each HTML element. These embeddings capture both the semantic content and structural relationships of the elements, allowing for tasks like element comparison, similarity measurement, and extraction.

The model also combines **tag embeddings** (representing HTML tags) and **text embeddings** (representing the textual content of elements), creating a powerful representation of the HTML page.

## Basic Usage

Here's a quick example of how to use WebLeaf:

```python
from webleaf import Web

# Load your HTML content
html_content = open('example.html').read()

# Create a Web object
web = Web(html_content)

# Extract an element using XPath
leaf = web.leaf(xpath=".//p")

# Extract an element using CSS selectors
leaf_css = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")

# Compare two elements
similarity = leaf.similarity(leaf_css)
print(f"Similarity: {similarity}")
>>> Similarity: 1.0

# Find the closest match for an element
path = web.find(leaf)
print(f"Element found at: {path}")
>>> Element found at: /html/body/div/div/div[1]/div[1]/p
```

### Advanced Features

- **Find Similar Elements**: You can also find the top `n` most similar elements to a given one:

    ```python
    similar_paths = web.find_n(leaf, n=3)
    print(f"Top 3 similar elements: {similar_paths}")
    >>> Top 3 similar elements: ['/html/body/div/div/div[1]/div[1]/p', '/html/body/div/div/div[2]/div[1]/p', '/html/body/div/div/div[3]/div[1]/span']

    ```
  

- **Distance Measurement**: Measure how unique or similar two elements are using `mdist()`:

    ```python
    distance = leaf.mdist(leaf_css)
    print(f"Distance: {distance}")
   >>>
  Distance: 0.0
    ```

## API Documentation

### `Web(html)`
- **Description**: Initializes the WebLeaf model with the HTML content, parses the document, and encodes it into a graph representation.
- **Arguments**:
  - `html` (str): The HTML content as a string.

### `leaf(xpath=None, css_select=None)`
- **Description**: Retrieves an HTML element as a `Leaf` object using either an XPath or CSS selector.
- **Arguments**:
  - `xpath` (str): The XPath of the desired element.
  - `css_select` (str): The CSS selector for the desired element.

### `similarity(leaf)`
- **Description**: Computes the similarity score between two `Leaf` objects based on their embeddings.
- **Returns**: A similarity score between 0 and 1.

### `mdist(leaf)`
- **Description**: Measures the "distance" between two `Leaf` objects, representing how unique or different they are.

### `find(leaf)`
- **Description**: Finds the closest match for a given `Leaf` object within the HTML structure.
- **Returns**: The XPath of the closest matching element.

### `find_n(leaf, n)`
- **Description**: Finds the top `n` most similar elements to a given `Leaf` object, sorted by similarity.
- **Returns**: A list of XPaths for the top `n` most similar elements.

## Running Tests

WebLeaf comes with a suite of unit tests to ensure everything works as expected. These tests cover basic operations like element extraction, similarity comparisons, and graph encoding. To run the tests:

1. Clone this repository.
2. Install the required dependencies using `pip install -r requirements.txt`.
3. Run the tests using `pytest`:

```bash
pytest
```

## Example Test

```python
def test_leaf_extraction():
    web = Web(example_html)
    leaf = web.leaf(xpath=".//p")
    assert leaf

def test_element_comparison():
    web = Web(example_html)
    leaf1 = web.leaf(xpath=".//p")
    leaf2 = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")
    assert leaf1.similarity(leaf2) > 0.9
```

## Pretrained Model

The WebLeaf model uses a pretrained **Graph Convolutional Network (GCN)** that has been trained on a diverse set of web pages to learn the structure and semantic relationships within HTML. The model is loaded from `product_page_model_4_80.torch` and is used to encode HTML elements into embeddings.

## Performance
This t-SNE (t-Distributed Stochastic Neighbor Embedding) plot provides a 2D visualization of the WebLeaf-encoded web elements, which have been projected into a lower-dimensional space. The purpose of t-SNE is to represent high-dimensional data (such as the embeddings generated by WebLeaf) in two dimensions, allowing us to better visualize relationships and groupings among different types of web elements.

<p align="center">
  <img src="https://github.com/thomsn/WebLeaf/raw/main/docs/tsne2.png" alt="WebLeaf Performance" style="width: 62%;">
</p>

## Contributing

We welcome contributions! Feel free to submit issues, feature requests, or pull requests. Here's how you can contribute:

1. Fork the repository.
2. Create your feature branch: `git checkout -b feature/new-feature`.
3. Commit your changes: `git commit -m 'Add new feature'`.
4. Push to the branch: `git push origin feature/new-feature`.
5. Open a pull request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

🌿 **WebLeaf** is a powerful and flexible tool for working with HTML as structured graph data. Give it a try and start leveraging the power of graph neural networks for your web scraping and analysis needs!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "webleaf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "dom, web, webscraping, leaf, beautifulsoup, html, tree, structure, embedding",
    "author": null,
    "author_email": "Matthew Thomson <m7homson@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/bd/3e/376965ecbbc3a73a975aea35cec2be8a0db2fb273c4ae1075cf508f5be77/webleaf-0.3.12.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"https://github.com/thomsn/WebLeaf/raw/main/docs/logo.webp\" alt=\"WebLeaf Logo\" style=\"width: 62%;\">\n</p>\n\n# \ud83c\udf3f WebLeaf - A Graph-Based HTML Parsing and Comparison Tool\n\n[![PyPI version](https://badge.fury.io/py/webleaf.svg)](https://badge.fury.io/py/webleaf)  \n[![Build Status](https://travis-ci.org/yourusername/webleaf.svg?branch=main)](https://travis-ci.org/yourusername/webleaf)  \n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**WebLeaf** is a Python package that brings the power of **graph neural networks (GNNs)** to HTML parsing and element comparison. It encodes HTML elements into feature-rich graph embeddings, allowing for advanced tasks like element extraction, structural comparison, and distance measurement between elements. WebLeaf is perfect for web scraping, semantic HTML analysis, and automated web page comparison tasks.\n\n## Key Features\n\n- \ud83c\udf1f **Graph-Based HTML Representation**: Treats the HTML structure as a graph, encoding elements as nodes and relationships as edges.\n- \ud83d\udcc4 **Tag and Text Embeddings**: Leverages embeddings for both HTML tags and textual content to capture meaningful semantic and structural representations.\n- \ud83d\udd0d **Element Extraction**: Retrieve elements using XPath or CSS selectors.\n- \ud83d\udee0\ufe0f **Element Comparison**: Measure similarity between elements based on their content and structure using graph embeddings.\n- \ud83d\udcc8 **Pretrained GCN Model**: Built on top of a pretrained **Graph Convolutional Network (GCN)**, enabling rich semantic and structural analysis out of the box.\n\n## Installation\n\nYou can install WebLeaf using pip:\n\n```bash\npip install webleaf\n```\n\n## How It Works\n\nWebLeaf represents an HTML document as a **graph**, where each HTML element is a node, and the parent-child relationships between elements form the edges of the graph. The graph is then processed by a **GCN (Graph Convolutional Network)** that creates embeddings for each HTML element. These embeddings capture both the semantic content and structural relationships of the elements, allowing for tasks like element comparison, similarity measurement, and extraction.\n\nThe model also combines **tag embeddings** (representing HTML tags) and **text embeddings** (representing the textual content of elements), creating a powerful representation of the HTML page.\n\n## Basic Usage\n\nHere's a quick example of how to use WebLeaf:\n\n```python\nfrom webleaf import Web\n\n# Load your HTML content\nhtml_content = open('example.html').read()\n\n# Create a Web object\nweb = Web(html_content)\n\n# Extract an element using XPath\nleaf = web.leaf(xpath=\".//p\")\n\n# Extract an element using CSS selectors\nleaf_css = web.leaf(css_select=\"div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)\")\n\n# Compare two elements\nsimilarity = leaf.similarity(leaf_css)\nprint(f\"Similarity: {similarity}\")\n>>> Similarity: 1.0\n\n# Find the closest match for an element\npath = web.find(leaf)\nprint(f\"Element found at: {path}\")\n>>> Element found at: /html/body/div/div/div[1]/div[1]/p\n```\n\n### Advanced Features\n\n- **Find Similar Elements**: You can also find the top `n` most similar elements to a given one:\n\n    ```python\n    similar_paths = web.find_n(leaf, n=3)\n    print(f\"Top 3 similar elements: {similar_paths}\")\n    >>> Top 3 similar elements: ['/html/body/div/div/div[1]/div[1]/p', '/html/body/div/div/div[2]/div[1]/p', '/html/body/div/div/div[3]/div[1]/span']\n\n    ```\n  \n\n- **Distance Measurement**: Measure how unique or similar two elements are using `mdist()`:\n\n    ```python\n    distance = leaf.mdist(leaf_css)\n    print(f\"Distance: {distance}\")\n   >>>\n  Distance: 0.0\n    ```\n\n## API Documentation\n\n### `Web(html)`\n- **Description**: Initializes the WebLeaf model with the HTML content, parses the document, and encodes it into a graph representation.\n- **Arguments**:\n  - `html` (str): The HTML content as a string.\n\n### `leaf(xpath=None, css_select=None)`\n- **Description**: Retrieves an HTML element as a `Leaf` object using either an XPath or CSS selector.\n- **Arguments**:\n  - `xpath` (str): The XPath of the desired element.\n  - `css_select` (str): The CSS selector for the desired element.\n\n### `similarity(leaf)`\n- **Description**: Computes the similarity score between two `Leaf` objects based on their embeddings.\n- **Returns**: A similarity score between 0 and 1.\n\n### `mdist(leaf)`\n- **Description**: Measures the \"distance\" between two `Leaf` objects, representing how unique or different they are.\n\n### `find(leaf)`\n- **Description**: Finds the closest match for a given `Leaf` object within the HTML structure.\n- **Returns**: The XPath of the closest matching element.\n\n### `find_n(leaf, n)`\n- **Description**: Finds the top `n` most similar elements to a given `Leaf` object, sorted by similarity.\n- **Returns**: A list of XPaths for the top `n` most similar elements.\n\n## Running Tests\n\nWebLeaf comes with a suite of unit tests to ensure everything works as expected. These tests cover basic operations like element extraction, similarity comparisons, and graph encoding. To run the tests:\n\n1. Clone this repository.\n2. Install the required dependencies using `pip install -r requirements.txt`.\n3. Run the tests using `pytest`:\n\n```bash\npytest\n```\n\n## Example Test\n\n```python\ndef test_leaf_extraction():\n    web = Web(example_html)\n    leaf = web.leaf(xpath=\".//p\")\n    assert leaf\n\ndef test_element_comparison():\n    web = Web(example_html)\n    leaf1 = web.leaf(xpath=\".//p\")\n    leaf2 = web.leaf(css_select=\"div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)\")\n    assert leaf1.similarity(leaf2) > 0.9\n```\n\n## Pretrained Model\n\nThe WebLeaf model uses a pretrained **Graph Convolutional Network (GCN)** that has been trained on a diverse set of web pages to learn the structure and semantic relationships within HTML. The model is loaded from `product_page_model_4_80.torch` and is used to encode HTML elements into embeddings.\n\n## Performance\nThis t-SNE (t-Distributed Stochastic Neighbor Embedding) plot provides a 2D visualization of the WebLeaf-encoded web elements, which have been projected into a lower-dimensional space. The purpose of t-SNE is to represent high-dimensional data (such as the embeddings generated by WebLeaf) in two dimensions, allowing us to better visualize relationships and groupings among different types of web elements.\n\n<p align=\"center\">\n  <img src=\"https://github.com/thomsn/WebLeaf/raw/main/docs/tsne2.png\" alt=\"WebLeaf Performance\" style=\"width: 62%;\">\n</p>\n\n## Contributing\n\nWe welcome contributions! Feel free to submit issues, feature requests, or pull requests. Here's how you can contribute:\n\n1. Fork the repository.\n2. Create your feature branch: `git checkout -b feature/new-feature`.\n3. Commit your changes: `git commit -m 'Add new feature'`.\n4. Push to the branch: `git push origin feature/new-feature`.\n5. Open a pull request.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n\ud83c\udf3f **WebLeaf** is a powerful and flexible tool for working with HTML as structured graph data. Give it a try and start leveraging the power of graph neural networks for your web scraping and analysis needs!\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Matt Thomson  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "HTML DOM Tree Leaf Structure Identification Package",
    "version": "0.3.12",
    "project_urls": {
        "documentation": "https://thomsn.github.io/WebLeaf/webleaf.html",
        "homepage": "https://thomsn.github.io/WebLeaf/webleaf.html",
        "repository": "https://github.com/thomsn/WebLeaf"
    },
    "split_keywords": [
        "dom",
        " web",
        " webscraping",
        " leaf",
        " beautifulsoup",
        " html",
        " tree",
        " structure",
        " embedding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1711dcf6827eca730b2dc1e270c42057c41e8e9b8e32a5ec1440fc448f41c9e8",
                "md5": "d64d9e49619b5651afcb362463c4b647",
                "sha256": "74b4ad0caf2a311272dc9d30581494df406a71c7a6605c95cc536594232d56e7"
            },
            "downloads": -1,
            "filename": "webleaf-0.3.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d64d9e49619b5651afcb362463c4b647",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 586434,
            "upload_time": "2024-09-12T20:40:34",
            "upload_time_iso_8601": "2024-09-12T20:40:34.507231Z",
            "url": "https://files.pythonhosted.org/packages/17/11/dcf6827eca730b2dc1e270c42057c41e8e9b8e32a5ec1440fc448f41c9e8/webleaf-0.3.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bd3e376965ecbbc3a73a975aea35cec2be8a0db2fb273c4ae1075cf508f5be77",
                "md5": "facae9bfb94467b646c855eb7c6076cc",
                "sha256": "f6170e099c4dcfe25054185ef88a907a66a46d73a6dcf5ef3cd9164975283fc1"
            },
            "downloads": -1,
            "filename": "webleaf-0.3.12.tar.gz",
            "has_sig": false,
            "md5_digest": "facae9bfb94467b646c855eb7c6076cc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 589089,
            "upload_time": "2024-09-12T20:40:35",
            "upload_time_iso_8601": "2024-09-12T20:40:35.878942Z",
            "url": "https://files.pythonhosted.org/packages/bd/3e/376965ecbbc3a73a975aea35cec2be8a0db2fb273c4ae1075cf508f5be77/webleaf-0.3.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-12 20:40:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thomsn",
    "github_project": "WebLeaf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "webleaf"
}
        
Elapsed time: 0.93820s