deboiler


Namedeboiler JSON
Version 2023.46.150 PyPI version JSON
download
home_pagehttps://github.com/globality-corp/deboiler
SummaryDeboiler is an open-source package to clean HTML pages across an entire domain
upload_time2023-11-17 22:52:36
maintainer
docs_urlNone
authorGlobality AI
requires_python
licenseMIT
keywords deboiler python html cleaning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Deboiler

[DISCLAIMER](./DISCLAIMER.md)

<img src="resources/logo.jpg" width="140" height="140">

The deboiler is Python module for webpage cleaning, distributed under the Apache License. It is a simple, yet novel, domain cleaning algorithm. Given all pages from a website, it identifies and removes boilerplate elements.

Benefits of the Deboiler approach to webpage cleaning include:

* It is entirely unsupervised and does not need any human annotations.
* It preserves the HTML structure during cleaning and can return a cleaned HTML, as well as a cleaned text.

## Approach

At a high level, `deboiler` detects boilerplate elements by identifying near-identical subtrees (from the html DOM tree) that are shared between two pages in the domain. The following provides more details about the underlying approach:

* <b>Candidate subtrees:</b> A candidate subtree in a page is a node whose html tag is from a limited list (such as  `<div>`, `<nav>`, `<navigation>`, `<footer>`, `<header>`, etc.).

* <b>Subtree comparison:</b> Each subtree is represented with a plain text, that is created recursively by concatenating all of its constituent elements' representations. In this process, html tag attributes are ignored. For instance in node `<a href="https://www.linkedin.com/foo">Linkedin</a>`, attribute `href="https://www.linkedin.com/foo"` is ignored. As a result, subtrees with similar structure and similar text, but potentially different tag attributes, will have the same representations.

* <b>Boilerplate elements from a pair of pages:</b> Given a pair of pages from the same domain, candidate subtrees that are shared (have the same representation) between the two pages are considered boilerplate.

* <b>All domain’s boilerplate elements:</b> Boilerplate elements identified from each pair are added to the set of all boilerplate elements for the domain. We use an efficient method that is only `O(n)` complex, where we sort pages based on URL and compare each page with the next one. It is based on the observation that most modern domains take advantage of folder structures in url paths, and hence, pages with similar parent directories are usually more similar than random pairs. As a result, more boilerplate elements can be identified with less computation.

* <b>Safeguard against identical pages:</b> To have a safeguard against comparing identical pages (and inadvertently denoting all elements as boilerplate), we refrain from using pairs whose intersection-over-union (i.e. ratio of shared elements compared to all elements) is above a certain threshold.

* <b>Cleaning a page:</b> To clean a page from the domain, any subtree in the page that is among the domain’s boilerplate elements is removed.

## Installation

`pip install deboiler`

## How to Use

This package contains an LXML-based, memory-efficient, and fast implementation of this boilerplate detection algorithm, with a simple `scikit-learn`-like API.

```python
from deboiler.dataset import JsonDataset
from deboiler import Deboiler


dataset = JsonDataset("path-to-json-lines-file")
deboiler = Deboiler(
    n_processes=1,  # no of processes
    operation_mode="memory",  # operation mode: `memory` or `performance`
    domain="globality",  # domain name (used for logging only)
)

# call the fit method to identify boilerplate elements
deboiler.fit(dataset)

output_pages = []
# call the transform method to yield cleaned pages
for output_page in deboiler.transform(dataset):
    # do something with the output_page
    output_pages.append(output_page)

```

## Modes of Operation

`deboiler` supports two modes of operation:

* low-memory mode: This mode offers the lowest memory footprint. It also supports multi-processing.

* high-performance mode: In this mode, parsed pages are kept in memory during `fit`, to be reused during `transform`, resulting in faster processing at the cost of higher memory footprint. This mode does _not_ support multi-processing.

|                       | single-processing | multi-processing |
|-----------------------|-------------------|------------------|
| low memory mode       | :heavy_check_mark:                 | :heavy_check_mark:                |
| high performance mode | :heavy_check_mark:                 | :x:               |

The following plot compares `deboiler` performance for different modes of operation and number of processes. In this benchmarking, `deboiler` cleans up pages from ~140 domains with 10-10k pages. The `performance` mode completes the tasks faster (38 mins vs. 54 mins) than the `memory` mode with a single process, i.e. `(memory, 1)`. However, the `memory` mode can outperform the `performance` mode if multi-processing is enabled (e.g. 5 or 10 processes in this example).

It is worth noting that the difference between modes of operation and multi-processing becomes more pronounced as the domain size increases.

![Performance Plot](resources/performance_plot.png)

## Creating Custom Datasets

The package includes a `JsonDataset` class. It expects a json lines file and has optional arguments to filter for html pages that were crawled successfully.

If the dataset needs to be more nuanced, one can create a custom dataset by sub-classing from the `DeboilerDataset` and implementing `__getitem__` and `__len__` methods, as well as the `urls` property.
It is usually beneficial to create an index of the data during class instantiation that allows for random access to the records in `__getitem__`.
You can refer to [`deboiler/dataset/json_dataset.py`](deboiler/dataset/json_dataset.py) as an example.

### Tests

Run the tests as in

```
bash entrypoint.sh test
```
or simply
```
pytest .
```

Some options:
* `-s` to show prints and be able to debug
* `--pdb` to trigger debugger when having an exception
* `pytest route_to_test` to test a specific test file
* `pytest route_to_test::test_function` to test a specific test function
* `pytest route_to_test::test_function[test_case]`
* ` --cov-report term` to show coverage

You might find other code inspectors in `entrypoint.sh`. Note that these are run
against your code if opening a pull request.


# Contributing

All contributions, bug reports, security issues, bug fixes, documentation improvements, enhancements, and ideas are welcome. This section is adapted and simplified
from [pandas contribution guide](https://pandas.pydata.org/docs/development/contributing.html).

## Submit and issue

Bug reports, security issues, and enhancement requests are an important part of making open-source software more stable and are curated through Github issues. When reporting and issue or request, please fill out the issue form fully to ensure others and the core development team can fully understand the scope of the issue.

The issue will then show up to the community and be open to comments/ideas from others.

## Submit a pull request

`deboiler` is hosted on GitHub, and to contribute, you will need to sign up for a free GitHub account. We use Git for version control to allow many people to work together on the project. If you are new to Git, you can reference some of the resources in the pandas contribution guide cited above.

Also, the project follows a standard forking workflow whereby contributors fork the repository, make changes, create a feature branch, push changes, and then create a pull request. To avoid redundancy, please follow all the instructions in the pandas contribution guide  cited above.

## Code of Conduct

As contributors and maintainers to this project, you are expected to abide by the code of conduct. More information can be found at the [Contributor Code of Conduct]((https://github.com/globality-corp/deboiler/.github/blob/master/CODE_OF_CONDUCT.md)).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/globality-corp/deboiler",
    "name": "deboiler",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "deboiler,python,html cleaning",
    "author": "Globality AI",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/b0/4b/ddff26a8399e988d4248c59c28394362a505ceeb4e6aff620d8bc45c7f49/deboiler-2023.46.150.tar.gz",
    "platform": null,
    "description": "# Deboiler\n\n[DISCLAIMER](./DISCLAIMER.md)\n\n<img src=\"resources/logo.jpg\" width=\"140\" height=\"140\">\n\nThe deboiler is Python module for webpage cleaning, distributed under the Apache License. It is a simple, yet novel, domain cleaning algorithm. Given all pages from a website, it identifies and removes boilerplate elements.\n\nBenefits of the Deboiler approach to webpage cleaning include:\n\n* It is entirely unsupervised and does not need any human annotations.\n* It preserves the HTML structure during cleaning and can return a cleaned HTML, as well as a cleaned text.\n\n## Approach\n\nAt a high level, `deboiler` detects boilerplate elements by identifying near-identical subtrees (from the html DOM tree) that are shared between two pages in the domain. The following provides more details about the underlying approach:\n\n* <b>Candidate subtrees:</b> A candidate subtree in a page is a node whose html tag is from a limited list (such as  `<div>`, `<nav>`, `<navigation>`, `<footer>`, `<header>`, etc.).\n\n* <b>Subtree comparison:</b> Each subtree is represented with a plain text, that is created recursively by concatenating all of its constituent elements' representations. In this process, html tag attributes are ignored. For instance in node `<a href=\"https://www.linkedin.com/foo\">Linkedin</a>`, attribute `href=\"https://www.linkedin.com/foo\"` is ignored. As a result, subtrees with similar structure and similar text, but potentially different tag attributes, will have the same representations.\n\n* <b>Boilerplate elements from a pair of pages:</b> Given a pair of pages from the same domain, candidate subtrees that are shared (have the same representation) between the two pages are considered boilerplate.\n\n* <b>All domain\u2019s boilerplate elements:</b> Boilerplate elements identified from each pair are added to the set of all boilerplate elements for the domain. We use an efficient method that is only `O(n)` complex, where we sort pages based on URL and compare each page with the next one. It is based on the observation that most modern domains take advantage of folder structures in url paths, and hence, pages with similar parent directories are usually more similar than random pairs. As a result, more boilerplate elements can be identified with less computation.\n\n* <b>Safeguard against identical pages:</b> To have a safeguard against comparing identical pages (and inadvertently denoting all elements as boilerplate), we refrain from using pairs whose intersection-over-union (i.e. ratio of shared elements compared to all elements) is above a certain threshold.\n\n* <b>Cleaning a page:</b> To clean a page from the domain, any subtree in the page that is among the domain\u2019s boilerplate elements is removed.\n\n## Installation\n\n`pip install deboiler`\n\n## How to Use\n\nThis package contains an LXML-based, memory-efficient, and fast implementation of this boilerplate detection algorithm, with a simple `scikit-learn`-like API.\n\n```python\nfrom deboiler.dataset import JsonDataset\nfrom deboiler import Deboiler\n\n\ndataset = JsonDataset(\"path-to-json-lines-file\")\ndeboiler = Deboiler(\n    n_processes=1,  # no of processes\n    operation_mode=\"memory\",  # operation mode: `memory` or `performance`\n    domain=\"globality\",  # domain name (used for logging only)\n)\n\n# call the fit method to identify boilerplate elements\ndeboiler.fit(dataset)\n\noutput_pages = []\n# call the transform method to yield cleaned pages\nfor output_page in deboiler.transform(dataset):\n    # do something with the output_page\n    output_pages.append(output_page)\n\n```\n\n## Modes of Operation\n\n`deboiler` supports two modes of operation:\n\n* low-memory mode: This mode offers the lowest memory footprint. It also supports multi-processing.\n\n* high-performance mode: In this mode, parsed pages are kept in memory during `fit`, to be reused during `transform`, resulting in faster processing at the cost of higher memory footprint. This mode does _not_ support multi-processing.\n\n|                       | single-processing | multi-processing |\n|-----------------------|-------------------|------------------|\n| low memory mode       | :heavy_check_mark:                 | :heavy_check_mark:                |\n| high performance mode | :heavy_check_mark:                 | :x:               |\n\nThe following plot compares `deboiler` performance for different modes of operation and number of processes. In this benchmarking, `deboiler` cleans up pages from ~140 domains with 10-10k pages. The `performance` mode completes the tasks faster (38 mins vs. 54 mins) than the `memory` mode with a single process, i.e. `(memory, 1)`. However, the `memory` mode can outperform the `performance` mode if multi-processing is enabled (e.g. 5 or 10 processes in this example).\n\nIt is worth noting that the difference between modes of operation and multi-processing becomes more pronounced as the domain size increases.\n\n![Performance Plot](resources/performance_plot.png)\n\n## Creating Custom Datasets\n\nThe package includes a `JsonDataset` class. It expects a json lines file and has optional arguments to filter for html pages that were crawled successfully.\n\nIf the dataset needs to be more nuanced, one can create a custom dataset by sub-classing from the `DeboilerDataset` and implementing `__getitem__` and `__len__` methods, as well as the `urls` property.\nIt is usually beneficial to create an index of the data during class instantiation that allows for random access to the records in `__getitem__`.\nYou can refer to [`deboiler/dataset/json_dataset.py`](deboiler/dataset/json_dataset.py) as an example.\n\n### Tests\n\nRun the tests as in\n\n```\nbash entrypoint.sh test\n```\nor simply\n```\npytest .\n```\n\nSome options:\n* `-s` to show prints and be able to debug\n* `--pdb` to trigger debugger when having an exception\n* `pytest route_to_test` to test a specific test file\n* `pytest route_to_test::test_function` to test a specific test function\n* `pytest route_to_test::test_function[test_case]`\n* ` --cov-report term` to show coverage\n\nYou might find other code inspectors in `entrypoint.sh`. Note that these are run\nagainst your code if opening a pull request.\n\n\n# Contributing\n\nAll contributions, bug reports, security issues, bug fixes, documentation improvements, enhancements, and ideas are welcome. This section is adapted and simplified\nfrom [pandas contribution guide](https://pandas.pydata.org/docs/development/contributing.html).\n\n## Submit and issue\n\nBug reports, security issues, and enhancement requests are an important part of making open-source software more stable and are curated through Github issues. When reporting and issue or request, please fill out the issue form fully to ensure others and the core development team can fully understand the scope of the issue.\n\nThe issue will then show up to the community and be open to comments/ideas from others.\n\n## Submit a pull request\n\n`deboiler` is hosted on GitHub, and to contribute, you will need to sign up for a free GitHub account. We use Git for version control to allow many people to work together on the project. If you are new to Git, you can reference some of the resources in the pandas contribution guide cited above.\n\nAlso, the project follows a standard forking workflow whereby contributors fork the repository, make changes, create a feature branch, push changes, and then create a pull request. To avoid redundancy, please follow all the instructions in the pandas contribution guide  cited above.\n\n## Code of Conduct\n\nAs contributors and maintainers to this project, you are expected to abide by the code of conduct. More information can be found at the [Contributor Code of Conduct]((https://github.com/globality-corp/deboiler/.github/blob/master/CODE_OF_CONDUCT.md)).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Deboiler is an open-source package to clean HTML pages across an entire domain",
    "version": "2023.46.150",
    "project_urls": {
        "Homepage": "https://github.com/globality-corp/deboiler"
    },
    "split_keywords": [
        "deboiler",
        "python",
        "html cleaning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "96ee6d38a2e1470055ff41861267fcb3d05fd0e017be1749d75809bf9bbd70ba",
                "md5": "9721b42eefd36c72583de6d16a4ec395",
                "sha256": "4891db91c84c654fdd62941af7b8218f4110eda2c17b0a9adc0b6633f394005e"
            },
            "downloads": -1,
            "filename": "deboiler-2023.46.150-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9721b42eefd36c72583de6d16a4ec395",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 31656,
            "upload_time": "2023-11-17T22:52:35",
            "upload_time_iso_8601": "2023-11-17T22:52:35.452499Z",
            "url": "https://files.pythonhosted.org/packages/96/ee/6d38a2e1470055ff41861267fcb3d05fd0e017be1749d75809bf9bbd70ba/deboiler-2023.46.150-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b04bddff26a8399e988d4248c59c28394362a505ceeb4e6aff620d8bc45c7f49",
                "md5": "8393b421f83fb8eb05288da2174cba5a",
                "sha256": "90ed6737dd0c8e4de31c18d435bc8e72a114962df680b2af756a7d9686a5c0a7"
            },
            "downloads": -1,
            "filename": "deboiler-2023.46.150.tar.gz",
            "has_sig": false,
            "md5_digest": "8393b421f83fb8eb05288da2174cba5a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 28330,
            "upload_time": "2023-11-17T22:52:36",
            "upload_time_iso_8601": "2023-11-17T22:52:36.824631Z",
            "url": "https://files.pythonhosted.org/packages/b0/4b/ddff26a8399e988d4248c59c28394362a505ceeb4e6aff620d8bc45c7f49/deboiler-2023.46.150.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-17 22:52:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "globality-corp",
    "github_project": "deboiler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "deboiler"
}
        
Elapsed time: 0.17827s