mwparserfromhtml


Namemwparserfromhtml JSON
Version 1.0.2 PyPI version JSON
download
home_pagehttps://gitlab.wikimedia.org/repos/research/html-dumps
SummaryWikipedia HTML Dump Parsing
upload_time2024-02-14 13:43:33
maintainer
docs_urlNone
authorAppledora & Isaac Johnson & Martin Gerlach
requires_python
licenseMIT License
keywords python wikipedia html
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# mwparserfromhtml

`mwparserfromhtml` is a Python library for parsing and mining metadata from the Enterprise HTML Dumps that has been recently made available by the [Wikimedia Enterprise](https://enterprise.wikimedia.com/). The 6 most updated Enterprise HTML dumps can be accessed from [*this location*](https://dumps.wikimedia.org/other/enterprise_html/runs/). The aim of this library is to provide an interface to work with these HTML dumps and extract the most relevant features from an article.

Besides using the HTML dumps, users can also use the [Wikipedia API](https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_html__title_) to obtain the HTML of a particular article from their title and parse the HTML string with this library.

## Motivation
When rendering contents, MediaWiki converts wikitext to HTML, allowing for the expansion of macros to include more material. The HTML version of a Wikipedia page generally has more information than the original source wikitext. So, it's reasonable that anyone who wants to analyze Wikipedia's content as it appears to its readers would prefer to work with HTML rather than wikitext. Traditionally, only the wikitext version has been available in the [XML-dumps](https://dumps.wikimedia.org/backup-index.html). Now, with the introduction of the Enterprise HTML dumps in 2021, anyone can now easily access and use HTML dumps (and they should).

However, parsing HTML to extract the necessary information is not a simple process. An inconspicuous user may know how to work with HTMLs but they might not be used to the specific format of the dump files. Also the wikitext translated to HTMLs by the MediaWiki API have many different edge-cases and requires heavy investigation of the documentation to get a grasp of the structure. Identifying the features from this HTML is no trivial task! Because of all these hassles, it is likely that individuals would continue working with wikitext as there are already excellent ready-to-use parsers for it (such as [mwparserfromhell](https://github.com/earwig/mwparserfromhell)).
Therefore, we wanted to write a Python library that can efficiently parse the HTML-code of an article from the Wikimedia Enterprise dumps to extract relevant elements such as text, links, templates, etc. This will hopefully lower the technical barriers to work with the HTML-dumps and empower researchers and others to take advantage of this beneficial resource.

## Features
* Iterate over large tarballs of HTML dumps without extracting them to memory (memory efficient, but not subscriptable unless converted to a list)
* Extract major article metadata like Category, Templates, Wikilinks, External Links, Media, References etc. with their respective type and status information
* Easily extract the content of an article from the HTML dump and customizing the level of detail
* Generate summary statistics for the articles in the dump


## Installation

You can install ``mwparserfromhtml`` with ``pip``:

```bash
$ pip install mwparserfromhtml
```

## Basic Usage
Check out [`example_notebook.ipynb`](docs/tutorials/example_notebook.ipynb) to have a runnable example.

* Import the dump module from the library and load the dump:

```python
from mwparserfromhtml import HTMLDump

html_file_path = "TARGZ_FILE_PATH"
html_dump = HTMLDump(html_file_path)
```

* Iterate over the articles in the dump:

```python
for article in html_dump:
    print(article.get_title())
```

* Extract the plain text of an article from the dump, i.e. remove anything that is not text such as infoboxes,
citation footnotes, or categories and replace links with their [anchor text](https://en.wikipedia.org/wiki/Anchor_text):

```python
for article in html_dump:
    print(article.get_title())
    prev_heading = "_Lead"
    for heading, paragraph in article.html.wikistew.get_plaintext(exclude_transcluded_paragraphs=True,
                                                                  exclude_para_context=None,  # set to {"pre-first-para", "between-paras", "post-last-para"} for more conservative approach
                                                                  exclude_elements={"Heading", "Math", "Citation", "List", "Wikitable", "Reference"}):
        if heading != prev_heading:
            print(f"\n{heading}:")
            prev_heading = heading
        print(paragraph)
```

* Extract Templates, Categories, Wikilinks, External Links, Media, References etc. from the dump:

```python
for article in html_dump:
    print(article.html.wikistew.get_templates())
    print(article.html.wikistew.get_categories())
    print(article.html.wikistew.get_wikilinks())
    print(article.html.wikistew.get_externallinks())
    print(article.html.wikistew.get_images())
    print(article.html.wikistew.get_references())
```

* Alternatively, you can process stand-alone Parsoid HTML e.g., from the APIs and convert to an `Article` object to extract the features
```python
from mwparserfromhtml import Article
import requests

lang = "en"
title = "Both Sides, Now"
r = requests.get(f'https://{lang}.wikipedia.org/api/rest_v1/page/html/{title}')
article = Article(r.text)
print(f"Article Name: {article.get_title()}")
print(f"Abstract: {article.wikistew.get_first_paragraph()}")
```

## Project Information
- [Licensing](https://gitlab.wikimedia.org/repos/research/html-dumps/-/blob/main/LICENSE)
- [Repository](https://gitlab.wikimedia.org/repos/research/html-dumps)
- [Issue Tracker](https://gitlab.wikimedia.org/repos/research/html-dumps/-/issues)
- [Contribution Guidelines](CONTRIBUTION.md)
- [Tutorials](docs/tutorials)

## Acknowledgements

This project was started as part of an [Outreachy](https://www.outreachy.org/) internship from May--August 2022. This project has benefited greatly from the work of Earwig ([mwparserfromhell](https://github.com/earwig/mwparserfromhell)) and Slavina Stefanova ([mwsql](https://github.com/mediawiki-utilities/python-mwsql)).

            

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.wikimedia.org/repos/research/html-dumps",
    "name": "mwparserfromhtml",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,wikipedia,html",
    "author": "Appledora & Isaac Johnson & Martin Gerlach",
    "author_email": "<isaac@wikimedia.org>",
    "download_url": "https://files.pythonhosted.org/packages/5a/7c/a1d8bfd4ed7eb4006809915a369e09576083ada84bbe8091778c4541b00f/mwparserfromhtml-1.0.2.tar.gz",
    "platform": null,
    "description": "\n# mwparserfromhtml\n\n`mwparserfromhtml` is a Python library for parsing and mining metadata from the Enterprise HTML Dumps that has been recently made available by the [Wikimedia Enterprise](https://enterprise.wikimedia.com/). The 6 most updated Enterprise HTML dumps can be accessed from [*this location*](https://dumps.wikimedia.org/other/enterprise_html/runs/). The aim of this library is to provide an interface to work with these HTML dumps and extract the most relevant features from an article.\n\nBesides using the HTML dumps, users can also use the [Wikipedia API](https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_html__title_) to obtain the HTML of a particular article from their title and parse the HTML string with this library.\n\n## Motivation\nWhen rendering contents, MediaWiki converts wikitext to HTML, allowing for the expansion of macros to include more material. The HTML version of a Wikipedia page generally has more information than the original source wikitext. So, it's reasonable that anyone who wants to analyze Wikipedia's content as it appears to its readers would prefer to work with HTML rather than wikitext. Traditionally, only the wikitext version has been available in the [XML-dumps](https://dumps.wikimedia.org/backup-index.html). Now, with the introduction of the Enterprise HTML dumps in 2021, anyone can now easily access and use HTML dumps (and they should).\n\nHowever, parsing HTML to extract the necessary information is not a simple process. An inconspicuous user may know how to work with HTMLs but they might not be used to the specific format of the dump files. Also the wikitext translated to HTMLs by the MediaWiki API have many different edge-cases and requires heavy investigation of the documentation to get a grasp of the structure. Identifying the features from this HTML is no trivial task! Because of all these hassles, it is likely that individuals would continue working with wikitext as there are already excellent ready-to-use parsers for it (such as [mwparserfromhell](https://github.com/earwig/mwparserfromhell)).\nTherefore, we wanted to write a Python library that can efficiently parse the HTML-code of an article from the Wikimedia Enterprise dumps to extract relevant elements such as text, links, templates, etc. This will hopefully lower the technical barriers to work with the HTML-dumps and empower researchers and others to take advantage of this beneficial resource.\n\n## Features\n* Iterate over large tarballs of HTML dumps without extracting them to memory (memory efficient, but not subscriptable unless converted to a list)\n* Extract major article metadata like Category, Templates, Wikilinks, External Links, Media, References etc. with their respective type and status information\n* Easily extract the content of an article from the HTML dump and customizing the level of detail\n* Generate summary statistics for the articles in the dump\n\n\n## Installation\n\nYou can install ``mwparserfromhtml`` with ``pip``:\n\n```bash\n$ pip install mwparserfromhtml\n```\n\n## Basic Usage\nCheck out [`example_notebook.ipynb`](docs/tutorials/example_notebook.ipynb) to have a runnable example.\n\n* Import the dump module from the library and load the dump:\n\n```python\nfrom mwparserfromhtml import HTMLDump\n\nhtml_file_path = \"TARGZ_FILE_PATH\"\nhtml_dump = HTMLDump(html_file_path)\n```\n\n* Iterate over the articles in the dump:\n\n```python\nfor article in html_dump:\n    print(article.get_title())\n```\n\n* Extract the plain text of an article from the dump, i.e. remove anything that is not text such as infoboxes,\ncitation footnotes, or categories and replace links with their [anchor text](https://en.wikipedia.org/wiki/Anchor_text):\n\n```python\nfor article in html_dump:\n    print(article.get_title())\n    prev_heading = \"_Lead\"\n    for heading, paragraph in article.html.wikistew.get_plaintext(exclude_transcluded_paragraphs=True,\n                                                                  exclude_para_context=None,  # set to {\"pre-first-para\", \"between-paras\", \"post-last-para\"} for more conservative approach\n                                                                  exclude_elements={\"Heading\", \"Math\", \"Citation\", \"List\", \"Wikitable\", \"Reference\"}):\n        if heading != prev_heading:\n            print(f\"\\n{heading}:\")\n            prev_heading = heading\n        print(paragraph)\n```\n\n* Extract Templates, Categories, Wikilinks, External Links, Media, References etc. from the dump:\n\n```python\nfor article in html_dump:\n    print(article.html.wikistew.get_templates())\n    print(article.html.wikistew.get_categories())\n    print(article.html.wikistew.get_wikilinks())\n    print(article.html.wikistew.get_externallinks())\n    print(article.html.wikistew.get_images())\n    print(article.html.wikistew.get_references())\n```\n\n* Alternatively, you can process stand-alone Parsoid HTML e.g., from the APIs and convert to an `Article` object to extract the features\n```python\nfrom mwparserfromhtml import Article\nimport requests\n\nlang = \"en\"\ntitle = \"Both Sides, Now\"\nr = requests.get(f'https://{lang}.wikipedia.org/api/rest_v1/page/html/{title}')\narticle = Article(r.text)\nprint(f\"Article Name: {article.get_title()}\")\nprint(f\"Abstract: {article.wikistew.get_first_paragraph()}\")\n```\n\n## Project Information\n- [Licensing](https://gitlab.wikimedia.org/repos/research/html-dumps/-/blob/main/LICENSE)\n- [Repository](https://gitlab.wikimedia.org/repos/research/html-dumps)\n- [Issue Tracker](https://gitlab.wikimedia.org/repos/research/html-dumps/-/issues)\n- [Contribution Guidelines](CONTRIBUTION.md)\n- [Tutorials](docs/tutorials)\n\n## Acknowledgements\n\nThis project was started as part of an [Outreachy](https://www.outreachy.org/) internship from May--August 2022. This project has benefited greatly from the work of Earwig ([mwparserfromhell](https://github.com/earwig/mwparserfromhell)) and Slavina Stefanova ([mwsql](https://github.com/mediawiki-utilities/python-mwsql)).\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Wikipedia HTML Dump Parsing",
    "version": "1.0.2",
    "project_urls": {
        "Homepage": "https://gitlab.wikimedia.org/repos/research/html-dumps"
    },
    "split_keywords": [
        "python",
        "wikipedia",
        "html"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f9bc00f231b5590595a2b45217a65d90b075978c7de27466726c0926d05f25f2",
                "md5": "dc8c53a8a52b49aef6affc73eabafb5e",
                "sha256": "82f668ce605198b945c307d0eb2e47c65d8c537451a233a55575259f22ed6d0f"
            },
            "downloads": -1,
            "filename": "mwparserfromhtml-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dc8c53a8a52b49aef6affc73eabafb5e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 56277,
            "upload_time": "2024-02-14T13:43:31",
            "upload_time_iso_8601": "2024-02-14T13:43:31.650395Z",
            "url": "https://files.pythonhosted.org/packages/f9/bc/00f231b5590595a2b45217a65d90b075978c7de27466726c0926d05f25f2/mwparserfromhtml-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5a7ca1d8bfd4ed7eb4006809915a369e09576083ada84bbe8091778c4541b00f",
                "md5": "16b5db7ea7b2e2eaabeeabe007de48e5",
                "sha256": "2583fbf178ad6a69cfa5d3baba2b2e97ae07e7308e8a791d715c9b845912c858"
            },
            "downloads": -1,
            "filename": "mwparserfromhtml-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "16b5db7ea7b2e2eaabeeabe007de48e5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 59513,
            "upload_time": "2024-02-14T13:43:33",
            "upload_time_iso_8601": "2024-02-14T13:43:33.144660Z",
            "url": "https://files.pythonhosted.org/packages/5a/7c/a1d8bfd4ed7eb4006809915a369e09576083ada84bbe8091778c4541b00f/mwparserfromhtml-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-14 13:43:33",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "mwparserfromhtml"
}
        
Elapsed time: 2.07421s