htmldate


Namehtmldate JSON
Version 1.9.2 PyPI version JSON
download
home_pageNone
SummaryFast and robust extraction of original and updated publication dates from URLs and web pages.
upload_time2024-11-12 12:31:51
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache 2.0
keywords datetime date-parser entity-extraction html-extraction html-parsing metadata-extraction webarchives web-scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # Htmldate: Find the Publication Date of Web Pages

[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)
[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)
[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)

<br/>

<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png" alt="Logo as PNG image" width="60%"/>

<br/>

Find **original and updated publication dates** of any web page. **On
the command-line or with Python**, all the steps needed from web page
download to HTML parsing, scraping, and text analysis are included. The
package is used in production on millions of documents and integrated by
[multiple
libraries](https://github.com/adbar/htmldate/network/dependents).


## In a nutshell

<br/>

<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif" alt="Demo as GIF image" width="80%"/>

<br/>

### With Python

``` python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
```

### On the command-line

``` bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
```

## Features

-   Flexible input: URLs, HTML files, or HTML trees can be used as input
    (including batch processing).
-   Customizable output: Any date format (defaults to [ISO 8601
    YMD](https://en.wikipedia.org/wiki/ISO_8601)).
-   Detection of both original and updated dates.
-   Multilingual.
-   Compatible with all recent versions of Python.

### How it works

Htmldate operates by sifting through HTML markup and if necessary text
elements. It features the following heuristics:

1.  **Markup in header**: Common patterns are used to identify relevant
    elements (e.g. `link` and `meta` elements) including [Open Graph
    protocol](http://ogp.me/) attributes.
2.  **HTML code**: The whole document is searched for structural markers
    like `abbr` or `time` elements and a series of attributes (e.g.
    `postmetadata`).
3.  **Bare HTML content**: Heuristics are run on text and markup:
    -   In `fast` mode the HTML page is cleaned and precise patterns are
        targeted.
    -   In `extensive` mode all potential dates are collected and a
        disambiguation algorithm determines the best one.

Finally, the output is validated and converted to the chosen format.

## Performance

1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)

| Python Package | Precision | Recall | Accuracy | F-Score | Time |
| -------------- | --------- | ------ | -------- | ------- | ---- |
| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |
| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |
| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |
| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |
| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |
| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |
| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |

For the complete results and explanations see [evaluation
page](https://htmldate.readthedocs.io/en/latest/evaluation.html).

## Installation

Htmldate is tested on Linux, macOS and Windows systems, it is compatible
with Python 3.8 upwards. It can notably be installed with `pip` (`pip3`
where applicable) from the PyPI package repository:

-   `pip install htmldate`
-   (optionally) `pip install htmldate[speed]`

The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.

## Documentation

For more details on installation, Python & CLI usage, **please refer to
the documentation**:
[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)

## License

This package is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).

Versions prior to v1.8.0 are under GPLv3+ license.

## Author

This project is part of methods to derive information from web documents
in order to build [text databases for
research](https://www.dwds.de/d/k-web) (chiefly linguistic analysis and
natural language processing).

Extracting and pre-processing web texts to meet the exacting standards
is a significant challenge. It is often not possible to reliably
determine the date of publication or modification using either the URL
or the server response. For more information:

[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)
[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)


``` shell
@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}
```

-   Barbaresi, A. \"[htmldate: A Python package to extract publication
    dates from web pages](https://doi.org/10.21105/joss.02439)\",
    Journal of Open Source Software, 5(51), 2439, 2020. DOI:
    10.21105/joss.02439
-   Barbaresi, A. \"[Generic Web Content Extraction with Open-Source
    Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\",
    Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
-   Barbaresi, A. \"[Efficient construction of metadata-enhanced web
    corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\",
    Proceedings of the [10th Web as Corpus Workshop
    (WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.

You can contact me via my [contact page](https://adrien.barbaresi.eu/)
or [GitHub](https://github.com/adbar).

## Contributing

[Contributions](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)
are welcome as well as issues filed on the [dedicated
page](https://github.com/adbar/htmldate/issues).

Special thanks to the
[contributors](https://github.com/adbar/htmldate/graphs/contributors)
who have submitted features and bugfixes!

## Acknowledgements

Kudos to the following software libraries:

-   [lxml](http://lxml.de/),
    [dateparser](https://github.com/scrapinghub/dateparser)
-   A few patterns are derived from the
    [python-goose](https://github.com/grangier/python-goose),
    [metascraper](https://github.com/ianstormtaylor/metascraper),
    [newspaper](https://github.com/codelucas/newspaper) and
    [articleDateExtractor](https://github.com/Webhose/article-date-extractor)
    libraries. This module extends their coverage and robustness
    significantly.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "htmldate",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "datetime, date-parser, entity-extraction, html-extraction, html-parsing, metadata-extraction, webarchives, web-scraping",
    "author": null,
    "author_email": "Adrien Barbaresi <barbaresi@bbaw.de>",
    "download_url": "https://files.pythonhosted.org/packages/7d/d9/2aa3b95ef02b60c5953031faba2e966155ef6c57aeac1a6d61d95acf9b4f/htmldate-1.9.2.tar.gz",
    "platform": null,
    "description": "# Htmldate: Find the Publication Date of Web Pages\n\n[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)\n[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)\n[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)\n[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)\n[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)\n[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)\n\n<br/>\n\n<img src=\"https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png\" alt=\"Logo as PNG image\" width=\"60%\"/>\n\n<br/>\n\nFind **original and updated publication dates** of any web page. **On\nthe command-line or with Python**, all the steps needed from web page\ndownload to HTML parsing, scraping, and text analysis are included. The\npackage is used in production on millions of documents and integrated by\n[multiple\nlibraries](https://github.com/adbar/htmldate/network/dependents).\n\n\n## In a nutshell\n\n<br/>\n\n<img src=\"https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif\" alt=\"Demo as GIF image\" width=\"80%\"/>\n\n<br/>\n\n### With Python\n\n``` python\n>>> from htmldate import find_date\n>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')\n'2016-12-23'\n```\n\n### On the command-line\n\n``` bash\n$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html\n'2016-12-23'\n```\n\n## Features\n\n-   Flexible input: URLs, HTML files, or HTML trees can be used as input\n    (including batch processing).\n-   Customizable output: Any date format (defaults to [ISO 8601\n    YMD](https://en.wikipedia.org/wiki/ISO_8601)).\n-   Detection of both original and updated dates.\n-   Multilingual.\n-   Compatible with all recent versions of Python.\n\n### How it works\n\nHtmldate operates by sifting through HTML markup and if necessary text\nelements. It features the following heuristics:\n\n1.  **Markup in header**: Common patterns are used to identify relevant\n    elements (e.g. `link` and `meta` elements) including [Open Graph\n    protocol](http://ogp.me/) attributes.\n2.  **HTML code**: The whole document is searched for structural markers\n    like `abbr` or `time` elements and a series of attributes (e.g.\n    `postmetadata`).\n3.  **Bare HTML content**: Heuristics are run on text and markup:\n    -   In `fast` mode the HTML page is cleaned and precise patterns are\n        targeted.\n    -   In `extensive` mode all potential dates are collected and a\n        disambiguation algorithm determines the best one.\n\nFinally, the output is validated and converted to the chosen format.\n\n## Performance\n\n1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)\n\n| Python Package | Precision | Recall | Accuracy | F-Score | Time |\n| -------------- | --------- | ------ | -------- | ------- | ---- |\n| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |\n| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |\n| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |\n| htmldate\\[all\\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |\n| htmldate\\[all\\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |\n| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |\n| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |\n\nFor the complete results and explanations see [evaluation\npage](https://htmldate.readthedocs.io/en/latest/evaluation.html).\n\n## Installation\n\nHtmldate is tested on Linux, macOS and Windows systems, it is compatible\nwith Python 3.8 upwards. It can notably be installed with `pip` (`pip3`\nwhere applicable) from the PyPI package repository:\n\n-   `pip install htmldate`\n-   (optionally) `pip install htmldate[speed]`\n\nThe last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.\n\n## Documentation\n\nFor more details on installation, Python & CLI usage, **please refer to\nthe documentation**:\n[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)\n\n## License\n\nThis package is distributed under the [Apache 2.0\nlicense](https://www.apache.org/licenses/LICENSE-2.0.html).\n\nVersions prior to v1.8.0 are under GPLv3+ license.\n\n## Author\n\nThis project is part of methods to derive information from web documents\nin order to build [text databases for\nresearch](https://www.dwds.de/d/k-web) (chiefly linguistic analysis and\nnatural language processing).\n\nExtracting and pre-processing web texts to meet the exacting standards\nis a significant challenge. It is often not possible to reliably\ndetermine the date of publication or modification using either the URL\nor the server response. For more information:\n\n[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)\n[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)\n\n\n``` shell\n@article{barbaresi-2020-htmldate,\n  title = {{htmldate: A Python package to extract publication dates from web pages}},\n  author = \"Barbaresi, Adrien\",\n  journal = \"Journal of Open Source Software\",\n  volume = 5,\n  number = 51,\n  pages = 2439,\n  url = {https://doi.org/10.21105/joss.02439},\n  publisher = {The Open Journal},\n  year = 2020,\n}\n```\n\n-   Barbaresi, A. \\\"[htmldate: A Python package to extract publication\n    dates from web pages](https://doi.org/10.21105/joss.02439)\\\",\n    Journal of Open Source Software, 5(51), 2439, 2020. DOI:\n    10.21105/joss.02439\n-   Barbaresi, A. \\\"[Generic Web Content Extraction with Open-Source\n    Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\\\",\n    Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.\n-   Barbaresi, A. \\\"[Efficient construction of metadata-enhanced web\n    corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\\\",\n    Proceedings of the [10th Web as Corpus Workshop\n    (WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.\n\nYou can contact me via my [contact page](https://adrien.barbaresi.eu/)\nor [GitHub](https://github.com/adbar).\n\n## Contributing\n\n[Contributions](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)\nare welcome as well as issues filed on the [dedicated\npage](https://github.com/adbar/htmldate/issues).\n\nSpecial thanks to the\n[contributors](https://github.com/adbar/htmldate/graphs/contributors)\nwho have submitted features and bugfixes!\n\n## Acknowledgements\n\nKudos to the following software libraries:\n\n-   [lxml](http://lxml.de/),\n    [dateparser](https://github.com/scrapinghub/dateparser)\n-   A few patterns are derived from the\n    [python-goose](https://github.com/grangier/python-goose),\n    [metascraper](https://github.com/ianstormtaylor/metascraper),\n    [newspaper](https://github.com/codelucas/newspaper) and\n    [articleDateExtractor](https://github.com/Webhose/article-date-extractor)\n    libraries. This module extends their coverage and robustness\n    significantly.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Fast and robust extraction of original and updated publication dates from URLs and web pages.",
    "version": "1.9.2",
    "project_urls": {
        "Blog": "https://adrien.barbaresi.eu/blog/",
        "Homepage": "https://htmldate.readthedocs.io",
        "Source": "https://github.com/adbar/htmldate",
        "Tracker": "https://github.com/adbar/htmldate/issues"
    },
    "split_keywords": [
        "datetime",
        " date-parser",
        " entity-extraction",
        " html-extraction",
        " html-parsing",
        " metadata-extraction",
        " webarchives",
        " web-scraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9f7c78c8129eb3e32aceb47e20dc10900adfbf306d3185b9f4247399c6983ce9",
                "md5": "ad3d2b5343e62ac0d85f149180f77aba",
                "sha256": "a63240e0107f6389e0d80007b838ca1b15aa4ea8486783e40027eecdc5ba58d0"
            },
            "downloads": -1,
            "filename": "htmldate-1.9.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ad3d2b5343e62ac0d85f149180f77aba",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 31496,
            "upload_time": "2024-11-12T12:31:50",
            "upload_time_iso_8601": "2024-11-12T12:31:50.442131Z",
            "url": "https://files.pythonhosted.org/packages/9f/7c/78c8129eb3e32aceb47e20dc10900adfbf306d3185b9f4247399c6983ce9/htmldate-1.9.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7dd92aa3b95ef02b60c5953031faba2e966155ef6c57aeac1a6d61d95acf9b4f",
                "md5": "2b795f0bb9ec80a00fc27c4e6075c30f",
                "sha256": "89553fb6e0942a18951a623e28ce3ce4a2e8543b3908e951eea356ec0346cbe4"
            },
            "downloads": -1,
            "filename": "htmldate-1.9.2.tar.gz",
            "has_sig": false,
            "md5_digest": "2b795f0bb9ec80a00fc27c4e6075c30f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 44965,
            "upload_time": "2024-11-12T12:31:51",
            "upload_time_iso_8601": "2024-11-12T12:31:51.825940Z",
            "url": "https://files.pythonhosted.org/packages/7d/d9/2aa3b95ef02b60c5953031faba2e966155ef6c57aeac1a6d61d95acf9b4f/htmldate-1.9.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-12 12:31:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "adbar",
    "github_project": "htmldate",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "htmldate"
}
        
Elapsed time: 4.94472s