# Htmldate: Find the Publication Date of Web Pages
[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)
[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)
[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)
<br/>
<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png" alt="Htmldate Logo" align="center" width="60%"/>
<br/>
Find **original and updated publication dates** of any web page.
It is often not possible to do it using just the URL or the server response.
**On the command-line or with Python**, all the steps needed from web page
download to HTML parsing, scraping, and text analysis are included.
The package is used in production on millions of documents and integrated into
[thousands of projects](https://github.com/adbar/htmldate/network/dependents).
## In a nutshell
<br/>
<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif" alt="Demo as GIF image" align="center" width="80%"/>
<br/>
### With Python
``` python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
```
### On the command-line
``` bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
```
## Features
- Flexible input: URLs, HTML files, or HTML trees can be used as input
(including batch processing).
- Customizable output: Any date format (defaults to [ISO 8601
YMD](https://en.wikipedia.org/wiki/ISO_8601)).
- Detection of both original and updated dates.
- Multilingual.
- Compatible with all recent versions of Python.
### How it works
Htmldate operates by sifting through HTML markup and if necessary text
elements. It features the following heuristics:
1. **Markup in header**: Common patterns are used to identify relevant
elements (e.g. `link` and `meta` elements) including [Open Graph
protocol](http://ogp.me/) attributes.
2. **HTML code**: The whole document is searched for structural markers
like `abbr` or `time` elements and a series of attributes (e.g.
`postmetadata`).
3. **Bare HTML content**: Heuristics are run on text and markup:
- In `fast` mode the HTML page is cleaned and precise patterns are
targeted.
- In `extensive` mode all potential dates are collected and a
disambiguation algorithm determines the best one.
Finally, the output is validated and converted to the chosen format.
## Performance
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
| Python Package | Precision | Recall | Accuracy | F-Score | Time |
| -------------- | --------- | ------ | -------- | ------- | ---- |
| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |
| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |
| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |
| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |
| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |
| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |
| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |
For the complete results and explanations see [evaluation
page](https://htmldate.readthedocs.io/en/latest/evaluation.html).
## Installation
Htmldate is tested on Linux, macOS and Windows systems, it is compatible
with Python 3.8 upwards. It can notably be installed with `pip` (`pip3`
where applicable) from the PyPI package repository:
- `pip install htmldate`
- (optionally) `pip install htmldate[speed]`
The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.
## Documentation
For more details on installation, Python & CLI usage, **please refer to
the documentation**:
[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)
## License
This package is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).
Versions prior to v1.8.0 are under GPLv3+ license.
## Context and contributions
Initially launched to create text databases for research purposes
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
this project continues to be maintained but its future development
depends on community support.
**If you value this software or depend on it for your product, consider
sponsoring it and contributing to its codebase**. Your support
will help maintain and enhance this package.
Visit the [Contributing page](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)
for more information.
Reach out via the software repository or the [contact page](https://adrien.barbaresi.eu/)
for inquiries, collaborations, or feedback.
[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)
[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)
``` shell
@article{barbaresi-2020-htmldate,
title = {{htmldate: A Python package to extract publication dates from web pages}},
author = "Barbaresi, Adrien",
journal = "Journal of Open Source Software",
volume = 5,
number = 51,
pages = 2439,
url = {https://doi.org/10.21105/joss.02439},
publisher = {The Open Journal},
year = 2020,
}
```
- Barbaresi, A. \"[htmldate: A Python package to extract publication
dates from web pages](https://doi.org/10.21105/joss.02439)\",
Journal of Open Source Software, 5(51), 2439, 2020. DOI:
10.21105/joss.02439
- Barbaresi, A. \"[Generic Web Content Extraction with Open-Source
Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\",
Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. \"[Efficient construction of metadata-enhanced web
corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\",
Proceedings of the [10th Web as Corpus Workshop
(WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.
## Acknowledgements
Kudos to the following software libraries:
- [lxml](http://lxml.de/),
[dateparser](https://github.com/scrapinghub/dateparser)
- A few patterns are derived from the
[python-goose](https://github.com/grangier/python-goose),
[metascraper](https://github.com/ianstormtaylor/metascraper),
[newspaper](https://github.com/codelucas/newspaper) and
[articleDateExtractor](https://github.com/Webhose/article-date-extractor)
libraries. This module extends their coverage and robustness
significantly.
Raw data
{
"_id": null,
"home_page": null,
"name": "htmldate",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "datetime, date-parser, entity-extraction, html-extraction, html-parsing, metadata-extraction, webarchives, web-scraping",
"author": null,
"author_email": "Adrien Barbaresi <barbaresi@bbaw.de>",
"download_url": "https://files.pythonhosted.org/packages/a5/26/aaae4cab984f0b7dd0f5f1b823fa2ed2fd4a2bb50acd5bd2f0d217562678/htmldate-1.9.3.tar.gz",
"platform": null,
"description": "# Htmldate: Find the Publication Date of Web Pages\n\n[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)\n[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)\n[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)\n[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)\n[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)\n[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)\n\n<br/>\n\n<img src=\"https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png\" alt=\"Htmldate Logo\" align=\"center\" width=\"60%\"/>\n\n<br/>\n\nFind **original and updated publication dates** of any web page.\nIt is often not possible to do it using just the URL or the server response.\n\n**On the command-line or with Python**, all the steps needed from web page\ndownload to HTML parsing, scraping, and text analysis are included.\n\nThe package is used in production on millions of documents and integrated into\n[thousands of projects](https://github.com/adbar/htmldate/network/dependents).\n\n\n## In a nutshell\n\n<br/>\n\n<img src=\"https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif\" alt=\"Demo as GIF image\" align=\"center\" width=\"80%\"/>\n\n<br/>\n\n### With Python\n\n``` python\n>>> from htmldate import find_date\n>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')\n'2016-12-23'\n```\n\n### On the command-line\n\n``` bash\n$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html\n'2016-12-23'\n```\n\n## Features\n\n- Flexible input: URLs, HTML files, or HTML trees can be used as input\n (including batch processing).\n- Customizable output: Any date format (defaults to [ISO 8601\n YMD](https://en.wikipedia.org/wiki/ISO_8601)).\n- Detection of both original and updated dates.\n- Multilingual.\n- Compatible with all recent versions of Python.\n\n### How it works\n\nHtmldate operates by sifting through HTML markup and if necessary text\nelements. It features the following heuristics:\n\n1. **Markup in header**: Common patterns are used to identify relevant\n elements (e.g. `link` and `meta` elements) including [Open Graph\n protocol](http://ogp.me/) attributes.\n2. **HTML code**: The whole document is searched for structural markers\n like `abbr` or `time` elements and a series of attributes (e.g.\n `postmetadata`).\n3. **Bare HTML content**: Heuristics are run on text and markup:\n - In `fast` mode the HTML page is cleaned and precise patterns are\n targeted.\n - In `extensive` mode all potential dates are collected and a\n disambiguation algorithm determines the best one.\n\nFinally, the output is validated and converted to the chosen format.\n\n## Performance\n\n1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)\n\n| Python Package | Precision | Recall | Accuracy | F-Score | Time |\n| -------------- | --------- | ------ | -------- | ------- | ---- |\n| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |\n| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |\n| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |\n| htmldate\\[all\\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |\n| htmldate\\[all\\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |\n| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |\n| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |\n\nFor the complete results and explanations see [evaluation\npage](https://htmldate.readthedocs.io/en/latest/evaluation.html).\n\n## Installation\n\nHtmldate is tested on Linux, macOS and Windows systems, it is compatible\nwith Python 3.8 upwards. It can notably be installed with `pip` (`pip3`\nwhere applicable) from the PyPI package repository:\n\n- `pip install htmldate`\n- (optionally) `pip install htmldate[speed]`\n\nThe last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.\n\n## Documentation\n\nFor more details on installation, Python & CLI usage, **please refer to\nthe documentation**:\n[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)\n\n## License\n\nThis package is distributed under the [Apache 2.0\nlicense](https://www.apache.org/licenses/LICENSE-2.0.html).\n\nVersions prior to v1.8.0 are under GPLv3+ license.\n\n## Context and contributions\n\nInitially launched to create text databases for research purposes\nat the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),\nthis project continues to be maintained but its future development\ndepends on community support.\n\n**If you value this software or depend on it for your product, consider\nsponsoring it and contributing to its codebase**. Your support\nwill help maintain and enhance this package.\nVisit the [Contributing page](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)\nfor more information.\n\nReach out via the software repository or the [contact page](https://adrien.barbaresi.eu/)\nfor inquiries, collaborations, or feedback.\n\n[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)\n[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)\n\n\n``` shell\n@article{barbaresi-2020-htmldate,\n title = {{htmldate: A Python package to extract publication dates from web pages}},\n author = \"Barbaresi, Adrien\",\n journal = \"Journal of Open Source Software\",\n volume = 5,\n number = 51,\n pages = 2439,\n url = {https://doi.org/10.21105/joss.02439},\n publisher = {The Open Journal},\n year = 2020,\n}\n```\n\n- Barbaresi, A. \\\"[htmldate: A Python package to extract publication\n dates from web pages](https://doi.org/10.21105/joss.02439)\\\",\n Journal of Open Source Software, 5(51), 2439, 2020. DOI:\n 10.21105/joss.02439\n- Barbaresi, A. \\\"[Generic Web Content Extraction with Open-Source\n Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\\\",\n Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.\n- Barbaresi, A. \\\"[Efficient construction of metadata-enhanced web\n corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\\\",\n Proceedings of the [10th Web as Corpus Workshop\n (WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.\n\n\n## Acknowledgements\n\nKudos to the following software libraries:\n\n- [lxml](http://lxml.de/),\n [dateparser](https://github.com/scrapinghub/dateparser)\n- A few patterns are derived from the\n [python-goose](https://github.com/grangier/python-goose),\n [metascraper](https://github.com/ianstormtaylor/metascraper),\n [newspaper](https://github.com/codelucas/newspaper) and\n [articleDateExtractor](https://github.com/Webhose/article-date-extractor)\n libraries. This module extends their coverage and robustness\n significantly.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Fast and robust extraction of original and updated publication dates from URLs and web pages.",
"version": "1.9.3",
"project_urls": {
"Blog": "https://adrien.barbaresi.eu/blog/",
"Homepage": "https://htmldate.readthedocs.io",
"Source": "https://github.com/adbar/htmldate",
"Tracker": "https://github.com/adbar/htmldate/issues"
},
"split_keywords": [
"datetime",
" date-parser",
" entity-extraction",
" html-extraction",
" html-parsing",
" metadata-extraction",
" webarchives",
" web-scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "05498872130016209c20436ce0c1067de1cf630755d0443d068a5bc17fa95015",
"md5": "217cd4ad5c04d5bd6ccf52eafa322250",
"sha256": "3fadc422cf3c10a5cdb5e1b914daf37ec7270400a80a1b37e2673ff84faaaff8"
},
"downloads": -1,
"filename": "htmldate-1.9.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "217cd4ad5c04d5bd6ccf52eafa322250",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 31565,
"upload_time": "2024-12-30T12:52:32",
"upload_time_iso_8601": "2024-12-30T12:52:32.145747Z",
"url": "https://files.pythonhosted.org/packages/05/49/8872130016209c20436ce0c1067de1cf630755d0443d068a5bc17fa95015/htmldate-1.9.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a526aaae4cab984f0b7dd0f5f1b823fa2ed2fd4a2bb50acd5bd2f0d217562678",
"md5": "54db775ffb68354de55cf62a7ad948bf",
"sha256": "ac0caf4628c3ded4042011e2d60dc68dfb314c77b106587dd307a80d77e708e9"
},
"downloads": -1,
"filename": "htmldate-1.9.3.tar.gz",
"has_sig": false,
"md5_digest": "54db775ffb68354de55cf62a7ad948bf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 44913,
"upload_time": "2024-12-30T12:52:35",
"upload_time_iso_8601": "2024-12-30T12:52:35.206284Z",
"url": "https://files.pythonhosted.org/packages/a5/26/aaae4cab984f0b7dd0f5f1b823fa2ed2fd4a2bb50acd5bd2f0d217562678/htmldate-1.9.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-30 12:52:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adbar",
"github_project": "htmldate",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "htmldate"
}