# Htmldate: Find the Publication Date of Web Pages
[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)
[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)
[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)
<br/>
<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png" alt="Logo as PNG image" width="60%"/>
<br/>
Find **original and updated publication dates** of any web page. **On
the command-line or with Python**, all the steps needed from web page
download to HTML parsing, scraping, and text analysis are included. The
package is used in production on millions of documents and integrated by
[multiple
libraries](https://github.com/adbar/htmldate/network/dependents).
## In a nutshell
<br/>
<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif" alt="Demo as GIF image" width="80%"/>
<br/>
### With Python
``` python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
```
### On the command-line
``` bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
```
## Features
- Flexible input: URLs, HTML files, or HTML trees can be used as input
(including batch processing).
- Customizable output: Any date format (defaults to [ISO 8601
YMD](https://en.wikipedia.org/wiki/ISO_8601)).
- Detection of both original and updated dates.
- Multilingual.
- Compatible with all recent versions of Python.
### How it works
Htmldate operates by sifting through HTML markup and if necessary text
elements. It features the following heuristics:
1. **Markup in header**: Common patterns are used to identify relevant
elements (e.g. `link` and `meta` elements) including [Open Graph
protocol](http://ogp.me/) attributes.
2. **HTML code**: The whole document is searched for structural markers
like `abbr` or `time` elements and a series of attributes (e.g.
`postmetadata`).
3. **Bare HTML content**: Heuristics are run on text and markup:
- In `fast` mode the HTML page is cleaned and precise patterns are
targeted.
- In `extensive` mode all potential dates are collected and a
disambiguation algorithm determines the best one.
Finally, the output is validated and converted to the chosen format.
## Performance
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
| Python Package | Precision | Recall | Accuracy | F-Score | Time |
| -------------- | --------- | ------ | -------- | ------- | ---- |
| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |
| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |
| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |
| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |
| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |
| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |
| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |
For the complete results and explanations see [evaluation
page](https://htmldate.readthedocs.io/en/latest/evaluation.html).
## Installation
Htmldate is tested on Linux, macOS and Windows systems, it is compatible
with Python 3.8 upwards. It can notably be installed with `pip` (`pip3`
where applicable) from the PyPI package repository:
- `pip install htmldate`
- (optionally) `pip install htmldate[speed]`
The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.
## Documentation
For more details on installation, Python & CLI usage, **please refer to
the documentation**:
[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)
## License
This package is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).
Versions prior to v1.8.0 are under GPLv3+ license.
## Author
This project is part of methods to derive information from web documents
in order to build [text databases for
research](https://www.dwds.de/d/k-web) (chiefly linguistic analysis and
natural language processing).
Extracting and pre-processing web texts to meet the exacting standards
is a significant challenge. It is often not possible to reliably
determine the date of publication or modification using either the URL
or the server response. For more information:
[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)
[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)
``` shell
@article{barbaresi-2020-htmldate,
title = {{htmldate: A Python package to extract publication dates from web pages}},
author = "Barbaresi, Adrien",
journal = "Journal of Open Source Software",
volume = 5,
number = 51,
pages = 2439,
url = {https://doi.org/10.21105/joss.02439},
publisher = {The Open Journal},
year = 2020,
}
```
- Barbaresi, A. \"[htmldate: A Python package to extract publication
dates from web pages](https://doi.org/10.21105/joss.02439)\",
Journal of Open Source Software, 5(51), 2439, 2020. DOI:
10.21105/joss.02439
- Barbaresi, A. \"[Generic Web Content Extraction with Open-Source
Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\",
Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. \"[Efficient construction of metadata-enhanced web
corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\",
Proceedings of the [10th Web as Corpus Workshop
(WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.
You can contact me via my [contact page](https://adrien.barbaresi.eu/)
or [GitHub](https://github.com/adbar).
## Contributing
[Contributions](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)
are welcome as well as issues filed on the [dedicated
page](https://github.com/adbar/htmldate/issues).
Special thanks to the
[contributors](https://github.com/adbar/htmldate/graphs/contributors)
who have submitted features and bugfixes!
## Acknowledgements
Kudos to the following software libraries:
- [lxml](http://lxml.de/),
[dateparser](https://github.com/scrapinghub/dateparser)
- A few patterns are derived from the
[python-goose](https://github.com/grangier/python-goose),
[metascraper](https://github.com/ianstormtaylor/metascraper),
[newspaper](https://github.com/codelucas/newspaper) and
[articleDateExtractor](https://github.com/Webhose/article-date-extractor)
libraries. This module extends their coverage and robustness
significantly.
Raw data
{
"_id": null,
"home_page": null,
"name": "htmldate",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "datetime, date-parser, entity-extraction, html-extraction, html-parsing, metadata-extraction, webarchives, web-scraping",
"author": null,
"author_email": "Adrien Barbaresi <barbaresi@bbaw.de>",
"download_url": "https://files.pythonhosted.org/packages/7d/d9/2aa3b95ef02b60c5953031faba2e966155ef6c57aeac1a6d61d95acf9b4f/htmldate-1.9.2.tar.gz",
"platform": null,
"description": "# Htmldate: Find the Publication Date of Web Pages\n\n[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)\n[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)\n[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)\n[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)\n[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)\n[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)\n\n<br/>\n\n<img src=\"https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png\" alt=\"Logo as PNG image\" width=\"60%\"/>\n\n<br/>\n\nFind **original and updated publication dates** of any web page. **On\nthe command-line or with Python**, all the steps needed from web page\ndownload to HTML parsing, scraping, and text analysis are included. The\npackage is used in production on millions of documents and integrated by\n[multiple\nlibraries](https://github.com/adbar/htmldate/network/dependents).\n\n\n## In a nutshell\n\n<br/>\n\n<img src=\"https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif\" alt=\"Demo as GIF image\" width=\"80%\"/>\n\n<br/>\n\n### With Python\n\n``` python\n>>> from htmldate import find_date\n>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')\n'2016-12-23'\n```\n\n### On the command-line\n\n``` bash\n$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html\n'2016-12-23'\n```\n\n## Features\n\n- Flexible input: URLs, HTML files, or HTML trees can be used as input\n (including batch processing).\n- Customizable output: Any date format (defaults to [ISO 8601\n YMD](https://en.wikipedia.org/wiki/ISO_8601)).\n- Detection of both original and updated dates.\n- Multilingual.\n- Compatible with all recent versions of Python.\n\n### How it works\n\nHtmldate operates by sifting through HTML markup and if necessary text\nelements. It features the following heuristics:\n\n1. **Markup in header**: Common patterns are used to identify relevant\n elements (e.g. `link` and `meta` elements) including [Open Graph\n protocol](http://ogp.me/) attributes.\n2. **HTML code**: The whole document is searched for structural markers\n like `abbr` or `time` elements and a series of attributes (e.g.\n `postmetadata`).\n3. **Bare HTML content**: Heuristics are run on text and markup:\n - In `fast` mode the HTML page is cleaned and precise patterns are\n targeted.\n - In `extensive` mode all potential dates are collected and a\n disambiguation algorithm determines the best one.\n\nFinally, the output is validated and converted to the chosen format.\n\n## Performance\n\n1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)\n\n| Python Package | Precision | Recall | Accuracy | F-Score | Time |\n| -------------- | --------- | ------ | -------- | ------- | ---- |\n| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |\n| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |\n| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |\n| htmldate\\[all\\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |\n| htmldate\\[all\\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |\n| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |\n| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |\n\nFor the complete results and explanations see [evaluation\npage](https://htmldate.readthedocs.io/en/latest/evaluation.html).\n\n## Installation\n\nHtmldate is tested on Linux, macOS and Windows systems, it is compatible\nwith Python 3.8 upwards. It can notably be installed with `pip` (`pip3`\nwhere applicable) from the PyPI package repository:\n\n- `pip install htmldate`\n- (optionally) `pip install htmldate[speed]`\n\nThe last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.\n\n## Documentation\n\nFor more details on installation, Python & CLI usage, **please refer to\nthe documentation**:\n[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)\n\n## License\n\nThis package is distributed under the [Apache 2.0\nlicense](https://www.apache.org/licenses/LICENSE-2.0.html).\n\nVersions prior to v1.8.0 are under GPLv3+ license.\n\n## Author\n\nThis project is part of methods to derive information from web documents\nin order to build [text databases for\nresearch](https://www.dwds.de/d/k-web) (chiefly linguistic analysis and\nnatural language processing).\n\nExtracting and pre-processing web texts to meet the exacting standards\nis a significant challenge. It is often not possible to reliably\ndetermine the date of publication or modification using either the URL\nor the server response. For more information:\n\n[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)\n[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)\n\n\n``` shell\n@article{barbaresi-2020-htmldate,\n title = {{htmldate: A Python package to extract publication dates from web pages}},\n author = \"Barbaresi, Adrien\",\n journal = \"Journal of Open Source Software\",\n volume = 5,\n number = 51,\n pages = 2439,\n url = {https://doi.org/10.21105/joss.02439},\n publisher = {The Open Journal},\n year = 2020,\n}\n```\n\n- Barbaresi, A. \\\"[htmldate: A Python package to extract publication\n dates from web pages](https://doi.org/10.21105/joss.02439)\\\",\n Journal of Open Source Software, 5(51), 2439, 2020. DOI:\n 10.21105/joss.02439\n- Barbaresi, A. \\\"[Generic Web Content Extraction with Open-Source\n Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\\\",\n Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.\n- Barbaresi, A. \\\"[Efficient construction of metadata-enhanced web\n corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\\\",\n Proceedings of the [10th Web as Corpus Workshop\n (WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.\n\nYou can contact me via my [contact page](https://adrien.barbaresi.eu/)\nor [GitHub](https://github.com/adbar).\n\n## Contributing\n\n[Contributions](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)\nare welcome as well as issues filed on the [dedicated\npage](https://github.com/adbar/htmldate/issues).\n\nSpecial thanks to the\n[contributors](https://github.com/adbar/htmldate/graphs/contributors)\nwho have submitted features and bugfixes!\n\n## Acknowledgements\n\nKudos to the following software libraries:\n\n- [lxml](http://lxml.de/),\n [dateparser](https://github.com/scrapinghub/dateparser)\n- A few patterns are derived from the\n [python-goose](https://github.com/grangier/python-goose),\n [metascraper](https://github.com/ianstormtaylor/metascraper),\n [newspaper](https://github.com/codelucas/newspaper) and\n [articleDateExtractor](https://github.com/Webhose/article-date-extractor)\n libraries. This module extends their coverage and robustness\n significantly.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Fast and robust extraction of original and updated publication dates from URLs and web pages.",
"version": "1.9.2",
"project_urls": {
"Blog": "https://adrien.barbaresi.eu/blog/",
"Homepage": "https://htmldate.readthedocs.io",
"Source": "https://github.com/adbar/htmldate",
"Tracker": "https://github.com/adbar/htmldate/issues"
},
"split_keywords": [
"datetime",
" date-parser",
" entity-extraction",
" html-extraction",
" html-parsing",
" metadata-extraction",
" webarchives",
" web-scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9f7c78c8129eb3e32aceb47e20dc10900adfbf306d3185b9f4247399c6983ce9",
"md5": "ad3d2b5343e62ac0d85f149180f77aba",
"sha256": "a63240e0107f6389e0d80007b838ca1b15aa4ea8486783e40027eecdc5ba58d0"
},
"downloads": -1,
"filename": "htmldate-1.9.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ad3d2b5343e62ac0d85f149180f77aba",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 31496,
"upload_time": "2024-11-12T12:31:50",
"upload_time_iso_8601": "2024-11-12T12:31:50.442131Z",
"url": "https://files.pythonhosted.org/packages/9f/7c/78c8129eb3e32aceb47e20dc10900adfbf306d3185b9f4247399c6983ce9/htmldate-1.9.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7dd92aa3b95ef02b60c5953031faba2e966155ef6c57aeac1a6d61d95acf9b4f",
"md5": "2b795f0bb9ec80a00fc27c4e6075c30f",
"sha256": "89553fb6e0942a18951a623e28ce3ce4a2e8543b3908e951eea356ec0346cbe4"
},
"downloads": -1,
"filename": "htmldate-1.9.2.tar.gz",
"has_sig": false,
"md5_digest": "2b795f0bb9ec80a00fc27c4e6075c30f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 44965,
"upload_time": "2024-11-12T12:31:51",
"upload_time_iso_8601": "2024-11-12T12:31:51.825940Z",
"url": "https://files.pythonhosted.org/packages/7d/d9/2aa3b95ef02b60c5953031faba2e966155ef6c57aeac1a6d61d95acf9b4f/htmldate-1.9.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-12 12:31:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adbar",
"github_project": "htmldate",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "htmldate"
}