Htmldate: Find the Publication Date of Web Pages
================================================
.. image:: https://img.shields.io/pypi/v/htmldate.svg
:target: https://pypi.python.org/pypi/htmldate
:alt: Python package
.. image:: https://img.shields.io/pypi/pyversions/htmldate.svg
:target: https://pypi.python.org/pypi/htmldate
:alt: Python versions
.. image:: https://readthedocs.org/projects/htmldate/badge/?version=latest
:target: https://htmldate.readthedocs.org/en/latest/?badge=latest
:alt: Documentation Status
.. image:: https://img.shields.io/codecov/c/github/adbar/htmldate.svg
:target: https://codecov.io/gh/adbar/htmldate
:alt: Code Coverage
.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational
:target: https://pepy.tech/project/htmldate
:alt: Downloads
.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
:target: https://doi.org/10.21105/joss.02439
:alt: JOSS article reference DOI: 10.21105/joss.02439
|
.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png
:alt: Logo as PNG image
:align: center
:width: 60%
|
Find **original and updated publication dates** of any web page. **On the command-line or with Python**, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included. The package is used in production on millions of documents and integrated by `multiple libraries <https://github.com/adbar/htmldate/network/dependents>`_.
In a nutshell
-------------
|
.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif
:alt: Demo as GIF image
:align: center
:width: 80%
:target: https://htmldate.readthedocs.org/
|
With Python
~~~~~~~~~~~
.. code-block:: python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
On the command-line
~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
Features
--------
- Flexible input: URLs, HTML files, or HTML trees can be used as input (including batch processing).
- Customizable output: Any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_).
- Detection of both original and updated dates.
- Multilingual.
- Compatible with all recent versions of Python.
How it works
~~~~~~~~~~~~
Htmldate operates by sifting through HTML markup and if necessary text elements. It features the following heuristics:
1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes.
2. **HTML code**: The whole document is searched for structural markers like ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``).
3. **Bare HTML content**: Heuristics are run on text and markup:
- In ``fast`` mode the HTML page is cleaned and precise patterns are targeted.
- In ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one.
Finally, the output is validated and converted to the chosen format.
Performance
-----------
=============================== ========= ========= ========= ========= =======
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
-------------------------------------------------------------------------------
Python Package Precision Recall Accuracy F-Score Time
=============================== ========= ========= ========= ========= =======
articleDateExtractor 0.20 0.803 0.734 0.622 0.767 5x
date_guesser 2.1.4 0.781 0.600 0.514 0.679 18x
goose3 3.1.17 0.869 0.532 0.493 0.660 15x
htmldate[all] 1.6.0 (fast) **0.883** 0.924 0.823 0.903 **1x**
htmldate[all] 1.6.0 (extensive) 0.870 **0.993** **0.865** **0.928** 1.7x
newspaper3k 0.2.8 0.769 0.667 0.556 0.715 15x
news-please 1.5.35 0.801 0.768 0.645 0.784 34x
=============================== ========= ========= ========= ========= =======
For the complete results and explanations see `evaluation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>`_.
Installation
------------
Htmldate is tested on Linux, macOS and Windows systems, it is compatible with Python 3.6 upwards. It can notably be installed with ``pip`` (``pip3`` where applicable) from the PyPI package repository:
- ``pip install htmldate``
- (optionally) ``pip install htmldate[speed]``
Documentation
-------------
For more details on installation, Python & CLI usage, **please refer to the documentation**: `htmldate.readthedocs.io <https://htmldate.readthedocs.io/>`_
License
-------
This package is distributed under the `Apache 2.0 license <https://www.apache.org/licenses/LICENSE-2.0.html>`_.
Versions prior to v1.8.0 are under GPLv3+ license.
Author
------
This project is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing).
Extracting and pre-processing web texts to meet the exacting standards is a significant challenge. It is often not possible to reliably determine the date of publication or modification using either the URL or the server response. For more information:
.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
:target: https://doi.org/10.21105/joss.02439
:alt: JOSS article reference DOI: 10.21105/joss.02439
.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue
:target: https://doi.org/10.5281/zenodo.3459599
:alt: Zenodo archive DOI: 10.5281/zenodo.3459599
.. code-block:: shell
@article{barbaresi-2020-htmldate,
title = {{htmldate: A Python package to extract publication dates from web pages}},
author = "Barbaresi, Adrien",
journal = "Journal of Open Source Software",
volume = 5,
number = 51,
pages = 2439,
url = {https://doi.org/10.21105/joss.02439},
publisher = {The Open Journal},
year = 2020,
}
- Barbaresi, A. "`htmldate: A Python package to extract publication dates from web pages <https://doi.org/10.21105/joss.02439>`_", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439
- Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.
You can contact me via my `contact page <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.
Contributing
------------
`Contributions <https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md>`_ are welcome as well as issues filed on the `dedicated page <https://github.com/adbar/htmldate/issues>`_.
Special thanks to the `contributors <https://github.com/adbar/htmldate/graphs/contributors>`_ who have submitted features and bugfixes!
Acknowledgements
----------------
Kudos to the following software libraries:
- `lxml <http://lxml.de/>`_, `dateparser <https://github.com/scrapinghub/dateparser>`_
- A few patterns are derived from the `python-goose <https://github.com/grangier/python-goose>`_, `metascraper <https://github.com/ianstormtaylor/metascraper>`_, `newspaper <https://github.com/codelucas/newspaper>`_ and `articleDateExtractor <https://github.com/Webhose/article-date-extractor>`_ libraries. This module extends their coverage and robustness significantly.
Raw data
{
"_id": null,
"home_page": "https://htmldate.readthedocs.io",
"name": "htmldate",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "datetime, date-parser, entity-extraction, html-extraction, html-parsing, metadata-extraction, webarchives, web-scraping",
"author": "Adrien Barbaresi",
"author_email": "barbaresi@bbaw.de",
"download_url": "https://files.pythonhosted.org/packages/cd/71/ac70cf10ea9b58414a0d8d32593f916ab83e0d9d28c95e91879d26cffd0d/htmldate-1.8.1.tar.gz",
"platform": null,
"description": "Htmldate: Find the Publication Date of Web Pages\n================================================\n\n\n.. image:: https://img.shields.io/pypi/v/htmldate.svg\n :target: https://pypi.python.org/pypi/htmldate\n :alt: Python package\n\n.. image:: https://img.shields.io/pypi/pyversions/htmldate.svg\n :target: https://pypi.python.org/pypi/htmldate\n :alt: Python versions\n\n.. image:: https://readthedocs.org/projects/htmldate/badge/?version=latest\n :target: https://htmldate.readthedocs.org/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://img.shields.io/codecov/c/github/adbar/htmldate.svg\n :target: https://codecov.io/gh/adbar/htmldate\n :alt: Code Coverage\n\n.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational\n :target: https://pepy.tech/project/htmldate\n :alt: Downloads\n\n.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen\n :target: https://doi.org/10.21105/joss.02439\n :alt: JOSS article reference DOI: 10.21105/joss.02439\n\n|\n\n\n.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png\n :alt: Logo as PNG image\n :align: center\n :width: 60%\n\n|\n\nFind **original and updated publication dates** of any web page. **On the command-line or with Python**, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included. The package is used in production on millions of documents and integrated by `multiple libraries <https://github.com/adbar/htmldate/network/dependents>`_.\n\n\nIn a nutshell\n-------------\n\n|\n\n.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif\n :alt: Demo as GIF image\n :align: center\n :width: 80%\n :target: https://htmldate.readthedocs.org/\n\n|\n\nWith Python\n~~~~~~~~~~~\n\n.. code-block:: python\n\n >>> from htmldate import find_date\n >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')\n '2016-12-23'\n\nOn the command-line\n~~~~~~~~~~~~~~~~~~~\n\n.. code-block:: bash\n\n $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html\n '2016-12-23'\n\n\nFeatures\n--------\n\n- Flexible input: URLs, HTML files, or HTML trees can be used as input (including batch processing).\n- Customizable output: Any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_).\n- Detection of both original and updated dates.\n- Multilingual.\n- Compatible with all recent versions of Python.\n\n\nHow it works\n~~~~~~~~~~~~\n\nHtmldate operates by sifting through HTML markup and if necessary text elements. It features the following heuristics:\n\n1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes.\n2. **HTML code**: The whole document is searched for structural markers like ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``).\n3. **Bare HTML content**: Heuristics are run on text and markup:\n - In ``fast`` mode the HTML page is cleaned and precise patterns are targeted.\n - In ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one.\n\n\nFinally, the output is validated and converted to the chosen format.\n\n\nPerformance\n-----------\n\n=============================== ========= ========= ========= ========= =======\n1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)\n-------------------------------------------------------------------------------\nPython Package Precision Recall Accuracy F-Score Time\n=============================== ========= ========= ========= ========= =======\narticleDateExtractor 0.20 0.803 0.734 0.622 0.767 5x\ndate_guesser 2.1.4 0.781 0.600 0.514 0.679 18x\ngoose3 3.1.17 0.869 0.532 0.493 0.660 15x\nhtmldate[all] 1.6.0 (fast) **0.883** 0.924 0.823 0.903 **1x**\nhtmldate[all] 1.6.0 (extensive) 0.870 **0.993** **0.865** **0.928** 1.7x\nnewspaper3k 0.2.8 0.769 0.667 0.556 0.715 15x\nnews-please 1.5.35 0.801 0.768 0.645 0.784 34x\n=============================== ========= ========= ========= ========= =======\n\nFor the complete results and explanations see `evaluation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>`_.\n\n\nInstallation\n------------\n\nHtmldate is tested on Linux, macOS and Windows systems, it is compatible with Python 3.6 upwards. It can notably be installed with ``pip`` (``pip3`` where applicable) from the PyPI package repository: \n\n- ``pip install htmldate`` \n- (optionally) ``pip install htmldate[speed]``\n\n\nDocumentation\n-------------\n\nFor more details on installation, Python & CLI usage, **please refer to the documentation**: `htmldate.readthedocs.io <https://htmldate.readthedocs.io/>`_\n\n\nLicense\n-------\n\nThis package is distributed under the `Apache 2.0 license <https://www.apache.org/licenses/LICENSE-2.0.html>`_.\n\nVersions prior to v1.8.0 are under GPLv3+ license.\n\n\nAuthor\n------\n\nThis project is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing).\n\nExtracting and pre-processing web texts to meet the exacting standards is a significant challenge. It is often not possible to reliably determine the date of publication or modification using either the URL or the server response. For more information:\n\n.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen\n :target: https://doi.org/10.21105/joss.02439\n :alt: JOSS article reference DOI: 10.21105/joss.02439\n\n.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue\n :target: https://doi.org/10.5281/zenodo.3459599\n :alt: Zenodo archive DOI: 10.5281/zenodo.3459599\n\n\n.. code-block:: shell\n\n @article{barbaresi-2020-htmldate,\n title = {{htmldate: A Python package to extract publication dates from web pages}},\n author = \"Barbaresi, Adrien\",\n journal = \"Journal of Open Source Software\",\n volume = 5,\n number = 51,\n pages = 2439,\n url = {https://doi.org/10.21105/joss.02439},\n publisher = {The Open Journal},\n year = 2020,\n }\n\n- Barbaresi, A. \"`htmldate: A Python package to extract publication dates from web pages <https://doi.org/10.21105/joss.02439>`_\", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439\n- Barbaresi, A. \"`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_\", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.\n- Barbaresi, A. \"`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_\", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.\n\nYou can contact me via my `contact page <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.\n\n\nContributing\n------------\n\n`Contributions <https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md>`_ are welcome as well as issues filed on the `dedicated page <https://github.com/adbar/htmldate/issues>`_.\n\nSpecial thanks to the `contributors <https://github.com/adbar/htmldate/graphs/contributors>`_ who have submitted features and bugfixes!\n\n\nAcknowledgements\n----------------\n\nKudos to the following software libraries:\n\n- `lxml <http://lxml.de/>`_, `dateparser <https://github.com/scrapinghub/dateparser>`_\n- A few patterns are derived from the `python-goose <https://github.com/grangier/python-goose>`_, `metascraper <https://github.com/ianstormtaylor/metascraper>`_, `newspaper <https://github.com/codelucas/newspaper>`_ and `articleDateExtractor <https://github.com/Webhose/article-date-extractor>`_ libraries. This module extends their coverage and robustness significantly.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Fast and robust extraction of original and updated publication dates from URLs and web pages.",
"version": "1.8.1",
"project_urls": {
"Blog": "https://adrien.barbaresi.eu/blog/tag/htmldate.html",
"Homepage": "https://htmldate.readthedocs.io",
"Source": "https://github.com/adbar/htmldate",
"Tracker": "https://github.com/adbar/htmldate/issues"
},
"split_keywords": [
"datetime",
" date-parser",
" entity-extraction",
" html-extraction",
" html-parsing",
" metadata-extraction",
" webarchives",
" web-scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a66ba52c2e6592d37b374c0c4913bca1a3771c262061a7d7fba354874ca9af70",
"md5": "3be12a1e4cf45fc0a91faab67d5a3eb2",
"sha256": "b1209dedfa7bc9bb4d0b812a3f0983ea5d39f1bdfe21745659ad26af4f8b7f32"
},
"downloads": -1,
"filename": "htmldate-1.8.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3be12a1e4cf45fc0a91faab67d5a3eb2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 31831,
"upload_time": "2024-04-11T14:50:18",
"upload_time_iso_8601": "2024-04-11T14:50:18.851679Z",
"url": "https://files.pythonhosted.org/packages/a6/6b/a52c2e6592d37b374c0c4913bca1a3771c262061a7d7fba354874ca9af70/htmldate-1.8.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cd71ac70cf10ea9b58414a0d8d32593f916ab83e0d9d28c95e91879d26cffd0d",
"md5": "6917985d562b3572d7f8da7ca6ca4002",
"sha256": "caf1686cf75c61dd1f061ede5d7a46e759b15d5f9987cd8e13c8c4237511263d"
},
"downloads": -1,
"filename": "htmldate-1.8.1.tar.gz",
"has_sig": false,
"md5_digest": "6917985d562b3572d7f8da7ca6ca4002",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 45080,
"upload_time": "2024-04-11T14:50:20",
"upload_time_iso_8601": "2024-04-11T14:50:20.771305Z",
"url": "https://files.pythonhosted.org/packages/cd/71/ac70cf10ea9b58414a0d8d32593f916ab83e0d9d28c95e91879d26cffd0d/htmldate-1.8.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-11 14:50:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adbar",
"github_project": "htmldate",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "htmldate"
}