htmldate


Namehtmldate JSON
Version 1.8.1 PyPI version JSON
download
home_pagehttps://htmldate.readthedocs.io
SummaryFast and robust extraction of original and updated publication dates from URLs and web pages.
upload_time2024-04-11 14:50:20
maintainerNone
docs_urlNone
authorAdrien Barbaresi
requires_python>=3.6
licenseApache-2.0
keywords datetime date-parser entity-extraction html-extraction html-parsing metadata-extraction webarchives web-scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            Htmldate: Find the Publication Date of Web Pages
================================================


.. image:: https://img.shields.io/pypi/v/htmldate.svg
    :target: https://pypi.python.org/pypi/htmldate
    :alt: Python package

.. image:: https://img.shields.io/pypi/pyversions/htmldate.svg
    :target: https://pypi.python.org/pypi/htmldate
    :alt: Python versions

.. image:: https://readthedocs.org/projects/htmldate/badge/?version=latest
    :target: https://htmldate.readthedocs.org/en/latest/?badge=latest
    :alt: Documentation Status

.. image:: https://img.shields.io/codecov/c/github/adbar/htmldate.svg
    :target: https://codecov.io/gh/adbar/htmldate
    :alt: Code Coverage

.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational
    :target: https://pepy.tech/project/htmldate
    :alt: Downloads

.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
   :target: https://doi.org/10.21105/joss.02439
   :alt: JOSS article reference DOI: 10.21105/joss.02439

|


.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png
    :alt: Logo as PNG image
    :align: center
    :width: 60%

|

Find **original and updated publication dates** of any web page. **On the command-line or with Python**, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included. The package is used in production on millions of documents and integrated by `multiple libraries <https://github.com/adbar/htmldate/network/dependents>`_.


In a nutshell
-------------

|

.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif
    :alt: Demo as GIF image
    :align: center
    :width: 80%
    :target: https://htmldate.readthedocs.org/

|

With Python
~~~~~~~~~~~

.. code-block:: python

    >>> from htmldate import find_date
    >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
    '2016-12-23'

On the command-line
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
    '2016-12-23'


Features
--------

- Flexible input: URLs, HTML files, or HTML trees can be used as input (including batch processing).
- Customizable output: Any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_).
- Detection of both original and updated dates.
- Multilingual.
- Compatible with all recent versions of Python.


How it works
~~~~~~~~~~~~

Htmldate operates by sifting through HTML markup and if necessary text elements. It features the following heuristics:

1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes.
2. **HTML code**: The whole document is searched for structural markers like ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``).
3. **Bare HTML content**: Heuristics are run on text and markup:
   - In ``fast`` mode the HTML page is cleaned and precise patterns are targeted.
   - In ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one.


Finally, the output is validated and converted to the chosen format.


Performance
-----------

=============================== ========= ========= ========= ========= =======
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
-------------------------------------------------------------------------------
Python Package                  Precision Recall    Accuracy  F-Score   Time
=============================== ========= ========= ========= ========= =======
articleDateExtractor 0.20       0.803     0.734     0.622     0.767     5x
date_guesser 2.1.4              0.781     0.600     0.514     0.679     18x
goose3 3.1.17                   0.869     0.532     0.493     0.660     15x
htmldate[all] 1.6.0 (fast)      **0.883** 0.924     0.823     0.903     **1x**
htmldate[all] 1.6.0 (extensive) 0.870     **0.993** **0.865** **0.928** 1.7x
newspaper3k 0.2.8               0.769     0.667     0.556     0.715     15x
news-please 1.5.35              0.801     0.768     0.645     0.784     34x
=============================== ========= ========= ========= ========= =======

For the complete results and explanations see `evaluation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>`_.


Installation
------------

Htmldate is tested on Linux, macOS and Windows systems, it is compatible with Python 3.6 upwards. It can notably be installed with ``pip`` (``pip3`` where applicable) from the PyPI package repository:  

-  ``pip install htmldate`` 
-  (optionally) ``pip install htmldate[speed]``


Documentation
-------------

For more details on installation, Python & CLI usage, **please refer to the documentation**: `htmldate.readthedocs.io <https://htmldate.readthedocs.io/>`_


License
-------

This package is distributed under the `Apache 2.0 license <https://www.apache.org/licenses/LICENSE-2.0.html>`_.

Versions prior to v1.8.0 are under GPLv3+ license.


Author
------

This project is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing).

Extracting and pre-processing web texts to meet the exacting standards is a significant challenge. It is often not possible to reliably determine the date of publication or modification using either the URL or the server response. For more information:

.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen
   :target: https://doi.org/10.21105/joss.02439
   :alt: JOSS article reference DOI: 10.21105/joss.02439

.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue
   :target: https://doi.org/10.5281/zenodo.3459599
   :alt: Zenodo archive DOI: 10.5281/zenodo.3459599


.. code-block:: shell

    @article{barbaresi-2020-htmldate,
      title = {{htmldate: A Python package to extract publication dates from web pages}},
      author = "Barbaresi, Adrien",
      journal = "Journal of Open Source Software",
      volume = 5,
      number = 51,
      pages = 2439,
      url = {https://doi.org/10.21105/joss.02439},
      publisher = {The Open Journal},
      year = 2020,
    }

-  Barbaresi, A. "`htmldate: A Python package to extract publication dates from web pages <https://doi.org/10.21105/joss.02439>`_", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439
-  Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
-  Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.

You can contact me via my `contact page <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.


Contributing
------------

`Contributions <https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md>`_ are welcome as well as issues filed on the `dedicated page <https://github.com/adbar/htmldate/issues>`_.

Special thanks to the `contributors <https://github.com/adbar/htmldate/graphs/contributors>`_ who have submitted features and bugfixes!


Acknowledgements
----------------

Kudos to the following software libraries:

-  `lxml <http://lxml.de/>`_, `dateparser <https://github.com/scrapinghub/dateparser>`_
-  A few patterns are derived from the `python-goose <https://github.com/grangier/python-goose>`_, `metascraper <https://github.com/ianstormtaylor/metascraper>`_, `newspaper <https://github.com/codelucas/newspaper>`_ and `articleDateExtractor <https://github.com/Webhose/article-date-extractor>`_ libraries. This module extends their coverage and robustness significantly.

            

Raw data

            {
    "_id": null,
    "home_page": "https://htmldate.readthedocs.io",
    "name": "htmldate",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "datetime, date-parser, entity-extraction, html-extraction, html-parsing, metadata-extraction, webarchives, web-scraping",
    "author": "Adrien Barbaresi",
    "author_email": "barbaresi@bbaw.de",
    "download_url": "https://files.pythonhosted.org/packages/cd/71/ac70cf10ea9b58414a0d8d32593f916ab83e0d9d28c95e91879d26cffd0d/htmldate-1.8.1.tar.gz",
    "platform": null,
    "description": "Htmldate: Find the Publication Date of Web Pages\n================================================\n\n\n.. image:: https://img.shields.io/pypi/v/htmldate.svg\n    :target: https://pypi.python.org/pypi/htmldate\n    :alt: Python package\n\n.. image:: https://img.shields.io/pypi/pyversions/htmldate.svg\n    :target: https://pypi.python.org/pypi/htmldate\n    :alt: Python versions\n\n.. image:: https://readthedocs.org/projects/htmldate/badge/?version=latest\n    :target: https://htmldate.readthedocs.org/en/latest/?badge=latest\n    :alt: Documentation Status\n\n.. image:: https://img.shields.io/codecov/c/github/adbar/htmldate.svg\n    :target: https://codecov.io/gh/adbar/htmldate\n    :alt: Code Coverage\n\n.. image:: https://img.shields.io/pypi/dm/htmldate?color=informational\n    :target: https://pepy.tech/project/htmldate\n    :alt: Downloads\n\n.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen\n   :target: https://doi.org/10.21105/joss.02439\n   :alt: JOSS article reference DOI: 10.21105/joss.02439\n\n|\n\n\n.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png\n    :alt: Logo as PNG image\n    :align: center\n    :width: 60%\n\n|\n\nFind **original and updated publication dates** of any web page. **On the command-line or with Python**, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included. The package is used in production on millions of documents and integrated by `multiple libraries <https://github.com/adbar/htmldate/network/dependents>`_.\n\n\nIn a nutshell\n-------------\n\n|\n\n.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif\n    :alt: Demo as GIF image\n    :align: center\n    :width: 80%\n    :target: https://htmldate.readthedocs.org/\n\n|\n\nWith Python\n~~~~~~~~~~~\n\n.. code-block:: python\n\n    >>> from htmldate import find_date\n    >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')\n    '2016-12-23'\n\nOn the command-line\n~~~~~~~~~~~~~~~~~~~\n\n.. code-block:: bash\n\n    $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html\n    '2016-12-23'\n\n\nFeatures\n--------\n\n- Flexible input: URLs, HTML files, or HTML trees can be used as input (including batch processing).\n- Customizable output: Any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_).\n- Detection of both original and updated dates.\n- Multilingual.\n- Compatible with all recent versions of Python.\n\n\nHow it works\n~~~~~~~~~~~~\n\nHtmldate operates by sifting through HTML markup and if necessary text elements. It features the following heuristics:\n\n1. **Markup in header**: Common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes.\n2. **HTML code**: The whole document is searched for structural markers like ``abbr`` or ``time`` elements and a series of attributes (e.g. ``postmetadata``).\n3. **Bare HTML content**: Heuristics are run on text and markup:\n   - In ``fast`` mode the HTML page is cleaned and precise patterns are targeted.\n   - In ``extensive`` mode all potential dates are collected and a disambiguation algorithm determines the best one.\n\n\nFinally, the output is validated and converted to the chosen format.\n\n\nPerformance\n-----------\n\n=============================== ========= ========= ========= ========= =======\n1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)\n-------------------------------------------------------------------------------\nPython Package                  Precision Recall    Accuracy  F-Score   Time\n=============================== ========= ========= ========= ========= =======\narticleDateExtractor 0.20       0.803     0.734     0.622     0.767     5x\ndate_guesser 2.1.4              0.781     0.600     0.514     0.679     18x\ngoose3 3.1.17                   0.869     0.532     0.493     0.660     15x\nhtmldate[all] 1.6.0 (fast)      **0.883** 0.924     0.823     0.903     **1x**\nhtmldate[all] 1.6.0 (extensive) 0.870     **0.993** **0.865** **0.928** 1.7x\nnewspaper3k 0.2.8               0.769     0.667     0.556     0.715     15x\nnews-please 1.5.35              0.801     0.768     0.645     0.784     34x\n=============================== ========= ========= ========= ========= =======\n\nFor the complete results and explanations see `evaluation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>`_.\n\n\nInstallation\n------------\n\nHtmldate is tested on Linux, macOS and Windows systems, it is compatible with Python 3.6 upwards. It can notably be installed with ``pip`` (``pip3`` where applicable) from the PyPI package repository:  \n\n-  ``pip install htmldate`` \n-  (optionally) ``pip install htmldate[speed]``\n\n\nDocumentation\n-------------\n\nFor more details on installation, Python & CLI usage, **please refer to the documentation**: `htmldate.readthedocs.io <https://htmldate.readthedocs.io/>`_\n\n\nLicense\n-------\n\nThis package is distributed under the `Apache 2.0 license <https://www.apache.org/licenses/LICENSE-2.0.html>`_.\n\nVersions prior to v1.8.0 are under GPLv3+ license.\n\n\nAuthor\n------\n\nThis project is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing).\n\nExtracting and pre-processing web texts to meet the exacting standards is a significant challenge. It is often not possible to reliably determine the date of publication or modification using either the URL or the server response. For more information:\n\n.. image:: https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen\n   :target: https://doi.org/10.21105/joss.02439\n   :alt: JOSS article reference DOI: 10.21105/joss.02439\n\n.. image:: https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue\n   :target: https://doi.org/10.5281/zenodo.3459599\n   :alt: Zenodo archive DOI: 10.5281/zenodo.3459599\n\n\n.. code-block:: shell\n\n    @article{barbaresi-2020-htmldate,\n      title = {{htmldate: A Python package to extract publication dates from web pages}},\n      author = \"Barbaresi, Adrien\",\n      journal = \"Journal of Open Source Software\",\n      volume = 5,\n      number = 51,\n      pages = 2439,\n      url = {https://doi.org/10.21105/joss.02439},\n      publisher = {The Open Journal},\n      year = 2020,\n    }\n\n-  Barbaresi, A. \"`htmldate: A Python package to extract publication dates from web pages <https://doi.org/10.21105/joss.02439>`_\", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439\n-  Barbaresi, A. \"`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_\", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.\n-  Barbaresi, A. \"`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_\", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.\n\nYou can contact me via my `contact page <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.\n\n\nContributing\n------------\n\n`Contributions <https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md>`_ are welcome as well as issues filed on the `dedicated page <https://github.com/adbar/htmldate/issues>`_.\n\nSpecial thanks to the `contributors <https://github.com/adbar/htmldate/graphs/contributors>`_ who have submitted features and bugfixes!\n\n\nAcknowledgements\n----------------\n\nKudos to the following software libraries:\n\n-  `lxml <http://lxml.de/>`_, `dateparser <https://github.com/scrapinghub/dateparser>`_\n-  A few patterns are derived from the `python-goose <https://github.com/grangier/python-goose>`_, `metascraper <https://github.com/ianstormtaylor/metascraper>`_, `newspaper <https://github.com/codelucas/newspaper>`_ and `articleDateExtractor <https://github.com/Webhose/article-date-extractor>`_ libraries. This module extends their coverage and robustness significantly.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Fast and robust extraction of original and updated publication dates from URLs and web pages.",
    "version": "1.8.1",
    "project_urls": {
        "Blog": "https://adrien.barbaresi.eu/blog/tag/htmldate.html",
        "Homepage": "https://htmldate.readthedocs.io",
        "Source": "https://github.com/adbar/htmldate",
        "Tracker": "https://github.com/adbar/htmldate/issues"
    },
    "split_keywords": [
        "datetime",
        " date-parser",
        " entity-extraction",
        " html-extraction",
        " html-parsing",
        " metadata-extraction",
        " webarchives",
        " web-scraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a66ba52c2e6592d37b374c0c4913bca1a3771c262061a7d7fba354874ca9af70",
                "md5": "3be12a1e4cf45fc0a91faab67d5a3eb2",
                "sha256": "b1209dedfa7bc9bb4d0b812a3f0983ea5d39f1bdfe21745659ad26af4f8b7f32"
            },
            "downloads": -1,
            "filename": "htmldate-1.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3be12a1e4cf45fc0a91faab67d5a3eb2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 31831,
            "upload_time": "2024-04-11T14:50:18",
            "upload_time_iso_8601": "2024-04-11T14:50:18.851679Z",
            "url": "https://files.pythonhosted.org/packages/a6/6b/a52c2e6592d37b374c0c4913bca1a3771c262061a7d7fba354874ca9af70/htmldate-1.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cd71ac70cf10ea9b58414a0d8d32593f916ab83e0d9d28c95e91879d26cffd0d",
                "md5": "6917985d562b3572d7f8da7ca6ca4002",
                "sha256": "caf1686cf75c61dd1f061ede5d7a46e759b15d5f9987cd8e13c8c4237511263d"
            },
            "downloads": -1,
            "filename": "htmldate-1.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6917985d562b3572d7f8da7ca6ca4002",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 45080,
            "upload_time": "2024-04-11T14:50:20",
            "upload_time_iso_8601": "2024-04-11T14:50:20.771305Z",
            "url": "https://files.pythonhosted.org/packages/cd/71/ac70cf10ea9b58414a0d8d32593f916ab83e0d9d28c95e91879d26cffd0d/htmldate-1.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-11 14:50:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "adbar",
    "github_project": "htmldate",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "htmldate"
}
        
Elapsed time: 0.22407s