HTMLArk


NameHTMLArk JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/BitLooter/htmlark
SummaryPack a webpage including support files into a single HTML file.
upload_time2016-04-26 16:01:51
maintainerNone
docs_urlNone
authorDavid Powell
requires_pythonNone
licenseMIT
keywords development html webpage css javascript
VCS
bugtrack_url
requirements beautifulsoup4 requests
Travis-CI No Travis.
coveralls test coverage No coveralls.
            HTMLArk
=======

.. image:: https://img.shields.io/github/downloads/BitLooter/htmlark/total.svg
        :target: https://github.com/BitLooter/htmlark
.. image:: https://img.shields.io/pypi/v/HTMLArk.svg
        :target: https://pypi.python.org/pypi/HTMLArk
.. image:: https://img.shields.io/pypi/l/HTMLArk.svg
        :target: https://raw.githubusercontent.com/BitLooter/htmlark/master/LICENSE.txt

Embed images, CSS, and JavaScript into an HTML file. Through the magic of `data URIs <https://developer.mozilla.org/en-US/docs/Web/HTTP/data_URIs>`_, HTMLArk can save these external dependencies inline right in the HTML. No more keeping around those "reallycoolwebpage_files" directories alongside the HTML files, everything is self-contained.

Note that this will only work with static pages. If an image or other resource is loaded with JavaScript, HTMLArk won't even know it exists.

Installation and Requirements
-----------------------------
Python 3.5 or greater is required for HTMLArk.

Install HTMLArk with ``pip`` like so:

.. code-block:: bash

    pip install htmlark

To use the `lxml <http://lxml.de/>`_ (recommended) or `html5lib <https://github.com/html5lib/html5lib-python>`_ parsers, you will need to install the lxml and/or html5lib Python libraries as well. HTMLArk can also get resources from the web, to enable this functionality you need `Requests <http://python-requests.org/>`_ installed. You can install HTMLArk with all optional dependencies with this command:

.. code-block:: bash

    pip install htmlark[http,parsers]


If you want to install it manually, the only hard dependency HTMLArk has is `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup/>`_.


Command-line usage
------------------
You can also get this information with ``htmlark --help``.

::

    usage: htmlark [-h] [-o OUTPUT] [-E] [-I] [-C] [-J]
                   [-p {html.parser,lxml,html5lib,auto}] [-v] [--version]
                   [webpage]

    Converts a webpage including external resources into a single HTML file. Note
    that resources loaded with JavaScript will not be handled by this program, it
    will only work properly with static pages.

    positional arguments:
      webpage               URL or path of webpage to convert. If not specified,
                            read from STDIN.

    optional arguments:
      -h, --help            show this help message and exit
      -o OUTPUT, --output OUTPUT
                            File to write output. Defaults to STDOUT.
      -E, --ignore-errors   Ignores unreadable resources
      -I, --ignore-images   Ignores images during conversion
      -C, --ignore-css      Ignores stylesheets during conversion
      -J, --ignore-js       Ignores external JavaScript during conversion
      -p {html.parser,lxml,html5lib,auto}, --parser {html.parser,lxml,html5lib,auto}
                            Select HTML parser. If not specifed, htmlark tries to
                            use lxml, html5lib, and html.parser in that order (the
                            'auto' option). See documentation for more
                            information.
      -v, --verbose         Prints information during conversion
      --version             Displays version information


Using HTMLArk as a module
-------------------------
You can also integrate HTMLArk into your own scripts, by importing it and calling ``convert_page``. Example:

.. code-block:: python

    import htmlark
    packed_html = htmlark.convert_page("samplepage.html", ignore_errors=True)

Details::

    def convert_page(page_path: str, parser: str='auto',
                     callback: Callable[[str, str, str], None]=lambda *_: None,
                     ignore_errors: bool=False, ignore_images: bool=False,
                     ignore_css: bool=False, ignore_js: bool=False) -> str

        Take an HTML file or URL and outputs new HTML with resources as data URIs.

        Parameters:
            pageurl (str): URL or path of web page to convert.
        Keyword Arguments:
            parser (str): HTML Parser for Beautiful Soup 4 to use. See
                `BS4's docs. <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser>`_
                Default: 'auto' - Not an actual parser, but tells the library to
                automatically choose a parser.
            ignore_errors (bool): If ``True`` do not abort on unreadable resources.
                Unprocessable tags (e.g. broken links) will simply be skipped.
                Default: ``False``
            ignore_images (bool): If ``True`` do not process ``<img>`` tags.
                Default: ``False``
            ignore_css (bool): If ``True`` do not process ``<link>`` (stylesheet) tags.
                Default: ``False``
            ignore_js (bool): If ``True`` do not process ``<script>`` tags.
                Default: ``False``
            callback (function): Called before a new resource is processed. Takes
                three parameters: message type ('INFO' or 'ERROR'), a string with
                the category of the callback (usually the tag related to the
                message), and the message data (usually a string to be printed).
        Returns:
            str: The new webpage HTML.
        Raises:
            OSError: Error reading a file
            ValueError: Problem with a path/URL
            requests.exceptions.RequestException: Problem getting remote resource
            NameError: HTMLArk requires Requests to be installed to get resources
                from the web. This error is raised when an external URL is
                encountered.
        Examples:
            A very basic conversion of a local HTML file, using default settings:

            >>> convert_page("webpage.html")
            <Converted page HTML>

            However, that example will fail if there are any problems accessing
            linked resources in the HTML (e.g. a missing image). If you cannot
            verify the validity of links ahead of time (converting a downloaded
            web page, for example) you can disable failing on error:

            >>> convert_page("brokenpage.html", ignore_errors=True)
            <Converted page HTML, tags with broken links untouched>

            You can also skip processing of content types:

            >>> convert_page("webpage.html", ignore_images=True)
            <Converted page HTML, with <img> tags untouched>

            If you want to get feedback on the progress of the conversion, you can
            define a callback function. For example, a callback that prints all
            CSS-related errors to stdout (note that ignore_errors will bypass
            broken links but still report them to the callback):

            >>> def mycallback(message_type, message_category, message):
            ...     if message_type == 'ERROR' and message_category == 'link':
            ...         print(message)
            >>> convert_page("badcss.html", ignore_errors=True, callback=mycallback)
            <Converted page HTML, CSS links untouched, CSS errors printed to screen>


Compatibility
-------------
Data URIs have been supported by every major browser for many years now. The only browser that might cause problems is Internet Explorer (surprise!). IE7 and below have no support for data URIs, but IE8 and above support them for CSS and images. As far as I know no version of IE allows you to load JavaScript from a data URI, though it is supported in Edge.

See `Can I Use's page on data URIs <http://caniuse.com/#feat=datauri>`_ for more compatibility information.

License
-------
HTMLArk is released under the MIT license, which may be found in the LICENSE file.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/BitLooter/htmlark",
    "name": "HTMLArk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "development html webpage css javascript",
    "author": "David Powell",
    "author_email": "BitLooter@users.noreply.github.com",
    "download_url": "https://files.pythonhosted.org/packages/d7/0a/c5150fb593abf13dd58ae3e3b46da56266d3f8aff8c0fc512f2001fa35a3/HTMLArk-1.0.0.tar.gz",
    "platform": "UNKNOWN",
    "description": "HTMLArk\n=======\n\n.. image:: https://img.shields.io/github/downloads/BitLooter/htmlark/total.svg\n        :target: https://github.com/BitLooter/htmlark\n.. image:: https://img.shields.io/pypi/v/HTMLArk.svg\n        :target: https://pypi.python.org/pypi/HTMLArk\n.. image:: https://img.shields.io/pypi/l/HTMLArk.svg\n        :target: https://raw.githubusercontent.com/BitLooter/htmlark/master/LICENSE.txt\n\nEmbed images, CSS, and JavaScript into an HTML file. Through the magic of `data URIs <https://developer.mozilla.org/en-US/docs/Web/HTTP/data_URIs>`_, HTMLArk can save these external dependencies inline right in the HTML. No more keeping around those \"reallycoolwebpage_files\" directories alongside the HTML files, everything is self-contained.\n\nNote that this will only work with static pages. If an image or other resource is loaded with JavaScript, HTMLArk won't even know it exists.\n\nInstallation and Requirements\n-----------------------------\nPython 3.5 or greater is required for HTMLArk.\n\nInstall HTMLArk with ``pip`` like so:\n\n.. code-block:: bash\n\n    pip install htmlark\n\nTo use the `lxml <http://lxml.de/>`_ (recommended) or `html5lib <https://github.com/html5lib/html5lib-python>`_ parsers, you will need to install the lxml and/or html5lib Python libraries as well. HTMLArk can also get resources from the web, to enable this functionality you need `Requests <http://python-requests.org/>`_ installed. You can install HTMLArk with all optional dependencies with this command:\n\n.. code-block:: bash\n\n    pip install htmlark[http,parsers]\n\n\nIf you want to install it manually, the only hard dependency HTMLArk has is `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup/>`_.\n\n\nCommand-line usage\n------------------\nYou can also get this information with ``htmlark --help``.\n\n::\n\n    usage: htmlark [-h] [-o OUTPUT] [-E] [-I] [-C] [-J]\n                   [-p {html.parser,lxml,html5lib,auto}] [-v] [--version]\n                   [webpage]\n\n    Converts a webpage including external resources into a single HTML file. Note\n    that resources loaded with JavaScript will not be handled by this program, it\n    will only work properly with static pages.\n\n    positional arguments:\n      webpage               URL or path of webpage to convert. If not specified,\n                            read from STDIN.\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -o OUTPUT, --output OUTPUT\n                            File to write output. Defaults to STDOUT.\n      -E, --ignore-errors   Ignores unreadable resources\n      -I, --ignore-images   Ignores images during conversion\n      -C, --ignore-css      Ignores stylesheets during conversion\n      -J, --ignore-js       Ignores external JavaScript during conversion\n      -p {html.parser,lxml,html5lib,auto}, --parser {html.parser,lxml,html5lib,auto}\n                            Select HTML parser. If not specifed, htmlark tries to\n                            use lxml, html5lib, and html.parser in that order (the\n                            'auto' option). See documentation for more\n                            information.\n      -v, --verbose         Prints information during conversion\n      --version             Displays version information\n\n\nUsing HTMLArk as a module\n-------------------------\nYou can also integrate HTMLArk into your own scripts, by importing it and calling ``convert_page``. Example:\n\n.. code-block:: python\n\n    import htmlark\n    packed_html = htmlark.convert_page(\"samplepage.html\", ignore_errors=True)\n\nDetails::\n\n    def convert_page(page_path: str, parser: str='auto',\n                     callback: Callable[[str, str, str], None]=lambda *_: None,\n                     ignore_errors: bool=False, ignore_images: bool=False,\n                     ignore_css: bool=False, ignore_js: bool=False) -> str\n\n        Take an HTML file or URL and outputs new HTML with resources as data URIs.\n\n        Parameters:\n            pageurl (str): URL or path of web page to convert.\n        Keyword Arguments:\n            parser (str): HTML Parser for Beautiful Soup 4 to use. See\n                `BS4's docs. <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser>`_\n                Default: 'auto' - Not an actual parser, but tells the library to\n                automatically choose a parser.\n            ignore_errors (bool): If ``True`` do not abort on unreadable resources.\n                Unprocessable tags (e.g. broken links) will simply be skipped.\n                Default: ``False``\n            ignore_images (bool): If ``True`` do not process ``<img>`` tags.\n                Default: ``False``\n            ignore_css (bool): If ``True`` do not process ``<link>`` (stylesheet) tags.\n                Default: ``False``\n            ignore_js (bool): If ``True`` do not process ``<script>`` tags.\n                Default: ``False``\n            callback (function): Called before a new resource is processed. Takes\n                three parameters: message type ('INFO' or 'ERROR'), a string with\n                the category of the callback (usually the tag related to the\n                message), and the message data (usually a string to be printed).\n        Returns:\n            str: The new webpage HTML.\n        Raises:\n            OSError: Error reading a file\n            ValueError: Problem with a path/URL\n            requests.exceptions.RequestException: Problem getting remote resource\n            NameError: HTMLArk requires Requests to be installed to get resources\n                from the web. This error is raised when an external URL is\n                encountered.\n        Examples:\n            A very basic conversion of a local HTML file, using default settings:\n\n            >>> convert_page(\"webpage.html\")\n            <Converted page HTML>\n\n            However, that example will fail if there are any problems accessing\n            linked resources in the HTML (e.g. a missing image). If you cannot\n            verify the validity of links ahead of time (converting a downloaded\n            web page, for example) you can disable failing on error:\n\n            >>> convert_page(\"brokenpage.html\", ignore_errors=True)\n            <Converted page HTML, tags with broken links untouched>\n\n            You can also skip processing of content types:\n\n            >>> convert_page(\"webpage.html\", ignore_images=True)\n            <Converted page HTML, with <img> tags untouched>\n\n            If you want to get feedback on the progress of the conversion, you can\n            define a callback function. For example, a callback that prints all\n            CSS-related errors to stdout (note that ignore_errors will bypass\n            broken links but still report them to the callback):\n\n            >>> def mycallback(message_type, message_category, message):\n            ...     if message_type == 'ERROR' and message_category == 'link':\n            ...         print(message)\n            >>> convert_page(\"badcss.html\", ignore_errors=True, callback=mycallback)\n            <Converted page HTML, CSS links untouched, CSS errors printed to screen>\n\n\nCompatibility\n-------------\nData URIs have been supported by every major browser for many years now. The only browser that might cause problems is Internet Explorer (surprise!). IE7 and below have no support for data URIs, but IE8 and above support them for CSS and images. As far as I know no version of IE allows you to load JavaScript from a data URI, though it is supported in Edge.\n\nSee `Can I Use's page on data URIs <http://caniuse.com/#feat=datauri>`_ for more compatibility information.\n\nLicense\n-------\nHTMLArk is released under the MIT license, which may be found in the LICENSE file.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Pack a webpage including support files into a single HTML file.",
    "version": "1.0.0",
    "project_urls": {
        "Download": "UNKNOWN",
        "Homepage": "https://github.com/BitLooter/htmlark"
    },
    "split_keywords": [
        "development",
        "html",
        "webpage",
        "css",
        "javascript"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d70ac5150fb593abf13dd58ae3e3b46da56266d3f8aff8c0fc512f2001fa35a3",
                "md5": "bcbdd16ae936cf03de478de6bf49bd7f",
                "sha256": "1b19599371e30c04383c4aff2837ac326be689a6868e2867efb0414c2ddfe261"
            },
            "downloads": -1,
            "filename": "HTMLArk-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "bcbdd16ae936cf03de478de6bf49bd7f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 9115,
            "upload_time": "2016-04-26T16:01:51",
            "upload_time_iso_8601": "2016-04-26T16:01:51.996250Z",
            "url": "https://files.pythonhosted.org/packages/d7/0a/c5150fb593abf13dd58ae3e3b46da56266d3f8aff8c0fc512f2001fa35a3/HTMLArk-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2016-04-26 16:01:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "BitLooter",
    "github_project": "htmlark",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "beautifulsoup4",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        }
    ],
    "lcname": "htmlark"
}
        
Elapsed time: 0.25759s