html2markdown


Namehtml2markdown JSON
Version 0.1.7 PyPI version JSON
download
home_pagehttps://github.com/dlon/html2markdown
SummaryConservatively convert html to markdown
upload_time2019-02-09 00:04:01
maintainer
docs_urlNone
authorDavid Lönnhager
requires_python
license
keywords
VCS
bugtrack_url
requirements beautifulsoup4
Travis-CI
coveralls test coverage No coveralls.
            =============
html2markdown
=============

**Experimental**

**Purpose**: Converts html to markdown while preserving unsupported html markup. The goal is to generate markdown that can be converted back into html. This is the major difference between html2markdown and html2text. The latter doesn't purport to be reversible.

Usage example
=============
::

	import html2markdown
	print html2markdown.convert('<h2>Test</h2><pre><code>Here is some code</code></pre>')

Output::

	## Test
	
	    Here is some code

Information and caveats
=======================

Does not convert the content of block-type tags other than ``<p>`` -- such as ``<div>`` tags -- into Markdown
-------------------------------------------------------------------------------------------------------------

It does convert to markdown the content of inline-type tags, e.g. ``<span>``.

**Input**: ``<div>this is stuff. <strong>stuff</strong></div>``

**Result**: ``<div>this is stuff. <strong>stuff</strong></div>``  

**Input**: ``<p>this is stuff. <strong>stuff</strong></p>``  

**Result**: ``this is stuff. __stuff__`` (surrounded by a newline on either side)  

**Input**: ``<span style="text-decoration:line-through;">strike <strong>through</strong> some text</span> here``  

**Result**: ``<span style="text-decoration:line-through;">strike __through__ some text</span> here``  

Except in unprocessed block-type tags, formatting characters are escaped
------------------------------------------------------------------------

**Input**: ``<p>**escape me?**</p>`` (in html, we would use \<strong\> here)  

**Result**: ``\*\*escape me?\*\*``  

**Input**: ``<span>**escape me?**</span>``  

**Result**: ``<span>\*\*escape me?\*\*</span>``  

**Input**: ``<div>**escape me?**</div>``  

**Result**: ``<div>**escape me?**</div>`` (block-type)  

Attributes not supported by Markdown are kept
---------------------------------------------

**Example**: ``<a href="http://myaddress" title="click me"><strong>link</strong></a>``  

**Result**: ``[__link__](http://myaddress "click me")``  

**Example**: ``<a onclick="javascript:dostuff()" href="http://myaddress" title="click me"><strong>link</strong></a>``  

**Result**: ``<a onclick="javascript:dostuff()" href="http://myaddress" title="click me">__link__</a>`` (the attribute *onclick* is not supported, so the tag is left alone)  


Limitations
===========

- Tables are kept as html.

Changes
=======

0.1.7:

- Improved handling of inline tags.
- Fix: Ignore ``<a>`` tags without an href attribute.
- Improve escaping.

0.1.6: Added tests and support for Python versions below 2.7.

0.1.5: Fix Unicode issue in Python 3.

0.1.0: First version.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dlon/html2markdown",
    "name": "html2markdown",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "David L\u00f6nnhager",
    "author_email": "dv.lnh.d@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ba/05/666b8105c1c45ee05fcbcb210176c73638710e402b99c5968c5dfdf3c67d/html2markdown-0.1.7.tar.gz",
    "platform": "",
    "description": "=============\nhtml2markdown\n=============\n\n**Experimental**\n\n**Purpose**: Converts html to markdown while preserving unsupported html markup. The goal is to generate markdown that can be converted back into html. This is the major difference between html2markdown and html2text. The latter doesn't purport to be reversible.\n\nUsage example\n=============\n::\n\n\timport html2markdown\n\tprint html2markdown.convert('<h2>Test</h2><pre><code>Here is some code</code></pre>')\n\nOutput::\n\n\t## Test\n\t\n\t    Here is some code\n\nInformation and caveats\n=======================\n\nDoes not convert the content of block-type tags other than ``<p>`` -- such as ``<div>`` tags -- into Markdown\n-------------------------------------------------------------------------------------------------------------\n\nIt does convert to markdown the content of inline-type tags, e.g. ``<span>``.\n\n**Input**: ``<div>this is stuff. <strong>stuff</strong></div>``\n\n**Result**: ``<div>this is stuff. <strong>stuff</strong></div>``  \n\n**Input**: ``<p>this is stuff. <strong>stuff</strong></p>``  \n\n**Result**: ``this is stuff. __stuff__`` (surrounded by a newline on either side)  \n\n**Input**: ``<span style=\"text-decoration:line-through;\">strike <strong>through</strong> some text</span> here``  \n\n**Result**: ``<span style=\"text-decoration:line-through;\">strike __through__ some text</span> here``  \n\nExcept in unprocessed block-type tags, formatting characters are escaped\n------------------------------------------------------------------------\n\n**Input**: ``<p>**escape me?**</p>`` (in html, we would use \\<strong\\> here)  \n\n**Result**: ``\\*\\*escape me?\\*\\*``  \n\n**Input**: ``<span>**escape me?**</span>``  \n\n**Result**: ``<span>\\*\\*escape me?\\*\\*</span>``  \n\n**Input**: ``<div>**escape me?**</div>``  \n\n**Result**: ``<div>**escape me?**</div>`` (block-type)  \n\nAttributes not supported by Markdown are kept\n---------------------------------------------\n\n**Example**: ``<a href=\"http://myaddress\" title=\"click me\"><strong>link</strong></a>``  \n\n**Result**: ``[__link__](http://myaddress \"click me\")``  \n\n**Example**: ``<a onclick=\"javascript:dostuff()\" href=\"http://myaddress\" title=\"click me\"><strong>link</strong></a>``  \n\n**Result**: ``<a onclick=\"javascript:dostuff()\" href=\"http://myaddress\" title=\"click me\">__link__</a>`` (the attribute *onclick* is not supported, so the tag is left alone)  \n\n\nLimitations\n===========\n\n- Tables are kept as html.\n\nChanges\n=======\n\n0.1.7:\n\n- Improved handling of inline tags.\n- Fix: Ignore ``<a>`` tags without an href attribute.\n- Improve escaping.\n\n0.1.6: Added tests and support for Python versions below 2.7.\n\n0.1.5: Fix Unicode issue in Python 3.\n\n0.1.0: First version.",
    "bugtrack_url": null,
    "license": "",
    "summary": "Conservatively convert html to markdown",
    "version": "0.1.7",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ba05666b8105c1c45ee05fcbcb210176c73638710e402b99c5968c5dfdf3c67d",
                "md5": "d066e82ee5f598c6d721dfa0529e2706",
                "sha256": "92baf932c7f216be6d9459a191d45b6401e204bda7a5413febafa875512cfa8c"
            },
            "downloads": -1,
            "filename": "html2markdown-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "d066e82ee5f598c6d721dfa0529e2706",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5253,
            "upload_time": "2019-02-09T00:04:01",
            "upload_time_iso_8601": "2019-02-09T00:04:01.195783Z",
            "url": "https://files.pythonhosted.org/packages/ba/05/666b8105c1c45ee05fcbcb210176c73638710e402b99c5968c5dfdf3c67d/html2markdown-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2019-02-09 00:04:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "dlon",
    "github_project": "html2markdown",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "beautifulsoup4",
            "specs": []
        }
    ],
    "lcname": "html2markdown"
}
        
Elapsed time: 0.03182s