unmarkd

Name	unmarkd JSON
Version	1.1.3 JSON
	download
home_page	https://github.com/ThatXliner/unmarkd
Summary	A markdown reverser
upload_time	2024-02-15 07:30:41
maintainer
docs_url	None
author	Bryan Hu
requires_python	>=3.8,<4.0
license	GPL-3.0-or-later
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            **NOTE: This project is _maintained._** While it may seem inactive, it is because there is nothing to add. If you want an enhancement or want to file a bug report, please go to the [issues](https://github.com/ThatXliner/unmarkd/issues).

# 🔄 Unmarkd

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v1.json)](https://github.com/charliermarsh/ruff)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)[![codecov](https://codecov.io/gh/ThatXliner/unmarkd/branch/master/graph/badge.svg?token=PWVIERHTG3)](https://codecov.io/gh/ThatXliner/unmarkd) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![CI](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml/badge.svg)](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml) [![PyPI - Downloads](https://img.shields.io/pypi/dm/unmarkd)](https://pypi.org/project/unmarkd/)

> A markdown reverser.

---

Unmarkd is a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)-powered [Markdown](https://en.wikipedia.org/wiki/Markdown) reverser written in Python and for Python.

## Why

This is created as a [StackSearch](http://github.com/ThatXliner/stacksearch) (one of my other projects) dependency. In order to create a better API, I needed a way to reverse HTML. So I created this.

There are [similar projects](https://github.com/xijo/reverse_markdown) (written in Ruby) ~~but I have not found any written in Python (or for Python)~~ later I found a popular library, [html2text](https://github.com/Alir3z4/html2text).

## Installation

You know the drill

```bash
pip install unmarkd
```

## Comparison

**TL;DR: Html2Text is fast. If you don't need much configuration, you could use Html2Text for the little speed increase.**

<details>

<summary>Click to expand</summary>

### Speed

**TL;DR: Unmarkd < Html2Text**

Html2Text is basically faster:

![Benchmark](./assets/benchmark.png)

(The `DOC` variable used can be found [here](./assets/benchmark.html))

Unmarkd sacrifices speed for [power](#configurability).

Html2Text directly uses Python's [`html.parser`](https://docs.python.org/3/library/html.parser.html) module (in the standard library). On the other hand, Unmarkd uses the powerful HTML parsing library, `beautifulsoup4`. BeautifulSoup can be configured to use different HTML parsers. In Unmarkd, we configure it to use Python's `html.parser`, too.

But another layer of code means more code is ran.

I hope that's a good explanation of the speed difference.

### Correctness

**TL;DR: Unmarkd == Html2Text**

I actually found _two_ html-to-markdown libraries. One of them was [Tomd](https://github.com/gaojiuli/tomd) which had an _incorrect implementation_:

![Actual results](./assets/tomd_cant_handle.png)

It seems to be abandoned, anyway.

Now with Html2Text and Unmarkd:

![Epic showdown](./assets/correct.png)

In other words, they _work_

### Configurability

**TL;DR: Unmarkd > Html2Text**

This is Unmarkd's strong point.

In Html2Text, you only have a limited [set of options](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options).

In Unmarkd, you can subclass the `BaseUnmarker` and implement conversions for new tags (e.g. `<q>`), etc. In my opinion, it's much easier to extend and configure Unmarkd.

Unmarkd was originally written as a StackSearch dependancy.

Html2Text has no options for configuring parsing of code blocks. Unmarkd does

</details>

## Documentation

Here's an example of basic usage

```python
import unmarkd
print(unmarkd.unmark("<b>I <i>love</i> markdown!</b>"))
# Output: **I *love* markdown!**
```

or something more complex (shamelessly taken from [here](https://markdowntohtml.com)):

```python
import unmarkd
html_doc = R"""<h1 id="sample-markdown">Sample Markdown</h1>
<p>This is some basic, sample markdown.</p>
<h2 id="second-heading">Second Heading</h2>
<ul>
<li>Unordered lists, and:<ol>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ol>
</li>
<li>More</li>
</ul>
<blockquote>
<p>Blockquote</p>
</blockquote>
<p>And <strong>bold</strong>, <em>italics</em>, and even <em>italics and later <strong>bold</strong></em>. Even <del>strikethrough</del>. <a href="https://markdowntohtml.com">A link</a> to somewhere.</p>
<p>And code highlighting:</p>
<pre><code class="lang-js"><span class="hljs-keyword">var</span> foo = <span class="hljs-string">'bar'</span>;

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">baz</span><span class="hljs-params">(s)</span> </span>{
   <span class="hljs-keyword">return</span> foo + <span class="hljs-string">':'</span> + s;
}
</code></pre>
<p>Or inline code like <code>var foo = &#39;bar&#39;;</code>.</p>
<p>Or an image of bears</p>
<p><img src="http://placebear.com/200/200" alt="bears"></p>
<p>The end ...</p>
"""
print(unmarkd.unmark(html_doc))
```

and the output:

````markdown
    # Sample Markdown


    This is some basic, sample markdown.

    ## Second Heading



    - Unordered lists, and:
     1. One
     2. Two
     3. Three
    - More

    >Blockquote


    And **bold**, *italics*, and even *italics and later **bold***. Even ~~strikethrough~~. [A link](https://markdowntohtml.com) to somewhere.

    And code highlighting:


    ```js
    var foo = 'bar';

    function baz(s) {
       return foo + ':' + s;
    }
    ```


    Or inline code like `var foo = 'bar';`.

    Or an image of bears

    ![bears](http://placebear.com/200/200)

    The end ...
````

### Extending

#### Brief Overview

Most functionality should be covered by the `BasicUnmarker` class defined in `unmarkd.unmarkers`.

If you need to reverse markdown from StackExchange (as in the case for my other project), you may use the `StackOverflowUnmarker` (or it's alias, `StackExchangeUnmarker`), which is also defined in `unmarkd.unmarkers`.

#### Customizing

If the above two classes do not suit your needs, you can subclass the `unmarkd.unmarkers.BaseUnmarker` abstract class.

Currently, you can _optionally_ override the following methods:

- `detect_language` (parameters: **1**)
  - **Parameters**:
    - html: `bs4.BeautifulSoup`
  - When a fenced code block is approached, this function is called with a parameter of type `bs4.BeautifulSoup` passed to it; this is the element the code block was detected from (i.e. `pre`).
  - This function is responsible for detecting the programming language (or returning `''` if none was detected) of the code block.
  - Note: This method is different from `unmarkd.unmarkers.BasicUnmarker`. It is simpler and does less checking/filtering

But Unmarkd is more flexible than that.

##### Customizable constants

There are currently 3 constants you may override:

- Formats:
  NOTE: Use the [**Format String Syntax**](https://docs.python.org/3/library/string.html#formatstrings)
  - `UNORDERED_FORMAT`
    - The string format of unordered (bulleted) lists.
  - `ORDERED_FORMAT`
    - The string format of ordered (numbered) lists.
- Miscellaneous:
  - `ESCAPABLES`
    - A container (preferably a `set`) of length-1 `str` that should be escaped

##### Customize converting HTML tags

For an HTML tag `some_tag`, you can customize how it's converted to markdown by overriding a method like so:

```python
from unmarkd.unmarkers import BaseUnmarker
class MyCustomUnmarker(BaseUnmarker):
    def tag_some_tag(self, element) -> str:
        ...  # parse code here
```

To reduce code duplication, if your tag also has aliases (e.g. `strong` is an alias for `b` in HTML) then you may modify the `TAG_ALIASES`.

If you really need to, you may also modify `DEFAULT_TAG_ALIASES`. Be warned: if you do so, **you will also need to implement the aliases** (currently `em` and `strong`).

###### Common Patterns

I find myself iterating through the children of the tag a lot. But that would lead to us needing to handle new tags, which could be anything. So here's the template/pattern I recommend:

```python
from unmarkd.unmarkers import BaseUnmarker
class MyCustomUnmarker(BaseUnmarker):
    def tag_some_tag(self, element) -> str:
        for child in element.children:
            if non_tag_output := self.parse_non_tags(child):
                output += non_tag_output
                continue
            assert isinstance(element, bs4.Tag), type(element)
            ...   # Do whatever you want with the child
```

##### Utility functions when overriding

You may use (when extending) the following functions:

- `__parse`, 2 parameters:
  - `html`: _bs4.BeautifulSoup_
    - The html to unmark. This is used internally by the `unmark` method and is slightly faster.
  - `escape`: _bool_
    - Whether to escape the characters inside the string or not. Defaults to `False`.
- `escape`: 1 parameter:
  - `string`: _str_
    - The string to escape and make markdown-safe
- `wrap`: 2 parameters:
  - `element`: _bs4.BeautifulSoup_
    - The element to wrap.
  - `around_with`: _str_
    - The character to wrap the element around with. **WILL NOT BE ESCPAED**
- And, of course, `tag_*` and `detect_language`.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ThatXliner/unmarkd",
    "name": "unmarkd",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Bryan Hu",
    "author_email": "bryan.hu.2020@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/32/15/b82304503f30d9c80907605523f8c933fc2bac33ab0da19f67c62734a31a/unmarkd-1.1.3.tar.gz",
    "platform": null,
    "description": "**NOTE: This project is _maintained._** While it may seem inactive, it is because there is nothing to add. If you want an enhancement or want to file a bug report, please go to the [issues](https://github.com/ThatXliner/unmarkd/issues).\n\n# \ud83d\udd04 Unmarkd\n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v1.json)](https://github.com/charliermarsh/ruff)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)[![codecov](https://codecov.io/gh/ThatXliner/unmarkd/branch/master/graph/badge.svg?token=PWVIERHTG3)](https://codecov.io/gh/ThatXliner/unmarkd) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![CI](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml/badge.svg)](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml) [![PyPI - Downloads](https://img.shields.io/pypi/dm/unmarkd)](https://pypi.org/project/unmarkd/)\n\n> A markdown reverser.\n\n---\n\nUnmarkd is a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)-powered [Markdown](https://en.wikipedia.org/wiki/Markdown) reverser written in Python and for Python.\n\n## Why\n\nThis is created as a [StackSearch](http://github.com/ThatXliner/stacksearch) (one of my other projects) dependency. In order to create a better API, I needed a way to reverse HTML. So I created this.\n\nThere are [similar projects](https://github.com/xijo/reverse_markdown) (written in Ruby) ~~but I have not found any written in Python (or for Python)~~ later I found a popular library, [html2text](https://github.com/Alir3z4/html2text).\n\n## Installation\n\nYou know the drill\n\n```bash\npip install unmarkd\n```\n\n## Comparison\n\n**TL;DR: Html2Text is fast. If you don't need much configuration, you could use Html2Text for the little speed increase.**\n\n<details>\n\n<summary>Click to expand</summary>\n\n### Speed\n\n**TL;DR: Unmarkd < Html2Text**\n\nHtml2Text is basically faster:\n\n![Benchmark](./assets/benchmark.png)\n\n(The `DOC` variable used can be found [here](./assets/benchmark.html))\n\nUnmarkd sacrifices speed for [power](#configurability).\n\nHtml2Text directly uses Python's [`html.parser`](https://docs.python.org/3/library/html.parser.html) module (in the standard library). On the other hand, Unmarkd uses the powerful HTML parsing library, `beautifulsoup4`. BeautifulSoup can be configured to use different HTML parsers. In Unmarkd, we configure it to use Python's `html.parser`, too.\n\nBut another layer of code means more code is ran.\n\nI hope that's a good explanation of the speed difference.\n\n### Correctness\n\n**TL;DR: Unmarkd == Html2Text**\n\nI actually found _two_ html-to-markdown libraries. One of them was [Tomd](https://github.com/gaojiuli/tomd) which had an _incorrect implementation_:\n\n![Actual results](./assets/tomd_cant_handle.png)\n\nIt seems to be abandoned, anyway.\n\nNow with Html2Text and Unmarkd:\n\n![Epic showdown](./assets/correct.png)\n\nIn other words, they _work_\n\n### Configurability\n\n**TL;DR: Unmarkd > Html2Text**\n\nThis is Unmarkd's strong point.\n\nIn Html2Text, you only have a limited [set of options](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options).\n\nIn Unmarkd, you can subclass the `BaseUnmarker` and implement conversions for new tags (e.g. `<q>`), etc. In my opinion, it's much easier to extend and configure Unmarkd.\n\nUnmarkd was originally written as a StackSearch dependancy.\n\nHtml2Text has no options for configuring parsing of code blocks. Unmarkd does\n\n</details>\n\n## Documentation\n\nHere's an example of basic usage\n\n```python\nimport unmarkd\nprint(unmarkd.unmark(\"<b>I <i>love</i> markdown!</b>\"))\n# Output: **I *love* markdown!**\n```\n\nor something more complex (shamelessly taken from [here](https://markdowntohtml.com)):\n\n```python\nimport unmarkd\nhtml_doc = R\"\"\"<h1 id=\"sample-markdown\">Sample Markdown</h1>\n<p>This is some basic, sample markdown.</p>\n<h2 id=\"second-heading\">Second Heading</h2>\n<ul>\n<li>Unordered lists, and:<ol>\n<li>One</li>\n<li>Two</li>\n<li>Three</li>\n</ol>\n</li>\n<li>More</li>\n</ul>\n<blockquote>\n<p>Blockquote</p>\n</blockquote>\n<p>And <strong>bold</strong>, <em>italics</em>, and even <em>italics and later <strong>bold</strong></em>. Even <del>strikethrough</del>. <a href=\"https://markdowntohtml.com\">A link</a> to somewhere.</p>\n<p>And code highlighting:</p>\n<pre><code class=\"lang-js\"><span class=\"hljs-keyword\">var</span> foo = <span class=\"hljs-string\">'bar'</span>;\n\n<span class=\"hljs-function\"><span class=\"hljs-keyword\">function</span> <span class=\"hljs-title\">baz</span><span class=\"hljs-params\">(s)</span> </span>{\n   <span class=\"hljs-keyword\">return</span> foo + <span class=\"hljs-string\">':'</span> + s;\n}\n</code></pre>\n<p>Or inline code like <code>var foo = &#39;bar&#39;;</code>.</p>\n<p>Or an image of bears</p>\n<p><img src=\"http://placebear.com/200/200\" alt=\"bears\"></p>\n<p>The end ...</p>\n\"\"\"\nprint(unmarkd.unmark(html_doc))\n```\n\nand the output:\n\n````markdown\n    # Sample Markdown\n\n\n    This is some basic, sample markdown.\n\n    ## Second Heading\n\n\n\n    - Unordered lists, and:\n     1. One\n     2. Two\n     3. Three\n    - More\n\n    >Blockquote\n\n\n    And **bold**, *italics*, and even *italics and later **bold***. Even ~~strikethrough~~. [A link](https://markdowntohtml.com) to somewhere.\n\n    And code highlighting:\n\n\n    ```js\n    var foo = 'bar';\n\n    function baz(s) {\n       return foo + ':' + s;\n    }\n    ```\n\n\n    Or inline code like `var foo = 'bar';`.\n\n    Or an image of bears\n\n    ![bears](http://placebear.com/200/200)\n\n    The end ...\n````\n\n### Extending\n\n#### Brief Overview\n\nMost functionality should be covered by the `BasicUnmarker` class defined in `unmarkd.unmarkers`.\n\nIf you need to reverse markdown from StackExchange (as in the case for my other project), you may use the `StackOverflowUnmarker` (or it's alias, `StackExchangeUnmarker`), which is also defined in `unmarkd.unmarkers`.\n\n#### Customizing\n\nIf the above two classes do not suit your needs, you can subclass the `unmarkd.unmarkers.BaseUnmarker` abstract class.\n\nCurrently, you can _optionally_ override the following methods:\n\n- `detect_language` (parameters: **1**)\n  - **Parameters**:\n    - html: `bs4.BeautifulSoup`\n  - When a fenced code block is approached, this function is called with a parameter of type `bs4.BeautifulSoup` passed to it; this is the element the code block was detected from (i.e. `pre`).\n  - This function is responsible for detecting the programming language (or returning `''` if none was detected) of the code block.\n  - Note: This method is different from `unmarkd.unmarkers.BasicUnmarker`. It is simpler and does less checking/filtering\n\nBut Unmarkd is more flexible than that.\n\n##### Customizable constants\n\nThere are currently 3 constants you may override:\n\n- Formats:\n  NOTE: Use the [**Format String Syntax**](https://docs.python.org/3/library/string.html#formatstrings)\n  - `UNORDERED_FORMAT`\n    - The string format of unordered (bulleted) lists.\n  - `ORDERED_FORMAT`\n    - The string format of ordered (numbered) lists.\n- Miscellaneous:\n  - `ESCAPABLES`\n    - A container (preferably a `set`) of length-1 `str` that should be escaped\n\n##### Customize converting HTML tags\n\nFor an HTML tag `some_tag`, you can customize how it's converted to markdown by overriding a method like so:\n\n```python\nfrom unmarkd.unmarkers import BaseUnmarker\nclass MyCustomUnmarker(BaseUnmarker):\n    def tag_some_tag(self, element) -> str:\n        ...  # parse code here\n```\n\nTo reduce code duplication, if your tag also has aliases (e.g. `strong` is an alias for `b` in HTML) then you may modify the `TAG_ALIASES`.\n\nIf you really need to, you may also modify `DEFAULT_TAG_ALIASES`. Be warned: if you do so, **you will also need to implement the aliases** (currently `em` and `strong`).\n\n###### Common Patterns\n\nI find myself iterating through the children of the tag a lot. But that would lead to us needing to handle new tags, which could be anything. So here's the template/pattern I recommend:\n\n```python\nfrom unmarkd.unmarkers import BaseUnmarker\nclass MyCustomUnmarker(BaseUnmarker):\n    def tag_some_tag(self, element) -> str:\n        for child in element.children:\n            if non_tag_output := self.parse_non_tags(child):\n                output += non_tag_output\n                continue\n            assert isinstance(element, bs4.Tag), type(element)\n            ...   # Do whatever you want with the child\n```\n\n##### Utility functions when overriding\n\nYou may use (when extending) the following functions:\n\n- `__parse`, 2 parameters:\n  - `html`: _bs4.BeautifulSoup_\n    - The html to unmark. This is used internally by the `unmark` method and is slightly faster.\n  - `escape`: _bool_\n    - Whether to escape the characters inside the string or not. Defaults to `False`.\n- `escape`: 1 parameter:\n  - `string`: _str_\n    - The string to escape and make markdown-safe\n- `wrap`: 2 parameters:\n  - `element`: _bs4.BeautifulSoup_\n    - The element to wrap.\n  - `around_with`: _str_\n    - The character to wrap the element around with. **WILL NOT BE ESCPAED**\n- And, of course, `tag_*` and `detect_language`.\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-or-later",
    "summary": "A markdown reverser",
    "version": "1.1.3",
    "project_urls": {
        "Documentation": "https://unmarkd.readthedocs.io/en/latest/index.html",
        "Homepage": "https://github.com/ThatXliner/unmarkd"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6ddef04b99ad0d467593c9339f37bae69329d0b60fd04261516f2908c7e101db",
                "md5": "e8436a9469f4f510bb88d54c7c2ea42c",
                "sha256": "6c28fdb2d290e4bdc52334dc8fb8ddd99f5b10bf893442280d869566d1b0b067"
            },
            "downloads": -1,
            "filename": "unmarkd-1.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e8436a9469f4f510bb88d54c7c2ea42c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 21170,
            "upload_time": "2024-02-15T07:30:40",
            "upload_time_iso_8601": "2024-02-15T07:30:40.106984Z",
            "url": "https://files.pythonhosted.org/packages/6d/de/f04b99ad0d467593c9339f37bae69329d0b60fd04261516f2908c7e101db/unmarkd-1.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3215b82304503f30d9c80907605523f8c933fc2bac33ab0da19f67c62734a31a",
                "md5": "e62fa28e1392bdbbcf5e525a4ddb02e3",
                "sha256": "09a6c59f9b1370fab1bd5483586ad325c62f3352760e17b99c3a1bac557d3b22"
            },
            "downloads": -1,
            "filename": "unmarkd-1.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "e62fa28e1392bdbbcf5e525a4ddb02e3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 20948,
            "upload_time": "2024-02-15T07:30:41",
            "upload_time_iso_8601": "2024-02-15T07:30:41.858523Z",
            "url": "https://files.pythonhosted.org/packages/32/15/b82304503f30d9c80907605523f8c933fc2bac33ab0da19f67c62734a31a/unmarkd-1.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-15 07:30:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ThatXliner",
    "github_project": "unmarkd",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "unmarkd"
}

Bryan Hu