html-to-markdown


Namehtml-to-markdown JSON
Version 2.5.6 PyPI version JSON
download
home_pagehttps://github.com/Goldziher/html-to-markdown
SummaryHigh-performance HTML to Markdown converter powered by Rust with a clean Python API
upload_time2025-10-29 21:19:10
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords cli-tool converter html html2markdown html5 markdown markup parser rust text-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # html-to-markdown

High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). The same engine also drives the Node.js, Ruby, and WebAssembly bindings, so rendered Markdown stays identical across runtimes. Wheels are published for Linux, macOS, and Windows.

[![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://crates.io/crates/html-to-markdown-rs)
[![npm version](https://badge.fury.io/js/html-to-markdown-node.svg)](https://www.npmjs.com/package/html-to-markdown-node)
[![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
[![Gem Version](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
[![Python Versions](https://img.shields.io/pypi/pyversions/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)

## Installation

```bash
pip install html-to-markdown
```

## Performance Snapshot

Apple M4 • Real Wikipedia documents • `convert()` (Python)

| Document            | Size  | Latency | Throughput | Docs/sec |
| ------------------- | ----- | ------- | ---------- | -------- |
| Lists (Timeline)    | 129KB | 0.62ms  | 208 MB/s   | 1,613    |
| Tables (Countries)  | 360KB | 2.02ms  | 178 MB/s   | 495      |
| Mixed (Python wiki) | 656KB | 4.56ms  | 144 MB/s   | 219      |

> V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2's Rust engine delivers 60–80× higher throughput.

## Quick Start

```python
from html_to_markdown import convert

html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
    <li>Blazing fast</li>
    <li>Type safe</li>
    <li>Easy to use</li>
</ul>
"""

markdown = convert(html)
print(markdown)
```

## Configuration (v2 API)

```python
from html_to_markdown import ConversionOptions, convert

options = ConversionOptions(
    heading_style="atx",
    list_indent_width=2,
    bullets="*+-",
)
options.escape_asterisks = True
options.code_language = "python"
options.extract_metadata = True

markdown = convert(html, options)
```

### HTML Preprocessing

```python
from html_to_markdown import ConversionOptions, PreprocessingOptions, convert

options = ConversionOptions(
    preprocessing=PreprocessingOptions(enabled=True, preset="aggressive"),
)

markdown = convert(scraped_html, options)
```

### Inline Image Extraction

```python
from html_to_markdown import InlineImageConfig, convert_with_inline_images

markdown, inline_images, warnings = convert_with_inline_images(
    '<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
    image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
)

if inline_images:
    first = inline_images[0]
    print(first["format"], first["dimensions"], first["attributes"])  # e.g. "png", (1, 1), {"width": "1"}
```

Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.

### hOCR (HTML OCR) Support

```python
from html_to_markdown import ConversionOptions, convert

# Default: emit structured Markdown directly
markdown = convert(hocr_html)

# hOCR documents are detected automatically; tables are reconstructed without extra configuration.
markdown = convert(hocr_html)
```

## CLI (same engine)

```bash
pipx install html-to-markdown  # or: pip install html-to-markdown

html-to-markdown page.html > page.md
cat page.html | html-to-markdown --heading-style atx > page.md
```

## API Surface

### `ConversionOptions`

Key fields (see docstring for full matrix):

- `heading_style`: `"underlined" | "atx" | "atx_closed"`
- `list_indent_width`: spaces per indent level (default 2)
- `bullets`: cycle of bullet characters (`"*+-"`)
- `strong_em_symbol`: `"*"` or `"_"`
- `code_language`: default fenced code block language
- `wrap`, `wrap_width`: wrap Markdown output
- `strip_tags`: remove specific HTML tags
- `preprocessing`: `PreprocessingOptions`
- `encoding`: input character encoding (informational)

### `PreprocessingOptions`

- `enabled`: enable HTML sanitisation (default: `True` since v2.4.2 for robust malformed HTML handling)
- `preset`: `"minimal" | "standard" | "aggressive"` (default: `"standard"`)
- `remove_navigation`: remove navigation elements (default: `True`)
- `remove_forms`: remove form elements (default: `True`)

**Note:** As of v2.4.2, preprocessing is enabled by default to ensure robust handling of malformed HTML (e.g., bare angle brackets like `1<2` in content). Set `enabled=False` if you need minimal preprocessing.

### `InlineImageConfig`

- `max_decoded_size_bytes`: reject larger payloads
- `filename_prefix`: generated name prefix (`embedded_image` default)
- `capture_svg`: collect inline `<svg>` (default `True`)
- `infer_dimensions`: decode raster images to obtain dimensions (default `False`)

## Performance: V2 vs V1 Compatibility Layer

### ⚠️ Important: Always Use V2 API

The v2 API (`convert()`) is **strongly recommended** for all code. The v1 compatibility layer adds significant overhead and should only be used for gradual migration:

```python
# ✅ RECOMMENDED - V2 Direct API (Fast)
from html_to_markdown import convert, ConversionOptions

markdown = convert(html)  # Simple conversion - FAST
markdown = convert(html, ConversionOptions(heading_style="atx"))  # With options - FAST

# ❌ AVOID - V1 Compatibility Layer (Slow)
from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(html, heading_style="atx")  # Adds 77% overhead
```

### Performance Comparison

Benchmarked on Apple M4 with 25-paragraph HTML document:

| API                      | ops/sec          | Relative Performance | Recommendation      |
| ------------------------ | ---------------- | -------------------- | ------------------- |
| **V2 API** (`convert()`) | **129,822**      | baseline             | ✅ **Use this**     |
| **V1 Compat Layer**      | **67,673**       | **77% slower**       | ⚠️ Migration only   |
| **CLI**                  | **150-210 MB/s** | Fastest              | ✅ Batch processing |

The v1 compatibility layer creates extra Python objects and performs additional conversions, significantly impacting performance.

### When to Use Each

- **V2 API (`convert()`)**: All new code, production systems, performance-critical applications ← **Use this**
- **V1 Compat (`convert_to_markdown()`)**: Only for gradual migration from legacy codebases
- **CLI (`html-to-markdown`)**: Batch processing, shell scripts, maximum throughput

## v1 Compatibility

A compatibility layer is provided to ease migration from v1.x:

- **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify`. Keyword mappings are listed in the [changelog](CHANGELOG.md#v200).
- **⚠️ Performance warning**: These compatibility functions add 77% overhead. Migrate to v2 API as soon as possible.
- **CLI**: The Rust CLI replaces the old Python script. New flags are documented via `html-to-markdown --help`.
- **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.

## Links

- GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
- Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
- Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)

## License

MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).

## Support

If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Goldziher/html-to-markdown",
    "name": "html-to-markdown",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "cli-tool, converter, html, html2markdown, html5, markdown, markup, parser, rust, text-processing",
    "author": null,
    "author_email": "Na'aman Hirschfeld <nhirschfeld@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/12/44/ae69b5994be93041a6d40bb627e0a9876b14b63a75fb38bd981e105ffc07/html_to_markdown-2.5.6.tar.gz",
    "platform": null,
    "description": "# html-to-markdown\n\nHigh-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). The same engine also drives the Node.js, Ruby, and WebAssembly bindings, so rendered Markdown stays identical across runtimes. Wheels are published for Linux, macOS, and Windows.\n\n[![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://crates.io/crates/html-to-markdown-rs)\n[![npm version](https://badge.fury.io/js/html-to-markdown-node.svg)](https://www.npmjs.com/package/html-to-markdown-node)\n[![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)\n[![Gem Version](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)\n[![Python Versions](https://img.shields.io/pypi/pyversions/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)\n\n## Installation\n\n```bash\npip install html-to-markdown\n```\n\n## Performance Snapshot\n\nApple M4 \u2022 Real Wikipedia documents \u2022 `convert()` (Python)\n\n| Document            | Size  | Latency | Throughput | Docs/sec |\n| ------------------- | ----- | ------- | ---------- | -------- |\n| Lists (Timeline)    | 129KB | 0.62ms  | 208 MB/s   | 1,613    |\n| Tables (Countries)  | 360KB | 2.02ms  | 178 MB/s   | 495      |\n| Mixed (Python wiki) | 656KB | 4.56ms  | 144 MB/s   | 219      |\n\n> V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2's Rust engine delivers 60\u201380\u00d7 higher throughput.\n\n## Quick Start\n\n```python\nfrom html_to_markdown import convert\n\nhtml = \"\"\"\n<h1>Welcome</h1>\n<p>This is <strong>fast</strong> Rust-powered conversion!</p>\n<ul>\n    <li>Blazing fast</li>\n    <li>Type safe</li>\n    <li>Easy to use</li>\n</ul>\n\"\"\"\n\nmarkdown = convert(html)\nprint(markdown)\n```\n\n## Configuration (v2 API)\n\n```python\nfrom html_to_markdown import ConversionOptions, convert\n\noptions = ConversionOptions(\n    heading_style=\"atx\",\n    list_indent_width=2,\n    bullets=\"*+-\",\n)\noptions.escape_asterisks = True\noptions.code_language = \"python\"\noptions.extract_metadata = True\n\nmarkdown = convert(html, options)\n```\n\n### HTML Preprocessing\n\n```python\nfrom html_to_markdown import ConversionOptions, PreprocessingOptions, convert\n\noptions = ConversionOptions(\n    preprocessing=PreprocessingOptions(enabled=True, preset=\"aggressive\"),\n)\n\nmarkdown = convert(scraped_html, options)\n```\n\n### Inline Image Extraction\n\n```python\nfrom html_to_markdown import InlineImageConfig, convert_with_inline_images\n\nmarkdown, inline_images, warnings = convert_with_inline_images(\n    '<p><img src=\"data:image/png;base64,...==\" alt=\"Pixel\" width=\"1\" height=\"1\"></p>',\n    image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),\n)\n\nif inline_images:\n    first = inline_images[0]\n    print(first[\"format\"], first[\"dimensions\"], first[\"attributes\"])  # e.g. \"png\", (1, 1), {\"width\": \"1\"}\n```\n\nEach inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.\n\n### hOCR (HTML OCR) Support\n\n```python\nfrom html_to_markdown import ConversionOptions, convert\n\n# Default: emit structured Markdown directly\nmarkdown = convert(hocr_html)\n\n# hOCR documents are detected automatically; tables are reconstructed without extra configuration.\nmarkdown = convert(hocr_html)\n```\n\n## CLI (same engine)\n\n```bash\npipx install html-to-markdown  # or: pip install html-to-markdown\n\nhtml-to-markdown page.html > page.md\ncat page.html | html-to-markdown --heading-style atx > page.md\n```\n\n## API Surface\n\n### `ConversionOptions`\n\nKey fields (see docstring for full matrix):\n\n- `heading_style`: `\"underlined\" | \"atx\" | \"atx_closed\"`\n- `list_indent_width`: spaces per indent level (default 2)\n- `bullets`: cycle of bullet characters (`\"*+-\"`)\n- `strong_em_symbol`: `\"*\"` or `\"_\"`\n- `code_language`: default fenced code block language\n- `wrap`, `wrap_width`: wrap Markdown output\n- `strip_tags`: remove specific HTML tags\n- `preprocessing`: `PreprocessingOptions`\n- `encoding`: input character encoding (informational)\n\n### `PreprocessingOptions`\n\n- `enabled`: enable HTML sanitisation (default: `True` since v2.4.2 for robust malformed HTML handling)\n- `preset`: `\"minimal\" | \"standard\" | \"aggressive\"` (default: `\"standard\"`)\n- `remove_navigation`: remove navigation elements (default: `True`)\n- `remove_forms`: remove form elements (default: `True`)\n\n**Note:** As of v2.4.2, preprocessing is enabled by default to ensure robust handling of malformed HTML (e.g., bare angle brackets like `1<2` in content). Set `enabled=False` if you need minimal preprocessing.\n\n### `InlineImageConfig`\n\n- `max_decoded_size_bytes`: reject larger payloads\n- `filename_prefix`: generated name prefix (`embedded_image` default)\n- `capture_svg`: collect inline `<svg>` (default `True`)\n- `infer_dimensions`: decode raster images to obtain dimensions (default `False`)\n\n## Performance: V2 vs V1 Compatibility Layer\n\n### \u26a0\ufe0f Important: Always Use V2 API\n\nThe v2 API (`convert()`) is **strongly recommended** for all code. The v1 compatibility layer adds significant overhead and should only be used for gradual migration:\n\n```python\n# \u2705 RECOMMENDED - V2 Direct API (Fast)\nfrom html_to_markdown import convert, ConversionOptions\n\nmarkdown = convert(html)  # Simple conversion - FAST\nmarkdown = convert(html, ConversionOptions(heading_style=\"atx\"))  # With options - FAST\n\n# \u274c AVOID - V1 Compatibility Layer (Slow)\nfrom html_to_markdown import convert_to_markdown\n\nmarkdown = convert_to_markdown(html, heading_style=\"atx\")  # Adds 77% overhead\n```\n\n### Performance Comparison\n\nBenchmarked on Apple M4 with 25-paragraph HTML document:\n\n| API                      | ops/sec          | Relative Performance | Recommendation      |\n| ------------------------ | ---------------- | -------------------- | ------------------- |\n| **V2 API** (`convert()`) | **129,822**      | baseline             | \u2705 **Use this**     |\n| **V1 Compat Layer**      | **67,673**       | **77% slower**       | \u26a0\ufe0f Migration only   |\n| **CLI**                  | **150-210 MB/s** | Fastest              | \u2705 Batch processing |\n\nThe v1 compatibility layer creates extra Python objects and performs additional conversions, significantly impacting performance.\n\n### When to Use Each\n\n- **V2 API (`convert()`)**: All new code, production systems, performance-critical applications \u2190 **Use this**\n- **V1 Compat (`convert_to_markdown()`)**: Only for gradual migration from legacy codebases\n- **CLI (`html-to-markdown`)**: Batch processing, shell scripts, maximum throughput\n\n## v1 Compatibility\n\nA compatibility layer is provided to ease migration from v1.x:\n\n- **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify`. Keyword mappings are listed in the [changelog](CHANGELOG.md#v200).\n- **\u26a0\ufe0f Performance warning**: These compatibility functions add 77% overhead. Migrate to v2 API as soon as possible.\n- **CLI**: The Rust CLI replaces the old Python script. New flags are documented via `html-to-markdown --help`.\n- **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.\n\n## Links\n\n- GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)\n- Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)\n- Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)\n\n## License\n\nMIT License \u2013 see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).\n\n## Support\n\nIf you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "High-performance HTML to Markdown converter powered by Rust with a clean Python API",
    "version": "2.5.6",
    "project_urls": {
        "Changelog": "https://github.com/Goldziher/html-to-markdown/releases",
        "Homepage": "https://github.com/Goldziher/html-to-markdown",
        "Issues": "https://github.com/Goldziher/html-to-markdown/issues",
        "Repository": "https://github.com/Goldziher/html-to-markdown.git"
    },
    "split_keywords": [
        "cli-tool",
        " converter",
        " html",
        " html2markdown",
        " html5",
        " markdown",
        " markup",
        " parser",
        " rust",
        " text-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bec910703e3f190d0b9988b9683bdcb7278013f951010a582529e54cbb101309",
                "md5": "bfd79220e3b8ec16e55c114ea813d344",
                "sha256": "9476b5a031ff7ec250e849688d4cd783105535fe36e142b76a031d58c83c3b48"
            },
            "downloads": -1,
            "filename": "html_to_markdown-2.5.6-cp310-abi3-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "bfd79220e3b8ec16e55c114ea813d344",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 4871290,
            "upload_time": "2025-10-29T21:19:04",
            "upload_time_iso_8601": "2025-10-29T21:19:04.730946Z",
            "url": "https://files.pythonhosted.org/packages/be/c9/10703e3f190d0b9988b9683bdcb7278013f951010a582529e54cbb101309/html_to_markdown-2.5.6-cp310-abi3-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f3c1b45e17e5d66110932aca582b4e179d0a9837653a214756c67246e0604ae8",
                "md5": "4d74540d090920d5e17d4b20a65adc81",
                "sha256": "7eb5a08c38251443dfc194694acef826261e3c01a9aeec20073e4f7a26321817"
            },
            "downloads": -1,
            "filename": "html_to_markdown-2.5.6-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl",
            "has_sig": false,
            "md5_digest": "4d74540d090920d5e17d4b20a65adc81",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 5397487,
            "upload_time": "2025-10-29T21:19:07",
            "upload_time_iso_8601": "2025-10-29T21:19:07.306754Z",
            "url": "https://files.pythonhosted.org/packages/f3/c1/b45e17e5d66110932aca582b4e179d0a9837653a214756c67246e0604ae8/html_to_markdown-2.5.6-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d24255a7dce8fda38193e1dfb6e9c60f1d2ea5219bcacf5deef83db88a639d85",
                "md5": "92d10654f9a639a95ccd100fb33d0220",
                "sha256": "277bebbbe731ca75b098ce6900e576d896c457a6ca036d7545207902adf2a66a"
            },
            "downloads": -1,
            "filename": "html_to_markdown-2.5.6-cp310-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "92d10654f9a639a95ccd100fb33d0220",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 5145598,
            "upload_time": "2025-10-29T21:19:08",
            "upload_time_iso_8601": "2025-10-29T21:19:08.996769Z",
            "url": "https://files.pythonhosted.org/packages/d2/42/55a7dce8fda38193e1dfb6e9c60f1d2ea5219bcacf5deef83db88a639d85/html_to_markdown-2.5.6-cp310-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1244ae69b5994be93041a6d40bb627e0a9876b14b63a75fb38bd981e105ffc07",
                "md5": "a101f5bb4426fb7b6a0d24d1d6d856cc",
                "sha256": "0a891fc327a4cc8017cd9af93c46d657cdd36c837f49f039319f10a4b2aec605"
            },
            "downloads": -1,
            "filename": "html_to_markdown-2.5.6.tar.gz",
            "has_sig": false,
            "md5_digest": "a101f5bb4426fb7b6a0d24d1d6d856cc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 1933181,
            "upload_time": "2025-10-29T21:19:10",
            "upload_time_iso_8601": "2025-10-29T21:19:10.590221Z",
            "url": "https://files.pythonhosted.org/packages/12/44/ae69b5994be93041a6d40bb627e0a9876b14b63a75fb38bd981e105ffc07/html_to_markdown-2.5.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 21:19:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Goldziher",
    "github_project": "html-to-markdown",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "html-to-markdown"
}
        
Elapsed time: 0.59251s