html-to-markdown


Namehtml-to-markdown JSON
Version 1.13.0 PyPI version JSON
download
home_pageNone
SummaryA modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
upload_time2025-09-16 05:35:37
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords beautifulsoup cli-tool converter html html2markdown markdown markup text-extraction text-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # html-to-markdown

A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
Python 3.9+.

## Support This Project

If you find html-to-markdown useful, please consider sponsoring the development:

<a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>

Your support helps maintain and improve this library for the community.

## Features

- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
- **Table Support**: Advanced handling of complex tables with rowspan/colspan support
- **Type Safety**: Strict MyPy adherence with comprehensive type hints
- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
- **Flexible Configuration**: Comprehensive configuration options for customizing conversion behavior
- **CLI Tool**: Full-featured command-line interface with complete API parity
- **Custom Converters**: Extensible converter system for custom HTML tag handling
- **List Formatting**: Configurable list indentation with Discord/Slack compatibility
- **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels
- **Whitespace Control**: Normalized or strict whitespace preservation modes
- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
- **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios

## Installation

```shell
pip install html-to-markdown
```

### Optional lxml Parser

For improved performance, you can install with the optional lxml parser:

```shell
pip install html-to-markdown[lxml]
```

The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.

The library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.

## Quick Start

Convert HTML to Markdown with a single function call:

```python
from html_to_markdown import convert_to_markdown

html = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Document</title>
    <meta name="description" content="A sample HTML document">
</head>
<body>
    <article>
        <h1>Welcome</h1>
        <p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
        <p>Here's some <mark>highlighted text</mark> and a task list:</p>
        <ul>
            <li><input type="checkbox" checked> Completed task</li>
            <li><input type="checkbox"> Pending task</li>
        </ul>
    </article>
</body>
</html>
"""

markdown = convert_to_markdown(html)
print(markdown)
```

Output:

```markdown
<!--
title: Sample Document
meta-description: A sample HTML document
-->

# Welcome

This is a **sample** with a [link](https://example.com).

Here's some ==highlighted text== and a task list:

* [x] Completed task
* [ ] Pending task
```

### Working with BeautifulSoup

If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:

```python
from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown

# Configure BeautifulSoup with your preferred parser
soup = BeautifulSoup(html, "lxml")  # Note: lxml requires additional installation
markdown = convert_to_markdown(soup)
```

## Common Use Cases

### Discord/Slack Compatible Lists

Discord and Slack require 2-space indentation for nested lists:

**Python:**

```python
from html_to_markdown import convert_to_markdown

html = "<ul><li>Item 1<ul><li>Nested item</li></ul></li></ul>"
markdown = convert_to_markdown(html, list_indent_width=2)
# Output: * Item 1\n  + Nested item
```

**CLI:**

```shell
html_to_markdown --list-indent-width 2 input.html
```

### Cleaning Web-Scraped HTML

Remove navigation, advertisements, and forms from scraped content:

**Python:**

```python
markdown = convert_to_markdown(html, preprocess_html=True, preprocessing_preset="aggressive")
```

**CLI:**

```shell
html_to_markdown --preprocess-html --preprocessing-preset aggressive input.html
```

### Preserving Whitespace for Documentation

Maintain exact whitespace for code documentation or technical content:

**Python:**

```python
markdown = convert_to_markdown(html, whitespace_mode="strict")
```

**CLI:**

```shell
html_to_markdown --whitespace-mode strict input.html
```

### Using Tabs for List Indentation

Some editors and platforms prefer tab-based indentation:

**Python:**

```python
markdown = convert_to_markdown(html, list_indent_type="tabs")
```

**CLI:**

```shell
html_to_markdown --list-indent-type tabs input.html
```

## Advanced Usage

### Configuration Example

```python
from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(
    html,
    # Headers and formatting
    heading_style="atx",
    strong_em_symbol="*",
    bullets="*+-",
    highlight_style="double-equal",
    # List indentation
    list_indent_type="spaces",
    list_indent_width=4,
    # Whitespace handling
    whitespace_mode="normalized",
    # HTML preprocessing
    preprocess_html=True,
    preprocessing_preset="standard",
)
```

### Custom Converters

Custom converters allow you to override the default conversion behavior for any HTML tag. This is particularly useful for customizing header formatting or implementing domain-specific conversion rules.

#### Basic Example: Custom Header Formatting

```python
from bs4.element import Tag
from html_to_markdown import convert_to_markdown

def custom_h1_converter(*, tag: Tag, text: str, **kwargs) -> str:
    """Convert h1 tags with custom formatting."""
    return f"### {text.upper()} ###\n\n"

def custom_h2_converter(*, tag: Tag, text: str, **kwargs) -> str:
    """Convert h2 tags with underline."""
    return f"{text}\n{'=' * len(text)}\n\n"

html = "<h1>Title</h1><h2>Subtitle</h2><p>Content</p>"
markdown = convert_to_markdown(html, custom_converters={"h1": custom_h1_converter, "h2": custom_h2_converter})
print(markdown)
# Output:
# ### TITLE ###
#
# Subtitle
# ========
#
# Content
```

#### Advanced Example: Context-Aware Link Conversion

```python
def smart_link_converter(*, tag: Tag, text: str, **kwargs) -> str:
    """Convert links based on their attributes."""
    href = tag.get("href", "")
    title = tag.get("title", "")

    # Handle different link types
    if href.startswith("http"):
        # External link
        return f"[{text}]({href} \"{title or 'External link'}\")"
    elif href.startswith("#"):
        # Anchor link
        return f"[{text}]({href})"
    elif href.startswith("mailto:"):
        # Email link
        return f"[{text}]({href})"
    else:
        # Relative link
        return f"[{text}]({href})"

html = '<a href="https://example.com">External</a> <a href="#section">Anchor</a>'
markdown = convert_to_markdown(html, custom_converters={"a": smart_link_converter})
```

#### Converter Function Signature

All converter functions must follow this signature:

```python
def converter(*, tag: Tag, text: str, **kwargs) -> str:
    """
    Args:
        tag: BeautifulSoup Tag object with access to all HTML attributes
        text: Pre-processed text content of the tag
        **kwargs: Additional context passed through from conversion

    Returns:
        Markdown formatted string
    """
    pass
```

Custom converters take precedence over built-in converters and can be used alongside other configuration options.

### Streaming API

For processing large documents with memory constraints, use the streaming API:

```python
from html_to_markdown import convert_to_markdown_stream

# Process large HTML in chunks
with open("large_document.html", "r") as f:
    html_content = f.read()

# Returns a generator that yields markdown chunks
for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
    print(chunk, end="")
```

With progress tracking:

```python
def show_progress(processed: int, total: int):
    if total > 0:
        percent = (processed / total) * 100
        print(f"\rProgress: {percent:.1f}%", end="")

# Stream with progress callback
markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
```

#### When to Use Streaming vs Regular Processing

Based on comprehensive performance analysis, here are our recommendations:

**📄 Use Regular Processing When:**

- Files < 100KB (simplicity preferred)
- Simple scripts and one-off conversions
- Memory is not a concern
- You want the simplest API

**🌊 Use Streaming Processing When:**

- Files > 100KB (memory efficiency)
- Processing many files in batch
- Memory is constrained
- You need progress reporting
- You want to process results incrementally
- Running in production environments

**📋 Specific Recommendations by File Size:**

| File Size  | Recommendation                                  | Reason                                 |
| ---------- | ----------------------------------------------- | -------------------------------------- |
| < 50KB     | Regular (simplicity) or Streaming (3-5% faster) | Either works well                      |
| 50KB-100KB | Either (streaming slightly preferred)           | Minimal difference                     |
| 100KB-1MB  | Streaming preferred                             | Better performance + memory efficiency |
| > 1MB      | Streaming strongly recommended                  | Significant memory advantages          |

**🔧 Configuration Recommendations:**

- **Default chunk_size: 2048 bytes** (optimal performance balance)
- **For very large files (>10MB)**: Consider `chunk_size=4096`
- **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`

**📈 Performance Benefits:**

Streaming provides consistent **3-5% performance improvement** across all file sizes:

- **Streaming throughput**: ~0.47-0.48 MB/s
- **Regular throughput**: ~0.44-0.47 MB/s
- **Memory usage**: Streaming uses less peak memory for large files
- **Latency**: Streaming allows processing results before completion

### Preprocessing API

The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:

```python
from html_to_markdown import preprocess_html, create_preprocessor

# Direct preprocessing with custom options
cleaned_html = preprocess_html(
    raw_html,
    remove_navigation=True,
    remove_forms=True,
    remove_scripts=True,
    remove_styles=True,
    remove_comments=True,
    preserve_semantic_structure=True,
    preserve_tables=True,
    preserve_media=True,
)
markdown = convert_to_markdown(cleaned_html)

# Create a preprocessor configuration from presets
config = create_preprocessor(preset="aggressive", preserve_tables=False)  # or "minimal", "standard"  # Override preset settings
markdown = convert_to_markdown(html, **config)
```

### Exception Handling

The library provides specific exception classes for better error handling:

````python
from html_to_markdown import (
    convert_to_markdown,
    HtmlToMarkdownError,
    EmptyHtmlError,
    InvalidParserError,
    ConflictingOptionsError,
    MissingDependencyError
)

try:
    markdown = convert_to_markdown(html, parser='lxml')
except MissingDependencyError:
    # lxml not installed
    markdown = convert_to_markdown(html, parser='html.parser')
except EmptyHtmlError:
    print("No HTML content to convert")
except InvalidParserError as e:
    print(f"Parser error: {e}")
except ConflictingOptionsError as e:
    print(f"Conflicting options: {e}")
except HtmlToMarkdownError as e:
    print(f"Conversion error: {e}")

## CLI Usage

Convert HTML files directly from the command line with full access to all API options:

```shell
# Convert a file
html_to_markdown input.html > output.md

# Process stdin
cat input.html | html_to_markdown > output.md

# Use custom options
html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md

# Discord-compatible lists with HTML preprocessing
html_to_markdown \
  --list-indent-width 2 \
  --preprocess-html \
  --preprocessing-preset aggressive \
  input.html > output.md
````

### Key CLI Options

**Most Common Options:**

```shell
--list-indent-width WIDTH           # Spaces per indent (default: 4, use 2 for Discord)
--list-indent-type {spaces,tabs}    # Indentation type (default: spaces)
--preprocess-html                   # Enable HTML cleaning for web scraping
--whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
--heading-style {atx,atx_closed,underlined} # Header style
--no-extract-metadata               # Disable metadata extraction
--br-in-tables                      # Use <br> tags for line breaks in table cells
--source-encoding ENCODING          # Override auto-detected encoding (rarely needed)
```

**File Encoding:**

The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):

```shell
# Override auto-detection for Latin-1 encoded file
html_to_markdown --source-encoding latin-1 input.html > output.md

# Force UTF-16 encoding when auto-detection fails
html_to_markdown --source-encoding utf-16 input.html > output.md
```

**All Available Options:**
The CLI supports all Python API parameters. Use `html_to_markdown --help` to see the complete list.

## Migration from Markdownify

For existing projects using Markdownify, a compatibility layer is provided:

```python
# Old code
from markdownify import markdownify as md

# New code - works the same way
from html_to_markdown import markdownify as md
```

The `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.

**Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.

## Configuration Reference

### Most Common Parameters

- `list_indent_width` (int, default: `4`): Number of spaces per indentation level (use 2 for Discord/Slack)
- `list_indent_type` (str, default: `'spaces'`): Use `'spaces'` or `'tabs'` for list indentation
- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
- `whitespace_mode` (str, default: `'normalized'`): Whitespace handling (`'normalized'` or `'strict'`)
- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
- `extract_metadata` (bool, default: `True`): Extract document metadata as comment header

### Text Formatting

- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
- `sub_symbol` (str, default: `''`): Custom symbol for subscript text
- `sup_symbol` (str, default: `''`): Custom symbol for superscript text
- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces

### Parser Options

- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
- `preprocessing_preset` (str, default: `'standard'`): Preprocessing level (`'minimal'`, `'standard'`, `'aggressive'`)
- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing

### Document Processing

- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only
- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
- `strip` (list, default: `None`): List of HTML tags to remove from output
- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions

### Text Escaping

- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
- `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting
- `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts

### Links and Media

- `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links
- `default_title` (bool, default: `False`): Use default titles for elements like links
- `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved

### Code Blocks

- `code_language` (str, default: `''`): Default language identifier for fenced code blocks
- `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language

### Text Wrapping

- `wrap` (bool, default: `False`): Enable text wrapping
- `wrap_width` (int, default: `80`): Width for text wrapping

### HTML Processing

- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
- `whitespace_mode` (str, default: `'normalized'`): How to handle whitespace (`'normalized'` intelligently cleans whitespace, `'strict'` preserves original)
- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
- `preprocessing_preset` (str, default: `'standard'`): Preprocessing aggressiveness (`'minimal'` for basic cleaning, `'standard'` for balanced, `'aggressive'` for heavy cleaning)
- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.

### Local Development

1. Clone the repo

1. Install system dependencies (requires Python 3.9+)

1. Install the project dependencies:

    ```shell
    uv sync --all-extras --dev
    ```

1. Install pre-commit hooks:

    ```shell
    uv run pre-commit install
    ```

1. Run tests to ensure everything works:

    ```shell
    uv run pytest
    ```

1. Run code quality checks:

    ```shell
    uv run pre-commit run --all-files
    ```

1. Make your changes and submit a PR

### Development Commands

```shell
# Run tests with coverage
uv run pytest --cov=html_to_markdown --cov-report=term-missing

# Lint and format code
uv run ruff check --fix .
uv run ruff format .

# Type checking
uv run mypy

# Test CLI during development
uv run python -m html_to_markdown input.html

# Build package
uv build
```

## License

This library uses the MIT license.

## HTML5 Element Support

This library provides comprehensive support for all modern HTML5 elements:

### Semantic Elements

- `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`
- `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`
- `<del>`, `<ins>` (strikethrough and insertion tracking)

### Form Elements

- `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`
- `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`
- Task list support: `<input type="checkbox">` converts to `- [x]` / `- [ ]`

### Table Elements

- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output

### Interactive Elements

- `<details>`, `<summary>`, `<dialog>`, `<menu>`

### Ruby Annotations

- `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)

### Media Elements

- `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`
- SVG support with data URI conversion

### Math Elements

- `<math>` (MathML support)

## Acknowledgments

Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "html-to-markdown",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "beautifulsoup, cli-tool, converter, html, html2markdown, markdown, markup, text-extraction, text-processing",
    "author": null,
    "author_email": "Na'aman Hirschfeld <nhirschfeld@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d9/eb/321c391a8f52ff470cdf53bf787d8981be9f3a36d3886fe8b195b549f2e0/html_to_markdown-1.13.0.tar.gz",
    "platform": null,
    "description": "# html-to-markdown\n\nA modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork\nof [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for\nPython 3.9+.\n\n## Support This Project\n\nIf you find html-to-markdown useful, please consider sponsoring the development:\n\n<a href=\"https://github.com/sponsors/Goldziher\"><img src=\"https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors\" alt=\"Sponsor on GitHub\" height=\"32\"></a>\n\nYour support helps maintain and improve this library for the community.\n\n## Features\n\n- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements\n- **Table Support**: Advanced handling of complex tables with rowspan/colspan support\n- **Type Safety**: Strict MyPy adherence with comprehensive type hints\n- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers\n- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks\n- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)\n- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax\n- **Flexible Configuration**: Comprehensive configuration options for customizing conversion behavior\n- **CLI Tool**: Full-featured command-line interface with complete API parity\n- **Custom Converters**: Extensible converter system for custom HTML tag handling\n- **List Formatting**: Configurable list indentation with Discord/Slack compatibility\n- **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels\n- **Whitespace Control**: Normalized or strict whitespace preservation modes\n- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances\n- **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios\n\n## Installation\n\n```shell\npip install html-to-markdown\n```\n\n### Optional lxml Parser\n\nFor improved performance, you can install with the optional lxml parser:\n\n```shell\npip install html-to-markdown[lxml]\n```\n\nThe lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.\n\nThe library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.\n\n## Quick Start\n\nConvert HTML to Markdown with a single function call:\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nhtml = \"\"\"\n<!DOCTYPE html>\n<html>\n<head>\n    <title>Sample Document</title>\n    <meta name=\"description\" content=\"A sample HTML document\">\n</head>\n<body>\n    <article>\n        <h1>Welcome</h1>\n        <p>This is a <strong>sample</strong> with a <a href=\"https://example.com\">link</a>.</p>\n        <p>Here's some <mark>highlighted text</mark> and a task list:</p>\n        <ul>\n            <li><input type=\"checkbox\" checked> Completed task</li>\n            <li><input type=\"checkbox\"> Pending task</li>\n        </ul>\n    </article>\n</body>\n</html>\n\"\"\"\n\nmarkdown = convert_to_markdown(html)\nprint(markdown)\n```\n\nOutput:\n\n```markdown\n<!--\ntitle: Sample Document\nmeta-description: A sample HTML document\n-->\n\n# Welcome\n\nThis is a **sample** with a [link](https://example.com).\n\nHere's some ==highlighted text== and a task list:\n\n* [x] Completed task\n* [ ] Pending task\n```\n\n### Working with BeautifulSoup\n\nIf you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:\n\n```python\nfrom bs4 import BeautifulSoup\nfrom html_to_markdown import convert_to_markdown\n\n# Configure BeautifulSoup with your preferred parser\nsoup = BeautifulSoup(html, \"lxml\")  # Note: lxml requires additional installation\nmarkdown = convert_to_markdown(soup)\n```\n\n## Common Use Cases\n\n### Discord/Slack Compatible Lists\n\nDiscord and Slack require 2-space indentation for nested lists:\n\n**Python:**\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nhtml = \"<ul><li>Item 1<ul><li>Nested item</li></ul></li></ul>\"\nmarkdown = convert_to_markdown(html, list_indent_width=2)\n# Output: * Item 1\\n  + Nested item\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --list-indent-width 2 input.html\n```\n\n### Cleaning Web-Scraped HTML\n\nRemove navigation, advertisements, and forms from scraped content:\n\n**Python:**\n\n```python\nmarkdown = convert_to_markdown(html, preprocess_html=True, preprocessing_preset=\"aggressive\")\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --preprocess-html --preprocessing-preset aggressive input.html\n```\n\n### Preserving Whitespace for Documentation\n\nMaintain exact whitespace for code documentation or technical content:\n\n**Python:**\n\n```python\nmarkdown = convert_to_markdown(html, whitespace_mode=\"strict\")\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --whitespace-mode strict input.html\n```\n\n### Using Tabs for List Indentation\n\nSome editors and platforms prefer tab-based indentation:\n\n**Python:**\n\n```python\nmarkdown = convert_to_markdown(html, list_indent_type=\"tabs\")\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --list-indent-type tabs input.html\n```\n\n## Advanced Usage\n\n### Configuration Example\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nmarkdown = convert_to_markdown(\n    html,\n    # Headers and formatting\n    heading_style=\"atx\",\n    strong_em_symbol=\"*\",\n    bullets=\"*+-\",\n    highlight_style=\"double-equal\",\n    # List indentation\n    list_indent_type=\"spaces\",\n    list_indent_width=4,\n    # Whitespace handling\n    whitespace_mode=\"normalized\",\n    # HTML preprocessing\n    preprocess_html=True,\n    preprocessing_preset=\"standard\",\n)\n```\n\n### Custom Converters\n\nCustom converters allow you to override the default conversion behavior for any HTML tag. This is particularly useful for customizing header formatting or implementing domain-specific conversion rules.\n\n#### Basic Example: Custom Header Formatting\n\n```python\nfrom bs4.element import Tag\nfrom html_to_markdown import convert_to_markdown\n\ndef custom_h1_converter(*, tag: Tag, text: str, **kwargs) -> str:\n    \"\"\"Convert h1 tags with custom formatting.\"\"\"\n    return f\"### {text.upper()} ###\\n\\n\"\n\ndef custom_h2_converter(*, tag: Tag, text: str, **kwargs) -> str:\n    \"\"\"Convert h2 tags with underline.\"\"\"\n    return f\"{text}\\n{'=' * len(text)}\\n\\n\"\n\nhtml = \"<h1>Title</h1><h2>Subtitle</h2><p>Content</p>\"\nmarkdown = convert_to_markdown(html, custom_converters={\"h1\": custom_h1_converter, \"h2\": custom_h2_converter})\nprint(markdown)\n# Output:\n# ### TITLE ###\n#\n# Subtitle\n# ========\n#\n# Content\n```\n\n#### Advanced Example: Context-Aware Link Conversion\n\n```python\ndef smart_link_converter(*, tag: Tag, text: str, **kwargs) -> str:\n    \"\"\"Convert links based on their attributes.\"\"\"\n    href = tag.get(\"href\", \"\")\n    title = tag.get(\"title\", \"\")\n\n    # Handle different link types\n    if href.startswith(\"http\"):\n        # External link\n        return f\"[{text}]({href} \\\"{title or 'External link'}\\\")\"\n    elif href.startswith(\"#\"):\n        # Anchor link\n        return f\"[{text}]({href})\"\n    elif href.startswith(\"mailto:\"):\n        # Email link\n        return f\"[{text}]({href})\"\n    else:\n        # Relative link\n        return f\"[{text}]({href})\"\n\nhtml = '<a href=\"https://example.com\">External</a> <a href=\"#section\">Anchor</a>'\nmarkdown = convert_to_markdown(html, custom_converters={\"a\": smart_link_converter})\n```\n\n#### Converter Function Signature\n\nAll converter functions must follow this signature:\n\n```python\ndef converter(*, tag: Tag, text: str, **kwargs) -> str:\n    \"\"\"\n    Args:\n        tag: BeautifulSoup Tag object with access to all HTML attributes\n        text: Pre-processed text content of the tag\n        **kwargs: Additional context passed through from conversion\n\n    Returns:\n        Markdown formatted string\n    \"\"\"\n    pass\n```\n\nCustom converters take precedence over built-in converters and can be used alongside other configuration options.\n\n### Streaming API\n\nFor processing large documents with memory constraints, use the streaming API:\n\n```python\nfrom html_to_markdown import convert_to_markdown_stream\n\n# Process large HTML in chunks\nwith open(\"large_document.html\", \"r\") as f:\n    html_content = f.read()\n\n# Returns a generator that yields markdown chunks\nfor chunk in convert_to_markdown_stream(html_content, chunk_size=2048):\n    print(chunk, end=\"\")\n```\n\nWith progress tracking:\n\n```python\ndef show_progress(processed: int, total: int):\n    if total > 0:\n        percent = (processed / total) * 100\n        print(f\"\\rProgress: {percent:.1f}%\", end=\"\")\n\n# Stream with progress callback\nmarkdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)\n```\n\n#### When to Use Streaming vs Regular Processing\n\nBased on comprehensive performance analysis, here are our recommendations:\n\n**\ud83d\udcc4 Use Regular Processing When:**\n\n- Files < 100KB (simplicity preferred)\n- Simple scripts and one-off conversions\n- Memory is not a concern\n- You want the simplest API\n\n**\ud83c\udf0a Use Streaming Processing When:**\n\n- Files > 100KB (memory efficiency)\n- Processing many files in batch\n- Memory is constrained\n- You need progress reporting\n- You want to process results incrementally\n- Running in production environments\n\n**\ud83d\udccb Specific Recommendations by File Size:**\n\n| File Size  | Recommendation                                  | Reason                                 |\n| ---------- | ----------------------------------------------- | -------------------------------------- |\n| < 50KB     | Regular (simplicity) or Streaming (3-5% faster) | Either works well                      |\n| 50KB-100KB | Either (streaming slightly preferred)           | Minimal difference                     |\n| 100KB-1MB  | Streaming preferred                             | Better performance + memory efficiency |\n| > 1MB      | Streaming strongly recommended                  | Significant memory advantages          |\n\n**\ud83d\udd27 Configuration Recommendations:**\n\n- **Default chunk_size: 2048 bytes** (optimal performance balance)\n- **For very large files (>10MB)**: Consider `chunk_size=4096`\n- **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`\n\n**\ud83d\udcc8 Performance Benefits:**\n\nStreaming provides consistent **3-5% performance improvement** across all file sizes:\n\n- **Streaming throughput**: ~0.47-0.48 MB/s\n- **Regular throughput**: ~0.44-0.47 MB/s\n- **Memory usage**: Streaming uses less peak memory for large files\n- **Latency**: Streaming allows processing results before completion\n\n### Preprocessing API\n\nThe library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:\n\n```python\nfrom html_to_markdown import preprocess_html, create_preprocessor\n\n# Direct preprocessing with custom options\ncleaned_html = preprocess_html(\n    raw_html,\n    remove_navigation=True,\n    remove_forms=True,\n    remove_scripts=True,\n    remove_styles=True,\n    remove_comments=True,\n    preserve_semantic_structure=True,\n    preserve_tables=True,\n    preserve_media=True,\n)\nmarkdown = convert_to_markdown(cleaned_html)\n\n# Create a preprocessor configuration from presets\nconfig = create_preprocessor(preset=\"aggressive\", preserve_tables=False)  # or \"minimal\", \"standard\"  # Override preset settings\nmarkdown = convert_to_markdown(html, **config)\n```\n\n### Exception Handling\n\nThe library provides specific exception classes for better error handling:\n\n````python\nfrom html_to_markdown import (\n    convert_to_markdown,\n    HtmlToMarkdownError,\n    EmptyHtmlError,\n    InvalidParserError,\n    ConflictingOptionsError,\n    MissingDependencyError\n)\n\ntry:\n    markdown = convert_to_markdown(html, parser='lxml')\nexcept MissingDependencyError:\n    # lxml not installed\n    markdown = convert_to_markdown(html, parser='html.parser')\nexcept EmptyHtmlError:\n    print(\"No HTML content to convert\")\nexcept InvalidParserError as e:\n    print(f\"Parser error: {e}\")\nexcept ConflictingOptionsError as e:\n    print(f\"Conflicting options: {e}\")\nexcept HtmlToMarkdownError as e:\n    print(f\"Conversion error: {e}\")\n\n## CLI Usage\n\nConvert HTML files directly from the command line with full access to all API options:\n\n```shell\n# Convert a file\nhtml_to_markdown input.html > output.md\n\n# Process stdin\ncat input.html | html_to_markdown > output.md\n\n# Use custom options\nhtml_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md\n\n# Discord-compatible lists with HTML preprocessing\nhtml_to_markdown \\\n  --list-indent-width 2 \\\n  --preprocess-html \\\n  --preprocessing-preset aggressive \\\n  input.html > output.md\n````\n\n### Key CLI Options\n\n**Most Common Options:**\n\n```shell\n--list-indent-width WIDTH           # Spaces per indent (default: 4, use 2 for Discord)\n--list-indent-type {spaces,tabs}    # Indentation type (default: spaces)\n--preprocess-html                   # Enable HTML cleaning for web scraping\n--whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)\n--heading-style {atx,atx_closed,underlined} # Header style\n--no-extract-metadata               # Disable metadata extraction\n--br-in-tables                      # Use <br> tags for line breaks in table cells\n--source-encoding ENCODING          # Override auto-detected encoding (rarely needed)\n```\n\n**File Encoding:**\n\nThe CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):\n\n```shell\n# Override auto-detection for Latin-1 encoded file\nhtml_to_markdown --source-encoding latin-1 input.html > output.md\n\n# Force UTF-16 encoding when auto-detection fails\nhtml_to_markdown --source-encoding utf-16 input.html > output.md\n```\n\n**All Available Options:**\nThe CLI supports all Python API parameters. Use `html_to_markdown --help` to see the complete list.\n\n## Migration from Markdownify\n\nFor existing projects using Markdownify, a compatibility layer is provided:\n\n```python\n# Old code\nfrom markdownify import markdownify as md\n\n# New code - works the same way\nfrom html_to_markdown import markdownify as md\n```\n\nThe `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.\n\n**Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.\n\n## Configuration Reference\n\n### Most Common Parameters\n\n- `list_indent_width` (int, default: `4`): Number of spaces per indentation level (use 2 for Discord/Slack)\n- `list_indent_type` (str, default: `'spaces'`): Use `'spaces'` or `'tabs'` for list indentation\n- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)\n- `whitespace_mode` (str, default: `'normalized'`): Whitespace handling (`'normalized'` or `'strict'`)\n- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML\n- `extract_metadata` (bool, default: `True`): Extract document metadata as comment header\n\n### Text Formatting\n\n- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)\n- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)\n- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists\n- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)\n- `sub_symbol` (str, default: `''`): Custom symbol for subscript text\n- `sup_symbol` (str, default: `''`): Custom symbol for superscript text\n- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces\n\n### Parser Options\n\n- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)\n- `preprocessing_preset` (str, default: `'standard'`): Preprocessing level (`'minimal'`, `'standard'`, `'aggressive'`)\n- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing\n- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing\n\n### Document Processing\n\n- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only\n- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing\n- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)\n- `strip` (list, default: `None`): List of HTML tags to remove from output\n- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions\n\n### Text Escaping\n\n- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting\n- `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting\n- `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts\n\n### Links and Media\n\n- `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links\n- `default_title` (bool, default: `False`): Use default titles for elements like links\n- `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved\n\n### Code Blocks\n\n- `code_language` (str, default: `''`): Default language identifier for fenced code blocks\n- `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language\n\n### Text Wrapping\n\n- `wrap` (bool, default: `False`): Enable text wrapping\n- `wrap_width` (int, default: `80`): Width for text wrapping\n\n### HTML Processing\n\n- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)\n- `whitespace_mode` (str, default: `'normalized'`): How to handle whitespace (`'normalized'` intelligently cleans whitespace, `'strict'` preserves original)\n- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML\n- `preprocessing_preset` (str, default: `'standard'`): Preprocessing aggressiveness (`'minimal'` for basic cleaning, `'standard'` for balanced, `'aggressive'` for heavy cleaning)\n- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing\n- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before\nsubmitting PRs to avoid disappointment.\n\n### Local Development\n\n1. Clone the repo\n\n1. Install system dependencies (requires Python 3.9+)\n\n1. Install the project dependencies:\n\n    ```shell\n    uv sync --all-extras --dev\n    ```\n\n1. Install pre-commit hooks:\n\n    ```shell\n    uv run pre-commit install\n    ```\n\n1. Run tests to ensure everything works:\n\n    ```shell\n    uv run pytest\n    ```\n\n1. Run code quality checks:\n\n    ```shell\n    uv run pre-commit run --all-files\n    ```\n\n1. Make your changes and submit a PR\n\n### Development Commands\n\n```shell\n# Run tests with coverage\nuv run pytest --cov=html_to_markdown --cov-report=term-missing\n\n# Lint and format code\nuv run ruff check --fix .\nuv run ruff format .\n\n# Type checking\nuv run mypy\n\n# Test CLI during development\nuv run python -m html_to_markdown input.html\n\n# Build package\nuv build\n```\n\n## License\n\nThis library uses the MIT license.\n\n## HTML5 Element Support\n\nThis library provides comprehensive support for all modern HTML5 elements:\n\n### Semantic Elements\n\n- `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`\n- `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`\n- `<del>`, `<ins>` (strikethrough and insertion tracking)\n\n### Form Elements\n\n- `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`\n- `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`\n- Task list support: `<input type=\"checkbox\">` converts to `- [x]` / `- [ ]`\n\n### Table Elements\n\n- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`\n- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts\n- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output\n\n### Interactive Elements\n\n- `<details>`, `<summary>`, `<dialog>`, `<menu>`\n\n### Ruby Annotations\n\n- `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)\n\n### Media Elements\n\n- `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`\n- SVG support with data URI conversion\n\n### Math Elements\n\n- `<math>` (MathML support)\n\n## Acknowledgments\n\nSpecial thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options",
    "version": "1.13.0",
    "project_urls": {
        "Changelog": "https://github.com/Goldziher/html-to-markdown/releases",
        "Homepage": "https://github.com/Goldziher/html-to-markdown",
        "Issues": "https://github.com/Goldziher/html-to-markdown/issues",
        "Repository": "https://github.com/Goldziher/html-to-markdown.git"
    },
    "split_keywords": [
        "beautifulsoup",
        " cli-tool",
        " converter",
        " html",
        " html2markdown",
        " markdown",
        " markup",
        " text-extraction",
        " text-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "af7393f379959bbf574286c262435a17ddd7922a8db7d3db2265a5f5c96c8ac8",
                "md5": "2a8c3c0f8f2734c106e86f56d111e161",
                "sha256": "99067fcd8ecc1c50953e5e6d1294b640fdec65bd2f7e74a67c342f9618eab234"
            },
            "downloads": -1,
            "filename": "html_to_markdown-1.13.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2a8c3c0f8f2734c106e86f56d111e161",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 35982,
            "upload_time": "2025-09-16T05:35:35",
            "upload_time_iso_8601": "2025-09-16T05:35:35.665426Z",
            "url": "https://files.pythonhosted.org/packages/af/73/93f379959bbf574286c262435a17ddd7922a8db7d3db2265a5f5c96c8ac8/html_to_markdown-1.13.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d9eb321c391a8f52ff470cdf53bf787d8981be9f3a36d3886fe8b195b549f2e0",
                "md5": "1fb20e73c3450387c0c07d7d347bc7fd",
                "sha256": "72c93594ec0b707307eade17ff2852e6cc52feb8d0ceb95f1b3fe6cce78eb48e"
            },
            "downloads": -1,
            "filename": "html_to_markdown-1.13.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1fb20e73c3450387c0c07d7d347bc7fd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 41074,
            "upload_time": "2025-09-16T05:35:37",
            "upload_time_iso_8601": "2025-09-16T05:35:37.386842Z",
            "url": "https://files.pythonhosted.org/packages/d9/eb/321c391a8f52ff470cdf53bf787d8981be9f3a36d3886fe8b195b549f2e0/html_to_markdown-1.13.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-16 05:35:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Goldziher",
    "github_project": "html-to-markdown",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "html-to-markdown"
}
        
Elapsed time: 7.62434s