# html-to-markdown
A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
Python 3.9+.
## Support This Project
If you find html-to-markdown useful, please consider sponsoring the development:
<a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>
Your support helps maintain and improve this library for the community.
## Features
- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
- **Table Support**: Advanced handling of complex tables with rowspan/colspan support
- **Type Safety**: Strict MyPy adherence with comprehensive type hints
- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
- **Flexible Configuration**: Comprehensive configuration options for customizing conversion behavior
- **CLI Tool**: Full-featured command-line interface with complete API parity
- **Custom Converters**: Extensible converter system for custom HTML tag handling
- **List Formatting**: Configurable list indentation with Discord/Slack compatibility
- **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels
- **Whitespace Control**: Normalized or strict whitespace preservation modes
- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
- **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios
## Installation
```shell
pip install html-to-markdown
```
### Optional lxml Parser
For improved performance, you can install with the optional lxml parser:
```shell
pip install html-to-markdown[lxml]
```
The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.
The library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.
## Quick Start
Convert HTML to Markdown with a single function call:
```python
from html_to_markdown import convert_to_markdown
html = """
<!DOCTYPE html>
<html>
<head>
<title>Sample Document</title>
<meta name="description" content="A sample HTML document">
</head>
<body>
<article>
<h1>Welcome</h1>
<p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
<p>Here's some <mark>highlighted text</mark> and a task list:</p>
<ul>
<li><input type="checkbox" checked> Completed task</li>
<li><input type="checkbox"> Pending task</li>
</ul>
</article>
</body>
</html>
"""
markdown = convert_to_markdown(html)
print(markdown)
```
Output:
```markdown
<!--
title: Sample Document
meta-description: A sample HTML document
-->
# Welcome
This is a **sample** with a [link](https://example.com).
Here's some ==highlighted text== and a task list:
* [x] Completed task
* [ ] Pending task
```
### Working with BeautifulSoup
If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:
```python
from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown
# Configure BeautifulSoup with your preferred parser
soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installation
markdown = convert_to_markdown(soup)
```
## Common Use Cases
### Discord/Slack Compatible Lists
Discord and Slack require 2-space indentation for nested lists:
**Python:**
```python
from html_to_markdown import convert_to_markdown
html = "<ul><li>Item 1<ul><li>Nested item</li></ul></li></ul>"
markdown = convert_to_markdown(html, list_indent_width=2)
# Output: * Item 1\n + Nested item
```
**CLI:**
```shell
html_to_markdown --list-indent-width 2 input.html
```
### Cleaning Web-Scraped HTML
Remove navigation, advertisements, and forms from scraped content:
**Python:**
```python
markdown = convert_to_markdown(html, preprocess_html=True, preprocessing_preset="aggressive")
```
**CLI:**
```shell
html_to_markdown --preprocess-html --preprocessing-preset aggressive input.html
```
### Preserving Whitespace for Documentation
Maintain exact whitespace for code documentation or technical content:
**Python:**
```python
markdown = convert_to_markdown(html, whitespace_mode="strict")
```
**CLI:**
```shell
html_to_markdown --whitespace-mode strict input.html
```
### Using Tabs for List Indentation
Some editors and platforms prefer tab-based indentation:
**Python:**
```python
markdown = convert_to_markdown(html, list_indent_type="tabs")
```
**CLI:**
```shell
html_to_markdown --list-indent-type tabs input.html
```
## Advanced Usage
### Configuration Example
```python
from html_to_markdown import convert_to_markdown
markdown = convert_to_markdown(
html,
# Headers and formatting
heading_style="atx",
strong_em_symbol="*",
bullets="*+-",
highlight_style="double-equal",
# List indentation
list_indent_type="spaces",
list_indent_width=4,
# Whitespace handling
whitespace_mode="normalized",
# HTML preprocessing
preprocess_html=True,
preprocessing_preset="standard",
)
```
### Custom Converters
Custom converters allow you to override the default conversion behavior for any HTML tag. This is particularly useful for customizing header formatting or implementing domain-specific conversion rules.
#### Basic Example: Custom Header Formatting
```python
from bs4.element import Tag
from html_to_markdown import convert_to_markdown
def custom_h1_converter(*, tag: Tag, text: str, **kwargs) -> str:
"""Convert h1 tags with custom formatting."""
return f"### {text.upper()} ###\n\n"
def custom_h2_converter(*, tag: Tag, text: str, **kwargs) -> str:
"""Convert h2 tags with underline."""
return f"{text}\n{'=' * len(text)}\n\n"
html = "<h1>Title</h1><h2>Subtitle</h2><p>Content</p>"
markdown = convert_to_markdown(html, custom_converters={"h1": custom_h1_converter, "h2": custom_h2_converter})
print(markdown)
# Output:
# ### TITLE ###
#
# Subtitle
# ========
#
# Content
```
#### Advanced Example: Context-Aware Link Conversion
```python
def smart_link_converter(*, tag: Tag, text: str, **kwargs) -> str:
"""Convert links based on their attributes."""
href = tag.get("href", "")
title = tag.get("title", "")
# Handle different link types
if href.startswith("http"):
# External link
return f"[{text}]({href} \"{title or 'External link'}\")"
elif href.startswith("#"):
# Anchor link
return f"[{text}]({href})"
elif href.startswith("mailto:"):
# Email link
return f"[{text}]({href})"
else:
# Relative link
return f"[{text}]({href})"
html = '<a href="https://example.com">External</a> <a href="#section">Anchor</a>'
markdown = convert_to_markdown(html, custom_converters={"a": smart_link_converter})
```
#### Converter Function Signature
All converter functions must follow this signature:
```python
def converter(*, tag: Tag, text: str, **kwargs) -> str:
"""
Args:
tag: BeautifulSoup Tag object with access to all HTML attributes
text: Pre-processed text content of the tag
**kwargs: Additional context passed through from conversion
Returns:
Markdown formatted string
"""
pass
```
Custom converters take precedence over built-in converters and can be used alongside other configuration options.
### Streaming API
For processing large documents with memory constraints, use the streaming API:
```python
from html_to_markdown import convert_to_markdown_stream
# Process large HTML in chunks
with open("large_document.html", "r") as f:
html_content = f.read()
# Returns a generator that yields markdown chunks
for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
print(chunk, end="")
```
With progress tracking:
```python
def show_progress(processed: int, total: int):
if total > 0:
percent = (processed / total) * 100
print(f"\rProgress: {percent:.1f}%", end="")
# Stream with progress callback
markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
```
#### When to Use Streaming vs Regular Processing
Based on comprehensive performance analysis, here are our recommendations:
**📄 Use Regular Processing When:**
- Files < 100KB (simplicity preferred)
- Simple scripts and one-off conversions
- Memory is not a concern
- You want the simplest API
**🌊 Use Streaming Processing When:**
- Files > 100KB (memory efficiency)
- Processing many files in batch
- Memory is constrained
- You need progress reporting
- You want to process results incrementally
- Running in production environments
**📋 Specific Recommendations by File Size:**
| File Size | Recommendation | Reason |
| ---------- | ----------------------------------------------- | -------------------------------------- |
| < 50KB | Regular (simplicity) or Streaming (3-5% faster) | Either works well |
| 50KB-100KB | Either (streaming slightly preferred) | Minimal difference |
| 100KB-1MB | Streaming preferred | Better performance + memory efficiency |
| > 1MB | Streaming strongly recommended | Significant memory advantages |
**🔧 Configuration Recommendations:**
- **Default chunk_size: 2048 bytes** (optimal performance balance)
- **For very large files (>10MB)**: Consider `chunk_size=4096`
- **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`
**📈 Performance Benefits:**
Streaming provides consistent **3-5% performance improvement** across all file sizes:
- **Streaming throughput**: ~0.47-0.48 MB/s
- **Regular throughput**: ~0.44-0.47 MB/s
- **Memory usage**: Streaming uses less peak memory for large files
- **Latency**: Streaming allows processing results before completion
### Preprocessing API
The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
```python
from html_to_markdown import preprocess_html, create_preprocessor
# Direct preprocessing with custom options
cleaned_html = preprocess_html(
raw_html,
remove_navigation=True,
remove_forms=True,
remove_scripts=True,
remove_styles=True,
remove_comments=True,
preserve_semantic_structure=True,
preserve_tables=True,
preserve_media=True,
)
markdown = convert_to_markdown(cleaned_html)
# Create a preprocessor configuration from presets
config = create_preprocessor(preset="aggressive", preserve_tables=False) # or "minimal", "standard" # Override preset settings
markdown = convert_to_markdown(html, **config)
```
### Exception Handling
The library provides specific exception classes for better error handling:
````python
from html_to_markdown import (
convert_to_markdown,
HtmlToMarkdownError,
EmptyHtmlError,
InvalidParserError,
ConflictingOptionsError,
MissingDependencyError
)
try:
markdown = convert_to_markdown(html, parser='lxml')
except MissingDependencyError:
# lxml not installed
markdown = convert_to_markdown(html, parser='html.parser')
except EmptyHtmlError:
print("No HTML content to convert")
except InvalidParserError as e:
print(f"Parser error: {e}")
except ConflictingOptionsError as e:
print(f"Conflicting options: {e}")
except HtmlToMarkdownError as e:
print(f"Conversion error: {e}")
## CLI Usage
Convert HTML files directly from the command line with full access to all API options:
```shell
# Convert a file
html_to_markdown input.html > output.md
# Process stdin
cat input.html | html_to_markdown > output.md
# Use custom options
html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md
# Discord-compatible lists with HTML preprocessing
html_to_markdown \
--list-indent-width 2 \
--preprocess-html \
--preprocessing-preset aggressive \
input.html > output.md
````
### Key CLI Options
**Most Common Options:**
```shell
--list-indent-width WIDTH # Spaces per indent (default: 4, use 2 for Discord)
--list-indent-type {spaces,tabs} # Indentation type (default: spaces)
--preprocess-html # Enable HTML cleaning for web scraping
--whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
--heading-style {atx,atx_closed,underlined} # Header style
--no-extract-metadata # Disable metadata extraction
--br-in-tables # Use <br> tags for line breaks in table cells
--source-encoding ENCODING # Override auto-detected encoding (rarely needed)
```
**File Encoding:**
The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):
```shell
# Override auto-detection for Latin-1 encoded file
html_to_markdown --source-encoding latin-1 input.html > output.md
# Force UTF-16 encoding when auto-detection fails
html_to_markdown --source-encoding utf-16 input.html > output.md
```
**All Available Options:**
The CLI supports all Python API parameters. Use `html_to_markdown --help` to see the complete list.
## Migration from Markdownify
For existing projects using Markdownify, a compatibility layer is provided:
```python
# Old code
from markdownify import markdownify as md
# New code - works the same way
from html_to_markdown import markdownify as md
```
The `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.
**Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.
## Configuration Reference
### Most Common Parameters
- `list_indent_width` (int, default: `4`): Number of spaces per indentation level (use 2 for Discord/Slack)
- `list_indent_type` (str, default: `'spaces'`): Use `'spaces'` or `'tabs'` for list indentation
- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
- `whitespace_mode` (str, default: `'normalized'`): Whitespace handling (`'normalized'` or `'strict'`)
- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
- `extract_metadata` (bool, default: `True`): Extract document metadata as comment header
### Text Formatting
- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
- `sub_symbol` (str, default: `''`): Custom symbol for subscript text
- `sup_symbol` (str, default: `''`): Custom symbol for superscript text
- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces
### Parser Options
- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
- `preprocessing_preset` (str, default: `'standard'`): Preprocessing level (`'minimal'`, `'standard'`, `'aggressive'`)
- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing
### Document Processing
- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only
- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
- `strip` (list, default: `None`): List of HTML tags to remove from output
- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions
### Text Escaping
- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
- `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting
- `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts
### Links and Media
- `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links
- `default_title` (bool, default: `False`): Use default titles for elements like links
- `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved
### Code Blocks
- `code_language` (str, default: `''`): Default language identifier for fenced code blocks
- `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language
### Text Wrapping
- `wrap` (bool, default: `False`): Enable text wrapping
- `wrap_width` (int, default: `80`): Width for text wrapping
### HTML Processing
- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
- `whitespace_mode` (str, default: `'normalized'`): How to handle whitespace (`'normalized'` intelligently cleans whitespace, `'strict'` preserves original)
- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
- `preprocessing_preset` (str, default: `'standard'`): Preprocessing aggressiveness (`'minimal'` for basic cleaning, `'standard'` for balanced, `'aggressive'` for heavy cleaning)
- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing
## Contribution
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.
### Local Development
1. Clone the repo
1. Install system dependencies (requires Python 3.9+)
1. Install the project dependencies:
```shell
uv sync --all-extras --dev
```
1. Install pre-commit hooks:
```shell
uv run pre-commit install
```
1. Run tests to ensure everything works:
```shell
uv run pytest
```
1. Run code quality checks:
```shell
uv run pre-commit run --all-files
```
1. Make your changes and submit a PR
### Development Commands
```shell
# Run tests with coverage
uv run pytest --cov=html_to_markdown --cov-report=term-missing
# Lint and format code
uv run ruff check --fix .
uv run ruff format .
# Type checking
uv run mypy
# Test CLI during development
uv run python -m html_to_markdown input.html
# Build package
uv build
```
## License
This library uses the MIT license.
## HTML5 Element Support
This library provides comprehensive support for all modern HTML5 elements:
### Semantic Elements
- `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`
- `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`
- `<del>`, `<ins>` (strikethrough and insertion tracking)
### Form Elements
- `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`
- `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`
- Task list support: `<input type="checkbox">` converts to `- [x]` / `- [ ]`
### Table Elements
- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output
### Interactive Elements
- `<details>`, `<summary>`, `<dialog>`, `<menu>`
### Ruby Annotations
- `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)
### Media Elements
- `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`
- SVG support with data URI conversion
### Math Elements
- `<math>` (MathML support)
## Acknowledgments
Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.
Raw data
{
"_id": null,
"home_page": null,
"name": "html-to-markdown",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "beautifulsoup, cli-tool, converter, html, html2markdown, markdown, markup, text-extraction, text-processing",
"author": null,
"author_email": "Na'aman Hirschfeld <nhirschfeld@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d9/eb/321c391a8f52ff470cdf53bf787d8981be9f3a36d3886fe8b195b549f2e0/html_to_markdown-1.13.0.tar.gz",
"platform": null,
"description": "# html-to-markdown\n\nA modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork\nof [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for\nPython 3.9+.\n\n## Support This Project\n\nIf you find html-to-markdown useful, please consider sponsoring the development:\n\n<a href=\"https://github.com/sponsors/Goldziher\"><img src=\"https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors\" alt=\"Sponsor on GitHub\" height=\"32\"></a>\n\nYour support helps maintain and improve this library for the community.\n\n## Features\n\n- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements\n- **Table Support**: Advanced handling of complex tables with rowspan/colspan support\n- **Type Safety**: Strict MyPy adherence with comprehensive type hints\n- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers\n- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks\n- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)\n- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax\n- **Flexible Configuration**: Comprehensive configuration options for customizing conversion behavior\n- **CLI Tool**: Full-featured command-line interface with complete API parity\n- **Custom Converters**: Extensible converter system for custom HTML tag handling\n- **List Formatting**: Configurable list indentation with Discord/Slack compatibility\n- **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels\n- **Whitespace Control**: Normalized or strict whitespace preservation modes\n- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances\n- **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios\n\n## Installation\n\n```shell\npip install html-to-markdown\n```\n\n### Optional lxml Parser\n\nFor improved performance, you can install with the optional lxml parser:\n\n```shell\npip install html-to-markdown[lxml]\n```\n\nThe lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.\n\nThe library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.\n\n## Quick Start\n\nConvert HTML to Markdown with a single function call:\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nhtml = \"\"\"\n<!DOCTYPE html>\n<html>\n<head>\n <title>Sample Document</title>\n <meta name=\"description\" content=\"A sample HTML document\">\n</head>\n<body>\n <article>\n <h1>Welcome</h1>\n <p>This is a <strong>sample</strong> with a <a href=\"https://example.com\">link</a>.</p>\n <p>Here's some <mark>highlighted text</mark> and a task list:</p>\n <ul>\n <li><input type=\"checkbox\" checked> Completed task</li>\n <li><input type=\"checkbox\"> Pending task</li>\n </ul>\n </article>\n</body>\n</html>\n\"\"\"\n\nmarkdown = convert_to_markdown(html)\nprint(markdown)\n```\n\nOutput:\n\n```markdown\n<!--\ntitle: Sample Document\nmeta-description: A sample HTML document\n-->\n\n# Welcome\n\nThis is a **sample** with a [link](https://example.com).\n\nHere's some ==highlighted text== and a task list:\n\n* [x] Completed task\n* [ ] Pending task\n```\n\n### Working with BeautifulSoup\n\nIf you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:\n\n```python\nfrom bs4 import BeautifulSoup\nfrom html_to_markdown import convert_to_markdown\n\n# Configure BeautifulSoup with your preferred parser\nsoup = BeautifulSoup(html, \"lxml\") # Note: lxml requires additional installation\nmarkdown = convert_to_markdown(soup)\n```\n\n## Common Use Cases\n\n### Discord/Slack Compatible Lists\n\nDiscord and Slack require 2-space indentation for nested lists:\n\n**Python:**\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nhtml = \"<ul><li>Item 1<ul><li>Nested item</li></ul></li></ul>\"\nmarkdown = convert_to_markdown(html, list_indent_width=2)\n# Output: * Item 1\\n + Nested item\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --list-indent-width 2 input.html\n```\n\n### Cleaning Web-Scraped HTML\n\nRemove navigation, advertisements, and forms from scraped content:\n\n**Python:**\n\n```python\nmarkdown = convert_to_markdown(html, preprocess_html=True, preprocessing_preset=\"aggressive\")\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --preprocess-html --preprocessing-preset aggressive input.html\n```\n\n### Preserving Whitespace for Documentation\n\nMaintain exact whitespace for code documentation or technical content:\n\n**Python:**\n\n```python\nmarkdown = convert_to_markdown(html, whitespace_mode=\"strict\")\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --whitespace-mode strict input.html\n```\n\n### Using Tabs for List Indentation\n\nSome editors and platforms prefer tab-based indentation:\n\n**Python:**\n\n```python\nmarkdown = convert_to_markdown(html, list_indent_type=\"tabs\")\n```\n\n**CLI:**\n\n```shell\nhtml_to_markdown --list-indent-type tabs input.html\n```\n\n## Advanced Usage\n\n### Configuration Example\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nmarkdown = convert_to_markdown(\n html,\n # Headers and formatting\n heading_style=\"atx\",\n strong_em_symbol=\"*\",\n bullets=\"*+-\",\n highlight_style=\"double-equal\",\n # List indentation\n list_indent_type=\"spaces\",\n list_indent_width=4,\n # Whitespace handling\n whitespace_mode=\"normalized\",\n # HTML preprocessing\n preprocess_html=True,\n preprocessing_preset=\"standard\",\n)\n```\n\n### Custom Converters\n\nCustom converters allow you to override the default conversion behavior for any HTML tag. This is particularly useful for customizing header formatting or implementing domain-specific conversion rules.\n\n#### Basic Example: Custom Header Formatting\n\n```python\nfrom bs4.element import Tag\nfrom html_to_markdown import convert_to_markdown\n\ndef custom_h1_converter(*, tag: Tag, text: str, **kwargs) -> str:\n \"\"\"Convert h1 tags with custom formatting.\"\"\"\n return f\"### {text.upper()} ###\\n\\n\"\n\ndef custom_h2_converter(*, tag: Tag, text: str, **kwargs) -> str:\n \"\"\"Convert h2 tags with underline.\"\"\"\n return f\"{text}\\n{'=' * len(text)}\\n\\n\"\n\nhtml = \"<h1>Title</h1><h2>Subtitle</h2><p>Content</p>\"\nmarkdown = convert_to_markdown(html, custom_converters={\"h1\": custom_h1_converter, \"h2\": custom_h2_converter})\nprint(markdown)\n# Output:\n# ### TITLE ###\n#\n# Subtitle\n# ========\n#\n# Content\n```\n\n#### Advanced Example: Context-Aware Link Conversion\n\n```python\ndef smart_link_converter(*, tag: Tag, text: str, **kwargs) -> str:\n \"\"\"Convert links based on their attributes.\"\"\"\n href = tag.get(\"href\", \"\")\n title = tag.get(\"title\", \"\")\n\n # Handle different link types\n if href.startswith(\"http\"):\n # External link\n return f\"[{text}]({href} \\\"{title or 'External link'}\\\")\"\n elif href.startswith(\"#\"):\n # Anchor link\n return f\"[{text}]({href})\"\n elif href.startswith(\"mailto:\"):\n # Email link\n return f\"[{text}]({href})\"\n else:\n # Relative link\n return f\"[{text}]({href})\"\n\nhtml = '<a href=\"https://example.com\">External</a> <a href=\"#section\">Anchor</a>'\nmarkdown = convert_to_markdown(html, custom_converters={\"a\": smart_link_converter})\n```\n\n#### Converter Function Signature\n\nAll converter functions must follow this signature:\n\n```python\ndef converter(*, tag: Tag, text: str, **kwargs) -> str:\n \"\"\"\n Args:\n tag: BeautifulSoup Tag object with access to all HTML attributes\n text: Pre-processed text content of the tag\n **kwargs: Additional context passed through from conversion\n\n Returns:\n Markdown formatted string\n \"\"\"\n pass\n```\n\nCustom converters take precedence over built-in converters and can be used alongside other configuration options.\n\n### Streaming API\n\nFor processing large documents with memory constraints, use the streaming API:\n\n```python\nfrom html_to_markdown import convert_to_markdown_stream\n\n# Process large HTML in chunks\nwith open(\"large_document.html\", \"r\") as f:\n html_content = f.read()\n\n# Returns a generator that yields markdown chunks\nfor chunk in convert_to_markdown_stream(html_content, chunk_size=2048):\n print(chunk, end=\"\")\n```\n\nWith progress tracking:\n\n```python\ndef show_progress(processed: int, total: int):\n if total > 0:\n percent = (processed / total) * 100\n print(f\"\\rProgress: {percent:.1f}%\", end=\"\")\n\n# Stream with progress callback\nmarkdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)\n```\n\n#### When to Use Streaming vs Regular Processing\n\nBased on comprehensive performance analysis, here are our recommendations:\n\n**\ud83d\udcc4 Use Regular Processing When:**\n\n- Files < 100KB (simplicity preferred)\n- Simple scripts and one-off conversions\n- Memory is not a concern\n- You want the simplest API\n\n**\ud83c\udf0a Use Streaming Processing When:**\n\n- Files > 100KB (memory efficiency)\n- Processing many files in batch\n- Memory is constrained\n- You need progress reporting\n- You want to process results incrementally\n- Running in production environments\n\n**\ud83d\udccb Specific Recommendations by File Size:**\n\n| File Size | Recommendation | Reason |\n| ---------- | ----------------------------------------------- | -------------------------------------- |\n| < 50KB | Regular (simplicity) or Streaming (3-5% faster) | Either works well |\n| 50KB-100KB | Either (streaming slightly preferred) | Minimal difference |\n| 100KB-1MB | Streaming preferred | Better performance + memory efficiency |\n| > 1MB | Streaming strongly recommended | Significant memory advantages |\n\n**\ud83d\udd27 Configuration Recommendations:**\n\n- **Default chunk_size: 2048 bytes** (optimal performance balance)\n- **For very large files (>10MB)**: Consider `chunk_size=4096`\n- **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`\n\n**\ud83d\udcc8 Performance Benefits:**\n\nStreaming provides consistent **3-5% performance improvement** across all file sizes:\n\n- **Streaming throughput**: ~0.47-0.48 MB/s\n- **Regular throughput**: ~0.44-0.47 MB/s\n- **Memory usage**: Streaming uses less peak memory for large files\n- **Latency**: Streaming allows processing results before completion\n\n### Preprocessing API\n\nThe library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:\n\n```python\nfrom html_to_markdown import preprocess_html, create_preprocessor\n\n# Direct preprocessing with custom options\ncleaned_html = preprocess_html(\n raw_html,\n remove_navigation=True,\n remove_forms=True,\n remove_scripts=True,\n remove_styles=True,\n remove_comments=True,\n preserve_semantic_structure=True,\n preserve_tables=True,\n preserve_media=True,\n)\nmarkdown = convert_to_markdown(cleaned_html)\n\n# Create a preprocessor configuration from presets\nconfig = create_preprocessor(preset=\"aggressive\", preserve_tables=False) # or \"minimal\", \"standard\" # Override preset settings\nmarkdown = convert_to_markdown(html, **config)\n```\n\n### Exception Handling\n\nThe library provides specific exception classes for better error handling:\n\n````python\nfrom html_to_markdown import (\n convert_to_markdown,\n HtmlToMarkdownError,\n EmptyHtmlError,\n InvalidParserError,\n ConflictingOptionsError,\n MissingDependencyError\n)\n\ntry:\n markdown = convert_to_markdown(html, parser='lxml')\nexcept MissingDependencyError:\n # lxml not installed\n markdown = convert_to_markdown(html, parser='html.parser')\nexcept EmptyHtmlError:\n print(\"No HTML content to convert\")\nexcept InvalidParserError as e:\n print(f\"Parser error: {e}\")\nexcept ConflictingOptionsError as e:\n print(f\"Conflicting options: {e}\")\nexcept HtmlToMarkdownError as e:\n print(f\"Conversion error: {e}\")\n\n## CLI Usage\n\nConvert HTML files directly from the command line with full access to all API options:\n\n```shell\n# Convert a file\nhtml_to_markdown input.html > output.md\n\n# Process stdin\ncat input.html | html_to_markdown > output.md\n\n# Use custom options\nhtml_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md\n\n# Discord-compatible lists with HTML preprocessing\nhtml_to_markdown \\\n --list-indent-width 2 \\\n --preprocess-html \\\n --preprocessing-preset aggressive \\\n input.html > output.md\n````\n\n### Key CLI Options\n\n**Most Common Options:**\n\n```shell\n--list-indent-width WIDTH # Spaces per indent (default: 4, use 2 for Discord)\n--list-indent-type {spaces,tabs} # Indentation type (default: spaces)\n--preprocess-html # Enable HTML cleaning for web scraping\n--whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)\n--heading-style {atx,atx_closed,underlined} # Header style\n--no-extract-metadata # Disable metadata extraction\n--br-in-tables # Use <br> tags for line breaks in table cells\n--source-encoding ENCODING # Override auto-detected encoding (rarely needed)\n```\n\n**File Encoding:**\n\nThe CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):\n\n```shell\n# Override auto-detection for Latin-1 encoded file\nhtml_to_markdown --source-encoding latin-1 input.html > output.md\n\n# Force UTF-16 encoding when auto-detection fails\nhtml_to_markdown --source-encoding utf-16 input.html > output.md\n```\n\n**All Available Options:**\nThe CLI supports all Python API parameters. Use `html_to_markdown --help` to see the complete list.\n\n## Migration from Markdownify\n\nFor existing projects using Markdownify, a compatibility layer is provided:\n\n```python\n# Old code\nfrom markdownify import markdownify as md\n\n# New code - works the same way\nfrom html_to_markdown import markdownify as md\n```\n\nThe `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.\n\n**Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.\n\n## Configuration Reference\n\n### Most Common Parameters\n\n- `list_indent_width` (int, default: `4`): Number of spaces per indentation level (use 2 for Discord/Slack)\n- `list_indent_type` (str, default: `'spaces'`): Use `'spaces'` or `'tabs'` for list indentation\n- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)\n- `whitespace_mode` (str, default: `'normalized'`): Whitespace handling (`'normalized'` or `'strict'`)\n- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML\n- `extract_metadata` (bool, default: `True`): Extract document metadata as comment header\n\n### Text Formatting\n\n- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)\n- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)\n- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists\n- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)\n- `sub_symbol` (str, default: `''`): Custom symbol for subscript text\n- `sup_symbol` (str, default: `''`): Custom symbol for superscript text\n- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces\n\n### Parser Options\n\n- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)\n- `preprocessing_preset` (str, default: `'standard'`): Preprocessing level (`'minimal'`, `'standard'`, `'aggressive'`)\n- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing\n- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing\n\n### Document Processing\n\n- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only\n- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing\n- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)\n- `strip` (list, default: `None`): List of HTML tags to remove from output\n- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions\n\n### Text Escaping\n\n- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting\n- `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting\n- `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts\n\n### Links and Media\n\n- `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links\n- `default_title` (bool, default: `False`): Use default titles for elements like links\n- `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved\n\n### Code Blocks\n\n- `code_language` (str, default: `''`): Default language identifier for fenced code blocks\n- `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language\n\n### Text Wrapping\n\n- `wrap` (bool, default: `False`): Enable text wrapping\n- `wrap_width` (int, default: `80`): Width for text wrapping\n\n### HTML Processing\n\n- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)\n- `whitespace_mode` (str, default: `'normalized'`): How to handle whitespace (`'normalized'` intelligently cleans whitespace, `'strict'` preserves original)\n- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML\n- `preprocessing_preset` (str, default: `'standard'`): Preprocessing aggressiveness (`'minimal'` for basic cleaning, `'standard'` for balanced, `'aggressive'` for heavy cleaning)\n- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing\n- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before\nsubmitting PRs to avoid disappointment.\n\n### Local Development\n\n1. Clone the repo\n\n1. Install system dependencies (requires Python 3.9+)\n\n1. Install the project dependencies:\n\n ```shell\n uv sync --all-extras --dev\n ```\n\n1. Install pre-commit hooks:\n\n ```shell\n uv run pre-commit install\n ```\n\n1. Run tests to ensure everything works:\n\n ```shell\n uv run pytest\n ```\n\n1. Run code quality checks:\n\n ```shell\n uv run pre-commit run --all-files\n ```\n\n1. Make your changes and submit a PR\n\n### Development Commands\n\n```shell\n# Run tests with coverage\nuv run pytest --cov=html_to_markdown --cov-report=term-missing\n\n# Lint and format code\nuv run ruff check --fix .\nuv run ruff format .\n\n# Type checking\nuv run mypy\n\n# Test CLI during development\nuv run python -m html_to_markdown input.html\n\n# Build package\nuv build\n```\n\n## License\n\nThis library uses the MIT license.\n\n## HTML5 Element Support\n\nThis library provides comprehensive support for all modern HTML5 elements:\n\n### Semantic Elements\n\n- `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`\n- `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`\n- `<del>`, `<ins>` (strikethrough and insertion tracking)\n\n### Form Elements\n\n- `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`\n- `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`\n- Task list support: `<input type=\"checkbox\">` converts to `- [x]` / `- [ ]`\n\n### Table Elements\n\n- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`\n- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts\n- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output\n\n### Interactive Elements\n\n- `<details>`, `<summary>`, `<dialog>`, `<menu>`\n\n### Ruby Annotations\n\n- `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)\n\n### Media Elements\n\n- `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`\n- SVG support with data URI conversion\n\n### Math Elements\n\n- `<math>` (MathML support)\n\n## Acknowledgments\n\nSpecial thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options",
"version": "1.13.0",
"project_urls": {
"Changelog": "https://github.com/Goldziher/html-to-markdown/releases",
"Homepage": "https://github.com/Goldziher/html-to-markdown",
"Issues": "https://github.com/Goldziher/html-to-markdown/issues",
"Repository": "https://github.com/Goldziher/html-to-markdown.git"
},
"split_keywords": [
"beautifulsoup",
" cli-tool",
" converter",
" html",
" html2markdown",
" markdown",
" markup",
" text-extraction",
" text-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "af7393f379959bbf574286c262435a17ddd7922a8db7d3db2265a5f5c96c8ac8",
"md5": "2a8c3c0f8f2734c106e86f56d111e161",
"sha256": "99067fcd8ecc1c50953e5e6d1294b640fdec65bd2f7e74a67c342f9618eab234"
},
"downloads": -1,
"filename": "html_to_markdown-1.13.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2a8c3c0f8f2734c106e86f56d111e161",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 35982,
"upload_time": "2025-09-16T05:35:35",
"upload_time_iso_8601": "2025-09-16T05:35:35.665426Z",
"url": "https://files.pythonhosted.org/packages/af/73/93f379959bbf574286c262435a17ddd7922a8db7d3db2265a5f5c96c8ac8/html_to_markdown-1.13.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d9eb321c391a8f52ff470cdf53bf787d8981be9f3a36d3886fe8b195b549f2e0",
"md5": "1fb20e73c3450387c0c07d7d347bc7fd",
"sha256": "72c93594ec0b707307eade17ff2852e6cc52feb8d0ceb95f1b3fe6cce78eb48e"
},
"downloads": -1,
"filename": "html_to_markdown-1.13.0.tar.gz",
"has_sig": false,
"md5_digest": "1fb20e73c3450387c0c07d7d347bc7fd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 41074,
"upload_time": "2025-09-16T05:35:37",
"upload_time_iso_8601": "2025-09-16T05:35:37.386842Z",
"url": "https://files.pythonhosted.org/packages/d9/eb/321c391a8f52ff470cdf53bf787d8981be9f3a36d3886fe8b195b549f2e0/html_to_markdown-1.13.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-16 05:35:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Goldziher",
"github_project": "html-to-markdown",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "html-to-markdown"
}