# html-to-markdown
A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
Python 3.9+.
## Features
- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
- **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation
- **Type Safety**: Strict MyPy adherence with comprehensive type hints
- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
- **Flexible Configuration**: 20+ configuration options for customizing conversion behavior
- **CLI Tool**: Full-featured command-line interface with all API options exposed
- **Custom Converters**: Extensible converter system for custom HTML tag handling
- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
- **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests
## Installation
```shell
pip install html-to-markdown
```
### Optional lxml Parser
For improved performance, you can install with the optional lxml parser:
```shell
pip install html-to-markdown[lxml]
```
The lxml parser offers:
- **~30% faster HTML parsing** compared to the default html.parser
- Better handling of malformed HTML
- More robust parsing for complex documents
Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:
```python
result = convert_to_markdown(html) # Auto-detects: uses lxml if available, otherwise html.parser
result = convert_to_markdown(html, parser="lxml") # Force lxml (requires installation)
result = convert_to_markdown(html, parser="html.parser") # Force built-in parser
```
## Quick Start
Convert HTML to Markdown with a single function call:
```python
from html_to_markdown import convert_to_markdown
html = """
<!DOCTYPE html>
<html>
<head>
<title>Sample Document</title>
<meta name="description" content="A sample HTML document">
</head>
<body>
<article>
<h1>Welcome</h1>
<p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
<p>Here's some <mark>highlighted text</mark> and a task list:</p>
<ul>
<li><input type="checkbox" checked> Completed task</li>
<li><input type="checkbox"> Pending task</li>
</ul>
</article>
</body>
</html>
"""
markdown = convert_to_markdown(html)
print(markdown)
```
Output:
```markdown
<!--
title: Sample Document
meta-description: A sample HTML document
-->
# Welcome
This is a **sample** with a [link](https://example.com).
Here's some ==highlighted text== and a task list:
* [x] Completed task
* [ ] Pending task
```
### Working with BeautifulSoup
If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:
```python
from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown
# Configure BeautifulSoup with your preferred parser
soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installation
markdown = convert_to_markdown(soup)
```
## Advanced Usage
### Customizing Conversion Options
The library offers extensive customization through various options:
```python
from html_to_markdown import convert_to_markdown
html = "<div>Your content here...</div>"
markdown = convert_to_markdown(
html,
# Document processing
extract_metadata=True, # Extract metadata as comment header
convert_as_inline=False, # Treat as block-level content
strip_newlines=False, # Preserve original newlines
# Formatting options
heading_style="atx", # Use # style headers
strong_em_symbol="*", # Use * for bold/italic
bullets="*+-", # Define bullet point characters
highlight_style="double-equal", # Use == for highlighted text
# Text processing
wrap=True, # Enable text wrapping
wrap_width=100, # Set wrap width
escape_asterisks=True, # Escape * characters
escape_underscores=True, # Escape _ characters
escape_misc=True, # Escape other special characters
# Code blocks
code_language="python", # Default code block language
# Streaming for large documents
stream_processing=False, # Enable for memory efficiency
chunk_size=1024, # Chunk size for streaming
)
```
### Custom Converters
You can provide your own conversion functions for specific HTML tags:
```python
from bs4.element import Tag
from html_to_markdown import convert_to_markdown
# Define a custom converter for the <b> tag
def custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:
return f"IMPORTANT: {text}"
html = "<p>This is a <b>bold statement</b>.</p>"
markdown = convert_to_markdown(html, custom_converters={"b": custom_bold_converter})
print(markdown)
# Output: This is a IMPORTANT: bold statement.
```
Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
### Enhanced Table Support
The library now provides better handling of complex tables with merged cells:
```python
from html_to_markdown import convert_to_markdown
# HTML table with merged cells
html = """
<table>
<tr>
<th rowspan="2">Category</th>
<th colspan="2">Sales Data</th>
</tr>
<tr>
<th>Q1</th>
<th>Q2</th>
</tr>
<tr>
<td>Product A</td>
<td>$100K</td>
<td>$150K</td>
</tr>
</table>
"""
markdown = convert_to_markdown(html)
print(markdown)
```
Output:
```markdown
| Category | Sales Data | |
| --- | --- | --- |
| | Q1 | Q2 |
| Product A | $100K | $150K |
```
The library handles:
- **Rowspan**: Inserts empty cells in subsequent rows
- **Colspan**: Properly manages column spanning
- **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent
### Key Configuration Options
| Option | Type | Default | Description |
| ------------------- | ---- | ---------------- | --------------------------------------------------------------- |
| `extract_metadata` | bool | `True` | Extract document metadata as comment header |
| `convert_as_inline` | bool | `False` | Treat content as inline elements only |
| `heading_style` | str | `'underlined'` | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |
| `highlight_style` | str | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |
| `stream_processing` | bool | `False` | Enable streaming for large documents |
| `parser` | str | auto-detect | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |
| `autolinks` | bool | `True` | Auto-convert URLs to Markdown links |
| `bullets` | str | `'*+-'` | Characters to use for bullet points |
| `escape_asterisks` | bool | `True` | Escape * characters |
| `wrap` | bool | `False` | Enable text wrapping |
| `wrap_width` | int | `80` | Text wrap width |
For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.
## CLI Usage
Convert HTML files directly from the command line with full access to all API options:
```shell
# Convert a file
html_to_markdown input.html > output.md
# Process stdin
cat input.html | html_to_markdown > output.md
# Use custom options
html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md
# Advanced options
html_to_markdown \
--no-extract-metadata \
--convert-as-inline \
--highlight-style html \
--stream-processing \
--show-progress \
input.html > output.md
```
### Key CLI Options
```shell
# Content processing
--convert-as-inline # Treat content as inline elements
--no-extract-metadata # Disable metadata extraction
--strip-newlines # Remove newlines from input
# Formatting
--heading-style {atx,atx_closed,underlined}
--highlight-style {double-equal,html,bold}
--strong-em-symbol {*,_}
--bullets CHARS # e.g., "*+-"
# Text escaping
--no-escape-asterisks # Disable * escaping
--no-escape-underscores # Disable _ escaping
--no-escape-misc # Disable misc character escaping
# Large document processing
--stream-processing # Enable streaming mode
--chunk-size SIZE # Set chunk size (default: 1024)
--show-progress # Show progress for large files
# Text wrapping
--wrap # Enable text wrapping
--wrap-width WIDTH # Set wrap width (default: 80)
```
View all available options:
```shell
html_to_markdown --help
```
## Migration from Markdownify
For existing projects using Markdownify, a compatibility layer is provided:
```python
# Old code
from markdownify import markdownify as md
# New code - works the same way
from html_to_markdown import markdownify as md
```
The `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.
**Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.
## Configuration Reference
Complete list of all configuration options:
### Document Processing
- `extract_metadata` (bool, default: `True`): Extract document metadata (title, meta tags) as comment header
- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only (no block elements)
- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
- `strip` (list, default: `None`): List of HTML tags to remove from output
- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions
### Streaming Support
- `stream_processing` (bool, default: `False`): Enable streaming processing for large documents
- `chunk_size` (int, default: `1024`): Size of chunks when using streaming processing
- `chunk_callback` (callable, default: `None`): Callback function called with each processed chunk
- `progress_callback` (callable, default: `None`): Callback function called with (processed_bytes, total_bytes)
### Text Formatting
- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
- `sub_symbol` (str, default: `''`): Custom symbol for subscript text
- `sup_symbol` (str, default: `''`): Custom symbol for superscript text
### Text Escaping
- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
- `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting
- `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts
### Links and Media
- `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links
- `default_title` (bool, default: `False`): Use default titles for elements like links
- `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved
### Code Blocks
- `code_language` (str, default: `''`): Default language identifier for fenced code blocks
- `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language
### Text Wrapping
- `wrap` (bool, default: `False`): Enable text wrapping
- `wrap_width` (int, default: `80`): Width for text wrapping
## Contribution
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.
### Local Development
1. Clone the repo
1. Install system dependencies (requires Python 3.9+)
1. Install the project dependencies:
```shell
uv sync --all-extras --dev
```
1. Install pre-commit hooks:
```shell
uv run pre-commit install
```
1. Run tests to ensure everything works:
```shell
uv run pytest
```
1. Run code quality checks:
```shell
uv run pre-commit run --all-files
```
1. Make your changes and submit a PR
### Development Commands
```shell
# Run tests with coverage
uv run pytest --cov=html_to_markdown --cov-report=term-missing
# Lint and format code
uv run ruff check --fix .
uv run ruff format .
# Type checking
uv run mypy
# Test CLI during development
uv run python -m html_to_markdown input.html
# Build package
uv build
```
## Performance
The library is optimized for performance with several key features:
- **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
- **Streaming support**: Process large documents in chunks to minimize memory usage
- **Optional lxml parser**: ~30% faster parsing for complex HTML documents
- **Optimized string operations**: Minimizes string concatenations in hot paths
Typical throughput: ~2 MB/s for regular processing on modern hardware.
## License
This library uses the MIT license.
## HTML5 Element Support
This library provides comprehensive support for all modern HTML5 elements:
### Semantic Elements
- `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`
- `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`
- `<del>`, `<ins>` (strikethrough and insertion tracking)
### Form Elements
- `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`
- `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`
- Task list support: `<input type="checkbox">` converts to `- [x]` / `- [ ]`
### Table Elements
- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output
### Interactive Elements
- `<details>`, `<summary>`, `<dialog>`, `<menu>`
### Ruby Annotations
- `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)
### Media Elements
- `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`
- SVG support with data URI conversion
### Math Elements
- `<math>` (MathML support)
## Advanced Table Support
The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:
```python
from html_to_markdown import convert_to_markdown
# Complex table with merged cells
html = """
<table>
<caption>Sales Report</caption>
<tr>
<th rowspan="2">Product</th>
<th colspan="2">Quarterly Sales</th>
</tr>
<tr>
<th>Q1</th>
<th>Q2</th>
</tr>
<tr>
<td>Widget A</td>
<td>$50K</td>
<td>$75K</td>
</tr>
</table>
"""
result = convert_to_markdown(html)
```
**Features:**
- **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
- **Clean output**: Automatically removes table styling elements that don't translate to Markdown
- **Structure preservation**: Maintains table hierarchy and relationships
## Acknowledgments
Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.
Raw data
{
"_id": null,
"home_page": null,
"name": "html-to-markdown",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "beautifulsoup, cli-tool, converter, html, html2markdown, markdown, markup, text-extraction, text-processing",
"author": null,
"author_email": "Na'aman Hirschfeld <nhirschfeld@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ca/88/5636b978592b2222473d6fdc3a395bc348d3f792307019ad9f57efe9604d/html_to_markdown-1.9.0.tar.gz",
"platform": null,
"description": "# html-to-markdown\n\nA modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork\nof [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for\nPython 3.9+.\n\n## Features\n\n- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements\n- **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation\n- **Type Safety**: Strict MyPy adherence with comprehensive type hints\n- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers\n- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks\n- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)\n- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax\n- **Flexible Configuration**: 20+ configuration options for customizing conversion behavior\n- **CLI Tool**: Full-featured command-line interface with all API options exposed\n- **Custom Converters**: Extensible converter system for custom HTML tag handling\n- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances\n- **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests\n\n## Installation\n\n```shell\npip install html-to-markdown\n```\n\n### Optional lxml Parser\n\nFor improved performance, you can install with the optional lxml parser:\n\n```shell\npip install html-to-markdown[lxml]\n```\n\nThe lxml parser offers:\n\n- **~30% faster HTML parsing** compared to the default html.parser\n- Better handling of malformed HTML\n- More robust parsing for complex documents\n\nOnce installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:\n\n```python\nresult = convert_to_markdown(html) # Auto-detects: uses lxml if available, otherwise html.parser\nresult = convert_to_markdown(html, parser=\"lxml\") # Force lxml (requires installation)\nresult = convert_to_markdown(html, parser=\"html.parser\") # Force built-in parser\n```\n\n## Quick Start\n\nConvert HTML to Markdown with a single function call:\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nhtml = \"\"\"\n<!DOCTYPE html>\n<html>\n<head>\n <title>Sample Document</title>\n <meta name=\"description\" content=\"A sample HTML document\">\n</head>\n<body>\n <article>\n <h1>Welcome</h1>\n <p>This is a <strong>sample</strong> with a <a href=\"https://example.com\">link</a>.</p>\n <p>Here's some <mark>highlighted text</mark> and a task list:</p>\n <ul>\n <li><input type=\"checkbox\" checked> Completed task</li>\n <li><input type=\"checkbox\"> Pending task</li>\n </ul>\n </article>\n</body>\n</html>\n\"\"\"\n\nmarkdown = convert_to_markdown(html)\nprint(markdown)\n```\n\nOutput:\n\n```markdown\n<!--\ntitle: Sample Document\nmeta-description: A sample HTML document\n-->\n\n# Welcome\n\nThis is a **sample** with a [link](https://example.com).\n\nHere's some ==highlighted text== and a task list:\n\n* [x] Completed task\n* [ ] Pending task\n```\n\n### Working with BeautifulSoup\n\nIf you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:\n\n```python\nfrom bs4 import BeautifulSoup\nfrom html_to_markdown import convert_to_markdown\n\n# Configure BeautifulSoup with your preferred parser\nsoup = BeautifulSoup(html, \"lxml\") # Note: lxml requires additional installation\nmarkdown = convert_to_markdown(soup)\n```\n\n## Advanced Usage\n\n### Customizing Conversion Options\n\nThe library offers extensive customization through various options:\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\nhtml = \"<div>Your content here...</div>\"\nmarkdown = convert_to_markdown(\n html,\n # Document processing\n extract_metadata=True, # Extract metadata as comment header\n convert_as_inline=False, # Treat as block-level content\n strip_newlines=False, # Preserve original newlines\n # Formatting options\n heading_style=\"atx\", # Use # style headers\n strong_em_symbol=\"*\", # Use * for bold/italic\n bullets=\"*+-\", # Define bullet point characters\n highlight_style=\"double-equal\", # Use == for highlighted text\n # Text processing\n wrap=True, # Enable text wrapping\n wrap_width=100, # Set wrap width\n escape_asterisks=True, # Escape * characters\n escape_underscores=True, # Escape _ characters\n escape_misc=True, # Escape other special characters\n # Code blocks\n code_language=\"python\", # Default code block language\n # Streaming for large documents\n stream_processing=False, # Enable for memory efficiency\n chunk_size=1024, # Chunk size for streaming\n)\n```\n\n### Custom Converters\n\nYou can provide your own conversion functions for specific HTML tags:\n\n```python\nfrom bs4.element import Tag\nfrom html_to_markdown import convert_to_markdown\n\n# Define a custom converter for the <b> tag\ndef custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:\n return f\"IMPORTANT: {text}\"\n\nhtml = \"<p>This is a <b>bold statement</b>.</p>\"\nmarkdown = convert_to_markdown(html, custom_converters={\"b\": custom_bold_converter})\nprint(markdown)\n# Output: This is a IMPORTANT: bold statement.\n```\n\nCustom converters take precedence over the built-in converters and can be used alongside other configuration options.\n\n### Enhanced Table Support\n\nThe library now provides better handling of complex tables with merged cells:\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\n# HTML table with merged cells\nhtml = \"\"\"\n<table>\n <tr>\n <th rowspan=\"2\">Category</th>\n <th colspan=\"2\">Sales Data</th>\n </tr>\n <tr>\n <th>Q1</th>\n <th>Q2</th>\n </tr>\n <tr>\n <td>Product A</td>\n <td>$100K</td>\n <td>$150K</td>\n </tr>\n</table>\n\"\"\"\n\nmarkdown = convert_to_markdown(html)\nprint(markdown)\n```\n\nOutput:\n\n```markdown\n| Category | Sales Data | |\n| --- | --- | --- |\n| | Q1 | Q2 |\n| Product A | $100K | $150K |\n```\n\nThe library handles:\n\n- **Rowspan**: Inserts empty cells in subsequent rows\n- **Colspan**: Properly manages column spanning\n- **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent\n\n### Key Configuration Options\n\n| Option | Type | Default | Description |\n| ------------------- | ---- | ---------------- | --------------------------------------------------------------- |\n| `extract_metadata` | bool | `True` | Extract document metadata as comment header |\n| `convert_as_inline` | bool | `False` | Treat content as inline elements only |\n| `heading_style` | str | `'underlined'` | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |\n| `highlight_style` | str | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |\n| `stream_processing` | bool | `False` | Enable streaming for large documents |\n| `parser` | str | auto-detect | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |\n| `autolinks` | bool | `True` | Auto-convert URLs to Markdown links |\n| `bullets` | str | `'*+-'` | Characters to use for bullet points |\n| `escape_asterisks` | bool | `True` | Escape * characters |\n| `wrap` | bool | `False` | Enable text wrapping |\n| `wrap_width` | int | `80` | Text wrap width |\n\nFor a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.\n\n## CLI Usage\n\nConvert HTML files directly from the command line with full access to all API options:\n\n```shell\n# Convert a file\nhtml_to_markdown input.html > output.md\n\n# Process stdin\ncat input.html | html_to_markdown > output.md\n\n# Use custom options\nhtml_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md\n\n# Advanced options\nhtml_to_markdown \\\n --no-extract-metadata \\\n --convert-as-inline \\\n --highlight-style html \\\n --stream-processing \\\n --show-progress \\\n input.html > output.md\n```\n\n### Key CLI Options\n\n```shell\n# Content processing\n--convert-as-inline # Treat content as inline elements\n--no-extract-metadata # Disable metadata extraction\n--strip-newlines # Remove newlines from input\n\n# Formatting\n--heading-style {atx,atx_closed,underlined}\n--highlight-style {double-equal,html,bold}\n--strong-em-symbol {*,_}\n--bullets CHARS # e.g., \"*+-\"\n\n# Text escaping\n--no-escape-asterisks # Disable * escaping\n--no-escape-underscores # Disable _ escaping\n--no-escape-misc # Disable misc character escaping\n\n# Large document processing\n--stream-processing # Enable streaming mode\n--chunk-size SIZE # Set chunk size (default: 1024)\n--show-progress # Show progress for large files\n\n# Text wrapping\n--wrap # Enable text wrapping\n--wrap-width WIDTH # Set wrap width (default: 80)\n```\n\nView all available options:\n\n```shell\nhtml_to_markdown --help\n```\n\n## Migration from Markdownify\n\nFor existing projects using Markdownify, a compatibility layer is provided:\n\n```python\n# Old code\nfrom markdownify import markdownify as md\n\n# New code - works the same way\nfrom html_to_markdown import markdownify as md\n```\n\nThe `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.\n\n**Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.\n\n## Configuration Reference\n\nComplete list of all configuration options:\n\n### Document Processing\n\n- `extract_metadata` (bool, default: `True`): Extract document metadata (title, meta tags) as comment header\n- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only (no block elements)\n- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing\n- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)\n- `strip` (list, default: `None`): List of HTML tags to remove from output\n- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions\n\n### Streaming Support\n\n- `stream_processing` (bool, default: `False`): Enable streaming processing for large documents\n- `chunk_size` (int, default: `1024`): Size of chunks when using streaming processing\n- `chunk_callback` (callable, default: `None`): Callback function called with each processed chunk\n- `progress_callback` (callable, default: `None`): Callback function called with (processed_bytes, total_bytes)\n\n### Text Formatting\n\n- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)\n- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)\n- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)\n- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists\n- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)\n- `sub_symbol` (str, default: `''`): Custom symbol for subscript text\n- `sup_symbol` (str, default: `''`): Custom symbol for superscript text\n\n### Text Escaping\n\n- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting\n- `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting\n- `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts\n\n### Links and Media\n\n- `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links\n- `default_title` (bool, default: `False`): Use default titles for elements like links\n- `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved\n\n### Code Blocks\n\n- `code_language` (str, default: `''`): Default language identifier for fenced code blocks\n- `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language\n\n### Text Wrapping\n\n- `wrap` (bool, default: `False`): Enable text wrapping\n- `wrap_width` (int, default: `80`): Width for text wrapping\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before\nsubmitting PRs to avoid disappointment.\n\n### Local Development\n\n1. Clone the repo\n\n1. Install system dependencies (requires Python 3.9+)\n\n1. Install the project dependencies:\n\n ```shell\n uv sync --all-extras --dev\n ```\n\n1. Install pre-commit hooks:\n\n ```shell\n uv run pre-commit install\n ```\n\n1. Run tests to ensure everything works:\n\n ```shell\n uv run pytest\n ```\n\n1. Run code quality checks:\n\n ```shell\n uv run pre-commit run --all-files\n ```\n\n1. Make your changes and submit a PR\n\n### Development Commands\n\n```shell\n# Run tests with coverage\nuv run pytest --cov=html_to_markdown --cov-report=term-missing\n\n# Lint and format code\nuv run ruff check --fix .\nuv run ruff format .\n\n# Type checking\nuv run mypy\n\n# Test CLI during development\nuv run python -m html_to_markdown input.html\n\n# Build package\nuv build\n```\n\n## Performance\n\nThe library is optimized for performance with several key features:\n\n- **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching\n- **Streaming support**: Process large documents in chunks to minimize memory usage\n- **Optional lxml parser**: ~30% faster parsing for complex HTML documents\n- **Optimized string operations**: Minimizes string concatenations in hot paths\n\nTypical throughput: ~2 MB/s for regular processing on modern hardware.\n\n## License\n\nThis library uses the MIT license.\n\n## HTML5 Element Support\n\nThis library provides comprehensive support for all modern HTML5 elements:\n\n### Semantic Elements\n\n- `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`\n- `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`\n- `<del>`, `<ins>` (strikethrough and insertion tracking)\n\n### Form Elements\n\n- `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`\n- `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`\n- Task list support: `<input type=\"checkbox\">` converts to `- [x]` / `- [ ]`\n\n### Table Elements\n\n- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`\n- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts\n- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output\n\n### Interactive Elements\n\n- `<details>`, `<summary>`, `<dialog>`, `<menu>`\n\n### Ruby Annotations\n\n- `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)\n\n### Media Elements\n\n- `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`\n- SVG support with data URI conversion\n\n### Math Elements\n\n- `<math>` (MathML support)\n\n## Advanced Table Support\n\nThe library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:\n\n```python\nfrom html_to_markdown import convert_to_markdown\n\n# Complex table with merged cells\nhtml = \"\"\"\n<table>\n <caption>Sales Report</caption>\n <tr>\n <th rowspan=\"2\">Product</th>\n <th colspan=\"2\">Quarterly Sales</th>\n </tr>\n <tr>\n <th>Q1</th>\n <th>Q2</th>\n </tr>\n <tr>\n <td>Widget A</td>\n <td>$50K</td>\n <td>$75K</td>\n </tr>\n</table>\n\"\"\"\n\nresult = convert_to_markdown(html)\n```\n\n**Features:**\n\n- **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently\n- **Clean output**: Automatically removes table styling elements that don't translate to Markdown\n- **Structure preservation**: Maintains table hierarchy and relationships\n\n## Acknowledgments\n\nSpecial thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options",
"version": "1.9.0",
"project_urls": {
"Changelog": "https://github.com/Goldziher/html-to-markdown/releases",
"Homepage": "https://github.com/Goldziher/html-to-markdown",
"Issues": "https://github.com/Goldziher/html-to-markdown/issues",
"Repository": "https://github.com/Goldziher/html-to-markdown.git"
},
"split_keywords": [
"beautifulsoup",
" cli-tool",
" converter",
" html",
" html2markdown",
" markdown",
" markup",
" text-extraction",
" text-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d4e530ead7ebecc9dd1a5c91e85643064c5c255af44c19043fed85a3e2e251b4",
"md5": "df6800ca898a9241fad10fcd977acdd3",
"sha256": "00655a3b659440314a442d0467d185aa062103fd61564904b764fc692d91b8fc"
},
"downloads": -1,
"filename": "html_to_markdown-1.9.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "df6800ca898a9241fad10fcd977acdd3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 34471,
"upload_time": "2025-07-29T15:39:58",
"upload_time_iso_8601": "2025-07-29T15:39:58.319119Z",
"url": "https://files.pythonhosted.org/packages/d4/e5/30ead7ebecc9dd1a5c91e85643064c5c255af44c19043fed85a3e2e251b4/html_to_markdown-1.9.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ca885636b978592b2222473d6fdc3a395bc348d3f792307019ad9f57efe9604d",
"md5": "a6aab2182be624c95799f360922a065a",
"sha256": "9cd066d33612a0fc00119f65bee7b5da98093c2fdd10493694fb08d15d1b2a2c"
},
"downloads": -1,
"filename": "html_to_markdown-1.9.0.tar.gz",
"has_sig": false,
"md5_digest": "a6aab2182be624c95799f360922a065a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 38094,
"upload_time": "2025-07-29T15:40:00",
"upload_time_iso_8601": "2025-07-29T15:40:00.280273Z",
"url": "https://files.pythonhosted.org/packages/ca/88/5636b978592b2222473d6fdc3a395bc348d3f792307019ad9f57efe9604d/html_to_markdown-1.9.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 15:40:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Goldziher",
"github_project": "html-to-markdown",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "html-to-markdown"
}