# markdown-to-data
Convert markdown and its elements (tables, lists, code, etc.) into structured, easily processable data formats like lists and hierarchical dictionaries (or JSON), with support for parsing back to markdown.
## Status
- [x] Detect, extract and convert markdown building blocks into Python data structures
- [x] Provide two formats for parsed markdown:
- [x] List format: Each building block as separate dictionary in a list
- [x] Dictionary format: Nested structure using headers as keys
- [x] Convert parsed markdown to JSON
- [x] Parse markdown data back to markdown formatted string
- [x] Add options which data gets parsed back to markdown
- [x] Extract specific building blocks (e.g., only tables or lists)
- [x] Support for task lists (checkboxes)
- [x] Enhanced code block handling with language detection
- [x] Comprehensive blockquote support with nesting
- [x] Consistent handling of definition lists
- [x] Provide comprehensive documentation
- [x] Add more test coverage --> 215 test cases
- [x] Publish on PyPI
- [ ] Align with edge cases of [Common Markdown Specification](https://spec.commonmark.org/0.31.2/)
## Quick Overview
### Install
```bash
pip install markdown-to-data
```
### Basic Usage
```python
from markdown_to_data import Markdown
markdown = """
---
title: Example text
author: John Doe
---
# Main Header
- [ ] Pending task
- [x] Completed subtask
- [x] Completed task
## Table Example
| Column 1 | Column 2 |
|----------|----------|
| Cell 1 | Cell 2 |
´´´python
def hello():
print("Hello World!")
´´´
"""
md = Markdown(markdown)
# Get parsed markdown as list
print(md.md_list)
# Each building block is a separate dictionary in the list
# Get parsed markdown as nested dictionary
print(md.md_dict)
# Headers are used as keys for nesting content
# Get information about markdown elements
print(md.md_elements)
```
### Output Formats
#### List Format (`md.md_list`)
```python
[
{'metadata': {'title': 'Example text', 'author': 'John Doe'}},
{'header': {'level': 1, 'content': 'Main Header'}},
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Pending task',
'items': [
{
'content': 'Completed subtask',
'items': [],
'task': 'checked'
}
],
'task': 'unchecked'
},
{'content': 'Completed task', 'items': [], 'task': 'checked'}
]
}
},
{'header': {'level': 2, 'content': 'Table Example'}},
{'table': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']}},
{
'code': {
'language': 'python',
'content': 'def hello():\n print("Hello World!")'
}
}
]
```
#### Dictionary Format (`md.md_dict`)
```python
{
'metadata': {'title': 'Example text', 'author': 'John Doe'},
'Main Header': {
'list_1': {
'type': 'ul',
'items': [
{
'content': 'Pending task',
'items': [
{
'content': 'Completed subtask',
'items': [],
'task': 'checked'
}
],
'task': 'unchecked'
},
{'content': 'Completed task', 'items': [], 'task': 'checked'}
]
},
'Table Example': {
'table_1': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']},
'code_1': {
'language': 'python',
'content': 'def hello():\n print("Hello World!")'
}
}
}
}
```
#### MD Elements (`md.md_elements`)
```python
{
'metadata': {'count': 1, 'positions': [0], 'variants': set()},
'header': {'count': 2, 'positions': [1, 3], 'variants': set()},
'list': {'count': 1, 'positions': [2], 'variants': {'ul'}},
'table': {'count': 1, 'positions': [4], 'variants': set()},
'code': {'count': 1, 'positions': [5], 'variants': {'python'}}
}
```
### Parse back to markdown (`to_md`)
The `Markdown` class provides a method to parse markdown data back to markdown-formatted strings.
The `to_md` method comes with options to customize the output:
```python
from markdown_to_data import Markdown
markdown = """
---
title: Example
---
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
## Code Example
´´´python
print("Hello")
´´´
"""
md = Markdown(markdown)
```
**Example 1**: Include specific elements
```python
print(md.to_md(
include=['header', 'list'], # Include all headers and lists
spacer=1 # One empty line between elements
))
```
Output:
```markdown
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
```
**Example 2**: Include by position and exclude specific types
```python
print(md.to_md(
include=[0, 1, 2], # Include first three elements
exclude=['code'], # But exclude any code blocks
spacer=2 # Two empty lines between elements
))
```
Output:
```markdown
---
title: Example
---
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
```
#### Using `to_md_parser` Function
The `to_md_parser` function can be used directly to convert markdown data structures to markdown text:
```python
from markdown_to_data import to_md_parser
data = [
{
'metadata': {
'title': 'Document'
}
},
{
'header': {
'level': 1,
'content': 'Title'
}
},
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Task 1',
'items': [],
'task': 'checked'
}
]
}
}
]
print(to_md_parser(data=data, spacer=1))
```
Output:
```markdown
---
title: Document
---
# Title
- [x] Task 1
```
## Supported Markdown Elements
### Metadata (YAML frontmatter)
```python
metadata = '''
---
title: Document
author: John Doe
tags: markdown, documentation
---
'''
md = Markdown(metadata)
print(md.md_list)
```
Output:
```python
[
{
'metadata': {
'title': 'Document',
'author': 'John Doe',
'tags': ['markdown', 'documentation']
}
}
]
```
### Headers
```python
headers = '''
# Main Title
## Section
### Subsection
'''
md = Markdown(headers)
print(md.md_list)
```
Output:
```python
[
{
'header': {
'level': 1,
'content': 'Main Title'
}
},
{
'header': {
'level': 2,
'content': 'Section'
}
},
{
'header': {
'level': 3,
'content': 'Subsection'
}
}
]
```
### Lists (Including Task Lists)
```python
lists = '''
- Regular item
- Nested item
- [x] Completed task
- [ ] Pending subtask
1. Ordered item
1. Nested ordered
'''
md = Markdown(lists)
print(md.md_list)
```
Output:
```python
[
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Regular item',
'items': [
{'content': 'Nested item', 'items': [], 'task': None}
],
'task': None
},
{
'content': 'Completed task',
'items': [
{
'content': 'Pending subtask',
'items': [],
'task': 'unchecked'
}
],
'task': 'checked'
}
]
}
},
{
'list': {
'type': 'ol',
'items': [
{
'content': 'Ordered item',
'items': [
{'content': 'Nested ordered', 'items': [], 'task': None}
],
'task': None
}
]
}
}
]
```
### Tables
```python
tables = '''
| Header 1 | Header 2 |
|----------|----------|
| Value 1 | Value 2 |
| Value 3 | Value 4 |
'''
md = Markdown(tables)
print(md.md_list)
```
Output:
```python
[
{
'table': {
'Header 1': ['Value 1', 'Value 3'],
'Header 2': ['Value 2', 'Value 4']
}
}
]
```
### Code Blocks
```python
code = '''
´´´python
def example():
return "Hello"
´´´
´´´javascript
console.log("Hello");
´´´
'''
md = Markdown(code)
print(md.md_list)
```
Output:
```python
[
{
'code': {
'language': 'python',
'content': 'def example():\n return "Hello"'
}
},
{
'code': {
'language': 'javascript',
'content': 'console.log("Hello");'
}
}
]
```
### Blockquotes
```python
blockquotes = '''
> Simple quote
> Multiple lines
> Nested quote
>> Inner quote
> Back to outer
'''
md = Markdown(blockquotes)
print(md.md_list)
```
Output:
```python
[
{
'blockquote': [
{'content': 'Simple quote', 'items': []},
{'content': 'Multiple lines', 'items': []}
]
},
{
'blockquote': [
{
'content': 'Nested quote',
'items': [
{'content': 'Inner quote', 'items': []}
]
},
{'content': 'Back to outer', 'items': []}
]
}
]
```
### Definition Lists
```python
def_lists = '''
Term
: Definition 1
: Definition 2
'''
md = Markdown(def_lists)
print(md.md_list)
```
Output:
```python
[
{
'def_list': {
'term': 'Term',
'list': ['Definition 1', 'Definition 2']
}
}
]
```
## Limitations
- Some extended markdown flavors might not be supported
- Inline formatting (bold, italic, links) is currently not parsed
- Table alignment specifications are not preserved
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an issue.
Raw data
{
"_id": null,
"home_page": null,
"name": "markdown-to-data",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "json, lists, markdown, markdown-parser, markdown-to-data, markdown-to-json, md, parser, parsing, tables",
"author": null,
"author_email": "Lennart Pollvogt <lennartpollvogt@protonmail.com>",
"download_url": "https://files.pythonhosted.org/packages/43/1a/9290f15554cb6da4a816d5c9a60cbc9afebe6b044eb3321d50abd233b445/markdown_to_data-1.0.0.tar.gz",
"platform": null,
"description": "# markdown-to-data\nConvert markdown and its elements (tables, lists, code, etc.) into structured, easily processable data formats like lists and hierarchical dictionaries (or JSON), with support for parsing back to markdown.\n\n## Status\n- [x] Detect, extract and convert markdown building blocks into Python data structures\n- [x] Provide two formats for parsed markdown:\n - [x] List format: Each building block as separate dictionary in a list\n - [x] Dictionary format: Nested structure using headers as keys\n- [x] Convert parsed markdown to JSON\n- [x] Parse markdown data back to markdown formatted string\n - [x] Add options which data gets parsed back to markdown\n- [x] Extract specific building blocks (e.g., only tables or lists)\n- [x] Support for task lists (checkboxes)\n- [x] Enhanced code block handling with language detection\n- [x] Comprehensive blockquote support with nesting\n- [x] Consistent handling of definition lists\n- [x] Provide comprehensive documentation\n- [x] Add more test coverage --> 215 test cases\n- [x] Publish on PyPI\n- [ ] Align with edge cases of [Common Markdown Specification](https://spec.commonmark.org/0.31.2/)\n\n## Quick Overview\n\n### Install\n```bash\npip install markdown-to-data\n```\n\n### Basic Usage\n```python\nfrom markdown_to_data import Markdown\n\nmarkdown = \"\"\"\n---\ntitle: Example text\nauthor: John Doe\n---\n\n# Main Header\n\n- [ ] Pending task\n - [x] Completed subtask\n- [x] Completed task\n\n## Table Example\n| Column 1 | Column 2 |\n|----------|----------|\n| Cell 1 | Cell 2 |\n\n\u00b4\u00b4\u00b4python\ndef hello():\n print(\"Hello World!\")\n\u00b4\u00b4\u00b4\n\"\"\"\n\nmd = Markdown(markdown)\n\n# Get parsed markdown as list\nprint(md.md_list)\n# Each building block is a separate dictionary in the list\n\n# Get parsed markdown as nested dictionary\nprint(md.md_dict)\n# Headers are used as keys for nesting content\n\n# Get information about markdown elements\nprint(md.md_elements)\n```\n\n### Output Formats\n\n#### List Format (`md.md_list`)\n```python\n[\n {'metadata': {'title': 'Example text', 'author': 'John Doe'}},\n {'header': {'level': 1, 'content': 'Main Header'}},\n {\n 'list': {\n 'type': 'ul',\n 'items': [\n {\n 'content': 'Pending task',\n 'items': [\n {\n 'content': 'Completed subtask',\n 'items': [],\n 'task': 'checked'\n }\n ],\n 'task': 'unchecked'\n },\n {'content': 'Completed task', 'items': [], 'task': 'checked'}\n ]\n }\n },\n {'header': {'level': 2, 'content': 'Table Example'}},\n {'table': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']}},\n {\n 'code': {\n 'language': 'python',\n 'content': 'def hello():\\n print(\"Hello World!\")'\n }\n }\n]\n```\n\n#### Dictionary Format (`md.md_dict`)\n```python\n{\n 'metadata': {'title': 'Example text', 'author': 'John Doe'},\n 'Main Header': {\n 'list_1': {\n 'type': 'ul',\n 'items': [\n {\n 'content': 'Pending task',\n 'items': [\n {\n 'content': 'Completed subtask',\n 'items': [],\n 'task': 'checked'\n }\n ],\n 'task': 'unchecked'\n },\n {'content': 'Completed task', 'items': [], 'task': 'checked'}\n ]\n },\n 'Table Example': {\n 'table_1': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']},\n 'code_1': {\n 'language': 'python',\n 'content': 'def hello():\\n print(\"Hello World!\")'\n }\n }\n }\n}\n```\n\n#### MD Elements (`md.md_elements`)\n```python\n{\n 'metadata': {'count': 1, 'positions': [0], 'variants': set()},\n 'header': {'count': 2, 'positions': [1, 3], 'variants': set()},\n 'list': {'count': 1, 'positions': [2], 'variants': {'ul'}},\n 'table': {'count': 1, 'positions': [4], 'variants': set()},\n 'code': {'count': 1, 'positions': [5], 'variants': {'python'}}\n}\n```\n\n### Parse back to markdown (`to_md`)\n\nThe `Markdown` class provides a method to parse markdown data back to markdown-formatted strings.\nThe `to_md` method comes with options to customize the output:\n\n```python\nfrom markdown_to_data import Markdown\n\nmarkdown = \"\"\"\n---\ntitle: Example\n---\n\n# Main Header\n\n- [x] Task 1\n - [ ] Subtask\n- [ ] Task 2\n\n## Code Example\n\u00b4\u00b4\u00b4python\nprint(\"Hello\")\n\u00b4\u00b4\u00b4\n\"\"\"\n\nmd = Markdown(markdown)\n```\n\n**Example 1**: Include specific elements\n```python\nprint(md.to_md(\n include=['header', 'list'], # Include all headers and lists\n spacer=1 # One empty line between elements\n))\n```\n\nOutput:\n```markdown\n# Main Header\n\n- [x] Task 1\n - [ ] Subtask\n- [ ] Task 2\n```\n\n**Example 2**: Include by position and exclude specific types\n```python\nprint(md.to_md(\n include=[0, 1, 2], # Include first three elements\n exclude=['code'], # But exclude any code blocks\n spacer=2 # Two empty lines between elements\n))\n```\n\nOutput:\n```markdown\n---\ntitle: Example\n---\n\n\n# Main Header\n\n\n- [x] Task 1\n - [ ] Subtask\n- [ ] Task 2\n```\n\n#### Using `to_md_parser` Function\n\nThe `to_md_parser` function can be used directly to convert markdown data structures to markdown text:\n\n```python\nfrom markdown_to_data import to_md_parser\n\ndata = [\n {\n 'metadata': {\n 'title': 'Document'\n }\n },\n {\n 'header': {\n 'level': 1,\n 'content': 'Title'\n }\n },\n {\n 'list': {\n 'type': 'ul',\n 'items': [\n {\n 'content': 'Task 1',\n 'items': [],\n 'task': 'checked'\n }\n ]\n }\n }\n]\n\nprint(to_md_parser(data=data, spacer=1))\n```\n\nOutput:\n```markdown\n---\ntitle: Document\n---\n\n# Title\n\n- [x] Task 1\n```\n\n## Supported Markdown Elements\n\n### Metadata (YAML frontmatter)\n\n```python\nmetadata = '''\n---\ntitle: Document\nauthor: John Doe\ntags: markdown, documentation\n---\n'''\n\nmd = Markdown(metadata)\nprint(md.md_list)\n```\n\nOutput:\n```python\n[\n {\n 'metadata': {\n 'title': 'Document',\n 'author': 'John Doe',\n 'tags': ['markdown', 'documentation']\n }\n }\n]\n```\n\n### Headers\n\n```python\nheaders = '''\n# Main Title\n## Section\n### Subsection\n'''\n\nmd = Markdown(headers)\nprint(md.md_list)\n```\n\nOutput:\n```python\n[\n {\n 'header': {\n 'level': 1,\n 'content': 'Main Title'\n }\n },\n {\n 'header': {\n 'level': 2,\n 'content': 'Section'\n }\n },\n {\n 'header': {\n 'level': 3,\n 'content': 'Subsection'\n }\n }\n]\n```\n\n### Lists (Including Task Lists)\n\n```python\nlists = '''\n- Regular item\n - Nested item\n- [x] Completed task\n - [ ] Pending subtask\n1. Ordered item\n 1. Nested ordered\n'''\n\nmd = Markdown(lists)\nprint(md.md_list)\n```\n\nOutput:\n```python\n[\n {\n 'list': {\n 'type': 'ul',\n 'items': [\n {\n 'content': 'Regular item',\n 'items': [\n {'content': 'Nested item', 'items': [], 'task': None}\n ],\n 'task': None\n },\n {\n 'content': 'Completed task',\n 'items': [\n {\n 'content': 'Pending subtask',\n 'items': [],\n 'task': 'unchecked'\n }\n ],\n 'task': 'checked'\n }\n ]\n }\n },\n {\n 'list': {\n 'type': 'ol',\n 'items': [\n {\n 'content': 'Ordered item',\n 'items': [\n {'content': 'Nested ordered', 'items': [], 'task': None}\n ],\n 'task': None\n }\n ]\n }\n }\n]\n```\n\n### Tables\n\n```python\ntables = '''\n| Header 1 | Header 2 |\n|----------|----------|\n| Value 1 | Value 2 |\n| Value 3 | Value 4 |\n'''\n\nmd = Markdown(tables)\nprint(md.md_list)\n```\n\nOutput:\n```python\n[\n {\n 'table': {\n 'Header 1': ['Value 1', 'Value 3'],\n 'Header 2': ['Value 2', 'Value 4']\n }\n }\n]\n```\n\n### Code Blocks\n\n```python\ncode = '''\n\u00b4\u00b4\u00b4python\ndef example():\n return \"Hello\"\n\u00b4\u00b4\u00b4\n\n\u00b4\u00b4\u00b4javascript\nconsole.log(\"Hello\");\n\u00b4\u00b4\u00b4\n'''\n\nmd = Markdown(code)\nprint(md.md_list)\n```\n\nOutput:\n```python\n[\n {\n 'code': {\n 'language': 'python',\n 'content': 'def example():\\n return \"Hello\"'\n }\n },\n {\n 'code': {\n 'language': 'javascript',\n 'content': 'console.log(\"Hello\");'\n }\n }\n]\n```\n\n### Blockquotes\n\n```python\nblockquotes = '''\n> Simple quote\n> Multiple lines\n\n> Nested quote\n>> Inner quote\n> Back to outer\n'''\n\nmd = Markdown(blockquotes)\nprint(md.md_list)\n```\n\nOutput:\n```python\n[\n {\n 'blockquote': [\n {'content': 'Simple quote', 'items': []},\n {'content': 'Multiple lines', 'items': []}\n ]\n },\n {\n 'blockquote': [\n {\n 'content': 'Nested quote',\n 'items': [\n {'content': 'Inner quote', 'items': []}\n ]\n },\n {'content': 'Back to outer', 'items': []}\n ]\n }\n]\n```\n\n### Definition Lists\n\n```python\ndef_lists = '''\nTerm\n: Definition 1\n: Definition 2\n'''\n\nmd = Markdown(def_lists)\nprint(md.md_list)\n```\n\nOutput:\n```python\n[\n {\n 'def_list': {\n 'term': 'Term',\n 'list': ['Definition 1', 'Definition 2']\n }\n }\n]\n```\n\n## Limitations\n- Some extended markdown flavors might not be supported\n- Inline formatting (bold, italic, links) is currently not parsed\n- Table alignment specifications are not preserved\n\n## Contributing\nContributions are welcome! Please feel free to submit a Pull Request or open an issue.\n",
"bugtrack_url": null,
"license": null,
"summary": "Convert markdown and its elements (tables, lists, code, etc.) into structured, easily processable data formats like lists and hierarchical dictionaries (or JSON), with support for parsing back to markdown.",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/lennartpollvogt/markdown-to-data",
"Repository": "https://github.com/lennartpollvogt/markdown-to-data"
},
"split_keywords": [
"json",
" lists",
" markdown",
" markdown-parser",
" markdown-to-data",
" markdown-to-json",
" md",
" parser",
" parsing",
" tables"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "442aa3a301a2a3c977ce7c70a52d12158d822f06591815f3adbc016d7177a35f",
"md5": "d3df37e56c975d052f82ff14fb58f9e3",
"sha256": "c19258835efaf7d393e0ab79bbf80d85a023a0480932dae18dc6f980b918e133"
},
"downloads": -1,
"filename": "markdown_to_data-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d3df37e56c975d052f82ff14fb58f9e3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 41323,
"upload_time": "2024-12-15T20:37:05",
"upload_time_iso_8601": "2024-12-15T20:37:05.202767Z",
"url": "https://files.pythonhosted.org/packages/44/2a/a3a301a2a3c977ce7c70a52d12158d822f06591815f3adbc016d7177a35f/markdown_to_data-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "431a9290f15554cb6da4a816d5c9a60cbc9afebe6b044eb3321d50abd233b445",
"md5": "4b0c773dad5cad467e08d40cc653c983",
"sha256": "e23d5b6da4b02e72f50fe55d35dc8ab8937ba47fdf4d533a8b8148d07b921328"
},
"downloads": -1,
"filename": "markdown_to_data-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "4b0c773dad5cad467e08d40cc653c983",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 39348,
"upload_time": "2024-12-15T20:37:07",
"upload_time_iso_8601": "2024-12-15T20:37:07.327186Z",
"url": "https://files.pythonhosted.org/packages/43/1a/9290f15554cb6da4a816d5c9a60cbc9afebe6b044eb3321d50abd233b445/markdown_to_data-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-15 20:37:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lennartpollvogt",
"github_project": "markdown-to-data",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pytest",
"specs": [
[
"==",
"8.3.3"
]
]
},
{
"name": "rich",
"specs": [
[
"==",
"13.8.1"
]
]
},
{
"name": "pydantic",
"specs": [
[
">",
"2.7.1"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0.2"
]
]
},
{
"name": "dict2xml",
"specs": [
[
"==",
"1.7.6"
]
]
}
],
"lcname": "markdown-to-data"
}