# mrkdwn_analysis
`mrkdwn_analysis` is a powerful Python library designed to analyze Markdown files. It provides extensive parsing capabilities to extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, lists, tables, tasks (todos), footnotes, and even embedded HTML. This makes it a versatile tool for data analysis, content generation, or building other tools that work with Markdown.
## Features
- **File Loading**: Load any given Markdown file by providing its file path.
- **Header Detection**: Identify all headers (ATX `#` to `######`, and Setext `===` and `---`) in the document, giving you a quick overview of its structure.
- **Section Identification (Setext)**: Recognize sections defined by a block of text followed by `=` or `-` lines, helping you understand the document’s conceptual divisions.
- **Paragraph Extraction**: Distinguish regular text (paragraphs) from structured elements like headers, lists, or code blocks, making it easy to isolate the body content.
- **Blockquote Identification**: Extract all blockquotes defined by lines starting with `>`.
- **Code Block Extraction**: Detect fenced code blocks delimited by triple backticks (```), optionally retrieve their language, and separate programming code from regular text.
- **List Recognition**: Identify both ordered and unordered lists, including task lists (`- [ ]`, `- [x]`), and understand their structure and hierarchy.
- **Tables (GFM)**: Detect GitHub-Flavored Markdown tables, parse their headers and rows, and separate structured tabular data for further analysis.
- **Links and Images**: Identify text links (`[text](url)`) and images (``), as well as reference-style links. This is useful for link validation or content analysis.
- **Footnotes**: Extract and handle Markdown footnotes (`[^note1]`), providing a way to process reference notes in the document.
- **HTML Blocks and Inline HTML**: Handle HTML blocks (`<div>...</div>`) as a single element, and detect inline HTML elements (`<span style="...">... </span>`) as a unified component.
- **Front Matter**: If present, extract YAML front matter at the start of the file.
- **Counting Elements**: Count how many occurrences of a certain element type (e.g., how many headers, code blocks, etc.).
- **Textual Statistics**: Count the number of words and characters (excluding whitespace). Get a global summary (`analyse()`) of the document’s composition.
## Installation
Install `mrkdwn_analysis` from PyPI:
```bash
pip install markdown-analysis
```
## Usage
Using `mrkdwn_analysis` is straightforward. Import `MarkdownAnalyzer`, create an instance with your Markdown file path, and then call the various methods to extract the elements you need.
```python
from mrkdwn_analysis import MarkdownAnalyzer
analyzer = MarkdownAnalyzer("path/to/document.md")
headers = analyzer.identify_headers()
paragraphs = analyzer.identify_paragraphs()
links = analyzer.identify_links()
...
```
### Example
Consider `example.md`:
```markdown
---
title: "Python 3.11 Report"
author: "John Doe"
date: "2024-01-15"
---
Python 3.11
===========
A major **Python** release with significant improvements...
### Performance Details
```python
import math
print(math.factorial(10))
```
> *Quote*: "Python 3.11 brings the speed we needed"
<div class="note">
<p>HTML block example</p>
</div>
This paragraph contains inline HTML: <span style="color:red;">Red text</span>.
- Unordered list:
- A basic point
- [ ] A task to do
- [x] A completed task
1. Ordered list item 1
2. Ordered list item 2
```
After analysis:
```python
analyzer = MarkdownAnalyzer("example.md")
print(analyzer.identify_headers())
# {"Header": [{"line": X, "level": 1, "text": "Python 3.11"}, {"line": Y, "level": 3, "text": "Performance Details"}]}
print(analyzer.identify_paragraphs())
# {"Paragraph": ["A major **Python** release ...", "This paragraph contains inline HTML: ..."]}
print(analyzer.identify_html_blocks())
# [{"line": Z, "content": "<div class=\"note\">\n <p>HTML block example</p>\n</div>"}]
print(analyzer.identify_html_inline())
# [{"line": W, "html": "<span style=\"color:red;\">Red text</span>"}]
print(analyzer.identify_lists())
# {
# "Ordered list": [["Ordered list item 1", "Ordered list item 2"]],
# "Unordered list": [["A basic point", "A task to do [Task]", "A completed task [Task done]"]]
# }
print(analyzer.identify_code_blocks())
# {"Code block": [{"start_line": X, "content": "import math\nprint(math.factorial(10))", "language": "python"}]}
print(analyzer.analyse())
# {
# 'headers': 2,
# 'paragraphs': 2,
# 'blockquotes': 1,
# 'code_blocks': 1,
# 'ordered_lists': 2,
# 'unordered_lists': 3,
# 'tables': 0,
# 'html_blocks': 1,
# 'html_inline_count': 1,
# 'words': 42,
# 'characters': 250
# }
```
### Key Methods
- `__init__(self, file_path)`: Load the Markdown file.
- `identify_headers()`: Returns all headers.
- `identify_sections()`: Returns setext sections.
- `identify_paragraphs()`: Returns paragraphs.
- `identify_blockquotes()`: Returns blockquotes.
- `identify_code_blocks()`: Returns code blocks with content and language.
- `identify_lists()`: Returns both ordered and unordered lists (including tasks).
- `identify_tables()`: Returns any GFM tables.
- `identify_links()`: Returns text and image links.
- `identify_footnotes()`: Returns footnotes used in the document.
- `identify_html_blocks()`: Returns HTML blocks as single tokens.
- `identify_html_inline()`: Returns inline HTML elements.
- `identify_todos()`: Returns task items.
- `count_elements(element_type)`: Counts occurrences of a specific element type.
- `count_words()`: Counts words in the entire document.
- `count_characters()`: Counts non-whitespace characters.
- `analyse()`: Provides a global summary (headers count, paragraphs count, etc.).
### Checking and Validating Links
- `check_links()`: Validates text links to see if they are broken (e.g., non-200 status) and returns a list of broken links.
### Global Analysis Example
```python
analysis = analyzer.analyse()
print(analysis)
# {
# 'headers': X,
# 'paragraphs': Y,
# 'blockquotes': Z,
# 'code_blocks': A,
# 'ordered_lists': B,
# 'unordered_lists': C,
# 'tables': D,
# 'html_blocks': E,
# 'html_inline_count': F,
# 'words': G,
# 'characters': H
# }
```
## Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request for bug reports, feature requests, or code improvements. Your input helps make `mrkdwn_analysis` more robust and versatile.
Raw data
{
"_id": null,
"home_page": "https://github.com/yannbanas/mrkdwn_analysis",
"name": "markdown-analysis",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "yannbanas",
"author_email": "yannbanas@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/77/cb/e9217270ee86149d66d1241e7344b30df2f0013bb397108c051df4dd08ce/markdown_analysis-0.1.3.tar.gz",
"platform": null,
"description": "# mrkdwn_analysis\r\n\r\n`mrkdwn_analysis` is a powerful Python library designed to analyze Markdown files. It provides extensive parsing capabilities to extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, lists, tables, tasks (todos), footnotes, and even embedded HTML. This makes it a versatile tool for data analysis, content generation, or building other tools that work with Markdown.\r\n\r\n## Features\r\n\r\n- **File Loading**: Load any given Markdown file by providing its file path.\r\n\r\n- **Header Detection**: Identify all headers (ATX `#` to `######`, and Setext `===` and `---`) in the document, giving you a quick overview of its structure.\r\n\r\n- **Section Identification (Setext)**: Recognize sections defined by a block of text followed by `=` or `-` lines, helping you understand the document\u2019s conceptual divisions.\r\n\r\n- **Paragraph Extraction**: Distinguish regular text (paragraphs) from structured elements like headers, lists, or code blocks, making it easy to isolate the body content.\r\n\r\n- **Blockquote Identification**: Extract all blockquotes defined by lines starting with `>`.\r\n\r\n- **Code Block Extraction**: Detect fenced code blocks delimited by triple backticks (```), optionally retrieve their language, and separate programming code from regular text.\r\n\r\n- **List Recognition**: Identify both ordered and unordered lists, including task lists (`- [ ]`, `- [x]`), and understand their structure and hierarchy.\r\n\r\n- **Tables (GFM)**: Detect GitHub-Flavored Markdown tables, parse their headers and rows, and separate structured tabular data for further analysis.\r\n\r\n- **Links and Images**: Identify text links (`[text](url)`) and images (``), as well as reference-style links. This is useful for link validation or content analysis.\r\n\r\n- **Footnotes**: Extract and handle Markdown footnotes (`[^note1]`), providing a way to process reference notes in the document.\r\n\r\n- **HTML Blocks and Inline HTML**: Handle HTML blocks (`<div>...</div>`) as a single element, and detect inline HTML elements (`<span style=\"...\">... </span>`) as a unified component.\r\n\r\n- **Front Matter**: If present, extract YAML front matter at the start of the file.\r\n\r\n- **Counting Elements**: Count how many occurrences of a certain element type (e.g., how many headers, code blocks, etc.).\r\n\r\n- **Textual Statistics**: Count the number of words and characters (excluding whitespace). Get a global summary (`analyse()`) of the document\u2019s composition.\r\n\r\n## Installation\r\n\r\nInstall `mrkdwn_analysis` from PyPI:\r\n\r\n```bash\r\npip install markdown-analysis\r\n```\r\n\r\n## Usage\r\n\r\nUsing `mrkdwn_analysis` is straightforward. Import `MarkdownAnalyzer`, create an instance with your Markdown file path, and then call the various methods to extract the elements you need.\r\n\r\n```python\r\nfrom mrkdwn_analysis import MarkdownAnalyzer\r\n\r\nanalyzer = MarkdownAnalyzer(\"path/to/document.md\")\r\n\r\nheaders = analyzer.identify_headers()\r\nparagraphs = analyzer.identify_paragraphs()\r\nlinks = analyzer.identify_links()\r\n...\r\n```\r\n\r\n### Example\r\n\r\nConsider `example.md`:\r\n\r\n```markdown\r\n---\r\ntitle: \"Python 3.11 Report\"\r\nauthor: \"John Doe\"\r\ndate: \"2024-01-15\"\r\n---\r\n\r\nPython 3.11\r\n===========\r\n\r\nA major **Python** release with significant improvements...\r\n\r\n### Performance Details\r\n\r\n```python\r\nimport math\r\nprint(math.factorial(10))\r\n```\r\n\r\n> *Quote*: \"Python 3.11 brings the speed we needed\"\r\n\r\n<div class=\"note\">\r\n <p>HTML block example</p>\r\n</div>\r\n\r\nThis paragraph contains inline HTML: <span style=\"color:red;\">Red text</span>.\r\n\r\n- Unordered list:\r\n - A basic point\r\n - [ ] A task to do\r\n - [x] A completed task\r\n\r\n1. Ordered list item 1\r\n2. Ordered list item 2\r\n```\r\n\r\nAfter analysis:\r\n\r\n```python\r\nanalyzer = MarkdownAnalyzer(\"example.md\")\r\n\r\nprint(analyzer.identify_headers())\r\n# {\"Header\": [{\"line\": X, \"level\": 1, \"text\": \"Python 3.11\"}, {\"line\": Y, \"level\": 3, \"text\": \"Performance Details\"}]}\r\n\r\nprint(analyzer.identify_paragraphs())\r\n# {\"Paragraph\": [\"A major **Python** release ...\", \"This paragraph contains inline HTML: ...\"]}\r\n\r\nprint(analyzer.identify_html_blocks())\r\n# [{\"line\": Z, \"content\": \"<div class=\\\"note\\\">\\n <p>HTML block example</p>\\n</div>\"}]\r\n\r\nprint(analyzer.identify_html_inline())\r\n# [{\"line\": W, \"html\": \"<span style=\\\"color:red;\\\">Red text</span>\"}]\r\n\r\nprint(analyzer.identify_lists())\r\n# {\r\n# \"Ordered list\": [[\"Ordered list item 1\", \"Ordered list item 2\"]],\r\n# \"Unordered list\": [[\"A basic point\", \"A task to do [Task]\", \"A completed task [Task done]\"]]\r\n# }\r\n\r\nprint(analyzer.identify_code_blocks())\r\n# {\"Code block\": [{\"start_line\": X, \"content\": \"import math\\nprint(math.factorial(10))\", \"language\": \"python\"}]}\r\n\r\nprint(analyzer.analyse())\r\n# {\r\n# 'headers': 2,\r\n# 'paragraphs': 2,\r\n# 'blockquotes': 1,\r\n# 'code_blocks': 1,\r\n# 'ordered_lists': 2,\r\n# 'unordered_lists': 3,\r\n# 'tables': 0,\r\n# 'html_blocks': 1,\r\n# 'html_inline_count': 1,\r\n# 'words': 42,\r\n# 'characters': 250\r\n# }\r\n```\r\n\r\n### Key Methods\r\n\r\n- `__init__(self, file_path)`: Load the Markdown file.\r\n- `identify_headers()`: Returns all headers.\r\n- `identify_sections()`: Returns setext sections.\r\n- `identify_paragraphs()`: Returns paragraphs.\r\n- `identify_blockquotes()`: Returns blockquotes.\r\n- `identify_code_blocks()`: Returns code blocks with content and language.\r\n- `identify_lists()`: Returns both ordered and unordered lists (including tasks).\r\n- `identify_tables()`: Returns any GFM tables.\r\n- `identify_links()`: Returns text and image links.\r\n- `identify_footnotes()`: Returns footnotes used in the document.\r\n- `identify_html_blocks()`: Returns HTML blocks as single tokens.\r\n- `identify_html_inline()`: Returns inline HTML elements.\r\n- `identify_todos()`: Returns task items.\r\n- `count_elements(element_type)`: Counts occurrences of a specific element type.\r\n- `count_words()`: Counts words in the entire document.\r\n- `count_characters()`: Counts non-whitespace characters.\r\n- `analyse()`: Provides a global summary (headers count, paragraphs count, etc.).\r\n\r\n### Checking and Validating Links\r\n\r\n- `check_links()`: Validates text links to see if they are broken (e.g., non-200 status) and returns a list of broken links.\r\n\r\n### Global Analysis Example\r\n\r\n```python\r\nanalysis = analyzer.analyse()\r\nprint(analysis)\r\n# {\r\n# 'headers': X,\r\n# 'paragraphs': Y,\r\n# 'blockquotes': Z,\r\n# 'code_blocks': A,\r\n# 'ordered_lists': B,\r\n# 'unordered_lists': C,\r\n# 'tables': D,\r\n# 'html_blocks': E,\r\n# 'html_inline_count': F,\r\n# 'words': G,\r\n# 'characters': H\r\n# }\r\n```\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Feel free to open an issue or submit a pull request for bug reports, feature requests, or code improvements. Your input helps make `mrkdwn_analysis` more robust and versatile.\r\n\r\n\r\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/yannbanas/mrkdwn_analysis"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "77cbe9217270ee86149d66d1241e7344b30df2f0013bb397108c051df4dd08ce",
"md5": "f0097161a279e2ad841dc09a7caee288",
"sha256": "e8d5fa8fd7009520e19a265121c4828913c8afa2b58c45d9959ff14ef6796dbb"
},
"downloads": -1,
"filename": "markdown_analysis-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "f0097161a279e2ad841dc09a7caee288",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10084,
"upload_time": "2025-01-15T20:09:08",
"upload_time_iso_8601": "2025-01-15T20:09:08.955356Z",
"url": "https://files.pythonhosted.org/packages/77/cb/e9217270ee86149d66d1241e7344b30df2f0013bb397108c051df4dd08ce/markdown_analysis-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-15 20:09:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yannbanas",
"github_project": "mrkdwn_analysis",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "markdown-analysis"
}