markdown-analysis


Namemarkdown-analysis JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/yannbanas/mrkdwn_analysis
SummaryNone
upload_time2025-01-15 20:09:08
maintainerNone
docs_urlNone
authoryannbanas
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # mrkdwn_analysis

`mrkdwn_analysis` is a powerful Python library designed to analyze Markdown files. It provides extensive parsing capabilities to extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, lists, tables, tasks (todos), footnotes, and even embedded HTML. This makes it a versatile tool for data analysis, content generation, or building other tools that work with Markdown.

## Features

- **File Loading**: Load any given Markdown file by providing its file path.

- **Header Detection**: Identify all headers (ATX `#` to `######`, and Setext `===` and `---`) in the document, giving you a quick overview of its structure.

- **Section Identification (Setext)**: Recognize sections defined by a block of text followed by `=` or `-` lines, helping you understand the document’s conceptual divisions.

- **Paragraph Extraction**: Distinguish regular text (paragraphs) from structured elements like headers, lists, or code blocks, making it easy to isolate the body content.

- **Blockquote Identification**: Extract all blockquotes defined by lines starting with `>`.

- **Code Block Extraction**: Detect fenced code blocks delimited by triple backticks (```), optionally retrieve their language, and separate programming code from regular text.

- **List Recognition**: Identify both ordered and unordered lists, including task lists (`- [ ]`, `- [x]`), and understand their structure and hierarchy.

- **Tables (GFM)**: Detect GitHub-Flavored Markdown tables, parse their headers and rows, and separate structured tabular data for further analysis.

- **Links and Images**: Identify text links (`[text](url)`) and images (`![alt](url)`), as well as reference-style links. This is useful for link validation or content analysis.

- **Footnotes**: Extract and handle Markdown footnotes (`[^note1]`), providing a way to process reference notes in the document.

- **HTML Blocks and Inline HTML**: Handle HTML blocks (`<div>...</div>`) as a single element, and detect inline HTML elements (`<span style="...">... </span>`) as a unified component.

- **Front Matter**: If present, extract YAML front matter at the start of the file.

- **Counting Elements**: Count how many occurrences of a certain element type (e.g., how many headers, code blocks, etc.).

- **Textual Statistics**: Count the number of words and characters (excluding whitespace). Get a global summary (`analyse()`) of the document’s composition.

## Installation

Install `mrkdwn_analysis` from PyPI:

```bash
pip install markdown-analysis
```

## Usage

Using `mrkdwn_analysis` is straightforward. Import `MarkdownAnalyzer`, create an instance with your Markdown file path, and then call the various methods to extract the elements you need.

```python
from mrkdwn_analysis import MarkdownAnalyzer

analyzer = MarkdownAnalyzer("path/to/document.md")

headers = analyzer.identify_headers()
paragraphs = analyzer.identify_paragraphs()
links = analyzer.identify_links()
...
```

### Example

Consider `example.md`:

```markdown
---
title: "Python 3.11 Report"
author: "John Doe"
date: "2024-01-15"
---

Python 3.11
===========

A major **Python** release with significant improvements...

### Performance Details

```python
import math
print(math.factorial(10))
```

> *Quote*: "Python 3.11 brings the speed we needed"

<div class="note">
  <p>HTML block example</p>
</div>

This paragraph contains inline HTML: <span style="color:red;">Red text</span>.

- Unordered list:
  - A basic point
  - [ ] A task to do
  - [x] A completed task

1. Ordered list item 1
2. Ordered list item 2
```

After analysis:

```python
analyzer = MarkdownAnalyzer("example.md")

print(analyzer.identify_headers())
# {"Header": [{"line": X, "level": 1, "text": "Python 3.11"}, {"line": Y, "level": 3, "text": "Performance Details"}]}

print(analyzer.identify_paragraphs())
# {"Paragraph": ["A major **Python** release ...", "This paragraph contains inline HTML: ..."]}

print(analyzer.identify_html_blocks())
# [{"line": Z, "content": "<div class=\"note\">\n  <p>HTML block example</p>\n</div>"}]

print(analyzer.identify_html_inline())
# [{"line": W, "html": "<span style=\"color:red;\">Red text</span>"}]

print(analyzer.identify_lists())
# {
#   "Ordered list": [["Ordered list item 1", "Ordered list item 2"]],
#   "Unordered list": [["A basic point", "A task to do [Task]", "A completed task [Task done]"]]
# }

print(analyzer.identify_code_blocks())
# {"Code block": [{"start_line": X, "content": "import math\nprint(math.factorial(10))", "language": "python"}]}

print(analyzer.analyse())
# {
#   'headers': 2,
#   'paragraphs': 2,
#   'blockquotes': 1,
#   'code_blocks': 1,
#   'ordered_lists': 2,
#   'unordered_lists': 3,
#   'tables': 0,
#   'html_blocks': 1,
#   'html_inline_count': 1,
#   'words': 42,
#   'characters': 250
# }
```

### Key Methods

- `__init__(self, file_path)`: Load the Markdown file.
- `identify_headers()`: Returns all headers.
- `identify_sections()`: Returns setext sections.
- `identify_paragraphs()`: Returns paragraphs.
- `identify_blockquotes()`: Returns blockquotes.
- `identify_code_blocks()`: Returns code blocks with content and language.
- `identify_lists()`: Returns both ordered and unordered lists (including tasks).
- `identify_tables()`: Returns any GFM tables.
- `identify_links()`: Returns text and image links.
- `identify_footnotes()`: Returns footnotes used in the document.
- `identify_html_blocks()`: Returns HTML blocks as single tokens.
- `identify_html_inline()`: Returns inline HTML elements.
- `identify_todos()`: Returns task items.
- `count_elements(element_type)`: Counts occurrences of a specific element type.
- `count_words()`: Counts words in the entire document.
- `count_characters()`: Counts non-whitespace characters.
- `analyse()`: Provides a global summary (headers count, paragraphs count, etc.).

### Checking and Validating Links

- `check_links()`: Validates text links to see if they are broken (e.g., non-200 status) and returns a list of broken links.

### Global Analysis Example

```python
analysis = analyzer.analyse()
print(analysis)
# {
#   'headers': X,
#   'paragraphs': Y,
#   'blockquotes': Z,
#   'code_blocks': A,
#   'ordered_lists': B,
#   'unordered_lists': C,
#   'tables': D,
#   'html_blocks': E,
#   'html_inline_count': F,
#   'words': G,
#   'characters': H
# }
```

## Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request for bug reports, feature requests, or code improvements. Your input helps make `mrkdwn_analysis` more robust and versatile.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yannbanas/mrkdwn_analysis",
    "name": "markdown-analysis",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "yannbanas",
    "author_email": "yannbanas@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/77/cb/e9217270ee86149d66d1241e7344b30df2f0013bb397108c051df4dd08ce/markdown_analysis-0.1.3.tar.gz",
    "platform": null,
    "description": "# mrkdwn_analysis\r\n\r\n`mrkdwn_analysis` is a powerful Python library designed to analyze Markdown files. It provides extensive parsing capabilities to extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, lists, tables, tasks (todos), footnotes, and even embedded HTML. This makes it a versatile tool for data analysis, content generation, or building other tools that work with Markdown.\r\n\r\n## Features\r\n\r\n- **File Loading**: Load any given Markdown file by providing its file path.\r\n\r\n- **Header Detection**: Identify all headers (ATX `#` to `######`, and Setext `===` and `---`) in the document, giving you a quick overview of its structure.\r\n\r\n- **Section Identification (Setext)**: Recognize sections defined by a block of text followed by `=` or `-` lines, helping you understand the document\u2019s conceptual divisions.\r\n\r\n- **Paragraph Extraction**: Distinguish regular text (paragraphs) from structured elements like headers, lists, or code blocks, making it easy to isolate the body content.\r\n\r\n- **Blockquote Identification**: Extract all blockquotes defined by lines starting with `>`.\r\n\r\n- **Code Block Extraction**: Detect fenced code blocks delimited by triple backticks (```), optionally retrieve their language, and separate programming code from regular text.\r\n\r\n- **List Recognition**: Identify both ordered and unordered lists, including task lists (`- [ ]`, `- [x]`), and understand their structure and hierarchy.\r\n\r\n- **Tables (GFM)**: Detect GitHub-Flavored Markdown tables, parse their headers and rows, and separate structured tabular data for further analysis.\r\n\r\n- **Links and Images**: Identify text links (`[text](url)`) and images (`![alt](url)`), as well as reference-style links. This is useful for link validation or content analysis.\r\n\r\n- **Footnotes**: Extract and handle Markdown footnotes (`[^note1]`), providing a way to process reference notes in the document.\r\n\r\n- **HTML Blocks and Inline HTML**: Handle HTML blocks (`<div>...</div>`) as a single element, and detect inline HTML elements (`<span style=\"...\">... </span>`) as a unified component.\r\n\r\n- **Front Matter**: If present, extract YAML front matter at the start of the file.\r\n\r\n- **Counting Elements**: Count how many occurrences of a certain element type (e.g., how many headers, code blocks, etc.).\r\n\r\n- **Textual Statistics**: Count the number of words and characters (excluding whitespace). Get a global summary (`analyse()`) of the document\u2019s composition.\r\n\r\n## Installation\r\n\r\nInstall `mrkdwn_analysis` from PyPI:\r\n\r\n```bash\r\npip install markdown-analysis\r\n```\r\n\r\n## Usage\r\n\r\nUsing `mrkdwn_analysis` is straightforward. Import `MarkdownAnalyzer`, create an instance with your Markdown file path, and then call the various methods to extract the elements you need.\r\n\r\n```python\r\nfrom mrkdwn_analysis import MarkdownAnalyzer\r\n\r\nanalyzer = MarkdownAnalyzer(\"path/to/document.md\")\r\n\r\nheaders = analyzer.identify_headers()\r\nparagraphs = analyzer.identify_paragraphs()\r\nlinks = analyzer.identify_links()\r\n...\r\n```\r\n\r\n### Example\r\n\r\nConsider `example.md`:\r\n\r\n```markdown\r\n---\r\ntitle: \"Python 3.11 Report\"\r\nauthor: \"John Doe\"\r\ndate: \"2024-01-15\"\r\n---\r\n\r\nPython 3.11\r\n===========\r\n\r\nA major **Python** release with significant improvements...\r\n\r\n### Performance Details\r\n\r\n```python\r\nimport math\r\nprint(math.factorial(10))\r\n```\r\n\r\n> *Quote*: \"Python 3.11 brings the speed we needed\"\r\n\r\n<div class=\"note\">\r\n  <p>HTML block example</p>\r\n</div>\r\n\r\nThis paragraph contains inline HTML: <span style=\"color:red;\">Red text</span>.\r\n\r\n- Unordered list:\r\n  - A basic point\r\n  - [ ] A task to do\r\n  - [x] A completed task\r\n\r\n1. Ordered list item 1\r\n2. Ordered list item 2\r\n```\r\n\r\nAfter analysis:\r\n\r\n```python\r\nanalyzer = MarkdownAnalyzer(\"example.md\")\r\n\r\nprint(analyzer.identify_headers())\r\n# {\"Header\": [{\"line\": X, \"level\": 1, \"text\": \"Python 3.11\"}, {\"line\": Y, \"level\": 3, \"text\": \"Performance Details\"}]}\r\n\r\nprint(analyzer.identify_paragraphs())\r\n# {\"Paragraph\": [\"A major **Python** release ...\", \"This paragraph contains inline HTML: ...\"]}\r\n\r\nprint(analyzer.identify_html_blocks())\r\n# [{\"line\": Z, \"content\": \"<div class=\\\"note\\\">\\n  <p>HTML block example</p>\\n</div>\"}]\r\n\r\nprint(analyzer.identify_html_inline())\r\n# [{\"line\": W, \"html\": \"<span style=\\\"color:red;\\\">Red text</span>\"}]\r\n\r\nprint(analyzer.identify_lists())\r\n# {\r\n#   \"Ordered list\": [[\"Ordered list item 1\", \"Ordered list item 2\"]],\r\n#   \"Unordered list\": [[\"A basic point\", \"A task to do [Task]\", \"A completed task [Task done]\"]]\r\n# }\r\n\r\nprint(analyzer.identify_code_blocks())\r\n# {\"Code block\": [{\"start_line\": X, \"content\": \"import math\\nprint(math.factorial(10))\", \"language\": \"python\"}]}\r\n\r\nprint(analyzer.analyse())\r\n# {\r\n#   'headers': 2,\r\n#   'paragraphs': 2,\r\n#   'blockquotes': 1,\r\n#   'code_blocks': 1,\r\n#   'ordered_lists': 2,\r\n#   'unordered_lists': 3,\r\n#   'tables': 0,\r\n#   'html_blocks': 1,\r\n#   'html_inline_count': 1,\r\n#   'words': 42,\r\n#   'characters': 250\r\n# }\r\n```\r\n\r\n### Key Methods\r\n\r\n- `__init__(self, file_path)`: Load the Markdown file.\r\n- `identify_headers()`: Returns all headers.\r\n- `identify_sections()`: Returns setext sections.\r\n- `identify_paragraphs()`: Returns paragraphs.\r\n- `identify_blockquotes()`: Returns blockquotes.\r\n- `identify_code_blocks()`: Returns code blocks with content and language.\r\n- `identify_lists()`: Returns both ordered and unordered lists (including tasks).\r\n- `identify_tables()`: Returns any GFM tables.\r\n- `identify_links()`: Returns text and image links.\r\n- `identify_footnotes()`: Returns footnotes used in the document.\r\n- `identify_html_blocks()`: Returns HTML blocks as single tokens.\r\n- `identify_html_inline()`: Returns inline HTML elements.\r\n- `identify_todos()`: Returns task items.\r\n- `count_elements(element_type)`: Counts occurrences of a specific element type.\r\n- `count_words()`: Counts words in the entire document.\r\n- `count_characters()`: Counts non-whitespace characters.\r\n- `analyse()`: Provides a global summary (headers count, paragraphs count, etc.).\r\n\r\n### Checking and Validating Links\r\n\r\n- `check_links()`: Validates text links to see if they are broken (e.g., non-200 status) and returns a list of broken links.\r\n\r\n### Global Analysis Example\r\n\r\n```python\r\nanalysis = analyzer.analyse()\r\nprint(analysis)\r\n# {\r\n#   'headers': X,\r\n#   'paragraphs': Y,\r\n#   'blockquotes': Z,\r\n#   'code_blocks': A,\r\n#   'ordered_lists': B,\r\n#   'unordered_lists': C,\r\n#   'tables': D,\r\n#   'html_blocks': E,\r\n#   'html_inline_count': F,\r\n#   'words': G,\r\n#   'characters': H\r\n# }\r\n```\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Feel free to open an issue or submit a pull request for bug reports, feature requests, or code improvements. Your input helps make `mrkdwn_analysis` more robust and versatile.\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/yannbanas/mrkdwn_analysis"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "77cbe9217270ee86149d66d1241e7344b30df2f0013bb397108c051df4dd08ce",
                "md5": "f0097161a279e2ad841dc09a7caee288",
                "sha256": "e8d5fa8fd7009520e19a265121c4828913c8afa2b58c45d9959ff14ef6796dbb"
            },
            "downloads": -1,
            "filename": "markdown_analysis-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "f0097161a279e2ad841dc09a7caee288",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10084,
            "upload_time": "2025-01-15T20:09:08",
            "upload_time_iso_8601": "2025-01-15T20:09:08.955356Z",
            "url": "https://files.pythonhosted.org/packages/77/cb/e9217270ee86149d66d1241e7344b30df2f0013bb397108c051df4dd08ce/markdown_analysis-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-15 20:09:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yannbanas",
    "github_project": "mrkdwn_analysis",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "markdown-analysis"
}
        
Elapsed time: 3.08902s