mkdown


Namemkdown JSON
Version 0.14.0 PyPI version JSON
download
home_pageNone
SummaryMarkdown helpers & models
upload_time2025-10-06 20:30:45
maintainerNone
docs_urlNone
authorPhilipp Temminghoff
requires_python>=3.12
licenseMIT License Copyright (c) 2024, Philipp Temminghoff Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # mkdown

[![PyPI License](https://img.shields.io/pypi/l/mkdown.svg)](https://pypi.org/project/mkdown/)
[![Package status](https://img.shields.io/pypi/status/mkdown.svg)](https://pypi.org/project/mkdown/)
[![Monthly downloads](https://img.shields.io/pypi/dm/mkdown.svg)](https://pypi.org/project/mkdown/)
[![Distribution format](https://img.shields.io/pypi/format/mkdown.svg)](https://pypi.org/project/mkdown/)
[![Wheel availability](https://img.shields.io/pypi/wheel/mkdown.svg)](https://pypi.org/project/mkdown/)
[![Python version](https://img.shields.io/pypi/pyversions/mkdown.svg)](https://pypi.org/project/mkdown/)
[![Implementation](https://img.shields.io/pypi/implementation/mkdown.svg)](https://pypi.org/project/mkdown/)
[![Releases](https://img.shields.io/github/downloads/phil65/mkdown/total.svg)](https://github.com/phil65/mkdown/releases)
[![Github Contributors](https://img.shields.io/github/contributors/phil65/mkdown)](https://github.com/phil65/mkdown/graphs/contributors)
[![Github Discussions](https://img.shields.io/github/discussions/phil65/mkdown)](https://github.com/phil65/mkdown/discussions)
[![Github Forks](https://img.shields.io/github/forks/phil65/mkdown)](https://github.com/phil65/mkdown/forks)
[![Github Issues](https://img.shields.io/github/issues/phil65/mkdown)](https://github.com/phil65/mkdown/issues)
[![Github Issues](https://img.shields.io/github/issues-pr/phil65/mkdown)](https://github.com/phil65/mkdown/pulls)
[![Github Watchers](https://img.shields.io/github/watchers/phil65/mkdown)](https://github.com/phil65/mkdown/watchers)
[![Github Stars](https://img.shields.io/github/stars/phil65/mkdown)](https://github.com/phil65/mkdown/stars)
[![Github Repository size](https://img.shields.io/github/repo-size/phil65/mkdown)](https://github.com/phil65/mkdown)
[![Github last commit](https://img.shields.io/github/last-commit/phil65/mkdown)](https://github.com/phil65/mkdown/commits)
[![Github release date](https://img.shields.io/github/release-date/phil65/mkdown)](https://github.com/phil65/mkdown/releases)
[![Github language count](https://img.shields.io/github/languages/count/phil65/mkdown)](https://github.com/phil65/mkdown)
[![Github commits this month](https://img.shields.io/github/commit-activity/m/phil65/mkdown)](https://github.com/phil65/mkdown)
[![Package status](https://codecov.io/gh/phil65/mkdown/branch/main/graph/badge.svg)](https://codecov.io/gh/phil65/mkdown/)
[![PyUp](https://pyup.io/repos/github/phil65/mkdown/shield.svg)](https://pyup.io/repos/github/phil65/mkdown/)

[Read the documentation!](https://phil65.github.io/mkdown/)



## Markdown Conventions for OCR Output

This project utilizes Markdown as the primary, self-contained format for storing OCR results and associated metadata. The goal is to have a single, versionable, human-readable file representing a processed document, simplifying pipeline management and data provenance.

We employ a hybrid approach, using different mechanisms for different types of metadata:

### 1. Metadata Comments (for Non-Visual Markers)

For metadata that should *not* affect the visual rendering of the Markdown (like page boundaries or page-level information), we use specially formatted HTML/XML comments.

**Format:**

```
<!-- docler:data_type {json_payload} -->
```

*   **`data_type`**: A string indicating the kind of metadata (e.g., `page_break`, `chunk_boundary`).
*   **`{json_payload}`**: A standard JSON object containing the metadata key-value pairs, serialized.

**Defined Types:**

*   **`page_break`**: Marks the transition *to* the specified page number. Placed immediately *before* the content of the new page.
    *   Example Payload: `{"next_page": 2}`
    *   Example Comment: `<!-- docler:page_break {"next_page": 2 } -->`
*   **`chunk_boundary`**: Marks a transition where a document should get chunked (semantically).
    *   Example Payload: `{"chunk_id": 1}`
    *   Example Comment: `<!-- docler:chunk_boundary {"chunk_id": 1 } -->`

### 2. HTML Figures (for Images and Diagrams)

For visual elements like images or diagrams, especially when they require richer metadata (like source code or bounding boxes), we use standard HTML structures within the Markdown. This allows direct association of metadata and handles complex data like code snippets gracefully.

**Structure:**

We typically use an HTML `<figure>` element:

```html
<figure data-docler-type="diagram" data-diagram-id="sysarch-01">
  <img src="images/system_architecture.png"
       alt="System Architecture Diagram"
       data-page-num="5"
       style="max-width: 100%; height: auto;"
       >
  <figcaption>Figure 2: High-level system data flow.</figcaption>
  <script type="text/docler-mermaid">
    graph LR
        A[Data Ingest] --> B(Processing Queue);
        B --> C{Main Processor};
        D --> F(API Endpoint);
  </script>
</figure>
```

*   **`<figure>`**: The container element.
    *   `data-docler-type`: Indicates the type of figure (e.g., `image`, `diagram`).
    *   Other `data-*` attributes can be added for figure-level metadata.
*   **`<img>`**: The visual representation.
    *   `src`, `alt`: Standard attributes.
    *   `data-*`: Used for image-specific metadata like `data-page-num`
    *   `style`: Optional for basic presentation.
*   **`<figcaption>`**: Optional standard HTML caption.
*   **`<script type="text/docler-...">`**: Used to embed source code or other complex textual data.
    *   The `type` attribute is custom (e.g., `text/docler-mermaid`, `text/docler-latex`) so browsers ignore it.
    *   The raw code/text is placed inside, preserving formatting.

### Rationale

*   **Comments** are used for page breaks and metadata because they are guaranteed *not* to interfere with Markdown rendering, ensuring purely structural information remains invisible.
*   **HTML Figures** are used for images/diagrams because HTML provides standard ways (`data-*`, nested elements like `<script>`) to directly associate rich, potentially complex or multi-line metadata (like source code) with the visual element itself.

### Utilities

Helper functions for creating and parsing these metadata comments and structures are available in `docler.markdown_utils`.

### Standardized Metadata Types

The library provides standardized metadata types for common use cases:

1. **Page Breaks**: Use `PAGE_BREAK_TYPE` constant and `create_metadata_comment()` function to create page transitions:
   ```python
   from docler.markdown_utils import create_metadata_comment, PAGE_BREAK_TYPE

   # Create a page break marker for page 2
   page_break = create_metadata_comment(PAGE_BREAK_TYPE, {"next_page": 2})
   # <!-- docler:page_break {"next_page":2} -->
   ```

2. **Chunk Boundaries**: Use `CHUNK_BOUNDARY_TYPE` constant and `create_chunk_boundary()` function to mark semantic chunks in a document:
   ```python
   from docler.markdown_utils import create_chunk_boundary

   # Create a chunk boundary marker with metadata
   chunk_marker = create_chunk_boundary(
       chunk_id=1,
       start_line=10,
       end_line=25,
       keywords=["introduction", "overview"],
       token_count=350,
   )
   # <!-- docler:chunk_boundary {"chunk_id":1,"end_line":25,"keywords":["introduction","overview"],"start_line":10,"token_count":350} -->
   ```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mkdown",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": null,
    "author": "Philipp Temminghoff",
    "author_email": "Philipp Temminghoff <philipptemminghoff@googlemail.com>",
    "download_url": "https://files.pythonhosted.org/packages/62/14/6ef87f77fd1389ebe1ac3b66b9633118397f66fdcc81d34e69c468cdd833/mkdown-0.14.0.tar.gz",
    "platform": null,
    "description": "# mkdown\n\n[![PyPI License](https://img.shields.io/pypi/l/mkdown.svg)](https://pypi.org/project/mkdown/)\n[![Package status](https://img.shields.io/pypi/status/mkdown.svg)](https://pypi.org/project/mkdown/)\n[![Monthly downloads](https://img.shields.io/pypi/dm/mkdown.svg)](https://pypi.org/project/mkdown/)\n[![Distribution format](https://img.shields.io/pypi/format/mkdown.svg)](https://pypi.org/project/mkdown/)\n[![Wheel availability](https://img.shields.io/pypi/wheel/mkdown.svg)](https://pypi.org/project/mkdown/)\n[![Python version](https://img.shields.io/pypi/pyversions/mkdown.svg)](https://pypi.org/project/mkdown/)\n[![Implementation](https://img.shields.io/pypi/implementation/mkdown.svg)](https://pypi.org/project/mkdown/)\n[![Releases](https://img.shields.io/github/downloads/phil65/mkdown/total.svg)](https://github.com/phil65/mkdown/releases)\n[![Github Contributors](https://img.shields.io/github/contributors/phil65/mkdown)](https://github.com/phil65/mkdown/graphs/contributors)\n[![Github Discussions](https://img.shields.io/github/discussions/phil65/mkdown)](https://github.com/phil65/mkdown/discussions)\n[![Github Forks](https://img.shields.io/github/forks/phil65/mkdown)](https://github.com/phil65/mkdown/forks)\n[![Github Issues](https://img.shields.io/github/issues/phil65/mkdown)](https://github.com/phil65/mkdown/issues)\n[![Github Issues](https://img.shields.io/github/issues-pr/phil65/mkdown)](https://github.com/phil65/mkdown/pulls)\n[![Github Watchers](https://img.shields.io/github/watchers/phil65/mkdown)](https://github.com/phil65/mkdown/watchers)\n[![Github Stars](https://img.shields.io/github/stars/phil65/mkdown)](https://github.com/phil65/mkdown/stars)\n[![Github Repository size](https://img.shields.io/github/repo-size/phil65/mkdown)](https://github.com/phil65/mkdown)\n[![Github last commit](https://img.shields.io/github/last-commit/phil65/mkdown)](https://github.com/phil65/mkdown/commits)\n[![Github release date](https://img.shields.io/github/release-date/phil65/mkdown)](https://github.com/phil65/mkdown/releases)\n[![Github language count](https://img.shields.io/github/languages/count/phil65/mkdown)](https://github.com/phil65/mkdown)\n[![Github commits this month](https://img.shields.io/github/commit-activity/m/phil65/mkdown)](https://github.com/phil65/mkdown)\n[![Package status](https://codecov.io/gh/phil65/mkdown/branch/main/graph/badge.svg)](https://codecov.io/gh/phil65/mkdown/)\n[![PyUp](https://pyup.io/repos/github/phil65/mkdown/shield.svg)](https://pyup.io/repos/github/phil65/mkdown/)\n\n[Read the documentation!](https://phil65.github.io/mkdown/)\n\n\n\n## Markdown Conventions for OCR Output\n\nThis project utilizes Markdown as the primary, self-contained format for storing OCR results and associated metadata. The goal is to have a single, versionable, human-readable file representing a processed document, simplifying pipeline management and data provenance.\n\nWe employ a hybrid approach, using different mechanisms for different types of metadata:\n\n### 1. Metadata Comments (for Non-Visual Markers)\n\nFor metadata that should *not* affect the visual rendering of the Markdown (like page boundaries or page-level information), we use specially formatted HTML/XML comments.\n\n**Format:**\n\n```\n<!-- docler:data_type {json_payload} -->\n```\n\n*   **`data_type`**: A string indicating the kind of metadata (e.g., `page_break`, `chunk_boundary`).\n*   **`{json_payload}`**: A standard JSON object containing the metadata key-value pairs, serialized.\n\n**Defined Types:**\n\n*   **`page_break`**: Marks the transition *to* the specified page number. Placed immediately *before* the content of the new page.\n    *   Example Payload: `{\"next_page\": 2}`\n    *   Example Comment: `<!-- docler:page_break {\"next_page\": 2 } -->`\n*   **`chunk_boundary`**: Marks a transition where a document should get chunked (semantically).\n    *   Example Payload: `{\"chunk_id\": 1}`\n    *   Example Comment: `<!-- docler:chunk_boundary {\"chunk_id\": 1 } -->`\n\n### 2. HTML Figures (for Images and Diagrams)\n\nFor visual elements like images or diagrams, especially when they require richer metadata (like source code or bounding boxes), we use standard HTML structures within the Markdown. This allows direct association of metadata and handles complex data like code snippets gracefully.\n\n**Structure:**\n\nWe typically use an HTML `<figure>` element:\n\n```html\n<figure data-docler-type=\"diagram\" data-diagram-id=\"sysarch-01\">\n  <img src=\"images/system_architecture.png\"\n       alt=\"System Architecture Diagram\"\n       data-page-num=\"5\"\n       style=\"max-width: 100%; height: auto;\"\n       >\n  <figcaption>Figure 2: High-level system data flow.</figcaption>\n  <script type=\"text/docler-mermaid\">\n    graph LR\n        A[Data Ingest] --> B(Processing Queue);\n        B --> C{Main Processor};\n        D --> F(API Endpoint);\n  </script>\n</figure>\n```\n\n*   **`<figure>`**: The container element.\n    *   `data-docler-type`: Indicates the type of figure (e.g., `image`, `diagram`).\n    *   Other `data-*` attributes can be added for figure-level metadata.\n*   **`<img>`**: The visual representation.\n    *   `src`, `alt`: Standard attributes.\n    *   `data-*`: Used for image-specific metadata like `data-page-num`\n    *   `style`: Optional for basic presentation.\n*   **`<figcaption>`**: Optional standard HTML caption.\n*   **`<script type=\"text/docler-...\">`**: Used to embed source code or other complex textual data.\n    *   The `type` attribute is custom (e.g., `text/docler-mermaid`, `text/docler-latex`) so browsers ignore it.\n    *   The raw code/text is placed inside, preserving formatting.\n\n### Rationale\n\n*   **Comments** are used for page breaks and metadata because they are guaranteed *not* to interfere with Markdown rendering, ensuring purely structural information remains invisible.\n*   **HTML Figures** are used for images/diagrams because HTML provides standard ways (`data-*`, nested elements like `<script>`) to directly associate rich, potentially complex or multi-line metadata (like source code) with the visual element itself.\n\n### Utilities\n\nHelper functions for creating and parsing these metadata comments and structures are available in `docler.markdown_utils`.\n\n### Standardized Metadata Types\n\nThe library provides standardized metadata types for common use cases:\n\n1. **Page Breaks**: Use `PAGE_BREAK_TYPE` constant and `create_metadata_comment()` function to create page transitions:\n   ```python\n   from docler.markdown_utils import create_metadata_comment, PAGE_BREAK_TYPE\n\n   # Create a page break marker for page 2\n   page_break = create_metadata_comment(PAGE_BREAK_TYPE, {\"next_page\": 2})\n   # <!-- docler:page_break {\"next_page\":2} -->\n   ```\n\n2. **Chunk Boundaries**: Use `CHUNK_BOUNDARY_TYPE` constant and `create_chunk_boundary()` function to mark semantic chunks in a document:\n   ```python\n   from docler.markdown_utils import create_chunk_boundary\n\n   # Create a chunk boundary marker with metadata\n   chunk_marker = create_chunk_boundary(\n       chunk_id=1,\n       start_line=10,\n       end_line=25,\n       keywords=[\"introduction\", \"overview\"],\n       token_count=350,\n   )\n   # <!-- docler:chunk_boundary {\"chunk_id\":1,\"end_line\":25,\"keywords\":[\"introduction\",\"overview\"],\"start_line\":10,\"token_count\":350} -->\n   ```\n",
    "bugtrack_url": null,
    "license": "MIT License\n         \n         Copyright (c) 2024, Philipp Temminghoff\n         \n         Permission is hereby granted, free of charge, to any person obtaining a copy\n         of this software and associated documentation files (the \"Software\"), to deal\n         in the Software without restriction, including without limitation the rights\n         to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n         copies of the Software, and to permit persons to whom the Software is\n         furnished to do so, subject to the following conditions:\n         \n         The above copyright notice and this permission notice shall be included in all\n         copies or substantial portions of the Software.\n         \n         THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n         IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n         FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n         AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n         LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n         OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n         SOFTWARE.\n         ",
    "summary": "Markdown helpers & models",
    "version": "0.14.0",
    "project_urls": {
        "Code coverage": "https://app.codecov.io/gh/phil65/mkdown",
        "Discussions": "https://github.com/phil65/mkdown/discussions",
        "Documentation": "https://phil65.github.io/mkdown/",
        "Issues": "https://github.com/phil65/mkdown/issues",
        "Source": "https://github.com/phil65/mkdown"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a85a81faf08a8da2c4c06bef5d8d78f19f5454889f7feb18513b14507cf9d9eb",
                "md5": "a318e38418809ba1130a08b041268942",
                "sha256": "4ea9f487a66fc45ac9e59bcaa829aa9d2e82a56759d82b44c16840822a10e9ea"
            },
            "downloads": -1,
            "filename": "mkdown-0.14.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a318e38418809ba1130a08b041268942",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 18101,
            "upload_time": "2025-10-06T20:30:43",
            "upload_time_iso_8601": "2025-10-06T20:30:43.778959Z",
            "url": "https://files.pythonhosted.org/packages/a8/5a/81faf08a8da2c4c06bef5d8d78f19f5454889f7feb18513b14507cf9d9eb/mkdown-0.14.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "62146ef87f77fd1389ebe1ac3b66b9633118397f66fdcc81d34e69c468cdd833",
                "md5": "f089cfff66db831196b9055926e537bb",
                "sha256": "61178bf5158906d04bad6f808326d588c71ebedd12db42142871b1646b7937b5"
            },
            "downloads": -1,
            "filename": "mkdown-0.14.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f089cfff66db831196b9055926e537bb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 16364,
            "upload_time": "2025-10-06T20:30:45",
            "upload_time_iso_8601": "2025-10-06T20:30:45.036460Z",
            "url": "https://files.pythonhosted.org/packages/62/14/6ef87f77fd1389ebe1ac3b66b9633118397f66fdcc81d34e69c468cdd833/mkdown-0.14.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 20:30:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "phil65",
    "github_project": "mkdown",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "mkdown"
}
        
Elapsed time: 1.30213s