doxstrux


Namedoxstrux JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryDocument structure extraction tool for markdown, with extensibility to PDF and HTML
upload_time2025-10-12 23:36:49
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseMIT
keywords markdown documentation analysis structure-extraction document-parsing ai-preprocessing code-extraction nested-lists requirements-extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # πŸ—οΈ Doxstrux

[![PyPI version](https://badge.fury.io/py/doxstrux.svg)](https://badge.fury.io/py/doxstrux)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Downloads](https://pepy.tech/badge/doxstrux)](https://pepy.tech/project/doxstrux)

**Document structure extraction tool** for markdown, with extensibility to PDF and HTML.

Extract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.

## ✨ Features

- **Zero-regex parsing**: Token-based extraction using markdown-it-py
- **Security-first design**: Three security profiles (strict/moderate/permissive)
- **Document IR**: Clean intermediate representation for RAG chunking
- **Structure extraction**: Headings, lists, tables, code blocks, links, images
- **Content integrity**: Parse without mutation, fail-closed security
- **Extensible architecture**: Ready for PDF and HTML support

## πŸ“¦ Installation

```bash
pip install doxstrux
```

## πŸš€ Quick Start

```python
from doxstrux.markdown_parser_core import MarkdownParserCore

# Basic usage
content = "# Hello\n\nThis is **markdown**."
parser = MarkdownParserCore(content)
result = parser.parse()

# Access structure
print(result['structure']['headings'])
print(result['metadata']['security']['statistics'])

# With security profile
parser = MarkdownParserCore(content, security_profile='strict')
result = parser.parse()

# With custom config
parser = MarkdownParserCore(
    content,
    config={
        'preset': 'gfm',
        'plugins': ['table', 'strikethrough'],
        'allows_html': False
    },
    security_profile='moderate'
)
result = parser.parse()
```

## πŸ—οΈ Architecture

### Core Principles

- **Extract everything, analyze nothing**: Focus on structural extraction, not semantics
- **No file I/O in core**: Parser accepts content strings, not paths
- **Plain dict outputs**: Lightweight, no heavy dependencies
- **Security layered throughout**: Size limits, plugin validation, content sanitization
- **Modular extractors** (Phase 7): 11 specialized modules with dependency injection
- **Single responsibility**: Each extractor handles one markdown element type

### Security Profiles

| Profile | Max Size | Max Lines | Recursion Depth | Use Case |
|---------|----------|-----------|-----------------|----------|
| **strict** | 100KB | 2K | 50 | Untrusted input |
| **moderate** | 1MB | 10K | 100 | Standard use (default) |
| **permissive** | 10MB | 50K | 150 | Trusted documents |

### Document IR

Clean intermediate representation for RAG pipelines and chunking:

```python
from doxstrux.markdown.ir import DocumentIR, ChunkPolicy

# Parse to IR
parser = MarkdownParserCore(content)
result = parser.parse()
doc_ir = DocumentIR.from_parse_result(result)

# Apply chunking policy
policy = ChunkPolicy(
    max_chunk_tokens=512,
    overlap_tokens=50,
    respect_boundaries=['heading', 'section']
)
chunks = doc_ir.chunk(policy)
```

## πŸ§ͺ Testing

```bash
# Run all tests
pytest

# With coverage
pytest --cov=src/doxstrux

# Type checking
mypy src/doxstrux

# Linting
ruff check src/ tests/
```

## πŸ“Š Project Status

- **Version**: 0.2.1 βœ… **Published on PyPI**
- **Python**: 3.12+
- **Test Coverage**: 69% (working toward 80% target)
- **Tests**: 95/95 pytest passing + 542/542 baseline tests passing
- **Regex Count**: 0 (zero-regex architecture)
- **Core Parser**: 1944 lines (reduced from 2900, -33%)
- **PyPI**: https://pypi.org/project/doxstrux/

### Phase 7: Modular Architecture βœ… COMPLETE

**Completed**: Full modularization of parser into 11 specialized extractors

- βœ… **7.0.5**: Rename from docpipe to doxstrux
- βœ… **7.1**: Create namespace structure
- βœ… **7.2**: Move existing modules to new namespace
- βœ… **7.3**: Extract line & text utilities
- βœ… **7.4**: Extract configuration & budgets
- βœ… **7.5**: Extract simple extractors (media, footnotes, blockquotes, html)
- βœ… **7.6**: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)

**Achievements**:
- Core parser reduced by 33% (2900 β†’ 1944 lines)
- 11 specialized extractor modules created
- 100% baseline test parity maintained
- Clean dependency injection pattern throughout
- Zero behavioral changes (byte-for-byte output identical)

## πŸ—ΊοΈ Roadmap

- [x] **Phase 7**: Modular architecture βœ… **COMPLETE**
- [ ] **Phase 8**: Enhanced testing & documentation
- [ ] **PDF support**: Extract structure from PDF documents
- [ ] **HTML support**: Parse HTML with same IR
- [ ] **Enhanced chunking**: Semantic-aware chunking strategies
- [ ] **Performance**: Cython optimization for hot paths

## πŸ“š Documentation

- **Architecture**: See `CLAUDE.md` for detailed architecture notes
- **Phase 7 Plan**: See `regex_refactor_docs/DETAILED_TASK_LIST.md`
- **Testing**: See `regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md`

## 🀝 Contributing

This project follows a phased refactoring methodology with comprehensive test gates.

1. All changes must pass 63 pytest tests
2. All changes must maintain byte-for-byte output parity (542 baseline tests)
3. Security-first: No untrusted regex, validated links, sanitized HTML
4. Type-safe: Full mypy strict mode compliance

## πŸ“œ License

MIT License - see `LICENSE` file for details.

## πŸ™ Acknowledgments

Built on:
- [markdown-it-py](https://github.com/executablebooks/markdown-it-py) - CommonMark compliant parser
- [mdit-py-plugins](https://github.com/executablebooks/mdit-py-plugins) - Extended markdown features

---

**Previous name**: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "doxstrux",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": "Doxstrux Team <doxstrux@example.com>",
    "keywords": "markdown, documentation, analysis, structure-extraction, document-parsing, ai-preprocessing, code-extraction, nested-lists, requirements-extraction",
    "author": null,
    "author_email": "Doxstrux Contributors <doxstrux@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/77/00/ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a/doxstrux-0.2.1.tar.gz",
    "platform": null,
    "description": "# \ud83c\udfd7\ufe0f Doxstrux\n\n[![PyPI version](https://badge.fury.io/py/doxstrux.svg)](https://badge.fury.io/py/doxstrux)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Downloads](https://pepy.tech/badge/doxstrux)](https://pepy.tech/project/doxstrux)\n\n**Document structure extraction tool** for markdown, with extensibility to PDF and HTML.\n\nExtract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.\n\n## \u2728 Features\n\n- **Zero-regex parsing**: Token-based extraction using markdown-it-py\n- **Security-first design**: Three security profiles (strict/moderate/permissive)\n- **Document IR**: Clean intermediate representation for RAG chunking\n- **Structure extraction**: Headings, lists, tables, code blocks, links, images\n- **Content integrity**: Parse without mutation, fail-closed security\n- **Extensible architecture**: Ready for PDF and HTML support\n\n## \ud83d\udce6 Installation\n\n```bash\npip install doxstrux\n```\n\n## \ud83d\ude80 Quick Start\n\n```python\nfrom doxstrux.markdown_parser_core import MarkdownParserCore\n\n# Basic usage\ncontent = \"# Hello\\n\\nThis is **markdown**.\"\nparser = MarkdownParserCore(content)\nresult = parser.parse()\n\n# Access structure\nprint(result['structure']['headings'])\nprint(result['metadata']['security']['statistics'])\n\n# With security profile\nparser = MarkdownParserCore(content, security_profile='strict')\nresult = parser.parse()\n\n# With custom config\nparser = MarkdownParserCore(\n    content,\n    config={\n        'preset': 'gfm',\n        'plugins': ['table', 'strikethrough'],\n        'allows_html': False\n    },\n    security_profile='moderate'\n)\nresult = parser.parse()\n```\n\n## \ud83c\udfd7\ufe0f Architecture\n\n### Core Principles\n\n- **Extract everything, analyze nothing**: Focus on structural extraction, not semantics\n- **No file I/O in core**: Parser accepts content strings, not paths\n- **Plain dict outputs**: Lightweight, no heavy dependencies\n- **Security layered throughout**: Size limits, plugin validation, content sanitization\n- **Modular extractors** (Phase 7): 11 specialized modules with dependency injection\n- **Single responsibility**: Each extractor handles one markdown element type\n\n### Security Profiles\n\n| Profile | Max Size | Max Lines | Recursion Depth | Use Case |\n|---------|----------|-----------|-----------------|----------|\n| **strict** | 100KB | 2K | 50 | Untrusted input |\n| **moderate** | 1MB | 10K | 100 | Standard use (default) |\n| **permissive** | 10MB | 50K | 150 | Trusted documents |\n\n### Document IR\n\nClean intermediate representation for RAG pipelines and chunking:\n\n```python\nfrom doxstrux.markdown.ir import DocumentIR, ChunkPolicy\n\n# Parse to IR\nparser = MarkdownParserCore(content)\nresult = parser.parse()\ndoc_ir = DocumentIR.from_parse_result(result)\n\n# Apply chunking policy\npolicy = ChunkPolicy(\n    max_chunk_tokens=512,\n    overlap_tokens=50,\n    respect_boundaries=['heading', 'section']\n)\nchunks = doc_ir.chunk(policy)\n```\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# With coverage\npytest --cov=src/doxstrux\n\n# Type checking\nmypy src/doxstrux\n\n# Linting\nruff check src/ tests/\n```\n\n## \ud83d\udcca Project Status\n\n- **Version**: 0.2.1 \u2705 **Published on PyPI**\n- **Python**: 3.12+\n- **Test Coverage**: 69% (working toward 80% target)\n- **Tests**: 95/95 pytest passing + 542/542 baseline tests passing\n- **Regex Count**: 0 (zero-regex architecture)\n- **Core Parser**: 1944 lines (reduced from 2900, -33%)\n- **PyPI**: https://pypi.org/project/doxstrux/\n\n### Phase 7: Modular Architecture \u2705 COMPLETE\n\n**Completed**: Full modularization of parser into 11 specialized extractors\n\n- \u2705 **7.0.5**: Rename from docpipe to doxstrux\n- \u2705 **7.1**: Create namespace structure\n- \u2705 **7.2**: Move existing modules to new namespace\n- \u2705 **7.3**: Extract line & text utilities\n- \u2705 **7.4**: Extract configuration & budgets\n- \u2705 **7.5**: Extract simple extractors (media, footnotes, blockquotes, html)\n- \u2705 **7.6**: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)\n\n**Achievements**:\n- Core parser reduced by 33% (2900 \u2192 1944 lines)\n- 11 specialized extractor modules created\n- 100% baseline test parity maintained\n- Clean dependency injection pattern throughout\n- Zero behavioral changes (byte-for-byte output identical)\n\n## \ud83d\uddfa\ufe0f Roadmap\n\n- [x] **Phase 7**: Modular architecture \u2705 **COMPLETE**\n- [ ] **Phase 8**: Enhanced testing & documentation\n- [ ] **PDF support**: Extract structure from PDF documents\n- [ ] **HTML support**: Parse HTML with same IR\n- [ ] **Enhanced chunking**: Semantic-aware chunking strategies\n- [ ] **Performance**: Cython optimization for hot paths\n\n## \ud83d\udcda Documentation\n\n- **Architecture**: See `CLAUDE.md` for detailed architecture notes\n- **Phase 7 Plan**: See `regex_refactor_docs/DETAILED_TASK_LIST.md`\n- **Testing**: See `regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md`\n\n## \ud83e\udd1d Contributing\n\nThis project follows a phased refactoring methodology with comprehensive test gates.\n\n1. All changes must pass 63 pytest tests\n2. All changes must maintain byte-for-byte output parity (542 baseline tests)\n3. Security-first: No untrusted regex, validated links, sanitized HTML\n4. Type-safe: Full mypy strict mode compliance\n\n## \ud83d\udcdc License\n\nMIT License - see `LICENSE` file for details.\n\n## \ud83d\ude4f Acknowledgments\n\nBuilt on:\n- [markdown-it-py](https://github.com/executablebooks/markdown-it-py) - CommonMark compliant parser\n- [mdit-py-plugins](https://github.com/executablebooks/mdit-py-plugins) - Extended markdown features\n\n---\n\n**Previous name**: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Document structure extraction tool for markdown, with extensibility to PDF and HTML",
    "version": "0.2.1",
    "project_urls": {
        "Changelog": "https://github.com/doxstrux/doxstrux/blob/main/CHANGELOG.md",
        "Documentation": "https://doxstrux.readthedocs.io",
        "Homepage": "https://github.com/doxstrux/doxstrux",
        "Issues": "https://github.com/doxstrux/doxstrux/issues",
        "Repository": "https://github.com/doxstrux/doxstrux.git"
    },
    "split_keywords": [
        "markdown",
        " documentation",
        " analysis",
        " structure-extraction",
        " document-parsing",
        " ai-preprocessing",
        " code-extraction",
        " nested-lists",
        " requirements-extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0f003973b010842e70cfac6c8c9e4249690f4c2727db39d45207813c9a811cdb",
                "md5": "77bd9913c7a76bfd73fa631ccf931393",
                "sha256": "666c8cf971a0fcc44a793717345278bd4fc6b0f52177dcb3d18b9e8a45f9e5c0"
            },
            "downloads": -1,
            "filename": "doxstrux-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "77bd9913c7a76bfd73fa631ccf931393",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 62578,
            "upload_time": "2025-10-12T23:36:48",
            "upload_time_iso_8601": "2025-10-12T23:36:48.596385Z",
            "url": "https://files.pythonhosted.org/packages/0f/00/3973b010842e70cfac6c8c9e4249690f4c2727db39d45207813c9a811cdb/doxstrux-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7700ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a",
                "md5": "55890223ec42c543641df77b02932df2",
                "sha256": "c8a1add602a1fea6de6ca20871138ae705828c2bee8422db3f22bfa4a70a0ebc"
            },
            "downloads": -1,
            "filename": "doxstrux-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "55890223ec42c543641df77b02932df2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 65350,
            "upload_time": "2025-10-12T23:36:49",
            "upload_time_iso_8601": "2025-10-12T23:36:49.874926Z",
            "url": "https://files.pythonhosted.org/packages/77/00/ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a/doxstrux-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-12 23:36:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "doxstrux",
    "github_project": "doxstrux",
    "github_not_found": true,
    "lcname": "doxstrux"
}
        
Elapsed time: 2.22778s