# ποΈ Doxstrux
[](https://badge.fury.io/py/doxstrux)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
[](https://pepy.tech/project/doxstrux)
**Document structure extraction tool** for markdown, with extensibility to PDF and HTML.
Extract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.
## β¨ Features
- **Zero-regex parsing**: Token-based extraction using markdown-it-py
- **Security-first design**: Three security profiles (strict/moderate/permissive)
- **Document IR**: Clean intermediate representation for RAG chunking
- **Structure extraction**: Headings, lists, tables, code blocks, links, images
- **Content integrity**: Parse without mutation, fail-closed security
- **Extensible architecture**: Ready for PDF and HTML support
## π¦ Installation
```bash
pip install doxstrux
```
## π Quick Start
```python
from doxstrux.markdown_parser_core import MarkdownParserCore
# Basic usage
content = "# Hello\n\nThis is **markdown**."
parser = MarkdownParserCore(content)
result = parser.parse()
# Access structure
print(result['structure']['headings'])
print(result['metadata']['security']['statistics'])
# With security profile
parser = MarkdownParserCore(content, security_profile='strict')
result = parser.parse()
# With custom config
parser = MarkdownParserCore(
content,
config={
'preset': 'gfm',
'plugins': ['table', 'strikethrough'],
'allows_html': False
},
security_profile='moderate'
)
result = parser.parse()
```
## ποΈ Architecture
### Core Principles
- **Extract everything, analyze nothing**: Focus on structural extraction, not semantics
- **No file I/O in core**: Parser accepts content strings, not paths
- **Plain dict outputs**: Lightweight, no heavy dependencies
- **Security layered throughout**: Size limits, plugin validation, content sanitization
- **Modular extractors** (Phase 7): 11 specialized modules with dependency injection
- **Single responsibility**: Each extractor handles one markdown element type
### Security Profiles
| Profile | Max Size | Max Lines | Recursion Depth | Use Case |
|---------|----------|-----------|-----------------|----------|
| **strict** | 100KB | 2K | 50 | Untrusted input |
| **moderate** | 1MB | 10K | 100 | Standard use (default) |
| **permissive** | 10MB | 50K | 150 | Trusted documents |
### Document IR
Clean intermediate representation for RAG pipelines and chunking:
```python
from doxstrux.markdown.ir import DocumentIR, ChunkPolicy
# Parse to IR
parser = MarkdownParserCore(content)
result = parser.parse()
doc_ir = DocumentIR.from_parse_result(result)
# Apply chunking policy
policy = ChunkPolicy(
max_chunk_tokens=512,
overlap_tokens=50,
respect_boundaries=['heading', 'section']
)
chunks = doc_ir.chunk(policy)
```
## π§ͺ Testing
```bash
# Run all tests
pytest
# With coverage
pytest --cov=src/doxstrux
# Type checking
mypy src/doxstrux
# Linting
ruff check src/ tests/
```
## π Project Status
- **Version**: 0.2.1 β
**Published on PyPI**
- **Python**: 3.12+
- **Test Coverage**: 69% (working toward 80% target)
- **Tests**: 95/95 pytest passing + 542/542 baseline tests passing
- **Regex Count**: 0 (zero-regex architecture)
- **Core Parser**: 1944 lines (reduced from 2900, -33%)
- **PyPI**: https://pypi.org/project/doxstrux/
### Phase 7: Modular Architecture β
COMPLETE
**Completed**: Full modularization of parser into 11 specialized extractors
- β
**7.0.5**: Rename from docpipe to doxstrux
- β
**7.1**: Create namespace structure
- β
**7.2**: Move existing modules to new namespace
- β
**7.3**: Extract line & text utilities
- β
**7.4**: Extract configuration & budgets
- β
**7.5**: Extract simple extractors (media, footnotes, blockquotes, html)
- β
**7.6**: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)
**Achievements**:
- Core parser reduced by 33% (2900 β 1944 lines)
- 11 specialized extractor modules created
- 100% baseline test parity maintained
- Clean dependency injection pattern throughout
- Zero behavioral changes (byte-for-byte output identical)
## πΊοΈ Roadmap
- [x] **Phase 7**: Modular architecture β
**COMPLETE**
- [ ] **Phase 8**: Enhanced testing & documentation
- [ ] **PDF support**: Extract structure from PDF documents
- [ ] **HTML support**: Parse HTML with same IR
- [ ] **Enhanced chunking**: Semantic-aware chunking strategies
- [ ] **Performance**: Cython optimization for hot paths
## π Documentation
- **Architecture**: See `CLAUDE.md` for detailed architecture notes
- **Phase 7 Plan**: See `regex_refactor_docs/DETAILED_TASK_LIST.md`
- **Testing**: See `regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md`
## π€ Contributing
This project follows a phased refactoring methodology with comprehensive test gates.
1. All changes must pass 63 pytest tests
2. All changes must maintain byte-for-byte output parity (542 baseline tests)
3. Security-first: No untrusted regex, validated links, sanitized HTML
4. Type-safe: Full mypy strict mode compliance
## π License
MIT License - see `LICENSE` file for details.
## π Acknowledgments
Built on:
- [markdown-it-py](https://github.com/executablebooks/markdown-it-py) - CommonMark compliant parser
- [mdit-py-plugins](https://github.com/executablebooks/mdit-py-plugins) - Extended markdown features
---
**Previous name**: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)
Raw data
{
"_id": null,
"home_page": null,
"name": "doxstrux",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": "Doxstrux Team <doxstrux@example.com>",
"keywords": "markdown, documentation, analysis, structure-extraction, document-parsing, ai-preprocessing, code-extraction, nested-lists, requirements-extraction",
"author": null,
"author_email": "Doxstrux Contributors <doxstrux@example.com>",
"download_url": "https://files.pythonhosted.org/packages/77/00/ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a/doxstrux-0.2.1.tar.gz",
"platform": null,
"description": "# \ud83c\udfd7\ufe0f Doxstrux\n\n[](https://badge.fury.io/py/doxstrux)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/psf/black)\n[](https://pepy.tech/project/doxstrux)\n\n**Document structure extraction tool** for markdown, with extensibility to PDF and HTML.\n\nExtract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.\n\n## \u2728 Features\n\n- **Zero-regex parsing**: Token-based extraction using markdown-it-py\n- **Security-first design**: Three security profiles (strict/moderate/permissive)\n- **Document IR**: Clean intermediate representation for RAG chunking\n- **Structure extraction**: Headings, lists, tables, code blocks, links, images\n- **Content integrity**: Parse without mutation, fail-closed security\n- **Extensible architecture**: Ready for PDF and HTML support\n\n## \ud83d\udce6 Installation\n\n```bash\npip install doxstrux\n```\n\n## \ud83d\ude80 Quick Start\n\n```python\nfrom doxstrux.markdown_parser_core import MarkdownParserCore\n\n# Basic usage\ncontent = \"# Hello\\n\\nThis is **markdown**.\"\nparser = MarkdownParserCore(content)\nresult = parser.parse()\n\n# Access structure\nprint(result['structure']['headings'])\nprint(result['metadata']['security']['statistics'])\n\n# With security profile\nparser = MarkdownParserCore(content, security_profile='strict')\nresult = parser.parse()\n\n# With custom config\nparser = MarkdownParserCore(\n content,\n config={\n 'preset': 'gfm',\n 'plugins': ['table', 'strikethrough'],\n 'allows_html': False\n },\n security_profile='moderate'\n)\nresult = parser.parse()\n```\n\n## \ud83c\udfd7\ufe0f Architecture\n\n### Core Principles\n\n- **Extract everything, analyze nothing**: Focus on structural extraction, not semantics\n- **No file I/O in core**: Parser accepts content strings, not paths\n- **Plain dict outputs**: Lightweight, no heavy dependencies\n- **Security layered throughout**: Size limits, plugin validation, content sanitization\n- **Modular extractors** (Phase 7): 11 specialized modules with dependency injection\n- **Single responsibility**: Each extractor handles one markdown element type\n\n### Security Profiles\n\n| Profile | Max Size | Max Lines | Recursion Depth | Use Case |\n|---------|----------|-----------|-----------------|----------|\n| **strict** | 100KB | 2K | 50 | Untrusted input |\n| **moderate** | 1MB | 10K | 100 | Standard use (default) |\n| **permissive** | 10MB | 50K | 150 | Trusted documents |\n\n### Document IR\n\nClean intermediate representation for RAG pipelines and chunking:\n\n```python\nfrom doxstrux.markdown.ir import DocumentIR, ChunkPolicy\n\n# Parse to IR\nparser = MarkdownParserCore(content)\nresult = parser.parse()\ndoc_ir = DocumentIR.from_parse_result(result)\n\n# Apply chunking policy\npolicy = ChunkPolicy(\n max_chunk_tokens=512,\n overlap_tokens=50,\n respect_boundaries=['heading', 'section']\n)\nchunks = doc_ir.chunk(policy)\n```\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# With coverage\npytest --cov=src/doxstrux\n\n# Type checking\nmypy src/doxstrux\n\n# Linting\nruff check src/ tests/\n```\n\n## \ud83d\udcca Project Status\n\n- **Version**: 0.2.1 \u2705 **Published on PyPI**\n- **Python**: 3.12+\n- **Test Coverage**: 69% (working toward 80% target)\n- **Tests**: 95/95 pytest passing + 542/542 baseline tests passing\n- **Regex Count**: 0 (zero-regex architecture)\n- **Core Parser**: 1944 lines (reduced from 2900, -33%)\n- **PyPI**: https://pypi.org/project/doxstrux/\n\n### Phase 7: Modular Architecture \u2705 COMPLETE\n\n**Completed**: Full modularization of parser into 11 specialized extractors\n\n- \u2705 **7.0.5**: Rename from docpipe to doxstrux\n- \u2705 **7.1**: Create namespace structure\n- \u2705 **7.2**: Move existing modules to new namespace\n- \u2705 **7.3**: Extract line & text utilities\n- \u2705 **7.4**: Extract configuration & budgets\n- \u2705 **7.5**: Extract simple extractors (media, footnotes, blockquotes, html)\n- \u2705 **7.6**: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)\n\n**Achievements**:\n- Core parser reduced by 33% (2900 \u2192 1944 lines)\n- 11 specialized extractor modules created\n- 100% baseline test parity maintained\n- Clean dependency injection pattern throughout\n- Zero behavioral changes (byte-for-byte output identical)\n\n## \ud83d\uddfa\ufe0f Roadmap\n\n- [x] **Phase 7**: Modular architecture \u2705 **COMPLETE**\n- [ ] **Phase 8**: Enhanced testing & documentation\n- [ ] **PDF support**: Extract structure from PDF documents\n- [ ] **HTML support**: Parse HTML with same IR\n- [ ] **Enhanced chunking**: Semantic-aware chunking strategies\n- [ ] **Performance**: Cython optimization for hot paths\n\n## \ud83d\udcda Documentation\n\n- **Architecture**: See `CLAUDE.md` for detailed architecture notes\n- **Phase 7 Plan**: See `regex_refactor_docs/DETAILED_TASK_LIST.md`\n- **Testing**: See `regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md`\n\n## \ud83e\udd1d Contributing\n\nThis project follows a phased refactoring methodology with comprehensive test gates.\n\n1. All changes must pass 63 pytest tests\n2. All changes must maintain byte-for-byte output parity (542 baseline tests)\n3. Security-first: No untrusted regex, validated links, sanitized HTML\n4. Type-safe: Full mypy strict mode compliance\n\n## \ud83d\udcdc License\n\nMIT License - see `LICENSE` file for details.\n\n## \ud83d\ude4f Acknowledgments\n\nBuilt on:\n- [markdown-it-py](https://github.com/executablebooks/markdown-it-py) - CommonMark compliant parser\n- [mdit-py-plugins](https://github.com/executablebooks/mdit-py-plugins) - Extended markdown features\n\n---\n\n**Previous name**: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Document structure extraction tool for markdown, with extensibility to PDF and HTML",
"version": "0.2.1",
"project_urls": {
"Changelog": "https://github.com/doxstrux/doxstrux/blob/main/CHANGELOG.md",
"Documentation": "https://doxstrux.readthedocs.io",
"Homepage": "https://github.com/doxstrux/doxstrux",
"Issues": "https://github.com/doxstrux/doxstrux/issues",
"Repository": "https://github.com/doxstrux/doxstrux.git"
},
"split_keywords": [
"markdown",
" documentation",
" analysis",
" structure-extraction",
" document-parsing",
" ai-preprocessing",
" code-extraction",
" nested-lists",
" requirements-extraction"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0f003973b010842e70cfac6c8c9e4249690f4c2727db39d45207813c9a811cdb",
"md5": "77bd9913c7a76bfd73fa631ccf931393",
"sha256": "666c8cf971a0fcc44a793717345278bd4fc6b0f52177dcb3d18b9e8a45f9e5c0"
},
"downloads": -1,
"filename": "doxstrux-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "77bd9913c7a76bfd73fa631ccf931393",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 62578,
"upload_time": "2025-10-12T23:36:48",
"upload_time_iso_8601": "2025-10-12T23:36:48.596385Z",
"url": "https://files.pythonhosted.org/packages/0f/00/3973b010842e70cfac6c8c9e4249690f4c2727db39d45207813c9a811cdb/doxstrux-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7700ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a",
"md5": "55890223ec42c543641df77b02932df2",
"sha256": "c8a1add602a1fea6de6ca20871138ae705828c2bee8422db3f22bfa4a70a0ebc"
},
"downloads": -1,
"filename": "doxstrux-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "55890223ec42c543641df77b02932df2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 65350,
"upload_time": "2025-10-12T23:36:49",
"upload_time_iso_8601": "2025-10-12T23:36:49.874926Z",
"url": "https://files.pythonhosted.org/packages/77/00/ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a/doxstrux-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-12 23:36:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "doxstrux",
"github_project": "doxstrux",
"github_not_found": true,
"lcname": "doxstrux"
}