# Document Analysis Framework
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/redhat-ai-americas/document-analysis-framework)
A lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.
## ๐ฏ When to Use This Framework
This is a **fallback framework** for text-based files not handled by our specialized frameworks:
### Specialized Frameworks (Use These First!)
1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** ๐
- For: XML documents of all types
- Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)
- Install: `pip install xml-analysis-framework`
2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** ๐
- For: PDF, Word, Excel, PowerPoint, and images with text
- Features: Docling-powered extraction with OCR support
- Install: `pip install docling-analysis-framework`
3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** ๐
- For: Structured data that needs AI agent interaction
- Features: Safe query interface for AI agents to analyze data
- Install: `pip install data-analysis-framework`
### Use This Framework For:
- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.
- **Config files**: Dockerfile, package.json, .env, INI files, etc.
- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.
- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.
- **Other text-based formats** not covered above
> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:
> - Use `data-analysis-framework` for AI agent querying of structured data
> - Use `document-analysis-framework` for chunking and document analysis
## ๐ Quick Start
### Document Analysis
```python
from core.analyzer import DocumentAnalyzer
analyzer = DocumentAnalyzer()
result = analyzer.analyze_document("path/to/file.py")
print(f"Document Type: {result['document_type'].type_name}")
print(f"Language: {result['document_type'].language}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
```
### Smart Chunking
```python
from core.analyzer import DocumentAnalyzer
from core.chunking import ChunkingOrchestrator
# Analyze document
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze_document("file.py")
# Convert format for chunking
chunking_analysis = {
'document_type': {
'type_name': analysis['document_type'].type_name,
'confidence': analysis['document_type'].confidence,
'category': analysis['document_type'].category
},
'analysis': analysis['analysis']
}
# Generate AI-optimized chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.py", chunking_analysis, strategy='auto')
```
## ๐ Currently Supported File Types
| Category | File Types | Extensions | Confidence |
|----------|------------|------------|------------|
| **๐ Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |
| **๐ป Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |
| **โ๏ธ Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |
### Coming Soon:
- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin
- Shell scripts, PowerShell, R, MATLAB
- INI files, .env files, Apache/Nginx configs
- LaTeX, AsciiDoc, reStructuredText
- Log files, CSS/SCSS, Vue/Svelte components
## ๐ฏ Key Features
- **๐ Intelligent Document Detection** - Content-based recognition with confidence scoring
- **๐ค AI-Ready Output** - Structured analysis with quality metrics and use case recommendations
- **โก Smart Chunking** - Document-type-aware segmentation strategies
- **๐ Security & Reliability** - File size limits, safe handling, pure Python stdlib
- **๐ Extensible** - Easy to add new handlers for additional file types
## ๐ง Installation
```bash
git clone https://github.com/redhat-ai-americas/document-analysis-framework.git
cd document-analysis-framework
pip install -e .
```
## ๐งช Framework Ecosystem
This framework is part of the **AI Building Blocks** document analysis ecosystem:
| Framework | Purpose | Key Features |
|-----------|---------|--------------|
| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |
| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |
| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |
| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |
### Choosing the Right Framework
```mermaid
graph TD
A[Have a document to analyze?] --> B{What type?}
B -->|XML| C[xml-analysis-framework]
B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]
B -->|Need AI to query data| E[data-analysis-framework]
B -->|Text/Code/Config/Other| F[document-analysis-framework]
```
## ๐ License
MIT License - see [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/rdwj/document-analysis-framework",
"name": "document-analysis-framework",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "document, analysis, ai, ml, text-processing, code-analysis, configuration, semantic-search",
"author": "Wes Jackson",
"author_email": "AI Building Blocks <wjackson@redhat.com>",
"download_url": "https://files.pythonhosted.org/packages/fa/9c/2e2453f510f3beb68b7b8f800ef390a3e76128a661758db9e144a59b06ad/document_analysis_framework-1.1.0.tar.gz",
"platform": null,
"description": "# Document Analysis Framework\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/redhat-ai-americas/document-analysis-framework)\n\nA lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.\n\n## \ud83c\udfaf When to Use This Framework\n\nThis is a **fallback framework** for text-based files not handled by our specialized frameworks:\n\n### Specialized Frameworks (Use These First!)\n\n1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** \ud83d\udcd1\n - For: XML documents of all types\n - Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)\n - Install: `pip install xml-analysis-framework`\n\n2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** \ud83d\udcc4\n - For: PDF, Word, Excel, PowerPoint, and images with text\n - Features: Docling-powered extraction with OCR support\n - Install: `pip install docling-analysis-framework`\n\n3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** \ud83d\udcca\n - For: Structured data that needs AI agent interaction\n - Features: Safe query interface for AI agents to analyze data\n - Install: `pip install data-analysis-framework`\n\n### Use This Framework For:\n- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.\n- **Config files**: Dockerfile, package.json, .env, INI files, etc.\n- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.\n- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.\n- **Other text-based formats** not covered above\n\n> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:\n> - Use `data-analysis-framework` for AI agent querying of structured data\n> - Use `document-analysis-framework` for chunking and document analysis\n\n## \ud83d\ude80 Quick Start\n\n### Document Analysis\n```python\nfrom core.analyzer import DocumentAnalyzer\n\nanalyzer = DocumentAnalyzer()\nresult = analyzer.analyze_document(\"path/to/file.py\")\n\nprint(f\"Document Type: {result['document_type'].type_name}\")\nprint(f\"Language: {result['document_type'].language}\")\nprint(f\"AI Use Cases: {result['analysis'].ai_use_cases}\")\n```\n\n### Smart Chunking\n```python\nfrom core.analyzer import DocumentAnalyzer\nfrom core.chunking import ChunkingOrchestrator\n\n# Analyze document\nanalyzer = DocumentAnalyzer()\nanalysis = analyzer.analyze_document(\"file.py\")\n\n# Convert format for chunking\nchunking_analysis = {\n 'document_type': {\n 'type_name': analysis['document_type'].type_name,\n 'confidence': analysis['document_type'].confidence,\n 'category': analysis['document_type'].category\n },\n 'analysis': analysis['analysis']\n}\n\n# Generate AI-optimized chunks\norchestrator = ChunkingOrchestrator()\nchunks = orchestrator.chunk_document(\"file.py\", chunking_analysis, strategy='auto')\n```\n\n## \ud83d\udccb Currently Supported File Types\n\n| Category | File Types | Extensions | Confidence |\n|----------|------------|------------|------------|\n| **\ud83d\udcdd Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |\n| **\ud83d\udcbb Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |\n| **\u2699\ufe0f Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |\n\n### Coming Soon:\n- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin\n- Shell scripts, PowerShell, R, MATLAB\n- INI files, .env files, Apache/Nginx configs\n- LaTeX, AsciiDoc, reStructuredText\n- Log files, CSS/SCSS, Vue/Svelte components\n\n## \ud83c\udfaf Key Features\n\n- **\ud83d\udd0d Intelligent Document Detection** - Content-based recognition with confidence scoring\n- **\ud83e\udd16 AI-Ready Output** - Structured analysis with quality metrics and use case recommendations \n- **\u26a1 Smart Chunking** - Document-type-aware segmentation strategies\n- **\ud83d\udd12 Security & Reliability** - File size limits, safe handling, pure Python stdlib\n- **\ud83d\udd04 Extensible** - Easy to add new handlers for additional file types\n\n## \ud83d\udd27 Installation\n\n```bash\ngit clone https://github.com/redhat-ai-americas/document-analysis-framework.git\ncd document-analysis-framework\npip install -e .\n```\n\n## \ud83e\uddea Framework Ecosystem\n\nThis framework is part of the **AI Building Blocks** document analysis ecosystem:\n\n| Framework | Purpose | Key Features |\n|-----------|---------|--------------|\n| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |\n| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |\n| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |\n| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |\n\n### Choosing the Right Framework\n\n```mermaid\ngraph TD\n A[Have a document to analyze?] --> B{What type?}\n B -->|XML| C[xml-analysis-framework]\n B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]\n B -->|Need AI to query data| E[data-analysis-framework]\n B -->|Text/Code/Config/Other| F[document-analysis-framework]\n```\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Text, code, and configuration file analysis framework for AI/ML data pipelines",
"version": "1.1.0",
"project_urls": {
"Documentation": "https://github.com/rdwj/document-analysis-framework/blob/main/README.md",
"Homepage": "https://github.com/rdwj/document-analysis-framework",
"Issues": "https://github.com/rdwj/document-analysis-framework/issues",
"Repository": "https://github.com/rdwj/document-analysis-framework"
},
"split_keywords": [
"document",
" analysis",
" ai",
" ml",
" text-processing",
" code-analysis",
" configuration",
" semantic-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7d6603f0df0397d76943ffd567c094d935a4ce99d5d017845b6b6e1fe2004e49",
"md5": "428f54bbd122856fdf8c8681300d2b34",
"sha256": "5b424994d192a37f7b79173af2ed63c5c869308ecc6b39cc624bc143fb00a7f9"
},
"downloads": -1,
"filename": "document_analysis_framework-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "428f54bbd122856fdf8c8681300d2b34",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 114988,
"upload_time": "2025-07-29T14:36:02",
"upload_time_iso_8601": "2025-07-29T14:36:02.354644Z",
"url": "https://files.pythonhosted.org/packages/7d/66/03f0df0397d76943ffd567c094d935a4ce99d5d017845b6b6e1fe2004e49/document_analysis_framework-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "fa9c2e2453f510f3beb68b7b8f800ef390a3e76128a661758db9e144a59b06ad",
"md5": "c52b001732063766a910623d4145cd33",
"sha256": "8a62fb29edba1ccb9083a697b35e9fbe3acf32730265a8c7eda96fd70c98a684"
},
"downloads": -1,
"filename": "document_analysis_framework-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "c52b001732063766a910623d4145cd33",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 162273,
"upload_time": "2025-07-29T14:36:03",
"upload_time_iso_8601": "2025-07-29T14:36:03.874821Z",
"url": "https://files.pythonhosted.org/packages/fa/9c/2e2453f510f3beb68b7b8f800ef390a3e76128a661758db9e144a59b06ad/document_analysis_framework-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 14:36:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rdwj",
"github_project": "document-analysis-framework",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "document-analysis-framework"
}