# Document Analysis Framework
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/redhat-ai-americas/document-analysis-framework)
A lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.
## ๐ฏ When to Use This Framework
This is a **fallback framework** for text-based files not handled by our specialized frameworks:
### Specialized Frameworks (Use These First!)
1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** ๐
- For: XML documents of all types
- Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)
- Install: `pip install xml-analysis-framework`
2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** ๐
- For: PDF, Word, Excel, PowerPoint, and images with text
- Features: Docling-powered extraction with OCR support
- Install: `pip install docling-analysis-framework`
3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** ๐
- For: Structured data that needs AI agent interaction
- Features: Safe query interface for AI agents to analyze data
- Install: `pip install data-analysis-framework`
### Use This Framework For:
- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.
- **Config files**: Dockerfile, package.json, .env, INI files, etc.
- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.
- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.
- **Other text-based formats** not covered above
> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:
> - Use `data-analysis-framework` for AI agent querying of structured data
> - Use `document-analysis-framework` for chunking and document analysis
## ๐ Quick Start
### Document Analysis
```python
from core.analyzer import DocumentAnalyzer
analyzer = DocumentAnalyzer()
result = analyzer.analyze_document("path/to/file.py")
print(f"Document Type: {result['document_type'].type_name}")
print(f"Language: {result['document_type'].language}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
```
### Smart Chunking
```python
from core.analyzer import DocumentAnalyzer
from core.chunking import ChunkingOrchestrator
# Analyze document
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze_document("file.py")
# Convert format for chunking
chunking_analysis = {
'document_type': {
'type_name': analysis['document_type'].type_name,
'confidence': analysis['document_type'].confidence,
'category': analysis['document_type'].category
},
'analysis': analysis['analysis']
}
# Generate AI-optimized chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.py", chunking_analysis, strategy='auto')
```
## ๐ Currently Supported File Types
| Category | File Types | Extensions | Confidence |
|----------|------------|------------|------------|
| **๐ Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |
| **๐ป Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |
| **โ๏ธ Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |
### Coming Soon:
- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin
- Shell scripts, PowerShell, R, MATLAB
- INI files, .env files, Apache/Nginx configs
- LaTeX, AsciiDoc, reStructuredText
- Log files, CSS/SCSS, Vue/Svelte components
## ๐ฏ Key Features
- **๐ Intelligent Document Detection** - Content-based recognition with confidence scoring
- **๐ค AI-Ready Output** - Structured analysis with quality metrics and use case recommendations
- **โก Smart Chunking** - Document-type-aware segmentation strategies
- **๐ Security & Reliability** - File size limits, safe handling, pure Python stdlib
- **๐ Extensible** - Easy to add new handlers for additional file types
## ๐ง Installation
```bash
git clone https://github.com/redhat-ai-americas/document-analysis-framework.git
cd document-analysis-framework
pip install -e .
```
## ๐งช Framework Ecosystem
This framework is part of the **AI Building Blocks** document analysis ecosystem:
| Framework | Purpose | Key Features |
|-----------|---------|--------------|
| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |
| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |
| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |
| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |
### Choosing the Right Framework
```mermaid
graph TD
A[Have a document to analyze?] --> B{What type?}
B -->|XML| C[xml-analysis-framework]
B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]
B -->|Need AI to query data| E[data-analysis-framework]
B -->|Text/Code/Config/Other| F[document-analysis-framework]
```
## ๐ License
MIT License - see [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/rdwj/document-analysis-framework",
"name": "document-analysis-framework",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "document, analysis, ai, ml, text-processing, code-analysis, configuration, semantic-search",
"author": "Wes Jackson",
"author_email": "AI Building Blocks <wjackson@redhat.com>",
"download_url": "https://files.pythonhosted.org/packages/2e/ec/7e7d0f7cb6d5d0fc8a32467550ddba7822c60662be6f3ded585ecc1242a5/document_analysis_framework-1.0.0.tar.gz",
"platform": null,
"description": "# Document Analysis Framework\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/redhat-ai-americas/document-analysis-framework)\n\nA lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.\n\n## \ud83c\udfaf When to Use This Framework\n\nThis is a **fallback framework** for text-based files not handled by our specialized frameworks:\n\n### Specialized Frameworks (Use These First!)\n\n1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** \ud83d\udcd1\n - For: XML documents of all types\n - Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)\n - Install: `pip install xml-analysis-framework`\n\n2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** \ud83d\udcc4\n - For: PDF, Word, Excel, PowerPoint, and images with text\n - Features: Docling-powered extraction with OCR support\n - Install: `pip install docling-analysis-framework`\n\n3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** \ud83d\udcca\n - For: Structured data that needs AI agent interaction\n - Features: Safe query interface for AI agents to analyze data\n - Install: `pip install data-analysis-framework`\n\n### Use This Framework For:\n- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.\n- **Config files**: Dockerfile, package.json, .env, INI files, etc.\n- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.\n- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.\n- **Other text-based formats** not covered above\n\n> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:\n> - Use `data-analysis-framework` for AI agent querying of structured data\n> - Use `document-analysis-framework` for chunking and document analysis\n\n## \ud83d\ude80 Quick Start\n\n### Document Analysis\n```python\nfrom core.analyzer import DocumentAnalyzer\n\nanalyzer = DocumentAnalyzer()\nresult = analyzer.analyze_document(\"path/to/file.py\")\n\nprint(f\"Document Type: {result['document_type'].type_name}\")\nprint(f\"Language: {result['document_type'].language}\")\nprint(f\"AI Use Cases: {result['analysis'].ai_use_cases}\")\n```\n\n### Smart Chunking\n```python\nfrom core.analyzer import DocumentAnalyzer\nfrom core.chunking import ChunkingOrchestrator\n\n# Analyze document\nanalyzer = DocumentAnalyzer()\nanalysis = analyzer.analyze_document(\"file.py\")\n\n# Convert format for chunking\nchunking_analysis = {\n 'document_type': {\n 'type_name': analysis['document_type'].type_name,\n 'confidence': analysis['document_type'].confidence,\n 'category': analysis['document_type'].category\n },\n 'analysis': analysis['analysis']\n}\n\n# Generate AI-optimized chunks\norchestrator = ChunkingOrchestrator()\nchunks = orchestrator.chunk_document(\"file.py\", chunking_analysis, strategy='auto')\n```\n\n## \ud83d\udccb Currently Supported File Types\n\n| Category | File Types | Extensions | Confidence |\n|----------|------------|------------|------------|\n| **\ud83d\udcdd Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |\n| **\ud83d\udcbb Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |\n| **\u2699\ufe0f Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |\n\n### Coming Soon:\n- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin\n- Shell scripts, PowerShell, R, MATLAB\n- INI files, .env files, Apache/Nginx configs\n- LaTeX, AsciiDoc, reStructuredText\n- Log files, CSS/SCSS, Vue/Svelte components\n\n## \ud83c\udfaf Key Features\n\n- **\ud83d\udd0d Intelligent Document Detection** - Content-based recognition with confidence scoring\n- **\ud83e\udd16 AI-Ready Output** - Structured analysis with quality metrics and use case recommendations \n- **\u26a1 Smart Chunking** - Document-type-aware segmentation strategies\n- **\ud83d\udd12 Security & Reliability** - File size limits, safe handling, pure Python stdlib\n- **\ud83d\udd04 Extensible** - Easy to add new handlers for additional file types\n\n## \ud83d\udd27 Installation\n\n```bash\ngit clone https://github.com/redhat-ai-americas/document-analysis-framework.git\ncd document-analysis-framework\npip install -e .\n```\n\n## \ud83e\uddea Framework Ecosystem\n\nThis framework is part of the **AI Building Blocks** document analysis ecosystem:\n\n| Framework | Purpose | Key Features |\n|-----------|---------|--------------|\n| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |\n| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |\n| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |\n| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |\n\n### Choosing the Right Framework\n\n```mermaid\ngraph TD\n A[Have a document to analyze?] --> B{What type?}\n B -->|XML| C[xml-analysis-framework]\n B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]\n B -->|Need AI to query data| E[data-analysis-framework]\n B -->|Text/Code/Config/Other| F[document-analysis-framework]\n```\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Text, code, and configuration file analysis framework for AI/ML data pipelines",
"version": "1.0.0",
"project_urls": {
"Documentation": "https://github.com/rdwj/document-analysis-framework/blob/main/README.md",
"Homepage": "https://github.com/rdwj/document-analysis-framework",
"Issues": "https://github.com/rdwj/document-analysis-framework/issues",
"Repository": "https://github.com/rdwj/document-analysis-framework"
},
"split_keywords": [
"document",
" analysis",
" ai",
" ml",
" text-processing",
" code-analysis",
" configuration",
" semantic-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "957bca3b0d17f722d738bc3c6ef84709ccbfd34151848e66cf4cd3c83266cfe0",
"md5": "5786c253c769689b43e9ad0728c1cb44",
"sha256": "bfea3ce1e5fdc38b1db15098ec7775affafa9cb0b12d4094e17fa3fb2907608a"
},
"downloads": -1,
"filename": "document_analysis_framework-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5786c253c769689b43e9ad0728c1cb44",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 111096,
"upload_time": "2025-07-28T19:49:59",
"upload_time_iso_8601": "2025-07-28T19:49:59.532303Z",
"url": "https://files.pythonhosted.org/packages/95/7b/ca3b0d17f722d738bc3c6ef84709ccbfd34151848e66cf4cd3c83266cfe0/document_analysis_framework-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2eec7e7d0f7cb6d5d0fc8a32467550ddba7822c60662be6f3ded585ecc1242a5",
"md5": "8171cd6d335cf006db309c1e2f510e63",
"sha256": "45de9484b0fcc6f166c6852305273711ccbedfefa14fe6ce1c0130043b44cae4"
},
"downloads": -1,
"filename": "document_analysis_framework-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "8171cd6d335cf006db309c1e2f510e63",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 155192,
"upload_time": "2025-07-28T19:50:01",
"upload_time_iso_8601": "2025-07-28T19:50:01.007443Z",
"url": "https://files.pythonhosted.org/packages/2e/ec/7e7d0f7cb6d5d0fc8a32467550ddba7822c60662be6f3ded585ecc1242a5/document_analysis_framework-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-28 19:50:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rdwj",
"github_project": "document-analysis-framework",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "document-analysis-framework"
}