document-analysis-framework

Name	document-analysis-framework JSON
Version	1.1.0 JSON
	download
home_page	https://github.com/rdwj/document-analysis-framework
Summary	Text, code, and configuration file analysis framework for AI/ML data pipelines
upload_time	2025-07-29 14:36:03
maintainer	None
docs_url	None
author	Wes Jackson
requires_python	>=3.8
license	MIT
keywords	document analysis ai ml text-processing code-analysis configuration semantic-search
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Document Analysis Framework

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![AI Ready](https://img.shields.io/badge/AI%20Ready-✓-green.svg)](https://github.com/redhat-ai-americas/document-analysis-framework)

A lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.

## 🎯 When to Use This Framework

This is a **fallback framework** for text-based files not handled by our specialized frameworks:

### Specialized Frameworks (Use These First!)

1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** 📑
   - For: XML documents of all types
   - Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)
   - Install: `pip install xml-analysis-framework`

2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** 📄
   - For: PDF, Word, Excel, PowerPoint, and images with text
   - Features: Docling-powered extraction with OCR support
   - Install: `pip install docling-analysis-framework`

3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** 📊
   - For: Structured data that needs AI agent interaction
   - Features: Safe query interface for AI agents to analyze data
   - Install: `pip install data-analysis-framework`

### Use This Framework For:
- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.
- **Config files**: Dockerfile, package.json, .env, INI files, etc.
- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.
- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.
- **Other text-based formats** not covered above

> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:
> - Use `data-analysis-framework` for AI agent querying of structured data
> - Use `document-analysis-framework` for chunking and document analysis

## 🚀 Quick Start

### Document Analysis
```python
from core.analyzer import DocumentAnalyzer

analyzer = DocumentAnalyzer()
result = analyzer.analyze_document("path/to/file.py")

print(f"Document Type: {result['document_type'].type_name}")
print(f"Language: {result['document_type'].language}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
```

### Smart Chunking
```python
from core.analyzer import DocumentAnalyzer
from core.chunking import ChunkingOrchestrator

# Analyze document
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze_document("file.py")

# Convert format for chunking
chunking_analysis = {
    'document_type': {
        'type_name': analysis['document_type'].type_name,
        'confidence': analysis['document_type'].confidence,
        'category': analysis['document_type'].category
    },
    'analysis': analysis['analysis']
}

# Generate AI-optimized chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.py", chunking_analysis, strategy='auto')
```

## 📋 Currently Supported File Types

| Category | File Types | Extensions | Confidence |
|----------|------------|------------|------------|
| **📝 Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |
| **💻 Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |
| **⚙️ Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |

### Coming Soon:
- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin
- Shell scripts, PowerShell, R, MATLAB
- INI files, .env files, Apache/Nginx configs
- LaTeX, AsciiDoc, reStructuredText
- Log files, CSS/SCSS, Vue/Svelte components

## 🎯 Key Features

- **🔍 Intelligent Document Detection** - Content-based recognition with confidence scoring
- **🤖 AI-Ready Output** - Structured analysis with quality metrics and use case recommendations  
- **⚡ Smart Chunking** - Document-type-aware segmentation strategies
- **🔒 Security & Reliability** - File size limits, safe handling, pure Python stdlib
- **🔄 Extensible** - Easy to add new handlers for additional file types

## 🔧 Installation

```bash
git clone https://github.com/redhat-ai-americas/document-analysis-framework.git
cd document-analysis-framework
pip install -e .
```

## 🧪 Framework Ecosystem

This framework is part of the **AI Building Blocks** document analysis ecosystem:

| Framework | Purpose | Key Features |
|-----------|---------|--------------|
| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |
| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |
| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |
| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |

### Choosing the Right Framework

```mermaid
graph TD
    A[Have a document to analyze?] --> B{What type?}
    B -->|XML| C[xml-analysis-framework]
    B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]
    B -->|Need AI to query data| E[data-analysis-framework]
    B -->|Text/Code/Config/Other| F[document-analysis-framework]
```

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rdwj/document-analysis-framework",
    "name": "document-analysis-framework",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "document, analysis, ai, ml, text-processing, code-analysis, configuration, semantic-search",
    "author": "Wes Jackson",
    "author_email": "AI Building Blocks <wjackson@redhat.com>",
    "download_url": "https://files.pythonhosted.org/packages/fa/9c/2e2453f510f3beb68b7b8f800ef390a3e76128a661758db9e144a59b06ad/document_analysis_framework-1.1.0.tar.gz",
    "platform": null,
    "description": "# Document Analysis Framework\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![AI Ready](https://img.shields.io/badge/AI%20Ready-\u2713-green.svg)](https://github.com/redhat-ai-americas/document-analysis-framework)\n\nA lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.\n\n## \ud83c\udfaf When to Use This Framework\n\nThis is a **fallback framework** for text-based files not handled by our specialized frameworks:\n\n### Specialized Frameworks (Use These First!)\n\n1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** \ud83d\udcd1\n   - For: XML documents of all types\n   - Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)\n   - Install: `pip install xml-analysis-framework`\n\n2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** \ud83d\udcc4\n   - For: PDF, Word, Excel, PowerPoint, and images with text\n   - Features: Docling-powered extraction with OCR support\n   - Install: `pip install docling-analysis-framework`\n\n3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** \ud83d\udcca\n   - For: Structured data that needs AI agent interaction\n   - Features: Safe query interface for AI agents to analyze data\n   - Install: `pip install data-analysis-framework`\n\n### Use This Framework For:\n- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.\n- **Config files**: Dockerfile, package.json, .env, INI files, etc.\n- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.\n- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.\n- **Other text-based formats** not covered above\n\n> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:\n> - Use `data-analysis-framework` for AI agent querying of structured data\n> - Use `document-analysis-framework` for chunking and document analysis\n\n## \ud83d\ude80 Quick Start\n\n### Document Analysis\n```python\nfrom core.analyzer import DocumentAnalyzer\n\nanalyzer = DocumentAnalyzer()\nresult = analyzer.analyze_document(\"path/to/file.py\")\n\nprint(f\"Document Type: {result['document_type'].type_name}\")\nprint(f\"Language: {result['document_type'].language}\")\nprint(f\"AI Use Cases: {result['analysis'].ai_use_cases}\")\n```\n\n### Smart Chunking\n```python\nfrom core.analyzer import DocumentAnalyzer\nfrom core.chunking import ChunkingOrchestrator\n\n# Analyze document\nanalyzer = DocumentAnalyzer()\nanalysis = analyzer.analyze_document(\"file.py\")\n\n# Convert format for chunking\nchunking_analysis = {\n    'document_type': {\n        'type_name': analysis['document_type'].type_name,\n        'confidence': analysis['document_type'].confidence,\n        'category': analysis['document_type'].category\n    },\n    'analysis': analysis['analysis']\n}\n\n# Generate AI-optimized chunks\norchestrator = ChunkingOrchestrator()\nchunks = orchestrator.chunk_document(\"file.py\", chunking_analysis, strategy='auto')\n```\n\n## \ud83d\udccb Currently Supported File Types\n\n| Category | File Types | Extensions | Confidence |\n|----------|------------|------------|------------|\n| **\ud83d\udcdd Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |\n| **\ud83d\udcbb Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |\n| **\u2699\ufe0f Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |\n\n### Coming Soon:\n- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin\n- Shell scripts, PowerShell, R, MATLAB\n- INI files, .env files, Apache/Nginx configs\n- LaTeX, AsciiDoc, reStructuredText\n- Log files, CSS/SCSS, Vue/Svelte components\n\n## \ud83c\udfaf Key Features\n\n- **\ud83d\udd0d Intelligent Document Detection** - Content-based recognition with confidence scoring\n- **\ud83e\udd16 AI-Ready Output** - Structured analysis with quality metrics and use case recommendations  \n- **\u26a1 Smart Chunking** - Document-type-aware segmentation strategies\n- **\ud83d\udd12 Security & Reliability** - File size limits, safe handling, pure Python stdlib\n- **\ud83d\udd04 Extensible** - Easy to add new handlers for additional file types\n\n## \ud83d\udd27 Installation\n\n```bash\ngit clone https://github.com/redhat-ai-americas/document-analysis-framework.git\ncd document-analysis-framework\npip install -e .\n```\n\n## \ud83e\uddea Framework Ecosystem\n\nThis framework is part of the **AI Building Blocks** document analysis ecosystem:\n\n| Framework | Purpose | Key Features |\n|-----------|---------|--------------|\n| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |\n| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |\n| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |\n| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |\n\n### Choosing the Right Framework\n\n```mermaid\ngraph TD\n    A[Have a document to analyze?] --> B{What type?}\n    B -->|XML| C[xml-analysis-framework]\n    B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]\n    B -->|Need AI to query data| E[data-analysis-framework]\n    B -->|Text/Code/Config/Other| F[document-analysis-framework]\n```\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Text, code, and configuration file analysis framework for AI/ML data pipelines",
    "version": "1.1.0",
    "project_urls": {
        "Documentation": "https://github.com/rdwj/document-analysis-framework/blob/main/README.md",
        "Homepage": "https://github.com/rdwj/document-analysis-framework",
        "Issues": "https://github.com/rdwj/document-analysis-framework/issues",
        "Repository": "https://github.com/rdwj/document-analysis-framework"
    },
    "split_keywords": [
        "document",
        " analysis",
        " ai",
        " ml",
        " text-processing",
        " code-analysis",
        " configuration",
        " semantic-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7d6603f0df0397d76943ffd567c094d935a4ce99d5d017845b6b6e1fe2004e49",
                "md5": "428f54bbd122856fdf8c8681300d2b34",
                "sha256": "5b424994d192a37f7b79173af2ed63c5c869308ecc6b39cc624bc143fb00a7f9"
            },
            "downloads": -1,
            "filename": "document_analysis_framework-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "428f54bbd122856fdf8c8681300d2b34",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 114988,
            "upload_time": "2025-07-29T14:36:02",
            "upload_time_iso_8601": "2025-07-29T14:36:02.354644Z",
            "url": "https://files.pythonhosted.org/packages/7d/66/03f0df0397d76943ffd567c094d935a4ce99d5d017845b6b6e1fe2004e49/document_analysis_framework-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fa9c2e2453f510f3beb68b7b8f800ef390a3e76128a661758db9e144a59b06ad",
                "md5": "c52b001732063766a910623d4145cd33",
                "sha256": "8a62fb29edba1ccb9083a697b35e9fbe3acf32730265a8c7eda96fd70c98a684"
            },
            "downloads": -1,
            "filename": "document_analysis_framework-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c52b001732063766a910623d4145cd33",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 162273,
            "upload_time": "2025-07-29T14:36:03",
            "upload_time_iso_8601": "2025-07-29T14:36:03.874821Z",
            "url": "https://files.pythonhosted.org/packages/fa/9c/2e2453f510f3beb68b7b8f800ef390a3e76128a661758db9e144a59b06ad/document_analysis_framework-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-29 14:36:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rdwj",
    "github_project": "document-analysis-framework",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "document-analysis-framework"
}

Wes Jackson