document-analysis-framework


Namedocument-analysis-framework JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/rdwj/document-analysis-framework
SummaryText, code, and configuration file analysis framework for AI/ML data pipelines
upload_time2025-07-28 19:50:01
maintainerNone
docs_urlNone
authorWes Jackson
requires_python>=3.8
licenseMIT
keywords document analysis ai ml text-processing code-analysis configuration semantic-search
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Document Analysis Framework

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![AI Ready](https://img.shields.io/badge/AI%20Ready-โœ“-green.svg)](https://github.com/redhat-ai-americas/document-analysis-framework)

A lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.

## ๐ŸŽฏ When to Use This Framework

This is a **fallback framework** for text-based files not handled by our specialized frameworks:

### Specialized Frameworks (Use These First!)

1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** ๐Ÿ“‘
   - For: XML documents of all types
   - Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)
   - Install: `pip install xml-analysis-framework`

2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** ๐Ÿ“„
   - For: PDF, Word, Excel, PowerPoint, and images with text
   - Features: Docling-powered extraction with OCR support
   - Install: `pip install docling-analysis-framework`

3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** ๐Ÿ“Š
   - For: Structured data that needs AI agent interaction
   - Features: Safe query interface for AI agents to analyze data
   - Install: `pip install data-analysis-framework`

### Use This Framework For:
- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.
- **Config files**: Dockerfile, package.json, .env, INI files, etc.
- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.
- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.
- **Other text-based formats** not covered above

> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:
> - Use `data-analysis-framework` for AI agent querying of structured data
> - Use `document-analysis-framework` for chunking and document analysis

## ๐Ÿš€ Quick Start

### Document Analysis
```python
from core.analyzer import DocumentAnalyzer

analyzer = DocumentAnalyzer()
result = analyzer.analyze_document("path/to/file.py")

print(f"Document Type: {result['document_type'].type_name}")
print(f"Language: {result['document_type'].language}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
```

### Smart Chunking
```python
from core.analyzer import DocumentAnalyzer
from core.chunking import ChunkingOrchestrator

# Analyze document
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze_document("file.py")

# Convert format for chunking
chunking_analysis = {
    'document_type': {
        'type_name': analysis['document_type'].type_name,
        'confidence': analysis['document_type'].confidence,
        'category': analysis['document_type'].category
    },
    'analysis': analysis['analysis']
}

# Generate AI-optimized chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.py", chunking_analysis, strategy='auto')
```

## ๐Ÿ“‹ Currently Supported File Types

| Category | File Types | Extensions | Confidence |
|----------|------------|------------|------------|
| **๐Ÿ“ Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |
| **๐Ÿ’ป Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |
| **โš™๏ธ Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |

### Coming Soon:
- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin
- Shell scripts, PowerShell, R, MATLAB
- INI files, .env files, Apache/Nginx configs
- LaTeX, AsciiDoc, reStructuredText
- Log files, CSS/SCSS, Vue/Svelte components

## ๐ŸŽฏ Key Features

- **๐Ÿ” Intelligent Document Detection** - Content-based recognition with confidence scoring
- **๐Ÿค– AI-Ready Output** - Structured analysis with quality metrics and use case recommendations  
- **โšก Smart Chunking** - Document-type-aware segmentation strategies
- **๐Ÿ”’ Security & Reliability** - File size limits, safe handling, pure Python stdlib
- **๐Ÿ”„ Extensible** - Easy to add new handlers for additional file types

## ๐Ÿ”ง Installation

```bash
git clone https://github.com/redhat-ai-americas/document-analysis-framework.git
cd document-analysis-framework
pip install -e .
```

## ๐Ÿงช Framework Ecosystem

This framework is part of the **AI Building Blocks** document analysis ecosystem:

| Framework | Purpose | Key Features |
|-----------|---------|--------------|
| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |
| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |
| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |
| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |

### Choosing the Right Framework

```mermaid
graph TD
    A[Have a document to analyze?] --> B{What type?}
    B -->|XML| C[xml-analysis-framework]
    B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]
    B -->|Need AI to query data| E[data-analysis-framework]
    B -->|Text/Code/Config/Other| F[document-analysis-framework]
```

## ๐Ÿ“„ License

MIT License - see [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rdwj/document-analysis-framework",
    "name": "document-analysis-framework",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "document, analysis, ai, ml, text-processing, code-analysis, configuration, semantic-search",
    "author": "Wes Jackson",
    "author_email": "AI Building Blocks <wjackson@redhat.com>",
    "download_url": "https://files.pythonhosted.org/packages/2e/ec/7e7d0f7cb6d5d0fc8a32467550ddba7822c60662be6f3ded585ecc1242a5/document_analysis_framework-1.0.0.tar.gz",
    "platform": null,
    "description": "# Document Analysis Framework\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![AI Ready](https://img.shields.io/badge/AI%20Ready-\u2713-green.svg)](https://github.com/redhat-ai-americas/document-analysis-framework)\n\nA lightweight document analysis framework for **text, code, configuration, and other text-based files** designed for AI/ML data pipelines. Uses only Python standard library for maximum compatibility.\n\n## \ud83c\udfaf When to Use This Framework\n\nThis is a **fallback framework** for text-based files not handled by our specialized frameworks:\n\n### Specialized Frameworks (Use These First!)\n\n1. **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** \ud83d\udcd1\n   - For: XML documents of all types\n   - Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)\n   - Install: `pip install xml-analysis-framework`\n\n2. **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** \ud83d\udcc4\n   - For: PDF, Word, Excel, PowerPoint, and images with text\n   - Features: Docling-powered extraction with OCR support\n   - Install: `pip install docling-analysis-framework`\n\n3. **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** \ud83d\udcca\n   - For: Structured data that needs AI agent interaction\n   - Features: Safe query interface for AI agents to analyze data\n   - Install: `pip install data-analysis-framework`\n\n### Use This Framework For:\n- **Code files**: Python, JavaScript, TypeScript, Go, Rust, etc.\n- **Config files**: Dockerfile, package.json, .env, INI files, etc.\n- **Text/Markup**: Markdown, plain text, LaTeX, AsciiDoc, etc.\n- **Data files**: CSV, JSON, YAML, TOML, TSV, etc.\n- **Other text-based formats** not covered above\n\n> **Note**: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:\n> - Use `data-analysis-framework` for AI agent querying of structured data\n> - Use `document-analysis-framework` for chunking and document analysis\n\n## \ud83d\ude80 Quick Start\n\n### Document Analysis\n```python\nfrom core.analyzer import DocumentAnalyzer\n\nanalyzer = DocumentAnalyzer()\nresult = analyzer.analyze_document(\"path/to/file.py\")\n\nprint(f\"Document Type: {result['document_type'].type_name}\")\nprint(f\"Language: {result['document_type'].language}\")\nprint(f\"AI Use Cases: {result['analysis'].ai_use_cases}\")\n```\n\n### Smart Chunking\n```python\nfrom core.analyzer import DocumentAnalyzer\nfrom core.chunking import ChunkingOrchestrator\n\n# Analyze document\nanalyzer = DocumentAnalyzer()\nanalysis = analyzer.analyze_document(\"file.py\")\n\n# Convert format for chunking\nchunking_analysis = {\n    'document_type': {\n        'type_name': analysis['document_type'].type_name,\n        'confidence': analysis['document_type'].confidence,\n        'category': analysis['document_type'].category\n    },\n    'analysis': analysis['analysis']\n}\n\n# Generate AI-optimized chunks\norchestrator = ChunkingOrchestrator()\nchunks = orchestrator.chunk_document(\"file.py\", chunking_analysis, strategy='auto')\n```\n\n## \ud83d\udccb Currently Supported File Types\n\n| Category | File Types | Extensions | Confidence |\n|----------|------------|------------|------------|\n| **\ud83d\udcdd Text & Data** | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |\n| **\ud83d\udcbb Code Files** | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |\n| **\u2699\ufe0f Configuration** | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |\n\n### Coming Soon:\n- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin\n- Shell scripts, PowerShell, R, MATLAB\n- INI files, .env files, Apache/Nginx configs\n- LaTeX, AsciiDoc, reStructuredText\n- Log files, CSS/SCSS, Vue/Svelte components\n\n## \ud83c\udfaf Key Features\n\n- **\ud83d\udd0d Intelligent Document Detection** - Content-based recognition with confidence scoring\n- **\ud83e\udd16 AI-Ready Output** - Structured analysis with quality metrics and use case recommendations  \n- **\u26a1 Smart Chunking** - Document-type-aware segmentation strategies\n- **\ud83d\udd12 Security & Reliability** - File size limits, safe handling, pure Python stdlib\n- **\ud83d\udd04 Extensible** - Easy to add new handlers for additional file types\n\n## \ud83d\udd27 Installation\n\n```bash\ngit clone https://github.com/redhat-ai-americas/document-analysis-framework.git\ncd document-analysis-framework\npip install -e .\n```\n\n## \ud83e\uddea Framework Ecosystem\n\nThis framework is part of the **AI Building Blocks** document analysis ecosystem:\n\n| Framework | Purpose | Key Features |\n|-----------|---------|--------------|\n| **[xml-analysis-framework](https://pypi.org/project/xml-analysis-framework/)** | XML document specialist | 29+ handlers, security-focused, enterprise configs |\n| **[docling-analysis-framework](https://pypi.org/project/docling-analysis-framework/)** | PDF & Office documents | OCR support, table extraction, figure handling |\n| **[data-analysis-framework](https://pypi.org/project/data-analysis-framework/)** | Structured data AI agent | Safe queries, natural language interface |\n| **document-analysis-framework** | Everything else (text-based) | Fallback handler, pure Python, extensible |\n\n### Choosing the Right Framework\n\n```mermaid\ngraph TD\n    A[Have a document to analyze?] --> B{What type?}\n    B -->|XML| C[xml-analysis-framework]\n    B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]\n    B -->|Need AI to query data| E[data-analysis-framework]\n    B -->|Text/Code/Config/Other| F[document-analysis-framework]\n```\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Text, code, and configuration file analysis framework for AI/ML data pipelines",
    "version": "1.0.0",
    "project_urls": {
        "Documentation": "https://github.com/rdwj/document-analysis-framework/blob/main/README.md",
        "Homepage": "https://github.com/rdwj/document-analysis-framework",
        "Issues": "https://github.com/rdwj/document-analysis-framework/issues",
        "Repository": "https://github.com/rdwj/document-analysis-framework"
    },
    "split_keywords": [
        "document",
        " analysis",
        " ai",
        " ml",
        " text-processing",
        " code-analysis",
        " configuration",
        " semantic-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "957bca3b0d17f722d738bc3c6ef84709ccbfd34151848e66cf4cd3c83266cfe0",
                "md5": "5786c253c769689b43e9ad0728c1cb44",
                "sha256": "bfea3ce1e5fdc38b1db15098ec7775affafa9cb0b12d4094e17fa3fb2907608a"
            },
            "downloads": -1,
            "filename": "document_analysis_framework-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5786c253c769689b43e9ad0728c1cb44",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 111096,
            "upload_time": "2025-07-28T19:49:59",
            "upload_time_iso_8601": "2025-07-28T19:49:59.532303Z",
            "url": "https://files.pythonhosted.org/packages/95/7b/ca3b0d17f722d738bc3c6ef84709ccbfd34151848e66cf4cd3c83266cfe0/document_analysis_framework-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2eec7e7d0f7cb6d5d0fc8a32467550ddba7822c60662be6f3ded585ecc1242a5",
                "md5": "8171cd6d335cf006db309c1e2f510e63",
                "sha256": "45de9484b0fcc6f166c6852305273711ccbedfefa14fe6ce1c0130043b44cae4"
            },
            "downloads": -1,
            "filename": "document_analysis_framework-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8171cd6d335cf006db309c1e2f510e63",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 155192,
            "upload_time": "2025-07-28T19:50:01",
            "upload_time_iso_8601": "2025-07-28T19:50:01.007443Z",
            "url": "https://files.pythonhosted.org/packages/2e/ec/7e7d0f7cb6d5d0fc8a32467550ddba7822c60662be6f3ded585ecc1242a5/document_analysis_framework-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-28 19:50:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rdwj",
    "github_project": "document-analysis-framework",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "document-analysis-framework"
}
        
Elapsed time: 1.39425s