analysis-framework-base


Nameanalysis-framework-base JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryBase interfaces and models for document analysis frameworks
upload_time2025-10-28 00:45:17
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords ai document-analysis framework interface
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # analysis-framework-base

**Lightweight foundation package defining shared interfaces and contracts for document analysis frameworks.**

[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Development Status](https://img.shields.io/badge/status-production-brightgreen)](https://github.com/redhat-ai-americas/analysis-framework-base)

## Overview

`analysis-framework-base` provides abstract base classes and data models that establish a consistent interface across multiple specialized document analysis frameworks. This package serves as the foundation for:

- **xml-analysis-framework** - XML and S1000D technical documentation
- **docling-analysis-framework** - PDF, Word, PowerPoint via Docling
- **document-analysis-framework** - General document processing
- **data-analysis-framework** - Structured data analysis

## Key Features

- **Zero Dependencies** - Pure Python standard library only
- **Type Hints** - Full typing support for better IDE integration
- **Multiple Access Patterns** - Dict-style and attribute-style access
- **Extensible** - Easy to implement new frameworks
- **Well Documented** - Comprehensive docstrings and examples

## Installation

```bash
pip install analysis-framework-base
```

### For Development

```bash
pip install analysis-framework-base[dev]
```

## Quick Start

### Implementing a Framework Analyzer

```python
from analysis_framework_base import (
    BaseAnalyzer,
    UnifiedAnalysisResult,
    UnsupportedFormatError,
    AnalysisError
)

class MyDocumentAnalyzer(BaseAnalyzer):
    """Custom analyzer for my document format."""

    def analyze_unified(self, file_path: str, **kwargs) -> UnifiedAnalysisResult:
        """Analyze a document and return unified results."""
        # Check file format
        if not file_path.endswith('.mydoc'):
            raise UnsupportedFormatError(f"Unsupported format: {file_path}")

        try:
            # Perform analysis
            with open(file_path, 'r') as f:
                content = f.read()

            return UnifiedAnalysisResult(
                document_type="MyDoc Technical Document",
                confidence=0.95,
                framework="my-doc-analyzer",
                metadata={
                    "version": "1.0",
                    "word_count": len(content.split())
                },
                content=content,
                ai_opportunities=[
                    "Document summarization",
                    "Question answering",
                    "Entity extraction"
                ]
            )
        except Exception as e:
            raise AnalysisError(f"Analysis failed: {e}")

    def get_supported_formats(self) -> list[str]:
        """Return supported file extensions."""
        return ['.mydoc', '.md']
```

### Implementing a Document Chunker

```python
from analysis_framework_base import (
    BaseChunker,
    ChunkInfo,
    UnifiedAnalysisResult,
    ChunkingError
)

class MyDocumentChunker(BaseChunker):
    """Custom chunker for my document format."""

    def chunk_document(
        self,
        file_path: str,
        analysis: UnifiedAnalysisResult,
        strategy: str = "auto",
        **kwargs
    ) -> list[ChunkInfo]:
        """Split document into chunks."""
        if strategy not in self.get_supported_strategies():
            raise ChunkingError(f"Unknown strategy: {strategy}")

        # Implement chunking logic
        chunks = []
        content = analysis.content or ""

        # Simple paragraph-based chunking
        paragraphs = content.split('\n\n')
        for i, para in enumerate(paragraphs):
            chunk = ChunkInfo(
                chunk_id=f"{file_path}_chunk_{i:04d}",
                content=para.strip(),
                metadata={
                    "paragraph_index": i,
                    "source_file": file_path
                },
                token_count=len(para.split()) * 1.3,  # Rough estimate
                chunk_type="paragraph"
            )
            chunks.append(chunk)

        return chunks

    def get_supported_strategies(self) -> list[str]:
        """Return supported chunking strategies."""
        return ['auto', 'paragraph', 'sliding_window']
```

### Using the Framework

```python
# Initialize analyzer
analyzer = MyDocumentAnalyzer()

# Analyze a document
result = analyzer.analyze_unified('document.mydoc')

# Access results (multiple patterns supported)
print(result.document_type)           # Attribute access
print(result['confidence'])           # Dict-style access
print(result.get('framework'))        # Dict get method

# Convert to dict for JSON serialization
result_dict = result.to_dict()

# Initialize chunker
chunker = MyDocumentChunker()

# Chunk the document
chunks = chunker.chunk_document('document.mydoc', result, strategy='paragraph')

# Process chunks
for chunk in chunks:
    print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
    print(chunk.content[:100])  # First 100 chars
```

## Core Interfaces

### BaseAnalyzer

Abstract base class for document analyzers. Requires implementation of:

- `analyze_unified(file_path: str, **kwargs) -> UnifiedAnalysisResult`
- `get_supported_formats() -> List[str]`

### BaseChunker

Abstract base class for document chunkers. Requires implementation of:

- `chunk_document(file_path: str, analysis: UnifiedAnalysisResult, strategy: str, **kwargs) -> List[ChunkInfo]`
- `get_supported_strategies() -> List[str]`

## Data Models

### UnifiedAnalysisResult

Standard result structure with:

- `document_type: str` - Human-readable document type
- `confidence: float` - Confidence score (0.0-1.0)
- `framework: str` - Framework identifier
- `metadata: Dict[str, Any]` - Framework-specific metadata
- `content: Optional[str]` - Extracted text content
- `ai_opportunities: List[str]` - Suggested AI use cases
- `raw_analysis: Dict[str, Any]` - Complete framework results

Supports both attribute and dict-style access:

```python
result.document_type        # Attribute
result['document_type']     # Dict-style
result.get('document_type') # Get method
'document_type' in result   # Contains check
```

### ChunkInfo

Standard chunk structure with:

- `chunk_id: str` - Unique identifier
- `content: str` - Chunk text content
- `metadata: Dict[str, Any]` - Chunk metadata
- `token_count: int` - Estimated token count
- `chunk_type: str` - Type (text, code, table, etc.)

## Exception Hierarchy

```python
FrameworkError                  # Base exception
├── UnsupportedFormatError     # File format not supported
├── AnalysisError              # Analysis failed
└── ChunkingError              # Chunking failed
```

## Constants

### ChunkStrategy Enum

Standard chunking strategy names:

- `AUTO` - Framework auto-selects best strategy
- `HIERARCHICAL` - Structure-based (sections, headings)
- `SLIDING_WINDOW` - Fixed-size overlapping chunks
- `CONTENT_AWARE` - Semantic boundary detection
- `STRUCTURAL` - Element-based (paragraphs, tables)
- `TABLE_AWARE` - Special table handling
- `PAGE_AWARE` - Page-boundary chunking

```python
from analysis_framework_base import ChunkStrategy

strategy = ChunkStrategy.HIERARCHICAL
print(strategy.value)  # 'hierarchical'
```

## Framework Suite

This package is part of a suite of analysis frameworks:

- **xml-analysis-framework** - 29+ specialized XML handlers, S1000D support, hierarchical chunking
- **docling-analysis-framework** - PDF, DOCX, PPTX via Docling, table-aware chunking
- **document-analysis-framework** - General document processing, format detection
- **data-analysis-framework** - CSV, JSON, Parquet with query paradigm

Each framework implements the interfaces defined in this package for consistent usage.

## Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/redhat-ai-americas/analysis-framework-base.git
cd analysis-framework-base

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with dev dependencies
pip install -e ".[dev]"
```

### Running Tests

```bash
# Run all tests with coverage
pytest

# Run specific test file
pytest tests/test_models.py

# Run with verbose output
pytest -v

# Generate HTML coverage report
pytest --cov-report=html
```

### Code Quality

```bash
# Format code
black src/ tests/

# Check formatting
black --check src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/
```

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run tests and quality checks
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

### Code Standards

- Follow PEP 8 style guidelines
- Add type hints to all functions
- Write comprehensive docstrings
- Include examples in docstrings
- Maintain test coverage above 80%
- Keep the package dependency-free (stdlib only)

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## Support

- **Issues**: [GitHub Issues](https://github.com/redhat-ai-americas/analysis-framework-base/issues)
- **Documentation**: [GitHub Repository](https://github.com/redhat-ai-americas/analysis-framework-base)

## Authors

Red Hat AI Americas Team

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for version history and changes.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "analysis-framework-base",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "ai, document-analysis, framework, interface",
    "author": null,
    "author_email": "Red Hat AI Americas <noreply@redhat.com>",
    "download_url": "https://files.pythonhosted.org/packages/87/e1/35309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7/analysis_framework_base-1.0.0.tar.gz",
    "platform": null,
    "description": "# analysis-framework-base\n\n**Lightweight foundation package defining shared interfaces and contracts for document analysis frameworks.**\n\n[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Development Status](https://img.shields.io/badge/status-production-brightgreen)](https://github.com/redhat-ai-americas/analysis-framework-base)\n\n## Overview\n\n`analysis-framework-base` provides abstract base classes and data models that establish a consistent interface across multiple specialized document analysis frameworks. This package serves as the foundation for:\n\n- **xml-analysis-framework** - XML and S1000D technical documentation\n- **docling-analysis-framework** - PDF, Word, PowerPoint via Docling\n- **document-analysis-framework** - General document processing\n- **data-analysis-framework** - Structured data analysis\n\n## Key Features\n\n- **Zero Dependencies** - Pure Python standard library only\n- **Type Hints** - Full typing support for better IDE integration\n- **Multiple Access Patterns** - Dict-style and attribute-style access\n- **Extensible** - Easy to implement new frameworks\n- **Well Documented** - Comprehensive docstrings and examples\n\n## Installation\n\n```bash\npip install analysis-framework-base\n```\n\n### For Development\n\n```bash\npip install analysis-framework-base[dev]\n```\n\n## Quick Start\n\n### Implementing a Framework Analyzer\n\n```python\nfrom analysis_framework_base import (\n    BaseAnalyzer,\n    UnifiedAnalysisResult,\n    UnsupportedFormatError,\n    AnalysisError\n)\n\nclass MyDocumentAnalyzer(BaseAnalyzer):\n    \"\"\"Custom analyzer for my document format.\"\"\"\n\n    def analyze_unified(self, file_path: str, **kwargs) -> UnifiedAnalysisResult:\n        \"\"\"Analyze a document and return unified results.\"\"\"\n        # Check file format\n        if not file_path.endswith('.mydoc'):\n            raise UnsupportedFormatError(f\"Unsupported format: {file_path}\")\n\n        try:\n            # Perform analysis\n            with open(file_path, 'r') as f:\n                content = f.read()\n\n            return UnifiedAnalysisResult(\n                document_type=\"MyDoc Technical Document\",\n                confidence=0.95,\n                framework=\"my-doc-analyzer\",\n                metadata={\n                    \"version\": \"1.0\",\n                    \"word_count\": len(content.split())\n                },\n                content=content,\n                ai_opportunities=[\n                    \"Document summarization\",\n                    \"Question answering\",\n                    \"Entity extraction\"\n                ]\n            )\n        except Exception as e:\n            raise AnalysisError(f\"Analysis failed: {e}\")\n\n    def get_supported_formats(self) -> list[str]:\n        \"\"\"Return supported file extensions.\"\"\"\n        return ['.mydoc', '.md']\n```\n\n### Implementing a Document Chunker\n\n```python\nfrom analysis_framework_base import (\n    BaseChunker,\n    ChunkInfo,\n    UnifiedAnalysisResult,\n    ChunkingError\n)\n\nclass MyDocumentChunker(BaseChunker):\n    \"\"\"Custom chunker for my document format.\"\"\"\n\n    def chunk_document(\n        self,\n        file_path: str,\n        analysis: UnifiedAnalysisResult,\n        strategy: str = \"auto\",\n        **kwargs\n    ) -> list[ChunkInfo]:\n        \"\"\"Split document into chunks.\"\"\"\n        if strategy not in self.get_supported_strategies():\n            raise ChunkingError(f\"Unknown strategy: {strategy}\")\n\n        # Implement chunking logic\n        chunks = []\n        content = analysis.content or \"\"\n\n        # Simple paragraph-based chunking\n        paragraphs = content.split('\\n\\n')\n        for i, para in enumerate(paragraphs):\n            chunk = ChunkInfo(\n                chunk_id=f\"{file_path}_chunk_{i:04d}\",\n                content=para.strip(),\n                metadata={\n                    \"paragraph_index\": i,\n                    \"source_file\": file_path\n                },\n                token_count=len(para.split()) * 1.3,  # Rough estimate\n                chunk_type=\"paragraph\"\n            )\n            chunks.append(chunk)\n\n        return chunks\n\n    def get_supported_strategies(self) -> list[str]:\n        \"\"\"Return supported chunking strategies.\"\"\"\n        return ['auto', 'paragraph', 'sliding_window']\n```\n\n### Using the Framework\n\n```python\n# Initialize analyzer\nanalyzer = MyDocumentAnalyzer()\n\n# Analyze a document\nresult = analyzer.analyze_unified('document.mydoc')\n\n# Access results (multiple patterns supported)\nprint(result.document_type)           # Attribute access\nprint(result['confidence'])           # Dict-style access\nprint(result.get('framework'))        # Dict get method\n\n# Convert to dict for JSON serialization\nresult_dict = result.to_dict()\n\n# Initialize chunker\nchunker = MyDocumentChunker()\n\n# Chunk the document\nchunks = chunker.chunk_document('document.mydoc', result, strategy='paragraph')\n\n# Process chunks\nfor chunk in chunks:\n    print(f\"Chunk {chunk.chunk_id}: {chunk.token_count} tokens\")\n    print(chunk.content[:100])  # First 100 chars\n```\n\n## Core Interfaces\n\n### BaseAnalyzer\n\nAbstract base class for document analyzers. Requires implementation of:\n\n- `analyze_unified(file_path: str, **kwargs) -> UnifiedAnalysisResult`\n- `get_supported_formats() -> List[str]`\n\n### BaseChunker\n\nAbstract base class for document chunkers. Requires implementation of:\n\n- `chunk_document(file_path: str, analysis: UnifiedAnalysisResult, strategy: str, **kwargs) -> List[ChunkInfo]`\n- `get_supported_strategies() -> List[str]`\n\n## Data Models\n\n### UnifiedAnalysisResult\n\nStandard result structure with:\n\n- `document_type: str` - Human-readable document type\n- `confidence: float` - Confidence score (0.0-1.0)\n- `framework: str` - Framework identifier\n- `metadata: Dict[str, Any]` - Framework-specific metadata\n- `content: Optional[str]` - Extracted text content\n- `ai_opportunities: List[str]` - Suggested AI use cases\n- `raw_analysis: Dict[str, Any]` - Complete framework results\n\nSupports both attribute and dict-style access:\n\n```python\nresult.document_type        # Attribute\nresult['document_type']     # Dict-style\nresult.get('document_type') # Get method\n'document_type' in result   # Contains check\n```\n\n### ChunkInfo\n\nStandard chunk structure with:\n\n- `chunk_id: str` - Unique identifier\n- `content: str` - Chunk text content\n- `metadata: Dict[str, Any]` - Chunk metadata\n- `token_count: int` - Estimated token count\n- `chunk_type: str` - Type (text, code, table, etc.)\n\n## Exception Hierarchy\n\n```python\nFrameworkError                  # Base exception\n\u251c\u2500\u2500 UnsupportedFormatError     # File format not supported\n\u251c\u2500\u2500 AnalysisError              # Analysis failed\n\u2514\u2500\u2500 ChunkingError              # Chunking failed\n```\n\n## Constants\n\n### ChunkStrategy Enum\n\nStandard chunking strategy names:\n\n- `AUTO` - Framework auto-selects best strategy\n- `HIERARCHICAL` - Structure-based (sections, headings)\n- `SLIDING_WINDOW` - Fixed-size overlapping chunks\n- `CONTENT_AWARE` - Semantic boundary detection\n- `STRUCTURAL` - Element-based (paragraphs, tables)\n- `TABLE_AWARE` - Special table handling\n- `PAGE_AWARE` - Page-boundary chunking\n\n```python\nfrom analysis_framework_base import ChunkStrategy\n\nstrategy = ChunkStrategy.HIERARCHICAL\nprint(strategy.value)  # 'hierarchical'\n```\n\n## Framework Suite\n\nThis package is part of a suite of analysis frameworks:\n\n- **xml-analysis-framework** - 29+ specialized XML handlers, S1000D support, hierarchical chunking\n- **docling-analysis-framework** - PDF, DOCX, PPTX via Docling, table-aware chunking\n- **document-analysis-framework** - General document processing, format detection\n- **data-analysis-framework** - CSV, JSON, Parquet with query paradigm\n\nEach framework implements the interfaces defined in this package for consistent usage.\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone repository\ngit clone https://github.com/redhat-ai-americas/analysis-framework-base.git\ncd analysis-framework-base\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install with dev dependencies\npip install -e \".[dev]\"\n```\n\n### Running Tests\n\n```bash\n# Run all tests with coverage\npytest\n\n# Run specific test file\npytest tests/test_models.py\n\n# Run with verbose output\npytest -v\n\n# Generate HTML coverage report\npytest --cov-report=html\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src/ tests/\n\n# Check formatting\nblack --check src/ tests/\n\n# Lint code\nflake8 src/ tests/\n\n# Type checking\nmypy src/\n```\n\n## Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Run tests and quality checks\n5. Commit your changes (`git commit -m 'Add amazing feature'`)\n6. Push to the branch (`git push origin feature/amazing-feature`)\n7. Open a Pull Request\n\n### Code Standards\n\n- Follow PEP 8 style guidelines\n- Add type hints to all functions\n- Write comprehensive docstrings\n- Include examples in docstrings\n- Maintain test coverage above 80%\n- Keep the package dependency-free (stdlib only)\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/redhat-ai-americas/analysis-framework-base/issues)\n- **Documentation**: [GitHub Repository](https://github.com/redhat-ai-americas/analysis-framework-base)\n\n## Authors\n\nRed Hat AI Americas Team\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for version history and changes.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Base interfaces and models for document analysis frameworks",
    "version": "1.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/redhat-ai-americas/analysis-framework-base/issues",
        "Homepage": "https://github.com/redhat-ai-americas/analysis-framework-base",
        "Repository": "https://github.com/redhat-ai-americas/analysis-framework-base"
    },
    "split_keywords": [
        "ai",
        " document-analysis",
        " framework",
        " interface"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "88b99b01e7c4371d29d87dab5fa48c3831c9e50b83080ade3c754fce3b04115c",
                "md5": "25fb7e72b3ec6e1930c13d6531250f39",
                "sha256": "b3da597e23ecdae04649a37762de25ec1e1597847c5befd6fc2f55053dd0b9c9"
            },
            "downloads": -1,
            "filename": "analysis_framework_base-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "25fb7e72b3ec6e1930c13d6531250f39",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 11874,
            "upload_time": "2025-10-28T00:45:15",
            "upload_time_iso_8601": "2025-10-28T00:45:15.649431Z",
            "url": "https://files.pythonhosted.org/packages/88/b9/9b01e7c4371d29d87dab5fa48c3831c9e50b83080ade3c754fce3b04115c/analysis_framework_base-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "87e135309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7",
                "md5": "0b0a52c7fdf64717863c2d5663deb4e6",
                "sha256": "dccbee6e11ea0ce012a0f345c0ee716cd8d5239cf987598c330cf70f627bbd73"
            },
            "downloads": -1,
            "filename": "analysis_framework_base-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0b0a52c7fdf64717863c2d5663deb4e6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 16331,
            "upload_time": "2025-10-28T00:45:17",
            "upload_time_iso_8601": "2025-10-28T00:45:17.076654Z",
            "url": "https://files.pythonhosted.org/packages/87/e1/35309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7/analysis_framework_base-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-28 00:45:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "redhat-ai-americas",
    "github_project": "analysis-framework-base",
    "github_not_found": true,
    "lcname": "analysis-framework-base"
}
        
Elapsed time: 0.84631s