# analysis-framework-base
**Lightweight foundation package defining shared interfaces and contracts for document analysis frameworks.**
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://github.com/redhat-ai-americas/analysis-framework-base)
## Overview
`analysis-framework-base` provides abstract base classes and data models that establish a consistent interface across multiple specialized document analysis frameworks. This package serves as the foundation for:
- **xml-analysis-framework** - XML and S1000D technical documentation
- **docling-analysis-framework** - PDF, Word, PowerPoint via Docling
- **document-analysis-framework** - General document processing
- **data-analysis-framework** - Structured data analysis
## Key Features
- **Zero Dependencies** - Pure Python standard library only
- **Type Hints** - Full typing support for better IDE integration
- **Multiple Access Patterns** - Dict-style and attribute-style access
- **Extensible** - Easy to implement new frameworks
- **Well Documented** - Comprehensive docstrings and examples
## Installation
```bash
pip install analysis-framework-base
```
### For Development
```bash
pip install analysis-framework-base[dev]
```
## Quick Start
### Implementing a Framework Analyzer
```python
from analysis_framework_base import (
BaseAnalyzer,
UnifiedAnalysisResult,
UnsupportedFormatError,
AnalysisError
)
class MyDocumentAnalyzer(BaseAnalyzer):
"""Custom analyzer for my document format."""
def analyze_unified(self, file_path: str, **kwargs) -> UnifiedAnalysisResult:
"""Analyze a document and return unified results."""
# Check file format
if not file_path.endswith('.mydoc'):
raise UnsupportedFormatError(f"Unsupported format: {file_path}")
try:
# Perform analysis
with open(file_path, 'r') as f:
content = f.read()
return UnifiedAnalysisResult(
document_type="MyDoc Technical Document",
confidence=0.95,
framework="my-doc-analyzer",
metadata={
"version": "1.0",
"word_count": len(content.split())
},
content=content,
ai_opportunities=[
"Document summarization",
"Question answering",
"Entity extraction"
]
)
except Exception as e:
raise AnalysisError(f"Analysis failed: {e}")
def get_supported_formats(self) -> list[str]:
"""Return supported file extensions."""
return ['.mydoc', '.md']
```
### Implementing a Document Chunker
```python
from analysis_framework_base import (
BaseChunker,
ChunkInfo,
UnifiedAnalysisResult,
ChunkingError
)
class MyDocumentChunker(BaseChunker):
"""Custom chunker for my document format."""
def chunk_document(
self,
file_path: str,
analysis: UnifiedAnalysisResult,
strategy: str = "auto",
**kwargs
) -> list[ChunkInfo]:
"""Split document into chunks."""
if strategy not in self.get_supported_strategies():
raise ChunkingError(f"Unknown strategy: {strategy}")
# Implement chunking logic
chunks = []
content = analysis.content or ""
# Simple paragraph-based chunking
paragraphs = content.split('\n\n')
for i, para in enumerate(paragraphs):
chunk = ChunkInfo(
chunk_id=f"{file_path}_chunk_{i:04d}",
content=para.strip(),
metadata={
"paragraph_index": i,
"source_file": file_path
},
token_count=len(para.split()) * 1.3, # Rough estimate
chunk_type="paragraph"
)
chunks.append(chunk)
return chunks
def get_supported_strategies(self) -> list[str]:
"""Return supported chunking strategies."""
return ['auto', 'paragraph', 'sliding_window']
```
### Using the Framework
```python
# Initialize analyzer
analyzer = MyDocumentAnalyzer()
# Analyze a document
result = analyzer.analyze_unified('document.mydoc')
# Access results (multiple patterns supported)
print(result.document_type) # Attribute access
print(result['confidence']) # Dict-style access
print(result.get('framework')) # Dict get method
# Convert to dict for JSON serialization
result_dict = result.to_dict()
# Initialize chunker
chunker = MyDocumentChunker()
# Chunk the document
chunks = chunker.chunk_document('document.mydoc', result, strategy='paragraph')
# Process chunks
for chunk in chunks:
print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
print(chunk.content[:100]) # First 100 chars
```
## Core Interfaces
### BaseAnalyzer
Abstract base class for document analyzers. Requires implementation of:
- `analyze_unified(file_path: str, **kwargs) -> UnifiedAnalysisResult`
- `get_supported_formats() -> List[str]`
### BaseChunker
Abstract base class for document chunkers. Requires implementation of:
- `chunk_document(file_path: str, analysis: UnifiedAnalysisResult, strategy: str, **kwargs) -> List[ChunkInfo]`
- `get_supported_strategies() -> List[str]`
## Data Models
### UnifiedAnalysisResult
Standard result structure with:
- `document_type: str` - Human-readable document type
- `confidence: float` - Confidence score (0.0-1.0)
- `framework: str` - Framework identifier
- `metadata: Dict[str, Any]` - Framework-specific metadata
- `content: Optional[str]` - Extracted text content
- `ai_opportunities: List[str]` - Suggested AI use cases
- `raw_analysis: Dict[str, Any]` - Complete framework results
Supports both attribute and dict-style access:
```python
result.document_type # Attribute
result['document_type'] # Dict-style
result.get('document_type') # Get method
'document_type' in result # Contains check
```
### ChunkInfo
Standard chunk structure with:
- `chunk_id: str` - Unique identifier
- `content: str` - Chunk text content
- `metadata: Dict[str, Any]` - Chunk metadata
- `token_count: int` - Estimated token count
- `chunk_type: str` - Type (text, code, table, etc.)
## Exception Hierarchy
```python
FrameworkError # Base exception
├── UnsupportedFormatError # File format not supported
├── AnalysisError # Analysis failed
└── ChunkingError # Chunking failed
```
## Constants
### ChunkStrategy Enum
Standard chunking strategy names:
- `AUTO` - Framework auto-selects best strategy
- `HIERARCHICAL` - Structure-based (sections, headings)
- `SLIDING_WINDOW` - Fixed-size overlapping chunks
- `CONTENT_AWARE` - Semantic boundary detection
- `STRUCTURAL` - Element-based (paragraphs, tables)
- `TABLE_AWARE` - Special table handling
- `PAGE_AWARE` - Page-boundary chunking
```python
from analysis_framework_base import ChunkStrategy
strategy = ChunkStrategy.HIERARCHICAL
print(strategy.value) # 'hierarchical'
```
## Framework Suite
This package is part of a suite of analysis frameworks:
- **xml-analysis-framework** - 29+ specialized XML handlers, S1000D support, hierarchical chunking
- **docling-analysis-framework** - PDF, DOCX, PPTX via Docling, table-aware chunking
- **document-analysis-framework** - General document processing, format detection
- **data-analysis-framework** - CSV, JSON, Parquet with query paradigm
Each framework implements the interfaces defined in this package for consistent usage.
## Development
### Setup Development Environment
```bash
# Clone repository
git clone https://github.com/redhat-ai-americas/analysis-framework-base.git
cd analysis-framework-base
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install with dev dependencies
pip install -e ".[dev]"
```
### Running Tests
```bash
# Run all tests with coverage
pytest
# Run specific test file
pytest tests/test_models.py
# Run with verbose output
pytest -v
# Generate HTML coverage report
pytest --cov-report=html
```
### Code Quality
```bash
# Format code
black src/ tests/
# Check formatting
black --check src/ tests/
# Lint code
flake8 src/ tests/
# Type checking
mypy src/
```
## Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run tests and quality checks
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request
### Code Standards
- Follow PEP 8 style guidelines
- Add type hints to all functions
- Write comprehensive docstrings
- Include examples in docstrings
- Maintain test coverage above 80%
- Keep the package dependency-free (stdlib only)
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## Support
- **Issues**: [GitHub Issues](https://github.com/redhat-ai-americas/analysis-framework-base/issues)
- **Documentation**: [GitHub Repository](https://github.com/redhat-ai-americas/analysis-framework-base)
## Authors
Red Hat AI Americas Team
## Changelog
See [CHANGELOG.md](CHANGELOG.md) for version history and changes.
Raw data
{
"_id": null,
"home_page": null,
"name": "analysis-framework-base",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "ai, document-analysis, framework, interface",
"author": null,
"author_email": "Red Hat AI Americas <noreply@redhat.com>",
"download_url": "https://files.pythonhosted.org/packages/87/e1/35309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7/analysis_framework_base-1.0.0.tar.gz",
"platform": null,
"description": "# analysis-framework-base\n\n**Lightweight foundation package defining shared interfaces and contracts for document analysis frameworks.**\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/Apache-2.0)\n[](https://github.com/redhat-ai-americas/analysis-framework-base)\n\n## Overview\n\n`analysis-framework-base` provides abstract base classes and data models that establish a consistent interface across multiple specialized document analysis frameworks. This package serves as the foundation for:\n\n- **xml-analysis-framework** - XML and S1000D technical documentation\n- **docling-analysis-framework** - PDF, Word, PowerPoint via Docling\n- **document-analysis-framework** - General document processing\n- **data-analysis-framework** - Structured data analysis\n\n## Key Features\n\n- **Zero Dependencies** - Pure Python standard library only\n- **Type Hints** - Full typing support for better IDE integration\n- **Multiple Access Patterns** - Dict-style and attribute-style access\n- **Extensible** - Easy to implement new frameworks\n- **Well Documented** - Comprehensive docstrings and examples\n\n## Installation\n\n```bash\npip install analysis-framework-base\n```\n\n### For Development\n\n```bash\npip install analysis-framework-base[dev]\n```\n\n## Quick Start\n\n### Implementing a Framework Analyzer\n\n```python\nfrom analysis_framework_base import (\n BaseAnalyzer,\n UnifiedAnalysisResult,\n UnsupportedFormatError,\n AnalysisError\n)\n\nclass MyDocumentAnalyzer(BaseAnalyzer):\n \"\"\"Custom analyzer for my document format.\"\"\"\n\n def analyze_unified(self, file_path: str, **kwargs) -> UnifiedAnalysisResult:\n \"\"\"Analyze a document and return unified results.\"\"\"\n # Check file format\n if not file_path.endswith('.mydoc'):\n raise UnsupportedFormatError(f\"Unsupported format: {file_path}\")\n\n try:\n # Perform analysis\n with open(file_path, 'r') as f:\n content = f.read()\n\n return UnifiedAnalysisResult(\n document_type=\"MyDoc Technical Document\",\n confidence=0.95,\n framework=\"my-doc-analyzer\",\n metadata={\n \"version\": \"1.0\",\n \"word_count\": len(content.split())\n },\n content=content,\n ai_opportunities=[\n \"Document summarization\",\n \"Question answering\",\n \"Entity extraction\"\n ]\n )\n except Exception as e:\n raise AnalysisError(f\"Analysis failed: {e}\")\n\n def get_supported_formats(self) -> list[str]:\n \"\"\"Return supported file extensions.\"\"\"\n return ['.mydoc', '.md']\n```\n\n### Implementing a Document Chunker\n\n```python\nfrom analysis_framework_base import (\n BaseChunker,\n ChunkInfo,\n UnifiedAnalysisResult,\n ChunkingError\n)\n\nclass MyDocumentChunker(BaseChunker):\n \"\"\"Custom chunker for my document format.\"\"\"\n\n def chunk_document(\n self,\n file_path: str,\n analysis: UnifiedAnalysisResult,\n strategy: str = \"auto\",\n **kwargs\n ) -> list[ChunkInfo]:\n \"\"\"Split document into chunks.\"\"\"\n if strategy not in self.get_supported_strategies():\n raise ChunkingError(f\"Unknown strategy: {strategy}\")\n\n # Implement chunking logic\n chunks = []\n content = analysis.content or \"\"\n\n # Simple paragraph-based chunking\n paragraphs = content.split('\\n\\n')\n for i, para in enumerate(paragraphs):\n chunk = ChunkInfo(\n chunk_id=f\"{file_path}_chunk_{i:04d}\",\n content=para.strip(),\n metadata={\n \"paragraph_index\": i,\n \"source_file\": file_path\n },\n token_count=len(para.split()) * 1.3, # Rough estimate\n chunk_type=\"paragraph\"\n )\n chunks.append(chunk)\n\n return chunks\n\n def get_supported_strategies(self) -> list[str]:\n \"\"\"Return supported chunking strategies.\"\"\"\n return ['auto', 'paragraph', 'sliding_window']\n```\n\n### Using the Framework\n\n```python\n# Initialize analyzer\nanalyzer = MyDocumentAnalyzer()\n\n# Analyze a document\nresult = analyzer.analyze_unified('document.mydoc')\n\n# Access results (multiple patterns supported)\nprint(result.document_type) # Attribute access\nprint(result['confidence']) # Dict-style access\nprint(result.get('framework')) # Dict get method\n\n# Convert to dict for JSON serialization\nresult_dict = result.to_dict()\n\n# Initialize chunker\nchunker = MyDocumentChunker()\n\n# Chunk the document\nchunks = chunker.chunk_document('document.mydoc', result, strategy='paragraph')\n\n# Process chunks\nfor chunk in chunks:\n print(f\"Chunk {chunk.chunk_id}: {chunk.token_count} tokens\")\n print(chunk.content[:100]) # First 100 chars\n```\n\n## Core Interfaces\n\n### BaseAnalyzer\n\nAbstract base class for document analyzers. Requires implementation of:\n\n- `analyze_unified(file_path: str, **kwargs) -> UnifiedAnalysisResult`\n- `get_supported_formats() -> List[str]`\n\n### BaseChunker\n\nAbstract base class for document chunkers. Requires implementation of:\n\n- `chunk_document(file_path: str, analysis: UnifiedAnalysisResult, strategy: str, **kwargs) -> List[ChunkInfo]`\n- `get_supported_strategies() -> List[str]`\n\n## Data Models\n\n### UnifiedAnalysisResult\n\nStandard result structure with:\n\n- `document_type: str` - Human-readable document type\n- `confidence: float` - Confidence score (0.0-1.0)\n- `framework: str` - Framework identifier\n- `metadata: Dict[str, Any]` - Framework-specific metadata\n- `content: Optional[str]` - Extracted text content\n- `ai_opportunities: List[str]` - Suggested AI use cases\n- `raw_analysis: Dict[str, Any]` - Complete framework results\n\nSupports both attribute and dict-style access:\n\n```python\nresult.document_type # Attribute\nresult['document_type'] # Dict-style\nresult.get('document_type') # Get method\n'document_type' in result # Contains check\n```\n\n### ChunkInfo\n\nStandard chunk structure with:\n\n- `chunk_id: str` - Unique identifier\n- `content: str` - Chunk text content\n- `metadata: Dict[str, Any]` - Chunk metadata\n- `token_count: int` - Estimated token count\n- `chunk_type: str` - Type (text, code, table, etc.)\n\n## Exception Hierarchy\n\n```python\nFrameworkError # Base exception\n\u251c\u2500\u2500 UnsupportedFormatError # File format not supported\n\u251c\u2500\u2500 AnalysisError # Analysis failed\n\u2514\u2500\u2500 ChunkingError # Chunking failed\n```\n\n## Constants\n\n### ChunkStrategy Enum\n\nStandard chunking strategy names:\n\n- `AUTO` - Framework auto-selects best strategy\n- `HIERARCHICAL` - Structure-based (sections, headings)\n- `SLIDING_WINDOW` - Fixed-size overlapping chunks\n- `CONTENT_AWARE` - Semantic boundary detection\n- `STRUCTURAL` - Element-based (paragraphs, tables)\n- `TABLE_AWARE` - Special table handling\n- `PAGE_AWARE` - Page-boundary chunking\n\n```python\nfrom analysis_framework_base import ChunkStrategy\n\nstrategy = ChunkStrategy.HIERARCHICAL\nprint(strategy.value) # 'hierarchical'\n```\n\n## Framework Suite\n\nThis package is part of a suite of analysis frameworks:\n\n- **xml-analysis-framework** - 29+ specialized XML handlers, S1000D support, hierarchical chunking\n- **docling-analysis-framework** - PDF, DOCX, PPTX via Docling, table-aware chunking\n- **document-analysis-framework** - General document processing, format detection\n- **data-analysis-framework** - CSV, JSON, Parquet with query paradigm\n\nEach framework implements the interfaces defined in this package for consistent usage.\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone repository\ngit clone https://github.com/redhat-ai-americas/analysis-framework-base.git\ncd analysis-framework-base\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install with dev dependencies\npip install -e \".[dev]\"\n```\n\n### Running Tests\n\n```bash\n# Run all tests with coverage\npytest\n\n# Run specific test file\npytest tests/test_models.py\n\n# Run with verbose output\npytest -v\n\n# Generate HTML coverage report\npytest --cov-report=html\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src/ tests/\n\n# Check formatting\nblack --check src/ tests/\n\n# Lint code\nflake8 src/ tests/\n\n# Type checking\nmypy src/\n```\n\n## Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Run tests and quality checks\n5. Commit your changes (`git commit -m 'Add amazing feature'`)\n6. Push to the branch (`git push origin feature/amazing-feature`)\n7. Open a Pull Request\n\n### Code Standards\n\n- Follow PEP 8 style guidelines\n- Add type hints to all functions\n- Write comprehensive docstrings\n- Include examples in docstrings\n- Maintain test coverage above 80%\n- Keep the package dependency-free (stdlib only)\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/redhat-ai-americas/analysis-framework-base/issues)\n- **Documentation**: [GitHub Repository](https://github.com/redhat-ai-americas/analysis-framework-base)\n\n## Authors\n\nRed Hat AI Americas Team\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for version history and changes.\n",
"bugtrack_url": null,
"license": null,
"summary": "Base interfaces and models for document analysis frameworks",
"version": "1.0.0",
"project_urls": {
"Bug Tracker": "https://github.com/redhat-ai-americas/analysis-framework-base/issues",
"Homepage": "https://github.com/redhat-ai-americas/analysis-framework-base",
"Repository": "https://github.com/redhat-ai-americas/analysis-framework-base"
},
"split_keywords": [
"ai",
" document-analysis",
" framework",
" interface"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "88b99b01e7c4371d29d87dab5fa48c3831c9e50b83080ade3c754fce3b04115c",
"md5": "25fb7e72b3ec6e1930c13d6531250f39",
"sha256": "b3da597e23ecdae04649a37762de25ec1e1597847c5befd6fc2f55053dd0b9c9"
},
"downloads": -1,
"filename": "analysis_framework_base-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "25fb7e72b3ec6e1930c13d6531250f39",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11874,
"upload_time": "2025-10-28T00:45:15",
"upload_time_iso_8601": "2025-10-28T00:45:15.649431Z",
"url": "https://files.pythonhosted.org/packages/88/b9/9b01e7c4371d29d87dab5fa48c3831c9e50b83080ade3c754fce3b04115c/analysis_framework_base-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "87e135309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7",
"md5": "0b0a52c7fdf64717863c2d5663deb4e6",
"sha256": "dccbee6e11ea0ce012a0f345c0ee716cd8d5239cf987598c330cf70f627bbd73"
},
"downloads": -1,
"filename": "analysis_framework_base-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "0b0a52c7fdf64717863c2d5663deb4e6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 16331,
"upload_time": "2025-10-28T00:45:17",
"upload_time_iso_8601": "2025-10-28T00:45:17.076654Z",
"url": "https://files.pythonhosted.org/packages/87/e1/35309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7/analysis_framework_base-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-28 00:45:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "redhat-ai-americas",
"github_project": "analysis-framework-base",
"github_not_found": true,
"lcname": "analysis-framework-base"
}