xml-analysis-framework

Name	xml-analysis-framework JSON
Version	1.2.8 JSON
	download
home_page	https://github.com/redhat-ai-americas/xml-analysis-framework
Summary	XML document analysis and preprocessing framework designed for AI/ML data pipelines
upload_time	2025-07-27 04:00:56
maintainer	None
docs_url	None
author	AI Building Blocks
requires_python	>=3.8
license	MIT
keywords	xml analysis ai ml document-processing semantic-search
VCS
bugtrack_url
requirements	defusedxml
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # XML Analysis Framework

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Test Success Rate](https://img.shields.io/badge/Tests-100%25%20Success-brightgreen.svg)](./test_results)
[![Handlers](https://img.shields.io/badge/Specialized%20Handlers-29-blue.svg)](./src/handlers)
[![AI Ready](https://img.shields.io/badge/AI%20Ready-✓-green.svg)](./AI_INTEGRATION_ARCHITECTURE.md)

A production-ready XML document analysis and preprocessing framework with **29 specialized handlers** designed for AI/ML data pipelines. Transform any XML document into structured, AI-ready data and optimized chunks with **100% success rate** across 71 diverse test files.

## 🚀 Quick Start

### Simple API - Get Started in Seconds

```python
import xml_analysis_framework as xaf

# 🎯 One-line analysis with specialized handlers
result = xaf.analyze("path/to/file.xml")
print(f"Document type: {result['document_type'].type_name}")
print(f"Handler used: {result['handler_used']}")

# 📊 Basic schema analysis  
schema = xaf.analyze_schema("path/to/file.xml")
print(f"Elements: {schema.total_elements}, Depth: {schema.max_depth}")

# ✂️ Smart chunking for AI/ML
chunks = xaf.chunk("path/to/file.xml", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")
```

### Advanced Usage

```python
import xml_analysis_framework as xaf

# Enhanced analysis with full results
analysis = xaf.analyze_enhanced("document.xml")
doc_type = analysis['document_type']
specialized = analysis['analysis']  # This contains the SpecializedAnalysis object

print(f"Type: {doc_type.type_name} (confidence: {doc_type.confidence:.2f})")
if specialized:
    print(f"AI use cases: {len(specialized.ai_use_cases)}")
    if specialized.quality_metrics:
        print(f"Quality score: {specialized.quality_metrics.get('completeness_score')}")
    else:
        print("Quality metrics: Not available")

# Different chunking strategies
hierarchical_chunks = xaf.chunk("document.xml", strategy="hierarchical")
sliding_chunks = xaf.chunk("document.xml", strategy="sliding_window") 
content_chunks = xaf.chunk("document.xml", strategy="content_aware")

# Process chunks
for chunk in hierarchical_chunks:
    print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")
    print(f"Type: {chunk.chunk_type}, Elements: {len(chunk.elements_included)}")
```

### Expert Usage - Direct Class Access

```python
# For advanced customization, use the classes directly
from xml_analysis_framework import XMLDocumentAnalyzer, ChunkingOrchestrator

analyzer = XMLDocumentAnalyzer(max_file_size_mb=500)
orchestrator = ChunkingOrchestrator(max_file_size_mb=1000)

# Custom analysis
result = analyzer.analyze_document("file.xml")

# Custom chunking with config (result works directly now!)
from xml_analysis_framework.core.chunking import ChunkingConfig
config = ChunkingConfig(
    max_chunk_size=2000,
    min_chunk_size=300,
    overlap_size=150,
    preserve_hierarchy=True
)
chunks = orchestrator.chunk_document("file.xml", result, strategy="auto", config=config)
```

## 🎯 Key Features

### 1. **🏆 Production Proven Results**

- **100% Success Rate**: All 71 test files processed successfully
- **2,752 Chunks Generated**: Average 38.8 optimized chunks per file
- **54 Document Types Detected**: Comprehensive XML format coverage
- **Minimal Dependencies**: Only defusedxml for security + Python stdlib

### 2. **🧠 29 Specialized XML Handlers**

Enterprise-grade document intelligence:

- **Security & Compliance**: SCAP, SAML, SOAP (90-100% confidence)
- **DevOps & Build**: Maven POM, Ant, Ivy, Spring, Log4j (95-100% confidence)
- **Content & Documentation**: RSS/Atom, DocBook, XHTML, SVG
- **Enterprise Systems**: ServiceNow, Hibernate, Struts configurations
- **Data & APIs**: GPX, KML, GraphML, WADL/WSDL, XML Schemas

### 3. **⚡ Intelligent Processing Pipeline**

- **Smart Document Detection**: Confidence scoring with graceful fallbacks
- **Semantic Chunking**: Document-type-aware optimal segmentation
- **Token Optimization**: LLM context window optimized chunks
- **Quality Assessment**: Automated data quality metrics

### 4. **🤖 AI-Ready Integration**

- **Vector Store Ready**: Structured embeddings with rich metadata
- **Graph Database Compatible**: Relationship and dependency mapping
- **LLM Agent Optimized**: Context-aware, actionable insights
- **Complete AI Workflows**: See [AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)

## 📋 Supported Document Types (29 Handlers)

| Category                              | Handlers                         | Confidence | Use Cases                                                                  |
| ------------------------------------- | -------------------------------- | ---------- | -------------------------------------------------------------------------- |
| **🔐 Security & Compliance**    | SCAP, SAML, SOAP                 | 90-100%    | Vulnerability assessment, compliance monitoring, security posture analysis |
| **⚙️ DevOps & Build Tools**   | Maven POM, Ant, Ivy              | 95-100%    | Dependency analysis, build optimization, technical debt assessment         |
| **🏢 Enterprise Configuration** | Spring, Hibernate, Struts, Log4j | 95-100%    | Configuration validation, security scanning, modernization planning        |
| **📄 Content & Documentation**  | RSS, DocBook, XHTML, SVG         | 90-100%    | Content intelligence, documentation search, knowledge management           |
| **🗂️ Enterprise Systems**     | ServiceNow, XML Sitemap          | 95-100%    | Incident analysis, process automation, system integration                  |
| **🌍 Geospatial & Data**        | GPX, KML, GraphML                | 85-95%     | Route optimization, geographic analysis, network intelligence              |
| **🔌 API & Integration**        | WADL, WSDL, XLIFF                | 90-95%     | Service discovery, integration planning, translation workflows             |
| **📐 Schemas & Standards**      | XML Schema (XSD)                 | 100%       | Schema validation, data modeling, API documentation                        |

## 🏗️ Architecture

```
xml-analysis-framework/
├── README.md                    # Project overview
├── LICENSE                      # MIT license
├── requirements.txt            # Dependencies (Python stdlib only)
├── setup.py                    # Package installation
├── .gitignore                  # Git ignore patterns
├── .github/workflows/          # CI/CD pipelines
│
├── src/                        # Source code
│   ├── core/                   # Core framework
│   │   ├── analyzer.py         # Main analysis engine
│   │   ├── schema_analyzer.py  # XML schema analysis
│   │   └── chunking.py         # Chunking strategies
│   ├── handlers/              # 28 specialized handlers
│   └── utils/                 # Utility functions
│
├── tests/                      # Comprehensive test suite
│   ├── unit/                  # Handler unit tests (16 files)
│   ├── integration/           # Integration tests (11 files)
│   ├── comprehensive/         # Full system tests (4 files)
│   └── run_all_tests.py      # Master test runner
│
├── examples/                   # Usage examples
│   ├── basic_analysis.py      # Simple analysis
│   └── enhanced_analysis.py   # Full featured analysis
│
├── scripts/                    # Utility scripts
│   ├── collect_test_files.py  # Test data collection
│   └── debug/                 # Debug utilities
│
├── docs/                       # Documentation
│   ├── architecture/          # Design documents
│   ├── guides/                # User guides
│   └── api/                   # API documentation
│
├── sample_data/               # Test XML files (99+ examples)
│   ├── test_files/           # Real-world examples
│   └── test_files_synthetic/ # Generated test cases
│
└── artifacts/                 # Build artifacts, results
    ├── analysis_results/     # JSON analysis outputs
    └── reports/             # Generated reports
```

## 🔒 Security

### XML Security Protection

This framework uses **defusedxml** to protect against common XML security vulnerabilities:

- **XXE (XML External Entity) attacks**: Prevents reading local files or making network requests
- **Billion Laughs attack**: Prevents exponential entity expansion DoS attacks
- **DTD retrieval**: Blocks external DTD fetching to prevent data exfiltration

#### Security Features

```python
# All XML parsing is automatically protected
from core.analyzer import XMLDocumentAnalyzer

analyzer = XMLDocumentAnalyzer()
# Safe parsing - malicious XML will be rejected
result = analyzer.analyze_document("potentially_malicious.xml")

if result.get('security_issue'):
    print(f"Security threat detected: {result['error']}")
```

#### Best Practices

1. **Always use the framework's parsers** - Never use `xml.etree.ElementTree` directly
2. **Validate file sizes** - Set reasonable limits for your use case
3. **Sanitize file paths** - Ensure input paths are properly validated
4. **Monitor for security exceptions** - Log and alert on security-blocked parsing attempts

### File Size Limits

The framework includes built-in file size limits to prevent memory exhaustion:

```python
# Built-in size limits in analyzer and chunking
from core.analyzer import XMLDocumentAnalyzer
from core.chunking import ChunkingOrchestrator

# Create analyzer with 50MB limit
analyzer = XMLDocumentAnalyzer(max_file_size_mb=50.0)

# Create chunking orchestrator with 100MB limit  
orchestrator = ChunkingOrchestrator(max_file_size_mb=100.0)

# Utility functions for easy setup
from utils import create_analyzer_with_limits, FileSizeLimits

# Use predefined limits
analyzer = create_analyzer_with_limits(FileSizeLimits.PRODUCTION_MEDIUM)  # 50MB
safe_result = safe_analyze_document("file.xml", FileSizeLimits.REAL_TIME)  # 5MB
```

## 🔧 Installation

```bash
# Install from PyPI (recommended)
pip install xml-analysis-framework

# Install from source
git clone https://github.com/redhat-ai-americas/xml-analysis-framework.git
cd xml-analysis-framework
pip install -e .

# Or install development dependencies
pip install -e .[dev]
```

### Dependencies

- **defusedxml** (0.7.1+): For secure XML parsing protection
- Python standard library (3.8+) for all other functionality

## 📖 Usage Examples

### Basic Analysis

```python
from core.schema_analyzer import XMLSchemaAnalyzer

analyzer = XMLSchemaAnalyzer()
schema = analyzer.analyze_file('document.xml')

# Access schema properties
print(f"Root element: {schema.root_element}")
print(f"Total elements: {schema.total_elements}")
print(f"Namespaces: {schema.namespaces}")
```

### Enhanced Analysis with Specialized Handlers

```python
from core.analyzer import XMLDocumentAnalyzer

analyzer = XMLDocumentAnalyzer()
result = analyzer.analyze_document('maven-project.xml')

print(f"Document Type: {result['document_type'].type_name}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Handler Used: {result['handler_used']}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
```

### Safe Analysis with File Validation

```python
from utils import safe_analyze_document, FileSizeLimits

# Safe analysis with comprehensive validation
result = safe_analyze_document(
    'document.xml', 
    max_size_mb=FileSizeLimits.PRODUCTION_MEDIUM
)

if result.get('error'):
    print(f"Analysis failed: {result['error']}")
else:
    print(f"Success: {result['document_type'].type_name}")
```

### Intelligent Chunking

```python
from core.chunking import ChunkingOrchestrator, XMLChunkingStrategy

orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document(
    'large_document.xml',
    specialized_analysis={},  # Analysis result from XMLDocumentAnalyzer
    strategy='auto'
)

# Token estimation
token_estimator = XMLChunkingStrategy()
for chunk in chunks:
    token_count = token_estimator.estimate_tokens(chunk.content)
    print(f"Chunk {chunk.chunk_id}: ~{token_count} tokens")
```

## 🧪 Testing & Validation

### **Production-Tested Performance**

- ✅ **100% Success Rate**: All 71 XML files processed successfully
- ✅ **2,752 Chunks Generated**: Optimal segmentation across diverse document types
- ✅ **54 Document Types**: Comprehensive coverage from ServiceNow to SCAP to Maven
- ✅ **Secure by Default**: Protected against XXE and billion laughs attacks

### **Test Coverage**

```bash
# Run comprehensive end-to-end test
python test_end_to_end_workflow.py

# Run individual component tests  
python test_all_chunking.py        # Chunking strategies
python test_servicenow_analysis.py # ServiceNow handler validation
python test_scap_analysis.py       # Security document analysis
```

### **Real-World Test Data**

- **Enterprise Systems**: ServiceNow incident exports (8 files)
- **Security Documents**: SCAP/XCCDF compliance reports (4 files)
- **Build Configurations**: Maven, Ant, Ivy projects (12 files)
- **Enterprise Config**: Spring, Hibernate, Log4j (15 files)
- **Content & APIs**: DocBook, RSS, WADL, Sitemaps (32 files)

## 🤖 AI Integration & Use Cases

### **AI Workflow Overview**

```mermaid
graph LR
    A[XML Documents] --> B[XML Analysis Framework]
    B --> C[Document Analysis<br/>29 Specialized Handlers]
    B --> D[Smart Chunking<br/>Token-Optimized]
    B --> E[AI-Ready Output<br/>Structured JSON]
  
    E --> F[Vector Store<br/>Semantic Search]
    E --> G[Graph Database<br/>Relationships]
    E --> H[LLM Agent<br/>Intelligence]
  
    F --> I[Security Intelligence]
    G --> J[DevOps Automation] 
    H --> K[Knowledge Management]
  
    style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style I fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style J fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style K fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
```

> **See [Complete AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)** for detailed workflows, implementation examples, and advanced use cases.

### **🔐 Security Intelligence Applications**

- **SCAP Compliance Monitoring**: Automated vulnerability assessment and risk scoring
- **SAML Security Analysis**: Authentication flow security validation and threat detection
- **Log4j Vulnerability Detection**: CVE scanning and automated remediation guidance
- **SOAP Security Assessment**: Web service configuration security review

### **⚙️ DevOps & Configuration Intelligence**

- **Dependency Risk Analysis**: Maven/Ant/Ivy vulnerability scanning and upgrade planning
- **Configuration Drift Detection**: Spring/Hibernate consistency monitoring
- **Build Optimization**: Performance analysis and security hardening recommendations
- **Technical Debt Assessment**: Legacy system modernization planning

### **🏢 Enterprise System Intelligence**

- **ServiceNow Process Mining**: Incident pattern analysis and workflow optimization
- **Cross-System Correlation**: Configuration impact analysis and change management
- **Compliance Automation**: Regulatory requirement mapping and validation

### **📚 Knowledge Management Applications**

- **Technical Documentation Search**: Semantic search across DocBook, API documentation
- **Content Intelligence**: RSS/Atom trend analysis and topic extraction
- **API Discovery**: WADL/WSDL service catalog and integration recommendations

## 🔬 Production Metrics & Performance

### **Framework Statistics**

- **✅ 100% Success Rate**: 71/71 files processed without errors
- **📊 2,752 Chunks Generated**: Optimal 38.8 avg chunks per document
- **🎯 54 Document Types**: Comprehensive XML format coverage
- **⚡ High Performance**: 0.015s average processing time per document
- **🔒 Secure Parsing**: defusedxml protection against XML attacks

### **Handler Confidence Levels**

- **100% Confidence**: XML Schema (XSD), Maven POM, Log4j, RSS/Atom, Sitemaps
- **95% Confidence**: ServiceNow, Apache Ant, Ivy, Spring, Hibernate, SAML, SOAP
- **90% Confidence**: SCAP/XCCDF, DocBook, WADL/WSDL
- **Intelligent Fallback**: Generic XML handler for unknown formats

## 🚀 Extending the Framework

### Adding New Handlers

```python
from core.analyzer import XMLHandler, SpecializedAnalysis, DocumentTypeInfo

class CustomHandler(XMLHandler):
    def can_handle(self, root, namespaces):
        return root.tag == 'custom-format', 1.0
  
    def detect_type(self, root, namespaces):
        return DocumentTypeInfo(
            type_name="Custom Format",
            confidence=1.0,
            version="1.0"
        )
  
    def analyze(self, root, file_path):
        return SpecializedAnalysis(
            document_type="Custom Format",
            key_findings={"custom_data": "value"},
            ai_use_cases=["Custom AI application"],
            structured_data={"extracted": "data"}
        )
```

### Custom Chunking Strategies

```python
from core.chunking import XMLChunkingStrategy, ChunkingOrchestrator

class CustomChunking(XMLChunkingStrategy):
    def chunk_document(self, file_path, specialized_analysis=None):
        # Custom chunking logic
        return chunks

# Register custom strategy
orchestrator = ChunkingOrchestrator()
orchestrator.strategies['custom'] = CustomChunking
```

## 📊 Real Production Output Examples

### **ServiceNow Incident Analysis**

```json
{
  "document_summary": {
    "document_type": "ServiceNow Incident",
    "type_confidence": 0.95,
    "handler_used": "ServiceNowHandler",
    "file_size_mb": 0.029
  },
  "key_insights": {
    "data_highlights": {
      "state": "7", "priority": "4", "impact": "3",
      "assignment_group": "REDACTED_GROUP",
      "resolution_time": "240 days, 0:45:51",
      "journal_analysis": {
        "total_entries": 9,
        "unique_contributors": 1
      }
    },
    "ai_applications": [
      "Incident pattern analysis",
      "Resolution time prediction", 
      "Workload optimization"
    ]
  },
  "structured_content": {
    "chunking_strategy": "content_aware_medium",
    "total_chunks": 75,
    "quality_metrics": {
      "overall_readiness": 0.87
    }
  }
}
```

### **Log4j Security Analysis**

```json
{
  "document_summary": {
    "document_type": "Log4j Configuration",
    "type_confidence": 1.0,
    "handler_used": "Log4jConfigHandler"
  },
  "key_insights": {
    "data_highlights": {
      "security_concerns": {
        "security_risks": ["External socket appender detected"],
        "log4shell_vulnerable": false,
        "external_connections": [{"host": "log-server.example.com"}]
      },
      "performance": {
        "async_appenders": 1,
        "performance_risks": ["Location info impacts performance"]
      }
    },
    "ai_applications": [
      "Vulnerability assessment",
      "Performance optimization",
      "Security hardening"
    ]
  },
  "structured_content": {
    "total_chunks": 19,
    "chunking_strategy": "hierarchical_small"
  }
}
```

## 🤝 Contributing

We welcome contributions! Whether you're adding new XML handlers, improving chunking algorithms, or enhancing AI integrations, your contributions help make XML analysis more accessible and powerful.

**Priority contribution areas:**

- 🎯 New XML format handlers (ERP, CRM, healthcare, government)
- ⚡ Enhanced chunking algorithms and strategies
- 🚀 Performance optimizations for large files
- 🤖 Advanced AI/ML integration examples
- 📝 Documentation and usage examples

**👉 See [CONTRIBUTING.md](CONTRIBUTING.md) for complete guidelines, development setup, and submission process.**

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Designed as part of the **AI Building Blocks** initiative
- Built for the modern AI/ML ecosystem
- Community-driven XML format support

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/redhat-ai-americas/xml-analysis-framework",
    "name": "xml-analysis-framework",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "xml, analysis, ai, ml, document-processing, semantic-search",
    "author": "AI Building Blocks",
    "author_email": "AI Building Blocks <wjackson@redhat.com>",
    "download_url": "https://files.pythonhosted.org/packages/46/20/d3b17abc1df09030bdddf62e4d7f8f246cc8062fe85b59c83dd1bf970e3a/xml_analysis_framework-1.2.8.tar.gz",
    "platform": null,
    "description": "# XML Analysis Framework\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Test Success Rate](https://img.shields.io/badge/Tests-100%25%20Success-brightgreen.svg)](./test_results)\n[![Handlers](https://img.shields.io/badge/Specialized%20Handlers-29-blue.svg)](./src/handlers)\n[![AI Ready](https://img.shields.io/badge/AI%20Ready-\u2713-green.svg)](./AI_INTEGRATION_ARCHITECTURE.md)\n\nA production-ready XML document analysis and preprocessing framework with **29 specialized handlers** designed for AI/ML data pipelines. Transform any XML document into structured, AI-ready data and optimized chunks with **100% success rate** across 71 diverse test files.\n\n## \ud83d\ude80 Quick Start\n\n### Simple API - Get Started in Seconds\n\n```python\nimport xml_analysis_framework as xaf\n\n# \ud83c\udfaf One-line analysis with specialized handlers\nresult = xaf.analyze(\"path/to/file.xml\")\nprint(f\"Document type: {result['document_type'].type_name}\")\nprint(f\"Handler used: {result['handler_used']}\")\n\n# \ud83d\udcca Basic schema analysis  \nschema = xaf.analyze_schema(\"path/to/file.xml\")\nprint(f\"Elements: {schema.total_elements}, Depth: {schema.max_depth}\")\n\n# \u2702\ufe0f Smart chunking for AI/ML\nchunks = xaf.chunk(\"path/to/file.xml\", strategy=\"auto\")\nprint(f\"Created {len(chunks)} optimized chunks\")\n```\n\n### Advanced Usage\n\n```python\nimport xml_analysis_framework as xaf\n\n# Enhanced analysis with full results\nanalysis = xaf.analyze_enhanced(\"document.xml\")\ndoc_type = analysis['document_type']\nspecialized = analysis['analysis']  # This contains the SpecializedAnalysis object\n\nprint(f\"Type: {doc_type.type_name} (confidence: {doc_type.confidence:.2f})\")\nif specialized:\n    print(f\"AI use cases: {len(specialized.ai_use_cases)}\")\n    if specialized.quality_metrics:\n        print(f\"Quality score: {specialized.quality_metrics.get('completeness_score')}\")\n    else:\n        print(\"Quality metrics: Not available\")\n\n# Different chunking strategies\nhierarchical_chunks = xaf.chunk(\"document.xml\", strategy=\"hierarchical\")\nsliding_chunks = xaf.chunk(\"document.xml\", strategy=\"sliding_window\") \ncontent_chunks = xaf.chunk(\"document.xml\", strategy=\"content_aware\")\n\n# Process chunks\nfor chunk in hierarchical_chunks:\n    print(f\"Chunk {chunk.chunk_id}: {len(chunk.content)} chars\")\n    print(f\"Type: {chunk.chunk_type}, Elements: {len(chunk.elements_included)}\")\n```\n\n### Expert Usage - Direct Class Access\n\n```python\n# For advanced customization, use the classes directly\nfrom xml_analysis_framework import XMLDocumentAnalyzer, ChunkingOrchestrator\n\nanalyzer = XMLDocumentAnalyzer(max_file_size_mb=500)\norchestrator = ChunkingOrchestrator(max_file_size_mb=1000)\n\n# Custom analysis\nresult = analyzer.analyze_document(\"file.xml\")\n\n# Custom chunking with config (result works directly now!)\nfrom xml_analysis_framework.core.chunking import ChunkingConfig\nconfig = ChunkingConfig(\n    max_chunk_size=2000,\n    min_chunk_size=300,\n    overlap_size=150,\n    preserve_hierarchy=True\n)\nchunks = orchestrator.chunk_document(\"file.xml\", result, strategy=\"auto\", config=config)\n```\n\n## \ud83c\udfaf Key Features\n\n### 1. **\ud83c\udfc6 Production Proven Results**\n\n- **100% Success Rate**: All 71 test files processed successfully\n- **2,752 Chunks Generated**: Average 38.8 optimized chunks per file\n- **54 Document Types Detected**: Comprehensive XML format coverage\n- **Minimal Dependencies**: Only defusedxml for security + Python stdlib\n\n### 2. **\ud83e\udde0 29 Specialized XML Handlers**\n\nEnterprise-grade document intelligence:\n\n- **Security & Compliance**: SCAP, SAML, SOAP (90-100% confidence)\n- **DevOps & Build**: Maven POM, Ant, Ivy, Spring, Log4j (95-100% confidence)\n- **Content & Documentation**: RSS/Atom, DocBook, XHTML, SVG\n- **Enterprise Systems**: ServiceNow, Hibernate, Struts configurations\n- **Data & APIs**: GPX, KML, GraphML, WADL/WSDL, XML Schemas\n\n### 3. **\u26a1 Intelligent Processing Pipeline**\n\n- **Smart Document Detection**: Confidence scoring with graceful fallbacks\n- **Semantic Chunking**: Document-type-aware optimal segmentation\n- **Token Optimization**: LLM context window optimized chunks\n- **Quality Assessment**: Automated data quality metrics\n\n### 4. **\ud83e\udd16 AI-Ready Integration**\n\n- **Vector Store Ready**: Structured embeddings with rich metadata\n- **Graph Database Compatible**: Relationship and dependency mapping\n- **LLM Agent Optimized**: Context-aware, actionable insights\n- **Complete AI Workflows**: See [AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)\n\n## \ud83d\udccb Supported Document Types (29 Handlers)\n\n| Category                              | Handlers                         | Confidence | Use Cases                                                                  |\n| ------------------------------------- | -------------------------------- | ---------- | -------------------------------------------------------------------------- |\n| **\ud83d\udd10 Security & Compliance**    | SCAP, SAML, SOAP                 | 90-100%    | Vulnerability assessment, compliance monitoring, security posture analysis |\n| **\u2699\ufe0f DevOps & Build Tools**   | Maven POM, Ant, Ivy              | 95-100%    | Dependency analysis, build optimization, technical debt assessment         |\n| **\ud83c\udfe2 Enterprise Configuration** | Spring, Hibernate, Struts, Log4j | 95-100%    | Configuration validation, security scanning, modernization planning        |\n| **\ud83d\udcc4 Content & Documentation**  | RSS, DocBook, XHTML, SVG         | 90-100%    | Content intelligence, documentation search, knowledge management           |\n| **\ud83d\uddc2\ufe0f Enterprise Systems**     | ServiceNow, XML Sitemap          | 95-100%    | Incident analysis, process automation, system integration                  |\n| **\ud83c\udf0d Geospatial & Data**        | GPX, KML, GraphML                | 85-95%     | Route optimization, geographic analysis, network intelligence              |\n| **\ud83d\udd0c API & Integration**        | WADL, WSDL, XLIFF                | 90-95%     | Service discovery, integration planning, translation workflows             |\n| **\ud83d\udcd0 Schemas & Standards**      | XML Schema (XSD)                 | 100%       | Schema validation, data modeling, API documentation                        |\n\n## \ud83c\udfd7\ufe0f Architecture\n\n```\nxml-analysis-framework/\n\u251c\u2500\u2500 README.md                    # Project overview\n\u251c\u2500\u2500 LICENSE                      # MIT license\n\u251c\u2500\u2500 requirements.txt            # Dependencies (Python stdlib only)\n\u251c\u2500\u2500 setup.py                    # Package installation\n\u251c\u2500\u2500 .gitignore                  # Git ignore patterns\n\u251c\u2500\u2500 .github/workflows/          # CI/CD pipelines\n\u2502\n\u251c\u2500\u2500 src/                        # Source code\n\u2502   \u251c\u2500\u2500 core/                   # Core framework\n\u2502   \u2502   \u251c\u2500\u2500 analyzer.py         # Main analysis engine\n\u2502   \u2502   \u251c\u2500\u2500 schema_analyzer.py  # XML schema analysis\n\u2502   \u2502   \u2514\u2500\u2500 chunking.py         # Chunking strategies\n\u2502   \u251c\u2500\u2500 handlers/              # 28 specialized handlers\n\u2502   \u2514\u2500\u2500 utils/                 # Utility functions\n\u2502\n\u251c\u2500\u2500 tests/                      # Comprehensive test suite\n\u2502   \u251c\u2500\u2500 unit/                  # Handler unit tests (16 files)\n\u2502   \u251c\u2500\u2500 integration/           # Integration tests (11 files)\n\u2502   \u251c\u2500\u2500 comprehensive/         # Full system tests (4 files)\n\u2502   \u2514\u2500\u2500 run_all_tests.py      # Master test runner\n\u2502\n\u251c\u2500\u2500 examples/                   # Usage examples\n\u2502   \u251c\u2500\u2500 basic_analysis.py      # Simple analysis\n\u2502   \u2514\u2500\u2500 enhanced_analysis.py   # Full featured analysis\n\u2502\n\u251c\u2500\u2500 scripts/                    # Utility scripts\n\u2502   \u251c\u2500\u2500 collect_test_files.py  # Test data collection\n\u2502   \u2514\u2500\u2500 debug/                 # Debug utilities\n\u2502\n\u251c\u2500\u2500 docs/                       # Documentation\n\u2502   \u251c\u2500\u2500 architecture/          # Design documents\n\u2502   \u251c\u2500\u2500 guides/                # User guides\n\u2502   \u2514\u2500\u2500 api/                   # API documentation\n\u2502\n\u251c\u2500\u2500 sample_data/               # Test XML files (99+ examples)\n\u2502   \u251c\u2500\u2500 test_files/           # Real-world examples\n\u2502   \u2514\u2500\u2500 test_files_synthetic/ # Generated test cases\n\u2502\n\u2514\u2500\u2500 artifacts/                 # Build artifacts, results\n    \u251c\u2500\u2500 analysis_results/     # JSON analysis outputs\n    \u2514\u2500\u2500 reports/             # Generated reports\n```\n\n## \ud83d\udd12 Security\n\n### XML Security Protection\n\nThis framework uses **defusedxml** to protect against common XML security vulnerabilities:\n\n- **XXE (XML External Entity) attacks**: Prevents reading local files or making network requests\n- **Billion Laughs attack**: Prevents exponential entity expansion DoS attacks\n- **DTD retrieval**: Blocks external DTD fetching to prevent data exfiltration\n\n#### Security Features\n\n```python\n# All XML parsing is automatically protected\nfrom core.analyzer import XMLDocumentAnalyzer\n\nanalyzer = XMLDocumentAnalyzer()\n# Safe parsing - malicious XML will be rejected\nresult = analyzer.analyze_document(\"potentially_malicious.xml\")\n\nif result.get('security_issue'):\n    print(f\"Security threat detected: {result['error']}\")\n```\n\n#### Best Practices\n\n1. **Always use the framework's parsers** - Never use `xml.etree.ElementTree` directly\n2. **Validate file sizes** - Set reasonable limits for your use case\n3. **Sanitize file paths** - Ensure input paths are properly validated\n4. **Monitor for security exceptions** - Log and alert on security-blocked parsing attempts\n\n### File Size Limits\n\nThe framework includes built-in file size limits to prevent memory exhaustion:\n\n```python\n# Built-in size limits in analyzer and chunking\nfrom core.analyzer import XMLDocumentAnalyzer\nfrom core.chunking import ChunkingOrchestrator\n\n# Create analyzer with 50MB limit\nanalyzer = XMLDocumentAnalyzer(max_file_size_mb=50.0)\n\n# Create chunking orchestrator with 100MB limit  \norchestrator = ChunkingOrchestrator(max_file_size_mb=100.0)\n\n# Utility functions for easy setup\nfrom utils import create_analyzer_with_limits, FileSizeLimits\n\n# Use predefined limits\nanalyzer = create_analyzer_with_limits(FileSizeLimits.PRODUCTION_MEDIUM)  # 50MB\nsafe_result = safe_analyze_document(\"file.xml\", FileSizeLimits.REAL_TIME)  # 5MB\n```\n\n## \ud83d\udd27 Installation\n\n```bash\n# Install from PyPI (recommended)\npip install xml-analysis-framework\n\n# Install from source\ngit clone https://github.com/redhat-ai-americas/xml-analysis-framework.git\ncd xml-analysis-framework\npip install -e .\n\n# Or install development dependencies\npip install -e .[dev]\n```\n\n### Dependencies\n\n- **defusedxml** (0.7.1+): For secure XML parsing protection\n- Python standard library (3.8+) for all other functionality\n\n## \ud83d\udcd6 Usage Examples\n\n### Basic Analysis\n\n```python\nfrom core.schema_analyzer import XMLSchemaAnalyzer\n\nanalyzer = XMLSchemaAnalyzer()\nschema = analyzer.analyze_file('document.xml')\n\n# Access schema properties\nprint(f\"Root element: {schema.root_element}\")\nprint(f\"Total elements: {schema.total_elements}\")\nprint(f\"Namespaces: {schema.namespaces}\")\n```\n\n### Enhanced Analysis with Specialized Handlers\n\n```python\nfrom core.analyzer import XMLDocumentAnalyzer\n\nanalyzer = XMLDocumentAnalyzer()\nresult = analyzer.analyze_document('maven-project.xml')\n\nprint(f\"Document Type: {result['document_type'].type_name}\")\nprint(f\"Confidence: {result['confidence']:.2f}\")\nprint(f\"Handler Used: {result['handler_used']}\")\nprint(f\"AI Use Cases: {result['analysis'].ai_use_cases}\")\n```\n\n### Safe Analysis with File Validation\n\n```python\nfrom utils import safe_analyze_document, FileSizeLimits\n\n# Safe analysis with comprehensive validation\nresult = safe_analyze_document(\n    'document.xml', \n    max_size_mb=FileSizeLimits.PRODUCTION_MEDIUM\n)\n\nif result.get('error'):\n    print(f\"Analysis failed: {result['error']}\")\nelse:\n    print(f\"Success: {result['document_type'].type_name}\")\n```\n\n### Intelligent Chunking\n\n```python\nfrom core.chunking import ChunkingOrchestrator, XMLChunkingStrategy\n\norchestrator = ChunkingOrchestrator()\nchunks = orchestrator.chunk_document(\n    'large_document.xml',\n    specialized_analysis={},  # Analysis result from XMLDocumentAnalyzer\n    strategy='auto'\n)\n\n# Token estimation\ntoken_estimator = XMLChunkingStrategy()\nfor chunk in chunks:\n    token_count = token_estimator.estimate_tokens(chunk.content)\n    print(f\"Chunk {chunk.chunk_id}: ~{token_count} tokens\")\n```\n\n## \ud83e\uddea Testing & Validation\n\n### **Production-Tested Performance**\n\n- \u2705 **100% Success Rate**: All 71 XML files processed successfully\n- \u2705 **2,752 Chunks Generated**: Optimal segmentation across diverse document types\n- \u2705 **54 Document Types**: Comprehensive coverage from ServiceNow to SCAP to Maven\n- \u2705 **Secure by Default**: Protected against XXE and billion laughs attacks\n\n### **Test Coverage**\n\n```bash\n# Run comprehensive end-to-end test\npython test_end_to_end_workflow.py\n\n# Run individual component tests  \npython test_all_chunking.py        # Chunking strategies\npython test_servicenow_analysis.py # ServiceNow handler validation\npython test_scap_analysis.py       # Security document analysis\n```\n\n### **Real-World Test Data**\n\n- **Enterprise Systems**: ServiceNow incident exports (8 files)\n- **Security Documents**: SCAP/XCCDF compliance reports (4 files)\n- **Build Configurations**: Maven, Ant, Ivy projects (12 files)\n- **Enterprise Config**: Spring, Hibernate, Log4j (15 files)\n- **Content & APIs**: DocBook, RSS, WADL, Sitemaps (32 files)\n\n## \ud83e\udd16 AI Integration & Use Cases\n\n### **AI Workflow Overview**\n\n```mermaid\ngraph LR\n    A[XML Documents] --> B[XML Analysis Framework]\n    B --> C[Document Analysis<br/>29 Specialized Handlers]\n    B --> D[Smart Chunking<br/>Token-Optimized]\n    B --> E[AI-Ready Output<br/>Structured JSON]\n  \n    E --> F[Vector Store<br/>Semantic Search]\n    E --> G[Graph Database<br/>Relationships]\n    E --> H[LLM Agent<br/>Intelligence]\n  \n    F --> I[Security Intelligence]\n    G --> J[DevOps Automation] \n    H --> K[Knowledge Management]\n  \n    style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px\n    style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px\n    style I fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px\n    style J fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px\n    style K fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px\n```\n\n> **See [Complete AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)** for detailed workflows, implementation examples, and advanced use cases.\n\n### **\ud83d\udd10 Security Intelligence Applications**\n\n- **SCAP Compliance Monitoring**: Automated vulnerability assessment and risk scoring\n- **SAML Security Analysis**: Authentication flow security validation and threat detection\n- **Log4j Vulnerability Detection**: CVE scanning and automated remediation guidance\n- **SOAP Security Assessment**: Web service configuration security review\n\n### **\u2699\ufe0f DevOps & Configuration Intelligence**\n\n- **Dependency Risk Analysis**: Maven/Ant/Ivy vulnerability scanning and upgrade planning\n- **Configuration Drift Detection**: Spring/Hibernate consistency monitoring\n- **Build Optimization**: Performance analysis and security hardening recommendations\n- **Technical Debt Assessment**: Legacy system modernization planning\n\n### **\ud83c\udfe2 Enterprise System Intelligence**\n\n- **ServiceNow Process Mining**: Incident pattern analysis and workflow optimization\n- **Cross-System Correlation**: Configuration impact analysis and change management\n- **Compliance Automation**: Regulatory requirement mapping and validation\n\n### **\ud83d\udcda Knowledge Management Applications**\n\n- **Technical Documentation Search**: Semantic search across DocBook, API documentation\n- **Content Intelligence**: RSS/Atom trend analysis and topic extraction\n- **API Discovery**: WADL/WSDL service catalog and integration recommendations\n\n## \ud83d\udd2c Production Metrics & Performance\n\n### **Framework Statistics**\n\n- **\u2705 100% Success Rate**: 71/71 files processed without errors\n- **\ud83d\udcca 2,752 Chunks Generated**: Optimal 38.8 avg chunks per document\n- **\ud83c\udfaf 54 Document Types**: Comprehensive XML format coverage\n- **\u26a1 High Performance**: 0.015s average processing time per document\n- **\ud83d\udd12 Secure Parsing**: defusedxml protection against XML attacks\n\n### **Handler Confidence Levels**\n\n- **100% Confidence**: XML Schema (XSD), Maven POM, Log4j, RSS/Atom, Sitemaps\n- **95% Confidence**: ServiceNow, Apache Ant, Ivy, Spring, Hibernate, SAML, SOAP\n- **90% Confidence**: SCAP/XCCDF, DocBook, WADL/WSDL\n- **Intelligent Fallback**: Generic XML handler for unknown formats\n\n## \ud83d\ude80 Extending the Framework\n\n### Adding New Handlers\n\n```python\nfrom core.analyzer import XMLHandler, SpecializedAnalysis, DocumentTypeInfo\n\nclass CustomHandler(XMLHandler):\n    def can_handle(self, root, namespaces):\n        return root.tag == 'custom-format', 1.0\n  \n    def detect_type(self, root, namespaces):\n        return DocumentTypeInfo(\n            type_name=\"Custom Format\",\n            confidence=1.0,\n            version=\"1.0\"\n        )\n  \n    def analyze(self, root, file_path):\n        return SpecializedAnalysis(\n            document_type=\"Custom Format\",\n            key_findings={\"custom_data\": \"value\"},\n            ai_use_cases=[\"Custom AI application\"],\n            structured_data={\"extracted\": \"data\"}\n        )\n```\n\n### Custom Chunking Strategies\n\n```python\nfrom core.chunking import XMLChunkingStrategy, ChunkingOrchestrator\n\nclass CustomChunking(XMLChunkingStrategy):\n    def chunk_document(self, file_path, specialized_analysis=None):\n        # Custom chunking logic\n        return chunks\n\n# Register custom strategy\norchestrator = ChunkingOrchestrator()\norchestrator.strategies['custom'] = CustomChunking\n```\n\n## \ud83d\udcca Real Production Output Examples\n\n### **ServiceNow Incident Analysis**\n\n```json\n{\n  \"document_summary\": {\n    \"document_type\": \"ServiceNow Incident\",\n    \"type_confidence\": 0.95,\n    \"handler_used\": \"ServiceNowHandler\",\n    \"file_size_mb\": 0.029\n  },\n  \"key_insights\": {\n    \"data_highlights\": {\n      \"state\": \"7\", \"priority\": \"4\", \"impact\": \"3\",\n      \"assignment_group\": \"REDACTED_GROUP\",\n      \"resolution_time\": \"240 days, 0:45:51\",\n      \"journal_analysis\": {\n        \"total_entries\": 9,\n        \"unique_contributors\": 1\n      }\n    },\n    \"ai_applications\": [\n      \"Incident pattern analysis\",\n      \"Resolution time prediction\", \n      \"Workload optimization\"\n    ]\n  },\n  \"structured_content\": {\n    \"chunking_strategy\": \"content_aware_medium\",\n    \"total_chunks\": 75,\n    \"quality_metrics\": {\n      \"overall_readiness\": 0.87\n    }\n  }\n}\n```\n\n### **Log4j Security Analysis**\n\n```json\n{\n  \"document_summary\": {\n    \"document_type\": \"Log4j Configuration\",\n    \"type_confidence\": 1.0,\n    \"handler_used\": \"Log4jConfigHandler\"\n  },\n  \"key_insights\": {\n    \"data_highlights\": {\n      \"security_concerns\": {\n        \"security_risks\": [\"External socket appender detected\"],\n        \"log4shell_vulnerable\": false,\n        \"external_connections\": [{\"host\": \"log-server.example.com\"}]\n      },\n      \"performance\": {\n        \"async_appenders\": 1,\n        \"performance_risks\": [\"Location info impacts performance\"]\n      }\n    },\n    \"ai_applications\": [\n      \"Vulnerability assessment\",\n      \"Performance optimization\",\n      \"Security hardening\"\n    ]\n  },\n  \"structured_content\": {\n    \"total_chunks\": 19,\n    \"chunking_strategy\": \"hierarchical_small\"\n  }\n}\n```\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Whether you're adding new XML handlers, improving chunking algorithms, or enhancing AI integrations, your contributions help make XML analysis more accessible and powerful.\n\n**Priority contribution areas:**\n\n- \ud83c\udfaf New XML format handlers (ERP, CRM, healthcare, government)\n- \u26a1 Enhanced chunking algorithms and strategies\n- \ud83d\ude80 Performance optimizations for large files\n- \ud83e\udd16 Advanced AI/ML integration examples\n- \ud83d\udcdd Documentation and usage examples\n\n**\ud83d\udc49 See [CONTRIBUTING.md](CONTRIBUTING.md) for complete guidelines, development setup, and submission process.**\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Designed as part of the **AI Building Blocks** initiative\n- Built for the modern AI/ML ecosystem\n- Community-driven XML format support\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "XML document analysis and preprocessing framework designed for AI/ML data pipelines",
    "version": "1.2.8",
    "project_urls": {
        "Documentation": "https://github.com/redhat-ai-americas/xml-analysis-framework/blob/main/README.md",
        "Homepage": "https://github.com/redhat-ai-americas/xml-analysis-framework",
        "Issues": "https://github.com/redhat-ai-americas/xml-analysis-framework/issues",
        "Repository": "https://github.com/redhat-ai-americas/xml-analysis-framework"
    },
    "split_keywords": [
        "xml",
        " analysis",
        " ai",
        " ml",
        " document-processing",
        " semantic-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e69ed21044cb59a59e6488ce8f0965c8f721939df1baee95202b2ae23f1c580b",
                "md5": "f8ec6ef32893038204c7cabb865098c2",
                "sha256": "155528498503747d75a14202219939350075c2dc2d06abe4485695fa34892dfa"
            },
            "downloads": -1,
            "filename": "xml_analysis_framework-1.2.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f8ec6ef32893038204c7cabb865098c2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 203816,
            "upload_time": "2025-07-27T04:00:54",
            "upload_time_iso_8601": "2025-07-27T04:00:54.799092Z",
            "url": "https://files.pythonhosted.org/packages/e6/9e/d21044cb59a59e6488ce8f0965c8f721939df1baee95202b2ae23f1c580b/xml_analysis_framework-1.2.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4620d3b17abc1df09030bdddf62e4d7f8f246cc8062fe85b59c83dd1bf970e3a",
                "md5": "c63f82c172b07147feacfc3831d99e22",
                "sha256": "dcb7fdfdc5aac7d961ac20d81a3a446760fc7880070bce8c868e4c9ee1aeb29e"
            },
            "downloads": -1,
            "filename": "xml_analysis_framework-1.2.8.tar.gz",
            "has_sig": false,
            "md5_digest": "c63f82c172b07147feacfc3831d99e22",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 9643623,
            "upload_time": "2025-07-27T04:00:56",
            "upload_time_iso_8601": "2025-07-27T04:00:56.826069Z",
            "url": "https://files.pythonhosted.org/packages/46/20/d3b17abc1df09030bdddf62e4d7f8f246cc8062fe85b59c83dd1bf970e3a/xml_analysis_framework-1.2.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 04:00:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "redhat-ai-americas",
    "github_project": "xml-analysis-framework",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "defusedxml",
            "specs": [
                [
                    ">=",
                    "0.7.1"
                ]
            ]
        }
    ],
    "lcname": "xml-analysis-framework"
}

AI Building Blocks