# XML Analysis Framework
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](./test_results)
[](./src/handlers)
[](./AI_INTEGRATION_ARCHITECTURE.md)
A production-ready XML document analysis and preprocessing framework with **29 specialized handlers** designed for AI/ML data pipelines. Transform any XML document into structured, AI-ready data and optimized chunks with **100% success rate** across 71 diverse test files.
## ๐ Quick Start
### Simple API - Get Started in Seconds
```python
import xml_analysis_framework as xaf
# ๐ฏ One-line analysis with specialized handlers
result = xaf.analyze("path/to/file.xml")
print(f"Document type: {result['document_type'].type_name}")
print(f"Handler used: {result['handler_used']}")
# ๐ Basic schema analysis
schema = xaf.analyze_schema("path/to/file.xml")
print(f"Elements: {schema.total_elements}, Depth: {schema.max_depth}")
# โ๏ธ Smart chunking for AI/ML
chunks = xaf.chunk("path/to/file.xml", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")
```
### Advanced Usage
```python
import xml_analysis_framework as xaf
# Enhanced analysis with full results
analysis = xaf.analyze_enhanced("document.xml")
doc_type = analysis['document_type']
specialized = analysis['analysis'] # This contains the SpecializedAnalysis object
print(f"Type: {doc_type.type_name} (confidence: {doc_type.confidence:.2f})")
if specialized:
print(f"AI use cases: {len(specialized.ai_use_cases)}")
if specialized.quality_metrics:
print(f"Quality score: {specialized.quality_metrics.get('completeness_score')}")
else:
print("Quality metrics: Not available")
# Different chunking strategies
hierarchical_chunks = xaf.chunk("document.xml", strategy="hierarchical")
sliding_chunks = xaf.chunk("document.xml", strategy="sliding_window")
content_chunks = xaf.chunk("document.xml", strategy="content_aware")
# Process chunks
for chunk in hierarchical_chunks:
print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")
print(f"Type: {chunk.chunk_type}, Elements: {len(chunk.elements_included)}")
```
### Expert Usage - Direct Class Access
```python
# For advanced customization, use the classes directly
from xml_analysis_framework import XMLDocumentAnalyzer, ChunkingOrchestrator
analyzer = XMLDocumentAnalyzer(max_file_size_mb=500)
orchestrator = ChunkingOrchestrator(max_file_size_mb=1000)
# Custom analysis
result = analyzer.analyze_document("file.xml")
# Custom chunking with config (result works directly now!)
from xml_analysis_framework.core.chunking import ChunkingConfig
config = ChunkingConfig(
max_chunk_size=2000,
min_chunk_size=300,
overlap_size=150,
preserve_hierarchy=True
)
chunks = orchestrator.chunk_document("file.xml", result, strategy="auto", config=config)
```
## ๐ฏ Key Features
### 1. **๐ Production Proven Results**
- **100% Success Rate**: All 71 test files processed successfully
- **2,752 Chunks Generated**: Average 38.8 optimized chunks per file
- **54 Document Types Detected**: Comprehensive XML format coverage
- **Minimal Dependencies**: Only defusedxml for security + Python stdlib
### 2. **๐ง 29 Specialized XML Handlers**
Enterprise-grade document intelligence:
- **Security & Compliance**: SCAP, SAML, SOAP (90-100% confidence)
- **DevOps & Build**: Maven POM, Ant, Ivy, Spring, Log4j (95-100% confidence)
- **Content & Documentation**: RSS/Atom, DocBook, XHTML, SVG
- **Enterprise Systems**: ServiceNow, Hibernate, Struts configurations
- **Data & APIs**: GPX, KML, GraphML, WADL/WSDL, XML Schemas
### 3. **โก Intelligent Processing Pipeline**
- **Smart Document Detection**: Confidence scoring with graceful fallbacks
- **Semantic Chunking**: Document-type-aware optimal segmentation
- **Token Optimization**: LLM context window optimized chunks
- **Quality Assessment**: Automated data quality metrics
### 4. **๐ค AI-Ready Integration**
- **Vector Store Ready**: Structured embeddings with rich metadata
- **Graph Database Compatible**: Relationship and dependency mapping
- **LLM Agent Optimized**: Context-aware, actionable insights
- **Complete AI Workflows**: See [AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)
## ๐ Supported Document Types (29 Handlers)
| Category | Handlers | Confidence | Use Cases |
| ------------------------------------- | -------------------------------- | ---------- | -------------------------------------------------------------------------- |
| **๐ Security & Compliance** | SCAP, SAML, SOAP | 90-100% | Vulnerability assessment, compliance monitoring, security posture analysis |
| **โ๏ธ DevOps & Build Tools** | Maven POM, Ant, Ivy | 95-100% | Dependency analysis, build optimization, technical debt assessment |
| **๐ข Enterprise Configuration** | Spring, Hibernate, Struts, Log4j | 95-100% | Configuration validation, security scanning, modernization planning |
| **๐ Content & Documentation** | RSS, DocBook, XHTML, SVG | 90-100% | Content intelligence, documentation search, knowledge management |
| **๐๏ธ Enterprise Systems** | ServiceNow, XML Sitemap | 95-100% | Incident analysis, process automation, system integration |
| **๐ Geospatial & Data** | GPX, KML, GraphML | 85-95% | Route optimization, geographic analysis, network intelligence |
| **๐ API & Integration** | WADL, WSDL, XLIFF | 90-95% | Service discovery, integration planning, translation workflows |
| **๐ Schemas & Standards** | XML Schema (XSD) | 100% | Schema validation, data modeling, API documentation |
## ๐๏ธ Architecture
```
xml-analysis-framework/
โโโ README.md # Project overview
โโโ LICENSE # MIT license
โโโ requirements.txt # Dependencies (Python stdlib only)
โโโ setup.py # Package installation
โโโ .gitignore # Git ignore patterns
โโโ .github/workflows/ # CI/CD pipelines
โ
โโโ src/ # Source code
โ โโโ core/ # Core framework
โ โ โโโ analyzer.py # Main analysis engine
โ โ โโโ schema_analyzer.py # XML schema analysis
โ โ โโโ chunking.py # Chunking strategies
โ โโโ handlers/ # 28 specialized handlers
โ โโโ utils/ # Utility functions
โ
โโโ tests/ # Comprehensive test suite
โ โโโ unit/ # Handler unit tests (16 files)
โ โโโ integration/ # Integration tests (11 files)
โ โโโ comprehensive/ # Full system tests (4 files)
โ โโโ run_all_tests.py # Master test runner
โ
โโโ examples/ # Usage examples
โ โโโ basic_analysis.py # Simple analysis
โ โโโ enhanced_analysis.py # Full featured analysis
โ
โโโ scripts/ # Utility scripts
โ โโโ collect_test_files.py # Test data collection
โ โโโ debug/ # Debug utilities
โ
โโโ docs/ # Documentation
โ โโโ architecture/ # Design documents
โ โโโ guides/ # User guides
โ โโโ api/ # API documentation
โ
โโโ sample_data/ # Test XML files (99+ examples)
โ โโโ test_files/ # Real-world examples
โ โโโ test_files_synthetic/ # Generated test cases
โ
โโโ artifacts/ # Build artifacts, results
โโโ analysis_results/ # JSON analysis outputs
โโโ reports/ # Generated reports
```
## ๐ Security
### XML Security Protection
This framework uses **defusedxml** to protect against common XML security vulnerabilities:
- **XXE (XML External Entity) attacks**: Prevents reading local files or making network requests
- **Billion Laughs attack**: Prevents exponential entity expansion DoS attacks
- **DTD retrieval**: Blocks external DTD fetching to prevent data exfiltration
#### Security Features
```python
# All XML parsing is automatically protected
from core.analyzer import XMLDocumentAnalyzer
analyzer = XMLDocumentAnalyzer()
# Safe parsing - malicious XML will be rejected
result = analyzer.analyze_document("potentially_malicious.xml")
if result.get('security_issue'):
print(f"Security threat detected: {result['error']}")
```
#### Best Practices
1. **Always use the framework's parsers** - Never use `xml.etree.ElementTree` directly
2. **Validate file sizes** - Set reasonable limits for your use case
3. **Sanitize file paths** - Ensure input paths are properly validated
4. **Monitor for security exceptions** - Log and alert on security-blocked parsing attempts
### File Size Limits
The framework includes built-in file size limits to prevent memory exhaustion:
```python
# Built-in size limits in analyzer and chunking
from core.analyzer import XMLDocumentAnalyzer
from core.chunking import ChunkingOrchestrator
# Create analyzer with 50MB limit
analyzer = XMLDocumentAnalyzer(max_file_size_mb=50.0)
# Create chunking orchestrator with 100MB limit
orchestrator = ChunkingOrchestrator(max_file_size_mb=100.0)
# Utility functions for easy setup
from utils import create_analyzer_with_limits, FileSizeLimits
# Use predefined limits
analyzer = create_analyzer_with_limits(FileSizeLimits.PRODUCTION_MEDIUM) # 50MB
safe_result = safe_analyze_document("file.xml", FileSizeLimits.REAL_TIME) # 5MB
```
## ๐ง Installation
```bash
# Install from PyPI (recommended)
pip install xml-analysis-framework
# Install from source
git clone https://github.com/redhat-ai-americas/xml-analysis-framework.git
cd xml-analysis-framework
pip install -e .
# Or install development dependencies
pip install -e .[dev]
```
### Dependencies
- **defusedxml** (0.7.1+): For secure XML parsing protection
- Python standard library (3.8+) for all other functionality
## ๐ Usage Examples
### Basic Analysis
```python
from core.schema_analyzer import XMLSchemaAnalyzer
analyzer = XMLSchemaAnalyzer()
schema = analyzer.analyze_file('document.xml')
# Access schema properties
print(f"Root element: {schema.root_element}")
print(f"Total elements: {schema.total_elements}")
print(f"Namespaces: {schema.namespaces}")
```
### Enhanced Analysis with Specialized Handlers
```python
from core.analyzer import XMLDocumentAnalyzer
analyzer = XMLDocumentAnalyzer()
result = analyzer.analyze_document('maven-project.xml')
print(f"Document Type: {result['document_type'].type_name}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Handler Used: {result['handler_used']}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
```
### Safe Analysis with File Validation
```python
from utils import safe_analyze_document, FileSizeLimits
# Safe analysis with comprehensive validation
result = safe_analyze_document(
'document.xml',
max_size_mb=FileSizeLimits.PRODUCTION_MEDIUM
)
if result.get('error'):
print(f"Analysis failed: {result['error']}")
else:
print(f"Success: {result['document_type'].type_name}")
```
### Intelligent Chunking
```python
from core.chunking import ChunkingOrchestrator, XMLChunkingStrategy
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document(
'large_document.xml',
specialized_analysis={}, # Analysis result from XMLDocumentAnalyzer
strategy='auto'
)
# Token estimation
token_estimator = XMLChunkingStrategy()
for chunk in chunks:
token_count = token_estimator.estimate_tokens(chunk.content)
print(f"Chunk {chunk.chunk_id}: ~{token_count} tokens")
```
## ๐งช Testing & Validation
### **Production-Tested Performance**
- โ
**100% Success Rate**: All 71 XML files processed successfully
- โ
**2,752 Chunks Generated**: Optimal segmentation across diverse document types
- โ
**54 Document Types**: Comprehensive coverage from ServiceNow to SCAP to Maven
- โ
**Secure by Default**: Protected against XXE and billion laughs attacks
### **Test Coverage**
```bash
# Run comprehensive end-to-end test
python test_end_to_end_workflow.py
# Run individual component tests
python test_all_chunking.py # Chunking strategies
python test_servicenow_analysis.py # ServiceNow handler validation
python test_scap_analysis.py # Security document analysis
```
### **Real-World Test Data**
- **Enterprise Systems**: ServiceNow incident exports (8 files)
- **Security Documents**: SCAP/XCCDF compliance reports (4 files)
- **Build Configurations**: Maven, Ant, Ivy projects (12 files)
- **Enterprise Config**: Spring, Hibernate, Log4j (15 files)
- **Content & APIs**: DocBook, RSS, WADL, Sitemaps (32 files)
## ๐ค AI Integration & Use Cases
### **AI Workflow Overview**
```mermaid
graph LR
A[XML Documents] --> B[XML Analysis Framework]
B --> C[Document Analysis<br/>29 Specialized Handlers]
B --> D[Smart Chunking<br/>Token-Optimized]
B --> E[AI-Ready Output<br/>Structured JSON]
E --> F[Vector Store<br/>Semantic Search]
E --> G[Graph Database<br/>Relationships]
E --> H[LLM Agent<br/>Intelligence]
F --> I[Security Intelligence]
G --> J[DevOps Automation]
H --> K[Knowledge Management]
style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style I fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
style J fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
style K fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
```
> **See [Complete AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)** for detailed workflows, implementation examples, and advanced use cases.
### **๐ Security Intelligence Applications**
- **SCAP Compliance Monitoring**: Automated vulnerability assessment and risk scoring
- **SAML Security Analysis**: Authentication flow security validation and threat detection
- **Log4j Vulnerability Detection**: CVE scanning and automated remediation guidance
- **SOAP Security Assessment**: Web service configuration security review
### **โ๏ธ DevOps & Configuration Intelligence**
- **Dependency Risk Analysis**: Maven/Ant/Ivy vulnerability scanning and upgrade planning
- **Configuration Drift Detection**: Spring/Hibernate consistency monitoring
- **Build Optimization**: Performance analysis and security hardening recommendations
- **Technical Debt Assessment**: Legacy system modernization planning
### **๐ข Enterprise System Intelligence**
- **ServiceNow Process Mining**: Incident pattern analysis and workflow optimization
- **Cross-System Correlation**: Configuration impact analysis and change management
- **Compliance Automation**: Regulatory requirement mapping and validation
### **๐ Knowledge Management Applications**
- **Technical Documentation Search**: Semantic search across DocBook, API documentation
- **Content Intelligence**: RSS/Atom trend analysis and topic extraction
- **API Discovery**: WADL/WSDL service catalog and integration recommendations
## ๐ฌ Production Metrics & Performance
### **Framework Statistics**
- **โ
100% Success Rate**: 71/71 files processed without errors
- **๐ 2,752 Chunks Generated**: Optimal 38.8 avg chunks per document
- **๐ฏ 54 Document Types**: Comprehensive XML format coverage
- **โก High Performance**: 0.015s average processing time per document
- **๐ Secure Parsing**: defusedxml protection against XML attacks
### **Handler Confidence Levels**
- **100% Confidence**: XML Schema (XSD), Maven POM, Log4j, RSS/Atom, Sitemaps
- **95% Confidence**: ServiceNow, Apache Ant, Ivy, Spring, Hibernate, SAML, SOAP
- **90% Confidence**: SCAP/XCCDF, DocBook, WADL/WSDL
- **Intelligent Fallback**: Generic XML handler for unknown formats
## ๐ Extending the Framework
### Adding New Handlers
```python
from core.analyzer import XMLHandler, SpecializedAnalysis, DocumentTypeInfo
class CustomHandler(XMLHandler):
def can_handle(self, root, namespaces):
return root.tag == 'custom-format', 1.0
def detect_type(self, root, namespaces):
return DocumentTypeInfo(
type_name="Custom Format",
confidence=1.0,
version="1.0"
)
def analyze(self, root, file_path):
return SpecializedAnalysis(
document_type="Custom Format",
key_findings={"custom_data": "value"},
ai_use_cases=["Custom AI application"],
structured_data={"extracted": "data"}
)
```
### Custom Chunking Strategies
```python
from core.chunking import XMLChunkingStrategy, ChunkingOrchestrator
class CustomChunking(XMLChunkingStrategy):
def chunk_document(self, file_path, specialized_analysis=None):
# Custom chunking logic
return chunks
# Register custom strategy
orchestrator = ChunkingOrchestrator()
orchestrator.strategies['custom'] = CustomChunking
```
## ๐ Real Production Output Examples
### **ServiceNow Incident Analysis**
```json
{
"document_summary": {
"document_type": "ServiceNow Incident",
"type_confidence": 0.95,
"handler_used": "ServiceNowHandler",
"file_size_mb": 0.029
},
"key_insights": {
"data_highlights": {
"state": "7", "priority": "4", "impact": "3",
"assignment_group": "REDACTED_GROUP",
"resolution_time": "240 days, 0:45:51",
"journal_analysis": {
"total_entries": 9,
"unique_contributors": 1
}
},
"ai_applications": [
"Incident pattern analysis",
"Resolution time prediction",
"Workload optimization"
]
},
"structured_content": {
"chunking_strategy": "content_aware_medium",
"total_chunks": 75,
"quality_metrics": {
"overall_readiness": 0.87
}
}
}
```
### **Log4j Security Analysis**
```json
{
"document_summary": {
"document_type": "Log4j Configuration",
"type_confidence": 1.0,
"handler_used": "Log4jConfigHandler"
},
"key_insights": {
"data_highlights": {
"security_concerns": {
"security_risks": ["External socket appender detected"],
"log4shell_vulnerable": false,
"external_connections": [{"host": "log-server.example.com"}]
},
"performance": {
"async_appenders": 1,
"performance_risks": ["Location info impacts performance"]
}
},
"ai_applications": [
"Vulnerability assessment",
"Performance optimization",
"Security hardening"
]
},
"structured_content": {
"total_chunks": 19,
"chunking_strategy": "hierarchical_small"
}
}
```
## ๐ค Contributing
We welcome contributions! Whether you're adding new XML handlers, improving chunking algorithms, or enhancing AI integrations, your contributions help make XML analysis more accessible and powerful.
**Priority contribution areas:**
- ๐ฏ New XML format handlers (ERP, CRM, healthcare, government)
- โก Enhanced chunking algorithms and strategies
- ๐ Performance optimizations for large files
- ๐ค Advanced AI/ML integration examples
- ๐ Documentation and usage examples
**๐ See [CONTRIBUTING.md](CONTRIBUTING.md) for complete guidelines, development setup, and submission process.**
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- Designed as part of the **AI Building Blocks** initiative
- Built for the modern AI/ML ecosystem
- Community-driven XML format support
Raw data
{
"_id": null,
"home_page": "https://github.com/redhat-ai-americas/xml-analysis-framework",
"name": "xml-analysis-framework",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "xml, analysis, ai, ml, document-processing, semantic-search",
"author": "AI Building Blocks",
"author_email": "AI Building Blocks <wjackson@redhat.com>",
"download_url": "https://files.pythonhosted.org/packages/46/20/d3b17abc1df09030bdddf62e4d7f8f246cc8062fe85b59c83dd1bf970e3a/xml_analysis_framework-1.2.8.tar.gz",
"platform": null,
"description": "# XML Analysis Framework\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](./test_results)\n[](./src/handlers)\n[](./AI_INTEGRATION_ARCHITECTURE.md)\n\nA production-ready XML document analysis and preprocessing framework with **29 specialized handlers** designed for AI/ML data pipelines. Transform any XML document into structured, AI-ready data and optimized chunks with **100% success rate** across 71 diverse test files.\n\n## \ud83d\ude80 Quick Start\n\n### Simple API - Get Started in Seconds\n\n```python\nimport xml_analysis_framework as xaf\n\n# \ud83c\udfaf One-line analysis with specialized handlers\nresult = xaf.analyze(\"path/to/file.xml\")\nprint(f\"Document type: {result['document_type'].type_name}\")\nprint(f\"Handler used: {result['handler_used']}\")\n\n# \ud83d\udcca Basic schema analysis \nschema = xaf.analyze_schema(\"path/to/file.xml\")\nprint(f\"Elements: {schema.total_elements}, Depth: {schema.max_depth}\")\n\n# \u2702\ufe0f Smart chunking for AI/ML\nchunks = xaf.chunk(\"path/to/file.xml\", strategy=\"auto\")\nprint(f\"Created {len(chunks)} optimized chunks\")\n```\n\n### Advanced Usage\n\n```python\nimport xml_analysis_framework as xaf\n\n# Enhanced analysis with full results\nanalysis = xaf.analyze_enhanced(\"document.xml\")\ndoc_type = analysis['document_type']\nspecialized = analysis['analysis'] # This contains the SpecializedAnalysis object\n\nprint(f\"Type: {doc_type.type_name} (confidence: {doc_type.confidence:.2f})\")\nif specialized:\n print(f\"AI use cases: {len(specialized.ai_use_cases)}\")\n if specialized.quality_metrics:\n print(f\"Quality score: {specialized.quality_metrics.get('completeness_score')}\")\n else:\n print(\"Quality metrics: Not available\")\n\n# Different chunking strategies\nhierarchical_chunks = xaf.chunk(\"document.xml\", strategy=\"hierarchical\")\nsliding_chunks = xaf.chunk(\"document.xml\", strategy=\"sliding_window\") \ncontent_chunks = xaf.chunk(\"document.xml\", strategy=\"content_aware\")\n\n# Process chunks\nfor chunk in hierarchical_chunks:\n print(f\"Chunk {chunk.chunk_id}: {len(chunk.content)} chars\")\n print(f\"Type: {chunk.chunk_type}, Elements: {len(chunk.elements_included)}\")\n```\n\n### Expert Usage - Direct Class Access\n\n```python\n# For advanced customization, use the classes directly\nfrom xml_analysis_framework import XMLDocumentAnalyzer, ChunkingOrchestrator\n\nanalyzer = XMLDocumentAnalyzer(max_file_size_mb=500)\norchestrator = ChunkingOrchestrator(max_file_size_mb=1000)\n\n# Custom analysis\nresult = analyzer.analyze_document(\"file.xml\")\n\n# Custom chunking with config (result works directly now!)\nfrom xml_analysis_framework.core.chunking import ChunkingConfig\nconfig = ChunkingConfig(\n max_chunk_size=2000,\n min_chunk_size=300,\n overlap_size=150,\n preserve_hierarchy=True\n)\nchunks = orchestrator.chunk_document(\"file.xml\", result, strategy=\"auto\", config=config)\n```\n\n## \ud83c\udfaf Key Features\n\n### 1. **\ud83c\udfc6 Production Proven Results**\n\n- **100% Success Rate**: All 71 test files processed successfully\n- **2,752 Chunks Generated**: Average 38.8 optimized chunks per file\n- **54 Document Types Detected**: Comprehensive XML format coverage\n- **Minimal Dependencies**: Only defusedxml for security + Python stdlib\n\n### 2. **\ud83e\udde0 29 Specialized XML Handlers**\n\nEnterprise-grade document intelligence:\n\n- **Security & Compliance**: SCAP, SAML, SOAP (90-100% confidence)\n- **DevOps & Build**: Maven POM, Ant, Ivy, Spring, Log4j (95-100% confidence)\n- **Content & Documentation**: RSS/Atom, DocBook, XHTML, SVG\n- **Enterprise Systems**: ServiceNow, Hibernate, Struts configurations\n- **Data & APIs**: GPX, KML, GraphML, WADL/WSDL, XML Schemas\n\n### 3. **\u26a1 Intelligent Processing Pipeline**\n\n- **Smart Document Detection**: Confidence scoring with graceful fallbacks\n- **Semantic Chunking**: Document-type-aware optimal segmentation\n- **Token Optimization**: LLM context window optimized chunks\n- **Quality Assessment**: Automated data quality metrics\n\n### 4. **\ud83e\udd16 AI-Ready Integration**\n\n- **Vector Store Ready**: Structured embeddings with rich metadata\n- **Graph Database Compatible**: Relationship and dependency mapping\n- **LLM Agent Optimized**: Context-aware, actionable insights\n- **Complete AI Workflows**: See [AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)\n\n## \ud83d\udccb Supported Document Types (29 Handlers)\n\n| Category | Handlers | Confidence | Use Cases |\n| ------------------------------------- | -------------------------------- | ---------- | -------------------------------------------------------------------------- |\n| **\ud83d\udd10 Security & Compliance** | SCAP, SAML, SOAP | 90-100% | Vulnerability assessment, compliance monitoring, security posture analysis |\n| **\u2699\ufe0f DevOps & Build Tools** | Maven POM, Ant, Ivy | 95-100% | Dependency analysis, build optimization, technical debt assessment |\n| **\ud83c\udfe2 Enterprise Configuration** | Spring, Hibernate, Struts, Log4j | 95-100% | Configuration validation, security scanning, modernization planning |\n| **\ud83d\udcc4 Content & Documentation** | RSS, DocBook, XHTML, SVG | 90-100% | Content intelligence, documentation search, knowledge management |\n| **\ud83d\uddc2\ufe0f Enterprise Systems** | ServiceNow, XML Sitemap | 95-100% | Incident analysis, process automation, system integration |\n| **\ud83c\udf0d Geospatial & Data** | GPX, KML, GraphML | 85-95% | Route optimization, geographic analysis, network intelligence |\n| **\ud83d\udd0c API & Integration** | WADL, WSDL, XLIFF | 90-95% | Service discovery, integration planning, translation workflows |\n| **\ud83d\udcd0 Schemas & Standards** | XML Schema (XSD) | 100% | Schema validation, data modeling, API documentation |\n\n## \ud83c\udfd7\ufe0f Architecture\n\n```\nxml-analysis-framework/\n\u251c\u2500\u2500 README.md # Project overview\n\u251c\u2500\u2500 LICENSE # MIT license\n\u251c\u2500\u2500 requirements.txt # Dependencies (Python stdlib only)\n\u251c\u2500\u2500 setup.py # Package installation\n\u251c\u2500\u2500 .gitignore # Git ignore patterns\n\u251c\u2500\u2500 .github/workflows/ # CI/CD pipelines\n\u2502\n\u251c\u2500\u2500 src/ # Source code\n\u2502 \u251c\u2500\u2500 core/ # Core framework\n\u2502 \u2502 \u251c\u2500\u2500 analyzer.py # Main analysis engine\n\u2502 \u2502 \u251c\u2500\u2500 schema_analyzer.py # XML schema analysis\n\u2502 \u2502 \u2514\u2500\u2500 chunking.py # Chunking strategies\n\u2502 \u251c\u2500\u2500 handlers/ # 28 specialized handlers\n\u2502 \u2514\u2500\u2500 utils/ # Utility functions\n\u2502\n\u251c\u2500\u2500 tests/ # Comprehensive test suite\n\u2502 \u251c\u2500\u2500 unit/ # Handler unit tests (16 files)\n\u2502 \u251c\u2500\u2500 integration/ # Integration tests (11 files)\n\u2502 \u251c\u2500\u2500 comprehensive/ # Full system tests (4 files)\n\u2502 \u2514\u2500\u2500 run_all_tests.py # Master test runner\n\u2502\n\u251c\u2500\u2500 examples/ # Usage examples\n\u2502 \u251c\u2500\u2500 basic_analysis.py # Simple analysis\n\u2502 \u2514\u2500\u2500 enhanced_analysis.py # Full featured analysis\n\u2502\n\u251c\u2500\u2500 scripts/ # Utility scripts\n\u2502 \u251c\u2500\u2500 collect_test_files.py # Test data collection\n\u2502 \u2514\u2500\u2500 debug/ # Debug utilities\n\u2502\n\u251c\u2500\u2500 docs/ # Documentation\n\u2502 \u251c\u2500\u2500 architecture/ # Design documents\n\u2502 \u251c\u2500\u2500 guides/ # User guides\n\u2502 \u2514\u2500\u2500 api/ # API documentation\n\u2502\n\u251c\u2500\u2500 sample_data/ # Test XML files (99+ examples)\n\u2502 \u251c\u2500\u2500 test_files/ # Real-world examples\n\u2502 \u2514\u2500\u2500 test_files_synthetic/ # Generated test cases\n\u2502\n\u2514\u2500\u2500 artifacts/ # Build artifacts, results\n \u251c\u2500\u2500 analysis_results/ # JSON analysis outputs\n \u2514\u2500\u2500 reports/ # Generated reports\n```\n\n## \ud83d\udd12 Security\n\n### XML Security Protection\n\nThis framework uses **defusedxml** to protect against common XML security vulnerabilities:\n\n- **XXE (XML External Entity) attacks**: Prevents reading local files or making network requests\n- **Billion Laughs attack**: Prevents exponential entity expansion DoS attacks\n- **DTD retrieval**: Blocks external DTD fetching to prevent data exfiltration\n\n#### Security Features\n\n```python\n# All XML parsing is automatically protected\nfrom core.analyzer import XMLDocumentAnalyzer\n\nanalyzer = XMLDocumentAnalyzer()\n# Safe parsing - malicious XML will be rejected\nresult = analyzer.analyze_document(\"potentially_malicious.xml\")\n\nif result.get('security_issue'):\n print(f\"Security threat detected: {result['error']}\")\n```\n\n#### Best Practices\n\n1. **Always use the framework's parsers** - Never use `xml.etree.ElementTree` directly\n2. **Validate file sizes** - Set reasonable limits for your use case\n3. **Sanitize file paths** - Ensure input paths are properly validated\n4. **Monitor for security exceptions** - Log and alert on security-blocked parsing attempts\n\n### File Size Limits\n\nThe framework includes built-in file size limits to prevent memory exhaustion:\n\n```python\n# Built-in size limits in analyzer and chunking\nfrom core.analyzer import XMLDocumentAnalyzer\nfrom core.chunking import ChunkingOrchestrator\n\n# Create analyzer with 50MB limit\nanalyzer = XMLDocumentAnalyzer(max_file_size_mb=50.0)\n\n# Create chunking orchestrator with 100MB limit \norchestrator = ChunkingOrchestrator(max_file_size_mb=100.0)\n\n# Utility functions for easy setup\nfrom utils import create_analyzer_with_limits, FileSizeLimits\n\n# Use predefined limits\nanalyzer = create_analyzer_with_limits(FileSizeLimits.PRODUCTION_MEDIUM) # 50MB\nsafe_result = safe_analyze_document(\"file.xml\", FileSizeLimits.REAL_TIME) # 5MB\n```\n\n## \ud83d\udd27 Installation\n\n```bash\n# Install from PyPI (recommended)\npip install xml-analysis-framework\n\n# Install from source\ngit clone https://github.com/redhat-ai-americas/xml-analysis-framework.git\ncd xml-analysis-framework\npip install -e .\n\n# Or install development dependencies\npip install -e .[dev]\n```\n\n### Dependencies\n\n- **defusedxml** (0.7.1+): For secure XML parsing protection\n- Python standard library (3.8+) for all other functionality\n\n## \ud83d\udcd6 Usage Examples\n\n### Basic Analysis\n\n```python\nfrom core.schema_analyzer import XMLSchemaAnalyzer\n\nanalyzer = XMLSchemaAnalyzer()\nschema = analyzer.analyze_file('document.xml')\n\n# Access schema properties\nprint(f\"Root element: {schema.root_element}\")\nprint(f\"Total elements: {schema.total_elements}\")\nprint(f\"Namespaces: {schema.namespaces}\")\n```\n\n### Enhanced Analysis with Specialized Handlers\n\n```python\nfrom core.analyzer import XMLDocumentAnalyzer\n\nanalyzer = XMLDocumentAnalyzer()\nresult = analyzer.analyze_document('maven-project.xml')\n\nprint(f\"Document Type: {result['document_type'].type_name}\")\nprint(f\"Confidence: {result['confidence']:.2f}\")\nprint(f\"Handler Used: {result['handler_used']}\")\nprint(f\"AI Use Cases: {result['analysis'].ai_use_cases}\")\n```\n\n### Safe Analysis with File Validation\n\n```python\nfrom utils import safe_analyze_document, FileSizeLimits\n\n# Safe analysis with comprehensive validation\nresult = safe_analyze_document(\n 'document.xml', \n max_size_mb=FileSizeLimits.PRODUCTION_MEDIUM\n)\n\nif result.get('error'):\n print(f\"Analysis failed: {result['error']}\")\nelse:\n print(f\"Success: {result['document_type'].type_name}\")\n```\n\n### Intelligent Chunking\n\n```python\nfrom core.chunking import ChunkingOrchestrator, XMLChunkingStrategy\n\norchestrator = ChunkingOrchestrator()\nchunks = orchestrator.chunk_document(\n 'large_document.xml',\n specialized_analysis={}, # Analysis result from XMLDocumentAnalyzer\n strategy='auto'\n)\n\n# Token estimation\ntoken_estimator = XMLChunkingStrategy()\nfor chunk in chunks:\n token_count = token_estimator.estimate_tokens(chunk.content)\n print(f\"Chunk {chunk.chunk_id}: ~{token_count} tokens\")\n```\n\n## \ud83e\uddea Testing & Validation\n\n### **Production-Tested Performance**\n\n- \u2705 **100% Success Rate**: All 71 XML files processed successfully\n- \u2705 **2,752 Chunks Generated**: Optimal segmentation across diverse document types\n- \u2705 **54 Document Types**: Comprehensive coverage from ServiceNow to SCAP to Maven\n- \u2705 **Secure by Default**: Protected against XXE and billion laughs attacks\n\n### **Test Coverage**\n\n```bash\n# Run comprehensive end-to-end test\npython test_end_to_end_workflow.py\n\n# Run individual component tests \npython test_all_chunking.py # Chunking strategies\npython test_servicenow_analysis.py # ServiceNow handler validation\npython test_scap_analysis.py # Security document analysis\n```\n\n### **Real-World Test Data**\n\n- **Enterprise Systems**: ServiceNow incident exports (8 files)\n- **Security Documents**: SCAP/XCCDF compliance reports (4 files)\n- **Build Configurations**: Maven, Ant, Ivy projects (12 files)\n- **Enterprise Config**: Spring, Hibernate, Log4j (15 files)\n- **Content & APIs**: DocBook, RSS, WADL, Sitemaps (32 files)\n\n## \ud83e\udd16 AI Integration & Use Cases\n\n### **AI Workflow Overview**\n\n```mermaid\ngraph LR\n A[XML Documents] --> B[XML Analysis Framework]\n B --> C[Document Analysis<br/>29 Specialized Handlers]\n B --> D[Smart Chunking<br/>Token-Optimized]\n B --> E[AI-Ready Output<br/>Structured JSON]\n \n E --> F[Vector Store<br/>Semantic Search]\n E --> G[Graph Database<br/>Relationships]\n E --> H[LLM Agent<br/>Intelligence]\n \n F --> I[Security Intelligence]\n G --> J[DevOps Automation] \n H --> K[Knowledge Management]\n \n style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px\n style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px\n style I fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px\n style J fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px\n style K fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px\n```\n\n> **See [Complete AI Integration Guide](./AI_INTEGRATION_ARCHITECTURE.md)** for detailed workflows, implementation examples, and advanced use cases.\n\n### **\ud83d\udd10 Security Intelligence Applications**\n\n- **SCAP Compliance Monitoring**: Automated vulnerability assessment and risk scoring\n- **SAML Security Analysis**: Authentication flow security validation and threat detection\n- **Log4j Vulnerability Detection**: CVE scanning and automated remediation guidance\n- **SOAP Security Assessment**: Web service configuration security review\n\n### **\u2699\ufe0f DevOps & Configuration Intelligence**\n\n- **Dependency Risk Analysis**: Maven/Ant/Ivy vulnerability scanning and upgrade planning\n- **Configuration Drift Detection**: Spring/Hibernate consistency monitoring\n- **Build Optimization**: Performance analysis and security hardening recommendations\n- **Technical Debt Assessment**: Legacy system modernization planning\n\n### **\ud83c\udfe2 Enterprise System Intelligence**\n\n- **ServiceNow Process Mining**: Incident pattern analysis and workflow optimization\n- **Cross-System Correlation**: Configuration impact analysis and change management\n- **Compliance Automation**: Regulatory requirement mapping and validation\n\n### **\ud83d\udcda Knowledge Management Applications**\n\n- **Technical Documentation Search**: Semantic search across DocBook, API documentation\n- **Content Intelligence**: RSS/Atom trend analysis and topic extraction\n- **API Discovery**: WADL/WSDL service catalog and integration recommendations\n\n## \ud83d\udd2c Production Metrics & Performance\n\n### **Framework Statistics**\n\n- **\u2705 100% Success Rate**: 71/71 files processed without errors\n- **\ud83d\udcca 2,752 Chunks Generated**: Optimal 38.8 avg chunks per document\n- **\ud83c\udfaf 54 Document Types**: Comprehensive XML format coverage\n- **\u26a1 High Performance**: 0.015s average processing time per document\n- **\ud83d\udd12 Secure Parsing**: defusedxml protection against XML attacks\n\n### **Handler Confidence Levels**\n\n- **100% Confidence**: XML Schema (XSD), Maven POM, Log4j, RSS/Atom, Sitemaps\n- **95% Confidence**: ServiceNow, Apache Ant, Ivy, Spring, Hibernate, SAML, SOAP\n- **90% Confidence**: SCAP/XCCDF, DocBook, WADL/WSDL\n- **Intelligent Fallback**: Generic XML handler for unknown formats\n\n## \ud83d\ude80 Extending the Framework\n\n### Adding New Handlers\n\n```python\nfrom core.analyzer import XMLHandler, SpecializedAnalysis, DocumentTypeInfo\n\nclass CustomHandler(XMLHandler):\n def can_handle(self, root, namespaces):\n return root.tag == 'custom-format', 1.0\n \n def detect_type(self, root, namespaces):\n return DocumentTypeInfo(\n type_name=\"Custom Format\",\n confidence=1.0,\n version=\"1.0\"\n )\n \n def analyze(self, root, file_path):\n return SpecializedAnalysis(\n document_type=\"Custom Format\",\n key_findings={\"custom_data\": \"value\"},\n ai_use_cases=[\"Custom AI application\"],\n structured_data={\"extracted\": \"data\"}\n )\n```\n\n### Custom Chunking Strategies\n\n```python\nfrom core.chunking import XMLChunkingStrategy, ChunkingOrchestrator\n\nclass CustomChunking(XMLChunkingStrategy):\n def chunk_document(self, file_path, specialized_analysis=None):\n # Custom chunking logic\n return chunks\n\n# Register custom strategy\norchestrator = ChunkingOrchestrator()\norchestrator.strategies['custom'] = CustomChunking\n```\n\n## \ud83d\udcca Real Production Output Examples\n\n### **ServiceNow Incident Analysis**\n\n```json\n{\n \"document_summary\": {\n \"document_type\": \"ServiceNow Incident\",\n \"type_confidence\": 0.95,\n \"handler_used\": \"ServiceNowHandler\",\n \"file_size_mb\": 0.029\n },\n \"key_insights\": {\n \"data_highlights\": {\n \"state\": \"7\", \"priority\": \"4\", \"impact\": \"3\",\n \"assignment_group\": \"REDACTED_GROUP\",\n \"resolution_time\": \"240 days, 0:45:51\",\n \"journal_analysis\": {\n \"total_entries\": 9,\n \"unique_contributors\": 1\n }\n },\n \"ai_applications\": [\n \"Incident pattern analysis\",\n \"Resolution time prediction\", \n \"Workload optimization\"\n ]\n },\n \"structured_content\": {\n \"chunking_strategy\": \"content_aware_medium\",\n \"total_chunks\": 75,\n \"quality_metrics\": {\n \"overall_readiness\": 0.87\n }\n }\n}\n```\n\n### **Log4j Security Analysis**\n\n```json\n{\n \"document_summary\": {\n \"document_type\": \"Log4j Configuration\",\n \"type_confidence\": 1.0,\n \"handler_used\": \"Log4jConfigHandler\"\n },\n \"key_insights\": {\n \"data_highlights\": {\n \"security_concerns\": {\n \"security_risks\": [\"External socket appender detected\"],\n \"log4shell_vulnerable\": false,\n \"external_connections\": [{\"host\": \"log-server.example.com\"}]\n },\n \"performance\": {\n \"async_appenders\": 1,\n \"performance_risks\": [\"Location info impacts performance\"]\n }\n },\n \"ai_applications\": [\n \"Vulnerability assessment\",\n \"Performance optimization\",\n \"Security hardening\"\n ]\n },\n \"structured_content\": {\n \"total_chunks\": 19,\n \"chunking_strategy\": \"hierarchical_small\"\n }\n}\n```\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Whether you're adding new XML handlers, improving chunking algorithms, or enhancing AI integrations, your contributions help make XML analysis more accessible and powerful.\n\n**Priority contribution areas:**\n\n- \ud83c\udfaf New XML format handlers (ERP, CRM, healthcare, government)\n- \u26a1 Enhanced chunking algorithms and strategies\n- \ud83d\ude80 Performance optimizations for large files\n- \ud83e\udd16 Advanced AI/ML integration examples\n- \ud83d\udcdd Documentation and usage examples\n\n**\ud83d\udc49 See [CONTRIBUTING.md](CONTRIBUTING.md) for complete guidelines, development setup, and submission process.**\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Designed as part of the **AI Building Blocks** initiative\n- Built for the modern AI/ML ecosystem\n- Community-driven XML format support\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "XML document analysis and preprocessing framework designed for AI/ML data pipelines",
"version": "1.2.8",
"project_urls": {
"Documentation": "https://github.com/redhat-ai-americas/xml-analysis-framework/blob/main/README.md",
"Homepage": "https://github.com/redhat-ai-americas/xml-analysis-framework",
"Issues": "https://github.com/redhat-ai-americas/xml-analysis-framework/issues",
"Repository": "https://github.com/redhat-ai-americas/xml-analysis-framework"
},
"split_keywords": [
"xml",
" analysis",
" ai",
" ml",
" document-processing",
" semantic-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e69ed21044cb59a59e6488ce8f0965c8f721939df1baee95202b2ae23f1c580b",
"md5": "f8ec6ef32893038204c7cabb865098c2",
"sha256": "155528498503747d75a14202219939350075c2dc2d06abe4485695fa34892dfa"
},
"downloads": -1,
"filename": "xml_analysis_framework-1.2.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f8ec6ef32893038204c7cabb865098c2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 203816,
"upload_time": "2025-07-27T04:00:54",
"upload_time_iso_8601": "2025-07-27T04:00:54.799092Z",
"url": "https://files.pythonhosted.org/packages/e6/9e/d21044cb59a59e6488ce8f0965c8f721939df1baee95202b2ae23f1c580b/xml_analysis_framework-1.2.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4620d3b17abc1df09030bdddf62e4d7f8f246cc8062fe85b59c83dd1bf970e3a",
"md5": "c63f82c172b07147feacfc3831d99e22",
"sha256": "dcb7fdfdc5aac7d961ac20d81a3a446760fc7880070bce8c868e4c9ee1aeb29e"
},
"downloads": -1,
"filename": "xml_analysis_framework-1.2.8.tar.gz",
"has_sig": false,
"md5_digest": "c63f82c172b07147feacfc3831d99e22",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 9643623,
"upload_time": "2025-07-27T04:00:56",
"upload_time_iso_8601": "2025-07-27T04:00:56.826069Z",
"url": "https://files.pythonhosted.org/packages/46/20/d3b17abc1df09030bdddf62e4d7f8f246cc8062fe85b59c83dd1bf970e3a/xml_analysis_framework-1.2.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-27 04:00:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "redhat-ai-americas",
"github_project": "xml-analysis-framework",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "defusedxml",
"specs": [
[
">=",
"0.7.1"
]
]
}
],
"lcname": "xml-analysis-framework"
}