<div align="center">
<img src="https://evolvis.ai/wp-content/uploads/2025/08/evie-solutions-03.png" alt="Evolvis AI - Evie Solutions Logo" width="400">
</div>
# Evolvishub Outlook Ingestor
**Production-ready, secure email ingestion system for Microsoft Outlook with advanced processing, monitoring, and hybrid storage capabilities.**
A comprehensive Python library for ingesting, processing, and storing email data from Microsoft Outlook and Exchange systems. Built with enterprise-grade security, performance, and scalability in mind, featuring intelligent hybrid storage architecture for optimal cost and performance.
## Download Statistics
[](https://pepy.tech/project/evolvishub-outlook-ingestor)
[](https://pepy.tech/project/evolvishub-outlook-ingestor)
[](https://pypi.org/project/evolvishub-outlook-ingestor/)
[](https://pypi.org/project/evolvishub-outlook-ingestor/)
[](LICENSE)
[](https://github.com/psf/black)
[](https://mypy.readthedocs.io/)
## Table of Contents
- [Features](#features)
- [Architecture](#architecture)
- [About Evolvis AI](#about-evolvis-ai)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Hybrid Storage Configuration](#hybrid-storage-configuration)
- [Configuration](#configuration)
- [Performance](#performance)
- [Advanced Usage](#advanced-usage)
- [Support and Documentation](#support-and-documentation)
- [Technical Specifications](#technical-specifications)
- [Acknowledgments](#acknowledgments)
- [License](#license)
## Features
### Protocol Support
- **Microsoft Graph API** - Modern OAuth2-based access to Office 365 and Exchange Online
- **Exchange Web Services (EWS)** - Enterprise-grade access to on-premises Exchange servers
- **IMAP/POP3** - Universal email protocol support for legacy systems and third-party providers
### Database Integration
- **PostgreSQL** - High-performance relational database with advanced indexing and async support
- **MongoDB** - Scalable NoSQL document storage for flexible email data structures
- **MySQL** - Reliable relational database support for existing infrastructure
- **SQLite** - Lightweight file-based database for development, testing, and small deployments
- **Microsoft SQL Server** - Enterprise database for Windows-centric environments with advanced features
- **MariaDB** - Open-source MySQL alternative with enhanced performance and features
- **Oracle Database** - Enterprise-grade database for mission-critical applications
- **CockroachDB** - Distributed, cloud-native database for global scale and resilience
- **ClickHouse** - High-performance columnar database for analytics and real-time queries
### Data Lake Integration
- **Delta Lake** - Apache Spark-based ACID transactional storage layer with time travel capabilities
- **Apache Iceberg** - Open table format for large-scale analytics with schema evolution support
- **Hybrid Analytics** - Seamless integration between operational databases and analytical data lakes
### Hybrid Storage Architecture
- **MinIO** - Self-hosted S3-compatible storage for on-premises control and high performance
- **AWS S3** - Enterprise cloud storage with global CDN, lifecycle policies, and encryption
- **Azure Blob Storage** - Microsoft ecosystem integration with hot/cool/archive storage tiers
- **Google Cloud Storage** - Global infrastructure with ML integration and advanced analytics
- **Intelligent Routing** - Size-based and content-type-based storage decisions with configurable rules
- **Content Deduplication** - SHA256-based deduplication to eliminate duplicate attachments
- **Automatic Compression** - GZIP/ZLIB compression for text-based attachments
- **Secure Access** - Pre-signed URLs with configurable expiration for secure attachment access
### Performance & Scalability
- **Async/Await Architecture** - Non-blocking operations for maximum throughput (1000+ emails/minute)
- **Hybrid Storage Strategy** - Intelligent routing between database and object storage
- **Batch Processing** - Efficient handling of large email volumes with concurrent workers
- **Connection Pooling** - Optimized database connections for enterprise workloads
- **Memory Optimization** - Smart caching and resource management for large datasets
- **Multi-tier Storage** - Automatic lifecycle management between hot/warm/cold storage
### Enterprise Security
- **Credential Encryption** - Fernet symmetric encryption for sensitive data storage
- **Input Sanitization** - Protection against SQL injection, XSS, and other attacks
- **Secure Configuration** - Environment variable-based configuration with validation
- **Audit Logging** - Complete audit trail without sensitive data exposure
- **Access Control** - IAM-based permissions and secure URL generation
### Developer Experience
- **Type Safety** - Full type hints and IDE support for enhanced development experience
- **Comprehensive Testing** - 80%+ test coverage with unit, integration, and performance tests
- **Extensive Documentation** - Complete API reference with examples and best practices
- **Configuration-Based Setup** - Flexible YAML/JSON configuration with validation
- **Error Handling** - Comprehensive exception hierarchy with automatic retry logic
## Architecture
### System Overview
```mermaid
graph TB
subgraph "Email Sources"
A[Microsoft Graph API]
B[Exchange Web Services]
C[IMAP/POP3]
end
subgraph "Evolvishub Outlook Ingestor"
D[Protocol Adapters]
E[Enhanced Attachment Processor]
F[Email Processor]
G[Security Layer]
end
subgraph "Storage Layer"
H[Database Storage]
I[Object Storage]
end
subgraph "Database Backends"
J[PostgreSQL]
K[MongoDB]
L[MySQL]
end
subgraph "Object Storage Backends"
M[MinIO]
N[AWS S3]
O[Azure Blob]
P[Google Cloud Storage]
end
A --> D
B --> D
C --> D
D --> F
D --> E
F --> G
E --> G
G --> H
G --> I
H --> J
H --> K
H --> L
I --> M
I --> N
I --> O
I --> P
```
### Hybrid Storage Strategy
```mermaid
sequenceDiagram
participant E as Email Processor
participant AP as Attachment Processor
participant DB as Database
participant OS as Object Storage
E->>AP: Process Email with Attachments
AP->>AP: Evaluate Storage Rules
alt Small Attachment (<1MB)
AP->>DB: Store in Database
DB-->>AP: Confirmation
else Medium Attachment (1-5MB)
AP->>OS: Store Content
AP->>DB: Store Metadata + Reference
OS-->>AP: Storage Key
DB-->>AP: Confirmation
else Large Attachment (>5MB)
AP->>OS: Store Content Only
OS-->>AP: Storage Key + Metadata
end
AP-->>E: Processing Complete
```
### Data Flow Architecture
```mermaid
flowchart LR
subgraph "Ingestion Layer"
A[Email Source] --> B[Protocol Adapter]
B --> C[Rate Limiter]
C --> D[Authentication]
end
subgraph "Processing Layer"
D --> E[Email Processor]
E --> F[Attachment Processor]
F --> G[Security Scanner]
G --> H[Deduplication Engine]
H --> I[Compression Engine]
end
subgraph "Storage Decision Engine"
I --> J{Storage Rules}
J -->|Small Files| K[Database Storage]
J -->|Medium Files| L[Hybrid Storage]
J -->|Large Files| M[Object Storage]
end
subgraph "Storage Backends"
K --> N[(PostgreSQL/MongoDB)]
L --> N
L --> O[(MinIO/S3/Azure/GCS)]
M --> O
end
```
## About Evolvis AI
**Evolvis AI** is a cutting-edge technology company specializing in AI-powered solutions for enterprise email processing, data ingestion, and intelligent automation. Founded with a mission to revolutionize how organizations handle and analyze their email communications, Evolvis AI develops sophisticated tools that combine artificial intelligence with robust engineering practices.
### Our Focus
- **AI-Powered Email Processing** - Advanced algorithms for intelligent email analysis, classification, and extraction
- **Enterprise Data Solutions** - Scalable systems for large-scale email ingestion and processing
- **Intelligent Automation** - Smart workflows that adapt to organizational needs and patterns
- **Security-First Architecture** - Enterprise-grade security and compliance for sensitive email data
### Innovation at Scale
Evolvis AI's solutions are designed to handle enterprise-scale email processing challenges, from small businesses to global corporations. Our technology stack emphasizes performance, security, and scalability while maintaining ease of use and deployment flexibility.
**Learn more about our solutions:** [https://evolvis.ai](https://evolvis.ai)
## Installation
### Basic Installation
```bash
# Install core package
pip install evolvishub-outlook-ingestor
```
### Feature-Specific Installation
```bash
# Protocol adapters (Microsoft Graph, EWS, IMAP/POP3)
pip install evolvishub-outlook-ingestor[protocols]
# Core database connectors (PostgreSQL, MongoDB, MySQL)
pip install evolvishub-outlook-ingestor[database]
# Individual database connectors
pip install evolvishub-outlook-ingestor[database-sqlite] # SQLite
pip install evolvishub-outlook-ingestor[database-mssql] # SQL Server
pip install evolvishub-outlook-ingestor[database-mariadb] # MariaDB
pip install evolvishub-outlook-ingestor[database-oracle] # Oracle
pip install evolvishub-outlook-ingestor[database-cockroachdb] # CockroachDB
# All database connectors
pip install evolvishub-outlook-ingestor[database-all]
# Data lake connectors
pip install evolvishub-outlook-ingestor[datalake-delta] # Delta Lake
pip install evolvishub-outlook-ingestor[datalake-iceberg] # Apache Iceberg
pip install evolvishub-outlook-ingestor[database-clickhouse] # ClickHouse
# All data lake connectors
pip install evolvishub-outlook-ingestor[datalake-all]
# Object storage support (MinIO S3-compatible)
pip install evolvishub-outlook-ingestor[storage]
# Data processing features (HTML parsing, image processing)
pip install evolvishub-outlook-ingestor[processing]
```
### Cloud Storage Installation
```bash
# AWS S3 support
pip install evolvishub-outlook-ingestor[cloud-aws]
# Azure Blob Storage support
pip install evolvishub-outlook-ingestor[cloud-azure]
# Google Cloud Storage support
pip install evolvishub-outlook-ingestor[cloud-gcp]
# All cloud storage backends
pip install evolvishub-outlook-ingestor[cloud-all]
```
### Complete Installation
```bash
# Install all features and dependencies
pip install evolvishub-outlook-ingestor[all]
# Development installation with testing tools
pip install evolvishub-outlook-ingestor[dev]
```
### Requirements
- **Python**: 3.9 or higher
- **Operating System**: Linux, macOS, Windows
- **Memory**: Minimum 512MB RAM (2GB+ recommended for large datasets)
- **Storage**: Varies based on email volume and attachment storage strategy
## Quick Start
### Basic Email Ingestion
```python
import asyncio
from evolvishub_outlook_ingestor.protocols.microsoft_graph import GraphAPIAdapter
from evolvishub_outlook_ingestor.connectors.postgresql_connector import PostgreSQLConnector
from evolvishub_outlook_ingestor.processors.email_processor import EmailProcessor
async def basic_email_ingestion():
# Configure Microsoft Graph API
graph_config = {
"client_id": "your_client_id",
"client_secret": "your_client_secret",
"tenant_id": "your_tenant_id"
}
# Configure PostgreSQL database
db_config = {
"host": "localhost",
"port": 5432,
"database": "outlook_data",
"username": "postgres",
"password": "your_password"
}
# Initialize components
async with GraphAPIAdapter("graph", graph_config) as protocol, \
PostgreSQLConnector("db", db_config) as connector:
# Create email processor
processor = EmailProcessor("email_processor")
# Fetch and process emails
emails = await protocol.fetch_emails(limit=10)
for email in emails:
# Process email content
result = await processor.process(email)
# Store in database
if result.status.value == "success":
await connector.store_email(result.processed_data)
print(f"Stored email: {email.subject}")
asyncio.run(basic_email_ingestion())
```
## Hybrid Storage Configuration
### Enterprise-Grade Attachment Processing
```python
import asyncio
from evolvishub_outlook_ingestor.processors.enhanced_attachment_processor import (
EnhancedAttachmentProcessor,
StorageStrategy
)
from evolvishub_outlook_ingestor.connectors.minio_connector import MinIOConnector
from evolvishub_outlook_ingestor.connectors.aws_s3_connector import AWSS3Connector
from evolvishub_outlook_ingestor.connectors.postgresql_connector import PostgreSQLConnector
async def hybrid_storage_setup():
# Configure MinIO for hot storage (frequently accessed files)
minio_config = {
"endpoint_url": "localhost:9000",
"access_key": "minioadmin",
"secret_key": "minioadmin",
"bucket_name": "email-attachments-hot",
"use_ssl": False # Set to True for production
}
# Configure AWS S3 for archive storage (long-term storage)
s3_config = {
"access_key": "your_aws_access_key",
"secret_key": "your_aws_secret_key",
"bucket_name": "email-attachments-archive",
"region": "us-east-1"
}
# Configure enhanced attachment processor with intelligent routing
processor_config = {
"storage_strategy": "hybrid",
"size_threshold": 1024 * 1024, # 1MB threshold
"enable_compression": True,
"enable_deduplication": True,
"enable_virus_scanning": False, # Configure as needed
"default_storage_backend": "hot_storage",
# Intelligent storage routing rules
"storage_rules": [
{
"name": "large_files",
"condition": "size > 5*1024*1024", # Files > 5MB
"strategy": "storage_only",
"storage_backend": "archive_storage"
},
{
"name": "medium_files",
"condition": "size > 1024*1024 and size <= 5*1024*1024", # 1-5MB
"strategy": "hybrid",
"storage_backend": "hot_storage"
},
{
"name": "small_files",
"condition": "size <= 1024*1024", # Files <= 1MB
"strategy": "database_only"
},
{
"name": "compressible_text",
"condition": "content_type.startswith('text/') and size > 1024",
"strategy": "hybrid",
"storage_backend": "hot_storage",
"compress": True,
"compression_type": "gzip"
}
]
}
# Initialize storage connectors
minio_connector = MinIOConnector("hot_storage", minio_config)
s3_connector = AWSS3Connector("archive_storage", s3_config)
# Initialize enhanced processor
processor = EnhancedAttachmentProcessor("hybrid_attachments", processor_config)
async with minio_connector, s3_connector:
# Add storage backends to processor
await processor.add_storage_backend("hot_storage", minio_connector)
await processor.add_storage_backend("archive_storage", s3_connector)
# Process emails with intelligent attachment routing
# (email processing code here)
# Generate secure URLs for attachment access
storage_info = {
"storage_location": "2024/01/15/abc123.pdf",
"storage_backend": "hot_storage"
}
backend = processor.storage_backends[storage_info["storage_backend"]]
secure_url = await backend.generate_presigned_url(
storage_info["storage_location"],
expires_in=3600 # 1 hour expiration
)
print(f"Secure attachment URL: {secure_url}")
asyncio.run(hybrid_storage_setup())
```
### Storage Strategy Decision Matrix
| File Size | Content Type | Storage Strategy | Backend | Compression |
|-----------|--------------|------------------|---------|-------------|
| < 1MB | Any | Database Only | PostgreSQL/MongoDB | No |
| 1-5MB | Documents/Images | Hybrid | MinIO/Hot Storage | Optional |
| 1-5MB | Text Files | Hybrid | MinIO/Hot Storage | Yes (GZIP) |
| > 5MB | Any | Storage Only | AWS S3/Archive | Optional |
| > 10MB | Any | Storage Only | AWS S3/Glacier | Yes |
### Database Selection Guide
| Database | Best For | Pros | Cons | Recommended Use Case |
|----------|----------|------|------|---------------------|
| **SQLite** | Development, Testing, Small Scale | Simple setup, no server required, ACID compliant | Single writer, limited concurrency | Development environments, small deployments (<10K emails/day) |
| **PostgreSQL** | General Purpose, High Performance | Excellent performance, rich features, strong consistency | Requires server setup and maintenance | Most production deployments, complex queries |
| **MongoDB** | Flexible Schema, Document Storage | Schema flexibility, horizontal scaling, JSON-native | Eventual consistency, memory usage | Variable email structures, rapid prototyping |
| **MySQL/MariaDB** | Web Applications, Existing Infrastructure | Wide adoption, good performance, familiar | Limited JSON support (older versions) | Web applications, existing MySQL infrastructure |
| **SQL Server** | Windows Environments, Enterprise | Enterprise features, excellent tooling, integration | Windows-centric, licensing costs | Windows-based enterprises, .NET applications |
| **Oracle** | Mission-Critical, Large Enterprise | Proven reliability, advanced features, scalability | High cost, complexity | Large enterprises, mission-critical systems |
| **CockroachDB** | Global Scale, Cloud-Native | Distributed, strong consistency, cloud-native | Newer technology, learning curve | Global deployments, cloud-native applications |
### Data Lake and Analytics Selection Guide
| Platform | Best For | Pros | Cons | Recommended Use Case |
|----------|----------|------|------|---------------------|
| **Delta Lake** | ACID Analytics, Time Travel | ACID transactions, time travel, schema evolution, Spark ecosystem | Requires Spark, Java/Scala ecosystem | Data science workflows, ML pipelines, audit requirements |
| **Apache Iceberg** | Multi-Engine Analytics | Engine agnostic, hidden partitioning, snapshot isolation | Newer ecosystem, complex setup | Multi-tool analytics, data warehouse modernization |
| **ClickHouse** | Real-Time Analytics | Extremely fast queries, columnar storage, SQL interface | Limited transaction support, specialized use case | Real-time dashboards, email analytics, reporting |
### Hybrid Architecture Patterns
| Pattern | Description | Use Case | Benefits |
|---------|-------------|----------|----------|
| **Operational + Analytics** | PostgreSQL for operations, Delta Lake for analytics | Real-time app + historical analysis | Best of both worlds, optimized for each workload |
| **Hot + Cold Storage** | ClickHouse for recent data, Iceberg for historical | Email analytics with time-based access patterns | Cost optimization, query performance |
| **Multi-Engine Lake** | Iceberg with Spark, Trino, and Flink | Complex analytics requiring different compute engines | Flexibility, avoid vendor lock-in |
## Configuration
### Complete Configuration Example
Create a `config.yaml` file for comprehensive system configuration:
```yaml
# Database configuration examples
# PostgreSQL
database:
type: "postgresql"
host: "localhost"
port: 5432
database: "outlook_data"
username: "postgres"
password: "your_password"
pool_size: 20
max_overflow: 30
# SQLite (for development/testing)
database:
type: "sqlite"
database_path: "outlook_data.db"
enable_wal: true
timeout: 30.0
# SQL Server
database:
type: "mssql"
server: "localhost\\SQLEXPRESS"
port: 1433
database: "outlook_data"
username: "sa"
password: "your_password"
trusted_connection: false
encrypt: true
# MariaDB
database:
type: "mariadb"
host: "localhost"
port: 3306
database: "outlook_data"
username: "root"
password: "your_password"
charset: "utf8mb4"
# Oracle Database
database:
type: "oracle"
host: "localhost"
port: 1521
service_name: "XEPDB1"
username: "outlook_user"
password: "your_password"
# CockroachDB
database:
type: "cockroachdb"
host: "localhost"
port: 26257
database: "outlook_data"
username: "root"
password: "your_password"
sslmode: "require"
# Data Lake configuration examples
# Delta Lake (local)
database:
type: "deltalake"
table_path: "./delta-tables/emails"
app_name: "outlook-ingestor"
master: "local[*]"
partition_columns: ["received_date_partition", "sender_domain"]
z_order_columns: ["received_date", "sender_email"]
enable_time_travel: true
# Delta Lake (AWS S3)
database:
type: "deltalake"
table_path: "s3a://my-bucket/delta-tables/emails"
app_name: "outlook-ingestor-prod"
master: "spark://spark-master:7077"
cloud_provider: "aws"
cloud_config:
access_key: "your_access_key"
secret_key: "your_secret_key"
region: "us-west-2"
# Apache Iceberg (Hadoop catalog)
database:
type: "iceberg"
catalog_type: "hadoop"
warehouse_path: "./iceberg-warehouse"
namespace: "outlook_data"
table_name: "emails"
enable_compaction: true
# Apache Iceberg (AWS Glue catalog)
database:
type: "iceberg"
catalog_type: "glue"
catalog_config:
warehouse: "s3://my-bucket/iceberg-warehouse"
region: "us-west-2"
namespace: "outlook_analytics"
table_name: "emails"
# ClickHouse (local)
database:
type: "clickhouse"
host: "localhost"
port: 8123
database: "outlook_data"
username: "default"
password: "your_password"
compression: true
# ClickHouse (cluster)
database:
type: "clickhouse"
host: "clickhouse-cluster.example.com"
port: 8123
database: "outlook_data"
username: "analytics_user"
password: "your_password"
cluster: "outlook_cluster"
secure: true
# Protocol configurations
protocols:
graph_api:
client_id: "your_client_id"
client_secret: "your_client_secret"
tenant_id: "your_tenant_id"
scopes: ["https://graph.microsoft.com/.default"]
exchange:
server: "outlook.office365.com"
username: "your_email@company.com"
password: "your_password"
autodiscover: true
# Storage backend configurations
storage:
minio:
endpoint_url: "localhost:9000"
access_key: "minioadmin"
secret_key: "minioadmin"
bucket_name: "email-attachments"
use_ssl: false
aws_s3:
access_key: "your_aws_access_key"
secret_key: "your_aws_secret_key"
bucket_name: "email-attachments-prod"
region: "us-east-1"
azure_blob:
connection_string: "DefaultEndpointsProtocol=https;AccountName=..."
container_name: "email-attachments"
# Enhanced attachment processing
attachment_processing:
storage_strategy: "hybrid"
size_threshold: 1048576 # 1MB
enable_compression: true
enable_deduplication: true
enable_virus_scanning: false
max_attachment_size: 52428800 # 50MB
storage_rules:
- name: "large_files"
condition: "size > 5*1024*1024"
strategy: "storage_only"
storage_backend: "aws_s3"
- name: "medium_files"
condition: "size > 1024*1024 and size <= 5*1024*1024"
strategy: "hybrid"
storage_backend: "minio"
- name: "small_files"
condition: "size <= 1024*1024"
strategy: "database_only"
# Processing settings
processing:
batch_size: 1000
max_workers: 10
timeout_seconds: 300
retry_attempts: 3
retry_delay: 1.0
# Email settings
email:
extract_attachments: true
include_folders:
- "Inbox"
- "Sent Items"
- "Archive"
exclude_folders:
- "Deleted Items"
- "Junk Email"
# Security settings
security:
encrypt_credentials: true
master_key: "your_encryption_key"
enable_audit_logging: true
# Monitoring settings
monitoring:
enable_metrics: true
metrics_port: 8080
health_check_interval: 30
log_level: "INFO"
```
### Environment Variables
```bash
# Database settings
export DATABASE__HOST=localhost
export DATABASE__PORT=5432
export DATABASE__USERNAME=postgres
export DATABASE__PASSWORD=your_password
# Graph API settings
export PROTOCOLS__GRAPH_API__CLIENT_ID=your_client_id
export PROTOCOLS__GRAPH_API__CLIENT_SECRET=your_client_secret
export PROTOCOLS__GRAPH_API__TENANT_ID=your_tenant_id
# Storage backend settings
export STORAGE__MINIO__ACCESS_KEY=minioadmin
export STORAGE__MINIO__SECRET_KEY=minioadmin
export STORAGE__AWS_S3__ACCESS_KEY=your_aws_access_key
export STORAGE__AWS_S3__SECRET_KEY=your_aws_secret_key
# Security settings
export SECURITY__MASTER_KEY=your_encryption_key
export SECURITY__ENCRYPT_CREDENTIALS=true
# Load configuration file
export CONFIG_FILE=/path/to/config.yaml
```
## Performance
### Throughput Benchmarks
| Configuration | Emails/Minute | Attachments/Minute | Memory Usage |
|---------------|---------------|-------------------|--------------|
| Basic (Database Only) | 500-800 | 200-400 | 256MB |
| Hybrid Storage | 800-1200 | 400-800 | 512MB |
| Object Storage Only | 1000-1500 | 600-1200 | 128MB |
| Multi-tier Enterprise | 1200-2000 | 800-1500 | 1GB |
### Performance Optimization
```python
# High-performance configuration
performance_config = {
"processing": {
"batch_size": 2000, # Larger batches for better throughput
"max_workers": 20, # More concurrent workers
"connection_pool_size": 50, # Larger connection pool
"prefetch_count": 100 # Prefetch more emails
},
"attachment_processing": {
"enable_compression": True, # Reduce storage I/O
"enable_deduplication": True, # Avoid duplicate processing
"concurrent_uploads": 10, # Parallel storage uploads
"chunk_size": 8192 # Optimal chunk size for uploads
},
"storage": {
"connection_timeout": 30, # Reasonable timeout
"retry_attempts": 3, # Automatic retries
"use_connection_pooling": True # Reuse connections
}
}
```
### Memory Management
```mermaid
graph LR
A[Email Batch] --> B{Size Check}
B -->|Small| C[Database Storage]
B -->|Medium| D[Hybrid Processing]
B -->|Large| E[Stream to Object Storage]
C --> F[Memory: ~1MB per email]
D --> G[Memory: ~100KB per email]
E --> H[Memory: ~10KB per email]
F --> I[Total: 256MB for 1000 emails]
G --> J[Total: 100MB for 1000 emails]
H --> K[Total: 10MB for 1000 emails]
```
### Scaling Recommendations
#### Small Deployments (< 10,000 emails/day)
- **Configuration**: Basic database storage
- **Resources**: 2 CPU cores, 4GB RAM
- **Storage**: PostgreSQL with SSD storage
#### Medium Deployments (10,000 - 100,000 emails/day)
- **Configuration**: Hybrid storage with MinIO
- **Resources**: 4 CPU cores, 8GB RAM
- **Storage**: PostgreSQL + MinIO cluster
#### Large Deployments (100,000+ emails/day)
- **Configuration**: Multi-tier object storage
- **Resources**: 8+ CPU cores, 16GB+ RAM
- **Storage**: PostgreSQL + AWS S3/Azure Blob + CDN
## Advanced Usage
### Protocol Adapters
#### Microsoft Graph API Adapter
- **Features**: OAuth2 authentication, rate limiting, pagination support
- **Configuration**: Client ID, Client Secret, Tenant ID
- **Usage**: Modern REST API for Office 365 and Outlook.com
```python
from evolvishub_outlook_ingestor.protocols import GraphAPIAdapter
adapter = GraphAPIAdapter("graph_api", {
"client_id": "your_client_id",
"client_secret": "your_client_secret",
"tenant_id": "your_tenant_id",
"rate_limit": 100, # requests per minute
})
```
#### Exchange Web Services (EWS) Adapter
- **Features**: Basic and OAuth2 authentication, connection pooling
- **Configuration**: Server URL, credentials, timeout settings
- **Usage**: On-premises Exchange servers and Exchange Online
```python
from evolvishub_outlook_ingestor.protocols import ExchangeWebServicesAdapter
adapter = ExchangeWebServicesAdapter("exchange", {
"server": "outlook.office365.com",
"username": "your_email@company.com",
"password": "your_password",
"auth_type": "basic", # or "oauth2"
})
```
#### IMAP/POP3 Adapter
- **Features**: SSL/TLS support, folder synchronization, UID tracking
- **Configuration**: Server details, authentication credentials
- **Usage**: Standard email protocols for broad compatibility
```python
from evolvishub_outlook_ingestor.protocols import IMAPAdapter
adapter = IMAPAdapter("imap", {
"server": "outlook.office365.com",
"port": 993,
"username": "your_email@company.com",
"password": "your_password",
"use_ssl": True,
})
```
### Database Connectors
#### PostgreSQL Connector
- **Features**: Async operations, connection pooling, JSON fields, full-text search
- **Schema**: Optimized tables with proper indexes for email data
- **Performance**: Batch operations, transaction support
```python
from evolvishub_outlook_ingestor.connectors import PostgreSQLConnector
connector = PostgreSQLConnector("postgresql", {
"host": "localhost",
"port": 5432,
"database": "outlook_data",
"username": "postgres",
"password": "your_password",
"pool_size": 20,
})
```
#### MongoDB Connector
- **Features**: Document storage, GridFS for large attachments, aggregation pipelines
- **Schema**: Flexible document structure with proper indexing
- **Scalability**: Horizontal scaling support, replica sets
```python
from evolvishub_outlook_ingestor.connectors import MongoDBConnector
connector = MongoDBConnector("mongodb", {
"host": "localhost",
"port": 27017,
"database": "outlook_data",
"username": "mongo_user",
"password": "your_password",
})
```
### Data Processors
#### Email Processor
- **Features**: Content normalization, HTML to text conversion, duplicate detection
- **Capabilities**: Email validation, link extraction, encoding detection
- **Configuration**: Customizable processing rules
```python
from evolvishub_outlook_ingestor.processors import EmailProcessor
processor = EmailProcessor("email", {
"normalize_content": True,
"extract_links": True,
"validate_addresses": True,
"html_to_text": True,
"remove_duplicates": True,
})
```
#### Attachment Processor
- **Features**: File type detection, virus scanning hooks, metadata extraction
- **Security**: Size validation, type filtering, content analysis
- **Optimization**: Image compression, hash calculation
```python
from evolvishub_outlook_ingestor.processors import AttachmentProcessor
processor = AttachmentProcessor("attachment", {
"max_attachment_size": 50 * 1024 * 1024, # 50MB
"scan_for_viruses": True,
"extract_metadata": True,
"calculate_hashes": True,
"compress_images": True,
})
```
## π§ Advanced Usage
### Hybrid Storage Configuration
```python
import asyncio
from evolvishub_outlook_ingestor.processors.enhanced_attachment_processor import (
EnhancedAttachmentProcessor,
StorageStrategy
)
from evolvishub_outlook_ingestor.connectors.minio_connector import MinIOConnector
from evolvishub_outlook_ingestor.connectors.aws_s3_connector import AWSS3Connector
async def setup_hybrid_storage():
# Configure storage backends
minio_config = {
"endpoint_url": "localhost:9000",
"access_key": "minioadmin",
"secret_key": "minioadmin",
"bucket_name": "email-attachments-hot",
"use_ssl": False
}
s3_config = {
"access_key": "your_aws_access_key",
"secret_key": "your_aws_secret_key",
"bucket_name": "email-attachments-archive",
"region": "us-east-1"
}
# Initialize storage connectors
minio_connector = MinIOConnector("hot_storage", minio_config)
s3_connector = AWSS3Connector("archive_storage", s3_config)
# Configure enhanced processor with storage rules
processor_config = {
"storage_strategy": "hybrid",
"size_threshold": 1024 * 1024, # 1MB
"enable_compression": True,
"enable_deduplication": True,
"storage_rules": [
{
"name": "large_files",
"condition": "size > 5*1024*1024", # Files > 5MB
"strategy": "storage_only",
"storage_backend": "archive_storage"
},
{
"name": "medium_files",
"condition": "size > 1024*1024 and size <= 5*1024*1024",
"strategy": "hybrid",
"storage_backend": "hot_storage"
},
{
"name": "small_files",
"condition": "size <= 1024*1024",
"strategy": "database_only"
}
]
}
# Create enhanced processor
processor = EnhancedAttachmentProcessor("hybrid_attachments", processor_config)
# Add storage backends
async with minio_connector, s3_connector:
await processor.add_storage_backend("hot_storage", minio_connector)
await processor.add_storage_backend("archive_storage", s3_connector)
# Process emails with hybrid storage
result = await processor.process(email_with_attachments)
# Generate secure URLs for attachment access
for storage_info in result.metadata.get("storage_infos", []):
if storage_info.get("storage_backend"):
backend = processor.storage_backends[storage_info["storage_backend"]]
secure_url = await backend.generate_presigned_url(
storage_info["storage_location"],
expires_in=3600 # 1 hour
)
print(f"Secure URL: {secure_url}")
asyncio.run(setup_hybrid_storage())
```
### Custom Protocol Adapter
```python
from evolvishub_outlook_ingestor.protocols import BaseProtocol
from evolvishub_outlook_ingestor.core.data_models import EmailMessage
class CustomProtocol(BaseProtocol):
async def _fetch_emails_impl(self, **kwargs):
# Implement custom email fetching logic
emails = []
# ... fetch emails from custom source
return emails
# Use custom protocol
ingestor = OutlookIngestor(
settings=settings,
protocol_adapters={"custom": CustomProtocol("custom", config)}
)
```
### Batch Processing with Progress Tracking
```python
from evolvishub_outlook_ingestor.core.data_models import BatchProcessingConfig
async def process_with_progress():
def progress_callback(processed, total, rate):
print(f"Progress: {processed}/{total} ({rate:.2f} emails/sec)")
batch_config = BatchProcessingConfig(
batch_size=500,
max_workers=8,
progress_callback=progress_callback
)
result = await ingestor.process_emails(
protocol="exchange",
database="mongodb",
batch_config=batch_config
)
```
### Database Transactions
```python
from evolvishub_outlook_ingestor.connectors import PostgreSQLConnector
async def transactional_processing():
connector = PostgreSQLConnector("postgres", config)
await connector.initialize()
async with connector.transaction() as tx:
# All operations within this block are transactional
for email in emails:
await connector.store_email(email, transaction=tx)
# Automatically commits on success, rolls back on error
```
## ποΈ Architecture
### Component Overview
```
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Protocols β β Processors β β Connectors β
β β β β β β
β β’ Exchange EWS βββββΆβ β’ Email Proc. βββββΆβ β’ PostgreSQL β
β β’ Graph API β β β’ Attachment β β β’ MongoDB β
β β’ IMAP/POP3 β β β’ Batch Proc. β β β’ MySQL β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βββββββββββββββββββββββ
β Core Framework β
β β
β β’ Configuration β
β β’ Logging β
β β’ Error Handling β
β β’ Retry Logic β
β β’ Metrics β
βββββββββββββββββββββββ
```
### Design Patterns
- **Strategy Pattern**: Interchangeable protocol adapters
- **Factory Pattern**: Dynamic component creation
- **Repository Pattern**: Database abstraction
- **Observer Pattern**: Progress and metrics tracking
- **Circuit Breaker**: Fault tolerance
## π§ͺ Testing
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=evolvishub_outlook_ingestor --cov-report=html
# Run specific test categories
pytest -m unit # Unit tests only
pytest -m integration # Integration tests only
pytest -m performance # Performance tests only
# Run tests in parallel
pytest -n auto
```
## π Performance
### Benchmarks
- **Email Processing**: 1000+ emails/minute
- **Memory Usage**: <100MB for 10K emails
- **Database Throughput**: 500+ inserts/second
- **Concurrent Connections**: 50+ simultaneous
### Optimization Tips
1. **Use Batch Processing**: Process emails in batches for better throughput
2. **Enable Connection Pooling**: Reuse database connections
3. **Configure Rate Limiting**: Avoid API throttling
4. **Monitor Memory Usage**: Use streaming for large datasets
5. **Tune Worker Count**: Match your system's CPU cores
## π Monitoring
### Metrics Collection
```python
# Enable Prometheus metrics
settings.monitoring.enable_metrics = True
settings.monitoring.metrics_port = 8000
# Access metrics endpoint
# http://localhost:8000/metrics
```
### Health Checks
```python
# Check component health
status = await ingestor.get_status()
print(f"Protocol Status: {status['protocols']}")
print(f"Database Status: {status['database']}")
```
## π€ Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
### Development Setup
```bash
# Clone repository
git clone https://github.com/evolvisai/metcal.git
cd metcal/shared/libs/evolvis-outlook-ingestor
# Install development dependencies
pip install -e ".[dev]"
# Run pre-commit hooks
pre-commit install
# Run tests
pytest
```
## π License
This project is licensed under the Evolvis AI License - see the [LICENSE](LICENSE) file for details.
## π API Reference
### Core Components
**Protocol Adapters**
- `GraphAPIAdapter`: Microsoft Graph API integration
- `ExchangeWebServicesAdapter`: Exchange Web Services (EWS) support
- `IMAPAdapter`: IMAP protocol support
**Database Connectors**
- `PostgreSQLConnector`: PostgreSQL database integration
- `MongoDBConnector`: MongoDB database integration
**Data Processors**
- `EmailProcessor`: Email content processing and normalization
- `AttachmentProcessor`: Attachment handling and security scanning
**Security Utilities**
- `SecureCredentialManager`: Credential encryption and management
- `CredentialMasker`: Sensitive data masking for logs
- `InputSanitizer`: Input validation and sanitization
### Configuration Reference
```python
# Complete configuration example
config = {
"graph_api": {
"client_id": "your_client_id",
"client_secret": "your_client_secret",
"tenant_id": "your_tenant_id",
"rate_limit": 100, # requests per minute
"timeout": 30, # request timeout in seconds
},
"database": {
"host": "localhost",
"port": 5432,
"database": "outlook_ingestor",
"username": "ingestor_user",
"password": "secure_password",
"ssl_mode": "require",
"enable_connection_pooling": True,
"pool_size": 10,
},
"email_processing": {
"normalize_content": True,
"extract_links": True,
"validate_addresses": True,
"html_to_text": True,
"remove_duplicates": True,
},
"attachment_processing": {
"max_attachment_size": 10 * 1024 * 1024, # 10MB
"extract_metadata": True,
"calculate_hashes": True,
"scan_for_viruses": False,
}
}
```
### Error Handling
```python
from evolvishub_outlook_ingestor.core.exceptions import (
ConnectionError,
AuthenticationError,
DatabaseError,
ProcessingError,
ValidationError,
)
try:
await protocol.fetch_emails()
except AuthenticationError:
# Handle authentication issues
print("Check your API credentials")
except ConnectionError:
# Handle network/connection issues
print("Check network connectivity")
except ProcessingError as e:
# Handle processing errors
print(f"Processing failed: {e}")
```
## Support and Documentation
### Documentation Resources
- **[Storage Architecture Guide](docs/STORAGE_ARCHITECTURE.md)** - Comprehensive guide to hybrid storage configuration
- **[Migration Guide](docs/MIGRATION_GUIDE.md)** - Step-by-step migration from basic to hybrid storage
- **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation with examples
- **[Performance Tuning](docs/PERFORMANCE_TUNING.md)** - Optimization guidelines for large-scale deployments
### Community and Support
- **[GitHub Issues](https://github.com/evolvisai/metcal/issues)** - Bug reports and feature requests
- **[GitHub Discussions](https://github.com/evolvisai/metcal/discussions)** - Community discussions and Q&A
- **[Examples Directory](examples/)** - Comprehensive usage examples and tutorials
### Enterprise Support
For enterprise deployments requiring dedicated support, custom integrations, or professional services, please contact our team for tailored solutions and SLA-backed support options.
## Technical Specifications
### Supported Platforms
- **Operating Systems**: Linux (Ubuntu 18.04+, CentOS 7+), macOS (10.15+), Windows (10+)
- **Python Versions**: 3.9, 3.10, 3.11, 3.12
- **Database Systems**: PostgreSQL 12+, MongoDB 4.4+, MySQL 8.0+
- **Object Storage**: MinIO, AWS S3, Azure Blob Storage, Google Cloud Storage
### Performance Characteristics
- **Throughput**: Up to 2,000 emails/minute with hybrid storage
- **Concurrency**: Support for 50+ concurrent processing workers
- **Memory Efficiency**: <10KB per email with object storage strategy
- **Storage Optimization**: Up to 70% reduction in database size with intelligent routing
### Security Compliance
- **Encryption**: AES-256 encryption for credentials and sensitive data
- **Authentication**: OAuth2, Basic Auth, and certificate-based authentication
- **Access Control**: Role-based access control and audit logging
- **Compliance**: GDPR, HIPAA, and SOX compliance features available
## Acknowledgments
This project is built on top of excellent open-source technologies:
- **[Pydantic](https://pydantic.dev/)** - Data validation and settings management
- **[SQLAlchemy](https://sqlalchemy.org/)** - Database ORM with async support
- **[asyncio](https://docs.python.org/3/library/asyncio.html)** - Asynchronous programming framework
- **[pytest](https://pytest.org/)** - Testing framework with async support
- **[Black](https://black.readthedocs.io/)**, **[isort](https://pycqa.github.io/isort/)**, **[mypy](https://mypy.readthedocs.io/)** - Code quality and type checking tools
## License
### Evolvis AI License
This software is proprietary to **Evolvis AI** and is protected by copyright and other intellectual property laws.
#### π **License Terms**
- **β
Evaluation and Non-Commercial Use**: This package is available for evaluation, research, and non-commercial use
- **β οΈ Commercial Use Restrictions**: Commercial or production use of this library requires a valid Evolvis AI License
- **π« Redistribution Prohibited**: Redistribution or commercial use without proper licensing is strictly prohibited
#### πΌ **Commercial Licensing**
For commercial licensing, production deployments, or enterprise use, please contact:
**Montgomery Miralles**
π§ **Email**: [m.miralles@evolvis.ai](mailto:m.miralles@evolvis.ai)
π’ **Company**: Evolvis AI
π **Website**: [https://evolvis.ai](https://evolvis.ai)
#### βοΈ **Important Notice**
> **Commercial users must obtain proper licensing before deploying this software in production environments.** Unauthorized commercial use may result in legal action. Contact Montgomery Miralles for licensing agreements and compliance requirements.
#### π **Full License**
For complete license terms and conditions, see the [LICENSE](LICENSE) file included with this distribution.
---
**Evolvishub Outlook Ingestor** - Enterprise-grade email ingestion with intelligent hybrid storage architecture.
Raw data
{
"_id": null,
"home_page": null,
"name": "evolvishub-outlook-ingestor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Kevin Medina G\u00f3mez <k.medina@evolvis.ai>",
"keywords": "outlook, email, ingestion, exchange, graph-api, imap, pop3, database, async, batch-processing, security, monitoring, performance, postgresql, mongodb, enterprise",
"author": null,
"author_email": "\"Alban Maxhuni, PhD\" <a.maxhuni@evolvis.ai>",
"download_url": "https://files.pythonhosted.org/packages/88/f2/2dfa77bd7b374a68007e712f301082248b9dd8763768eed9fa44c0c0beff/evolvishub_outlook_ingestor-1.0.2.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <img src=\"https://evolvis.ai/wp-content/uploads/2025/08/evie-solutions-03.png\" alt=\"Evolvis AI - Evie Solutions Logo\" width=\"400\">\n</div>\n\n# Evolvishub Outlook Ingestor\n\n**Production-ready, secure email ingestion system for Microsoft Outlook with advanced processing, monitoring, and hybrid storage capabilities.**\n\nA comprehensive Python library for ingesting, processing, and storing email data from Microsoft Outlook and Exchange systems. Built with enterprise-grade security, performance, and scalability in mind, featuring intelligent hybrid storage architecture for optimal cost and performance.\n\n## Download Statistics\n\n[](https://pepy.tech/project/evolvishub-outlook-ingestor)\n[](https://pepy.tech/project/evolvishub-outlook-ingestor)\n[](https://pypi.org/project/evolvishub-outlook-ingestor/)\n[](https://pypi.org/project/evolvishub-outlook-ingestor/)\n[](LICENSE)\n[](https://github.com/psf/black)\n[](https://mypy.readthedocs.io/)\n\n## Table of Contents\n\n- [Features](#features)\n- [Architecture](#architecture)\n- [About Evolvis AI](#about-evolvis-ai)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Hybrid Storage Configuration](#hybrid-storage-configuration)\n- [Configuration](#configuration)\n- [Performance](#performance)\n- [Advanced Usage](#advanced-usage)\n- [Support and Documentation](#support-and-documentation)\n- [Technical Specifications](#technical-specifications)\n- [Acknowledgments](#acknowledgments)\n- [License](#license)\n\n## Features\n\n### Protocol Support\n- **Microsoft Graph API** - Modern OAuth2-based access to Office 365 and Exchange Online\n- **Exchange Web Services (EWS)** - Enterprise-grade access to on-premises Exchange servers\n- **IMAP/POP3** - Universal email protocol support for legacy systems and third-party providers\n\n### Database Integration\n- **PostgreSQL** - High-performance relational database with advanced indexing and async support\n- **MongoDB** - Scalable NoSQL document storage for flexible email data structures\n- **MySQL** - Reliable relational database support for existing infrastructure\n- **SQLite** - Lightweight file-based database for development, testing, and small deployments\n- **Microsoft SQL Server** - Enterprise database for Windows-centric environments with advanced features\n- **MariaDB** - Open-source MySQL alternative with enhanced performance and features\n- **Oracle Database** - Enterprise-grade database for mission-critical applications\n- **CockroachDB** - Distributed, cloud-native database for global scale and resilience\n- **ClickHouse** - High-performance columnar database for analytics and real-time queries\n\n### Data Lake Integration\n- **Delta Lake** - Apache Spark-based ACID transactional storage layer with time travel capabilities\n- **Apache Iceberg** - Open table format for large-scale analytics with schema evolution support\n- **Hybrid Analytics** - Seamless integration between operational databases and analytical data lakes\n\n### Hybrid Storage Architecture\n- **MinIO** - Self-hosted S3-compatible storage for on-premises control and high performance\n- **AWS S3** - Enterprise cloud storage with global CDN, lifecycle policies, and encryption\n- **Azure Blob Storage** - Microsoft ecosystem integration with hot/cool/archive storage tiers\n- **Google Cloud Storage** - Global infrastructure with ML integration and advanced analytics\n- **Intelligent Routing** - Size-based and content-type-based storage decisions with configurable rules\n- **Content Deduplication** - SHA256-based deduplication to eliminate duplicate attachments\n- **Automatic Compression** - GZIP/ZLIB compression for text-based attachments\n- **Secure Access** - Pre-signed URLs with configurable expiration for secure attachment access\n\n### Performance & Scalability\n- **Async/Await Architecture** - Non-blocking operations for maximum throughput (1000+ emails/minute)\n- **Hybrid Storage Strategy** - Intelligent routing between database and object storage\n- **Batch Processing** - Efficient handling of large email volumes with concurrent workers\n- **Connection Pooling** - Optimized database connections for enterprise workloads\n- **Memory Optimization** - Smart caching and resource management for large datasets\n- **Multi-tier Storage** - Automatic lifecycle management between hot/warm/cold storage\n\n### Enterprise Security\n- **Credential Encryption** - Fernet symmetric encryption for sensitive data storage\n- **Input Sanitization** - Protection against SQL injection, XSS, and other attacks\n- **Secure Configuration** - Environment variable-based configuration with validation\n- **Audit Logging** - Complete audit trail without sensitive data exposure\n- **Access Control** - IAM-based permissions and secure URL generation\n\n### Developer Experience\n- **Type Safety** - Full type hints and IDE support for enhanced development experience\n- **Comprehensive Testing** - 80%+ test coverage with unit, integration, and performance tests\n- **Extensive Documentation** - Complete API reference with examples and best practices\n- **Configuration-Based Setup** - Flexible YAML/JSON configuration with validation\n- **Error Handling** - Comprehensive exception hierarchy with automatic retry logic\n\n## Architecture\n\n### System Overview\n\n```mermaid\ngraph TB\n subgraph \"Email Sources\"\n A[Microsoft Graph API]\n B[Exchange Web Services]\n C[IMAP/POP3]\n end\n\n subgraph \"Evolvishub Outlook Ingestor\"\n D[Protocol Adapters]\n E[Enhanced Attachment Processor]\n F[Email Processor]\n G[Security Layer]\n end\n\n subgraph \"Storage Layer\"\n H[Database Storage]\n I[Object Storage]\n end\n\n subgraph \"Database Backends\"\n J[PostgreSQL]\n K[MongoDB]\n L[MySQL]\n end\n\n subgraph \"Object Storage Backends\"\n M[MinIO]\n N[AWS S3]\n O[Azure Blob]\n P[Google Cloud Storage]\n end\n\n A --> D\n B --> D\n C --> D\n D --> F\n D --> E\n F --> G\n E --> G\n G --> H\n G --> I\n H --> J\n H --> K\n H --> L\n I --> M\n I --> N\n I --> O\n I --> P\n```\n\n### Hybrid Storage Strategy\n\n```mermaid\nsequenceDiagram\n participant E as Email Processor\n participant AP as Attachment Processor\n participant DB as Database\n participant OS as Object Storage\n\n E->>AP: Process Email with Attachments\n AP->>AP: Evaluate Storage Rules\n\n alt Small Attachment (<1MB)\n AP->>DB: Store in Database\n DB-->>AP: Confirmation\n else Medium Attachment (1-5MB)\n AP->>OS: Store Content\n AP->>DB: Store Metadata + Reference\n OS-->>AP: Storage Key\n DB-->>AP: Confirmation\n else Large Attachment (>5MB)\n AP->>OS: Store Content Only\n OS-->>AP: Storage Key + Metadata\n end\n\n AP-->>E: Processing Complete\n```\n\n### Data Flow Architecture\n\n```mermaid\nflowchart LR\n subgraph \"Ingestion Layer\"\n A[Email Source] --> B[Protocol Adapter]\n B --> C[Rate Limiter]\n C --> D[Authentication]\n end\n\n subgraph \"Processing Layer\"\n D --> E[Email Processor]\n E --> F[Attachment Processor]\n F --> G[Security Scanner]\n G --> H[Deduplication Engine]\n H --> I[Compression Engine]\n end\n\n subgraph \"Storage Decision Engine\"\n I --> J{Storage Rules}\n J -->|Small Files| K[Database Storage]\n J -->|Medium Files| L[Hybrid Storage]\n J -->|Large Files| M[Object Storage]\n end\n\n subgraph \"Storage Backends\"\n K --> N[(PostgreSQL/MongoDB)]\n L --> N\n L --> O[(MinIO/S3/Azure/GCS)]\n M --> O\n end\n```\n\n## About Evolvis AI\n\n**Evolvis AI** is a cutting-edge technology company specializing in AI-powered solutions for enterprise email processing, data ingestion, and intelligent automation. Founded with a mission to revolutionize how organizations handle and analyze their email communications, Evolvis AI develops sophisticated tools that combine artificial intelligence with robust engineering practices.\n\n### Our Focus\n- **AI-Powered Email Processing** - Advanced algorithms for intelligent email analysis, classification, and extraction\n- **Enterprise Data Solutions** - Scalable systems for large-scale email ingestion and processing\n- **Intelligent Automation** - Smart workflows that adapt to organizational needs and patterns\n- **Security-First Architecture** - Enterprise-grade security and compliance for sensitive email data\n\n### Innovation at Scale\nEvolvis AI's solutions are designed to handle enterprise-scale email processing challenges, from small businesses to global corporations. Our technology stack emphasizes performance, security, and scalability while maintaining ease of use and deployment flexibility.\n\n**Learn more about our solutions:** [https://evolvis.ai](https://evolvis.ai)\n\n## Installation\n\n### Basic Installation\n\n```bash\n# Install core package\npip install evolvishub-outlook-ingestor\n```\n\n### Feature-Specific Installation\n\n```bash\n# Protocol adapters (Microsoft Graph, EWS, IMAP/POP3)\npip install evolvishub-outlook-ingestor[protocols]\n\n# Core database connectors (PostgreSQL, MongoDB, MySQL)\npip install evolvishub-outlook-ingestor[database]\n\n# Individual database connectors\npip install evolvishub-outlook-ingestor[database-sqlite] # SQLite\npip install evolvishub-outlook-ingestor[database-mssql] # SQL Server\npip install evolvishub-outlook-ingestor[database-mariadb] # MariaDB\npip install evolvishub-outlook-ingestor[database-oracle] # Oracle\npip install evolvishub-outlook-ingestor[database-cockroachdb] # CockroachDB\n\n# All database connectors\npip install evolvishub-outlook-ingestor[database-all]\n\n# Data lake connectors\npip install evolvishub-outlook-ingestor[datalake-delta] # Delta Lake\npip install evolvishub-outlook-ingestor[datalake-iceberg] # Apache Iceberg\npip install evolvishub-outlook-ingestor[database-clickhouse] # ClickHouse\n\n# All data lake connectors\npip install evolvishub-outlook-ingestor[datalake-all]\n\n# Object storage support (MinIO S3-compatible)\npip install evolvishub-outlook-ingestor[storage]\n\n# Data processing features (HTML parsing, image processing)\npip install evolvishub-outlook-ingestor[processing]\n```\n\n### Cloud Storage Installation\n\n```bash\n# AWS S3 support\npip install evolvishub-outlook-ingestor[cloud-aws]\n\n# Azure Blob Storage support\npip install evolvishub-outlook-ingestor[cloud-azure]\n\n# Google Cloud Storage support\npip install evolvishub-outlook-ingestor[cloud-gcp]\n\n# All cloud storage backends\npip install evolvishub-outlook-ingestor[cloud-all]\n```\n\n### Complete Installation\n\n```bash\n# Install all features and dependencies\npip install evolvishub-outlook-ingestor[all]\n\n# Development installation with testing tools\npip install evolvishub-outlook-ingestor[dev]\n```\n\n### Requirements\n\n- **Python**: 3.9 or higher\n- **Operating System**: Linux, macOS, Windows\n- **Memory**: Minimum 512MB RAM (2GB+ recommended for large datasets)\n- **Storage**: Varies based on email volume and attachment storage strategy\n\n## Quick Start\n\n### Basic Email Ingestion\n\n```python\nimport asyncio\nfrom evolvishub_outlook_ingestor.protocols.microsoft_graph import GraphAPIAdapter\nfrom evolvishub_outlook_ingestor.connectors.postgresql_connector import PostgreSQLConnector\nfrom evolvishub_outlook_ingestor.processors.email_processor import EmailProcessor\n\nasync def basic_email_ingestion():\n # Configure Microsoft Graph API\n graph_config = {\n \"client_id\": \"your_client_id\",\n \"client_secret\": \"your_client_secret\",\n \"tenant_id\": \"your_tenant_id\"\n }\n\n # Configure PostgreSQL database\n db_config = {\n \"host\": \"localhost\",\n \"port\": 5432,\n \"database\": \"outlook_data\",\n \"username\": \"postgres\",\n \"password\": \"your_password\"\n }\n\n # Initialize components\n async with GraphAPIAdapter(\"graph\", graph_config) as protocol, \\\n PostgreSQLConnector(\"db\", db_config) as connector:\n\n # Create email processor\n processor = EmailProcessor(\"email_processor\")\n\n # Fetch and process emails\n emails = await protocol.fetch_emails(limit=10)\n\n for email in emails:\n # Process email content\n result = await processor.process(email)\n\n # Store in database\n if result.status.value == \"success\":\n await connector.store_email(result.processed_data)\n print(f\"Stored email: {email.subject}\")\n\nasyncio.run(basic_email_ingestion())\n```\n\n## Hybrid Storage Configuration\n\n### Enterprise-Grade Attachment Processing\n\n```python\nimport asyncio\nfrom evolvishub_outlook_ingestor.processors.enhanced_attachment_processor import (\n EnhancedAttachmentProcessor,\n StorageStrategy\n)\nfrom evolvishub_outlook_ingestor.connectors.minio_connector import MinIOConnector\nfrom evolvishub_outlook_ingestor.connectors.aws_s3_connector import AWSS3Connector\nfrom evolvishub_outlook_ingestor.connectors.postgresql_connector import PostgreSQLConnector\n\nasync def hybrid_storage_setup():\n # Configure MinIO for hot storage (frequently accessed files)\n minio_config = {\n \"endpoint_url\": \"localhost:9000\",\n \"access_key\": \"minioadmin\",\n \"secret_key\": \"minioadmin\",\n \"bucket_name\": \"email-attachments-hot\",\n \"use_ssl\": False # Set to True for production\n }\n\n # Configure AWS S3 for archive storage (long-term storage)\n s3_config = {\n \"access_key\": \"your_aws_access_key\",\n \"secret_key\": \"your_aws_secret_key\",\n \"bucket_name\": \"email-attachments-archive\",\n \"region\": \"us-east-1\"\n }\n\n # Configure enhanced attachment processor with intelligent routing\n processor_config = {\n \"storage_strategy\": \"hybrid\",\n \"size_threshold\": 1024 * 1024, # 1MB threshold\n \"enable_compression\": True,\n \"enable_deduplication\": True,\n \"enable_virus_scanning\": False, # Configure as needed\n \"default_storage_backend\": \"hot_storage\",\n\n # Intelligent storage routing rules\n \"storage_rules\": [\n {\n \"name\": \"large_files\",\n \"condition\": \"size > 5*1024*1024\", # Files > 5MB\n \"strategy\": \"storage_only\",\n \"storage_backend\": \"archive_storage\"\n },\n {\n \"name\": \"medium_files\",\n \"condition\": \"size > 1024*1024 and size <= 5*1024*1024\", # 1-5MB\n \"strategy\": \"hybrid\",\n \"storage_backend\": \"hot_storage\"\n },\n {\n \"name\": \"small_files\",\n \"condition\": \"size <= 1024*1024\", # Files <= 1MB\n \"strategy\": \"database_only\"\n },\n {\n \"name\": \"compressible_text\",\n \"condition\": \"content_type.startswith('text/') and size > 1024\",\n \"strategy\": \"hybrid\",\n \"storage_backend\": \"hot_storage\",\n \"compress\": True,\n \"compression_type\": \"gzip\"\n }\n ]\n }\n\n # Initialize storage connectors\n minio_connector = MinIOConnector(\"hot_storage\", minio_config)\n s3_connector = AWSS3Connector(\"archive_storage\", s3_config)\n\n # Initialize enhanced processor\n processor = EnhancedAttachmentProcessor(\"hybrid_attachments\", processor_config)\n\n async with minio_connector, s3_connector:\n # Add storage backends to processor\n await processor.add_storage_backend(\"hot_storage\", minio_connector)\n await processor.add_storage_backend(\"archive_storage\", s3_connector)\n\n # Process emails with intelligent attachment routing\n # (email processing code here)\n\n # Generate secure URLs for attachment access\n storage_info = {\n \"storage_location\": \"2024/01/15/abc123.pdf\",\n \"storage_backend\": \"hot_storage\"\n }\n\n backend = processor.storage_backends[storage_info[\"storage_backend\"]]\n secure_url = await backend.generate_presigned_url(\n storage_info[\"storage_location\"],\n expires_in=3600 # 1 hour expiration\n )\n\n print(f\"Secure attachment URL: {secure_url}\")\n\nasyncio.run(hybrid_storage_setup())\n```\n\n### Storage Strategy Decision Matrix\n\n| File Size | Content Type | Storage Strategy | Backend | Compression |\n|-----------|--------------|------------------|---------|-------------|\n| < 1MB | Any | Database Only | PostgreSQL/MongoDB | No |\n| 1-5MB | Documents/Images | Hybrid | MinIO/Hot Storage | Optional |\n| 1-5MB | Text Files | Hybrid | MinIO/Hot Storage | Yes (GZIP) |\n| > 5MB | Any | Storage Only | AWS S3/Archive | Optional |\n| > 10MB | Any | Storage Only | AWS S3/Glacier | Yes |\n\n### Database Selection Guide\n\n| Database | Best For | Pros | Cons | Recommended Use Case |\n|----------|----------|------|------|---------------------|\n| **SQLite** | Development, Testing, Small Scale | Simple setup, no server required, ACID compliant | Single writer, limited concurrency | Development environments, small deployments (<10K emails/day) |\n| **PostgreSQL** | General Purpose, High Performance | Excellent performance, rich features, strong consistency | Requires server setup and maintenance | Most production deployments, complex queries |\n| **MongoDB** | Flexible Schema, Document Storage | Schema flexibility, horizontal scaling, JSON-native | Eventual consistency, memory usage | Variable email structures, rapid prototyping |\n| **MySQL/MariaDB** | Web Applications, Existing Infrastructure | Wide adoption, good performance, familiar | Limited JSON support (older versions) | Web applications, existing MySQL infrastructure |\n| **SQL Server** | Windows Environments, Enterprise | Enterprise features, excellent tooling, integration | Windows-centric, licensing costs | Windows-based enterprises, .NET applications |\n| **Oracle** | Mission-Critical, Large Enterprise | Proven reliability, advanced features, scalability | High cost, complexity | Large enterprises, mission-critical systems |\n| **CockroachDB** | Global Scale, Cloud-Native | Distributed, strong consistency, cloud-native | Newer technology, learning curve | Global deployments, cloud-native applications |\n\n### Data Lake and Analytics Selection Guide\n\n| Platform | Best For | Pros | Cons | Recommended Use Case |\n|----------|----------|------|------|---------------------|\n| **Delta Lake** | ACID Analytics, Time Travel | ACID transactions, time travel, schema evolution, Spark ecosystem | Requires Spark, Java/Scala ecosystem | Data science workflows, ML pipelines, audit requirements |\n| **Apache Iceberg** | Multi-Engine Analytics | Engine agnostic, hidden partitioning, snapshot isolation | Newer ecosystem, complex setup | Multi-tool analytics, data warehouse modernization |\n| **ClickHouse** | Real-Time Analytics | Extremely fast queries, columnar storage, SQL interface | Limited transaction support, specialized use case | Real-time dashboards, email analytics, reporting |\n\n### Hybrid Architecture Patterns\n\n| Pattern | Description | Use Case | Benefits |\n|---------|-------------|----------|----------|\n| **Operational + Analytics** | PostgreSQL for operations, Delta Lake for analytics | Real-time app + historical analysis | Best of both worlds, optimized for each workload |\n| **Hot + Cold Storage** | ClickHouse for recent data, Iceberg for historical | Email analytics with time-based access patterns | Cost optimization, query performance |\n| **Multi-Engine Lake** | Iceberg with Spark, Trino, and Flink | Complex analytics requiring different compute engines | Flexibility, avoid vendor lock-in |\n\n## Configuration\n\n### Complete Configuration Example\n\nCreate a `config.yaml` file for comprehensive system configuration:\n\n```yaml\n# Database configuration examples\n\n# PostgreSQL\ndatabase:\n type: \"postgresql\"\n host: \"localhost\"\n port: 5432\n database: \"outlook_data\"\n username: \"postgres\"\n password: \"your_password\"\n pool_size: 20\n max_overflow: 30\n\n# SQLite (for development/testing)\ndatabase:\n type: \"sqlite\"\n database_path: \"outlook_data.db\"\n enable_wal: true\n timeout: 30.0\n\n# SQL Server\ndatabase:\n type: \"mssql\"\n server: \"localhost\\\\SQLEXPRESS\"\n port: 1433\n database: \"outlook_data\"\n username: \"sa\"\n password: \"your_password\"\n trusted_connection: false\n encrypt: true\n\n# MariaDB\ndatabase:\n type: \"mariadb\"\n host: \"localhost\"\n port: 3306\n database: \"outlook_data\"\n username: \"root\"\n password: \"your_password\"\n charset: \"utf8mb4\"\n\n# Oracle Database\ndatabase:\n type: \"oracle\"\n host: \"localhost\"\n port: 1521\n service_name: \"XEPDB1\"\n username: \"outlook_user\"\n password: \"your_password\"\n\n# CockroachDB\ndatabase:\n type: \"cockroachdb\"\n host: \"localhost\"\n port: 26257\n database: \"outlook_data\"\n username: \"root\"\n password: \"your_password\"\n sslmode: \"require\"\n\n# Data Lake configuration examples\n\n# Delta Lake (local)\ndatabase:\n type: \"deltalake\"\n table_path: \"./delta-tables/emails\"\n app_name: \"outlook-ingestor\"\n master: \"local[*]\"\n partition_columns: [\"received_date_partition\", \"sender_domain\"]\n z_order_columns: [\"received_date\", \"sender_email\"]\n enable_time_travel: true\n\n# Delta Lake (AWS S3)\ndatabase:\n type: \"deltalake\"\n table_path: \"s3a://my-bucket/delta-tables/emails\"\n app_name: \"outlook-ingestor-prod\"\n master: \"spark://spark-master:7077\"\n cloud_provider: \"aws\"\n cloud_config:\n access_key: \"your_access_key\"\n secret_key: \"your_secret_key\"\n region: \"us-west-2\"\n\n# Apache Iceberg (Hadoop catalog)\ndatabase:\n type: \"iceberg\"\n catalog_type: \"hadoop\"\n warehouse_path: \"./iceberg-warehouse\"\n namespace: \"outlook_data\"\n table_name: \"emails\"\n enable_compaction: true\n\n# Apache Iceberg (AWS Glue catalog)\ndatabase:\n type: \"iceberg\"\n catalog_type: \"glue\"\n catalog_config:\n warehouse: \"s3://my-bucket/iceberg-warehouse\"\n region: \"us-west-2\"\n namespace: \"outlook_analytics\"\n table_name: \"emails\"\n\n# ClickHouse (local)\ndatabase:\n type: \"clickhouse\"\n host: \"localhost\"\n port: 8123\n database: \"outlook_data\"\n username: \"default\"\n password: \"your_password\"\n compression: true\n\n# ClickHouse (cluster)\ndatabase:\n type: \"clickhouse\"\n host: \"clickhouse-cluster.example.com\"\n port: 8123\n database: \"outlook_data\"\n username: \"analytics_user\"\n password: \"your_password\"\n cluster: \"outlook_cluster\"\n secure: true\n\n# Protocol configurations\nprotocols:\n graph_api:\n client_id: \"your_client_id\"\n client_secret: \"your_client_secret\"\n tenant_id: \"your_tenant_id\"\n scopes: [\"https://graph.microsoft.com/.default\"]\n\n exchange:\n server: \"outlook.office365.com\"\n username: \"your_email@company.com\"\n password: \"your_password\"\n autodiscover: true\n\n# Storage backend configurations\nstorage:\n minio:\n endpoint_url: \"localhost:9000\"\n access_key: \"minioadmin\"\n secret_key: \"minioadmin\"\n bucket_name: \"email-attachments\"\n use_ssl: false\n\n aws_s3:\n access_key: \"your_aws_access_key\"\n secret_key: \"your_aws_secret_key\"\n bucket_name: \"email-attachments-prod\"\n region: \"us-east-1\"\n\n azure_blob:\n connection_string: \"DefaultEndpointsProtocol=https;AccountName=...\"\n container_name: \"email-attachments\"\n\n# Enhanced attachment processing\nattachment_processing:\n storage_strategy: \"hybrid\"\n size_threshold: 1048576 # 1MB\n enable_compression: true\n enable_deduplication: true\n enable_virus_scanning: false\n max_attachment_size: 52428800 # 50MB\n\n storage_rules:\n - name: \"large_files\"\n condition: \"size > 5*1024*1024\"\n strategy: \"storage_only\"\n storage_backend: \"aws_s3\"\n - name: \"medium_files\"\n condition: \"size > 1024*1024 and size <= 5*1024*1024\"\n strategy: \"hybrid\"\n storage_backend: \"minio\"\n - name: \"small_files\"\n condition: \"size <= 1024*1024\"\n strategy: \"database_only\"\n\n# Processing settings\nprocessing:\n batch_size: 1000\n max_workers: 10\n timeout_seconds: 300\n retry_attempts: 3\n retry_delay: 1.0\n\n# Email settings\nemail:\n extract_attachments: true\n include_folders:\n - \"Inbox\"\n - \"Sent Items\"\n - \"Archive\"\n exclude_folders:\n - \"Deleted Items\"\n - \"Junk Email\"\n\n# Security settings\nsecurity:\n encrypt_credentials: true\n master_key: \"your_encryption_key\"\n enable_audit_logging: true\n\n# Monitoring settings\nmonitoring:\n enable_metrics: true\n metrics_port: 8080\n health_check_interval: 30\n log_level: \"INFO\"\n```\n\n### Environment Variables\n\n```bash\n# Database settings\nexport DATABASE__HOST=localhost\nexport DATABASE__PORT=5432\nexport DATABASE__USERNAME=postgres\nexport DATABASE__PASSWORD=your_password\n\n# Graph API settings\nexport PROTOCOLS__GRAPH_API__CLIENT_ID=your_client_id\nexport PROTOCOLS__GRAPH_API__CLIENT_SECRET=your_client_secret\nexport PROTOCOLS__GRAPH_API__TENANT_ID=your_tenant_id\n\n# Storage backend settings\nexport STORAGE__MINIO__ACCESS_KEY=minioadmin\nexport STORAGE__MINIO__SECRET_KEY=minioadmin\nexport STORAGE__AWS_S3__ACCESS_KEY=your_aws_access_key\nexport STORAGE__AWS_S3__SECRET_KEY=your_aws_secret_key\n\n# Security settings\nexport SECURITY__MASTER_KEY=your_encryption_key\nexport SECURITY__ENCRYPT_CREDENTIALS=true\n\n# Load configuration file\nexport CONFIG_FILE=/path/to/config.yaml\n```\n\n## Performance\n\n### Throughput Benchmarks\n\n| Configuration | Emails/Minute | Attachments/Minute | Memory Usage |\n|---------------|---------------|-------------------|--------------|\n| Basic (Database Only) | 500-800 | 200-400 | 256MB |\n| Hybrid Storage | 800-1200 | 400-800 | 512MB |\n| Object Storage Only | 1000-1500 | 600-1200 | 128MB |\n| Multi-tier Enterprise | 1200-2000 | 800-1500 | 1GB |\n\n### Performance Optimization\n\n```python\n# High-performance configuration\nperformance_config = {\n \"processing\": {\n \"batch_size\": 2000, # Larger batches for better throughput\n \"max_workers\": 20, # More concurrent workers\n \"connection_pool_size\": 50, # Larger connection pool\n \"prefetch_count\": 100 # Prefetch more emails\n },\n\n \"attachment_processing\": {\n \"enable_compression\": True, # Reduce storage I/O\n \"enable_deduplication\": True, # Avoid duplicate processing\n \"concurrent_uploads\": 10, # Parallel storage uploads\n \"chunk_size\": 8192 # Optimal chunk size for uploads\n },\n\n \"storage\": {\n \"connection_timeout\": 30, # Reasonable timeout\n \"retry_attempts\": 3, # Automatic retries\n \"use_connection_pooling\": True # Reuse connections\n }\n}\n```\n\n### Memory Management\n\n```mermaid\ngraph LR\n A[Email Batch] --> B{Size Check}\n B -->|Small| C[Database Storage]\n B -->|Medium| D[Hybrid Processing]\n B -->|Large| E[Stream to Object Storage]\n\n C --> F[Memory: ~1MB per email]\n D --> G[Memory: ~100KB per email]\n E --> H[Memory: ~10KB per email]\n\n F --> I[Total: 256MB for 1000 emails]\n G --> J[Total: 100MB for 1000 emails]\n H --> K[Total: 10MB for 1000 emails]\n```\n\n### Scaling Recommendations\n\n#### Small Deployments (< 10,000 emails/day)\n- **Configuration**: Basic database storage\n- **Resources**: 2 CPU cores, 4GB RAM\n- **Storage**: PostgreSQL with SSD storage\n\n#### Medium Deployments (10,000 - 100,000 emails/day)\n- **Configuration**: Hybrid storage with MinIO\n- **Resources**: 4 CPU cores, 8GB RAM\n- **Storage**: PostgreSQL + MinIO cluster\n\n#### Large Deployments (100,000+ emails/day)\n- **Configuration**: Multi-tier object storage\n- **Resources**: 8+ CPU cores, 16GB+ RAM\n- **Storage**: PostgreSQL + AWS S3/Azure Blob + CDN\n\n## Advanced Usage\n\n### Protocol Adapters\n\n#### Microsoft Graph API Adapter\n- **Features**: OAuth2 authentication, rate limiting, pagination support\n- **Configuration**: Client ID, Client Secret, Tenant ID\n- **Usage**: Modern REST API for Office 365 and Outlook.com\n\n```python\nfrom evolvishub_outlook_ingestor.protocols import GraphAPIAdapter\n\nadapter = GraphAPIAdapter(\"graph_api\", {\n \"client_id\": \"your_client_id\",\n \"client_secret\": \"your_client_secret\",\n \"tenant_id\": \"your_tenant_id\",\n \"rate_limit\": 100, # requests per minute\n})\n```\n\n#### Exchange Web Services (EWS) Adapter\n- **Features**: Basic and OAuth2 authentication, connection pooling\n- **Configuration**: Server URL, credentials, timeout settings\n- **Usage**: On-premises Exchange servers and Exchange Online\n\n```python\nfrom evolvishub_outlook_ingestor.protocols import ExchangeWebServicesAdapter\n\nadapter = ExchangeWebServicesAdapter(\"exchange\", {\n \"server\": \"outlook.office365.com\",\n \"username\": \"your_email@company.com\",\n \"password\": \"your_password\",\n \"auth_type\": \"basic\", # or \"oauth2\"\n})\n```\n\n#### IMAP/POP3 Adapter\n- **Features**: SSL/TLS support, folder synchronization, UID tracking\n- **Configuration**: Server details, authentication credentials\n- **Usage**: Standard email protocols for broad compatibility\n\n```python\nfrom evolvishub_outlook_ingestor.protocols import IMAPAdapter\n\nadapter = IMAPAdapter(\"imap\", {\n \"server\": \"outlook.office365.com\",\n \"port\": 993,\n \"username\": \"your_email@company.com\",\n \"password\": \"your_password\",\n \"use_ssl\": True,\n})\n```\n\n### Database Connectors\n\n#### PostgreSQL Connector\n- **Features**: Async operations, connection pooling, JSON fields, full-text search\n- **Schema**: Optimized tables with proper indexes for email data\n- **Performance**: Batch operations, transaction support\n\n```python\nfrom evolvishub_outlook_ingestor.connectors import PostgreSQLConnector\n\nconnector = PostgreSQLConnector(\"postgresql\", {\n \"host\": \"localhost\",\n \"port\": 5432,\n \"database\": \"outlook_data\",\n \"username\": \"postgres\",\n \"password\": \"your_password\",\n \"pool_size\": 20,\n})\n```\n\n#### MongoDB Connector\n- **Features**: Document storage, GridFS for large attachments, aggregation pipelines\n- **Schema**: Flexible document structure with proper indexing\n- **Scalability**: Horizontal scaling support, replica sets\n\n```python\nfrom evolvishub_outlook_ingestor.connectors import MongoDBConnector\n\nconnector = MongoDBConnector(\"mongodb\", {\n \"host\": \"localhost\",\n \"port\": 27017,\n \"database\": \"outlook_data\",\n \"username\": \"mongo_user\",\n \"password\": \"your_password\",\n})\n```\n\n### Data Processors\n\n#### Email Processor\n- **Features**: Content normalization, HTML to text conversion, duplicate detection\n- **Capabilities**: Email validation, link extraction, encoding detection\n- **Configuration**: Customizable processing rules\n\n```python\nfrom evolvishub_outlook_ingestor.processors import EmailProcessor\n\nprocessor = EmailProcessor(\"email\", {\n \"normalize_content\": True,\n \"extract_links\": True,\n \"validate_addresses\": True,\n \"html_to_text\": True,\n \"remove_duplicates\": True,\n})\n```\n\n#### Attachment Processor\n- **Features**: File type detection, virus scanning hooks, metadata extraction\n- **Security**: Size validation, type filtering, content analysis\n- **Optimization**: Image compression, hash calculation\n\n```python\nfrom evolvishub_outlook_ingestor.processors import AttachmentProcessor\n\nprocessor = AttachmentProcessor(\"attachment\", {\n \"max_attachment_size\": 50 * 1024 * 1024, # 50MB\n \"scan_for_viruses\": True,\n \"extract_metadata\": True,\n \"calculate_hashes\": True,\n \"compress_images\": True,\n})\n```\n\n## \ud83d\udd27 Advanced Usage\n\n### Hybrid Storage Configuration\n\n```python\nimport asyncio\nfrom evolvishub_outlook_ingestor.processors.enhanced_attachment_processor import (\n EnhancedAttachmentProcessor,\n StorageStrategy\n)\nfrom evolvishub_outlook_ingestor.connectors.minio_connector import MinIOConnector\nfrom evolvishub_outlook_ingestor.connectors.aws_s3_connector import AWSS3Connector\n\nasync def setup_hybrid_storage():\n # Configure storage backends\n minio_config = {\n \"endpoint_url\": \"localhost:9000\",\n \"access_key\": \"minioadmin\",\n \"secret_key\": \"minioadmin\",\n \"bucket_name\": \"email-attachments-hot\",\n \"use_ssl\": False\n }\n\n s3_config = {\n \"access_key\": \"your_aws_access_key\",\n \"secret_key\": \"your_aws_secret_key\",\n \"bucket_name\": \"email-attachments-archive\",\n \"region\": \"us-east-1\"\n }\n\n # Initialize storage connectors\n minio_connector = MinIOConnector(\"hot_storage\", minio_config)\n s3_connector = AWSS3Connector(\"archive_storage\", s3_config)\n\n # Configure enhanced processor with storage rules\n processor_config = {\n \"storage_strategy\": \"hybrid\",\n \"size_threshold\": 1024 * 1024, # 1MB\n \"enable_compression\": True,\n \"enable_deduplication\": True,\n \"storage_rules\": [\n {\n \"name\": \"large_files\",\n \"condition\": \"size > 5*1024*1024\", # Files > 5MB\n \"strategy\": \"storage_only\",\n \"storage_backend\": \"archive_storage\"\n },\n {\n \"name\": \"medium_files\",\n \"condition\": \"size > 1024*1024 and size <= 5*1024*1024\",\n \"strategy\": \"hybrid\",\n \"storage_backend\": \"hot_storage\"\n },\n {\n \"name\": \"small_files\",\n \"condition\": \"size <= 1024*1024\",\n \"strategy\": \"database_only\"\n }\n ]\n }\n\n # Create enhanced processor\n processor = EnhancedAttachmentProcessor(\"hybrid_attachments\", processor_config)\n\n # Add storage backends\n async with minio_connector, s3_connector:\n await processor.add_storage_backend(\"hot_storage\", minio_connector)\n await processor.add_storage_backend(\"archive_storage\", s3_connector)\n\n # Process emails with hybrid storage\n result = await processor.process(email_with_attachments)\n\n # Generate secure URLs for attachment access\n for storage_info in result.metadata.get(\"storage_infos\", []):\n if storage_info.get(\"storage_backend\"):\n backend = processor.storage_backends[storage_info[\"storage_backend\"]]\n secure_url = await backend.generate_presigned_url(\n storage_info[\"storage_location\"],\n expires_in=3600 # 1 hour\n )\n print(f\"Secure URL: {secure_url}\")\n\nasyncio.run(setup_hybrid_storage())\n```\n\n### Custom Protocol Adapter\n\n```python\nfrom evolvishub_outlook_ingestor.protocols import BaseProtocol\nfrom evolvishub_outlook_ingestor.core.data_models import EmailMessage\n\nclass CustomProtocol(BaseProtocol):\n async def _fetch_emails_impl(self, **kwargs):\n # Implement custom email fetching logic\n emails = []\n # ... fetch emails from custom source\n return emails\n\n# Use custom protocol\ningestor = OutlookIngestor(\n settings=settings,\n protocol_adapters={\"custom\": CustomProtocol(\"custom\", config)}\n)\n```\n\n### Batch Processing with Progress Tracking\n\n```python\nfrom evolvishub_outlook_ingestor.core.data_models import BatchProcessingConfig\n\nasync def process_with_progress():\n def progress_callback(processed, total, rate):\n print(f\"Progress: {processed}/{total} ({rate:.2f} emails/sec)\")\n \n batch_config = BatchProcessingConfig(\n batch_size=500,\n max_workers=8,\n progress_callback=progress_callback\n )\n \n result = await ingestor.process_emails(\n protocol=\"exchange\",\n database=\"mongodb\",\n batch_config=batch_config\n )\n```\n\n### Database Transactions\n\n```python\nfrom evolvishub_outlook_ingestor.connectors import PostgreSQLConnector\n\nasync def transactional_processing():\n connector = PostgreSQLConnector(\"postgres\", config)\n await connector.initialize()\n \n async with connector.transaction() as tx:\n # All operations within this block are transactional\n for email in emails:\n await connector.store_email(email, transaction=tx)\n # Automatically commits on success, rolls back on error\n```\n\n## \ud83c\udfd7\ufe0f Architecture\n\n### Component Overview\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Protocols \u2502 \u2502 Processors \u2502 \u2502 Connectors \u2502\n\u2502 \u2502 \u2502 \u2502 \u2502 \u2502\n\u2502 \u2022 Exchange EWS \u2502\u2500\u2500\u2500\u25b6\u2502 \u2022 Email Proc. \u2502\u2500\u2500\u2500\u25b6\u2502 \u2022 PostgreSQL \u2502\n\u2502 \u2022 Graph API \u2502 \u2502 \u2022 Attachment \u2502 \u2502 \u2022 MongoDB \u2502\n\u2502 \u2022 IMAP/POP3 \u2502 \u2502 \u2022 Batch Proc. \u2502 \u2502 \u2022 MySQL \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502 \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 Core Framework \u2502\n \u2502 \u2502\n \u2502 \u2022 Configuration \u2502\n \u2502 \u2022 Logging \u2502\n \u2502 \u2022 Error Handling \u2502\n \u2502 \u2022 Retry Logic \u2502\n \u2502 \u2022 Metrics \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Design Patterns\n\n- **Strategy Pattern**: Interchangeable protocol adapters\n- **Factory Pattern**: Dynamic component creation\n- **Repository Pattern**: Database abstraction\n- **Observer Pattern**: Progress and metrics tracking\n- **Circuit Breaker**: Fault tolerance\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=evolvishub_outlook_ingestor --cov-report=html\n\n# Run specific test categories\npytest -m unit # Unit tests only\npytest -m integration # Integration tests only\npytest -m performance # Performance tests only\n\n# Run tests in parallel\npytest -n auto\n```\n\n## \ud83d\udcca Performance\n\n### Benchmarks\n\n- **Email Processing**: 1000+ emails/minute\n- **Memory Usage**: <100MB for 10K emails\n- **Database Throughput**: 500+ inserts/second\n- **Concurrent Connections**: 50+ simultaneous\n\n### Optimization Tips\n\n1. **Use Batch Processing**: Process emails in batches for better throughput\n2. **Enable Connection Pooling**: Reuse database connections\n3. **Configure Rate Limiting**: Avoid API throttling\n4. **Monitor Memory Usage**: Use streaming for large datasets\n5. **Tune Worker Count**: Match your system's CPU cores\n\n## \ud83d\udd0d Monitoring\n\n### Metrics Collection\n\n```python\n# Enable Prometheus metrics\nsettings.monitoring.enable_metrics = True\nsettings.monitoring.metrics_port = 8000\n\n# Access metrics endpoint\n# http://localhost:8000/metrics\n```\n\n### Health Checks\n\n```python\n# Check component health\nstatus = await ingestor.get_status()\nprint(f\"Protocol Status: {status['protocols']}\")\nprint(f\"Database Status: {status['database']}\")\n```\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n### Development Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/evolvisai/metcal.git\ncd metcal/shared/libs/evolvis-outlook-ingestor\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run pre-commit hooks\npre-commit install\n\n# Run tests\npytest\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the Evolvis AI License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcda API Reference\n\n### Core Components\n\n**Protocol Adapters**\n- `GraphAPIAdapter`: Microsoft Graph API integration\n- `ExchangeWebServicesAdapter`: Exchange Web Services (EWS) support\n- `IMAPAdapter`: IMAP protocol support\n\n**Database Connectors**\n- `PostgreSQLConnector`: PostgreSQL database integration\n- `MongoDBConnector`: MongoDB database integration\n\n**Data Processors**\n- `EmailProcessor`: Email content processing and normalization\n- `AttachmentProcessor`: Attachment handling and security scanning\n\n**Security Utilities**\n- `SecureCredentialManager`: Credential encryption and management\n- `CredentialMasker`: Sensitive data masking for logs\n- `InputSanitizer`: Input validation and sanitization\n\n### Configuration Reference\n\n```python\n# Complete configuration example\nconfig = {\n \"graph_api\": {\n \"client_id\": \"your_client_id\",\n \"client_secret\": \"your_client_secret\",\n \"tenant_id\": \"your_tenant_id\",\n \"rate_limit\": 100, # requests per minute\n \"timeout\": 30, # request timeout in seconds\n },\n \"database\": {\n \"host\": \"localhost\",\n \"port\": 5432,\n \"database\": \"outlook_ingestor\",\n \"username\": \"ingestor_user\",\n \"password\": \"secure_password\",\n \"ssl_mode\": \"require\",\n \"enable_connection_pooling\": True,\n \"pool_size\": 10,\n },\n \"email_processing\": {\n \"normalize_content\": True,\n \"extract_links\": True,\n \"validate_addresses\": True,\n \"html_to_text\": True,\n \"remove_duplicates\": True,\n },\n \"attachment_processing\": {\n \"max_attachment_size\": 10 * 1024 * 1024, # 10MB\n \"extract_metadata\": True,\n \"calculate_hashes\": True,\n \"scan_for_viruses\": False,\n }\n}\n```\n\n### Error Handling\n\n```python\nfrom evolvishub_outlook_ingestor.core.exceptions import (\n ConnectionError,\n AuthenticationError,\n DatabaseError,\n ProcessingError,\n ValidationError,\n)\n\ntry:\n await protocol.fetch_emails()\nexcept AuthenticationError:\n # Handle authentication issues\n print(\"Check your API credentials\")\nexcept ConnectionError:\n # Handle network/connection issues\n print(\"Check network connectivity\")\nexcept ProcessingError as e:\n # Handle processing errors\n print(f\"Processing failed: {e}\")\n```\n\n## Support and Documentation\n\n### Documentation Resources\n- **[Storage Architecture Guide](docs/STORAGE_ARCHITECTURE.md)** - Comprehensive guide to hybrid storage configuration\n- **[Migration Guide](docs/MIGRATION_GUIDE.md)** - Step-by-step migration from basic to hybrid storage\n- **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation with examples\n- **[Performance Tuning](docs/PERFORMANCE_TUNING.md)** - Optimization guidelines for large-scale deployments\n\n### Community and Support\n- **[GitHub Issues](https://github.com/evolvisai/metcal/issues)** - Bug reports and feature requests\n- **[GitHub Discussions](https://github.com/evolvisai/metcal/discussions)** - Community discussions and Q&A\n- **[Examples Directory](examples/)** - Comprehensive usage examples and tutorials\n\n### Enterprise Support\nFor enterprise deployments requiring dedicated support, custom integrations, or professional services, please contact our team for tailored solutions and SLA-backed support options.\n\n## Technical Specifications\n\n### Supported Platforms\n- **Operating Systems**: Linux (Ubuntu 18.04+, CentOS 7+), macOS (10.15+), Windows (10+)\n- **Python Versions**: 3.9, 3.10, 3.11, 3.12\n- **Database Systems**: PostgreSQL 12+, MongoDB 4.4+, MySQL 8.0+\n- **Object Storage**: MinIO, AWS S3, Azure Blob Storage, Google Cloud Storage\n\n### Performance Characteristics\n- **Throughput**: Up to 2,000 emails/minute with hybrid storage\n- **Concurrency**: Support for 50+ concurrent processing workers\n- **Memory Efficiency**: <10KB per email with object storage strategy\n- **Storage Optimization**: Up to 70% reduction in database size with intelligent routing\n\n### Security Compliance\n- **Encryption**: AES-256 encryption for credentials and sensitive data\n- **Authentication**: OAuth2, Basic Auth, and certificate-based authentication\n- **Access Control**: Role-based access control and audit logging\n- **Compliance**: GDPR, HIPAA, and SOX compliance features available\n\n## Acknowledgments\n\nThis project is built on top of excellent open-source technologies:\n\n- **[Pydantic](https://pydantic.dev/)** - Data validation and settings management\n- **[SQLAlchemy](https://sqlalchemy.org/)** - Database ORM with async support\n- **[asyncio](https://docs.python.org/3/library/asyncio.html)** - Asynchronous programming framework\n- **[pytest](https://pytest.org/)** - Testing framework with async support\n- **[Black](https://black.readthedocs.io/)**, **[isort](https://pycqa.github.io/isort/)**, **[mypy](https://mypy.readthedocs.io/)** - Code quality and type checking tools\n\n## License\n\n### Evolvis AI License\n\nThis software is proprietary to **Evolvis AI** and is protected by copyright and other intellectual property laws.\n\n#### \ud83d\udccb **License Terms**\n\n- **\u2705 Evaluation and Non-Commercial Use**: This package is available for evaluation, research, and non-commercial use\n- **\u26a0\ufe0f Commercial Use Restrictions**: Commercial or production use of this library requires a valid Evolvis AI License\n- **\ud83d\udeab Redistribution Prohibited**: Redistribution or commercial use without proper licensing is strictly prohibited\n\n#### \ud83d\udcbc **Commercial Licensing**\n\nFor commercial licensing, production deployments, or enterprise use, please contact:\n\n**Montgomery Miralles**\n\ud83d\udce7 **Email**: [m.miralles@evolvis.ai](mailto:m.miralles@evolvis.ai)\n\ud83c\udfe2 **Company**: Evolvis AI\n\ud83c\udf10 **Website**: [https://evolvis.ai](https://evolvis.ai)\n\n#### \u2696\ufe0f **Important Notice**\n\n> **Commercial users must obtain proper licensing before deploying this software in production environments.** Unauthorized commercial use may result in legal action. Contact Montgomery Miralles for licensing agreements and compliance requirements.\n\n#### \ud83d\udcc4 **Full License**\n\nFor complete license terms and conditions, see the [LICENSE](LICENSE) file included with this distribution.\n\n---\n\n**Evolvishub Outlook Ingestor** - Enterprise-grade email ingestion with intelligent hybrid storage architecture.\n",
"bugtrack_url": null,
"license": "Evolvis AI License",
"summary": "Production-ready, secure email ingestion system for Microsoft Outlook with advanced processing, monitoring, and database integration",
"version": "1.0.2",
"project_urls": {
"Changelog": "https://github.com/evolvisai/metcal/blob/main/shared/libs/evolvis-outlook-ingestor/CHANGELOG.md",
"Documentation": "https://github.com/evolvisai/metcal/tree/main/shared/libs/evolvis-outlook-ingestor/docs",
"Examples": "https://github.com/evolvisai/metcal/tree/main/shared/libs/evolvis-outlook-ingestor/examples",
"Homepage": "https://github.com/evolvisai/metcal",
"Issues": "https://github.com/evolvisai/metcal/issues",
"Repository": "https://github.com/evolvisai/metcal.git"
},
"split_keywords": [
"outlook",
" email",
" ingestion",
" exchange",
" graph-api",
" imap",
" pop3",
" database",
" async",
" batch-processing",
" security",
" monitoring",
" performance",
" postgresql",
" mongodb",
" enterprise"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1e324b6fb96f75e6fea90edcac84d329e66b4cc8700dcbcd6a5057781ecc6b61",
"md5": "b02777a55c545ee5d395ad37b0c18e66",
"sha256": "690bd9d04c272ff4771628fb590a6e3ead227ef78575b08c7130820a6b0c652c"
},
"downloads": -1,
"filename": "evolvishub_outlook_ingestor-1.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b02777a55c545ee5d395ad37b0c18e66",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 153970,
"upload_time": "2025-10-06T22:41:10",
"upload_time_iso_8601": "2025-10-06T22:41:10.933571Z",
"url": "https://files.pythonhosted.org/packages/1e/32/4b6fb96f75e6fea90edcac84d329e66b4cc8700dcbcd6a5057781ecc6b61/evolvishub_outlook_ingestor-1.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "88f22dfa77bd7b374a68007e712f301082248b9dd8763768eed9fa44c0c0beff",
"md5": "d885fa67eeb8c38e608d3fa773c2c55e",
"sha256": "dcc5248cffd7d4c5208b4f940e6ec5238c4244de4c2e188e223ed0f67ca4aff2"
},
"downloads": -1,
"filename": "evolvishub_outlook_ingestor-1.0.2.tar.gz",
"has_sig": false,
"md5_digest": "d885fa67eeb8c38e608d3fa773c2c55e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 185062,
"upload_time": "2025-10-06T22:41:12",
"upload_time_iso_8601": "2025-10-06T22:41:12.580331Z",
"url": "https://files.pythonhosted.org/packages/88/f2/2dfa77bd7b374a68007e712f301082248b9dd8763768eed9fa44c0c0beff/evolvishub_outlook_ingestor-1.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 22:41:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "evolvisai",
"github_project": "metcal",
"github_not_found": true,
"lcname": "evolvishub-outlook-ingestor"
}