# Splurge LazyFrame Comparison Framework
A comprehensive Python framework for comparing two Polars LazyFrames with configurable schemas, primary keys, and column mappings. The framework identifies value differences, missing records, and provides detailed comparison reports.
## Features
- **Service-oriented architecture** - Modular design with separate services for comparison, validation, reporting, and data preparation
- **Multi-column primary key support** - Compare datasets using composite keys
- **Flexible schema definition** - Define friendly names and data types for columns using either direct Polars datatypes or human-readable string names
- **Column-to-column mapping** - Map columns between datasets with different naming conventions
- **Three core comparison patterns**:
- Value differences (same keys, different values)
- Left-only records (records only in left dataset)
- Right-only records (records only in right dataset)
- **Comprehensive reporting** - Generate summary and detailed reports with multiple table formats
- **Data validation** - Built-in data quality validation capabilities
- **Type-safe implementation** - Full type annotations and validation
- **Performance optimized** - Leverages Polars' lazy evaluation for memory efficiency
- **Export capabilities** - Export results to CSV, Parquet, and JSON formats
- **Auto-configuration** - Automatically generate comparison configurations from LazyFrames with identical schemas
- **Configuration management** - Load, save, and manage comparison configurations with environment variable support
- **Production-ready logging** - Structured logging with Python's logging module, configurable log levels, and performance monitoring
- **Error handling** - Robust exception handling with custom exceptions and graceful error recovery
## Installation
```bash
pip install splurge-lazyframe-compare
```
## Quick Start
```python
import polars as pl
from datetime import date
from splurge_lazyframe_compare import (
LazyFrameComparator,
ComparisonConfig,
ComparisonSchema,
ColumnDefinition,
ColumnMapping,
)
# Define schemas (demonstrating mixed datatype usage)
left_schema = ComparisonSchema(
columns={
"customer_id": ColumnDefinition("customer_id", "Customer ID", "Int64", False), # String name
"order_date": ColumnDefinition("order_date", "Order Date", pl.Date, False), # Direct datatype
"amount": ColumnDefinition("amount", "Order Amount", "Float64", False), # String name
"status": ColumnDefinition("status", "Order Status", pl.Utf8, True), # Direct datatype
},
pk_columns=["customer_id", "order_date"]
)
right_schema = ComparisonSchema(
columns={
"cust_id": ColumnDefinition("cust_id", "Customer ID", "Int64", False), # String name
"order_dt": ColumnDefinition("order_dt", "Order Date", "Date", False), # String name
"total_amount": ColumnDefinition("total_amount", "Order Amount", pl.Float64, False), # Direct datatype
"order_status": ColumnDefinition("order_status", "Order Status", "String", True), # String name
},
pk_columns=["cust_id", "order_dt"]
)
# Define column mappings
mappings = [
ColumnMapping(left="customer_id", right="cust_id", name="customer_id"),
ColumnMapping(left="order_date", right="order_dt", name="order_date"),
ColumnMapping(left="amount", right="total_amount", name="amount"),
ColumnMapping(left="status", right="order_status", name="status"),
]
# Create configuration
config = ComparisonConfig(
left_schema=left_schema,
right_schema=right_schema,
column_mappings=mappings,
pk_columns=["customer_id", "order_date"]
)
# Create sample data
left_data = {
"customer_id": [1, 2, 3],
"order_date": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],
"amount": [100.0, 200.0, 300.0],
"status": ["pending", "completed", "pending"],
}
right_data = {
"cust_id": [1, 2, 3],
"order_dt": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],
"total_amount": [100.0, 250.0, 300.0], # Different amount for customer 2
"order_status": ["pending", "completed", "cancelled"], # Different status for customer 3
}
left_df = pl.LazyFrame(left_data)
right_df = pl.LazyFrame(right_data)
# Execute comparison
comparator = LazyFrameComparator(config)
results = comparator.compare(left=left_df, right=right_df)
# Generate report using ReportingService
from splurge_lazyframe_compare import ReportingService
reporter = ReportingService()
summary_report = reporter.generate_summary_report(results=results)
print(summary_report)
```
## Auto-Configuration from LazyFrames
For cases where your LazyFrames have identical column names and you want to quickly set up a comparison, you can use the automatic configuration generator:
```python
from splurge_lazyframe_compare.utils import create_comparison_config_from_lazyframes
# Your LazyFrames with identical column names
left_data = {
"customer_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"email": ["alice@example.com", "bob@example.com", "charlie@example.com", "david@example.com", "eve@example.com"],
"balance": [100.50, 250.00, 75.25, 300.00, 150.75],
"active": [True, True, False, True, True]
}
right_data = {
"customer_id": [1, 2, 3, 4, 6], # ID 5 missing, ID 6 added
"name": ["Alice", "Bob", "Charlie", "Dave", "Frank"], # David -> Dave, Eve -> Frank
"email": ["alice@example.com", "bob@example.com", "charlie@example.com", "dave@example.com", "frank@example.com"],
"balance": [100.50, 250.00, 75.25, 320.00, 200.00], # David's balance changed
"active": [True, True, False, True, False] # Frank is inactive
}
left_df = pl.LazyFrame(left_data)
right_df = pl.LazyFrame(right_data)
# Specify primary key columns
primary_keys = ["customer_id"]
# Generate ComparisonConfig automatically (keyword-only parameters)
config = create_comparison_config_from_lazyframes(
left=left_df,
right=right_df,
pk_columns=primary_keys
)
print(f"Auto-generated config with {len(config.column_mappings)} column mappings")
print(f"Primary key columns: {config.pk_columns}")
# Use immediately for comparison
from splurge_lazyframe_compare.services.comparison_service import ComparisonService
comparison_service = ComparisonService()
results = comparison_service.execute_comparison(
left=left_df,
right=right_df,
config=config
)
print(f"Comparison completed - found {results.summary.value_differences_count} differences")
```
**Key Benefits:**
- **No Manual Configuration**: Eliminates need for manual schema and mapping definition
- **Type Safety**: Automatically infers Polars data types from your LazyFrames
- **Error Prevention**: Validates primary key columns exist before comparison
- **Keyword-Only API**: Explicit parameter names prevent argument order errors
- **Ready-to-Use**: Generated config works immediately with all comparison services
## Advanced Usage
### Numeric Tolerance
```python
# Allow small differences in numeric columns
config = ComparisonConfig(
left_schema=left_schema,
right_schema=right_schema,
column_mappings=mappings,
pk_columns=["customer_id", "order_date"],
tolerance={"amount": 0.01} # Allow 1 cent difference
)
```
### Case-Insensitive String Comparison
```python
# Ignore case in string comparisons
config = ComparisonConfig(
left_schema=left_schema,
right_schema=right_schema,
column_mappings=mappings,
pk_columns=["customer_id", "order_date"],
ignore_case=True
)
```
### Null Value Handling
```python
# Configure null value comparison behavior
config = ComparisonConfig(
left_schema=left_schema,
right_schema=right_schema,
column_mappings=mappings,
pk_columns=["customer_id", "order_date"],
null_equals_null=True # Treat null values as equal
)
```
### Export Results
```python
# Export comparison results to files using ReportingService
reporter = ReportingService()
exported_files = reporter.export_results(
results=results,
format="csv",
output_dir="./comparison_results"
)
print(f"Exported files: {list(exported_files.keys())}")
# Generate detailed report with different table formats
detailed_report = reporter.generate_detailed_report(
results=results,
max_samples=5
)
print(detailed_report)
# Generate summary table in different formats
summary_grid = reporter.generate_summary_table(
results=results,
table_format="grid"
)
print(summary_grid)
summary_pipe = reporter.generate_summary_table(
results=results,
table_format="pipe"
)
print(summary_pipe)
```
## Configuration Management
The framework provides comprehensive configuration management utilities for loading, saving, and manipulating comparison configurations:
### Configuration Files
```python
from splurge_lazyframe_compare.utils.config_helpers import (
load_config_from_file,
save_config_to_file,
create_default_config,
validate_config
)
# Create a default configuration template
default_config = create_default_config()
print("Default config keys:", list(default_config.keys()))
# Save configuration to file
save_config_to_file(default_config, "my_comparison_config.json")
# Load configuration from file
loaded_config = load_config_from_file("my_comparison_config.json")
# Validate configuration
validation_errors = validate_config(loaded_config)
if validation_errors:
print("Configuration errors:", validation_errors)
else:
print("Configuration is valid")
```
### Environment Variable Configuration
```python
from splurge_lazyframe_compare.utils.config_helpers import (
get_env_config,
apply_environment_overrides,
merge_configs
)
# Load configuration from environment variables (prefixed with SPLURGE_)
env_config = get_env_config()
print("Environment config:", env_config)
# Apply environment overrides to existing config
final_config = apply_environment_overrides(default_config)
# Merge multiple configuration sources
custom_config = {"comparison": {"max_samples": 100}}
merged_config = merge_configs(default_config, custom_config)
```
### Configuration Utilities
```python
from splurge_lazyframe_compare.utils.config_helpers import (
get_config_value,
create_config_from_dataframes
)
# Get nested configuration values
max_samples = get_config_value(merged_config, "reporting.max_samples", default_value=10)
# Create configuration from existing DataFrames
basic_config = create_config_from_dataframes(
left_df=left_df,
right_df=right_df,
primary_keys=["customer_id"],
auto_map_columns=True
)
```
## Service Architecture
The framework follows a modular service-oriented architecture with clear separation of concerns:
### Core Services
#### `ComparisonOrchestrator`
Main entry point that coordinates all comparison activities and manages service dependencies.
#### `ComparisonService`
Handles the core comparison logic, schema validation, and result generation.
#### `ValidationService`
Provides comprehensive data quality validation including schema validation, primary key checks, data type validation, and completeness checks.
#### `ReportingService`
Generates human-readable reports in multiple formats and handles data export functionality.
#### `DataPreparationService`
Manages data preprocessing, column mapping, and schema transformations.
### Logging Utilities
#### `get_logger(name: str)`
Factory function to create configured loggers with proper naming hierarchy.
#### `performance_monitor(service_name: str, operation: str)`
Context manager for automatic performance monitoring and logging.
#### `log_service_initialization(service_name: str, config: dict = None)`
Logs service initialization with optional configuration details.
#### `log_service_operation(service_name: str, operation: str, status: str, message: str = None)`
Logs service operations with status and optional details.
#### `log_performance(service_name: str, operation: str, duration_ms: float, details: dict = None)`
Logs performance metrics with automatic slow operation detection.
### Service Usage Pattern
```python
from splurge_lazyframe_compare import ComparisonOrchestrator
# Services are automatically managed by the orchestrator
orchestrator = ComparisonOrchestrator()
results = orchestrator.compare_dataframes(
config=comparison_config,
left=left_df,
right=right_df
)
```
### Individual Service Usage
```python
from splurge_lazyframe_compare import (
ValidationService,
ReportingService
)
# Use validation service independently
validator = ValidationService()
validation_result = validator.validate_dataframe_schema(
df=df,
expected_schema=comparison_schema
)
# Use reporting service independently
reporter = ReportingService()
summary_report = reporter.generate_summary_report(results=results)
```
## Logging and Monitoring
The framework includes comprehensive logging and monitoring capabilities using Python's standard logging module:
### Logger Configuration
```python
from splurge_lazyframe_compare.utils.logging_helpers import (
get_logger,
configure_logging,
)
# Configure logging at application startup (no side-effects on import)
configure_logging(level="INFO", fmt='[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s')
# Get a logger for your module
logger = get_logger(__name__)
```
### Performance Monitoring
```python
from splurge_lazyframe_compare.utils.logging_helpers import performance_monitor
# Automatically log performance metrics
with performance_monitor("ComparisonService", "find_differences") as ctx:
result = perform_comparison() # Your actual operation here
# Add custom metrics (result may not always have len() method)
if hasattr(result, '__len__'):
ctx["records_processed"] = len(result)
ctx["operation_status"] = "completed"
```
### Service Logging
```python
from splurge_lazyframe_compare.utils.logging_helpers import (
log_service_initialization,
log_service_operation
)
# Log service initialization
log_service_initialization("ComparisonService", {"version": "1.0"})
# Log service operations
log_service_operation("ComparisonService", "compare", "success", "Comparison completed")
```
### Log Output Format
```
[2025-01-29 10:30:45,123] [INFO] [splurge_lazyframe_compare.ComparisonService] [initialization] Service initialized successfully Details: config={'version': '1.0'}
[2025-01-29 10:30:45,234] [WARNING] [splurge_lazyframe_compare.ComparisonService] [find_differences] SLOW OPERATION: 150.50ms (150.50ms) Details: records=1000
```
### Interpreting Validation Errors
- SchemaValidationError: indicates schema mismatches (missing columns, wrong dtypes, nullability, or unmapped PKs). Inspect the message substrings for actionable details such as missing columns or dtype mismatches. Primary key violations surface as duplicates or missing mappings.
- PrimaryKeyViolationError: raised when primary key constraints are violated. Ensure all PKs exist and are unique in input LazyFrames.
Exception types and original tracebacks are preserved by services; messages include service name and operation context for clarity.
## CLI Usage
The package provides a `slc` CLI.
```bash
slc --help
slc compare --dry-run
slc report --dry-run
slc export --dry-run
```
The dry-run subcommands validate inputs and demonstrate execution without running a full comparison.
### CLI Errors and Exit Codes
- The CLI surfaces domain errors using custom exceptions and stable exit codes:
- Configuration issues (e.g., invalid JSON, failed validation) raise `ConfigError` and exit with code `2`.
- Data source issues (e.g., missing files, unsupported extensions) raise `DataSourceError` and exit with code `2`.
- Unexpected errors exit with code `1` and include a brief message; enable debug logging for full tracebacks.
Example messages:
```
Configuration error: Invalid configuration: <details>
Compare failed: Unsupported file extension: .txt
Export failed: Data file not found: <path>
```
## Large-data Export Tips
- Prefer parquet format for performance and compression. CSV is human-friendly but slower for large datasets.
- Ensure sufficient temporary disk space when exporting large LazyFrames; parquet writes may buffer data.
- For JSON summaries we export a compact, versioned envelope:
```json
{
"schema_version": "1.0",
"summary": {
"total_left_records": 123,
"total_right_records": 123,
"matching_records": 120,
"value_differences_count": 3,
"left_only_count": 0,
"right_only_count": 0,
"comparison_timestamp": "2025-01-01T00:00:00"
}
}
```
[2025-01-29 10:30:45,345] [ERROR] [splurge_lazyframe_compare.ValidationService] [validate_schema] Schema validation failed: Invalid column type
```
### Log Levels
- **DEBUG**: Detailed debugging information and performance metrics
- **INFO**: General information about operations and service lifecycle
- **WARNING**: Warning conditions that don't prevent operation
- **ERROR**: Error conditions that may affect functionality
## API Reference
### Core Classes
#### `ComparisonSchema`
Defines the structure and constraints for a dataset.
```python
schema = ComparisonSchema(
columns={
"id": ColumnDefinition("id", "ID", pl.Int64, False),
"name": ColumnDefinition("name", "Name", pl.Utf8, True),
},
pk_columns=["id"]
)
```
#### `ColumnDefinition`
Defines a column with metadata for comparison.
**Using Direct Polars Datatypes:**
```python
col_def = ColumnDefinition(
name="customer_id",
alias="Customer ID",
datatype=pl.Int64,
nullable=False
)
```
**Using String Datatype Names (Recommended):**
```python
# More readable and user-friendly
col_def = ColumnDefinition(
name="customer_id",
alias="Customer ID",
datatype="Int64", # String name instead of pl.Int64
nullable=False
)
# Supports all Polars datatypes
complex_col = ColumnDefinition(
name="metadata",
alias="Metadata",
datatype="Struct", # Complex datatype as string
nullable=True
)
timestamp_col = ColumnDefinition(
name="created_at",
alias="Created At",
datatype="Datetime", # Automatically configured with defaults
nullable=False
)
```
**Mixed Usage in Schemas:**
```python
schema = ComparisonSchema(
columns={
# Mix string names and direct datatypes
"id": ColumnDefinition("id", "ID", "Int64", False), # String
"name": ColumnDefinition("name", "Name", pl.Utf8, True), # Direct
"created": ColumnDefinition("created", "Created", "Datetime", False), # String
"tags": ColumnDefinition("tags", "Tags", pl.List(pl.Utf8), True), # Direct
},
pk_columns=["id"]
)
```
**⚠️ Important Notes for Complex Types:**
- **List types** require an inner type: `pl.List(pl.Utf8)` ✅, not `pl.List` ❌
- **Struct types** require field definitions: `pl.Struct([])` ✅ or `pl.Struct({"field": pl.Utf8})` ✅, not `pl.Struct` ❌
- The framework will provide clear error messages if you accidentally use unparameterized complex types
#### `ColumnMapping`
Maps columns between left and right datasets.
```python
mapping = ColumnMapping(
left="customer_id",
right="cust_id",
name="customer_id"
)
```
#### `ComparisonConfig`
Configuration for comparing two datasets.
```python
config = ComparisonConfig(
left_schema=left_schema,
right_schema=right_schema,
column_mappings=mappings,
pk_columns=["customer_id", "order_date"],
ignore_case=False,
null_equals_null=True,
tolerance={"amount": 0.01}
)
```
#### `LazyFrameComparator`
Main comparison engine.
```python
comparator = LazyFrameComparator(config)
results = comparator.compare(left=left_df, right=right_df)
```
### Results and Reporting
#### `ComparisonResult`
Container for all comparison results.
```python
# Access summary statistics
print(f"Matching records: {results.summary.matching_records}")
print(f"Value differences: {results.summary.value_differences_count}")
print(f"Left-only records: {results.summary.left_only_count}")
print(f"Right-only records: {results.summary.right_only_count}")
# Access result DataFrames
value_diffs = results.value_differences.collect()
left_only = results.left_only_records.collect()
right_only = results.right_only_records.collect()
```
#### `ReportingService`
Generate human-readable reports with multiple formats.
```python
from splurge_lazyframe_compare import ReportingService
reporter = ReportingService()
# Summary report
summary_report = reporter.generate_summary_report(results=results)
print(summary_report)
# Detailed report with samples
detailed_report = reporter.generate_detailed_report(
results=results,
max_samples=10
)
print(detailed_report)
# Summary table in different formats
summary_grid = reporter.generate_summary_table(
results=results,
table_format="grid" # Options: grid, simple, pipe, orgtbl
)
print(summary_grid)
# Export results to files
exported_files = reporter.export_results(
results=results,
format="csv", # Options: csv, parquet, json
output_dir="./comparison_results"
)
```
#### `ComparisonOrchestrator` (Extended Methods)
```python
from splurge_lazyframe_compare.services.orchestrator import ComparisonOrchestrator
orchestrator = ComparisonOrchestrator()
# Get comparison summary as string
summary_str = orchestrator.get_comparison_summary(result=results)
# Get comparison table in various formats
table_str = orchestrator.get_comparison_table(
result=results,
table_format="grid" # Options: grid, simple, pipe, orgtbl
)
# Generate report from existing result
report = orchestrator.generate_report_from_result(
result=results,
report_type="detailed", # Options: summary, detailed, table
max_samples=10
)
```
### Configuration Management API
#### `create_comparison_config_from_lazyframes()`
```python
from splurge_lazyframe_compare.utils import create_comparison_config_from_lazyframes
# Auto-generate ComparisonConfig from LazyFrames
config = create_comparison_config_from_lazyframes(
left=left_lf, # Left LazyFrame
right=right_lf, # Right LazyFrame
pk_columns=["id"] # Primary key columns
)
# Returns: ComparisonConfig ready for use
```
#### Configuration File Operations
```python
from splurge_lazyframe_compare.utils.config_helpers import (
load_config_from_file,
save_config_to_file,
create_default_config,
validate_config
)
# Create default configuration template
config = create_default_config()
# Save to file
save_config_to_file(config, "comparison_config.json")
# Load from file
loaded_config = load_config_from_file("comparison_config.json")
# Validate configuration
errors = validate_config(loaded_config) # Returns list of error messages
```
#### Environment Configuration
```python
from splurge_lazyframe_compare.utils.config_helpers import (
get_env_config,
apply_environment_overrides,
merge_configs,
get_config_value
)
# Get configuration from environment variables (SPLURGE_ prefix)
env_config = get_env_config()
# Apply environment overrides to existing config
final_config = apply_environment_overrides(base_config)
# Merge multiple configurations
merged = merge_configs(base_config, override_config)
# Get nested configuration values
value = get_config_value(config, "reporting.max_samples", default_value=10)
```
#### DataFrame Configuration Generation
```python
from splurge_lazyframe_compare.utils.config_helpers import create_config_from_dataframes
# Generate configuration from DataFrame schemas
config = create_config_from_dataframes(
left_df=left_df,
right_df=right_df,
primary_keys=["customer_id"],
auto_map_columns=True # Auto-map columns with same names
)
```
### Utility Classes and Constants
#### `ConfigConstants`
```python
from splurge_lazyframe_compare.utils.config_helpers import ConfigConstants
# Configuration constants
prefix = ConfigConstants.ENV_PREFIX # "SPLURGE_"
config_file = ConfigConstants.DEFAULT_CONFIG_FILE # "comparison_config.json"
schema_file = ConfigConstants.DEFAULT_SCHEMA_FILE # "schemas.json"
# Valid policy values
null_policies = ConfigConstants.VALID_NULL_POLICIES # ["equals", "not_equals", "ignore"]
case_policies = ConfigConstants.VALID_CASE_POLICIES # ["sensitive", "insensitive", "preserve"]
```
## Data Quality Validation
The framework includes comprehensive data quality validation through the ValidationService:
```python
from splurge_lazyframe_compare import ValidationService
validator = ValidationService()
# Validate schema against expected structure
schema_result = validator.validate_dataframe_schema(
df=df,
expected_schema=comparison_schema
)
# Check primary key uniqueness
pk_result = validator.validate_primary_key_uniqueness(
df=df,
pk_columns=["customer_id", "order_date"]
)
# Validate data completeness
completeness_result = validator.validate_completeness(
df=df,
required_columns=["customer_id", "amount"]
)
# Validate data types
dtype_result = validator.validate_data_types(
df=df,
expected_types={"customer_id": pl.Int64, "amount": pl.Float64}
)
# Validate numeric ranges
range_result = validator.validate_numeric_ranges(
df=df,
column_ranges={"amount": {"min": 0, "max": 10000}}
)
# Validate string patterns (regex)
pattern_result = validator.validate_string_patterns(
df=df,
column_patterns={"email": r"^[^@]+@[^@]+\.[^@]+$"}
)
# Validate uniqueness constraints
uniqueness_result = validator.validate_uniqueness(
df=df,
unique_columns=["customer_id"]
)
# Access validation results
if schema_result.is_valid:
print("Schema validation passed!")
else:
print("Schema validation failed:")
for error in schema_result.errors:
print(f" - {error}")
```
## Error Handling
The framework provides custom exceptions for different error scenarios:
```python
from splurge_lazyframe_compare.exceptions import (
SchemaValidationError,
PrimaryKeyViolationError,
ColumnMappingError,
ConfigError,
DataSourceError,
)
try:
results = comparator.compare(left=left_df, right=right_df)
except SchemaValidationError as e:
print(f"Schema validation failed: {e}")
if hasattr(e, 'validation_errors'):
for error in e.validation_errors:
print(f" - {error}")
except PrimaryKeyViolationError as e:
print(f"Primary key violation: {e}")
if hasattr(e, 'duplicate_keys'):
print(f"Duplicate keys found: {e.duplicate_keys}")
except ColumnMappingError as e:
print(f"Column mapping error: {e}")
if hasattr(e, 'mapping_errors'):
for error in e.mapping_errors:
print(f" - {error}")
except ConfigError as e:
print(f"Configuration error: {e}")
except DataSourceError as e:
print(f"Data source error: {e}")
```
## Examples
See the `examples/` directory for comprehensive working examples demonstrating all framework capabilities:
### Core Usage Examples
- **`basic_comparison_example.py`** - Basic usage demonstration with schema definition, column mapping, and report generation
- **`service_example.py`** - Service-oriented architecture patterns and dependency injection
- **`auto_config_example.py`** - Automatic configuration generation from LazyFrames (NEW)
### Configuration Management Examples
- **`auto_config_example.py`** - Demonstrates automatic ComparisonConfig generation from LazyFrames with identical column names
### Performance Examples
- **`performance_comparison_example.py`** - Performance benchmarking with large datasets (100K+ records)
- **`detailed_performance_benchmark.py`** - Comprehensive performance analysis with statistical reporting
### Reporting Examples
- **`tabulated_report_example.py`** - Multiple table formats (grid, simple, pipe, orgtbl) and export functionality
### Running Examples
```bash
# Basic comparison
python examples/basic_comparison_example.py
# Auto-configuration example (NEW)
python examples/auto_config_example.py
# Service architecture
python examples/service_example.py
# Performance testing
python examples/performance_comparison_example.py
# Detailed benchmarking
python examples/detailed_performance_benchmark.py
# Tabulated reporting
python examples/tabulated_report_example.py
```
Each example includes comprehensive documentation and demonstrates real-world usage patterns.
## Performance Characteristics
Based on the included performance benchmarks, the framework demonstrates excellent performance:
### Key Metrics
- **Processing Speed**: 2.2M+ records/second for large datasets
- **Memory Efficiency**: Lazy evaluation prevents memory issues
- **Scalability**: Performance improves with larger datasets due to Polars optimizations
- **Time per Record**: 0.0004 ms/record for optimal performance
### Benchmark Results Summary
- **1K-5K records**: 118K-475K records/second
- **10K-25K records**: 841K-1.4M records/second
- **50K+ records**: 2.2M+ records/second
The framework leverages Polars' lazy evaluation and vectorized operations for optimal performance across all dataset sizes.
## Development
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=splurge_lazyframe_compare
# Run specific test file
pytest tests/test_comparator.py
```
### Logging in Development
The framework uses structured logging throughout. During development, you can enable debug logging to see detailed information:
```python
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
# Now all framework operations will be logged with detailed information
```
### Code Quality
```bash
# Run linting
ruff check .
# Run formatting
ruff format .
# Type checking (if mypy is configured)
mypy splurge_lazyframe_compare
```
## CLI Usage
After installation, a `slc` command is available:
```bash
# Show help
slc --help
# Dry run for compare (validates CLI wiring)
slc compare --dry-run
```
### Recent Improvements
- **Production-ready logging**: Replaced all `print()` statements with proper Python logging module
- **Structured logging**: Consistent log format with timestamps, service names, and operation context
- **Performance monitoring**: Built-in performance tracking and slow operation detection
- **Error handling**: Robust exception handling with custom exceptions and graceful recovery
- **Polars integration**: Seamless integration with Polars LazyFrames for efficient data processing
- **Improved error messages**: Clear guidance for proper usage of complex data types
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request
## License
MIT License - see LICENSE file for details.
## Requirements
- **Python**: 3.10 or higher
- **Polars**: >= 1.32.0 (core data processing)
- **tabulate**: >= 0.9.0 (table formatting for reports)
- **logging**: Built-in (structured logging and monitoring)
- **typing**: Built-in (for type annotations)
### Optional Dependencies
- Additional packages may be required for specific export formats or advanced features
### Installation Options
```bash
# Install with all dependencies
pip install splurge-lazyframe-compare
# Install in development mode
pip install -e .
```
## Changelog
### 2025.2.0 (2025-09-03)
- Added domain exceptions at CLI boundary: `ConfigError`, `DataSourceError`.
- Standardized CLI exit codes: `2` for domain errors, `1` for unexpected errors.
- CLI now catches `ComparisonError` first, preserving clear user-facing messages.
- Services error handling preserves exception type and chains the original cause; messages now include service name and context.
- Documentation updates: README now documents CLI errors/exit codes and new exceptions.
- New CLI capabilities and flags: `compare`, `report`, `export`, `--dry-run`, `--format`, `--output-dir`, `--log-level`.
- Logging improvements: introduced `configure_logging()`; removed import-time handler side-effects; consistent log formatting.
- Packaging/entrypoints: ensured `slc` console script and `__main__.py` entry for `python -m splurge_lazyframe_compare`.
- Docs: added CLI usage, logging configuration, and large-data export guidance.
### 2025.1.1 (2025-08-29)
- Removed extraneous folders and plan documents.
### 2025.1.0 (2025-08-29)
- Initial Commit
Raw data
{
"_id": null,
"home_page": null,
"name": "splurge-lazyframe-compare",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "splurge, lazyframe, compare, polars, dataframe, comparison",
"author": "Jim Schilling",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/96/21/68eba5e9981953b16663512c389a394201b7d41f0bd2a452b923fdd793c0/splurge_lazyframe_compare-2025.2.0.tar.gz",
"platform": null,
"description": "# Splurge LazyFrame Comparison Framework\r\n\r\nA comprehensive Python framework for comparing two Polars LazyFrames with configurable schemas, primary keys, and column mappings. The framework identifies value differences, missing records, and provides detailed comparison reports.\r\n\r\n## Features\r\n\r\n- **Service-oriented architecture** - Modular design with separate services for comparison, validation, reporting, and data preparation\r\n- **Multi-column primary key support** - Compare datasets using composite keys\r\n- **Flexible schema definition** - Define friendly names and data types for columns using either direct Polars datatypes or human-readable string names\r\n- **Column-to-column mapping** - Map columns between datasets with different naming conventions\r\n- **Three core comparison patterns**:\r\n - Value differences (same keys, different values)\r\n - Left-only records (records only in left dataset)\r\n - Right-only records (records only in right dataset)\r\n- **Comprehensive reporting** - Generate summary and detailed reports with multiple table formats\r\n- **Data validation** - Built-in data quality validation capabilities\r\n- **Type-safe implementation** - Full type annotations and validation\r\n- **Performance optimized** - Leverages Polars' lazy evaluation for memory efficiency\r\n- **Export capabilities** - Export results to CSV, Parquet, and JSON formats\r\n- **Auto-configuration** - Automatically generate comparison configurations from LazyFrames with identical schemas\r\n- **Configuration management** - Load, save, and manage comparison configurations with environment variable support\r\n- **Production-ready logging** - Structured logging with Python's logging module, configurable log levels, and performance monitoring\r\n- **Error handling** - Robust exception handling with custom exceptions and graceful error recovery\r\n\r\n## Installation\r\n\r\n```bash\r\npip install splurge-lazyframe-compare\r\n```\r\n\r\n## Quick Start\r\n\r\n```python\r\nimport polars as pl\r\nfrom datetime import date\r\nfrom splurge_lazyframe_compare import (\r\n LazyFrameComparator,\r\n ComparisonConfig,\r\n ComparisonSchema,\r\n ColumnDefinition,\r\n ColumnMapping,\r\n)\r\n\r\n# Define schemas (demonstrating mixed datatype usage)\r\nleft_schema = ComparisonSchema(\r\n columns={\r\n \"customer_id\": ColumnDefinition(\"customer_id\", \"Customer ID\", \"Int64\", False), # String name\r\n \"order_date\": ColumnDefinition(\"order_date\", \"Order Date\", pl.Date, False), # Direct datatype\r\n \"amount\": ColumnDefinition(\"amount\", \"Order Amount\", \"Float64\", False), # String name\r\n \"status\": ColumnDefinition(\"status\", \"Order Status\", pl.Utf8, True), # Direct datatype\r\n },\r\n pk_columns=[\"customer_id\", \"order_date\"]\r\n)\r\n\r\nright_schema = ComparisonSchema(\r\n columns={\r\n \"cust_id\": ColumnDefinition(\"cust_id\", \"Customer ID\", \"Int64\", False), # String name\r\n \"order_dt\": ColumnDefinition(\"order_dt\", \"Order Date\", \"Date\", False), # String name\r\n \"total_amount\": ColumnDefinition(\"total_amount\", \"Order Amount\", pl.Float64, False), # Direct datatype\r\n \"order_status\": ColumnDefinition(\"order_status\", \"Order Status\", \"String\", True), # String name\r\n },\r\n pk_columns=[\"cust_id\", \"order_dt\"]\r\n)\r\n\r\n# Define column mappings\r\nmappings = [\r\n ColumnMapping(left=\"customer_id\", right=\"cust_id\", name=\"customer_id\"),\r\n ColumnMapping(left=\"order_date\", right=\"order_dt\", name=\"order_date\"),\r\n ColumnMapping(left=\"amount\", right=\"total_amount\", name=\"amount\"),\r\n ColumnMapping(left=\"status\", right=\"order_status\", name=\"status\"),\r\n]\r\n\r\n# Create configuration\r\nconfig = ComparisonConfig(\r\n left_schema=left_schema,\r\n right_schema=right_schema,\r\n column_mappings=mappings,\r\n pk_columns=[\"customer_id\", \"order_date\"]\r\n)\r\n\r\n# Create sample data\r\nleft_data = {\r\n \"customer_id\": [1, 2, 3],\r\n \"order_date\": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],\r\n \"amount\": [100.0, 200.0, 300.0],\r\n \"status\": [\"pending\", \"completed\", \"pending\"],\r\n}\r\n\r\nright_data = {\r\n \"cust_id\": [1, 2, 3],\r\n \"order_dt\": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],\r\n \"total_amount\": [100.0, 250.0, 300.0], # Different amount for customer 2\r\n \"order_status\": [\"pending\", \"completed\", \"cancelled\"], # Different status for customer 3\r\n}\r\n\r\nleft_df = pl.LazyFrame(left_data)\r\nright_df = pl.LazyFrame(right_data)\r\n\r\n# Execute comparison\r\ncomparator = LazyFrameComparator(config)\r\nresults = comparator.compare(left=left_df, right=right_df)\r\n\r\n# Generate report using ReportingService\r\nfrom splurge_lazyframe_compare import ReportingService\r\nreporter = ReportingService()\r\nsummary_report = reporter.generate_summary_report(results=results)\r\nprint(summary_report)\r\n```\r\n\r\n## Auto-Configuration from LazyFrames\r\n\r\nFor cases where your LazyFrames have identical column names and you want to quickly set up a comparison, you can use the automatic configuration generator:\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils import create_comparison_config_from_lazyframes\r\n\r\n# Your LazyFrames with identical column names\r\nleft_data = {\r\n \"customer_id\": [1, 2, 3, 4, 5],\r\n \"name\": [\"Alice\", \"Bob\", \"Charlie\", \"David\", \"Eve\"],\r\n \"email\": [\"alice@example.com\", \"bob@example.com\", \"charlie@example.com\", \"david@example.com\", \"eve@example.com\"],\r\n \"balance\": [100.50, 250.00, 75.25, 300.00, 150.75],\r\n \"active\": [True, True, False, True, True]\r\n}\r\n\r\nright_data = {\r\n \"customer_id\": [1, 2, 3, 4, 6], # ID 5 missing, ID 6 added\r\n \"name\": [\"Alice\", \"Bob\", \"Charlie\", \"Dave\", \"Frank\"], # David -> Dave, Eve -> Frank\r\n \"email\": [\"alice@example.com\", \"bob@example.com\", \"charlie@example.com\", \"dave@example.com\", \"frank@example.com\"],\r\n \"balance\": [100.50, 250.00, 75.25, 320.00, 200.00], # David's balance changed\r\n \"active\": [True, True, False, True, False] # Frank is inactive\r\n}\r\n\r\nleft_df = pl.LazyFrame(left_data)\r\nright_df = pl.LazyFrame(right_data)\r\n\r\n# Specify primary key columns\r\nprimary_keys = [\"customer_id\"]\r\n\r\n# Generate ComparisonConfig automatically (keyword-only parameters)\r\nconfig = create_comparison_config_from_lazyframes(\r\n left=left_df,\r\n right=right_df,\r\n pk_columns=primary_keys\r\n)\r\n\r\nprint(f\"Auto-generated config with {len(config.column_mappings)} column mappings\")\r\nprint(f\"Primary key columns: {config.pk_columns}\")\r\n\r\n# Use immediately for comparison\r\nfrom splurge_lazyframe_compare.services.comparison_service import ComparisonService\r\n\r\ncomparison_service = ComparisonService()\r\nresults = comparison_service.execute_comparison(\r\n left=left_df,\r\n right=right_df,\r\n config=config\r\n)\r\n\r\nprint(f\"Comparison completed - found {results.summary.value_differences_count} differences\")\r\n```\r\n\r\n**Key Benefits:**\r\n- **No Manual Configuration**: Eliminates need for manual schema and mapping definition\r\n- **Type Safety**: Automatically infers Polars data types from your LazyFrames\r\n- **Error Prevention**: Validates primary key columns exist before comparison\r\n- **Keyword-Only API**: Explicit parameter names prevent argument order errors\r\n- **Ready-to-Use**: Generated config works immediately with all comparison services\r\n\r\n## Advanced Usage\r\n\r\n### Numeric Tolerance\r\n\r\n```python\r\n# Allow small differences in numeric columns\r\nconfig = ComparisonConfig(\r\n left_schema=left_schema,\r\n right_schema=right_schema,\r\n column_mappings=mappings,\r\n pk_columns=[\"customer_id\", \"order_date\"],\r\n tolerance={\"amount\": 0.01} # Allow 1 cent difference\r\n)\r\n```\r\n\r\n### Case-Insensitive String Comparison\r\n\r\n```python\r\n# Ignore case in string comparisons\r\nconfig = ComparisonConfig(\r\n left_schema=left_schema,\r\n right_schema=right_schema,\r\n column_mappings=mappings,\r\n pk_columns=[\"customer_id\", \"order_date\"],\r\n ignore_case=True\r\n)\r\n```\r\n\r\n### Null Value Handling\r\n\r\n```python\r\n# Configure null value comparison behavior\r\nconfig = ComparisonConfig(\r\n left_schema=left_schema,\r\n right_schema=right_schema,\r\n column_mappings=mappings,\r\n pk_columns=[\"customer_id\", \"order_date\"],\r\n null_equals_null=True # Treat null values as equal\r\n)\r\n```\r\n\r\n### Export Results\r\n\r\n```python\r\n# Export comparison results to files using ReportingService\r\nreporter = ReportingService()\r\nexported_files = reporter.export_results(\r\n results=results,\r\n format=\"csv\",\r\n output_dir=\"./comparison_results\"\r\n)\r\nprint(f\"Exported files: {list(exported_files.keys())}\")\r\n\r\n# Generate detailed report with different table formats\r\ndetailed_report = reporter.generate_detailed_report(\r\n results=results,\r\n max_samples=5\r\n)\r\nprint(detailed_report)\r\n\r\n# Generate summary table in different formats\r\nsummary_grid = reporter.generate_summary_table(\r\n results=results,\r\n table_format=\"grid\"\r\n)\r\nprint(summary_grid)\r\n\r\nsummary_pipe = reporter.generate_summary_table(\r\n results=results,\r\n table_format=\"pipe\"\r\n)\r\nprint(summary_pipe)\r\n```\r\n\r\n\r\n\r\n## Configuration Management\r\n\r\nThe framework provides comprehensive configuration management utilities for loading, saving, and manipulating comparison configurations:\r\n\r\n### Configuration Files\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.config_helpers import (\r\n load_config_from_file,\r\n save_config_to_file,\r\n create_default_config,\r\n validate_config\r\n)\r\n\r\n# Create a default configuration template\r\ndefault_config = create_default_config()\r\nprint(\"Default config keys:\", list(default_config.keys()))\r\n\r\n# Save configuration to file\r\nsave_config_to_file(default_config, \"my_comparison_config.json\")\r\n\r\n# Load configuration from file\r\nloaded_config = load_config_from_file(\"my_comparison_config.json\")\r\n\r\n# Validate configuration\r\nvalidation_errors = validate_config(loaded_config)\r\nif validation_errors:\r\n print(\"Configuration errors:\", validation_errors)\r\nelse:\r\n print(\"Configuration is valid\")\r\n```\r\n\r\n### Environment Variable Configuration\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.config_helpers import (\r\n get_env_config,\r\n apply_environment_overrides,\r\n merge_configs\r\n)\r\n\r\n# Load configuration from environment variables (prefixed with SPLURGE_)\r\nenv_config = get_env_config()\r\nprint(\"Environment config:\", env_config)\r\n\r\n# Apply environment overrides to existing config\r\nfinal_config = apply_environment_overrides(default_config)\r\n\r\n# Merge multiple configuration sources\r\ncustom_config = {\"comparison\": {\"max_samples\": 100}}\r\nmerged_config = merge_configs(default_config, custom_config)\r\n```\r\n\r\n### Configuration Utilities\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.config_helpers import (\r\n get_config_value,\r\n create_config_from_dataframes\r\n)\r\n\r\n# Get nested configuration values\r\nmax_samples = get_config_value(merged_config, \"reporting.max_samples\", default_value=10)\r\n\r\n# Create configuration from existing DataFrames\r\nbasic_config = create_config_from_dataframes(\r\n left_df=left_df,\r\n right_df=right_df,\r\n primary_keys=[\"customer_id\"],\r\n auto_map_columns=True\r\n)\r\n```\r\n\r\n## Service Architecture\r\n\r\nThe framework follows a modular service-oriented architecture with clear separation of concerns:\r\n\r\n### Core Services\r\n\r\n#### `ComparisonOrchestrator`\r\nMain entry point that coordinates all comparison activities and manages service dependencies.\r\n\r\n#### `ComparisonService`\r\nHandles the core comparison logic, schema validation, and result generation.\r\n\r\n#### `ValidationService`\r\nProvides comprehensive data quality validation including schema validation, primary key checks, data type validation, and completeness checks.\r\n\r\n#### `ReportingService`\r\nGenerates human-readable reports in multiple formats and handles data export functionality.\r\n\r\n#### `DataPreparationService`\r\nManages data preprocessing, column mapping, and schema transformations.\r\n\r\n### Logging Utilities\r\n\r\n#### `get_logger(name: str)`\r\nFactory function to create configured loggers with proper naming hierarchy.\r\n\r\n#### `performance_monitor(service_name: str, operation: str)`\r\nContext manager for automatic performance monitoring and logging.\r\n\r\n#### `log_service_initialization(service_name: str, config: dict = None)`\r\nLogs service initialization with optional configuration details.\r\n\r\n#### `log_service_operation(service_name: str, operation: str, status: str, message: str = None)`\r\nLogs service operations with status and optional details.\r\n\r\n#### `log_performance(service_name: str, operation: str, duration_ms: float, details: dict = None)`\r\nLogs performance metrics with automatic slow operation detection.\r\n\r\n### Service Usage Pattern\r\n```python\r\nfrom splurge_lazyframe_compare import ComparisonOrchestrator\r\n\r\n# Services are automatically managed by the orchestrator\r\norchestrator = ComparisonOrchestrator()\r\nresults = orchestrator.compare_dataframes(\r\n config=comparison_config,\r\n left=left_df,\r\n right=right_df\r\n)\r\n```\r\n\r\n### Individual Service Usage\r\n```python\r\nfrom splurge_lazyframe_compare import (\r\n ValidationService,\r\n ReportingService\r\n)\r\n\r\n# Use validation service independently\r\nvalidator = ValidationService()\r\nvalidation_result = validator.validate_dataframe_schema(\r\n df=df,\r\n expected_schema=comparison_schema\r\n)\r\n\r\n# Use reporting service independently\r\nreporter = ReportingService()\r\nsummary_report = reporter.generate_summary_report(results=results)\r\n```\r\n\r\n## Logging and Monitoring\r\n\r\nThe framework includes comprehensive logging and monitoring capabilities using Python's standard logging module:\r\n\r\n### Logger Configuration\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.logging_helpers import (\r\n get_logger,\r\n configure_logging,\r\n)\r\n\r\n# Configure logging at application startup (no side-effects on import)\r\nconfigure_logging(level=\"INFO\", fmt='[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s')\r\n\r\n# Get a logger for your module\r\nlogger = get_logger(__name__)\r\n```\r\n\r\n### Performance Monitoring\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.logging_helpers import performance_monitor\r\n\r\n# Automatically log performance metrics\r\nwith performance_monitor(\"ComparisonService\", \"find_differences\") as ctx:\r\n result = perform_comparison() # Your actual operation here\r\n # Add custom metrics (result may not always have len() method)\r\n if hasattr(result, '__len__'):\r\n ctx[\"records_processed\"] = len(result)\r\n ctx[\"operation_status\"] = \"completed\"\r\n```\r\n\r\n### Service Logging\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.logging_helpers import (\r\n log_service_initialization,\r\n log_service_operation\r\n)\r\n\r\n# Log service initialization\r\nlog_service_initialization(\"ComparisonService\", {\"version\": \"1.0\"})\r\n\r\n# Log service operations\r\nlog_service_operation(\"ComparisonService\", \"compare\", \"success\", \"Comparison completed\")\r\n```\r\n\r\n### Log Output Format\r\n\r\n```\r\n[2025-01-29 10:30:45,123] [INFO] [splurge_lazyframe_compare.ComparisonService] [initialization] Service initialized successfully Details: config={'version': '1.0'}\r\n[2025-01-29 10:30:45,234] [WARNING] [splurge_lazyframe_compare.ComparisonService] [find_differences] SLOW OPERATION: 150.50ms (150.50ms) Details: records=1000\r\n```\r\n\r\n### Interpreting Validation Errors\r\n\r\n- SchemaValidationError: indicates schema mismatches (missing columns, wrong dtypes, nullability, or unmapped PKs). Inspect the message substrings for actionable details such as missing columns or dtype mismatches. Primary key violations surface as duplicates or missing mappings.\r\n- PrimaryKeyViolationError: raised when primary key constraints are violated. Ensure all PKs exist and are unique in input LazyFrames.\r\n\r\nException types and original tracebacks are preserved by services; messages include service name and operation context for clarity.\r\n\r\n## CLI Usage\r\n\r\nThe package provides a `slc` CLI.\r\n\r\n```bash\r\nslc --help\r\nslc compare --dry-run\r\nslc report --dry-run\r\nslc export --dry-run\r\n```\r\n\r\nThe dry-run subcommands validate inputs and demonstrate execution without running a full comparison.\r\n\r\n### CLI Errors and Exit Codes\r\n\r\n- The CLI surfaces domain errors using custom exceptions and stable exit codes:\r\n - Configuration issues (e.g., invalid JSON, failed validation) raise `ConfigError` and exit with code `2`.\r\n - Data source issues (e.g., missing files, unsupported extensions) raise `DataSourceError` and exit with code `2`.\r\n - Unexpected errors exit with code `1` and include a brief message; enable debug logging for full tracebacks.\r\n\r\nExample messages:\r\n```\r\nConfiguration error: Invalid configuration: <details>\r\nCompare failed: Unsupported file extension: .txt\r\nExport failed: Data file not found: <path>\r\n```\r\n\r\n## Large-data Export Tips\r\n\r\n- Prefer parquet format for performance and compression. CSV is human-friendly but slower for large datasets.\r\n- Ensure sufficient temporary disk space when exporting large LazyFrames; parquet writes may buffer data.\r\n- For JSON summaries we export a compact, versioned envelope:\r\n\r\n```json\r\n{\r\n \"schema_version\": \"1.0\",\r\n \"summary\": {\r\n \"total_left_records\": 123,\r\n \"total_right_records\": 123,\r\n \"matching_records\": 120,\r\n \"value_differences_count\": 3,\r\n \"left_only_count\": 0,\r\n \"right_only_count\": 0,\r\n \"comparison_timestamp\": \"2025-01-01T00:00:00\"\r\n }\r\n}\r\n```\r\n[2025-01-29 10:30:45,345] [ERROR] [splurge_lazyframe_compare.ValidationService] [validate_schema] Schema validation failed: Invalid column type\r\n```\r\n\r\n### Log Levels\r\n\r\n- **DEBUG**: Detailed debugging information and performance metrics\r\n- **INFO**: General information about operations and service lifecycle\r\n- **WARNING**: Warning conditions that don't prevent operation\r\n- **ERROR**: Error conditions that may affect functionality\r\n\r\n## API Reference\r\n\r\n### Core Classes\r\n\r\n#### `ComparisonSchema`\r\nDefines the structure and constraints for a dataset.\r\n\r\n```python\r\nschema = ComparisonSchema(\r\n columns={\r\n \"id\": ColumnDefinition(\"id\", \"ID\", pl.Int64, False),\r\n \"name\": ColumnDefinition(\"name\", \"Name\", pl.Utf8, True),\r\n },\r\n pk_columns=[\"id\"]\r\n)\r\n```\r\n\r\n#### `ColumnDefinition`\r\nDefines a column with metadata for comparison.\r\n\r\n**Using Direct Polars Datatypes:**\r\n```python\r\ncol_def = ColumnDefinition(\r\n name=\"customer_id\",\r\n alias=\"Customer ID\",\r\n datatype=pl.Int64,\r\n nullable=False\r\n)\r\n```\r\n\r\n**Using String Datatype Names (Recommended):**\r\n```python\r\n# More readable and user-friendly\r\ncol_def = ColumnDefinition(\r\n name=\"customer_id\",\r\n alias=\"Customer ID\",\r\n datatype=\"Int64\", # String name instead of pl.Int64\r\n nullable=False\r\n)\r\n\r\n# Supports all Polars datatypes\r\ncomplex_col = ColumnDefinition(\r\n name=\"metadata\",\r\n alias=\"Metadata\",\r\n datatype=\"Struct\", # Complex datatype as string\r\n nullable=True\r\n)\r\n\r\ntimestamp_col = ColumnDefinition(\r\n name=\"created_at\",\r\n alias=\"Created At\",\r\n datatype=\"Datetime\", # Automatically configured with defaults\r\n nullable=False\r\n)\r\n```\r\n\r\n**Mixed Usage in Schemas:**\r\n```python\r\nschema = ComparisonSchema(\r\n columns={\r\n # Mix string names and direct datatypes\r\n \"id\": ColumnDefinition(\"id\", \"ID\", \"Int64\", False), # String\r\n \"name\": ColumnDefinition(\"name\", \"Name\", pl.Utf8, True), # Direct\r\n \"created\": ColumnDefinition(\"created\", \"Created\", \"Datetime\", False), # String\r\n \"tags\": ColumnDefinition(\"tags\", \"Tags\", pl.List(pl.Utf8), True), # Direct\r\n },\r\n pk_columns=[\"id\"]\r\n)\r\n```\r\n\r\n**\u26a0\ufe0f Important Notes for Complex Types:**\r\n\r\n- **List types** require an inner type: `pl.List(pl.Utf8)` \u2705, not `pl.List` \u274c\r\n- **Struct types** require field definitions: `pl.Struct([])` \u2705 or `pl.Struct({\"field\": pl.Utf8})` \u2705, not `pl.Struct` \u274c\r\n- The framework will provide clear error messages if you accidentally use unparameterized complex types\r\n\r\n#### `ColumnMapping`\r\nMaps columns between left and right datasets.\r\n\r\n```python\r\nmapping = ColumnMapping(\r\n left=\"customer_id\",\r\n right=\"cust_id\",\r\n name=\"customer_id\"\r\n)\r\n```\r\n\r\n#### `ComparisonConfig`\r\nConfiguration for comparing two datasets.\r\n\r\n```python\r\nconfig = ComparisonConfig(\r\n left_schema=left_schema,\r\n right_schema=right_schema,\r\n column_mappings=mappings,\r\n pk_columns=[\"customer_id\", \"order_date\"],\r\n ignore_case=False,\r\n null_equals_null=True,\r\n tolerance={\"amount\": 0.01}\r\n)\r\n```\r\n\r\n#### `LazyFrameComparator`\r\nMain comparison engine.\r\n\r\n```python\r\ncomparator = LazyFrameComparator(config)\r\nresults = comparator.compare(left=left_df, right=right_df)\r\n```\r\n\r\n### Results and Reporting\r\n\r\n#### `ComparisonResult`\r\nContainer for all comparison results.\r\n\r\n```python\r\n# Access summary statistics\r\nprint(f\"Matching records: {results.summary.matching_records}\")\r\nprint(f\"Value differences: {results.summary.value_differences_count}\")\r\nprint(f\"Left-only records: {results.summary.left_only_count}\")\r\nprint(f\"Right-only records: {results.summary.right_only_count}\")\r\n\r\n# Access result DataFrames\r\nvalue_diffs = results.value_differences.collect()\r\nleft_only = results.left_only_records.collect()\r\nright_only = results.right_only_records.collect()\r\n```\r\n\r\n#### `ReportingService`\r\nGenerate human-readable reports with multiple formats.\r\n\r\n```python\r\nfrom splurge_lazyframe_compare import ReportingService\r\n\r\nreporter = ReportingService()\r\n\r\n# Summary report\r\nsummary_report = reporter.generate_summary_report(results=results)\r\nprint(summary_report)\r\n\r\n# Detailed report with samples\r\ndetailed_report = reporter.generate_detailed_report(\r\n results=results,\r\n max_samples=10\r\n)\r\nprint(detailed_report)\r\n\r\n# Summary table in different formats\r\nsummary_grid = reporter.generate_summary_table(\r\n results=results,\r\n table_format=\"grid\" # Options: grid, simple, pipe, orgtbl\r\n)\r\nprint(summary_grid)\r\n\r\n# Export results to files\r\nexported_files = reporter.export_results(\r\n results=results,\r\n format=\"csv\", # Options: csv, parquet, json\r\n output_dir=\"./comparison_results\"\r\n)\r\n```\r\n\r\n#### `ComparisonOrchestrator` (Extended Methods)\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.services.orchestrator import ComparisonOrchestrator\r\n\r\norchestrator = ComparisonOrchestrator()\r\n\r\n# Get comparison summary as string\r\nsummary_str = orchestrator.get_comparison_summary(result=results)\r\n\r\n# Get comparison table in various formats\r\ntable_str = orchestrator.get_comparison_table(\r\n result=results,\r\n table_format=\"grid\" # Options: grid, simple, pipe, orgtbl\r\n)\r\n\r\n# Generate report from existing result\r\nreport = orchestrator.generate_report_from_result(\r\n result=results,\r\n report_type=\"detailed\", # Options: summary, detailed, table\r\n max_samples=10\r\n)\r\n\r\n\r\n```\r\n\r\n### Configuration Management API\r\n\r\n#### `create_comparison_config_from_lazyframes()`\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils import create_comparison_config_from_lazyframes\r\n\r\n# Auto-generate ComparisonConfig from LazyFrames\r\nconfig = create_comparison_config_from_lazyframes(\r\n left=left_lf, # Left LazyFrame\r\n right=right_lf, # Right LazyFrame\r\n pk_columns=[\"id\"] # Primary key columns\r\n)\r\n\r\n# Returns: ComparisonConfig ready for use\r\n```\r\n\r\n#### Configuration File Operations\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.config_helpers import (\r\n load_config_from_file,\r\n save_config_to_file,\r\n create_default_config,\r\n validate_config\r\n)\r\n\r\n# Create default configuration template\r\nconfig = create_default_config()\r\n\r\n# Save to file\r\nsave_config_to_file(config, \"comparison_config.json\")\r\n\r\n# Load from file\r\nloaded_config = load_config_from_file(\"comparison_config.json\")\r\n\r\n# Validate configuration\r\nerrors = validate_config(loaded_config) # Returns list of error messages\r\n```\r\n\r\n#### Environment Configuration\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.config_helpers import (\r\n get_env_config,\r\n apply_environment_overrides,\r\n merge_configs,\r\n get_config_value\r\n)\r\n\r\n# Get configuration from environment variables (SPLURGE_ prefix)\r\nenv_config = get_env_config()\r\n\r\n# Apply environment overrides to existing config\r\nfinal_config = apply_environment_overrides(base_config)\r\n\r\n# Merge multiple configurations\r\nmerged = merge_configs(base_config, override_config)\r\n\r\n# Get nested configuration values\r\nvalue = get_config_value(config, \"reporting.max_samples\", default_value=10)\r\n```\r\n\r\n#### DataFrame Configuration Generation\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.config_helpers import create_config_from_dataframes\r\n\r\n# Generate configuration from DataFrame schemas\r\nconfig = create_config_from_dataframes(\r\n left_df=left_df,\r\n right_df=right_df,\r\n primary_keys=[\"customer_id\"],\r\n auto_map_columns=True # Auto-map columns with same names\r\n)\r\n```\r\n\r\n### Utility Classes and Constants\r\n\r\n#### `ConfigConstants`\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.utils.config_helpers import ConfigConstants\r\n\r\n# Configuration constants\r\nprefix = ConfigConstants.ENV_PREFIX # \"SPLURGE_\"\r\nconfig_file = ConfigConstants.DEFAULT_CONFIG_FILE # \"comparison_config.json\"\r\nschema_file = ConfigConstants.DEFAULT_SCHEMA_FILE # \"schemas.json\"\r\n\r\n# Valid policy values\r\nnull_policies = ConfigConstants.VALID_NULL_POLICIES # [\"equals\", \"not_equals\", \"ignore\"]\r\ncase_policies = ConfigConstants.VALID_CASE_POLICIES # [\"sensitive\", \"insensitive\", \"preserve\"]\r\n```\r\n\r\n## Data Quality Validation\r\n\r\nThe framework includes comprehensive data quality validation through the ValidationService:\r\n\r\n```python\r\nfrom splurge_lazyframe_compare import ValidationService\r\n\r\nvalidator = ValidationService()\r\n\r\n# Validate schema against expected structure\r\nschema_result = validator.validate_dataframe_schema(\r\n df=df,\r\n expected_schema=comparison_schema\r\n)\r\n\r\n# Check primary key uniqueness\r\npk_result = validator.validate_primary_key_uniqueness(\r\n df=df,\r\n pk_columns=[\"customer_id\", \"order_date\"]\r\n)\r\n\r\n# Validate data completeness\r\ncompleteness_result = validator.validate_completeness(\r\n df=df,\r\n required_columns=[\"customer_id\", \"amount\"]\r\n)\r\n\r\n# Validate data types\r\ndtype_result = validator.validate_data_types(\r\n df=df,\r\n expected_types={\"customer_id\": pl.Int64, \"amount\": pl.Float64}\r\n)\r\n\r\n# Validate numeric ranges\r\nrange_result = validator.validate_numeric_ranges(\r\n df=df,\r\n column_ranges={\"amount\": {\"min\": 0, \"max\": 10000}}\r\n)\r\n\r\n# Validate string patterns (regex)\r\npattern_result = validator.validate_string_patterns(\r\n df=df,\r\n column_patterns={\"email\": r\"^[^@]+@[^@]+\\.[^@]+$\"}\r\n)\r\n\r\n# Validate uniqueness constraints\r\nuniqueness_result = validator.validate_uniqueness(\r\n df=df,\r\n unique_columns=[\"customer_id\"]\r\n)\r\n\r\n# Access validation results\r\nif schema_result.is_valid:\r\n print(\"Schema validation passed!\")\r\nelse:\r\n print(\"Schema validation failed:\")\r\n for error in schema_result.errors:\r\n print(f\" - {error}\")\r\n```\r\n\r\n## Error Handling\r\n\r\nThe framework provides custom exceptions for different error scenarios:\r\n\r\n```python\r\nfrom splurge_lazyframe_compare.exceptions import (\r\n SchemaValidationError,\r\n PrimaryKeyViolationError,\r\n ColumnMappingError,\r\n ConfigError,\r\n DataSourceError,\r\n)\r\n\r\ntry:\r\n results = comparator.compare(left=left_df, right=right_df)\r\nexcept SchemaValidationError as e:\r\n print(f\"Schema validation failed: {e}\")\r\n if hasattr(e, 'validation_errors'):\r\n for error in e.validation_errors:\r\n print(f\" - {error}\")\r\nexcept PrimaryKeyViolationError as e:\r\n print(f\"Primary key violation: {e}\")\r\n if hasattr(e, 'duplicate_keys'):\r\n print(f\"Duplicate keys found: {e.duplicate_keys}\")\r\nexcept ColumnMappingError as e:\r\n print(f\"Column mapping error: {e}\")\r\n if hasattr(e, 'mapping_errors'):\r\n for error in e.mapping_errors:\r\n print(f\" - {error}\")\r\nexcept ConfigError as e:\r\n print(f\"Configuration error: {e}\")\r\nexcept DataSourceError as e:\r\n print(f\"Data source error: {e}\")\r\n```\r\n\r\n## Examples\r\n\r\nSee the `examples/` directory for comprehensive working examples demonstrating all framework capabilities:\r\n\r\n### Core Usage Examples\r\n- **`basic_comparison_example.py`** - Basic usage demonstration with schema definition, column mapping, and report generation\r\n- **`service_example.py`** - Service-oriented architecture patterns and dependency injection\r\n- **`auto_config_example.py`** - Automatic configuration generation from LazyFrames (NEW)\r\n\r\n### Configuration Management Examples\r\n- **`auto_config_example.py`** - Demonstrates automatic ComparisonConfig generation from LazyFrames with identical column names\r\n\r\n### Performance Examples\r\n- **`performance_comparison_example.py`** - Performance benchmarking with large datasets (100K+ records)\r\n- **`detailed_performance_benchmark.py`** - Comprehensive performance analysis with statistical reporting\r\n\r\n### Reporting Examples\r\n- **`tabulated_report_example.py`** - Multiple table formats (grid, simple, pipe, orgtbl) and export functionality\r\n\r\n### Running Examples\r\n```bash\r\n# Basic comparison\r\npython examples/basic_comparison_example.py\r\n\r\n# Auto-configuration example (NEW)\r\npython examples/auto_config_example.py\r\n\r\n# Service architecture\r\npython examples/service_example.py\r\n\r\n# Performance testing\r\npython examples/performance_comparison_example.py\r\n\r\n# Detailed benchmarking\r\npython examples/detailed_performance_benchmark.py\r\n\r\n# Tabulated reporting\r\npython examples/tabulated_report_example.py\r\n```\r\n\r\nEach example includes comprehensive documentation and demonstrates real-world usage patterns.\r\n\r\n## Performance Characteristics\r\n\r\nBased on the included performance benchmarks, the framework demonstrates excellent performance:\r\n\r\n### Key Metrics\r\n- **Processing Speed**: 2.2M+ records/second for large datasets\r\n- **Memory Efficiency**: Lazy evaluation prevents memory issues\r\n- **Scalability**: Performance improves with larger datasets due to Polars optimizations\r\n- **Time per Record**: 0.0004 ms/record for optimal performance\r\n\r\n### Benchmark Results Summary\r\n- **1K-5K records**: 118K-475K records/second\r\n- **10K-25K records**: 841K-1.4M records/second\r\n- **50K+ records**: 2.2M+ records/second\r\n\r\nThe framework leverages Polars' lazy evaluation and vectorized operations for optimal performance across all dataset sizes.\r\n\r\n## Development\r\n\r\n### Running Tests\r\n\r\n```bash\r\n# Run all tests\r\npytest\r\n\r\n# Run with coverage\r\npytest --cov=splurge_lazyframe_compare\r\n\r\n# Run specific test file\r\npytest tests/test_comparator.py\r\n```\r\n\r\n### Logging in Development\r\n\r\nThe framework uses structured logging throughout. During development, you can enable debug logging to see detailed information:\r\n\r\n```python\r\nimport logging\r\n\r\n# Enable debug logging\r\nlogging.basicConfig(level=logging.DEBUG)\r\n\r\n# Now all framework operations will be logged with detailed information\r\n```\r\n\r\n### Code Quality\r\n\r\n```bash\r\n# Run linting\r\nruff check .\r\n\r\n# Run formatting\r\nruff format .\r\n\r\n# Type checking (if mypy is configured)\r\nmypy splurge_lazyframe_compare\r\n```\r\n\r\n## CLI Usage\r\n\r\nAfter installation, a `slc` command is available:\r\n\r\n```bash\r\n# Show help\r\nslc --help\r\n\r\n# Dry run for compare (validates CLI wiring)\r\nslc compare --dry-run\r\n```\r\n\r\n### Recent Improvements\r\n\r\n- **Production-ready logging**: Replaced all `print()` statements with proper Python logging module\r\n- **Structured logging**: Consistent log format with timestamps, service names, and operation context\r\n- **Performance monitoring**: Built-in performance tracking and slow operation detection\r\n- **Error handling**: Robust exception handling with custom exceptions and graceful recovery\r\n- **Polars integration**: Seamless integration with Polars LazyFrames for efficient data processing\r\n- **Improved error messages**: Clear guidance for proper usage of complex data types\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Make your changes\r\n4. Add tests for new functionality\r\n5. Run the test suite\r\n6. Submit a pull request\r\n\r\n## License\r\n\r\nMIT License - see LICENSE file for details.\r\n\r\n## Requirements\r\n\r\n- **Python**: 3.10 or higher\r\n- **Polars**: >= 1.32.0 (core data processing)\r\n- **tabulate**: >= 0.9.0 (table formatting for reports)\r\n- **logging**: Built-in (structured logging and monitoring)\r\n- **typing**: Built-in (for type annotations)\r\n\r\n\r\n\r\n### Optional Dependencies\r\n- Additional packages may be required for specific export formats or advanced features\r\n\r\n### Installation Options\r\n```bash\r\n# Install with all dependencies\r\npip install splurge-lazyframe-compare\r\n\r\n# Install in development mode\r\npip install -e .\r\n```\r\n\r\n## Changelog\r\n\r\n### 2025.2.0 (2025-09-03)\r\n- Added domain exceptions at CLI boundary: `ConfigError`, `DataSourceError`.\r\n- Standardized CLI exit codes: `2` for domain errors, `1` for unexpected errors.\r\n- CLI now catches `ComparisonError` first, preserving clear user-facing messages.\r\n- Services error handling preserves exception type and chains the original cause; messages now include service name and context.\r\n- Documentation updates: README now documents CLI errors/exit codes and new exceptions.\r\n- New CLI capabilities and flags: `compare`, `report`, `export`, `--dry-run`, `--format`, `--output-dir`, `--log-level`.\r\n- Logging improvements: introduced `configure_logging()`; removed import-time handler side-effects; consistent log formatting.\r\n- Packaging/entrypoints: ensured `slc` console script and `__main__.py` entry for `python -m splurge_lazyframe_compare`.\r\n- Docs: added CLI usage, logging configuration, and large-data export guidance.\r\n\r\n### 2025.1.1 (2025-08-29)\r\n- Removed extraneous folders and plan documents.\r\n\r\n### 2025.1.0 (2025-08-29)\r\n- Initial Commit\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python package for comparing polars lazyframes",
"version": "2025.2.0",
"project_urls": {
"Homepage": "https://github.com/jim-schilling/splurge-lazyframe-compare"
},
"split_keywords": [
"splurge",
" lazyframe",
" compare",
" polars",
" dataframe",
" comparison"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "731ff5dd6ba3a2faaa52f28c4d2bf2878c6de17204b35fec1f442bf9d7e03469",
"md5": "15c521d37ca68ea8a6be5b22f23b1a5f",
"sha256": "8105298da94b48a091d1012ddb60d20bc943c76c4021ab8cf1808f28e1815c59"
},
"downloads": -1,
"filename": "splurge_lazyframe_compare-2025.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "15c521d37ca68ea8a6be5b22f23b1a5f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 63649,
"upload_time": "2025-09-03T16:06:40",
"upload_time_iso_8601": "2025-09-03T16:06:40.623625Z",
"url": "https://files.pythonhosted.org/packages/73/1f/f5dd6ba3a2faaa52f28c4d2bf2878c6de17204b35fec1f442bf9d7e03469/splurge_lazyframe_compare-2025.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "962168eba5e9981953b16663512c389a394201b7d41f0bd2a452b923fdd793c0",
"md5": "443ac8af3e57b40d53020d2de96d127d",
"sha256": "a8166e75b323f147520774eb12f49d4078cd7cf382639bfc6c8809894ae7f740"
},
"downloads": -1,
"filename": "splurge_lazyframe_compare-2025.2.0.tar.gz",
"has_sig": false,
"md5_digest": "443ac8af3e57b40d53020d2de96d127d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 66417,
"upload_time": "2025-09-03T16:06:41",
"upload_time_iso_8601": "2025-09-03T16:06:41.996275Z",
"url": "https://files.pythonhosted.org/packages/96/21/68eba5e9981953b16663512c389a394201b7d41f0bd2a452b923fdd793c0/splurge_lazyframe_compare-2025.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-03 16:06:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jim-schilling",
"github_project": "splurge-lazyframe-compare",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "splurge-lazyframe-compare"
}