# ๐ What's New in v0.17.1
**Notebook Fixes & Beginner Experience:**
- Fixed confusion matrix example in classification and advanced workflow notebooks to match API signature
- Audited all example notebooks for beginner-friendliness and error-free execution
- All notebooks now run without unnecessary errors for new users
**Major Examples & Documentation Update (v0.17.0):**
- Added interactive Jupyter notebooks for all major workflows
- All guides and onboarding instructions updated to reference new features and examples
- Verified and enhanced documentation for:
- `highlight_anomalies`
- `create_lag_features`
- `display_facet_grid`
- `scale_features`
- `group_rare_categories`
- `export_figure`
- Improved onboarding and user guidance in ReadTheDocs and README
- Minor bug fixes and consistency improvements across docs and codebase
**Why Upgrade?**
This release ensures all users have access to complete, copy-paste-ready examples and documentation. The onboarding experience is now smoother, and all advanced features are fully documented.
See the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)
**Major Documentation Overhaul for Education:**
- Added a dedicated Learning Path for new and aspiring data scientists
- Consolidated ML workflow steps into a single, copy-paste-safe guide
- Expanded examples: classification, regression, and computer vision
- Improved navigation: clear table of contents, user guide, API reference, and best practices
- Advanced features and troubleshooting tips for power users
**Why Upgrade?**
This release makes edaflow best-in-class for educational value, with a structured progression for learners and educators. All documentation is now easier to follow, with practical code and hands-on exercises.
See the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)
# edaflow
[](https://edaflow.readthedocs.io/en/latest/?badge=latest)
[](https://badge.fury.io/py/edaflow)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://pepy.tech/project/edaflow)
**Quick Navigation**:
๐ [Documentation](https://edaflow.readthedocs.io) |
๐ฆ [PyPI Package](https://pypi.org/project/edaflow/) |
๐ [Quick Start](https://edaflow.readthedocs.io/en/latest/quickstart.html) |
๐ [Changelog](#-changelog) |
๐ [Issues](https://github.com/evanlow/edaflow/issues)
A Python package for streamlined exploratory data analysis workflows.
> **๐ฆ Current Version: v0.16.4** - [Latest Release](https://pypi.org/project/edaflow/0.16.4/) adds a complete examples directory, improved onboarding, and fully documented advanced features. *Updated: September 12, 2025*
## ๐ Table of Contents
- [Description](#description)
- [๐จ Critical Fixes in v0.15.0](#-critical-fixes-in-v0150)
- [โจ What's New](#-whats-new)
- [Features](#features)
- [๐ Recent Updates](#-recent-updates)
- [๐ Documentation](#-documentation)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [๐ Changelog](#-changelog)
- [Support](#support)
- [Roadmap](#roadmap)
## Description
`edaflow` is designed to simplify and accelerate the exploratory data analysis (EDA) process by providing a collection of tools and utilities for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, and preprocessing.
## ๐จ What's New in v0.15.1
**NEW:** `setup_ml_experiment` now supports a `primary_metric` argument, making metric selection robust and error-free for all ML workflows. All documentation, tests, and downstream code are updated for consistency. A new test ensures the metric is set and accessible throughout the workflow.
**Upgrade recommended for all users who want reliable, copy-paste-safe ML workflows with dynamic metric selection.**
---
## ๐จ Critical Fixes in v0.15.0
**(Previous release)**
### ๐ฏ **Issues Resolved**:
- โ **FIXED**: `RandomForestClassifier instance is not fitted yet` errors
- โ **FIXED**: `TypeError: unexpected keyword argument` errors
- โ **FIXED**: Missing imports and undefined variables in examples
- โ **FIXED**: Duplicate step numbering in documentation
- โ
**RESULT**: All ML workflow examples now work perfectly!
### ๐ **What This Means For You**:
- ๐ **Copy-paste examples that work immediately**
- ๐ฏ **No more confusing error messages**
- ๐ **Complete, beginner-friendly documentation**
- ๐ **Smooth learning experience for new users**
**Upgrade recommended for all users following ML workflow documentation.**
## โจ What's New
### ๐จ Critical ML Documentation Fixes (v0.15.0)
**MAJOR DOCUMENTATION UPDATE**: Fixed critical issues that were causing user errors when following ML workflow examples.
**Problems Resolved**:
- โ
**Model Fitting**: Added missing `model.fit()` steps that were causing "not fitted" errors
- โ
**Function Parameters**: Fixed incorrect parameter names in all examples
- โ
**Missing Context**: Added imports and data preparation context
- โ
**Step Numbering**: Corrected duplicate step numbers in documentation
- โ
**Enhanced Warnings**: Added prominent warnings about critical requirements
**Result**: All ML workflow documentation now works perfectly out-of-the-box!
### ๐ฏ Enhanced rank_models Function (v0.14.x)
**DUAL RETURN FORMAT SUPPORT**: Major enhancement based on user requests.
```python
# Both formats now supported:
df_results = ml.rank_models(results, 'accuracy') # DataFrame (default)
list_results = ml.rank_models(results, 'accuracy', return_format='list') # List of dicts
# User-requested pattern now works:
best_model = ml.rank_models(results, 'accuracy', return_format='list')[0]["model_name"]
```
### ๐ ML Expansion (v0.13.0+)
**COMPLETE MACHINE LEARNING SUBPACKAGE**: Extended edaflow into full ML workflows.
**New ML Modules Added**:
- **`ml.config`**: ML experiment setup and data validation
- **`ml.leaderboard`**: Multi-model comparison and ranking
- **`ml.tuning`**: Advanced hyperparameter optimization
- **`ml.curves`**: Learning curves and performance visualization
- **`ml.artifacts`**: Model persistence and experiment tracking
**Key ML Features**:
```python
# Complete ML workflow in one package
import edaflow.ml as ml
# Setup experiment with flexible parameter support
# Both calling patterns work:
experiment = ml.setup_ml_experiment(df, 'target') # DataFrame style
# OR
experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15) # sklearn style
# Compare multiple models
results = ml.compare_models(models, **experiment)
# Optimize hyperparameters with multiple strategies
best_model = ml.optimize_hyperparameters(model, params, **experiment)
# Generate comprehensive visualizations
ml.plot_learning_curves(model, **experiment)
```
### Previous: API Improvement (v0.12.33)
**NEW CLEAN APIs**: Introduced consistent, user-friendly encoding functions that eliminate confusion and crashes.
**Root Cause Solved**: The inconsistent return type of `apply_smart_encoding()` (sometimes DataFrame, sometimes tuple) was causing AttributeError crashes and user confusion.
**New Functions Added**:
```python
# โ
NEW: Clean, consistent DataFrame return (RECOMMENDED)
df_encoded = edaflow.apply_encoding(df) # Always returns DataFrame
# โ
NEW: Explicit tuple return when encoders needed
df_encoded, encoders = edaflow.apply_encoding_with_encoders(df) # Always returns tuple
# โ ๏ธ DEPRECATED: Inconsistent behavior (still works with warnings)
df_encoded = edaflow.apply_smart_encoding(df, return_encoders=True) # Sometimes tuple!
```
**Benefits**:
- ๐ฏ **Zero Breaking Changes**: All existing workflows continue working exactly the same
- ๐ก๏ธ **Better Error Messages**: Helpful guidance when mistakes are made
- ๐ **Migration Path**: Multiple options for users who want cleaner APIs
- ๐ **Clear Documentation**: Explicit examples showing best practices
### ๐ Critical Input Validation Fix (v0.12.32)
**RESOLVED**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions when `apply_smart_encoding(..., return_encoders=True)` result is used incorrectly.
**Problem Solved**: Users who passed the tuple result from `apply_smart_encoding` directly to visualization functions without unpacking were experiencing crashes in step 14 of EDA workflows.
**Enhanced Error Messages**: Added intelligent input validation with helpful error messages guiding users to the correct usage pattern:
```python
# โ WRONG - This causes the AttributeError:
df_encoded = edaflow.apply_smart_encoding(df, return_encoders=True) # Returns (df, encoders) tuple!
edaflow.visualize_scatter_matrix(df_encoded) # Crashes with AttributeError
# โ
CORRECT - Unpack the tuple:
df_encoded, encoders = edaflow.apply_smart_encoding(df, return_encoders=True)
edaflow.visualize_scatter_matrix(df_encoded) # Should work well!
```
### ๐จ BREAKTHROUGH: Universal Dark Mode Compatibility (v0.12.30)
- **NEW FUNCTION**: `optimize_display()` - The **FIRST** EDA library with universal notebook compatibility!
- **Universal Platform Support**: Improved visibility across Google Colab, JupyterLab, VS Code, and Classic Jupyter
- **Automatic Detection**: Zero configuration needed - automatically detects your environment
- **Accessibility Support**: Built-in high contrast mode for improved accessibility
- **One-Line Solution**: `edaflow.optimize_display()` fixes all visibility issues instantly
### ๐ Critical KeyError Hotfix (v0.12.31)
- **Fixed KeyError**: Resolved "KeyError: 'type'" in `summarize_eda_insights()` function
- **Enhanced Error Handling**: Added robust exception handling for target analysis edge cases
- **Improved Stability**: Function now handles missing or invalid target columns gracefully
### ๐ Platform Benefits:
- โ
**Google Colab**: Auto light/dark mode detection with improved text visibility
- โ
**JupyterLab**: Dark theme compatibility with custom theme support
- โ
**VS Code**: Native theme integration with seamless notebook experience
- โ
**Classic Jupyter**: Full compatibility with enhanced readability options
```python
import edaflow
# โญ NEW: Improved visibility everywhere!
edaflow.optimize_display() # Universal dark mode fix!
# All functions now display beautifully
edaflow.check_null_columns(df)
edaflow.visualize_histograms(df)
```
### โจ NEW FUNCTION: `summarize_eda_insights()` (Added in v0.12.28)
- **Comprehensive Analysis**: Generate complete EDA insights and actionable recommendations after completing your analysis workflow
- **Smart Recommendations**: Provides intelligent next steps for modeling, preprocessing, and data quality improvements
- **Target-Aware Analysis**: Supports both classification and regression scenarios with specific insights
- **Function Tracking**: Knows which edaflow functions you've already used in your workflow
- **Structured Output**: Returns organized dictionary with dataset overview, data quality assessment, and recommendations
### ๐จ Display Formatting Excellence
- **Enhanced Visual Experience**: Refined Rich console styling with optimized panel borders and alignment
- **Google Colab Optimized**: Improved display formatting specifically tailored for notebook environments
- **Consistent Design**: Professional rounded borders, proper width constraints, and refined color schemes
- **Universal Compatibility**: Beautiful output rendering across all major Python environments and notebooks
### ๏ฟฝ Recent Fixes (v0.12.24-0.12.26)
- **LBP Warning Resolution**: Fixed scikit-image UserWarning in texture analysis functions
- **Parameter Documentation**: Corrected `analyze_image_features` documentation mismatches
- **RTD Synchronization**: Updated Read the Docs changelog with all recent improvements
### ๐ Rich Styling (v0.12.20-0.12.21)
- **Vibrant Output**: ALL major EDA functions now feature professional, color-coded styling
- **Smart Indicators**: Color-coded severity levels (โ
CLEAN, โ ๏ธ WARNING, ๐จ CRITICAL)
- **Professional Tables**: Beautiful formatted output with rich library integration
- **Actionable Insights**: Context-aware recommendations and visual status indicators
## Features
### ๐ **Exploratory Data Analysis**
- **Missing Data Analysis**: Color-coded analysis of null values with customizable thresholds
- **Categorical Data Insights**: ๐ *FIXED in v0.12.29* Identify object columns that might be numeric, detect data type issues (now handles unhashable types)
- **Automatic Data Type Conversion**: Smart conversion of object columns to numeric when appropriate
- **Categorical Values Visualization**: Detailed exploration of categorical column values with insights
- **Column Type Classification**: Simple categorization of DataFrame columns into categorical and numerical types
- **Data Type Detection**: Smart analysis to flag potential data conversion needs
- **EDA Insights Summary**: โญ *NEW in v0.12.28* Comprehensive EDA insights and actionable recommendations after completing analysis workflow
### ๐ **Advanced Visualizations**
- **Numerical Distribution Visualization**: Advanced boxplot analysis with outlier detection and statistical summaries
- **Interactive Boxplot Visualization**: Interactive Plotly Express boxplots with zoom, hover, and statistical tooltips
- **Comprehensive Heatmap Visualizations**: Correlation matrices, missing data patterns, values heatmaps, and cross-tabulations
- **Statistical Histogram Analysis**: Advanced histogram visualization with skewness detection, normality testing, and distribution analysis
- **Scatter Matrix Analysis**: Advanced pairwise relationship visualization with customizable matrix layouts, regression lines, and statistical insights
### ๐ค **Machine Learning Preprocessing** โญ *Introduced in v0.12.0*
- **Intelligent Encoding Analysis**: Automatic detection of optimal encoding strategies for categorical variables
- **Smart Encoding Application**: Automated categorical encoding with support for:
- One-Hot Encoding for low cardinality categories
- Target Encoding for high cardinality with target correlation
- Ordinal Encoding for ordinal relationships
- Binary Encoding for medium cardinality
- Text Vectorization (TF-IDF) for text features
- Leave Unchanged for numeric columns
- **Memory-Efficient Processing**: Intelligent handling of high-cardinality features to prevent memory issues
- **Comprehensive Encoding Pipeline**: End-to-end preprocessing solution for ML model preparation
### ๐ค **Machine Learning Workflows** โญ *NEW in v0.13.0*
The powerful `edaflow.ml` subpackage provides comprehensive machine learning workflow capabilities:
#### **ML Experiment Setup (`ml.config`)**
- **Smart Data Validation**: Automatic data quality assessment and problem type detection
- **Intelligent Data Splitting**: Train/validation/test splits with stratification support
- **ML Pipeline Configuration**: Automated preprocessing pipeline setup for ML workflows
#### **Model Comparison & Ranking (`ml.leaderboard`)**
- **Multi-Model Evaluation**: Compare multiple models with comprehensive metrics
- **Smart Leaderboards**: Automatically rank models by performance with visual displays
- **Export Capabilities**: Save comparison results for reporting and analysis
#### **Hyperparameter Optimization (`ml.tuning`)**
- **Multiple Search Strategies**: Grid search, random search, and Bayesian optimization
- **Cross-Validation Integration**: Built-in CV with customizable scoring metrics
- **Parallel Processing**: Multi-core hyperparameter optimization for faster results
#### **Learning & Performance Curves (`ml.curves`)**
- **Learning Curves**: Visualize model performance vs training size
- **Validation Curves**: Analyze hyperparameter impact on model performance
- **ROC & Precision-Recall Curves**: Comprehensive classification performance analysis
- **Feature Importance**: Visual analysis of model feature contributions
#### **Model Persistence & Tracking (`ml.artifacts`)**
- **Complete Model Artifacts**: Save models, configs, and metadata
- **Experiment Tracking**: Track multiple experiments with organized storage
- **Model Reports**: Generate comprehensive model performance reports
- **Version Management**: Organized model versioning and retrieval
**Quick ML Example:**
```python
import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier
# Setup ML experiment - Multiple parameter patterns supported
# Method 1: DataFrame + target column (recommended)
experiment = ml.setup_ml_experiment(df, target_column='target')
# Method 2: sklearn-style (also supported)
X = df.drop('target', axis=1)
y = df['target']
experiment = ml.setup_ml_experiment(
X=X, y=y,
test_size=0.2,
val_size=0.15, # Alternative to validation_size
experiment_name="my_ml_project",
stratify=True,
random_state=42
)
# Compare multiple models
models = {
'RandomForest': RandomForestClassifier(),
'LogisticRegression': LogisticRegression()
}
comparison = ml.compare_models(models, **experiment)
# Rank models with flexible access patterns
# Method 1: Easy dictionary access (recommended for getting best model)
best_model_name = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']
# Method 2: Traditional DataFrame format
ranked_df = ml.rank_models(comparison, 'accuracy')
best_model_traditional = ranked_df.iloc[0]['model']
# Both methods give the same result
print(f"Best model: {best_model_name}") # Easy access
print(f"Best model: {best_model_traditional}") # Traditional access
# Optimize hyperparameters
# --- Copy-paste-safe hyperparameter optimization example ---
model_name = 'LogisticRegression' # or 'RandomForest' or 'GradientBoosting'
if model_name == 'RandomForest':
param_distributions = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
model = RandomForestClassifier()
method = 'grid'
elif model_name == 'GradientBoosting':
param_distributions = {
'n_estimators': (50, 200),
'learning_rate': (0.01, 0.3),
'max_depth': (3, 8)
}
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
method = 'bayesian'
elif model_name == 'LogisticRegression':
param_distributions = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2', 'elasticnet', 'none'],
'solver': ['lbfgs', 'liblinear', 'saga']
}
model = LogisticRegression(max_iter=1000)
method = 'grid'
else:
raise ValueError(f"Unknown model_name: {model_name}")
results = ml.optimize_hyperparameters(
model,
param_distributions=param_distributions,
**experiment
)
# Generate learning curves
ml.plot_learning_curves(results['best_model'], **experiment)
# Save complete artifacts
ml.save_model_artifacts(
model=results['best_model'],
model_name='optimized_rf',
experiment_config=experiment,
performance_metrics=results['cv_results']
)
```
### ๐ผ๏ธ **Computer Vision Support**
- **Computer Vision EDA**: Class-wise image sample visualization and comprehensive quality assessment for image classification datasets
- **Image Quality Assessment**: Automated detection of corrupted images, quality issues, blur, artifacts, and dataset health metrics
### Usage Examples
### Basic Usage
```python
import edaflow
# Verify installation
message = edaflow.hello()
print(message) # Output: "Hello from edaflow! Ready for exploratory data analysis."
```
### Missing Data Analysis with `check_null_columns`
The `check_null_columns` function provides a color-coded analysis of missing data in your DataFrame:
```python
import pandas as pd
import edaflow
# Create sample data with missing values
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
'age': [25, None, 35, None, 45],
'email': [None, None, None, None, None], # All missing
'purchase_amount': [100.5, 250.0, 75.25, None, 320.0]
})
# Analyze missing data with default threshold (10%)
styled_result = edaflow.check_null_columns(df)
styled_result # Display in Jupyter notebook for color-coded styling
# Use custom threshold (20%) to change color coding sensitivity
styled_result = edaflow.check_null_columns(df, threshold=20)
styled_result
# Access underlying data if needed
data = styled_result.data
print(data)
```
**Color Coding:**
- ๐ด **Red**: > 20% missing (high concern)
- ๐ก **Yellow**: 10-20% missing (medium concern)
- ๐จ **Light Yellow**: 1-10% missing (low concern)
- โฌ **Gray**: 0% missing (no issues)
### Categorical Data Analysis with `analyze_categorical_columns`
The `analyze_categorical_columns` function helps identify data type issues and provides insights into object-type columns:
```python
import pandas as pd
import edaflow
# Create sample data with mixed categorical types
df = pd.DataFrame({
'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price_str': ['999', '25', '75', '450'], # Numbers stored as strings
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],
'rating': [4.5, 3.8, 4.2, 4.7], # Already numeric
'mixed_ids': ['001', '002', 'ABC', '004'], # Mixed format
'status': ['active', 'inactive', 'active', 'pending']
})
# Analyze categorical columns with default threshold (35%)
edaflow.analyze_categorical_columns(df)
# Use custom threshold (50%) to be more lenient about mixed data
edaflow.analyze_categorical_columns(df, threshold=50)
```
**Output Interpretation:**
- ๐ด๐ต **Highlighted in Red/Blue**: Potentially numeric columns that might need conversion
- ๐กโซ **Highlighted in Yellow/Black**: Shows unique values for potential numeric columns
- **Regular text**: Truly categorical columns with statistics
- **"not an object column"**: Already properly typed numeric columns
### Data Type Conversion with `convert_to_numeric`
After analyzing your categorical columns, you can automatically convert appropriate columns to numeric:
```python
import pandas as pd
import edaflow
# Create sample data with string numbers
df = pd.DataFrame({
'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price_str': ['999', '25', '75', '450'], # Should convert
'mixed_ids': ['001', '002', 'ABC', '004'], # Mixed data
'category': ['Electronics', 'Accessories', 'Electronics', 'Electronics']
})
# Convert appropriate columns to numeric (threshold=35% by default)
df_converted = edaflow.convert_to_numeric(df, threshold=35)
# Or modify the original DataFrame in place
edaflow.convert_to_numeric(df, threshold=35, inplace=True)
# Use a stricter threshold (only convert if <20% non-numeric values)
df_strict = edaflow.convert_to_numeric(df, threshold=20)
```
**Function Features:**
- โ
**Smart Detection**: Only converts columns with few non-numeric values
- โ
**Customizable Threshold**: Control conversion sensitivity
- โ
**Safe Conversion**: Non-numeric values become NaN (not errors)
- โ
**Inplace Option**: Modify original DataFrame or create new one
- โ
**Detailed Output**: Shows exactly what was converted and why
### Categorical Data Visualization with `visualize_categorical_values`
After cleaning your data, explore categorical columns in detail to understand value distributions:
```python
import pandas as pd
import edaflow
# Example DataFrame with categorical data
df = pd.DataFrame({
'department': ['Sales', 'Marketing', 'Sales', 'HR', 'Marketing', 'Sales', 'IT'],
'status': ['Active', 'Inactive', 'Active', 'Pending', 'Active', 'Active', 'Inactive'],
'priority': ['High', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],
'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007], # Numeric (ignored)
'salary': [50000, 60000, 70000, 45000, 58000, 62000, 70000] # Numeric (ignored)
})
# Visualize all categorical columns
edaflow.visualize_categorical_values(df)
```
**Advanced Usage Examples:**
```python
# Handle high-cardinality data (many unique values)
large_df = pd.DataFrame({
'product_id': [f'PROD_{i:04d}' for i in range(100)], # 100 unique values
'category': ['Electronics'] * 40 + ['Clothing'] * 35 + ['Books'] * 25,
'status': ['Available'] * 80 + ['Out of Stock'] * 15 + ['Discontinued'] * 5
})
# Limit display for high-cardinality columns
edaflow.visualize_categorical_values(large_df, max_unique_values=5)
```
```python
# DataFrame with missing values for comprehensive analysis
df_with_nulls = pd.DataFrame({
'region': ['North', 'South', None, 'East', 'West', 'North', None],
'customer_type': ['Premium', 'Standard', 'Premium', None, 'Standard', 'Premium', 'Standard'],
'transaction_id': [f'TXN_{i}' for i in range(7)], # Mostly unique (ID-like)
})
# Get detailed insights including missing value analysis
edaflow.visualize_categorical_values(df_with_nulls)
```
**Function Features:**
- ๐ฏ **Zero Breaking Changes**: All existing workflows continue working exactly the same
- ๐ก๏ธ **Better Error Messages**: Helpful guidance when mistakes are made
- ๐ **Migration Path**: Multiple options for users who want cleaner APIs
- ๐ **Clear Documentation**: Explicit examples showing best practices
### Column Type Classification with `display_column_types`
The `display_column_types` function provides a simple way to categorize DataFrame columns into categorical and numerical types:
```python
import pandas as pd
import edaflow
# Create sample data with mixed types
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago'],
'salary': [50000, 60000, 70000],
'is_active': [True, False, True]
}
df = pd.DataFrame(data)
# Display column type classification
result = edaflow.display_column_types(df)
# Access the categorized column lists
categorical_cols = result['categorical'] # ['name', 'city']
numerical_cols = result['numerical'] # ['age', 'salary', 'is_active']
```
**Example Output:**
```
๐ Column Type Analysis
==================================================
๐ Categorical Columns (2 total):
1. name (unique values: 3)
2. city (unique values: 3)
๐ข Numerical Columns (3 total):
1. age (dtype: int64)
2. salary (dtype: int64)
3. is_active (dtype: bool)
๐ Summary:
Total columns: 5
Categorical: 2 (40.0%)
Numerical: 3 (60.0%)
```
**Function Features:**
- ๐ **Simple Classification**: Separates columns into categorical (object dtype) and numerical (all other dtypes)
- ๐ **Detailed Information**: Shows unique value counts for categorical columns and data types for numerical columns
- ๐ **Summary Statistics**: Provides percentage breakdown of column types
- ๐ฏ **Return Values**: Returns dictionary with categorized column lists for programmatic use
- โก **Fast Processing**: Efficient classification based on pandas data types
- ๐ก๏ธ **Error Handling**: Validates input and handles edge cases like empty DataFrames
### Data Imputation with `impute_numerical_median` and `impute_categorical_mode`
After analyzing your data, you often need to handle missing values. The edaflow package provides two specialized imputation functions for this purpose:
#### Numerical Imputation with `impute_numerical_median`
The `impute_numerical_median` function fills missing values in numerical columns using the median value:
```python
import pandas as pd
import edaflow
# Create sample data with missing numerical values
df = pd.DataFrame({
'age': [25, None, 35, None, 45],
'salary': [50000, 60000, None, 70000, None],
'score': [85.5, None, 92.0, 88.5, None],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
})
# Impute all numerical columns with median values
df_imputed = edaflow.impute_numerical_median(df)
# Impute specific columns only
df_imputed = edaflow.impute_numerical_median(df, columns=['age', 'salary'])
# Impute in place (modifies original DataFrame)
edaflow.impute_numerical_median(df, inplace=True)
```
**Function Features:**
- ๐ข **Smart Detection**: Automatically identifies numerical columns (int, float, etc.)
- ๐ **Median Imputation**: Uses median values which are robust to outliers
- ๐ฏ **Selective Imputation**: Option to specify which columns to impute
- ๐ **Inplace Option**: Modify original DataFrame or create new one
- ๐ก๏ธ **Safe Handling**: Gracefully handles edge cases like all-missing columns
- ๐ **Detailed Reporting**: Shows exactly what was imputed and summary statistics
#### Categorical Imputation with `impute_categorical_mode`
The `impute_categorical_mode` function fills missing values in categorical columns using the mode (most frequent value):
```python
import pandas as pd
import edaflow
# Create sample data with missing categorical values
df = pd.DataFrame({
'category': ['A', 'B', 'A', None, 'A'],
'status': ['Active', None, 'Active', 'Inactive', None],
'priority': ['High', 'Medium', None, 'Low', 'High'],
'age': [25, 30, 35, 40, 45]
})
# Impute all categorical columns with mode values
df_imputed = edaflow.impute_categorical_mode(df)
# Impute specific columns only
df_imputed = edaflow.impute_categorical_mode(df, columns=['category', 'status'])
# Impute in place (modifies original DataFrame)
edaflow.impute_categorical_mode(df, inplace=True)
```
**Function Features:**
- ๐ **Smart Detection**: Automatically identifies categorical (object) columns
- ๐ฏ **Mode Imputation**: Uses most frequent value for each column
- โ๏ธ **Tie Handling**: Gracefully handles mode ties (multiple values with same frequency)
- ๐ **Inplace Option**: Modify original DataFrame or create new one
- ๐ก๏ธ **Safe Handling**: Gracefully handles edge cases like all-missing columns
- ๐ **Detailed Reporting**: Shows exactly what was imputed and mode tie warnings
#### Complete Imputation Workflow Example
```python
import pandas as pd
import edaflow
# Sample data with both numerical and categorical missing values
df = pd.DataFrame({
'age': [25, None, 35, None, 45],
'salary': [50000, None, 70000, 80000, None],
'category': ['A', 'B', None, 'A', None],
'status': ['Active', None, 'Active', 'Inactive', None],
'score': [85.5, 92.0, None, 88.5, None]
})
print("Original DataFrame:")
print(df)
print("\n" + "="*50)
# Step 1: Impute numerical columns
print("STEP 1: Numerical Imputation")
df_step1 = edaflow.impute_numerical_median(df)
# Step 2: Impute categorical columns
print("\nSTEP 2: Categorical Imputation")
df_final = edaflow.impute_categorical_mode(df_step1)
print("\nFinal DataFrame (all missing values imputed):")
print(df_final)
# Verify no missing values remain
print(f"\nMissing values remaining: {df_final.isnull().sum().sum()}")
```
**Expected Output:**
```
๐ข Numerical Missing Value Imputation (Median)
=======================================================
๐ age - Imputed 2 values with median: 35.0
๐ salary - Imputed 2 values with median: 70000.0
๐ score - Imputed 1 values with median: 88.75
๐ Imputation Summary:
Columns processed: 3
Columns imputed: 3
Total values imputed: 5
๐ Categorical Missing Value Imputation (Mode)
=======================================================
๐ category - Imputed 2 values with mode: 'A'
๐ status - Imputed 1 values with mode: 'Active'
๐ Imputation Summary:
Columns processed: 2
Columns imputed: 2
Total values imputed: 3
```
### Numerical Distribution Analysis with `visualize_numerical_boxplots`
Analyze numerical columns to detect outliers, understand distributions, and assess skewness:
```python
import pandas as pd
import edaflow
# Create sample dataset with outliers
df = pd.DataFrame({
'age': [25, 30, 35, 40, 45, 28, 32, 38, 42, 100], # 100 is an outlier
'salary': [50000, 60000, 75000, 80000, 90000, 55000, 65000, 70000, 85000, 250000], # 250000 is outlier
'experience': [2, 5, 8, 12, 15, 3, 6, 9, 13, 30], # 30 might be an outlier
'score': [85, 92, 78, 88, 95, 82, 89, 91, 86, 20], # 20 is an outlier
'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'] # Non-numerical
})
# Basic boxplot analysis
edaflow.visualize_numerical_boxplots(
df,
title="Employee Data Analysis - Outlier Detection",
show_skewness=True
)
# Custom layout and specific columns
edaflow.visualize_numerical_boxplots(
df,
columns=['age', 'salary'],
rows=1,
cols=2,
title="Age vs Salary Analysis",
orientation='vertical',
color_palette='viridis'
)
```
**Expected Output:**
```
๐ Creating boxplots for 4 numerical column(s): age, salary, experience, score
๐ Summary Statistics:
==================================================
๐ age:
Range: 25.00 to 100.00
Median: 36.50
IQR: 11.00 (Q1: 30.50, Q3: 41.50)
Skewness: 2.66 (highly skewed)
Outliers: 1 values outside [14.00, 58.00]
Outlier values: [100]
๐ salary:
Range: 50000.00 to 250000.00
Median: 72500.00
IQR: 22500.00 (Q1: 61250.00, Q3: 83750.00)
Skewness: 2.88 (highly skewed)
Outliers: 1 values outside [27500.00, 117500.00]
Outlier values: [250000]
๐ experience:
Range: 2.00 to 30.00
Median: 8.50
IQR: 7.50 (Q1: 5.25, Q3: 12.75)
Skewness: 1.69 (highly skewed)
Outliers: 1 values outside [-6.00, 24.00]
Outlier values: [30]
๐ score:
Range: 20.00 to 95.00
Median: 87.00
IQR: 7.75 (Q1: 82.75, Q3: 90.50)
Skewness: -2.87 (highly skewed)
Outliers: 1 values outside [71.12, 102.12]
Outlier values: [20]
```
### Complete EDA Workflow Example
```python
import edaflow
import pandas as pd
# Test the installation
print(edaflow.hello())
# Load your data
df = pd.read_csv('your_data.csv')
# Complete EDA workflow with all core functions:
# 1. Analyze missing data with styled output
null_analysis = edaflow.check_null_columns(df, threshold=10)
# 2. Analyze categorical columns to identify data type issues
edaflow.analyze_categorical_columns(df, threshold=35)
# 3. Convert appropriate object columns to numeric automatically
df_cleaned = edaflow.convert_to_numeric(df, threshold=35)
# 4. Visualize categorical column values
edaflow.visualize_categorical_values(df_cleaned)
# 5. Display column type classification
edaflow.display_column_types(df_cleaned)
# 6. Impute missing values
df_numeric_imputed = edaflow.impute_numerical_median(df_cleaned)
df_fully_imputed = edaflow.impute_categorical_mode(df_numeric_imputed)
# 7. Statistical distribution analysis with advanced insights
edaflow.visualize_histograms(df_fully_imputed, kde=True, show_normal_curve=True)
# 8. Comprehensive relationship analysis
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='correlation')
edaflow.visualize_scatter_matrix(df_fully_imputed, show_regression=True)
# 9. Generate comprehensive EDA insights and recommendations
insights = edaflow.summarize_eda_insights(df_fully_imputed, target_column='your_target_col')
print(insights) # View insights dictionary
# 10. Outlier detection and visualization
edaflow.visualize_numerical_boxplots(df_fully_imputed, show_skewness=True)
edaflow.visualize_interactive_boxplots(df_fully_imputed)
# 10. Advanced heatmap analysis
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='missing')
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='values')
# 11. Final data cleaning with outlier handling
df_final = edaflow.handle_outliers_median(df_fully_imputed, method='iqr', verbose=True)
# 12. Results verification
edaflow.visualize_scatter_matrix(df_final, title="Clean Data Relationships")
edaflow.visualize_numerical_boxplots(df_final, title="Final Clean Distribution")
```
### ๐ค **Complete ML Workflow** โญ *Enhanced in v0.14.0*
```python
import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Continue from cleaned data above...
df_final['target'] = your_target_data # Add your target column
# 1. Setup ML experiment โญ NEW: Enhanced parameters in v0.14.0
experiment = ml.setup_ml_experiment(
df_final, 'target',
test_size=0.2, # Test set: 20%
val_size=0.15, # โญ NEW: Validation set: 15%
experiment_name="production_ml_pipeline", # โญ NEW: Experiment tracking
random_state=42,
stratify=True
)
# Alternative: sklearn-style calling (also enhanced)
# X = df_final.drop('target', axis=1)
# y = df_final['target']
# experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15, experiment_name="sklearn_workflow")
print(f"Training: {len(experiment['X_train'])}, Validation: {len(experiment['X_val'])}, Test: {len(experiment['X_test'])}")
# 2. Compare multiple models โญ Enhanced with validation set support
models = {
'RandomForest': RandomForestClassifier(random_state=42),
'GradientBoosting': GradientBoostingClassifier(random_state=42),
'LogisticRegression': LogisticRegression(random_state=42),
'SVM': SVC(random_state=42, probability=True)
}
# Fit all models
for name, model in models.items():
model.fit(experiment['X_train'], experiment['y_train'])
# โญ Enhanced compare_models with experiment_config support
comparison = ml.compare_models(
models=models,
experiment_config=experiment, # โญ NEW: Automatically uses validation set
verbose=True
)
print(comparison) # Professional styled output
# โญ Enhanced rank_models with flexible return formats
# Quick access to best model (list format - NEW)
best_model = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']
print(f"๐ Best model: {best_model}")
# Detailed ranking analysis (DataFrame format - traditional)
ranked_models = ml.rank_models(comparison, 'accuracy')
print("๐ Top 3 models:")
print(ranked_models.head(3)[['model', 'accuracy', 'f1', 'rank']])
# Advanced: Multi-metric weighted ranking
weighted_ranking = ml.rank_models(
comparison,
'accuracy',
weights={'accuracy': 0.4, 'f1': 0.3, 'precision': 0.3},
return_format='list'
)
print(f"๐ฏ Best by weighted score: {weighted_ranking[0]['model_name']}")
# 3. Hyperparameter optimization โญ Enhanced with validation set
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
best_results = ml.optimize_hyperparameters(
RandomForestClassifier(random_state=42),
param_distributions=param_grid,
X_train=experiment['X_train'],
y_train=experiment['y_train'],
method='grid_search',
cv=5
)
# 4. Generate comprehensive performance visualizations
ml.plot_learning_curves(best_results['best_model'],
X_train=experiment['X_train'], y_train=experiment['y_train'])
ml.plot_roc_curves({'optimized_model': best_results['best_model']},
X_test=experiment['X_test'], y_test=experiment['y_test'])
ml.plot_feature_importance(best_results['best_model'],
feature_names=experiment['feature_names'])
# 5. Save complete model artifacts with experiment tracking
ml.save_model_artifacts(
model=best_results['best_model'],
model_name=f"{experiment['experiment_name']}_optimized_model", # โญ NEW: Uses experiment name
experiment_config=experiment,
performance_metrics={
'cv_score': best_results['best_score'],
'test_score': best_results['best_model'].score(experiment['X_test'], experiment['y_test']),
'model_type': 'RandomForestClassifier'
},
metadata={
'experiment_name': experiment['experiment_name'], # โญ NEW: Experiment tracking
'data_shape': df_final.shape,
'feature_count': len(experiment['feature_names'])
}
)
print(f"โ
Complete ML pipeline finished! Experiment: {experiment['experiment_name']}")
```
### ๐ค **ML Preprocessing with Smart Encoding** โญ *Introduced in v0.12.0*
```python
import edaflow
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
# Step 1: Analyze encoding needs (with or without target)
encoding_analysis = edaflow.analyze_encoding_needs(
df,
target_column=None, # Optional: specify target if you have one
max_cardinality_onehot=15, # Optional: max categories for one-hot encoding
max_cardinality_target=50, # Optional: max categories for target encoding
ordinal_columns=None # Optional: specify ordinal columns if known
)
# Step 2: Apply intelligent encoding transformations
df_encoded = edaflow.apply_smart_encoding(
df, # Use your full dataset (or df.drop('target_col', axis=1) if needed)
encoding_analysis=encoding_analysis, # Optional: use previous analysis
handle_unknown='ignore' # Optional: how to handle unknown categories
)
# The encoding pipeline automatically:
# โ
One-hot encodes low cardinality categoricals
# โ
Target encodes high cardinality with target correlation
# โ
Binary encodes medium cardinality features
# โ
TF-IDF vectorizes text columns
# โ
Preserves numeric columns unchanged
# โ
Handles memory efficiently for large datasets
print(f"Shape transformation: {df.shape} โ {df_encoded.shape}")
print(f"Encoding methods applied: {len(encoding_analysis['encoding_methods'])} different strategies")
```
## Project Structure
```
edaflow/
โโโ edaflow/
โ โโโ __init__.py
โ โโโ analysis/
โ โโโ visualization/
โ โโโ preprocessing/
โโโ tests/
โโโ docs/
โโโ examples/
โโโ setup.py
โโโ requirements.txt
โโโ README.md
โโโ LICENSE
```
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/new-feature`)
3. Commit your changes (`git commit -m 'Add new feature'`)
4. Push to the branch (`git push origin feature/new-feature`)
5. Open a Pull Request
## Development
### Setup Development Environment
```bash
# Clone the repository
git clone https://github.com/evanlow/edaflow.git
cd edaflow
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
flake8 edaflow/
black edaflow/
isort edaflow/
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Changelog
> **๐ Latest Updates**: This changelog reflects the most current releases including v0.12.32 critical input validation fix, v0.12.31 hotfix with KeyError resolution and v0.12.30 universal display optimization breakthrough.
### v0.12.32 (2025-08-11) - Critical Input Validation Fix ๐
- **CRITICAL**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions
- **ROOT CAUSE**: Users passing tuple result from `apply_smart_encoding(..., return_encoders=True)` directly to visualization functions
- **ENHANCED**: Added intelligent input validation with helpful error messages for common usage mistakes
- **IMPROVED**: Better error handling in `visualize_scatter_matrix` and other visualization functions
- **DOCUMENTED**: Clear examples showing correct vs incorrect usage patterns for `apply_smart_encoding`
- **STABILITY**: Prevents crashes in step 14 of EDA workflows when encoding functions are misused
### v0.12.31 (2025-01-05) - Critical KeyError Hotfix ๐จ
- **CRITICAL**: Fixed KeyError: 'type' in `summarize_eda_insights()` function during Google Colab usage
- **RESOLVED**: Exception handling when target analysis dictionary missing expected keys
- **IMPROVED**: Enhanced error handling with safe dictionary access using `.get()` method
- **MAINTAINED**: All existing functionality preserved - pure stability fix
- **TESTED**: Verified fix works across all notebook platforms (Colab, JupyterLab, VS Code)
### v0.12.30 (2025-01-05) - Universal Display Optimization Breakthrough ๐จ
- **BREAKTHROUGH**: Introduced `optimize_display()` function for universal notebook compatibility
- **REVOLUTIONARY**: Automatic platform detection (Google Colab, JupyterLab, VS Code Notebooks, Classic Jupyter)
- **ENHANCED**: Dynamic CSS injection for perfect dark/light mode visibility across all platforms
- **NEW FEATURE**: Automatic matplotlib backend optimization for each notebook environment
- **ACCESSIBILITY**: Solves visibility issues in dark mode themes universally
- **SEAMLESS**: Zero configuration required - automatically detects and optimizes for your platform
- **COMPATIBILITY**: Works flawlessly across Google Colab, JupyterLab, VS Code, Classic Jupyter
- **EXAMPLE**: Simple usage: `from edaflow import optimize_display; optimize_display()`
### v0.12.3 (2025-08-06) - Complete Positional Argument Compatibility Fix ๐ง
- **CRITICAL**: Fixed positional argument usage for `visualize_image_classes()` function
- **RESOLVED**: TypeError when calling `visualize_image_classes(image_paths, ...)` with positional arguments
- **ENHANCED**: Comprehensive backward compatibility supporting all three usage patterns:
- Positional: `visualize_image_classes(path, ...)` (shows warning)
- Deprecated keyword: `visualize_image_classes(image_paths=path, ...)` (shows warning)
- Recommended: `visualize_image_classes(data_source=path, ...)` (no warning)
- **IMPROVED**: Clear deprecation warnings guiding users toward recommended syntax
- **SECURE**: Prevents using both parameters simultaneously to avoid confusion
- **RESOLVED**: TypeError for users calling with `image_paths=` parameter from v0.12.0 breaking change
- **ENHANCED**: Improved error messages for parameter validation in image visualization functions
- **DOCUMENTATION**: Added comprehensive parameter documentation including deprecation notices
### v0.12.2 (2025-08-06) - Documentation Refresh ๐
- **IMPROVED**: Enhanced README.md with updated timestamps and current version indicators
- **FIXED**: Ensured PyPI displays the most current changelog information including v0.12.1 fixes
- **ENHANCED**: Added latest updates indicator to changelog for better visibility
- **DOCUMENTATION**: Forced PyPI cache refresh to display current version information
## โจ What's New in v0.16.2
**New Features:**
- Faceted visualizations with `display_facet_grid`
- Feature scaling with `scale_features`
- Grouping rare categories with `group_rare_categories`
- Exporting figures with `export_figure`
**Documentation Updates:**
- User Guide, Advanced Features, and Best Practices now reference all new APIs
- Visualization Guide includes external library requirements and troubleshooting
- Changelog documents all new features and documentation changes
**External Library Requirements:**
Some advanced features require additional libraries:
- matplotlib
- seaborn
- scikit-learn
- statsmodels
- pandas
See the Visualization Guide for installation instructions and troubleshooting tips.
---
Raw data
{
"_id": null,
"home_page": null,
"name": "edaflow",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Evan Low <evan.low@illumetechnology.com>",
"keywords": "data-analysis, eda, exploratory-data-analysis, data-science, visualization",
"author": null,
"author_email": "Evan Low <evan.low@illumetechnology.com>",
"download_url": "https://files.pythonhosted.org/packages/bd/61/c76ac3dfde6698fdeebfa12275be32510587bffe48efe7ef523251a54e57/edaflow-0.17.1.tar.gz",
"platform": null,
"description": "# \ud83d\ude80 What's New in v0.17.1\r\n\r\n**Notebook Fixes & Beginner Experience:**\r\n- Fixed confusion matrix example in classification and advanced workflow notebooks to match API signature\r\n- Audited all example notebooks for beginner-friendliness and error-free execution\r\n- All notebooks now run without unnecessary errors for new users\r\n\r\n**Major Examples & Documentation Update (v0.17.0):**\r\n- Added interactive Jupyter notebooks for all major workflows\r\n- All guides and onboarding instructions updated to reference new features and examples\r\n- Verified and enhanced documentation for:\r\n - `highlight_anomalies`\r\n - `create_lag_features`\r\n - `display_facet_grid`\r\n - `scale_features`\r\n - `group_rare_categories`\r\n - `export_figure`\r\n- Improved onboarding and user guidance in ReadTheDocs and README\r\n- Minor bug fixes and consistency improvements across docs and codebase\r\n\r\n**Why Upgrade?**\r\nThis release ensures all users have access to complete, copy-paste-ready examples and documentation. The onboarding experience is now smoother, and all advanced features are fully documented.\r\n\r\nSee the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)\r\n\r\n**Major Documentation Overhaul for Education:**\r\n- Added a dedicated Learning Path for new and aspiring data scientists\r\n- Consolidated ML workflow steps into a single, copy-paste-safe guide\r\n- Expanded examples: classification, regression, and computer vision\r\n- Improved navigation: clear table of contents, user guide, API reference, and best practices\r\n- Advanced features and troubleshooting tips for power users\r\n\r\n**Why Upgrade?**\r\nThis release makes edaflow best-in-class for educational value, with a structured progression for learners and educators. All documentation is now easier to follow, with practical code and hands-on exercises.\r\n\r\nSee the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)\r\n# edaflow\r\n\r\n[](https://edaflow.readthedocs.io/en/latest/?badge=latest)\r\n[](https://badge.fury.io/py/edaflow)\r\n[](https://www.python.org/downloads/)\r\n[](https://opensource.org/licenses/MIT)\r\n[](https://pepy.tech/project/edaflow)\r\n\r\n**Quick Navigation**: \r\n\ud83d\udcda [Documentation](https://edaflow.readthedocs.io) | \r\n\ud83d\udce6 [PyPI Package](https://pypi.org/project/edaflow/) | \r\n\ud83d\ude80 [Quick Start](https://edaflow.readthedocs.io/en/latest/quickstart.html) | \r\n\ud83d\udccb [Changelog](#-changelog) | \r\n\ud83d\udc1b [Issues](https://github.com/evanlow/edaflow/issues)\r\n\r\nA Python package for streamlined exploratory data analysis workflows.\r\n\r\n > **\ud83d\udce6 Current Version: v0.16.4** - [Latest Release](https://pypi.org/project/edaflow/0.16.4/) adds a complete examples directory, improved onboarding, and fully documented advanced features. *Updated: September 12, 2025*\r\n\r\n## \ud83d\udcd6 Table of Contents\r\n\r\n- [Description](#description)\r\n- [\ud83d\udea8 Critical Fixes in v0.15.0](#-critical-fixes-in-v0150)\r\n- [\u2728 What's New](#-whats-new)\r\n- [Features](#features)\r\n- [\ud83c\udd95 Recent Updates](#-recent-updates)\r\n- [\ud83d\udcda Documentation](#-documentation)\r\n- [Installation](#installation)\r\n- [Quick Start](#quick-start)\r\n- [\ud83d\udccb Changelog](#-changelog)\r\n- [Support](#support)\r\n- [Roadmap](#roadmap)\r\n\r\n## Description\r\n\r\n`edaflow` is designed to simplify and accelerate the exploratory data analysis (EDA) process by providing a collection of tools and utilities for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, and preprocessing.\r\n\r\n## \ud83d\udea8 What's New in v0.15.1\r\n\r\n**NEW:** `setup_ml_experiment` now supports a `primary_metric` argument, making metric selection robust and error-free for all ML workflows. All documentation, tests, and downstream code are updated for consistency. A new test ensures the metric is set and accessible throughout the workflow.\r\n\r\n**Upgrade recommended for all users who want reliable, copy-paste-safe ML workflows with dynamic metric selection.**\r\n\r\n---\r\n\r\n## \ud83d\udea8 Critical Fixes in v0.15.0\r\n**(Previous release)**\r\n\r\n### \ud83c\udfaf **Issues Resolved**:\r\n- \u274c **FIXED**: `RandomForestClassifier instance is not fitted yet` errors\r\n- \u274c **FIXED**: `TypeError: unexpected keyword argument` errors \r\n- \u274c **FIXED**: Missing imports and undefined variables in examples\r\n- \u274c **FIXED**: Duplicate step numbering in documentation\r\n- \u2705 **RESULT**: All ML workflow examples now work perfectly!\r\n\r\n### \ud83d\udccb **What This Means For You**:\r\n- \ud83c\udf89 **Copy-paste examples that work immediately**\r\n- \ud83c\udfaf **No more confusing error messages**\r\n- \ud83d\udcda **Complete, beginner-friendly documentation**\r\n- \ud83d\ude80 **Smooth learning experience for new users**\r\n\r\n**Upgrade recommended for all users following ML workflow documentation.**\r\n\r\n## \u2728 What's New\r\n\r\n### \ud83d\udea8 Critical ML Documentation Fixes (v0.15.0)\r\n**MAJOR DOCUMENTATION UPDATE**: Fixed critical issues that were causing user errors when following ML workflow examples.\r\n\r\n**Problems Resolved**:\r\n- \u2705 **Model Fitting**: Added missing `model.fit()` steps that were causing \"not fitted\" errors\r\n- \u2705 **Function Parameters**: Fixed incorrect parameter names in all examples\r\n- \u2705 **Missing Context**: Added imports and data preparation context \r\n- \u2705 **Step Numbering**: Corrected duplicate step numbers in documentation\r\n- \u2705 **Enhanced Warnings**: Added prominent warnings about critical requirements\r\n\r\n**Result**: All ML workflow documentation now works perfectly out-of-the-box!\r\n\r\n### \ud83c\udfaf Enhanced rank_models Function (v0.14.x)\r\n**DUAL RETURN FORMAT SUPPORT**: Major enhancement based on user requests.\r\n\r\n```python\r\n# Both formats now supported:\r\ndf_results = ml.rank_models(results, 'accuracy') # DataFrame (default)\r\nlist_results = ml.rank_models(results, 'accuracy', return_format='list') # List of dicts\r\n\r\n# User-requested pattern now works:\r\nbest_model = ml.rank_models(results, 'accuracy', return_format='list')[0][\"model_name\"]\r\n```\r\n\r\n### \ud83d\ude80 ML Expansion (v0.13.0+)\r\n**COMPLETE MACHINE LEARNING SUBPACKAGE**: Extended edaflow into full ML workflows.\r\n\r\n**New ML Modules Added**:\r\n- **`ml.config`**: ML experiment setup and data validation\r\n- **`ml.leaderboard`**: Multi-model comparison and ranking\r\n- **`ml.tuning`**: Advanced hyperparameter optimization\r\n- **`ml.curves`**: Learning curves and performance visualization\r\n- **`ml.artifacts`**: Model persistence and experiment tracking\r\n\r\n**Key ML Features**:\r\n```python\r\n# Complete ML workflow in one package\r\nimport edaflow.ml as ml\r\n\r\n# Setup experiment with flexible parameter support\r\n# Both calling patterns work:\r\nexperiment = ml.setup_ml_experiment(df, 'target') # DataFrame style\r\n# OR\r\nexperiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15) # sklearn style\r\n\r\n# Compare multiple models\r\nresults = ml.compare_models(models, **experiment)\r\n\r\n# Optimize hyperparameters with multiple strategies\r\nbest_model = ml.optimize_hyperparameters(model, params, **experiment)\r\n\r\n# Generate comprehensive visualizations\r\nml.plot_learning_curves(model, **experiment)\r\n```\r\n\r\n### Previous: API Improvement (v0.12.33)\r\n**NEW CLEAN APIs**: Introduced consistent, user-friendly encoding functions that eliminate confusion and crashes.\r\n\r\n**Root Cause Solved**: The inconsistent return type of `apply_smart_encoding()` (sometimes DataFrame, sometimes tuple) was causing AttributeError crashes and user confusion.\r\n\r\n**New Functions Added**:\r\n```python\r\n# \u2705 NEW: Clean, consistent DataFrame return (RECOMMENDED)\r\ndf_encoded = edaflow.apply_encoding(df) # Always returns DataFrame\r\n\r\n# \u2705 NEW: Explicit tuple return when encoders needed\r\ndf_encoded, encoders = edaflow.apply_encoding_with_encoders(df) # Always returns tuple\r\n\r\n# \u26a0\ufe0f DEPRECATED: Inconsistent behavior (still works with warnings)\r\ndf_encoded = edaflow.apply_smart_encoding(df, return_encoders=True) # Sometimes tuple!\r\n```\r\n\r\n**Benefits**:\r\n- \ud83c\udfaf **Zero Breaking Changes**: All existing workflows continue working exactly the same\r\n- \ud83d\udee1\ufe0f **Better Error Messages**: Helpful guidance when mistakes are made \r\n- \ud83d\udd04 **Migration Path**: Multiple options for users who want cleaner APIs\r\n- \ud83d\udcda **Clear Documentation**: Explicit examples showing best practices\r\n\r\n### \ud83d\udc1b Critical Input Validation Fix (v0.12.32)\r\n**RESOLVED**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions when `apply_smart_encoding(..., return_encoders=True)` result is used incorrectly.\r\n\r\n**Problem Solved**: Users who passed the tuple result from `apply_smart_encoding` directly to visualization functions without unpacking were experiencing crashes in step 14 of EDA workflows.\r\n\r\n**Enhanced Error Messages**: Added intelligent input validation with helpful error messages guiding users to the correct usage pattern:\r\n```python\r\n# \u274c WRONG - This causes the AttributeError:\r\ndf_encoded = edaflow.apply_smart_encoding(df, return_encoders=True) # Returns (df, encoders) tuple!\r\nedaflow.visualize_scatter_matrix(df_encoded) # Crashes with AttributeError\r\n\r\n# \u2705 CORRECT - Unpack the tuple: \r\ndf_encoded, encoders = edaflow.apply_smart_encoding(df, return_encoders=True)\r\nedaflow.visualize_scatter_matrix(df_encoded) # Should work well!\r\n```\r\n\r\n### \ud83c\udfa8 BREAKTHROUGH: Universal Dark Mode Compatibility (v0.12.30)\r\n- **NEW FUNCTION**: `optimize_display()` - The **FIRST** EDA library with universal notebook compatibility!\r\n- **Universal Platform Support**: Improved visibility across Google Colab, JupyterLab, VS Code, and Classic Jupyter\r\n- **Automatic Detection**: Zero configuration needed - automatically detects your environment\r\n- **Accessibility Support**: Built-in high contrast mode for improved accessibility\r\n- **One-Line Solution**: `edaflow.optimize_display()` fixes all visibility issues instantly\r\n\r\n### \ud83d\udc1b Critical KeyError Hotfix (v0.12.31)\r\n- **Fixed KeyError**: Resolved \"KeyError: 'type'\" in `summarize_eda_insights()` function\r\n- **Enhanced Error Handling**: Added robust exception handling for target analysis edge cases\r\n- **Improved Stability**: Function now handles missing or invalid target columns gracefully\r\n\r\n### \ud83c\udf1f Platform Benefits:\r\n- \u2705 **Google Colab**: Auto light/dark mode detection with improved text visibility\r\n- \u2705 **JupyterLab**: Dark theme compatibility with custom theme support\r\n- \u2705 **VS Code**: Native theme integration with seamless notebook experience \r\n- \u2705 **Classic Jupyter**: Full compatibility with enhanced readability options\r\n\r\n```python\r\nimport edaflow\r\n# \u2b50 NEW: Improved visibility everywhere!\r\nedaflow.optimize_display() # Universal dark mode fix!\r\n\r\n# All functions now display beautifully\r\nedaflow.check_null_columns(df)\r\nedaflow.visualize_histograms(df)\r\n```\r\n\r\n### \u2728 NEW FUNCTION: `summarize_eda_insights()` (Added in v0.12.28)\r\n- **Comprehensive Analysis**: Generate complete EDA insights and actionable recommendations after completing your analysis workflow\r\n- **Smart Recommendations**: Provides intelligent next steps for modeling, preprocessing, and data quality improvements\r\n- **Target-Aware Analysis**: Supports both classification and regression scenarios with specific insights\r\n- **Function Tracking**: Knows which edaflow functions you've already used in your workflow\r\n- **Structured Output**: Returns organized dictionary with dataset overview, data quality assessment, and recommendations\r\n\r\n### \ud83c\udfa8 Display Formatting Excellence\r\n- **Enhanced Visual Experience**: Refined Rich console styling with optimized panel borders and alignment\r\n- **Google Colab Optimized**: Improved display formatting specifically tailored for notebook environments\r\n- **Consistent Design**: Professional rounded borders, proper width constraints, and refined color schemes\r\n- **Universal Compatibility**: Beautiful output rendering across all major Python environments and notebooks\r\n\r\n### \ufffd Recent Fixes (v0.12.24-0.12.26)\r\n- **LBP Warning Resolution**: Fixed scikit-image UserWarning in texture analysis functions\r\n- **Parameter Documentation**: Corrected `analyze_image_features` documentation mismatches\r\n- **RTD Synchronization**: Updated Read the Docs changelog with all recent improvements\r\n\r\n### \ud83c\udf08 Rich Styling (v0.12.20-0.12.21)\r\n- **Vibrant Output**: ALL major EDA functions now feature professional, color-coded styling\r\n- **Smart Indicators**: Color-coded severity levels (\u2705 CLEAN, \u26a0\ufe0f WARNING, \ud83d\udea8 CRITICAL)\r\n- **Professional Tables**: Beautiful formatted output with rich library integration\r\n- **Actionable Insights**: Context-aware recommendations and visual status indicators\r\n\r\n## Features\r\n\r\n### \ud83d\udd0d **Exploratory Data Analysis**\r\n- **Missing Data Analysis**: Color-coded analysis of null values with customizable thresholds\r\n- **Categorical Data Insights**: \ud83d\udc1b *FIXED in v0.12.29* Identify object columns that might be numeric, detect data type issues (now handles unhashable types)\r\n- **Automatic Data Type Conversion**: Smart conversion of object columns to numeric when appropriate\r\n- **Categorical Values Visualization**: Detailed exploration of categorical column values with insights\r\n- **Column Type Classification**: Simple categorization of DataFrame columns into categorical and numerical types\r\n- **Data Type Detection**: Smart analysis to flag potential data conversion needs\r\n- **EDA Insights Summary**: \u2b50 *NEW in v0.12.28* Comprehensive EDA insights and actionable recommendations after completing analysis workflow\r\n\r\n### \ud83d\udcca **Advanced Visualizations**\r\n- **Numerical Distribution Visualization**: Advanced boxplot analysis with outlier detection and statistical summaries\r\n- **Interactive Boxplot Visualization**: Interactive Plotly Express boxplots with zoom, hover, and statistical tooltips\r\n- **Comprehensive Heatmap Visualizations**: Correlation matrices, missing data patterns, values heatmaps, and cross-tabulations\r\n- **Statistical Histogram Analysis**: Advanced histogram visualization with skewness detection, normality testing, and distribution analysis\r\n- **Scatter Matrix Analysis**: Advanced pairwise relationship visualization with customizable matrix layouts, regression lines, and statistical insights\r\n\r\n### \ud83e\udd16 **Machine Learning Preprocessing** \u2b50 *Introduced in v0.12.0*\r\n- **Intelligent Encoding Analysis**: Automatic detection of optimal encoding strategies for categorical variables\r\n- **Smart Encoding Application**: Automated categorical encoding with support for:\r\n - One-Hot Encoding for low cardinality categories\r\n - Target Encoding for high cardinality with target correlation\r\n - Ordinal Encoding for ordinal relationships\r\n - Binary Encoding for medium cardinality\r\n - Text Vectorization (TF-IDF) for text features\r\n - Leave Unchanged for numeric columns\r\n- **Memory-Efficient Processing**: Intelligent handling of high-cardinality features to prevent memory issues\r\n- **Comprehensive Encoding Pipeline**: End-to-end preprocessing solution for ML model preparation\r\n\r\n### \ud83e\udd16 **Machine Learning Workflows** \u2b50 *NEW in v0.13.0*\r\nThe powerful `edaflow.ml` subpackage provides comprehensive machine learning workflow capabilities:\r\n\r\n#### **ML Experiment Setup (`ml.config`)**\r\n- **Smart Data Validation**: Automatic data quality assessment and problem type detection\r\n- **Intelligent Data Splitting**: Train/validation/test splits with stratification support\r\n- **ML Pipeline Configuration**: Automated preprocessing pipeline setup for ML workflows\r\n\r\n#### **Model Comparison & Ranking (`ml.leaderboard`)**\r\n- **Multi-Model Evaluation**: Compare multiple models with comprehensive metrics\r\n- **Smart Leaderboards**: Automatically rank models by performance with visual displays\r\n- **Export Capabilities**: Save comparison results for reporting and analysis\r\n\r\n#### **Hyperparameter Optimization (`ml.tuning`)**\r\n- **Multiple Search Strategies**: Grid search, random search, and Bayesian optimization\r\n- **Cross-Validation Integration**: Built-in CV with customizable scoring metrics\r\n- **Parallel Processing**: Multi-core hyperparameter optimization for faster results\r\n\r\n#### **Learning & Performance Curves (`ml.curves`)**\r\n- **Learning Curves**: Visualize model performance vs training size\r\n- **Validation Curves**: Analyze hyperparameter impact on model performance\r\n- **ROC & Precision-Recall Curves**: Comprehensive classification performance analysis\r\n- **Feature Importance**: Visual analysis of model feature contributions\r\n\r\n#### **Model Persistence & Tracking (`ml.artifacts`)**\r\n- **Complete Model Artifacts**: Save models, configs, and metadata\r\n- **Experiment Tracking**: Track multiple experiments with organized storage\r\n- **Model Reports**: Generate comprehensive model performance reports\r\n- **Version Management**: Organized model versioning and retrieval\r\n\r\n**Quick ML Example:**\r\n```python\r\nimport edaflow.ml as ml\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n\r\n# Setup ML experiment - Multiple parameter patterns supported\r\n# Method 1: DataFrame + target column (recommended)\r\nexperiment = ml.setup_ml_experiment(df, target_column='target')\r\n\r\n# Method 2: sklearn-style (also supported)\r\nX = df.drop('target', axis=1)\r\ny = df['target']\r\nexperiment = ml.setup_ml_experiment(\r\n X=X, y=y,\r\n test_size=0.2,\r\n val_size=0.15, # Alternative to validation_size\r\n experiment_name=\"my_ml_project\",\r\n stratify=True,\r\n random_state=42\r\n)\r\n\r\n# Compare multiple models\r\nmodels = {\r\n 'RandomForest': RandomForestClassifier(),\r\n 'LogisticRegression': LogisticRegression()\r\n}\r\ncomparison = ml.compare_models(models, **experiment)\r\n\r\n# Rank models with flexible access patterns\r\n# Method 1: Easy dictionary access (recommended for getting best model)\r\nbest_model_name = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']\r\n\r\n# Method 2: Traditional DataFrame format\r\nranked_df = ml.rank_models(comparison, 'accuracy')\r\nbest_model_traditional = ranked_df.iloc[0]['model']\r\n\r\n# Both methods give the same result\r\nprint(f\"Best model: {best_model_name}\") # Easy access\r\nprint(f\"Best model: {best_model_traditional}\") # Traditional access\r\n\r\n# Optimize hyperparameters\r\n\r\n# --- Copy-paste-safe hyperparameter optimization example ---\r\nmodel_name = 'LogisticRegression' # or 'RandomForest' or 'GradientBoosting'\r\n\r\nif model_name == 'RandomForest':\r\n param_distributions = {\r\n 'n_estimators': [100, 200, 300],\r\n 'max_depth': [5, 10, 15, None],\r\n 'min_samples_split': [2, 5, 10]\r\n }\r\n model = RandomForestClassifier()\r\n method = 'grid'\r\nelif model_name == 'GradientBoosting':\r\n param_distributions = {\r\n 'n_estimators': (50, 200),\r\n 'learning_rate': (0.01, 0.3),\r\n 'max_depth': (3, 8)\r\n }\r\n from sklearn.ensemble import GradientBoostingClassifier\r\n model = GradientBoostingClassifier()\r\n method = 'bayesian'\r\nelif model_name == 'LogisticRegression':\r\n param_distributions = {\r\n 'C': [0.01, 0.1, 1, 10, 100],\r\n 'penalty': ['l1', 'l2', 'elasticnet', 'none'],\r\n 'solver': ['lbfgs', 'liblinear', 'saga']\r\n }\r\n model = LogisticRegression(max_iter=1000)\r\n method = 'grid'\r\nelse:\r\n raise ValueError(f\"Unknown model_name: {model_name}\")\r\n\r\nresults = ml.optimize_hyperparameters(\r\n model,\r\n param_distributions=param_distributions,\r\n **experiment\r\n)\r\n\r\n# Generate learning curves\r\nml.plot_learning_curves(results['best_model'], **experiment)\r\n\r\n# Save complete artifacts\r\nml.save_model_artifacts(\r\n model=results['best_model'],\r\n model_name='optimized_rf',\r\n experiment_config=experiment,\r\n performance_metrics=results['cv_results']\r\n)\r\n```\r\n\r\n### \ud83d\uddbc\ufe0f **Computer Vision Support**\r\n- **Computer Vision EDA**: Class-wise image sample visualization and comprehensive quality assessment for image classification datasets\r\n- **Image Quality Assessment**: Automated detection of corrupted images, quality issues, blur, artifacts, and dataset health metrics\r\n\r\n### Usage Examples\r\n\r\n### Basic Usage\r\n```python\r\nimport edaflow\r\n\r\n# Verify installation\r\nmessage = edaflow.hello()\r\nprint(message) # Output: \"Hello from edaflow! Ready for exploratory data analysis.\"\r\n```\r\n\r\n### Missing Data Analysis with `check_null_columns`\r\n\r\nThe `check_null_columns` function provides a color-coded analysis of missing data in your DataFrame:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing values\r\ndf = pd.DataFrame({\r\n 'customer_id': [1, 2, 3, 4, 5],\r\n 'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],\r\n 'age': [25, None, 35, None, 45],\r\n 'email': [None, None, None, None, None], # All missing\r\n 'purchase_amount': [100.5, 250.0, 75.25, None, 320.0]\r\n})\r\n\r\n# Analyze missing data with default threshold (10%)\r\nstyled_result = edaflow.check_null_columns(df)\r\nstyled_result # Display in Jupyter notebook for color-coded styling\r\n\r\n# Use custom threshold (20%) to change color coding sensitivity\r\nstyled_result = edaflow.check_null_columns(df, threshold=20)\r\nstyled_result\r\n\r\n# Access underlying data if needed\r\ndata = styled_result.data\r\nprint(data)\r\n```\r\n\r\n**Color Coding:**\r\n- \ud83d\udd34 **Red**: > 20% missing (high concern)\r\n- \ud83d\udfe1 **Yellow**: 10-20% missing (medium concern) \r\n- \ud83d\udfe8 **Light Yellow**: 1-10% missing (low concern)\r\n- \u2b1c **Gray**: 0% missing (no issues)\r\n\r\n### Categorical Data Analysis with `analyze_categorical_columns`\r\n\r\nThe `analyze_categorical_columns` function helps identify data type issues and provides insights into object-type columns:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with mixed categorical types\r\ndf = pd.DataFrame({\r\n 'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],\r\n 'price_str': ['999', '25', '75', '450'], # Numbers stored as strings\r\n 'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],\r\n 'rating': [4.5, 3.8, 4.2, 4.7], # Already numeric\r\n 'mixed_ids': ['001', '002', 'ABC', '004'], # Mixed format\r\n 'status': ['active', 'inactive', 'active', 'pending']\r\n})\r\n\r\n# Analyze categorical columns with default threshold (35%)\r\nedaflow.analyze_categorical_columns(df)\r\n\r\n# Use custom threshold (50%) to be more lenient about mixed data\r\nedaflow.analyze_categorical_columns(df, threshold=50)\r\n```\r\n\r\n**Output Interpretation:**\r\n- \ud83d\udd34\ud83d\udd35 **Highlighted in Red/Blue**: Potentially numeric columns that might need conversion\r\n- \ud83d\udfe1\u26ab **Highlighted in Yellow/Black**: Shows unique values for potential numeric columns\r\n- **Regular text**: Truly categorical columns with statistics\r\n- **\"not an object column\"**: Already properly typed numeric columns\r\n\r\n### Data Type Conversion with `convert_to_numeric`\r\n\r\nAfter analyzing your categorical columns, you can automatically convert appropriate columns to numeric:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with string numbers\r\ndf = pd.DataFrame({\r\n 'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],\r\n 'price_str': ['999', '25', '75', '450'], # Should convert\r\n 'mixed_ids': ['001', '002', 'ABC', '004'], # Mixed data\r\n 'category': ['Electronics', 'Accessories', 'Electronics', 'Electronics']\r\n})\r\n\r\n# Convert appropriate columns to numeric (threshold=35% by default)\r\ndf_converted = edaflow.convert_to_numeric(df, threshold=35)\r\n\r\n# Or modify the original DataFrame in place\r\nedaflow.convert_to_numeric(df, threshold=35, inplace=True)\r\n\r\n# Use a stricter threshold (only convert if <20% non-numeric values)\r\ndf_strict = edaflow.convert_to_numeric(df, threshold=20)\r\n```\r\n\r\n**Function Features:**\r\n- \u2705 **Smart Detection**: Only converts columns with few non-numeric values\r\n- \u2705 **Customizable Threshold**: Control conversion sensitivity \r\n- \u2705 **Safe Conversion**: Non-numeric values become NaN (not errors)\r\n- \u2705 **Inplace Option**: Modify original DataFrame or create new one\r\n- \u2705 **Detailed Output**: Shows exactly what was converted and why\r\n\r\n### Categorical Data Visualization with `visualize_categorical_values`\r\n\r\nAfter cleaning your data, explore categorical columns in detail to understand value distributions:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Example DataFrame with categorical data\r\ndf = pd.DataFrame({\r\n 'department': ['Sales', 'Marketing', 'Sales', 'HR', 'Marketing', 'Sales', 'IT'],\r\n 'status': ['Active', 'Inactive', 'Active', 'Pending', 'Active', 'Active', 'Inactive'],\r\n 'priority': ['High', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],\r\n 'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007], # Numeric (ignored)\r\n 'salary': [50000, 60000, 70000, 45000, 58000, 62000, 70000] # Numeric (ignored)\r\n})\r\n\r\n# Visualize all categorical columns\r\nedaflow.visualize_categorical_values(df)\r\n```\r\n\r\n**Advanced Usage Examples:**\r\n\r\n```python\r\n# Handle high-cardinality data (many unique values)\r\nlarge_df = pd.DataFrame({\r\n 'product_id': [f'PROD_{i:04d}' for i in range(100)], # 100 unique values\r\n 'category': ['Electronics'] * 40 + ['Clothing'] * 35 + ['Books'] * 25,\r\n 'status': ['Available'] * 80 + ['Out of Stock'] * 15 + ['Discontinued'] * 5\r\n})\r\n\r\n# Limit display for high-cardinality columns\r\nedaflow.visualize_categorical_values(large_df, max_unique_values=5)\r\n```\r\n\r\n```python\r\n# DataFrame with missing values for comprehensive analysis\r\ndf_with_nulls = pd.DataFrame({\r\n 'region': ['North', 'South', None, 'East', 'West', 'North', None],\r\n 'customer_type': ['Premium', 'Standard', 'Premium', None, 'Standard', 'Premium', 'Standard'],\r\n 'transaction_id': [f'TXN_{i}' for i in range(7)], # Mostly unique (ID-like)\r\n})\r\n\r\n# Get detailed insights including missing value analysis\r\nedaflow.visualize_categorical_values(df_with_nulls)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83c\udfaf **Zero Breaking Changes**: All existing workflows continue working exactly the same\r\n- \ud83d\udee1\ufe0f **Better Error Messages**: Helpful guidance when mistakes are made \r\n- \ud83d\udd04 **Migration Path**: Multiple options for users who want cleaner APIs\r\n- \ud83d\udcda **Clear Documentation**: Explicit examples showing best practices\r\n\r\n### Column Type Classification with `display_column_types`\r\n\r\nThe `display_column_types` function provides a simple way to categorize DataFrame columns into categorical and numerical types:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with mixed types\r\ndata = {\r\n 'name': ['Alice', 'Bob', 'Charlie'],\r\n 'age': [25, 30, 35],\r\n 'city': ['NYC', 'LA', 'Chicago'],\r\n 'salary': [50000, 60000, 70000],\r\n 'is_active': [True, False, True]\r\n}\r\ndf = pd.DataFrame(data)\r\n\r\n# Display column type classification\r\nresult = edaflow.display_column_types(df)\r\n\r\n# Access the categorized column lists\r\ncategorical_cols = result['categorical'] # ['name', 'city']\r\nnumerical_cols = result['numerical'] # ['age', 'salary', 'is_active']\r\n```\r\n\r\n**Example Output:**\r\n```\r\n\ud83d\udcca Column Type Analysis\r\n==================================================\r\n\r\n\ud83d\udcdd Categorical Columns (2 total):\r\n 1. name (unique values: 3)\r\n 2. city (unique values: 3)\r\n\r\n\ud83d\udd22 Numerical Columns (3 total):\r\n 1. age (dtype: int64)\r\n 2. salary (dtype: int64)\r\n 3. is_active (dtype: bool)\r\n\r\n\ud83d\udcc8 Summary:\r\n Total columns: 5\r\n Categorical: 2 (40.0%)\r\n Numerical: 3 (60.0%)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udd0d **Simple Classification**: Separates columns into categorical (object dtype) and numerical (all other dtypes)\r\n- \ud83d\udcca **Detailed Information**: Shows unique value counts for categorical columns and data types for numerical columns\r\n- \ud83d\udcc8 **Summary Statistics**: Provides percentage breakdown of column types\r\n- \ud83c\udfaf **Return Values**: Returns dictionary with categorized column lists for programmatic use\r\n- \u26a1 **Fast Processing**: Efficient classification based on pandas data types\r\n- \ud83d\udee1\ufe0f **Error Handling**: Validates input and handles edge cases like empty DataFrames\r\n\r\n### Data Imputation with `impute_numerical_median` and `impute_categorical_mode`\r\n\r\nAfter analyzing your data, you often need to handle missing values. The edaflow package provides two specialized imputation functions for this purpose:\r\n\r\n#### Numerical Imputation with `impute_numerical_median`\r\n\r\nThe `impute_numerical_median` function fills missing values in numerical columns using the median value:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing numerical values\r\ndf = pd.DataFrame({\r\n 'age': [25, None, 35, None, 45],\r\n 'salary': [50000, 60000, None, 70000, None],\r\n 'score': [85.5, None, 92.0, 88.5, None],\r\n 'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']\r\n})\r\n\r\n# Impute all numerical columns with median values\r\ndf_imputed = edaflow.impute_numerical_median(df)\r\n\r\n# Impute specific columns only\r\ndf_imputed = edaflow.impute_numerical_median(df, columns=['age', 'salary'])\r\n\r\n# Impute in place (modifies original DataFrame)\r\nedaflow.impute_numerical_median(df, inplace=True)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udd22 **Smart Detection**: Automatically identifies numerical columns (int, float, etc.)\r\n- \ud83d\udcca **Median Imputation**: Uses median values which are robust to outliers\r\n- \ud83c\udfaf **Selective Imputation**: Option to specify which columns to impute\r\n- \ud83d\udd04 **Inplace Option**: Modify original DataFrame or create new one\r\n- \ud83d\udee1\ufe0f **Safe Handling**: Gracefully handles edge cases like all-missing columns\r\n- \ud83d\udccb **Detailed Reporting**: Shows exactly what was imputed and summary statistics\r\n\r\n#### Categorical Imputation with `impute_categorical_mode`\r\n\r\nThe `impute_categorical_mode` function fills missing values in categorical columns using the mode (most frequent value):\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing categorical values\r\ndf = pd.DataFrame({\r\n 'category': ['A', 'B', 'A', None, 'A'],\r\n 'status': ['Active', None, 'Active', 'Inactive', None],\r\n 'priority': ['High', 'Medium', None, 'Low', 'High'],\r\n 'age': [25, 30, 35, 40, 45]\r\n})\r\n\r\n# Impute all categorical columns with mode values\r\ndf_imputed = edaflow.impute_categorical_mode(df)\r\n\r\n# Impute specific columns only\r\ndf_imputed = edaflow.impute_categorical_mode(df, columns=['category', 'status'])\r\n\r\n# Impute in place (modifies original DataFrame)\r\nedaflow.impute_categorical_mode(df, inplace=True)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udcdd **Smart Detection**: Automatically identifies categorical (object) columns\r\n- \ud83c\udfaf **Mode Imputation**: Uses most frequent value for each column\r\n- \u2696\ufe0f **Tie Handling**: Gracefully handles mode ties (multiple values with same frequency)\r\n- \ud83d\udd04 **Inplace Option**: Modify original DataFrame or create new one\r\n- \ud83d\udee1\ufe0f **Safe Handling**: Gracefully handles edge cases like all-missing columns\r\n- \ud83d\udccb **Detailed Reporting**: Shows exactly what was imputed and mode tie warnings\r\n\r\n#### Complete Imputation Workflow Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Sample data with both numerical and categorical missing values\r\ndf = pd.DataFrame({\r\n 'age': [25, None, 35, None, 45],\r\n 'salary': [50000, None, 70000, 80000, None],\r\n 'category': ['A', 'B', None, 'A', None],\r\n 'status': ['Active', None, 'Active', 'Inactive', None],\r\n 'score': [85.5, 92.0, None, 88.5, None]\r\n})\r\n\r\nprint(\"Original DataFrame:\")\r\nprint(df)\r\nprint(\"\\n\" + \"=\"*50)\r\n\r\n# Step 1: Impute numerical columns\r\nprint(\"STEP 1: Numerical Imputation\")\r\ndf_step1 = edaflow.impute_numerical_median(df)\r\n\r\n# Step 2: Impute categorical columns\r\nprint(\"\\nSTEP 2: Categorical Imputation\")\r\ndf_final = edaflow.impute_categorical_mode(df_step1)\r\n\r\nprint(\"\\nFinal DataFrame (all missing values imputed):\")\r\nprint(df_final)\r\n\r\n# Verify no missing values remain\r\nprint(f\"\\nMissing values remaining: {df_final.isnull().sum().sum()}\")\r\n```\r\n\r\n**Expected Output:**\r\n```\r\n\ud83d\udd22 Numerical Missing Value Imputation (Median)\r\n=======================================================\r\n\ud83d\udd04 age - Imputed 2 values with median: 35.0\r\n\ud83d\udd04 salary - Imputed 2 values with median: 70000.0\r\n\ud83d\udd04 score - Imputed 1 values with median: 88.75\r\n\r\n\ud83d\udcca Imputation Summary:\r\n Columns processed: 3\r\n Columns imputed: 3\r\n Total values imputed: 5\r\n\r\n\ud83d\udcdd Categorical Missing Value Imputation (Mode)\r\n=======================================================\r\n\ud83d\udd04 category - Imputed 2 values with mode: 'A'\r\n\ud83d\udd04 status - Imputed 1 values with mode: 'Active'\r\n\r\n\ud83d\udcca Imputation Summary:\r\n Columns processed: 2\r\n Columns imputed: 2\r\n Total values imputed: 3\r\n```\r\n\r\n### Numerical Distribution Analysis with `visualize_numerical_boxplots`\r\n\r\nAnalyze numerical columns to detect outliers, understand distributions, and assess skewness:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample dataset with outliers\r\ndf = pd.DataFrame({\r\n 'age': [25, 30, 35, 40, 45, 28, 32, 38, 42, 100], # 100 is an outlier\r\n 'salary': [50000, 60000, 75000, 80000, 90000, 55000, 65000, 70000, 85000, 250000], # 250000 is outlier\r\n 'experience': [2, 5, 8, 12, 15, 3, 6, 9, 13, 30], # 30 might be an outlier\r\n 'score': [85, 92, 78, 88, 95, 82, 89, 91, 86, 20], # 20 is an outlier\r\n 'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'] # Non-numerical\r\n})\r\n\r\n# Basic boxplot analysis\r\nedaflow.visualize_numerical_boxplots(\r\n df, \r\n title=\"Employee Data Analysis - Outlier Detection\",\r\n show_skewness=True\r\n)\r\n\r\n# Custom layout and specific columns\r\nedaflow.visualize_numerical_boxplots(\r\n df, \r\n columns=['age', 'salary'],\r\n rows=1, \r\n cols=2,\r\n title=\"Age vs Salary Analysis\",\r\n orientation='vertical',\r\n color_palette='viridis'\r\n)\r\n```\r\n\r\n**Expected Output:**\r\n```\r\n\ud83d\udcca Creating boxplots for 4 numerical column(s): age, salary, experience, score\r\n\r\n\ud83d\udcc8 Summary Statistics:\r\n==================================================\r\n\ud83d\udcca age:\r\n Range: 25.00 to 100.00\r\n Median: 36.50\r\n IQR: 11.00 (Q1: 30.50, Q3: 41.50)\r\n Skewness: 2.66 (highly skewed)\r\n Outliers: 1 values outside [14.00, 58.00]\r\n Outlier values: [100]\r\n\r\n\ud83d\udcca salary:\r\n Range: 50000.00 to 250000.00\r\n Median: 72500.00\r\n IQR: 22500.00 (Q1: 61250.00, Q3: 83750.00)\r\n Skewness: 2.88 (highly skewed)\r\n Outliers: 1 values outside [27500.00, 117500.00]\r\n Outlier values: [250000]\r\n\r\n\ud83d\udcca experience:\r\n Range: 2.00 to 30.00\r\n Median: 8.50\r\n IQR: 7.50 (Q1: 5.25, Q3: 12.75)\r\n Skewness: 1.69 (highly skewed)\r\n Outliers: 1 values outside [-6.00, 24.00]\r\n Outlier values: [30]\r\n\r\n\ud83d\udcca score:\r\n Range: 20.00 to 95.00\r\n Median: 87.00\r\n IQR: 7.75 (Q1: 82.75, Q3: 90.50)\r\n Skewness: -2.87 (highly skewed)\r\n Outliers: 1 values outside [71.12, 102.12]\r\n Outlier values: [20]\r\n```\r\n\r\n### Complete EDA Workflow Example\r\n\r\n```python\r\nimport edaflow\r\nimport pandas as pd\r\n\r\n# Test the installation\r\nprint(edaflow.hello())\r\n\r\n# Load your data\r\ndf = pd.read_csv('your_data.csv')\r\n\r\n# Complete EDA workflow with all core functions:\r\n# 1. Analyze missing data with styled output\r\nnull_analysis = edaflow.check_null_columns(df, threshold=10)\r\n\r\n# 2. Analyze categorical columns to identify data type issues\r\nedaflow.analyze_categorical_columns(df, threshold=35)\r\n\r\n# 3. Convert appropriate object columns to numeric automatically\r\ndf_cleaned = edaflow.convert_to_numeric(df, threshold=35)\r\n\r\n# 4. Visualize categorical column values\r\nedaflow.visualize_categorical_values(df_cleaned)\r\n\r\n# 5. Display column type classification\r\nedaflow.display_column_types(df_cleaned)\r\n\r\n# 6. Impute missing values\r\ndf_numeric_imputed = edaflow.impute_numerical_median(df_cleaned)\r\ndf_fully_imputed = edaflow.impute_categorical_mode(df_numeric_imputed)\r\n\r\n# 7. Statistical distribution analysis with advanced insights\r\nedaflow.visualize_histograms(df_fully_imputed, kde=True, show_normal_curve=True)\r\n\r\n# 8. Comprehensive relationship analysis\r\nedaflow.visualize_heatmap(df_fully_imputed, heatmap_type='correlation')\r\nedaflow.visualize_scatter_matrix(df_fully_imputed, show_regression=True)\r\n\r\n# 9. Generate comprehensive EDA insights and recommendations\r\ninsights = edaflow.summarize_eda_insights(df_fully_imputed, target_column='your_target_col')\r\nprint(insights) # View insights dictionary\r\n\r\n# 10. Outlier detection and visualization\r\nedaflow.visualize_numerical_boxplots(df_fully_imputed, show_skewness=True)\r\nedaflow.visualize_interactive_boxplots(df_fully_imputed)\r\n\r\n# 10. Advanced heatmap analysis\r\nedaflow.visualize_heatmap(df_fully_imputed, heatmap_type='missing')\r\nedaflow.visualize_heatmap(df_fully_imputed, heatmap_type='values')\r\n\r\n# 11. Final data cleaning with outlier handling\r\ndf_final = edaflow.handle_outliers_median(df_fully_imputed, method='iqr', verbose=True)\r\n\r\n# 12. Results verification\r\nedaflow.visualize_scatter_matrix(df_final, title=\"Clean Data Relationships\")\r\nedaflow.visualize_numerical_boxplots(df_final, title=\"Final Clean Distribution\")\r\n```\r\n\r\n### \ud83e\udd16 **Complete ML Workflow** \u2b50 *Enhanced in v0.14.0*\r\n```python\r\nimport edaflow.ml as ml\r\nfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.svm import SVC\r\n\r\n# Continue from cleaned data above...\r\ndf_final['target'] = your_target_data # Add your target column\r\n\r\n# 1. Setup ML experiment \u2b50 NEW: Enhanced parameters in v0.14.0\r\nexperiment = ml.setup_ml_experiment(\r\n df_final, 'target',\r\n test_size=0.2, # Test set: 20%\r\n val_size=0.15, # \u2b50 NEW: Validation set: 15% \r\n experiment_name=\"production_ml_pipeline\", # \u2b50 NEW: Experiment tracking\r\n random_state=42,\r\n stratify=True\r\n)\r\n\r\n# Alternative: sklearn-style calling (also enhanced)\r\n# X = df_final.drop('target', axis=1)\r\n# y = df_final['target']\r\n# experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15, experiment_name=\"sklearn_workflow\")\r\n\r\nprint(f\"Training: {len(experiment['X_train'])}, Validation: {len(experiment['X_val'])}, Test: {len(experiment['X_test'])}\")\r\n\r\n# 2. Compare multiple models \u2b50 Enhanced with validation set support\r\nmodels = {\r\n 'RandomForest': RandomForestClassifier(random_state=42),\r\n 'GradientBoosting': GradientBoostingClassifier(random_state=42),\r\n 'LogisticRegression': LogisticRegression(random_state=42),\r\n 'SVM': SVC(random_state=42, probability=True)\r\n}\r\n\r\n# Fit all models\r\nfor name, model in models.items():\r\n model.fit(experiment['X_train'], experiment['y_train'])\r\n\r\n# \u2b50 Enhanced compare_models with experiment_config support\r\ncomparison = ml.compare_models(\r\n models=models,\r\n experiment_config=experiment, # \u2b50 NEW: Automatically uses validation set\r\n verbose=True\r\n)\r\nprint(comparison) # Professional styled output\r\n\r\n# \u2b50 Enhanced rank_models with flexible return formats\r\n# Quick access to best model (list format - NEW)\r\nbest_model = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']\r\nprint(f\"\ud83c\udfc6 Best model: {best_model}\")\r\n\r\n# Detailed ranking analysis (DataFrame format - traditional)\r\nranked_models = ml.rank_models(comparison, 'accuracy')\r\nprint(\"\ud83d\udcca Top 3 models:\")\r\nprint(ranked_models.head(3)[['model', 'accuracy', 'f1', 'rank']])\r\n\r\n# Advanced: Multi-metric weighted ranking\r\nweighted_ranking = ml.rank_models(\r\n comparison, \r\n 'accuracy',\r\n weights={'accuracy': 0.4, 'f1': 0.3, 'precision': 0.3},\r\n return_format='list'\r\n)\r\nprint(f\"\ud83c\udfaf Best by weighted score: {weighted_ranking[0]['model_name']}\")\r\n\r\n# 3. Hyperparameter optimization \u2b50 Enhanced with validation set\r\nparam_grid = {\r\n 'n_estimators': [100, 200, 300],\r\n 'max_depth': [5, 10, 15, None],\r\n 'min_samples_split': [2, 5, 10]\r\n}\r\n\r\nbest_results = ml.optimize_hyperparameters(\r\n RandomForestClassifier(random_state=42),\r\n param_distributions=param_grid,\r\n X_train=experiment['X_train'],\r\n y_train=experiment['y_train'],\r\n method='grid_search',\r\n cv=5\r\n)\r\n\r\n# 4. Generate comprehensive performance visualizations\r\nml.plot_learning_curves(best_results['best_model'], \r\n X_train=experiment['X_train'], y_train=experiment['y_train'])\r\nml.plot_roc_curves({'optimized_model': best_results['best_model']}, \r\n X_test=experiment['X_test'], y_test=experiment['y_test'])\r\nml.plot_feature_importance(best_results['best_model'], \r\n feature_names=experiment['feature_names'])\r\n\r\n# 5. Save complete model artifacts with experiment tracking\r\nml.save_model_artifacts(\r\n model=best_results['best_model'],\r\n model_name=f\"{experiment['experiment_name']}_optimized_model\", # \u2b50 NEW: Uses experiment name\r\n experiment_config=experiment,\r\n performance_metrics={\r\n 'cv_score': best_results['best_score'],\r\n 'test_score': best_results['best_model'].score(experiment['X_test'], experiment['y_test']),\r\n 'model_type': 'RandomForestClassifier'\r\n },\r\n metadata={\r\n 'experiment_name': experiment['experiment_name'], # \u2b50 NEW: Experiment tracking\r\n 'data_shape': df_final.shape,\r\n 'feature_count': len(experiment['feature_names'])\r\n }\r\n)\r\n\r\nprint(f\"\u2705 Complete ML pipeline finished! Experiment: {experiment['experiment_name']}\")\r\n```\r\n\r\n### \ud83e\udd16 **ML Preprocessing with Smart Encoding** \u2b50 *Introduced in v0.12.0*\r\n```python\r\nimport edaflow\r\nimport pandas as pd\r\n\r\n# Load your data\r\ndf = pd.read_csv('your_data.csv')\r\n\r\n# Step 1: Analyze encoding needs (with or without target)\r\nencoding_analysis = edaflow.analyze_encoding_needs(\r\n df, \r\n target_column=None, # Optional: specify target if you have one\r\n max_cardinality_onehot=15, # Optional: max categories for one-hot encoding\r\n max_cardinality_target=50, # Optional: max categories for target encoding\r\n ordinal_columns=None # Optional: specify ordinal columns if known\r\n)\r\n\r\n# Step 2: Apply intelligent encoding transformations \r\ndf_encoded = edaflow.apply_smart_encoding(\r\n df, # Use your full dataset (or df.drop('target_col', axis=1) if needed)\r\n encoding_analysis=encoding_analysis, # Optional: use previous analysis\r\n handle_unknown='ignore' # Optional: how to handle unknown categories\r\n)\r\n\r\n# The encoding pipeline automatically:\r\n# \u2705 One-hot encodes low cardinality categoricals\r\n# \u2705 Target encodes high cardinality with target correlation \r\n# \u2705 Binary encodes medium cardinality features\r\n# \u2705 TF-IDF vectorizes text columns\r\n# \u2705 Preserves numeric columns unchanged\r\n# \u2705 Handles memory efficiently for large datasets\r\n\r\nprint(f\"Shape transformation: {df.shape} \u2192 {df_encoded.shape}\")\r\nprint(f\"Encoding methods applied: {len(encoding_analysis['encoding_methods'])} different strategies\")\r\n```\r\n\r\n## Project Structure\r\n\r\n```\r\nedaflow/\r\n\u251c\u2500\u2500 edaflow/\r\n\u2502 \u251c\u2500\u2500 __init__.py\r\n\u2502 \u251c\u2500\u2500 analysis/\r\n\u2502 \u251c\u2500\u2500 visualization/\r\n\u2502 \u2514\u2500\u2500 preprocessing/\r\n\u251c\u2500\u2500 tests/\r\n\u251c\u2500\u2500 docs/\r\n\u251c\u2500\u2500 examples/\r\n\u251c\u2500\u2500 setup.py\r\n\u251c\u2500\u2500 requirements.txt\r\n\u251c\u2500\u2500 README.md\r\n\u2514\u2500\u2500 LICENSE\r\n```\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch (`git checkout -b feature/new-feature`)\r\n3. Commit your changes (`git commit -m 'Add new feature'`)\r\n4. Push to the branch (`git push origin feature/new-feature`)\r\n5. Open a Pull Request\r\n\r\n## Development\r\n\r\n### Setup Development Environment\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/evanlow/edaflow.git\r\ncd edaflow\r\n\r\n# Create virtual environment\r\npython -m venv venv\r\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\r\n\r\n# Install in development mode\r\npip install -e \".[dev]\"\r\n\r\n# Run tests\r\npytest\r\n\r\n# Run linting\r\nflake8 edaflow/\r\nblack edaflow/\r\nisort edaflow/\r\n```\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Changelog\r\n\r\n> **\ud83d\ude80 Latest Updates**: This changelog reflects the most current releases including v0.12.32 critical input validation fix, v0.12.31 hotfix with KeyError resolution and v0.12.30 universal display optimization breakthrough.\r\n\r\n### v0.12.32 (2025-08-11) - Critical Input Validation Fix \ud83d\udc1b\r\n- **CRITICAL**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions\r\n- **ROOT CAUSE**: Users passing tuple result from `apply_smart_encoding(..., return_encoders=True)` directly to visualization functions\r\n- **ENHANCED**: Added intelligent input validation with helpful error messages for common usage mistakes\r\n- **IMPROVED**: Better error handling in `visualize_scatter_matrix` and other visualization functions\r\n- **DOCUMENTED**: Clear examples showing correct vs incorrect usage patterns for `apply_smart_encoding`\r\n- **STABILITY**: Prevents crashes in step 14 of EDA workflows when encoding functions are misused\r\n\r\n### v0.12.31 (2025-01-05) - Critical KeyError Hotfix \ud83d\udea8\r\n- **CRITICAL**: Fixed KeyError: 'type' in `summarize_eda_insights()` function during Google Colab usage\r\n- **RESOLVED**: Exception handling when target analysis dictionary missing expected keys\r\n- **IMPROVED**: Enhanced error handling with safe dictionary access using `.get()` method\r\n- **MAINTAINED**: All existing functionality preserved - pure stability fix\r\n- **TESTED**: Verified fix works across all notebook platforms (Colab, JupyterLab, VS Code)\r\n\r\n### v0.12.30 (2025-01-05) - Universal Display Optimization Breakthrough \ud83c\udfa8\r\n- **BREAKTHROUGH**: Introduced `optimize_display()` function for universal notebook compatibility\r\n- **REVOLUTIONARY**: Automatic platform detection (Google Colab, JupyterLab, VS Code Notebooks, Classic Jupyter)\r\n- **ENHANCED**: Dynamic CSS injection for perfect dark/light mode visibility across all platforms\r\n- **NEW FEATURE**: Automatic matplotlib backend optimization for each notebook environment \r\n- **ACCESSIBILITY**: Solves visibility issues in dark mode themes universally\r\n- **SEAMLESS**: Zero configuration required - automatically detects and optimizes for your platform\r\n- **COMPATIBILITY**: Works flawlessly across Google Colab, JupyterLab, VS Code, Classic Jupyter\r\n- **EXAMPLE**: Simple usage: `from edaflow import optimize_display; optimize_display()`\r\n\r\n### v0.12.3 (2025-08-06) - Complete Positional Argument Compatibility Fix \ud83d\udd27\r\n- **CRITICAL**: Fixed positional argument usage for `visualize_image_classes()` function \r\n- **RESOLVED**: TypeError when calling `visualize_image_classes(image_paths, ...)` with positional arguments\r\n- **ENHANCED**: Comprehensive backward compatibility supporting all three usage patterns:\r\n - Positional: `visualize_image_classes(path, ...)` (shows warning)\r\n - Deprecated keyword: `visualize_image_classes(image_paths=path, ...)` (shows warning)\r\n - Recommended: `visualize_image_classes(data_source=path, ...)` (no warning)\r\n- **IMPROVED**: Clear deprecation warnings guiding users toward recommended syntax\r\n- **SECURE**: Prevents using both parameters simultaneously to avoid confusion\r\n- **RESOLVED**: TypeError for users calling with `image_paths=` parameter from v0.12.0 breaking change\r\n- **ENHANCED**: Improved error messages for parameter validation in image visualization functions\r\n- **DOCUMENTATION**: Added comprehensive parameter documentation including deprecation notices\r\n\r\n### v0.12.2 (2025-08-06) - Documentation Refresh \ud83d\udcda\r\n- **IMPROVED**: Enhanced README.md with updated timestamps and current version indicators\r\n- **FIXED**: Ensured PyPI displays the most current changelog information including v0.12.1 fixes\r\n- **ENHANCED**: Added latest updates indicator to changelog for better visibility\r\n- **DOCUMENTATION**: Forced PyPI cache refresh to display current version information\r\n\r\n## \u2728 What's New in v0.16.2\r\n\r\n**New Features:**\r\n- Faceted visualizations with `display_facet_grid`\r\n- Feature scaling with `scale_features`\r\n- Grouping rare categories with `group_rare_categories`\r\n- Exporting figures with `export_figure`\r\n\r\n**Documentation Updates:**\r\n- User Guide, Advanced Features, and Best Practices now reference all new APIs\r\n- Visualization Guide includes external library requirements and troubleshooting\r\n- Changelog documents all new features and documentation changes\r\n\r\n**External Library Requirements:**\r\nSome advanced features require additional libraries:\r\n- matplotlib\r\n- seaborn\r\n- scikit-learn\r\n- statsmodels\r\n- pandas\r\n\r\nSee the Visualization Guide for installation instructions and troubleshooting tips.\r\n\r\n---\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python package for exploratory data analysis workflows with universal dark mode compatibility",
"version": "0.17.1",
"project_urls": {
"Bug Tracker": "https://github.com/evanlow/edaflow/issues",
"Changelog": "https://github.com/evanlow/edaflow/blob/main/CHANGELOG.md",
"Documentation": "https://edaflow.readthedocs.io",
"Homepage": "https://github.com/evanlow/edaflow",
"Repository": "https://github.com/evanlow/edaflow.git",
"Source Code": "https://github.com/evanlow/edaflow"
},
"split_keywords": [
"data-analysis",
" eda",
" exploratory-data-analysis",
" data-science",
" visualization"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "77b47cf455193bf60dc6f0ce9fe85946dfe7a51736ef48b72cc7e376439f985b",
"md5": "76b02db2701556303e2370f9076b2f16",
"sha256": "9c9e2f6f597f51bcf440a9a43f60551a84a1969d3c5ab087dc388d8318a01a4f"
},
"downloads": -1,
"filename": "edaflow-0.17.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "76b02db2701556303e2370f9076b2f16",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 116717,
"upload_time": "2025-09-12T04:39:05",
"upload_time_iso_8601": "2025-09-12T04:39:05.710088Z",
"url": "https://files.pythonhosted.org/packages/77/b4/7cf455193bf60dc6f0ce9fe85946dfe7a51736ef48b72cc7e376439f985b/edaflow-0.17.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bd61c76ac3dfde6698fdeebfa12275be32510587bffe48efe7ef523251a54e57",
"md5": "587d55c7149a9dbc7be9ddc33487fb2f",
"sha256": "d7b1b625daf245ab2bd11bb148dd1a9aca8cbf24098728df7442e7319231fa51"
},
"downloads": -1,
"filename": "edaflow-0.17.1.tar.gz",
"has_sig": false,
"md5_digest": "587d55c7149a9dbc7be9ddc33487fb2f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 781573,
"upload_time": "2025-09-12T04:39:07",
"upload_time_iso_8601": "2025-09-12T04:39:07.816097Z",
"url": "https://files.pythonhosted.org/packages/bd/61/c76ac3dfde6698fdeebfa12275be32510587bffe48efe7ef523251a54e57/edaflow-0.17.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-12 04:39:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "evanlow",
"github_project": "edaflow",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pandas",
"specs": [
[
">=",
"1.5.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.5.0"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.11.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.7.0"
]
]
},
{
"name": "missingno",
"specs": [
[
">=",
"0.5.0"
]
]
},
{
"name": "jinja2",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "Pillow",
"specs": [
[
">=",
"8.0.0"
]
]
}
],
"lcname": "edaflow"
}