edaflow


Nameedaflow JSON
Version 0.17.1 PyPI version JSON
download
home_pageNone
SummaryA Python package for exploratory data analysis workflows with universal dark mode compatibility
upload_time2025-09-12 04:39:07
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords data-analysis eda exploratory-data-analysis data-science visualization
VCS
bugtrack_url
requirements pandas numpy matplotlib seaborn scipy missingno jinja2 Pillow
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ๐Ÿš€ What's New in v0.17.1

**Notebook Fixes & Beginner Experience:**
- Fixed confusion matrix example in classification and advanced workflow notebooks to match API signature
- Audited all example notebooks for beginner-friendliness and error-free execution
- All notebooks now run without unnecessary errors for new users

**Major Examples & Documentation Update (v0.17.0):**
- Added interactive Jupyter notebooks for all major workflows
- All guides and onboarding instructions updated to reference new features and examples
- Verified and enhanced documentation for:
    - `highlight_anomalies`
    - `create_lag_features`
    - `display_facet_grid`
    - `scale_features`
    - `group_rare_categories`
    - `export_figure`
- Improved onboarding and user guidance in ReadTheDocs and README
- Minor bug fixes and consistency improvements across docs and codebase

**Why Upgrade?**
This release ensures all users have access to complete, copy-paste-ready examples and documentation. The onboarding experience is now smoother, and all advanced features are fully documented.

See the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)

**Major Documentation Overhaul for Education:**
- Added a dedicated Learning Path for new and aspiring data scientists
- Consolidated ML workflow steps into a single, copy-paste-safe guide
- Expanded examples: classification, regression, and computer vision
- Improved navigation: clear table of contents, user guide, API reference, and best practices
- Advanced features and troubleshooting tips for power users

**Why Upgrade?**
This release makes edaflow best-in-class for educational value, with a structured progression for learners and educators. All documentation is now easier to follow, with practical code and hands-on exercises.

See the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)
# edaflow

[![Documentation Status](https://readthedocs.org/projects/edaflow/badge/?version=latest)](https://edaflow.readthedocs.io/en/latest/?badge=latest)
[![PyPI version](https://badge.fury.io/py/edaflow.svg)](https://badge.fury.io/py/edaflow)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Downloads](https://pepy.tech/badge/edaflow)](https://pepy.tech/project/edaflow)

**Quick Navigation**: 
๐Ÿ“š [Documentation](https://edaflow.readthedocs.io) | 
๐Ÿ“ฆ [PyPI Package](https://pypi.org/project/edaflow/) | 
๐Ÿš€ [Quick Start](https://edaflow.readthedocs.io/en/latest/quickstart.html) | 
๐Ÿ“‹ [Changelog](#-changelog) | 
๐Ÿ› [Issues](https://github.com/evanlow/edaflow/issues)

A Python package for streamlined exploratory data analysis workflows.

 > **๐Ÿ“ฆ Current Version: v0.16.4** - [Latest Release](https://pypi.org/project/edaflow/0.16.4/) adds a complete examples directory, improved onboarding, and fully documented advanced features. *Updated: September 12, 2025*

## ๐Ÿ“– Table of Contents

- [Description](#description)
- [๐Ÿšจ Critical Fixes in v0.15.0](#-critical-fixes-in-v0150)
- [โœจ What's New](#-whats-new)
- [Features](#features)
- [๐Ÿ†• Recent Updates](#-recent-updates)
- [๐Ÿ“š Documentation](#-documentation)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [๐Ÿ“‹ Changelog](#-changelog)
- [Support](#support)
- [Roadmap](#roadmap)

## Description

`edaflow` is designed to simplify and accelerate the exploratory data analysis (EDA) process by providing a collection of tools and utilities for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, and preprocessing.

## ๐Ÿšจ What's New in v0.15.1

**NEW:** `setup_ml_experiment` now supports a `primary_metric` argument, making metric selection robust and error-free for all ML workflows. All documentation, tests, and downstream code are updated for consistency. A new test ensures the metric is set and accessible throughout the workflow.

**Upgrade recommended for all users who want reliable, copy-paste-safe ML workflows with dynamic metric selection.**

---

## ๐Ÿšจ Critical Fixes in v0.15.0
**(Previous release)**

### ๐ŸŽฏ **Issues Resolved**:
- โŒ **FIXED**: `RandomForestClassifier instance is not fitted yet` errors
- โŒ **FIXED**: `TypeError: unexpected keyword argument` errors  
- โŒ **FIXED**: Missing imports and undefined variables in examples
- โŒ **FIXED**: Duplicate step numbering in documentation
- โœ… **RESULT**: All ML workflow examples now work perfectly!

### ๐Ÿ“‹ **What This Means For You**:
- ๐ŸŽ‰ **Copy-paste examples that work immediately**
- ๐ŸŽฏ **No more confusing error messages**
- ๐Ÿ“š **Complete, beginner-friendly documentation**
- ๐Ÿš€ **Smooth learning experience for new users**

**Upgrade recommended for all users following ML workflow documentation.**

## โœจ What's New

### ๐Ÿšจ Critical ML Documentation Fixes (v0.15.0)
**MAJOR DOCUMENTATION UPDATE**: Fixed critical issues that were causing user errors when following ML workflow examples.

**Problems Resolved**:
- โœ… **Model Fitting**: Added missing `model.fit()` steps that were causing "not fitted" errors
- โœ… **Function Parameters**: Fixed incorrect parameter names in all examples
- โœ… **Missing Context**: Added imports and data preparation context  
- โœ… **Step Numbering**: Corrected duplicate step numbers in documentation
- โœ… **Enhanced Warnings**: Added prominent warnings about critical requirements

**Result**: All ML workflow documentation now works perfectly out-of-the-box!

### ๐ŸŽฏ Enhanced rank_models Function (v0.14.x)
**DUAL RETURN FORMAT SUPPORT**: Major enhancement based on user requests.

```python
# Both formats now supported:
df_results = ml.rank_models(results, 'accuracy')  # DataFrame (default)
list_results = ml.rank_models(results, 'accuracy', return_format='list')  # List of dicts

# User-requested pattern now works:
best_model = ml.rank_models(results, 'accuracy', return_format='list')[0]["model_name"]
```

### ๐Ÿš€ ML Expansion (v0.13.0+)
**COMPLETE MACHINE LEARNING SUBPACKAGE**: Extended edaflow into full ML workflows.

**New ML Modules Added**:
- **`ml.config`**: ML experiment setup and data validation
- **`ml.leaderboard`**: Multi-model comparison and ranking
- **`ml.tuning`**: Advanced hyperparameter optimization
- **`ml.curves`**: Learning curves and performance visualization
- **`ml.artifacts`**: Model persistence and experiment tracking

**Key ML Features**:
```python
# Complete ML workflow in one package
import edaflow.ml as ml

# Setup experiment with flexible parameter support
# Both calling patterns work:
experiment = ml.setup_ml_experiment(df, 'target')  # DataFrame style
# OR
experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15)  # sklearn style

# Compare multiple models
results = ml.compare_models(models, **experiment)

# Optimize hyperparameters with multiple strategies
best_model = ml.optimize_hyperparameters(model, params, **experiment)

# Generate comprehensive visualizations
ml.plot_learning_curves(model, **experiment)
```

### Previous: API Improvement (v0.12.33)
**NEW CLEAN APIs**: Introduced consistent, user-friendly encoding functions that eliminate confusion and crashes.

**Root Cause Solved**: The inconsistent return type of `apply_smart_encoding()` (sometimes DataFrame, sometimes tuple) was causing AttributeError crashes and user confusion.

**New Functions Added**:
```python
# โœ… NEW: Clean, consistent DataFrame return (RECOMMENDED)
df_encoded = edaflow.apply_encoding(df)  # Always returns DataFrame

# โœ… NEW: Explicit tuple return when encoders needed
df_encoded, encoders = edaflow.apply_encoding_with_encoders(df)  # Always returns tuple

# โš ๏ธ DEPRECATED: Inconsistent behavior (still works with warnings)
df_encoded = edaflow.apply_smart_encoding(df, return_encoders=True)  # Sometimes tuple!
```

**Benefits**:
- ๐ŸŽฏ **Zero Breaking Changes**: All existing workflows continue working exactly the same
- ๐Ÿ›ก๏ธ **Better Error Messages**: Helpful guidance when mistakes are made  
- ๐Ÿ”„ **Migration Path**: Multiple options for users who want cleaner APIs
- ๐Ÿ“š **Clear Documentation**: Explicit examples showing best practices

### ๐Ÿ› Critical Input Validation Fix (v0.12.32)
**RESOLVED**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions when `apply_smart_encoding(..., return_encoders=True)` result is used incorrectly.

**Problem Solved**: Users who passed the tuple result from `apply_smart_encoding` directly to visualization functions without unpacking were experiencing crashes in step 14 of EDA workflows.

**Enhanced Error Messages**: Added intelligent input validation with helpful error messages guiding users to the correct usage pattern:
```python
# โŒ WRONG - This causes the AttributeError:
df_encoded = edaflow.apply_smart_encoding(df, return_encoders=True)  # Returns (df, encoders) tuple!
edaflow.visualize_scatter_matrix(df_encoded)  # Crashes with AttributeError

# โœ… CORRECT - Unpack the tuple:  
df_encoded, encoders = edaflow.apply_smart_encoding(df, return_encoders=True)
edaflow.visualize_scatter_matrix(df_encoded)  # Should work well!
```

### ๐ŸŽจ BREAKTHROUGH: Universal Dark Mode Compatibility (v0.12.30)
- **NEW FUNCTION**: `optimize_display()` - The **FIRST** EDA library with universal notebook compatibility!
- **Universal Platform Support**: Improved visibility across Google Colab, JupyterLab, VS Code, and Classic Jupyter
- **Automatic Detection**: Zero configuration needed - automatically detects your environment
- **Accessibility Support**: Built-in high contrast mode for improved accessibility
- **One-Line Solution**: `edaflow.optimize_display()` fixes all visibility issues instantly

### ๐Ÿ› Critical KeyError Hotfix (v0.12.31)
- **Fixed KeyError**: Resolved "KeyError: 'type'" in `summarize_eda_insights()` function
- **Enhanced Error Handling**: Added robust exception handling for target analysis edge cases
- **Improved Stability**: Function now handles missing or invalid target columns gracefully

### ๐ŸŒŸ Platform Benefits:
- โœ… **Google Colab**: Auto light/dark mode detection with improved text visibility
- โœ… **JupyterLab**: Dark theme compatibility with custom theme support
- โœ… **VS Code**: Native theme integration with seamless notebook experience  
- โœ… **Classic Jupyter**: Full compatibility with enhanced readability options

```python
import edaflow
# โญ NEW: Improved visibility everywhere!
edaflow.optimize_display()  # Universal dark mode fix!

# All functions now display beautifully
edaflow.check_null_columns(df)
edaflow.visualize_histograms(df)
```

### โœจ NEW FUNCTION: `summarize_eda_insights()` (Added in v0.12.28)
- **Comprehensive Analysis**: Generate complete EDA insights and actionable recommendations after completing your analysis workflow
- **Smart Recommendations**: Provides intelligent next steps for modeling, preprocessing, and data quality improvements
- **Target-Aware Analysis**: Supports both classification and regression scenarios with specific insights
- **Function Tracking**: Knows which edaflow functions you've already used in your workflow
- **Structured Output**: Returns organized dictionary with dataset overview, data quality assessment, and recommendations

### ๐ŸŽจ Display Formatting Excellence
- **Enhanced Visual Experience**: Refined Rich console styling with optimized panel borders and alignment
- **Google Colab Optimized**: Improved display formatting specifically tailored for notebook environments
- **Consistent Design**: Professional rounded borders, proper width constraints, and refined color schemes
- **Universal Compatibility**: Beautiful output rendering across all major Python environments and notebooks

### ๏ฟฝ Recent Fixes (v0.12.24-0.12.26)
- **LBP Warning Resolution**: Fixed scikit-image UserWarning in texture analysis functions
- **Parameter Documentation**: Corrected `analyze_image_features` documentation mismatches
- **RTD Synchronization**: Updated Read the Docs changelog with all recent improvements

### ๐ŸŒˆ Rich Styling (v0.12.20-0.12.21)
- **Vibrant Output**: ALL major EDA functions now feature professional, color-coded styling
- **Smart Indicators**: Color-coded severity levels (โœ… CLEAN, โš ๏ธ WARNING, ๐Ÿšจ CRITICAL)
- **Professional Tables**: Beautiful formatted output with rich library integration
- **Actionable Insights**: Context-aware recommendations and visual status indicators

## Features

### ๐Ÿ” **Exploratory Data Analysis**
- **Missing Data Analysis**: Color-coded analysis of null values with customizable thresholds
- **Categorical Data Insights**: ๐Ÿ› *FIXED in v0.12.29* Identify object columns that might be numeric, detect data type issues (now handles unhashable types)
- **Automatic Data Type Conversion**: Smart conversion of object columns to numeric when appropriate
- **Categorical Values Visualization**: Detailed exploration of categorical column values with insights
- **Column Type Classification**: Simple categorization of DataFrame columns into categorical and numerical types
- **Data Type Detection**: Smart analysis to flag potential data conversion needs
- **EDA Insights Summary**: โญ *NEW in v0.12.28* Comprehensive EDA insights and actionable recommendations after completing analysis workflow

### ๐Ÿ“Š **Advanced Visualizations**
- **Numerical Distribution Visualization**: Advanced boxplot analysis with outlier detection and statistical summaries
- **Interactive Boxplot Visualization**: Interactive Plotly Express boxplots with zoom, hover, and statistical tooltips
- **Comprehensive Heatmap Visualizations**: Correlation matrices, missing data patterns, values heatmaps, and cross-tabulations
- **Statistical Histogram Analysis**: Advanced histogram visualization with skewness detection, normality testing, and distribution analysis
- **Scatter Matrix Analysis**: Advanced pairwise relationship visualization with customizable matrix layouts, regression lines, and statistical insights

### ๐Ÿค– **Machine Learning Preprocessing** โญ *Introduced in v0.12.0*
- **Intelligent Encoding Analysis**: Automatic detection of optimal encoding strategies for categorical variables
- **Smart Encoding Application**: Automated categorical encoding with support for:
  - One-Hot Encoding for low cardinality categories
  - Target Encoding for high cardinality with target correlation
  - Ordinal Encoding for ordinal relationships
  - Binary Encoding for medium cardinality
  - Text Vectorization (TF-IDF) for text features
  - Leave Unchanged for numeric columns
- **Memory-Efficient Processing**: Intelligent handling of high-cardinality features to prevent memory issues
- **Comprehensive Encoding Pipeline**: End-to-end preprocessing solution for ML model preparation

### ๐Ÿค– **Machine Learning Workflows** โญ *NEW in v0.13.0*
The powerful `edaflow.ml` subpackage provides comprehensive machine learning workflow capabilities:

#### **ML Experiment Setup (`ml.config`)**
- **Smart Data Validation**: Automatic data quality assessment and problem type detection
- **Intelligent Data Splitting**: Train/validation/test splits with stratification support
- **ML Pipeline Configuration**: Automated preprocessing pipeline setup for ML workflows

#### **Model Comparison & Ranking (`ml.leaderboard`)**
- **Multi-Model Evaluation**: Compare multiple models with comprehensive metrics
- **Smart Leaderboards**: Automatically rank models by performance with visual displays
- **Export Capabilities**: Save comparison results for reporting and analysis

#### **Hyperparameter Optimization (`ml.tuning`)**
- **Multiple Search Strategies**: Grid search, random search, and Bayesian optimization
- **Cross-Validation Integration**: Built-in CV with customizable scoring metrics
- **Parallel Processing**: Multi-core hyperparameter optimization for faster results

#### **Learning & Performance Curves (`ml.curves`)**
- **Learning Curves**: Visualize model performance vs training size
- **Validation Curves**: Analyze hyperparameter impact on model performance
- **ROC & Precision-Recall Curves**: Comprehensive classification performance analysis
- **Feature Importance**: Visual analysis of model feature contributions

#### **Model Persistence & Tracking (`ml.artifacts`)**
- **Complete Model Artifacts**: Save models, configs, and metadata
- **Experiment Tracking**: Track multiple experiments with organized storage
- **Model Reports**: Generate comprehensive model performance reports
- **Version Management**: Organized model versioning and retrieval

**Quick ML Example:**
```python
import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier

# Setup ML experiment - Multiple parameter patterns supported
# Method 1: DataFrame + target column (recommended)
experiment = ml.setup_ml_experiment(df, target_column='target')

# Method 2: sklearn-style (also supported)
X = df.drop('target', axis=1)
y = df['target']
experiment = ml.setup_ml_experiment(
    X=X, y=y,
    test_size=0.2,
    val_size=0.15,  # Alternative to validation_size
    experiment_name="my_ml_project",
    stratify=True,
    random_state=42
)

# Compare multiple models
models = {
    'RandomForest': RandomForestClassifier(),
    'LogisticRegression': LogisticRegression()
}
comparison = ml.compare_models(models, **experiment)

# Rank models with flexible access patterns
# Method 1: Easy dictionary access (recommended for getting best model)
best_model_name = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']

# Method 2: Traditional DataFrame format
ranked_df = ml.rank_models(comparison, 'accuracy')
best_model_traditional = ranked_df.iloc[0]['model']

# Both methods give the same result
print(f"Best model: {best_model_name}")  # Easy access
print(f"Best model: {best_model_traditional}")  # Traditional access

# Optimize hyperparameters

# --- Copy-paste-safe hyperparameter optimization example ---
model_name = 'LogisticRegression'  # or 'RandomForest' or 'GradientBoosting'

if model_name == 'RandomForest':
    param_distributions = {
        'n_estimators': [100, 200, 300],
        'max_depth': [5, 10, 15, None],
        'min_samples_split': [2, 5, 10]
    }
    model = RandomForestClassifier()
    method = 'grid'
elif model_name == 'GradientBoosting':
    param_distributions = {
        'n_estimators': (50, 200),
        'learning_rate': (0.01, 0.3),
        'max_depth': (3, 8)
    }
    from sklearn.ensemble import GradientBoostingClassifier
    model = GradientBoostingClassifier()
    method = 'bayesian'
elif model_name == 'LogisticRegression':
    param_distributions = {
        'C': [0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2', 'elasticnet', 'none'],
        'solver': ['lbfgs', 'liblinear', 'saga']
    }
    model = LogisticRegression(max_iter=1000)
    method = 'grid'
else:
    raise ValueError(f"Unknown model_name: {model_name}")

results = ml.optimize_hyperparameters(
    model,
    param_distributions=param_distributions,
    **experiment
)

# Generate learning curves
ml.plot_learning_curves(results['best_model'], **experiment)

# Save complete artifacts
ml.save_model_artifacts(
    model=results['best_model'],
    model_name='optimized_rf',
    experiment_config=experiment,
    performance_metrics=results['cv_results']
)
```

### ๐Ÿ–ผ๏ธ **Computer Vision Support**
- **Computer Vision EDA**: Class-wise image sample visualization and comprehensive quality assessment for image classification datasets
- **Image Quality Assessment**: Automated detection of corrupted images, quality issues, blur, artifacts, and dataset health metrics

### Usage Examples

### Basic Usage
```python
import edaflow

# Verify installation
message = edaflow.hello()
print(message)  # Output: "Hello from edaflow! Ready for exploratory data analysis."
```

### Missing Data Analysis with `check_null_columns`

The `check_null_columns` function provides a color-coded analysis of missing data in your DataFrame:

```python
import pandas as pd
import edaflow

# Create sample data with missing values
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'age': [25, None, 35, None, 45],
    'email': [None, None, None, None, None],  # All missing
    'purchase_amount': [100.5, 250.0, 75.25, None, 320.0]
})

# Analyze missing data with default threshold (10%)
styled_result = edaflow.check_null_columns(df)
styled_result  # Display in Jupyter notebook for color-coded styling

# Use custom threshold (20%) to change color coding sensitivity
styled_result = edaflow.check_null_columns(df, threshold=20)
styled_result

# Access underlying data if needed
data = styled_result.data
print(data)
```

**Color Coding:**
- ๐Ÿ”ด **Red**: > 20% missing (high concern)
- ๐ŸŸก **Yellow**: 10-20% missing (medium concern)  
- ๐ŸŸจ **Light Yellow**: 1-10% missing (low concern)
- โฌœ **Gray**: 0% missing (no issues)

### Categorical Data Analysis with `analyze_categorical_columns`

The `analyze_categorical_columns` function helps identify data type issues and provides insights into object-type columns:

```python
import pandas as pd
import edaflow

# Create sample data with mixed categorical types
df = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price_str': ['999', '25', '75', '450'],  # Numbers stored as strings
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],
    'rating': [4.5, 3.8, 4.2, 4.7],  # Already numeric
    'mixed_ids': ['001', '002', 'ABC', '004'],  # Mixed format
    'status': ['active', 'inactive', 'active', 'pending']
})

# Analyze categorical columns with default threshold (35%)
edaflow.analyze_categorical_columns(df)

# Use custom threshold (50%) to be more lenient about mixed data
edaflow.analyze_categorical_columns(df, threshold=50)
```

**Output Interpretation:**
- ๐Ÿ”ด๐Ÿ”ต **Highlighted in Red/Blue**: Potentially numeric columns that might need conversion
- ๐ŸŸกโšซ **Highlighted in Yellow/Black**: Shows unique values for potential numeric columns
- **Regular text**: Truly categorical columns with statistics
- **"not an object column"**: Already properly typed numeric columns

### Data Type Conversion with `convert_to_numeric`

After analyzing your categorical columns, you can automatically convert appropriate columns to numeric:

```python
import pandas as pd
import edaflow

# Create sample data with string numbers
df = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price_str': ['999', '25', '75', '450'],      # Should convert
    'mixed_ids': ['001', '002', 'ABC', '004'],    # Mixed data
    'category': ['Electronics', 'Accessories', 'Electronics', 'Electronics']
})

# Convert appropriate columns to numeric (threshold=35% by default)
df_converted = edaflow.convert_to_numeric(df, threshold=35)

# Or modify the original DataFrame in place
edaflow.convert_to_numeric(df, threshold=35, inplace=True)

# Use a stricter threshold (only convert if <20% non-numeric values)
df_strict = edaflow.convert_to_numeric(df, threshold=20)
```

**Function Features:**
- โœ… **Smart Detection**: Only converts columns with few non-numeric values
- โœ… **Customizable Threshold**: Control conversion sensitivity 
- โœ… **Safe Conversion**: Non-numeric values become NaN (not errors)
- โœ… **Inplace Option**: Modify original DataFrame or create new one
- โœ… **Detailed Output**: Shows exactly what was converted and why

### Categorical Data Visualization with `visualize_categorical_values`

After cleaning your data, explore categorical columns in detail to understand value distributions:

```python
import pandas as pd
import edaflow

# Example DataFrame with categorical data
df = pd.DataFrame({
    'department': ['Sales', 'Marketing', 'Sales', 'HR', 'Marketing', 'Sales', 'IT'],
    'status': ['Active', 'Inactive', 'Active', 'Pending', 'Active', 'Active', 'Inactive'],
    'priority': ['High', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],
    'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],  # Numeric (ignored)
    'salary': [50000, 60000, 70000, 45000, 58000, 62000, 70000]  # Numeric (ignored)
})

# Visualize all categorical columns
edaflow.visualize_categorical_values(df)
```

**Advanced Usage Examples:**

```python
# Handle high-cardinality data (many unique values)
large_df = pd.DataFrame({
    'product_id': [f'PROD_{i:04d}' for i in range(100)],  # 100 unique values
    'category': ['Electronics'] * 40 + ['Clothing'] * 35 + ['Books'] * 25,
    'status': ['Available'] * 80 + ['Out of Stock'] * 15 + ['Discontinued'] * 5
})

# Limit display for high-cardinality columns
edaflow.visualize_categorical_values(large_df, max_unique_values=5)
```

```python
# DataFrame with missing values for comprehensive analysis
df_with_nulls = pd.DataFrame({
    'region': ['North', 'South', None, 'East', 'West', 'North', None],
    'customer_type': ['Premium', 'Standard', 'Premium', None, 'Standard', 'Premium', 'Standard'],
    'transaction_id': [f'TXN_{i}' for i in range(7)],  # Mostly unique (ID-like)
})

# Get detailed insights including missing value analysis
edaflow.visualize_categorical_values(df_with_nulls)
```

**Function Features:**
- ๐ŸŽฏ **Zero Breaking Changes**: All existing workflows continue working exactly the same
- ๐Ÿ›ก๏ธ **Better Error Messages**: Helpful guidance when mistakes are made  
- ๐Ÿ”„ **Migration Path**: Multiple options for users who want cleaner APIs
- ๐Ÿ“š **Clear Documentation**: Explicit examples showing best practices

### Column Type Classification with `display_column_types`

The `display_column_types` function provides a simple way to categorize DataFrame columns into categorical and numerical types:

```python
import pandas as pd
import edaflow

# Create sample data with mixed types
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago'],
    'salary': [50000, 60000, 70000],
    'is_active': [True, False, True]
}
df = pd.DataFrame(data)

# Display column type classification
result = edaflow.display_column_types(df)

# Access the categorized column lists
categorical_cols = result['categorical']  # ['name', 'city']
numerical_cols = result['numerical']      # ['age', 'salary', 'is_active']
```

**Example Output:**
```
๐Ÿ“Š Column Type Analysis
==================================================

๐Ÿ“ Categorical Columns (2 total):
    1. name                 (unique values: 3)
    2. city                 (unique values: 3)

๐Ÿ”ข Numerical Columns (3 total):
    1. age                  (dtype: int64)
    2. salary               (dtype: int64)
    3. is_active            (dtype: bool)

๐Ÿ“ˆ Summary:
   Total columns: 5
   Categorical: 2 (40.0%)
   Numerical: 3 (60.0%)
```

**Function Features:**
- ๐Ÿ” **Simple Classification**: Separates columns into categorical (object dtype) and numerical (all other dtypes)
- ๐Ÿ“Š **Detailed Information**: Shows unique value counts for categorical columns and data types for numerical columns
- ๐Ÿ“ˆ **Summary Statistics**: Provides percentage breakdown of column types
- ๐ŸŽฏ **Return Values**: Returns dictionary with categorized column lists for programmatic use
- โšก **Fast Processing**: Efficient classification based on pandas data types
- ๐Ÿ›ก๏ธ **Error Handling**: Validates input and handles edge cases like empty DataFrames

### Data Imputation with `impute_numerical_median` and `impute_categorical_mode`

After analyzing your data, you often need to handle missing values. The edaflow package provides two specialized imputation functions for this purpose:

#### Numerical Imputation with `impute_numerical_median`

The `impute_numerical_median` function fills missing values in numerical columns using the median value:

```python
import pandas as pd
import edaflow

# Create sample data with missing numerical values
df = pd.DataFrame({
    'age': [25, None, 35, None, 45],
    'salary': [50000, 60000, None, 70000, None],
    'score': [85.5, None, 92.0, 88.5, None],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
})

# Impute all numerical columns with median values
df_imputed = edaflow.impute_numerical_median(df)

# Impute specific columns only
df_imputed = edaflow.impute_numerical_median(df, columns=['age', 'salary'])

# Impute in place (modifies original DataFrame)
edaflow.impute_numerical_median(df, inplace=True)
```

**Function Features:**
- ๐Ÿ”ข **Smart Detection**: Automatically identifies numerical columns (int, float, etc.)
- ๐Ÿ“Š **Median Imputation**: Uses median values which are robust to outliers
- ๐ŸŽฏ **Selective Imputation**: Option to specify which columns to impute
- ๐Ÿ”„ **Inplace Option**: Modify original DataFrame or create new one
- ๐Ÿ›ก๏ธ **Safe Handling**: Gracefully handles edge cases like all-missing columns
- ๐Ÿ“‹ **Detailed Reporting**: Shows exactly what was imputed and summary statistics

#### Categorical Imputation with `impute_categorical_mode`

The `impute_categorical_mode` function fills missing values in categorical columns using the mode (most frequent value):

```python
import pandas as pd
import edaflow

# Create sample data with missing categorical values
df = pd.DataFrame({
    'category': ['A', 'B', 'A', None, 'A'],
    'status': ['Active', None, 'Active', 'Inactive', None],
    'priority': ['High', 'Medium', None, 'Low', 'High'],
    'age': [25, 30, 35, 40, 45]
})

# Impute all categorical columns with mode values
df_imputed = edaflow.impute_categorical_mode(df)

# Impute specific columns only
df_imputed = edaflow.impute_categorical_mode(df, columns=['category', 'status'])

# Impute in place (modifies original DataFrame)
edaflow.impute_categorical_mode(df, inplace=True)
```

**Function Features:**
- ๐Ÿ“ **Smart Detection**: Automatically identifies categorical (object) columns
- ๐ŸŽฏ **Mode Imputation**: Uses most frequent value for each column
- โš–๏ธ **Tie Handling**: Gracefully handles mode ties (multiple values with same frequency)
- ๐Ÿ”„ **Inplace Option**: Modify original DataFrame or create new one
- ๐Ÿ›ก๏ธ **Safe Handling**: Gracefully handles edge cases like all-missing columns
- ๐Ÿ“‹ **Detailed Reporting**: Shows exactly what was imputed and mode tie warnings

#### Complete Imputation Workflow Example

```python
import pandas as pd
import edaflow

# Sample data with both numerical and categorical missing values
df = pd.DataFrame({
    'age': [25, None, 35, None, 45],
    'salary': [50000, None, 70000, 80000, None],
    'category': ['A', 'B', None, 'A', None],
    'status': ['Active', None, 'Active', 'Inactive', None],
    'score': [85.5, 92.0, None, 88.5, None]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50)

# Step 1: Impute numerical columns
print("STEP 1: Numerical Imputation")
df_step1 = edaflow.impute_numerical_median(df)

# Step 2: Impute categorical columns
print("\nSTEP 2: Categorical Imputation")
df_final = edaflow.impute_categorical_mode(df_step1)

print("\nFinal DataFrame (all missing values imputed):")
print(df_final)

# Verify no missing values remain
print(f"\nMissing values remaining: {df_final.isnull().sum().sum()}")
```

**Expected Output:**
```
๐Ÿ”ข Numerical Missing Value Imputation (Median)
=======================================================
๐Ÿ”„ age                  - Imputed 2 values with median: 35.0
๐Ÿ”„ salary               - Imputed 2 values with median: 70000.0
๐Ÿ”„ score                - Imputed 1 values with median: 88.75

๐Ÿ“Š Imputation Summary:
   Columns processed: 3
   Columns imputed: 3
   Total values imputed: 5

๐Ÿ“ Categorical Missing Value Imputation (Mode)
=======================================================
๐Ÿ”„ category             - Imputed 2 values with mode: 'A'
๐Ÿ”„ status               - Imputed 1 values with mode: 'Active'

๐Ÿ“Š Imputation Summary:
   Columns processed: 2
   Columns imputed: 2
   Total values imputed: 3
```

### Numerical Distribution Analysis with `visualize_numerical_boxplots`

Analyze numerical columns to detect outliers, understand distributions, and assess skewness:

```python
import pandas as pd
import edaflow

# Create sample dataset with outliers
df = pd.DataFrame({
    'age': [25, 30, 35, 40, 45, 28, 32, 38, 42, 100],  # 100 is an outlier
    'salary': [50000, 60000, 75000, 80000, 90000, 55000, 65000, 70000, 85000, 250000],  # 250000 is outlier
    'experience': [2, 5, 8, 12, 15, 3, 6, 9, 13, 30],  # 30 might be an outlier
    'score': [85, 92, 78, 88, 95, 82, 89, 91, 86, 20],  # 20 is an outlier
    'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C']  # Non-numerical
})

# Basic boxplot analysis
edaflow.visualize_numerical_boxplots(
    df, 
    title="Employee Data Analysis - Outlier Detection",
    show_skewness=True
)

# Custom layout and specific columns
edaflow.visualize_numerical_boxplots(
    df, 
    columns=['age', 'salary'],
    rows=1, 
    cols=2,
    title="Age vs Salary Analysis",
    orientation='vertical',
    color_palette='viridis'
)
```

**Expected Output:**
```
๐Ÿ“Š Creating boxplots for 4 numerical column(s): age, salary, experience, score

๐Ÿ“ˆ Summary Statistics:
==================================================
๐Ÿ“Š age:
   Range: 25.00 to 100.00
   Median: 36.50
   IQR: 11.00 (Q1: 30.50, Q3: 41.50)
   Skewness: 2.66 (highly skewed)
   Outliers: 1 values outside [14.00, 58.00]
   Outlier values: [100]

๐Ÿ“Š salary:
   Range: 50000.00 to 250000.00
   Median: 72500.00
   IQR: 22500.00 (Q1: 61250.00, Q3: 83750.00)
   Skewness: 2.88 (highly skewed)
   Outliers: 1 values outside [27500.00, 117500.00]
   Outlier values: [250000]

๐Ÿ“Š experience:
   Range: 2.00 to 30.00
   Median: 8.50
   IQR: 7.50 (Q1: 5.25, Q3: 12.75)
   Skewness: 1.69 (highly skewed)
   Outliers: 1 values outside [-6.00, 24.00]
   Outlier values: [30]

๐Ÿ“Š score:
   Range: 20.00 to 95.00
   Median: 87.00
   IQR: 7.75 (Q1: 82.75, Q3: 90.50)
   Skewness: -2.87 (highly skewed)
   Outliers: 1 values outside [71.12, 102.12]
   Outlier values: [20]
```

### Complete EDA Workflow Example

```python
import edaflow
import pandas as pd

# Test the installation
print(edaflow.hello())

# Load your data
df = pd.read_csv('your_data.csv')

# Complete EDA workflow with all core functions:
# 1. Analyze missing data with styled output
null_analysis = edaflow.check_null_columns(df, threshold=10)

# 2. Analyze categorical columns to identify data type issues
edaflow.analyze_categorical_columns(df, threshold=35)

# 3. Convert appropriate object columns to numeric automatically
df_cleaned = edaflow.convert_to_numeric(df, threshold=35)

# 4. Visualize categorical column values
edaflow.visualize_categorical_values(df_cleaned)

# 5. Display column type classification
edaflow.display_column_types(df_cleaned)

# 6. Impute missing values
df_numeric_imputed = edaflow.impute_numerical_median(df_cleaned)
df_fully_imputed = edaflow.impute_categorical_mode(df_numeric_imputed)

# 7. Statistical distribution analysis with advanced insights
edaflow.visualize_histograms(df_fully_imputed, kde=True, show_normal_curve=True)

# 8. Comprehensive relationship analysis
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='correlation')
edaflow.visualize_scatter_matrix(df_fully_imputed, show_regression=True)

# 9. Generate comprehensive EDA insights and recommendations
insights = edaflow.summarize_eda_insights(df_fully_imputed, target_column='your_target_col')
print(insights)  # View insights dictionary

# 10. Outlier detection and visualization
edaflow.visualize_numerical_boxplots(df_fully_imputed, show_skewness=True)
edaflow.visualize_interactive_boxplots(df_fully_imputed)

# 10. Advanced heatmap analysis
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='missing')
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='values')

# 11. Final data cleaning with outlier handling
df_final = edaflow.handle_outliers_median(df_fully_imputed, method='iqr', verbose=True)

# 12. Results verification
edaflow.visualize_scatter_matrix(df_final, title="Clean Data Relationships")
edaflow.visualize_numerical_boxplots(df_final, title="Final Clean Distribution")
```

### ๐Ÿค– **Complete ML Workflow** โญ *Enhanced in v0.14.0*
```python
import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Continue from cleaned data above...
df_final['target'] = your_target_data  # Add your target column

# 1. Setup ML experiment โญ NEW: Enhanced parameters in v0.14.0
experiment = ml.setup_ml_experiment(
    df_final, 'target',
    test_size=0.2,               # Test set: 20%
    val_size=0.15,               # โญ NEW: Validation set: 15% 
    experiment_name="production_ml_pipeline",  # โญ NEW: Experiment tracking
    random_state=42,
    stratify=True
)

# Alternative: sklearn-style calling (also enhanced)
# X = df_final.drop('target', axis=1)
# y = df_final['target']
# experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15, experiment_name="sklearn_workflow")

print(f"Training: {len(experiment['X_train'])}, Validation: {len(experiment['X_val'])}, Test: {len(experiment['X_test'])}")

# 2. Compare multiple models โญ Enhanced with validation set support
models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'LogisticRegression': LogisticRegression(random_state=42),
    'SVM': SVC(random_state=42, probability=True)
}

# Fit all models
for name, model in models.items():
    model.fit(experiment['X_train'], experiment['y_train'])

# โญ Enhanced compare_models with experiment_config support
comparison = ml.compare_models(
    models=models,
    experiment_config=experiment,  # โญ NEW: Automatically uses validation set
    verbose=True
)
print(comparison)  # Professional styled output

# โญ Enhanced rank_models with flexible return formats
# Quick access to best model (list format - NEW)
best_model = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']
print(f"๐Ÿ† Best model: {best_model}")

# Detailed ranking analysis (DataFrame format - traditional)
ranked_models = ml.rank_models(comparison, 'accuracy')
print("๐Ÿ“Š Top 3 models:")
print(ranked_models.head(3)[['model', 'accuracy', 'f1', 'rank']])

# Advanced: Multi-metric weighted ranking
weighted_ranking = ml.rank_models(
    comparison, 
    'accuracy',
    weights={'accuracy': 0.4, 'f1': 0.3, 'precision': 0.3},
    return_format='list'
)
print(f"๐ŸŽฏ Best by weighted score: {weighted_ranking[0]['model_name']}")

# 3. Hyperparameter optimization โญ Enhanced with validation set
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

best_results = ml.optimize_hyperparameters(
    RandomForestClassifier(random_state=42),
    param_distributions=param_grid,
    X_train=experiment['X_train'],
    y_train=experiment['y_train'],
    method='grid_search',
    cv=5
)

# 4. Generate comprehensive performance visualizations
ml.plot_learning_curves(best_results['best_model'], 
                       X_train=experiment['X_train'], y_train=experiment['y_train'])
ml.plot_roc_curves({'optimized_model': best_results['best_model']}, 
                   X_test=experiment['X_test'], y_test=experiment['y_test'])
ml.plot_feature_importance(best_results['best_model'], 
                          feature_names=experiment['feature_names'])

# 5. Save complete model artifacts with experiment tracking
ml.save_model_artifacts(
    model=best_results['best_model'],
    model_name=f"{experiment['experiment_name']}_optimized_model",  # โญ NEW: Uses experiment name
    experiment_config=experiment,
    performance_metrics={
        'cv_score': best_results['best_score'],
        'test_score': best_results['best_model'].score(experiment['X_test'], experiment['y_test']),
        'model_type': 'RandomForestClassifier'
    },
    metadata={
        'experiment_name': experiment['experiment_name'],  # โญ NEW: Experiment tracking
        'data_shape': df_final.shape,
        'feature_count': len(experiment['feature_names'])
    }
)

print(f"โœ… Complete ML pipeline finished! Experiment: {experiment['experiment_name']}")
```

### ๐Ÿค– **ML Preprocessing with Smart Encoding** โญ *Introduced in v0.12.0*
```python
import edaflow
import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Step 1: Analyze encoding needs (with or without target)
encoding_analysis = edaflow.analyze_encoding_needs(
    df, 
    target_column=None,            # Optional: specify target if you have one
    max_cardinality_onehot=15,     # Optional: max categories for one-hot encoding
    max_cardinality_target=50,     # Optional: max categories for target encoding
    ordinal_columns=None           # Optional: specify ordinal columns if known
)

# Step 2: Apply intelligent encoding transformations  
df_encoded = edaflow.apply_smart_encoding(
    df,                            # Use your full dataset (or df.drop('target_col', axis=1) if needed)
    encoding_analysis=encoding_analysis,  # Optional: use previous analysis
    handle_unknown='ignore'        # Optional: how to handle unknown categories
)

# The encoding pipeline automatically:
# โœ… One-hot encodes low cardinality categoricals
# โœ… Target encodes high cardinality with target correlation  
# โœ… Binary encodes medium cardinality features
# โœ… TF-IDF vectorizes text columns
# โœ… Preserves numeric columns unchanged
# โœ… Handles memory efficiently for large datasets

print(f"Shape transformation: {df.shape} โ†’ {df_encoded.shape}")
print(f"Encoding methods applied: {len(encoding_analysis['encoding_methods'])} different strategies")
```

## Project Structure

```
edaflow/
โ”œโ”€โ”€ edaflow/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ analysis/
โ”‚   โ”œโ”€โ”€ visualization/
โ”‚   โ””โ”€โ”€ preprocessing/
โ”œโ”€โ”€ tests/
โ”œโ”€โ”€ docs/
โ”œโ”€โ”€ examples/
โ”œโ”€โ”€ setup.py
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ LICENSE
```

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/new-feature`)
3. Commit your changes (`git commit -m 'Add new feature'`)
4. Push to the branch (`git push origin feature/new-feature`)
5. Open a Pull Request

## Development

### Setup Development Environment
```bash
# Clone the repository
git clone https://github.com/evanlow/edaflow.git
cd edaflow

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
flake8 edaflow/
black edaflow/
isort edaflow/
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Changelog

> **๐Ÿš€ Latest Updates**: This changelog reflects the most current releases including v0.12.32 critical input validation fix, v0.12.31 hotfix with KeyError resolution and v0.12.30 universal display optimization breakthrough.

### v0.12.32 (2025-08-11) - Critical Input Validation Fix ๐Ÿ›
- **CRITICAL**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions
- **ROOT CAUSE**: Users passing tuple result from `apply_smart_encoding(..., return_encoders=True)` directly to visualization functions
- **ENHANCED**: Added intelligent input validation with helpful error messages for common usage mistakes
- **IMPROVED**: Better error handling in `visualize_scatter_matrix` and other visualization functions
- **DOCUMENTED**: Clear examples showing correct vs incorrect usage patterns for `apply_smart_encoding`
- **STABILITY**: Prevents crashes in step 14 of EDA workflows when encoding functions are misused

### v0.12.31 (2025-01-05) - Critical KeyError Hotfix ๐Ÿšจ
- **CRITICAL**: Fixed KeyError: 'type' in `summarize_eda_insights()` function during Google Colab usage
- **RESOLVED**: Exception handling when target analysis dictionary missing expected keys
- **IMPROVED**: Enhanced error handling with safe dictionary access using `.get()` method
- **MAINTAINED**: All existing functionality preserved - pure stability fix
- **TESTED**: Verified fix works across all notebook platforms (Colab, JupyterLab, VS Code)

### v0.12.30 (2025-01-05) - Universal Display Optimization Breakthrough ๐ŸŽจ
- **BREAKTHROUGH**: Introduced `optimize_display()` function for universal notebook compatibility
- **REVOLUTIONARY**: Automatic platform detection (Google Colab, JupyterLab, VS Code Notebooks, Classic Jupyter)
- **ENHANCED**: Dynamic CSS injection for perfect dark/light mode visibility across all platforms
- **NEW FEATURE**: Automatic matplotlib backend optimization for each notebook environment  
- **ACCESSIBILITY**: Solves visibility issues in dark mode themes universally
- **SEAMLESS**: Zero configuration required - automatically detects and optimizes for your platform
- **COMPATIBILITY**: Works flawlessly across Google Colab, JupyterLab, VS Code, Classic Jupyter
- **EXAMPLE**: Simple usage: `from edaflow import optimize_display; optimize_display()`

### v0.12.3 (2025-08-06) - Complete Positional Argument Compatibility Fix ๐Ÿ”ง
- **CRITICAL**: Fixed positional argument usage for `visualize_image_classes()` function  
- **RESOLVED**: TypeError when calling `visualize_image_classes(image_paths, ...)` with positional arguments
- **ENHANCED**: Comprehensive backward compatibility supporting all three usage patterns:
  - Positional: `visualize_image_classes(path, ...)` (shows warning)
  - Deprecated keyword: `visualize_image_classes(image_paths=path, ...)` (shows warning)
  - Recommended: `visualize_image_classes(data_source=path, ...)` (no warning)
- **IMPROVED**: Clear deprecation warnings guiding users toward recommended syntax
- **SECURE**: Prevents using both parameters simultaneously to avoid confusion
- **RESOLVED**: TypeError for users calling with `image_paths=` parameter from v0.12.0 breaking change
- **ENHANCED**: Improved error messages for parameter validation in image visualization functions
- **DOCUMENTATION**: Added comprehensive parameter documentation including deprecation notices

### v0.12.2 (2025-08-06) - Documentation Refresh ๐Ÿ“š
- **IMPROVED**: Enhanced README.md with updated timestamps and current version indicators
- **FIXED**: Ensured PyPI displays the most current changelog information including v0.12.1 fixes
- **ENHANCED**: Added latest updates indicator to changelog for better visibility
- **DOCUMENTATION**: Forced PyPI cache refresh to display current version information

## โœจ What's New in v0.16.2

**New Features:**
- Faceted visualizations with `display_facet_grid`
- Feature scaling with `scale_features`
- Grouping rare categories with `group_rare_categories`
- Exporting figures with `export_figure`

**Documentation Updates:**
- User Guide, Advanced Features, and Best Practices now reference all new APIs
- Visualization Guide includes external library requirements and troubleshooting
- Changelog documents all new features and documentation changes

**External Library Requirements:**
Some advanced features require additional libraries:
- matplotlib
- seaborn
- scikit-learn
- statsmodels
- pandas

See the Visualization Guide for installation instructions and troubleshooting tips.

---

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "edaflow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Evan Low <evan.low@illumetechnology.com>",
    "keywords": "data-analysis, eda, exploratory-data-analysis, data-science, visualization",
    "author": null,
    "author_email": "Evan Low <evan.low@illumetechnology.com>",
    "download_url": "https://files.pythonhosted.org/packages/bd/61/c76ac3dfde6698fdeebfa12275be32510587bffe48efe7ef523251a54e57/edaflow-0.17.1.tar.gz",
    "platform": null,
    "description": "# \ud83d\ude80 What's New in v0.17.1\r\n\r\n**Notebook Fixes & Beginner Experience:**\r\n- Fixed confusion matrix example in classification and advanced workflow notebooks to match API signature\r\n- Audited all example notebooks for beginner-friendliness and error-free execution\r\n- All notebooks now run without unnecessary errors for new users\r\n\r\n**Major Examples & Documentation Update (v0.17.0):**\r\n- Added interactive Jupyter notebooks for all major workflows\r\n- All guides and onboarding instructions updated to reference new features and examples\r\n- Verified and enhanced documentation for:\r\n    - `highlight_anomalies`\r\n    - `create_lag_features`\r\n    - `display_facet_grid`\r\n    - `scale_features`\r\n    - `group_rare_categories`\r\n    - `export_figure`\r\n- Improved onboarding and user guidance in ReadTheDocs and README\r\n- Minor bug fixes and consistency improvements across docs and codebase\r\n\r\n**Why Upgrade?**\r\nThis release ensures all users have access to complete, copy-paste-ready examples and documentation. The onboarding experience is now smoother, and all advanced features are fully documented.\r\n\r\nSee the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)\r\n\r\n**Major Documentation Overhaul for Education:**\r\n- Added a dedicated Learning Path for new and aspiring data scientists\r\n- Consolidated ML workflow steps into a single, copy-paste-safe guide\r\n- Expanded examples: classification, regression, and computer vision\r\n- Improved navigation: clear table of contents, user guide, API reference, and best practices\r\n- Advanced features and troubleshooting tips for power users\r\n\r\n**Why Upgrade?**\r\nThis release makes edaflow best-in-class for educational value, with a structured progression for learners and educators. All documentation is now easier to follow, with practical code and hands-on exercises.\r\n\r\nSee the full documentation at [edaflow.readthedocs.io](https://edaflow.readthedocs.io)\r\n# edaflow\r\n\r\n[![Documentation Status](https://readthedocs.org/projects/edaflow/badge/?version=latest)](https://edaflow.readthedocs.io/en/latest/?badge=latest)\r\n[![PyPI version](https://badge.fury.io/py/edaflow.svg)](https://badge.fury.io/py/edaflow)\r\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n[![Downloads](https://pepy.tech/badge/edaflow)](https://pepy.tech/project/edaflow)\r\n\r\n**Quick Navigation**: \r\n\ud83d\udcda [Documentation](https://edaflow.readthedocs.io) | \r\n\ud83d\udce6 [PyPI Package](https://pypi.org/project/edaflow/) | \r\n\ud83d\ude80 [Quick Start](https://edaflow.readthedocs.io/en/latest/quickstart.html) | \r\n\ud83d\udccb [Changelog](#-changelog) | \r\n\ud83d\udc1b [Issues](https://github.com/evanlow/edaflow/issues)\r\n\r\nA Python package for streamlined exploratory data analysis workflows.\r\n\r\n > **\ud83d\udce6 Current Version: v0.16.4** - [Latest Release](https://pypi.org/project/edaflow/0.16.4/) adds a complete examples directory, improved onboarding, and fully documented advanced features. *Updated: September 12, 2025*\r\n\r\n## \ud83d\udcd6 Table of Contents\r\n\r\n- [Description](#description)\r\n- [\ud83d\udea8 Critical Fixes in v0.15.0](#-critical-fixes-in-v0150)\r\n- [\u2728 What's New](#-whats-new)\r\n- [Features](#features)\r\n- [\ud83c\udd95 Recent Updates](#-recent-updates)\r\n- [\ud83d\udcda Documentation](#-documentation)\r\n- [Installation](#installation)\r\n- [Quick Start](#quick-start)\r\n- [\ud83d\udccb Changelog](#-changelog)\r\n- [Support](#support)\r\n- [Roadmap](#roadmap)\r\n\r\n## Description\r\n\r\n`edaflow` is designed to simplify and accelerate the exploratory data analysis (EDA) process by providing a collection of tools and utilities for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, and preprocessing.\r\n\r\n## \ud83d\udea8 What's New in v0.15.1\r\n\r\n**NEW:** `setup_ml_experiment` now supports a `primary_metric` argument, making metric selection robust and error-free for all ML workflows. All documentation, tests, and downstream code are updated for consistency. A new test ensures the metric is set and accessible throughout the workflow.\r\n\r\n**Upgrade recommended for all users who want reliable, copy-paste-safe ML workflows with dynamic metric selection.**\r\n\r\n---\r\n\r\n## \ud83d\udea8 Critical Fixes in v0.15.0\r\n**(Previous release)**\r\n\r\n### \ud83c\udfaf **Issues Resolved**:\r\n- \u274c **FIXED**: `RandomForestClassifier instance is not fitted yet` errors\r\n- \u274c **FIXED**: `TypeError: unexpected keyword argument` errors  \r\n- \u274c **FIXED**: Missing imports and undefined variables in examples\r\n- \u274c **FIXED**: Duplicate step numbering in documentation\r\n- \u2705 **RESULT**: All ML workflow examples now work perfectly!\r\n\r\n### \ud83d\udccb **What This Means For You**:\r\n- \ud83c\udf89 **Copy-paste examples that work immediately**\r\n- \ud83c\udfaf **No more confusing error messages**\r\n- \ud83d\udcda **Complete, beginner-friendly documentation**\r\n- \ud83d\ude80 **Smooth learning experience for new users**\r\n\r\n**Upgrade recommended for all users following ML workflow documentation.**\r\n\r\n## \u2728 What's New\r\n\r\n### \ud83d\udea8 Critical ML Documentation Fixes (v0.15.0)\r\n**MAJOR DOCUMENTATION UPDATE**: Fixed critical issues that were causing user errors when following ML workflow examples.\r\n\r\n**Problems Resolved**:\r\n- \u2705 **Model Fitting**: Added missing `model.fit()` steps that were causing \"not fitted\" errors\r\n- \u2705 **Function Parameters**: Fixed incorrect parameter names in all examples\r\n- \u2705 **Missing Context**: Added imports and data preparation context  \r\n- \u2705 **Step Numbering**: Corrected duplicate step numbers in documentation\r\n- \u2705 **Enhanced Warnings**: Added prominent warnings about critical requirements\r\n\r\n**Result**: All ML workflow documentation now works perfectly out-of-the-box!\r\n\r\n### \ud83c\udfaf Enhanced rank_models Function (v0.14.x)\r\n**DUAL RETURN FORMAT SUPPORT**: Major enhancement based on user requests.\r\n\r\n```python\r\n# Both formats now supported:\r\ndf_results = ml.rank_models(results, 'accuracy')  # DataFrame (default)\r\nlist_results = ml.rank_models(results, 'accuracy', return_format='list')  # List of dicts\r\n\r\n# User-requested pattern now works:\r\nbest_model = ml.rank_models(results, 'accuracy', return_format='list')[0][\"model_name\"]\r\n```\r\n\r\n### \ud83d\ude80 ML Expansion (v0.13.0+)\r\n**COMPLETE MACHINE LEARNING SUBPACKAGE**: Extended edaflow into full ML workflows.\r\n\r\n**New ML Modules Added**:\r\n- **`ml.config`**: ML experiment setup and data validation\r\n- **`ml.leaderboard`**: Multi-model comparison and ranking\r\n- **`ml.tuning`**: Advanced hyperparameter optimization\r\n- **`ml.curves`**: Learning curves and performance visualization\r\n- **`ml.artifacts`**: Model persistence and experiment tracking\r\n\r\n**Key ML Features**:\r\n```python\r\n# Complete ML workflow in one package\r\nimport edaflow.ml as ml\r\n\r\n# Setup experiment with flexible parameter support\r\n# Both calling patterns work:\r\nexperiment = ml.setup_ml_experiment(df, 'target')  # DataFrame style\r\n# OR\r\nexperiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15)  # sklearn style\r\n\r\n# Compare multiple models\r\nresults = ml.compare_models(models, **experiment)\r\n\r\n# Optimize hyperparameters with multiple strategies\r\nbest_model = ml.optimize_hyperparameters(model, params, **experiment)\r\n\r\n# Generate comprehensive visualizations\r\nml.plot_learning_curves(model, **experiment)\r\n```\r\n\r\n### Previous: API Improvement (v0.12.33)\r\n**NEW CLEAN APIs**: Introduced consistent, user-friendly encoding functions that eliminate confusion and crashes.\r\n\r\n**Root Cause Solved**: The inconsistent return type of `apply_smart_encoding()` (sometimes DataFrame, sometimes tuple) was causing AttributeError crashes and user confusion.\r\n\r\n**New Functions Added**:\r\n```python\r\n# \u2705 NEW: Clean, consistent DataFrame return (RECOMMENDED)\r\ndf_encoded = edaflow.apply_encoding(df)  # Always returns DataFrame\r\n\r\n# \u2705 NEW: Explicit tuple return when encoders needed\r\ndf_encoded, encoders = edaflow.apply_encoding_with_encoders(df)  # Always returns tuple\r\n\r\n# \u26a0\ufe0f DEPRECATED: Inconsistent behavior (still works with warnings)\r\ndf_encoded = edaflow.apply_smart_encoding(df, return_encoders=True)  # Sometimes tuple!\r\n```\r\n\r\n**Benefits**:\r\n- \ud83c\udfaf **Zero Breaking Changes**: All existing workflows continue working exactly the same\r\n- \ud83d\udee1\ufe0f **Better Error Messages**: Helpful guidance when mistakes are made  \r\n- \ud83d\udd04 **Migration Path**: Multiple options for users who want cleaner APIs\r\n- \ud83d\udcda **Clear Documentation**: Explicit examples showing best practices\r\n\r\n### \ud83d\udc1b Critical Input Validation Fix (v0.12.32)\r\n**RESOLVED**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions when `apply_smart_encoding(..., return_encoders=True)` result is used incorrectly.\r\n\r\n**Problem Solved**: Users who passed the tuple result from `apply_smart_encoding` directly to visualization functions without unpacking were experiencing crashes in step 14 of EDA workflows.\r\n\r\n**Enhanced Error Messages**: Added intelligent input validation with helpful error messages guiding users to the correct usage pattern:\r\n```python\r\n# \u274c WRONG - This causes the AttributeError:\r\ndf_encoded = edaflow.apply_smart_encoding(df, return_encoders=True)  # Returns (df, encoders) tuple!\r\nedaflow.visualize_scatter_matrix(df_encoded)  # Crashes with AttributeError\r\n\r\n# \u2705 CORRECT - Unpack the tuple:  \r\ndf_encoded, encoders = edaflow.apply_smart_encoding(df, return_encoders=True)\r\nedaflow.visualize_scatter_matrix(df_encoded)  # Should work well!\r\n```\r\n\r\n### \ud83c\udfa8 BREAKTHROUGH: Universal Dark Mode Compatibility (v0.12.30)\r\n- **NEW FUNCTION**: `optimize_display()` - The **FIRST** EDA library with universal notebook compatibility!\r\n- **Universal Platform Support**: Improved visibility across Google Colab, JupyterLab, VS Code, and Classic Jupyter\r\n- **Automatic Detection**: Zero configuration needed - automatically detects your environment\r\n- **Accessibility Support**: Built-in high contrast mode for improved accessibility\r\n- **One-Line Solution**: `edaflow.optimize_display()` fixes all visibility issues instantly\r\n\r\n### \ud83d\udc1b Critical KeyError Hotfix (v0.12.31)\r\n- **Fixed KeyError**: Resolved \"KeyError: 'type'\" in `summarize_eda_insights()` function\r\n- **Enhanced Error Handling**: Added robust exception handling for target analysis edge cases\r\n- **Improved Stability**: Function now handles missing or invalid target columns gracefully\r\n\r\n### \ud83c\udf1f Platform Benefits:\r\n- \u2705 **Google Colab**: Auto light/dark mode detection with improved text visibility\r\n- \u2705 **JupyterLab**: Dark theme compatibility with custom theme support\r\n- \u2705 **VS Code**: Native theme integration with seamless notebook experience  \r\n- \u2705 **Classic Jupyter**: Full compatibility with enhanced readability options\r\n\r\n```python\r\nimport edaflow\r\n# \u2b50 NEW: Improved visibility everywhere!\r\nedaflow.optimize_display()  # Universal dark mode fix!\r\n\r\n# All functions now display beautifully\r\nedaflow.check_null_columns(df)\r\nedaflow.visualize_histograms(df)\r\n```\r\n\r\n### \u2728 NEW FUNCTION: `summarize_eda_insights()` (Added in v0.12.28)\r\n- **Comprehensive Analysis**: Generate complete EDA insights and actionable recommendations after completing your analysis workflow\r\n- **Smart Recommendations**: Provides intelligent next steps for modeling, preprocessing, and data quality improvements\r\n- **Target-Aware Analysis**: Supports both classification and regression scenarios with specific insights\r\n- **Function Tracking**: Knows which edaflow functions you've already used in your workflow\r\n- **Structured Output**: Returns organized dictionary with dataset overview, data quality assessment, and recommendations\r\n\r\n### \ud83c\udfa8 Display Formatting Excellence\r\n- **Enhanced Visual Experience**: Refined Rich console styling with optimized panel borders and alignment\r\n- **Google Colab Optimized**: Improved display formatting specifically tailored for notebook environments\r\n- **Consistent Design**: Professional rounded borders, proper width constraints, and refined color schemes\r\n- **Universal Compatibility**: Beautiful output rendering across all major Python environments and notebooks\r\n\r\n### \ufffd Recent Fixes (v0.12.24-0.12.26)\r\n- **LBP Warning Resolution**: Fixed scikit-image UserWarning in texture analysis functions\r\n- **Parameter Documentation**: Corrected `analyze_image_features` documentation mismatches\r\n- **RTD Synchronization**: Updated Read the Docs changelog with all recent improvements\r\n\r\n### \ud83c\udf08 Rich Styling (v0.12.20-0.12.21)\r\n- **Vibrant Output**: ALL major EDA functions now feature professional, color-coded styling\r\n- **Smart Indicators**: Color-coded severity levels (\u2705 CLEAN, \u26a0\ufe0f WARNING, \ud83d\udea8 CRITICAL)\r\n- **Professional Tables**: Beautiful formatted output with rich library integration\r\n- **Actionable Insights**: Context-aware recommendations and visual status indicators\r\n\r\n## Features\r\n\r\n### \ud83d\udd0d **Exploratory Data Analysis**\r\n- **Missing Data Analysis**: Color-coded analysis of null values with customizable thresholds\r\n- **Categorical Data Insights**: \ud83d\udc1b *FIXED in v0.12.29* Identify object columns that might be numeric, detect data type issues (now handles unhashable types)\r\n- **Automatic Data Type Conversion**: Smart conversion of object columns to numeric when appropriate\r\n- **Categorical Values Visualization**: Detailed exploration of categorical column values with insights\r\n- **Column Type Classification**: Simple categorization of DataFrame columns into categorical and numerical types\r\n- **Data Type Detection**: Smart analysis to flag potential data conversion needs\r\n- **EDA Insights Summary**: \u2b50 *NEW in v0.12.28* Comprehensive EDA insights and actionable recommendations after completing analysis workflow\r\n\r\n### \ud83d\udcca **Advanced Visualizations**\r\n- **Numerical Distribution Visualization**: Advanced boxplot analysis with outlier detection and statistical summaries\r\n- **Interactive Boxplot Visualization**: Interactive Plotly Express boxplots with zoom, hover, and statistical tooltips\r\n- **Comprehensive Heatmap Visualizations**: Correlation matrices, missing data patterns, values heatmaps, and cross-tabulations\r\n- **Statistical Histogram Analysis**: Advanced histogram visualization with skewness detection, normality testing, and distribution analysis\r\n- **Scatter Matrix Analysis**: Advanced pairwise relationship visualization with customizable matrix layouts, regression lines, and statistical insights\r\n\r\n### \ud83e\udd16 **Machine Learning Preprocessing** \u2b50 *Introduced in v0.12.0*\r\n- **Intelligent Encoding Analysis**: Automatic detection of optimal encoding strategies for categorical variables\r\n- **Smart Encoding Application**: Automated categorical encoding with support for:\r\n  - One-Hot Encoding for low cardinality categories\r\n  - Target Encoding for high cardinality with target correlation\r\n  - Ordinal Encoding for ordinal relationships\r\n  - Binary Encoding for medium cardinality\r\n  - Text Vectorization (TF-IDF) for text features\r\n  - Leave Unchanged for numeric columns\r\n- **Memory-Efficient Processing**: Intelligent handling of high-cardinality features to prevent memory issues\r\n- **Comprehensive Encoding Pipeline**: End-to-end preprocessing solution for ML model preparation\r\n\r\n### \ud83e\udd16 **Machine Learning Workflows** \u2b50 *NEW in v0.13.0*\r\nThe powerful `edaflow.ml` subpackage provides comprehensive machine learning workflow capabilities:\r\n\r\n#### **ML Experiment Setup (`ml.config`)**\r\n- **Smart Data Validation**: Automatic data quality assessment and problem type detection\r\n- **Intelligent Data Splitting**: Train/validation/test splits with stratification support\r\n- **ML Pipeline Configuration**: Automated preprocessing pipeline setup for ML workflows\r\n\r\n#### **Model Comparison & Ranking (`ml.leaderboard`)**\r\n- **Multi-Model Evaluation**: Compare multiple models with comprehensive metrics\r\n- **Smart Leaderboards**: Automatically rank models by performance with visual displays\r\n- **Export Capabilities**: Save comparison results for reporting and analysis\r\n\r\n#### **Hyperparameter Optimization (`ml.tuning`)**\r\n- **Multiple Search Strategies**: Grid search, random search, and Bayesian optimization\r\n- **Cross-Validation Integration**: Built-in CV with customizable scoring metrics\r\n- **Parallel Processing**: Multi-core hyperparameter optimization for faster results\r\n\r\n#### **Learning & Performance Curves (`ml.curves`)**\r\n- **Learning Curves**: Visualize model performance vs training size\r\n- **Validation Curves**: Analyze hyperparameter impact on model performance\r\n- **ROC & Precision-Recall Curves**: Comprehensive classification performance analysis\r\n- **Feature Importance**: Visual analysis of model feature contributions\r\n\r\n#### **Model Persistence & Tracking (`ml.artifacts`)**\r\n- **Complete Model Artifacts**: Save models, configs, and metadata\r\n- **Experiment Tracking**: Track multiple experiments with organized storage\r\n- **Model Reports**: Generate comprehensive model performance reports\r\n- **Version Management**: Organized model versioning and retrieval\r\n\r\n**Quick ML Example:**\r\n```python\r\nimport edaflow.ml as ml\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n\r\n# Setup ML experiment - Multiple parameter patterns supported\r\n# Method 1: DataFrame + target column (recommended)\r\nexperiment = ml.setup_ml_experiment(df, target_column='target')\r\n\r\n# Method 2: sklearn-style (also supported)\r\nX = df.drop('target', axis=1)\r\ny = df['target']\r\nexperiment = ml.setup_ml_experiment(\r\n    X=X, y=y,\r\n    test_size=0.2,\r\n    val_size=0.15,  # Alternative to validation_size\r\n    experiment_name=\"my_ml_project\",\r\n    stratify=True,\r\n    random_state=42\r\n)\r\n\r\n# Compare multiple models\r\nmodels = {\r\n    'RandomForest': RandomForestClassifier(),\r\n    'LogisticRegression': LogisticRegression()\r\n}\r\ncomparison = ml.compare_models(models, **experiment)\r\n\r\n# Rank models with flexible access patterns\r\n# Method 1: Easy dictionary access (recommended for getting best model)\r\nbest_model_name = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']\r\n\r\n# Method 2: Traditional DataFrame format\r\nranked_df = ml.rank_models(comparison, 'accuracy')\r\nbest_model_traditional = ranked_df.iloc[0]['model']\r\n\r\n# Both methods give the same result\r\nprint(f\"Best model: {best_model_name}\")  # Easy access\r\nprint(f\"Best model: {best_model_traditional}\")  # Traditional access\r\n\r\n# Optimize hyperparameters\r\n\r\n# --- Copy-paste-safe hyperparameter optimization example ---\r\nmodel_name = 'LogisticRegression'  # or 'RandomForest' or 'GradientBoosting'\r\n\r\nif model_name == 'RandomForest':\r\n    param_distributions = {\r\n        'n_estimators': [100, 200, 300],\r\n        'max_depth': [5, 10, 15, None],\r\n        'min_samples_split': [2, 5, 10]\r\n    }\r\n    model = RandomForestClassifier()\r\n    method = 'grid'\r\nelif model_name == 'GradientBoosting':\r\n    param_distributions = {\r\n        'n_estimators': (50, 200),\r\n        'learning_rate': (0.01, 0.3),\r\n        'max_depth': (3, 8)\r\n    }\r\n    from sklearn.ensemble import GradientBoostingClassifier\r\n    model = GradientBoostingClassifier()\r\n    method = 'bayesian'\r\nelif model_name == 'LogisticRegression':\r\n    param_distributions = {\r\n        'C': [0.01, 0.1, 1, 10, 100],\r\n        'penalty': ['l1', 'l2', 'elasticnet', 'none'],\r\n        'solver': ['lbfgs', 'liblinear', 'saga']\r\n    }\r\n    model = LogisticRegression(max_iter=1000)\r\n    method = 'grid'\r\nelse:\r\n    raise ValueError(f\"Unknown model_name: {model_name}\")\r\n\r\nresults = ml.optimize_hyperparameters(\r\n    model,\r\n    param_distributions=param_distributions,\r\n    **experiment\r\n)\r\n\r\n# Generate learning curves\r\nml.plot_learning_curves(results['best_model'], **experiment)\r\n\r\n# Save complete artifacts\r\nml.save_model_artifacts(\r\n    model=results['best_model'],\r\n    model_name='optimized_rf',\r\n    experiment_config=experiment,\r\n    performance_metrics=results['cv_results']\r\n)\r\n```\r\n\r\n### \ud83d\uddbc\ufe0f **Computer Vision Support**\r\n- **Computer Vision EDA**: Class-wise image sample visualization and comprehensive quality assessment for image classification datasets\r\n- **Image Quality Assessment**: Automated detection of corrupted images, quality issues, blur, artifacts, and dataset health metrics\r\n\r\n### Usage Examples\r\n\r\n### Basic Usage\r\n```python\r\nimport edaflow\r\n\r\n# Verify installation\r\nmessage = edaflow.hello()\r\nprint(message)  # Output: \"Hello from edaflow! Ready for exploratory data analysis.\"\r\n```\r\n\r\n### Missing Data Analysis with `check_null_columns`\r\n\r\nThe `check_null_columns` function provides a color-coded analysis of missing data in your DataFrame:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing values\r\ndf = pd.DataFrame({\r\n    'customer_id': [1, 2, 3, 4, 5],\r\n    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],\r\n    'age': [25, None, 35, None, 45],\r\n    'email': [None, None, None, None, None],  # All missing\r\n    'purchase_amount': [100.5, 250.0, 75.25, None, 320.0]\r\n})\r\n\r\n# Analyze missing data with default threshold (10%)\r\nstyled_result = edaflow.check_null_columns(df)\r\nstyled_result  # Display in Jupyter notebook for color-coded styling\r\n\r\n# Use custom threshold (20%) to change color coding sensitivity\r\nstyled_result = edaflow.check_null_columns(df, threshold=20)\r\nstyled_result\r\n\r\n# Access underlying data if needed\r\ndata = styled_result.data\r\nprint(data)\r\n```\r\n\r\n**Color Coding:**\r\n- \ud83d\udd34 **Red**: > 20% missing (high concern)\r\n- \ud83d\udfe1 **Yellow**: 10-20% missing (medium concern)  \r\n- \ud83d\udfe8 **Light Yellow**: 1-10% missing (low concern)\r\n- \u2b1c **Gray**: 0% missing (no issues)\r\n\r\n### Categorical Data Analysis with `analyze_categorical_columns`\r\n\r\nThe `analyze_categorical_columns` function helps identify data type issues and provides insights into object-type columns:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with mixed categorical types\r\ndf = pd.DataFrame({\r\n    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],\r\n    'price_str': ['999', '25', '75', '450'],  # Numbers stored as strings\r\n    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],\r\n    'rating': [4.5, 3.8, 4.2, 4.7],  # Already numeric\r\n    'mixed_ids': ['001', '002', 'ABC', '004'],  # Mixed format\r\n    'status': ['active', 'inactive', 'active', 'pending']\r\n})\r\n\r\n# Analyze categorical columns with default threshold (35%)\r\nedaflow.analyze_categorical_columns(df)\r\n\r\n# Use custom threshold (50%) to be more lenient about mixed data\r\nedaflow.analyze_categorical_columns(df, threshold=50)\r\n```\r\n\r\n**Output Interpretation:**\r\n- \ud83d\udd34\ud83d\udd35 **Highlighted in Red/Blue**: Potentially numeric columns that might need conversion\r\n- \ud83d\udfe1\u26ab **Highlighted in Yellow/Black**: Shows unique values for potential numeric columns\r\n- **Regular text**: Truly categorical columns with statistics\r\n- **\"not an object column\"**: Already properly typed numeric columns\r\n\r\n### Data Type Conversion with `convert_to_numeric`\r\n\r\nAfter analyzing your categorical columns, you can automatically convert appropriate columns to numeric:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with string numbers\r\ndf = pd.DataFrame({\r\n    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],\r\n    'price_str': ['999', '25', '75', '450'],      # Should convert\r\n    'mixed_ids': ['001', '002', 'ABC', '004'],    # Mixed data\r\n    'category': ['Electronics', 'Accessories', 'Electronics', 'Electronics']\r\n})\r\n\r\n# Convert appropriate columns to numeric (threshold=35% by default)\r\ndf_converted = edaflow.convert_to_numeric(df, threshold=35)\r\n\r\n# Or modify the original DataFrame in place\r\nedaflow.convert_to_numeric(df, threshold=35, inplace=True)\r\n\r\n# Use a stricter threshold (only convert if <20% non-numeric values)\r\ndf_strict = edaflow.convert_to_numeric(df, threshold=20)\r\n```\r\n\r\n**Function Features:**\r\n- \u2705 **Smart Detection**: Only converts columns with few non-numeric values\r\n- \u2705 **Customizable Threshold**: Control conversion sensitivity \r\n- \u2705 **Safe Conversion**: Non-numeric values become NaN (not errors)\r\n- \u2705 **Inplace Option**: Modify original DataFrame or create new one\r\n- \u2705 **Detailed Output**: Shows exactly what was converted and why\r\n\r\n### Categorical Data Visualization with `visualize_categorical_values`\r\n\r\nAfter cleaning your data, explore categorical columns in detail to understand value distributions:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Example DataFrame with categorical data\r\ndf = pd.DataFrame({\r\n    'department': ['Sales', 'Marketing', 'Sales', 'HR', 'Marketing', 'Sales', 'IT'],\r\n    'status': ['Active', 'Inactive', 'Active', 'Pending', 'Active', 'Active', 'Inactive'],\r\n    'priority': ['High', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],\r\n    'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],  # Numeric (ignored)\r\n    'salary': [50000, 60000, 70000, 45000, 58000, 62000, 70000]  # Numeric (ignored)\r\n})\r\n\r\n# Visualize all categorical columns\r\nedaflow.visualize_categorical_values(df)\r\n```\r\n\r\n**Advanced Usage Examples:**\r\n\r\n```python\r\n# Handle high-cardinality data (many unique values)\r\nlarge_df = pd.DataFrame({\r\n    'product_id': [f'PROD_{i:04d}' for i in range(100)],  # 100 unique values\r\n    'category': ['Electronics'] * 40 + ['Clothing'] * 35 + ['Books'] * 25,\r\n    'status': ['Available'] * 80 + ['Out of Stock'] * 15 + ['Discontinued'] * 5\r\n})\r\n\r\n# Limit display for high-cardinality columns\r\nedaflow.visualize_categorical_values(large_df, max_unique_values=5)\r\n```\r\n\r\n```python\r\n# DataFrame with missing values for comprehensive analysis\r\ndf_with_nulls = pd.DataFrame({\r\n    'region': ['North', 'South', None, 'East', 'West', 'North', None],\r\n    'customer_type': ['Premium', 'Standard', 'Premium', None, 'Standard', 'Premium', 'Standard'],\r\n    'transaction_id': [f'TXN_{i}' for i in range(7)],  # Mostly unique (ID-like)\r\n})\r\n\r\n# Get detailed insights including missing value analysis\r\nedaflow.visualize_categorical_values(df_with_nulls)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83c\udfaf **Zero Breaking Changes**: All existing workflows continue working exactly the same\r\n- \ud83d\udee1\ufe0f **Better Error Messages**: Helpful guidance when mistakes are made  \r\n- \ud83d\udd04 **Migration Path**: Multiple options for users who want cleaner APIs\r\n- \ud83d\udcda **Clear Documentation**: Explicit examples showing best practices\r\n\r\n### Column Type Classification with `display_column_types`\r\n\r\nThe `display_column_types` function provides a simple way to categorize DataFrame columns into categorical and numerical types:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with mixed types\r\ndata = {\r\n    'name': ['Alice', 'Bob', 'Charlie'],\r\n    'age': [25, 30, 35],\r\n    'city': ['NYC', 'LA', 'Chicago'],\r\n    'salary': [50000, 60000, 70000],\r\n    'is_active': [True, False, True]\r\n}\r\ndf = pd.DataFrame(data)\r\n\r\n# Display column type classification\r\nresult = edaflow.display_column_types(df)\r\n\r\n# Access the categorized column lists\r\ncategorical_cols = result['categorical']  # ['name', 'city']\r\nnumerical_cols = result['numerical']      # ['age', 'salary', 'is_active']\r\n```\r\n\r\n**Example Output:**\r\n```\r\n\ud83d\udcca Column Type Analysis\r\n==================================================\r\n\r\n\ud83d\udcdd Categorical Columns (2 total):\r\n    1. name                 (unique values: 3)\r\n    2. city                 (unique values: 3)\r\n\r\n\ud83d\udd22 Numerical Columns (3 total):\r\n    1. age                  (dtype: int64)\r\n    2. salary               (dtype: int64)\r\n    3. is_active            (dtype: bool)\r\n\r\n\ud83d\udcc8 Summary:\r\n   Total columns: 5\r\n   Categorical: 2 (40.0%)\r\n   Numerical: 3 (60.0%)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udd0d **Simple Classification**: Separates columns into categorical (object dtype) and numerical (all other dtypes)\r\n- \ud83d\udcca **Detailed Information**: Shows unique value counts for categorical columns and data types for numerical columns\r\n- \ud83d\udcc8 **Summary Statistics**: Provides percentage breakdown of column types\r\n- \ud83c\udfaf **Return Values**: Returns dictionary with categorized column lists for programmatic use\r\n- \u26a1 **Fast Processing**: Efficient classification based on pandas data types\r\n- \ud83d\udee1\ufe0f **Error Handling**: Validates input and handles edge cases like empty DataFrames\r\n\r\n### Data Imputation with `impute_numerical_median` and `impute_categorical_mode`\r\n\r\nAfter analyzing your data, you often need to handle missing values. The edaflow package provides two specialized imputation functions for this purpose:\r\n\r\n#### Numerical Imputation with `impute_numerical_median`\r\n\r\nThe `impute_numerical_median` function fills missing values in numerical columns using the median value:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing numerical values\r\ndf = pd.DataFrame({\r\n    'age': [25, None, 35, None, 45],\r\n    'salary': [50000, 60000, None, 70000, None],\r\n    'score': [85.5, None, 92.0, 88.5, None],\r\n    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']\r\n})\r\n\r\n# Impute all numerical columns with median values\r\ndf_imputed = edaflow.impute_numerical_median(df)\r\n\r\n# Impute specific columns only\r\ndf_imputed = edaflow.impute_numerical_median(df, columns=['age', 'salary'])\r\n\r\n# Impute in place (modifies original DataFrame)\r\nedaflow.impute_numerical_median(df, inplace=True)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udd22 **Smart Detection**: Automatically identifies numerical columns (int, float, etc.)\r\n- \ud83d\udcca **Median Imputation**: Uses median values which are robust to outliers\r\n- \ud83c\udfaf **Selective Imputation**: Option to specify which columns to impute\r\n- \ud83d\udd04 **Inplace Option**: Modify original DataFrame or create new one\r\n- \ud83d\udee1\ufe0f **Safe Handling**: Gracefully handles edge cases like all-missing columns\r\n- \ud83d\udccb **Detailed Reporting**: Shows exactly what was imputed and summary statistics\r\n\r\n#### Categorical Imputation with `impute_categorical_mode`\r\n\r\nThe `impute_categorical_mode` function fills missing values in categorical columns using the mode (most frequent value):\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing categorical values\r\ndf = pd.DataFrame({\r\n    'category': ['A', 'B', 'A', None, 'A'],\r\n    'status': ['Active', None, 'Active', 'Inactive', None],\r\n    'priority': ['High', 'Medium', None, 'Low', 'High'],\r\n    'age': [25, 30, 35, 40, 45]\r\n})\r\n\r\n# Impute all categorical columns with mode values\r\ndf_imputed = edaflow.impute_categorical_mode(df)\r\n\r\n# Impute specific columns only\r\ndf_imputed = edaflow.impute_categorical_mode(df, columns=['category', 'status'])\r\n\r\n# Impute in place (modifies original DataFrame)\r\nedaflow.impute_categorical_mode(df, inplace=True)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udcdd **Smart Detection**: Automatically identifies categorical (object) columns\r\n- \ud83c\udfaf **Mode Imputation**: Uses most frequent value for each column\r\n- \u2696\ufe0f **Tie Handling**: Gracefully handles mode ties (multiple values with same frequency)\r\n- \ud83d\udd04 **Inplace Option**: Modify original DataFrame or create new one\r\n- \ud83d\udee1\ufe0f **Safe Handling**: Gracefully handles edge cases like all-missing columns\r\n- \ud83d\udccb **Detailed Reporting**: Shows exactly what was imputed and mode tie warnings\r\n\r\n#### Complete Imputation Workflow Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Sample data with both numerical and categorical missing values\r\ndf = pd.DataFrame({\r\n    'age': [25, None, 35, None, 45],\r\n    'salary': [50000, None, 70000, 80000, None],\r\n    'category': ['A', 'B', None, 'A', None],\r\n    'status': ['Active', None, 'Active', 'Inactive', None],\r\n    'score': [85.5, 92.0, None, 88.5, None]\r\n})\r\n\r\nprint(\"Original DataFrame:\")\r\nprint(df)\r\nprint(\"\\n\" + \"=\"*50)\r\n\r\n# Step 1: Impute numerical columns\r\nprint(\"STEP 1: Numerical Imputation\")\r\ndf_step1 = edaflow.impute_numerical_median(df)\r\n\r\n# Step 2: Impute categorical columns\r\nprint(\"\\nSTEP 2: Categorical Imputation\")\r\ndf_final = edaflow.impute_categorical_mode(df_step1)\r\n\r\nprint(\"\\nFinal DataFrame (all missing values imputed):\")\r\nprint(df_final)\r\n\r\n# Verify no missing values remain\r\nprint(f\"\\nMissing values remaining: {df_final.isnull().sum().sum()}\")\r\n```\r\n\r\n**Expected Output:**\r\n```\r\n\ud83d\udd22 Numerical Missing Value Imputation (Median)\r\n=======================================================\r\n\ud83d\udd04 age                  - Imputed 2 values with median: 35.0\r\n\ud83d\udd04 salary               - Imputed 2 values with median: 70000.0\r\n\ud83d\udd04 score                - Imputed 1 values with median: 88.75\r\n\r\n\ud83d\udcca Imputation Summary:\r\n   Columns processed: 3\r\n   Columns imputed: 3\r\n   Total values imputed: 5\r\n\r\n\ud83d\udcdd Categorical Missing Value Imputation (Mode)\r\n=======================================================\r\n\ud83d\udd04 category             - Imputed 2 values with mode: 'A'\r\n\ud83d\udd04 status               - Imputed 1 values with mode: 'Active'\r\n\r\n\ud83d\udcca Imputation Summary:\r\n   Columns processed: 2\r\n   Columns imputed: 2\r\n   Total values imputed: 3\r\n```\r\n\r\n### Numerical Distribution Analysis with `visualize_numerical_boxplots`\r\n\r\nAnalyze numerical columns to detect outliers, understand distributions, and assess skewness:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample dataset with outliers\r\ndf = pd.DataFrame({\r\n    'age': [25, 30, 35, 40, 45, 28, 32, 38, 42, 100],  # 100 is an outlier\r\n    'salary': [50000, 60000, 75000, 80000, 90000, 55000, 65000, 70000, 85000, 250000],  # 250000 is outlier\r\n    'experience': [2, 5, 8, 12, 15, 3, 6, 9, 13, 30],  # 30 might be an outlier\r\n    'score': [85, 92, 78, 88, 95, 82, 89, 91, 86, 20],  # 20 is an outlier\r\n    'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C']  # Non-numerical\r\n})\r\n\r\n# Basic boxplot analysis\r\nedaflow.visualize_numerical_boxplots(\r\n    df, \r\n    title=\"Employee Data Analysis - Outlier Detection\",\r\n    show_skewness=True\r\n)\r\n\r\n# Custom layout and specific columns\r\nedaflow.visualize_numerical_boxplots(\r\n    df, \r\n    columns=['age', 'salary'],\r\n    rows=1, \r\n    cols=2,\r\n    title=\"Age vs Salary Analysis\",\r\n    orientation='vertical',\r\n    color_palette='viridis'\r\n)\r\n```\r\n\r\n**Expected Output:**\r\n```\r\n\ud83d\udcca Creating boxplots for 4 numerical column(s): age, salary, experience, score\r\n\r\n\ud83d\udcc8 Summary Statistics:\r\n==================================================\r\n\ud83d\udcca age:\r\n   Range: 25.00 to 100.00\r\n   Median: 36.50\r\n   IQR: 11.00 (Q1: 30.50, Q3: 41.50)\r\n   Skewness: 2.66 (highly skewed)\r\n   Outliers: 1 values outside [14.00, 58.00]\r\n   Outlier values: [100]\r\n\r\n\ud83d\udcca salary:\r\n   Range: 50000.00 to 250000.00\r\n   Median: 72500.00\r\n   IQR: 22500.00 (Q1: 61250.00, Q3: 83750.00)\r\n   Skewness: 2.88 (highly skewed)\r\n   Outliers: 1 values outside [27500.00, 117500.00]\r\n   Outlier values: [250000]\r\n\r\n\ud83d\udcca experience:\r\n   Range: 2.00 to 30.00\r\n   Median: 8.50\r\n   IQR: 7.50 (Q1: 5.25, Q3: 12.75)\r\n   Skewness: 1.69 (highly skewed)\r\n   Outliers: 1 values outside [-6.00, 24.00]\r\n   Outlier values: [30]\r\n\r\n\ud83d\udcca score:\r\n   Range: 20.00 to 95.00\r\n   Median: 87.00\r\n   IQR: 7.75 (Q1: 82.75, Q3: 90.50)\r\n   Skewness: -2.87 (highly skewed)\r\n   Outliers: 1 values outside [71.12, 102.12]\r\n   Outlier values: [20]\r\n```\r\n\r\n### Complete EDA Workflow Example\r\n\r\n```python\r\nimport edaflow\r\nimport pandas as pd\r\n\r\n# Test the installation\r\nprint(edaflow.hello())\r\n\r\n# Load your data\r\ndf = pd.read_csv('your_data.csv')\r\n\r\n# Complete EDA workflow with all core functions:\r\n# 1. Analyze missing data with styled output\r\nnull_analysis = edaflow.check_null_columns(df, threshold=10)\r\n\r\n# 2. Analyze categorical columns to identify data type issues\r\nedaflow.analyze_categorical_columns(df, threshold=35)\r\n\r\n# 3. Convert appropriate object columns to numeric automatically\r\ndf_cleaned = edaflow.convert_to_numeric(df, threshold=35)\r\n\r\n# 4. Visualize categorical column values\r\nedaflow.visualize_categorical_values(df_cleaned)\r\n\r\n# 5. Display column type classification\r\nedaflow.display_column_types(df_cleaned)\r\n\r\n# 6. Impute missing values\r\ndf_numeric_imputed = edaflow.impute_numerical_median(df_cleaned)\r\ndf_fully_imputed = edaflow.impute_categorical_mode(df_numeric_imputed)\r\n\r\n# 7. Statistical distribution analysis with advanced insights\r\nedaflow.visualize_histograms(df_fully_imputed, kde=True, show_normal_curve=True)\r\n\r\n# 8. Comprehensive relationship analysis\r\nedaflow.visualize_heatmap(df_fully_imputed, heatmap_type='correlation')\r\nedaflow.visualize_scatter_matrix(df_fully_imputed, show_regression=True)\r\n\r\n# 9. Generate comprehensive EDA insights and recommendations\r\ninsights = edaflow.summarize_eda_insights(df_fully_imputed, target_column='your_target_col')\r\nprint(insights)  # View insights dictionary\r\n\r\n# 10. Outlier detection and visualization\r\nedaflow.visualize_numerical_boxplots(df_fully_imputed, show_skewness=True)\r\nedaflow.visualize_interactive_boxplots(df_fully_imputed)\r\n\r\n# 10. Advanced heatmap analysis\r\nedaflow.visualize_heatmap(df_fully_imputed, heatmap_type='missing')\r\nedaflow.visualize_heatmap(df_fully_imputed, heatmap_type='values')\r\n\r\n# 11. Final data cleaning with outlier handling\r\ndf_final = edaflow.handle_outliers_median(df_fully_imputed, method='iqr', verbose=True)\r\n\r\n# 12. Results verification\r\nedaflow.visualize_scatter_matrix(df_final, title=\"Clean Data Relationships\")\r\nedaflow.visualize_numerical_boxplots(df_final, title=\"Final Clean Distribution\")\r\n```\r\n\r\n### \ud83e\udd16 **Complete ML Workflow** \u2b50 *Enhanced in v0.14.0*\r\n```python\r\nimport edaflow.ml as ml\r\nfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.svm import SVC\r\n\r\n# Continue from cleaned data above...\r\ndf_final['target'] = your_target_data  # Add your target column\r\n\r\n# 1. Setup ML experiment \u2b50 NEW: Enhanced parameters in v0.14.0\r\nexperiment = ml.setup_ml_experiment(\r\n    df_final, 'target',\r\n    test_size=0.2,               # Test set: 20%\r\n    val_size=0.15,               # \u2b50 NEW: Validation set: 15% \r\n    experiment_name=\"production_ml_pipeline\",  # \u2b50 NEW: Experiment tracking\r\n    random_state=42,\r\n    stratify=True\r\n)\r\n\r\n# Alternative: sklearn-style calling (also enhanced)\r\n# X = df_final.drop('target', axis=1)\r\n# y = df_final['target']\r\n# experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15, experiment_name=\"sklearn_workflow\")\r\n\r\nprint(f\"Training: {len(experiment['X_train'])}, Validation: {len(experiment['X_val'])}, Test: {len(experiment['X_test'])}\")\r\n\r\n# 2. Compare multiple models \u2b50 Enhanced with validation set support\r\nmodels = {\r\n    'RandomForest': RandomForestClassifier(random_state=42),\r\n    'GradientBoosting': GradientBoostingClassifier(random_state=42),\r\n    'LogisticRegression': LogisticRegression(random_state=42),\r\n    'SVM': SVC(random_state=42, probability=True)\r\n}\r\n\r\n# Fit all models\r\nfor name, model in models.items():\r\n    model.fit(experiment['X_train'], experiment['y_train'])\r\n\r\n# \u2b50 Enhanced compare_models with experiment_config support\r\ncomparison = ml.compare_models(\r\n    models=models,\r\n    experiment_config=experiment,  # \u2b50 NEW: Automatically uses validation set\r\n    verbose=True\r\n)\r\nprint(comparison)  # Professional styled output\r\n\r\n# \u2b50 Enhanced rank_models with flexible return formats\r\n# Quick access to best model (list format - NEW)\r\nbest_model = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']\r\nprint(f\"\ud83c\udfc6 Best model: {best_model}\")\r\n\r\n# Detailed ranking analysis (DataFrame format - traditional)\r\nranked_models = ml.rank_models(comparison, 'accuracy')\r\nprint(\"\ud83d\udcca Top 3 models:\")\r\nprint(ranked_models.head(3)[['model', 'accuracy', 'f1', 'rank']])\r\n\r\n# Advanced: Multi-metric weighted ranking\r\nweighted_ranking = ml.rank_models(\r\n    comparison, \r\n    'accuracy',\r\n    weights={'accuracy': 0.4, 'f1': 0.3, 'precision': 0.3},\r\n    return_format='list'\r\n)\r\nprint(f\"\ud83c\udfaf Best by weighted score: {weighted_ranking[0]['model_name']}\")\r\n\r\n# 3. Hyperparameter optimization \u2b50 Enhanced with validation set\r\nparam_grid = {\r\n    'n_estimators': [100, 200, 300],\r\n    'max_depth': [5, 10, 15, None],\r\n    'min_samples_split': [2, 5, 10]\r\n}\r\n\r\nbest_results = ml.optimize_hyperparameters(\r\n    RandomForestClassifier(random_state=42),\r\n    param_distributions=param_grid,\r\n    X_train=experiment['X_train'],\r\n    y_train=experiment['y_train'],\r\n    method='grid_search',\r\n    cv=5\r\n)\r\n\r\n# 4. Generate comprehensive performance visualizations\r\nml.plot_learning_curves(best_results['best_model'], \r\n                       X_train=experiment['X_train'], y_train=experiment['y_train'])\r\nml.plot_roc_curves({'optimized_model': best_results['best_model']}, \r\n                   X_test=experiment['X_test'], y_test=experiment['y_test'])\r\nml.plot_feature_importance(best_results['best_model'], \r\n                          feature_names=experiment['feature_names'])\r\n\r\n# 5. Save complete model artifacts with experiment tracking\r\nml.save_model_artifacts(\r\n    model=best_results['best_model'],\r\n    model_name=f\"{experiment['experiment_name']}_optimized_model\",  # \u2b50 NEW: Uses experiment name\r\n    experiment_config=experiment,\r\n    performance_metrics={\r\n        'cv_score': best_results['best_score'],\r\n        'test_score': best_results['best_model'].score(experiment['X_test'], experiment['y_test']),\r\n        'model_type': 'RandomForestClassifier'\r\n    },\r\n    metadata={\r\n        'experiment_name': experiment['experiment_name'],  # \u2b50 NEW: Experiment tracking\r\n        'data_shape': df_final.shape,\r\n        'feature_count': len(experiment['feature_names'])\r\n    }\r\n)\r\n\r\nprint(f\"\u2705 Complete ML pipeline finished! Experiment: {experiment['experiment_name']}\")\r\n```\r\n\r\n### \ud83e\udd16 **ML Preprocessing with Smart Encoding** \u2b50 *Introduced in v0.12.0*\r\n```python\r\nimport edaflow\r\nimport pandas as pd\r\n\r\n# Load your data\r\ndf = pd.read_csv('your_data.csv')\r\n\r\n# Step 1: Analyze encoding needs (with or without target)\r\nencoding_analysis = edaflow.analyze_encoding_needs(\r\n    df, \r\n    target_column=None,            # Optional: specify target if you have one\r\n    max_cardinality_onehot=15,     # Optional: max categories for one-hot encoding\r\n    max_cardinality_target=50,     # Optional: max categories for target encoding\r\n    ordinal_columns=None           # Optional: specify ordinal columns if known\r\n)\r\n\r\n# Step 2: Apply intelligent encoding transformations  \r\ndf_encoded = edaflow.apply_smart_encoding(\r\n    df,                            # Use your full dataset (or df.drop('target_col', axis=1) if needed)\r\n    encoding_analysis=encoding_analysis,  # Optional: use previous analysis\r\n    handle_unknown='ignore'        # Optional: how to handle unknown categories\r\n)\r\n\r\n# The encoding pipeline automatically:\r\n# \u2705 One-hot encodes low cardinality categoricals\r\n# \u2705 Target encodes high cardinality with target correlation  \r\n# \u2705 Binary encodes medium cardinality features\r\n# \u2705 TF-IDF vectorizes text columns\r\n# \u2705 Preserves numeric columns unchanged\r\n# \u2705 Handles memory efficiently for large datasets\r\n\r\nprint(f\"Shape transformation: {df.shape} \u2192 {df_encoded.shape}\")\r\nprint(f\"Encoding methods applied: {len(encoding_analysis['encoding_methods'])} different strategies\")\r\n```\r\n\r\n## Project Structure\r\n\r\n```\r\nedaflow/\r\n\u251c\u2500\u2500 edaflow/\r\n\u2502   \u251c\u2500\u2500 __init__.py\r\n\u2502   \u251c\u2500\u2500 analysis/\r\n\u2502   \u251c\u2500\u2500 visualization/\r\n\u2502   \u2514\u2500\u2500 preprocessing/\r\n\u251c\u2500\u2500 tests/\r\n\u251c\u2500\u2500 docs/\r\n\u251c\u2500\u2500 examples/\r\n\u251c\u2500\u2500 setup.py\r\n\u251c\u2500\u2500 requirements.txt\r\n\u251c\u2500\u2500 README.md\r\n\u2514\u2500\u2500 LICENSE\r\n```\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch (`git checkout -b feature/new-feature`)\r\n3. Commit your changes (`git commit -m 'Add new feature'`)\r\n4. Push to the branch (`git push origin feature/new-feature`)\r\n5. Open a Pull Request\r\n\r\n## Development\r\n\r\n### Setup Development Environment\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/evanlow/edaflow.git\r\ncd edaflow\r\n\r\n# Create virtual environment\r\npython -m venv venv\r\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\r\n\r\n# Install in development mode\r\npip install -e \".[dev]\"\r\n\r\n# Run tests\r\npytest\r\n\r\n# Run linting\r\nflake8 edaflow/\r\nblack edaflow/\r\nisort edaflow/\r\n```\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Changelog\r\n\r\n> **\ud83d\ude80 Latest Updates**: This changelog reflects the most current releases including v0.12.32 critical input validation fix, v0.12.31 hotfix with KeyError resolution and v0.12.30 universal display optimization breakthrough.\r\n\r\n### v0.12.32 (2025-08-11) - Critical Input Validation Fix \ud83d\udc1b\r\n- **CRITICAL**: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions\r\n- **ROOT CAUSE**: Users passing tuple result from `apply_smart_encoding(..., return_encoders=True)` directly to visualization functions\r\n- **ENHANCED**: Added intelligent input validation with helpful error messages for common usage mistakes\r\n- **IMPROVED**: Better error handling in `visualize_scatter_matrix` and other visualization functions\r\n- **DOCUMENTED**: Clear examples showing correct vs incorrect usage patterns for `apply_smart_encoding`\r\n- **STABILITY**: Prevents crashes in step 14 of EDA workflows when encoding functions are misused\r\n\r\n### v0.12.31 (2025-01-05) - Critical KeyError Hotfix \ud83d\udea8\r\n- **CRITICAL**: Fixed KeyError: 'type' in `summarize_eda_insights()` function during Google Colab usage\r\n- **RESOLVED**: Exception handling when target analysis dictionary missing expected keys\r\n- **IMPROVED**: Enhanced error handling with safe dictionary access using `.get()` method\r\n- **MAINTAINED**: All existing functionality preserved - pure stability fix\r\n- **TESTED**: Verified fix works across all notebook platforms (Colab, JupyterLab, VS Code)\r\n\r\n### v0.12.30 (2025-01-05) - Universal Display Optimization Breakthrough \ud83c\udfa8\r\n- **BREAKTHROUGH**: Introduced `optimize_display()` function for universal notebook compatibility\r\n- **REVOLUTIONARY**: Automatic platform detection (Google Colab, JupyterLab, VS Code Notebooks, Classic Jupyter)\r\n- **ENHANCED**: Dynamic CSS injection for perfect dark/light mode visibility across all platforms\r\n- **NEW FEATURE**: Automatic matplotlib backend optimization for each notebook environment  \r\n- **ACCESSIBILITY**: Solves visibility issues in dark mode themes universally\r\n- **SEAMLESS**: Zero configuration required - automatically detects and optimizes for your platform\r\n- **COMPATIBILITY**: Works flawlessly across Google Colab, JupyterLab, VS Code, Classic Jupyter\r\n- **EXAMPLE**: Simple usage: `from edaflow import optimize_display; optimize_display()`\r\n\r\n### v0.12.3 (2025-08-06) - Complete Positional Argument Compatibility Fix \ud83d\udd27\r\n- **CRITICAL**: Fixed positional argument usage for `visualize_image_classes()` function  \r\n- **RESOLVED**: TypeError when calling `visualize_image_classes(image_paths, ...)` with positional arguments\r\n- **ENHANCED**: Comprehensive backward compatibility supporting all three usage patterns:\r\n  - Positional: `visualize_image_classes(path, ...)` (shows warning)\r\n  - Deprecated keyword: `visualize_image_classes(image_paths=path, ...)` (shows warning)\r\n  - Recommended: `visualize_image_classes(data_source=path, ...)` (no warning)\r\n- **IMPROVED**: Clear deprecation warnings guiding users toward recommended syntax\r\n- **SECURE**: Prevents using both parameters simultaneously to avoid confusion\r\n- **RESOLVED**: TypeError for users calling with `image_paths=` parameter from v0.12.0 breaking change\r\n- **ENHANCED**: Improved error messages for parameter validation in image visualization functions\r\n- **DOCUMENTATION**: Added comprehensive parameter documentation including deprecation notices\r\n\r\n### v0.12.2 (2025-08-06) - Documentation Refresh \ud83d\udcda\r\n- **IMPROVED**: Enhanced README.md with updated timestamps and current version indicators\r\n- **FIXED**: Ensured PyPI displays the most current changelog information including v0.12.1 fixes\r\n- **ENHANCED**: Added latest updates indicator to changelog for better visibility\r\n- **DOCUMENTATION**: Forced PyPI cache refresh to display current version information\r\n\r\n## \u2728 What's New in v0.16.2\r\n\r\n**New Features:**\r\n- Faceted visualizations with `display_facet_grid`\r\n- Feature scaling with `scale_features`\r\n- Grouping rare categories with `group_rare_categories`\r\n- Exporting figures with `export_figure`\r\n\r\n**Documentation Updates:**\r\n- User Guide, Advanced Features, and Best Practices now reference all new APIs\r\n- Visualization Guide includes external library requirements and troubleshooting\r\n- Changelog documents all new features and documentation changes\r\n\r\n**External Library Requirements:**\r\nSome advanced features require additional libraries:\r\n- matplotlib\r\n- seaborn\r\n- scikit-learn\r\n- statsmodels\r\n- pandas\r\n\r\nSee the Visualization Guide for installation instructions and troubleshooting tips.\r\n\r\n---\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python package for exploratory data analysis workflows with universal dark mode compatibility",
    "version": "0.17.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/evanlow/edaflow/issues",
        "Changelog": "https://github.com/evanlow/edaflow/blob/main/CHANGELOG.md",
        "Documentation": "https://edaflow.readthedocs.io",
        "Homepage": "https://github.com/evanlow/edaflow",
        "Repository": "https://github.com/evanlow/edaflow.git",
        "Source Code": "https://github.com/evanlow/edaflow"
    },
    "split_keywords": [
        "data-analysis",
        " eda",
        " exploratory-data-analysis",
        " data-science",
        " visualization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "77b47cf455193bf60dc6f0ce9fe85946dfe7a51736ef48b72cc7e376439f985b",
                "md5": "76b02db2701556303e2370f9076b2f16",
                "sha256": "9c9e2f6f597f51bcf440a9a43f60551a84a1969d3c5ab087dc388d8318a01a4f"
            },
            "downloads": -1,
            "filename": "edaflow-0.17.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "76b02db2701556303e2370f9076b2f16",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 116717,
            "upload_time": "2025-09-12T04:39:05",
            "upload_time_iso_8601": "2025-09-12T04:39:05.710088Z",
            "url": "https://files.pythonhosted.org/packages/77/b4/7cf455193bf60dc6f0ce9fe85946dfe7a51736ef48b72cc7e376439f985b/edaflow-0.17.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bd61c76ac3dfde6698fdeebfa12275be32510587bffe48efe7ef523251a54e57",
                "md5": "587d55c7149a9dbc7be9ddc33487fb2f",
                "sha256": "d7b1b625daf245ab2bd11bb148dd1a9aca8cbf24098728df7442e7319231fa51"
            },
            "downloads": -1,
            "filename": "edaflow-0.17.1.tar.gz",
            "has_sig": false,
            "md5_digest": "587d55c7149a9dbc7be9ddc33487fb2f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 781573,
            "upload_time": "2025-09-12T04:39:07",
            "upload_time_iso_8601": "2025-09-12T04:39:07.816097Z",
            "url": "https://files.pythonhosted.org/packages/bd/61/c76ac3dfde6698fdeebfa12275be32510587bffe48efe7ef523251a54e57/edaflow-0.17.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-12 04:39:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "evanlow",
    "github_project": "edaflow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    ">=",
                    "0.11.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "missingno",
            "specs": [
                [
                    ">=",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "jinja2",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": [
                [
                    ">=",
                    "8.0.0"
                ]
            ]
        }
    ],
    "lcname": "edaflow"
}
        
Elapsed time: 1.55368s