edaflow


Nameedaflow JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://github.com/evanlow/edaflow
SummaryA Python package for exploratory data analysis workflows
upload_time2025-08-04 04:40:32
maintainerNone
docs_urlNone
authorEvan Low
requires_python>=3.8
licenseNone
keywords data-analysis eda exploratory-data-analysis data-science visualization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # edaflow

A Python package for streamlined exploratory data analysis workflows.

## Description

`edaflow` is designed to simplify and accelerate the exploratory data analysis (EDA) process by providing a collection of tools and utilities for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, and preprocessing.

## Features

- **Missing Data Analysis**: Color-coded analysis of null values with customizable thresholds
- **Categorical Data Insights**: Identify object columns that might be numeric, detect data type issues
- **Automatic Data Type Conversion**: Smart conversion of object columns to numeric when appropriate
- **Categorical Values Visualization**: Detailed exploration of categorical column values with insights
- **Column Type Classification**: Simple categorization of DataFrame columns into categorical and numerical types
- **Data Imputation**: Smart missing value imputation using median for numerical and mode for categorical columns
- **Data Type Detection**: Smart analysis to flag potential data conversion needs
- **Styled Output**: Beautiful, color-coded results for Jupyter notebooks and terminals
- **Easy Integration**: Works seamlessly with pandas, numpy, and other popular libraries

## Installation

### From PyPI
```bash
pip install edaflow
```

### From Source
```bash
git clone https://github.com/evanlow/edaflow.git
cd edaflow
pip install -e .
```

### Development Installation
```bash
git clone https://github.com/evanlow/edaflow.git
cd edaflow
pip install -e ".[dev]"
```

## Requirements

- Python 3.8+
- pandas >= 1.5.0
- numpy >= 1.21.0
- matplotlib >= 3.5.0
- seaborn >= 0.11.0
- scipy >= 1.7.0
- missingno >= 0.5.0

## Quick Start

```python
import edaflow

# Test the installation
print(edaflow.hello())

# Check null values in your dataset
import pandas as pd
df = pd.read_csv('your_data.csv')

# Analyze missing data with styled output
null_analysis = edaflow.check_null_columns(df, threshold=10)
print(null_analysis)

# Analyze categorical columns to identify data type issues
edaflow.analyze_categorical_columns(df, threshold=35)

# Convert appropriate object columns to numeric automatically
df_cleaned = edaflow.convert_to_numeric(df, threshold=35)
print("Data types after conversion:", df_cleaned.dtypes)
```

## Usage Examples

### Basic Usage
```python
import edaflow

# Verify installation
message = edaflow.hello()
print(message)  # Output: "Hello from edaflow! Ready for exploratory data analysis."
```

### Missing Data Analysis with `check_null_columns`

The `check_null_columns` function provides a color-coded analysis of missing data in your DataFrame:

```python
import pandas as pd
import edaflow

# Create sample data with missing values
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'age': [25, None, 35, None, 45],
    'email': [None, None, None, None, None],  # All missing
    'purchase_amount': [100.5, 250.0, 75.25, None, 320.0]
})

# Analyze missing data with default threshold (10%)
styled_result = edaflow.check_null_columns(df)
styled_result  # Display in Jupyter notebook for color-coded styling

# Use custom threshold (20%) to change color coding sensitivity
styled_result = edaflow.check_null_columns(df, threshold=20)
styled_result

# Access underlying data if needed
data = styled_result.data
print(data)
```

**Color Coding:**
- ๐Ÿ”ด **Red**: > 20% missing (high concern)
- ๐ŸŸก **Yellow**: 10-20% missing (medium concern)  
- ๐ŸŸจ **Light Yellow**: 1-10% missing (low concern)
- โฌœ **Gray**: 0% missing (no issues)

### Categorical Data Analysis with `analyze_categorical_columns`

The `analyze_categorical_columns` function helps identify data type issues and provides insights into object-type columns:

```python
import pandas as pd
import edaflow

# Create sample data with mixed categorical types
df = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price_str': ['999', '25', '75', '450'],  # Numbers stored as strings
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],
    'rating': [4.5, 3.8, 4.2, 4.7],  # Already numeric
    'mixed_ids': ['001', '002', 'ABC', '004'],  # Mixed format
    'status': ['active', 'inactive', 'active', 'pending']
})

# Analyze categorical columns with default threshold (35%)
edaflow.analyze_categorical_columns(df)

# Use custom threshold (50%) to be more lenient about mixed data
edaflow.analyze_categorical_columns(df, threshold=50)
```

**Output Interpretation:**
- ๐Ÿ”ด๐Ÿ”ต **Highlighted in Red/Blue**: Potentially numeric columns that might need conversion
- ๐ŸŸกโšซ **Highlighted in Yellow/Black**: Shows unique values for potential numeric columns
- **Regular text**: Truly categorical columns with statistics
- **"not an object column"**: Already properly typed numeric columns

### Data Type Conversion with `convert_to_numeric`

After analyzing your categorical columns, you can automatically convert appropriate columns to numeric:

```python
import pandas as pd
import edaflow

# Create sample data with string numbers
df = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price_str': ['999', '25', '75', '450'],      # Should convert
    'mixed_ids': ['001', '002', 'ABC', '004'],    # Mixed data
    'category': ['Electronics', 'Accessories', 'Electronics', 'Electronics']
})

# Convert appropriate columns to numeric (threshold=35% by default)
df_converted = edaflow.convert_to_numeric(df, threshold=35)

# Or modify the original DataFrame in place
edaflow.convert_to_numeric(df, threshold=35, inplace=True)

# Use a stricter threshold (only convert if <20% non-numeric values)
df_strict = edaflow.convert_to_numeric(df, threshold=20)
```

**Function Features:**
- โœ… **Smart Detection**: Only converts columns with few non-numeric values
- โœ… **Customizable Threshold**: Control conversion sensitivity 
- โœ… **Safe Conversion**: Non-numeric values become NaN (not errors)
- โœ… **Inplace Option**: Modify original DataFrame or create new one
- โœ… **Detailed Output**: Shows exactly what was converted and why

### Categorical Data Visualization with `visualize_categorical_values`

After cleaning your data, explore categorical columns in detail to understand value distributions:

```python
import pandas as pd
import edaflow

# Example DataFrame with categorical data
df = pd.DataFrame({
    'department': ['Sales', 'Marketing', 'Sales', 'HR', 'Marketing', 'Sales', 'IT'],
    'status': ['Active', 'Inactive', 'Active', 'Pending', 'Active', 'Active', 'Inactive'],
    'priority': ['High', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],
    'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],  # Numeric (ignored)
    'salary': [50000, 60000, 55000, 45000, 58000, 62000, 70000]  # Numeric (ignored)
})

# Visualize all categorical columns
edaflow.visualize_categorical_values(df)
```

**Advanced Usage Examples:**

```python
# Handle high-cardinality data (many unique values)
large_df = pd.DataFrame({
    'product_id': [f'PROD_{i:04d}' for i in range(100)],  # 100 unique values
    'category': ['Electronics'] * 40 + ['Clothing'] * 35 + ['Books'] * 25,
    'status': ['Available'] * 80 + ['Out of Stock'] * 15 + ['Discontinued'] * 5
})

# Limit display for high-cardinality columns
edaflow.visualize_categorical_values(large_df, max_unique_values=5)
```

```python
# DataFrame with missing values for comprehensive analysis
df_with_nulls = pd.DataFrame({
    'region': ['North', 'South', None, 'East', 'West', 'North', None],
    'customer_type': ['Premium', 'Standard', 'Premium', None, 'Standard', 'Premium', 'Standard'],
    'transaction_id': [f'TXN_{i}' for i in range(7)],  # Mostly unique (ID-like)
})

# Get detailed insights including missing value analysis
edaflow.visualize_categorical_values(df_with_nulls)
```

**Function Features:**
- ๐ŸŽฏ **Smart Column Detection**: Automatically finds categorical (object-type) columns
- ๐Ÿ“Š **Value Distribution**: Shows counts and percentages for each unique value  
- ๐Ÿ” **Missing Value Analysis**: Tracks and reports NaN/missing values
- โšก **High-Cardinality Handling**: Truncates display for columns with many unique values
- ๐Ÿ’ก **Actionable Insights**: Identifies ID-like columns and provides data quality recommendations
- ๐ŸŽจ **Color-Coded Output**: Easy-to-read formatted results with highlighting

### Column Type Classification with `display_column_types`

The `display_column_types` function provides a simple way to categorize DataFrame columns into categorical and numerical types:

```python
import pandas as pd
import edaflow

# Create sample data with mixed types
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago'],
    'salary': [50000, 60000, 70000],
    'is_active': [True, False, True]
}
df = pd.DataFrame(data)

# Display column type classification
result = edaflow.display_column_types(df)

# Access the categorized column lists
categorical_cols = result['categorical']  # ['name', 'city']
numerical_cols = result['numerical']      # ['age', 'salary', 'is_active']
```

**Example Output:**
```
๐Ÿ“Š Column Type Analysis
==================================================

๐Ÿ“ Categorical Columns (2 total):
    1. name                 (unique values: 3)
    2. city                 (unique values: 3)

๐Ÿ”ข Numerical Columns (3 total):
    1. age                  (dtype: int64)
    2. salary               (dtype: int64)
    3. is_active            (dtype: bool)

๐Ÿ“ˆ Summary:
   Total columns: 5
   Categorical: 2 (40.0%)
   Numerical: 3 (60.0%)
```

**Function Features:**
- ๐Ÿ” **Simple Classification**: Separates columns into categorical (object dtype) and numerical (all other dtypes)
- ๐Ÿ“Š **Detailed Information**: Shows unique value counts for categorical columns and data types for numerical columns
- ๐Ÿ“ˆ **Summary Statistics**: Provides percentage breakdown of column types
- ๐ŸŽฏ **Return Values**: Returns dictionary with categorized column lists for programmatic use
- โšก **Fast Processing**: Efficient classification based on pandas data types
- ๐Ÿ›ก๏ธ **Error Handling**: Validates input and handles edge cases like empty DataFrames

### Data Imputation with `impute_numerical_median` and `impute_categorical_mode`

After analyzing your data, you often need to handle missing values. The edaflow package provides two specialized imputation functions for this purpose:

#### Numerical Imputation with `impute_numerical_median`

The `impute_numerical_median` function fills missing values in numerical columns using the median value:

```python
import pandas as pd
import edaflow

# Create sample data with missing numerical values
df = pd.DataFrame({
    'age': [25, None, 35, None, 45],
    'salary': [50000, 60000, None, 70000, None],
    'score': [85.5, None, 92.0, 88.5, None],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
})

# Impute all numerical columns with median values
df_imputed = edaflow.impute_numerical_median(df)

# Impute specific columns only
df_imputed = edaflow.impute_numerical_median(df, columns=['age', 'salary'])

# Impute in place (modifies original DataFrame)
edaflow.impute_numerical_median(df, inplace=True)
```

**Function Features:**
- ๐Ÿ”ข **Smart Detection**: Automatically identifies numerical columns (int, float, etc.)
- ๐Ÿ“Š **Median Imputation**: Uses median values which are robust to outliers
- ๐ŸŽฏ **Selective Imputation**: Option to specify which columns to impute
- ๐Ÿ”„ **Inplace Option**: Modify original DataFrame or create new one
- ๐Ÿ›ก๏ธ **Safe Handling**: Gracefully handles edge cases like all-missing columns
- ๐Ÿ“‹ **Detailed Reporting**: Shows exactly what was imputed and summary statistics

#### Categorical Imputation with `impute_categorical_mode`

The `impute_categorical_mode` function fills missing values in categorical columns using the mode (most frequent value):

```python
import pandas as pd
import edaflow

# Create sample data with missing categorical values
df = pd.DataFrame({
    'category': ['A', 'B', 'A', None, 'A'],
    'status': ['Active', None, 'Active', 'Inactive', None],
    'priority': ['High', 'Medium', None, 'Low', 'High'],
    'age': [25, 30, 35, 40, 45]
})

# Impute all categorical columns with mode values
df_imputed = edaflow.impute_categorical_mode(df)

# Impute specific columns only
df_imputed = edaflow.impute_categorical_mode(df, columns=['category', 'status'])

# Impute in place (modifies original DataFrame)
edaflow.impute_categorical_mode(df, inplace=True)
```

**Function Features:**
- ๐Ÿ“ **Smart Detection**: Automatically identifies categorical (object) columns
- ๐ŸŽฏ **Mode Imputation**: Uses most frequent value for each column
- โš–๏ธ **Tie Handling**: Gracefully handles mode ties (multiple values with same frequency)
- ๐Ÿ”„ **Inplace Option**: Modify original DataFrame or create new one
- ๐Ÿ›ก๏ธ **Safe Handling**: Gracefully handles edge cases like all-missing columns
- ๐Ÿ“‹ **Detailed Reporting**: Shows exactly what was imputed and mode tie warnings

#### Complete Imputation Workflow Example

```python
import pandas as pd
import edaflow

# Sample data with both numerical and categorical missing values
df = pd.DataFrame({
    'age': [25, None, 35, None, 45],
    'salary': [50000, None, 70000, 80000, None],
    'category': ['A', 'B', None, 'A', None],
    'status': ['Active', None, 'Active', 'Inactive', None],
    'score': [85.5, 92.0, None, 88.5, None]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50)

# Step 1: Impute numerical columns
print("STEP 1: Numerical Imputation")
df_step1 = edaflow.impute_numerical_median(df)

# Step 2: Impute categorical columns
print("\nSTEP 2: Categorical Imputation")
df_final = edaflow.impute_categorical_mode(df_step1)

print("\nFinal DataFrame (all missing values imputed):")
print(df_final)

# Verify no missing values remain
print(f"\nMissing values remaining: {df_final.isnull().sum().sum()}")
```

**Expected Output:**
```
๐Ÿ”ข Numerical Missing Value Imputation (Median)
=======================================================
๐Ÿ”„ age                  - Imputed 2 values with median: 35.0
๐Ÿ”„ salary               - Imputed 2 values with median: 70000.0
๐Ÿ”„ score                - Imputed 1 values with median: 88.75

๐Ÿ“Š Imputation Summary:
   Columns processed: 3
   Columns imputed: 3
   Total values imputed: 5

๐Ÿ“ Categorical Missing Value Imputation (Mode)
=======================================================
๐Ÿ”„ category             - Imputed 2 values with mode: 'A'
๐Ÿ”„ status               - Imputed 1 values with mode: 'Active'

๐Ÿ“Š Imputation Summary:
   Columns processed: 2
   Columns imputed: 2
   Total values imputed: 3
```

### Complete EDA Workflow Example

```python
import pandas as pd
import edaflow

# Load your dataset
df = pd.read_csv('customer_data.csv')

print("=== EXPLORATORY DATA ANALYSIS WITH EDAFLOW ===")
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Step 1: Check for missing data
print("\n1. MISSING DATA ANALYSIS")
print("-" * 40)
null_analysis = edaflow.check_null_columns(df, threshold=15)
null_analysis  # Shows color-coded missing data summary

# Step 2: Analyze categorical columns for data type issues
print("\n2. CATEGORICAL DATA ANALYSIS")  
print("-" * 40)
edaflow.analyze_categorical_columns(df, threshold=30)

# Step 3: Convert appropriate columns to numeric automatically
print("\n3. AUTOMATIC DATA TYPE CONVERSION")
print("-" * 40)
df_cleaned = edaflow.convert_to_numeric(df, threshold=30)

# Step 4: Visualize categorical column values in detail
print("\n4. CATEGORICAL VALUES EXPLORATION")
print("-" * 40)
edaflow.visualize_categorical_values(df_cleaned, max_unique_values=10)

# Step 5: Display column type classification
print("\n5. COLUMN TYPE CLASSIFICATION")
print("-" * 40)
column_types = edaflow.display_column_types(df_cleaned)

# Step 6: Handle missing values with imputation
print("\n6. MISSING VALUE IMPUTATION") 
print("-" * 40)
# Impute numerical columns with median
df_numeric_imputed = edaflow.impute_numerical_median(df_cleaned)
# Impute categorical columns with mode
df_fully_imputed = edaflow.impute_categorical_mode(df_numeric_imputed)

# Step 7: Final data review
print("\n7. DATA CLEANING SUMMARY")
print("-" * 40)
print("Original data types:")
print(df.dtypes)
print("\nCleaned data types:")
print(df_fully_imputed.dtypes)
print(f"\nFinal dataset shape: {df_fully_imputed.shape}")
print(f"Missing values remaining: {df_fully_imputed.isnull().sum().sum()}")

# Now your data is ready for further analysis!
# You can proceed with:
# - Statistical analysis
# - Machine learning preprocessing  
# - Visualization
# - Advanced EDA techniques
```

### Integration with Jupyter Notebooks

For the best experience, use these functions in Jupyter notebooks where:
- `check_null_columns()` displays beautiful color-coded tables
- `analyze_categorical_columns()` shows colored terminal output
- You can iterate quickly on data cleaning decisions

```python
# In Jupyter notebook cell
import pandas as pd
import edaflow

df = pd.read_csv('your_data.csv')

# This will display a nicely formatted, color-coded table
edaflow.check_null_columns(df)
```

# Load your dataset
df = pd.read_csv('data.csv')

# Analyze categorical columns to identify potential issues
edaflow.analyze_categorical_columns(df, threshold=35)

# This will identify:
# - Object columns that might actually be numeric (need conversion)
# - Truly categorical columns with their unique values
# - Mixed data type issues
```

### Working with Data (Future Implementation)
```python
import pandas as pd
import edaflow

# Load your dataset
df = pd.read_csv('data.csv')

# Perform EDA workflow
# summary = edaflow.quick_summary(df)
# edaflow.plot_overview(df)
# clean_df = edaflow.clean_data(df)
```

## Project Structure

```
edaflow/
โ”œโ”€โ”€ edaflow/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ analysis/
โ”‚   โ”œโ”€โ”€ visualization/
โ”‚   โ””โ”€โ”€ preprocessing/
โ”œโ”€โ”€ tests/
โ”œโ”€โ”€ docs/
โ”œโ”€โ”€ examples/
โ”œโ”€โ”€ setup.py
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ LICENSE
```

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Development

### Setup Development Environment
```bash
# Clone the repository
git clone https://github.com/evanlow/edaflow.git
cd edaflow

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
flake8 edaflow/
black edaflow/
isort edaflow/
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Changelog

### v0.4.0 (Data Imputation Release)
- **NEW**: `impute_numerical_median()` function for numerical missing value imputation using median
- **NEW**: `impute_categorical_mode()` function for categorical missing value imputation using mode
- **NEW**: Complete 7-function EDA workflow: analyze โ†’ convert โ†’ visualize โ†’ classify โ†’ impute
- **NEW**: Smart column detection and validation for imputation functions
- **NEW**: Inplace imputation option with detailed reporting and error handling
- **NEW**: Comprehensive edge case handling (empty DataFrames, all missing values, mode ties)
- Enhanced testing coverage with 54 comprehensive tests achieving 93% coverage

### v0.3.1 (Feature Enhancement)
- **NEW**: `display_column_types()` function for column type classification
- **NEW**: Complete 5-function EDA workflow: analyze โ†’ convert โ†’ visualize โ†’ classify
- **ENHANCED**: Updated comprehensive examples with full 5-function workflow
- Enhanced testing coverage with 32 comprehensive tests covering all functions

### v0.3.0 (Major Feature Release)
- **NEW**: `convert_to_numeric()` function for automatic data type conversion
- **NEW**: `visualize_categorical_values()` function for detailed categorical data exploration
- **NEW**: Smart threshold-based conversion with detailed reporting
- **NEW**: Inplace conversion option for flexible DataFrame modification
- **NEW**: Safe conversion with NaN handling for invalid values
- **NEW**: High-cardinality handling and data quality insights
- Enhanced testing coverage with comprehensive tests

### v0.2.1 (Documentation Enhancement)
- **ENHANCED**: Comprehensive README with detailed usage examples
- **NEW**: Step-by-step examples for both `check_null_columns()` and `analyze_categorical_columns()`
- **NEW**: Complete EDA workflow example showing real-world usage
- **NEW**: Jupyter notebook integration examples
- **IMPROVED**: Color-coding explanations and output interpretation guides

### v0.2.0 (Feature Release)
- **NEW**: `analyze_categorical_columns()` function for categorical data analysis
- **NEW**: Smart detection of object columns that might be numeric
- **NEW**: Color-coded terminal output for better readability
- Enhanced testing coverage with 12 comprehensive tests
- Improved documentation with detailed usage examples

### v0.1.1 (Documentation Update)
- Updated README with improved acknowledgments
- Fixed GitHub repository URLs
- Enhanced PyPI package presentation

### v0.1.0 (Initial Release)
- Basic package structure
- Sample hello() function
- `check_null_columns()` function for missing data analysis
- Core dependencies setup
- Documentation framework

## Support

If you encounter any issues or have questions, please file an issue on the [GitHub repository](https://github.com/evanlow/edaflow/issues).

## Roadmap

- [ ] Core analysis modules
- [ ] Visualization utilities
- [ ] Data preprocessing tools
- [ ] Missing data handling
- [ ] Statistical testing suite
- [ ] Interactive dashboards
- [ ] CLI interface
- [ ] Documentation website

## Acknowledgments

edaflow was developed during the AI/ML course conducted by NTUC LearningHub. I am grateful for the privilege of working alongside my coursemates from Cohort 15. A special thanks to our awesome instructor, Ms. Isha Sehgal, who not only inspired us but also instilled the data science discipline that we now possess

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/evanlow/edaflow",
    "name": "edaflow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Evan Low <evan.low@illumetechnology.com>",
    "keywords": "data-analysis, eda, exploratory-data-analysis, data-science, visualization",
    "author": "Evan Low",
    "author_email": "Evan Low <evan.low@illumetechnology.com>",
    "download_url": "https://files.pythonhosted.org/packages/2e/f8/11c78b14311d1c323f5ea62f1f32675d221ad8971fc8132726fda18d607e/edaflow-0.4.0.tar.gz",
    "platform": null,
    "description": "# edaflow\r\n\r\nA Python package for streamlined exploratory data analysis workflows.\r\n\r\n## Description\r\n\r\n`edaflow` is designed to simplify and accelerate the exploratory data analysis (EDA) process by providing a collection of tools and utilities for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, and preprocessing.\r\n\r\n## Features\r\n\r\n- **Missing Data Analysis**: Color-coded analysis of null values with customizable thresholds\r\n- **Categorical Data Insights**: Identify object columns that might be numeric, detect data type issues\r\n- **Automatic Data Type Conversion**: Smart conversion of object columns to numeric when appropriate\r\n- **Categorical Values Visualization**: Detailed exploration of categorical column values with insights\r\n- **Column Type Classification**: Simple categorization of DataFrame columns into categorical and numerical types\r\n- **Data Imputation**: Smart missing value imputation using median for numerical and mode for categorical columns\r\n- **Data Type Detection**: Smart analysis to flag potential data conversion needs\r\n- **Styled Output**: Beautiful, color-coded results for Jupyter notebooks and terminals\r\n- **Easy Integration**: Works seamlessly with pandas, numpy, and other popular libraries\r\n\r\n## Installation\r\n\r\n### From PyPI\r\n```bash\r\npip install edaflow\r\n```\r\n\r\n### From Source\r\n```bash\r\ngit clone https://github.com/evanlow/edaflow.git\r\ncd edaflow\r\npip install -e .\r\n```\r\n\r\n### Development Installation\r\n```bash\r\ngit clone https://github.com/evanlow/edaflow.git\r\ncd edaflow\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## Requirements\r\n\r\n- Python 3.8+\r\n- pandas >= 1.5.0\r\n- numpy >= 1.21.0\r\n- matplotlib >= 3.5.0\r\n- seaborn >= 0.11.0\r\n- scipy >= 1.7.0\r\n- missingno >= 0.5.0\r\n\r\n## Quick Start\r\n\r\n```python\r\nimport edaflow\r\n\r\n# Test the installation\r\nprint(edaflow.hello())\r\n\r\n# Check null values in your dataset\r\nimport pandas as pd\r\ndf = pd.read_csv('your_data.csv')\r\n\r\n# Analyze missing data with styled output\r\nnull_analysis = edaflow.check_null_columns(df, threshold=10)\r\nprint(null_analysis)\r\n\r\n# Analyze categorical columns to identify data type issues\r\nedaflow.analyze_categorical_columns(df, threshold=35)\r\n\r\n# Convert appropriate object columns to numeric automatically\r\ndf_cleaned = edaflow.convert_to_numeric(df, threshold=35)\r\nprint(\"Data types after conversion:\", df_cleaned.dtypes)\r\n```\r\n\r\n## Usage Examples\r\n\r\n### Basic Usage\r\n```python\r\nimport edaflow\r\n\r\n# Verify installation\r\nmessage = edaflow.hello()\r\nprint(message)  # Output: \"Hello from edaflow! Ready for exploratory data analysis.\"\r\n```\r\n\r\n### Missing Data Analysis with `check_null_columns`\r\n\r\nThe `check_null_columns` function provides a color-coded analysis of missing data in your DataFrame:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing values\r\ndf = pd.DataFrame({\r\n    'customer_id': [1, 2, 3, 4, 5],\r\n    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],\r\n    'age': [25, None, 35, None, 45],\r\n    'email': [None, None, None, None, None],  # All missing\r\n    'purchase_amount': [100.5, 250.0, 75.25, None, 320.0]\r\n})\r\n\r\n# Analyze missing data with default threshold (10%)\r\nstyled_result = edaflow.check_null_columns(df)\r\nstyled_result  # Display in Jupyter notebook for color-coded styling\r\n\r\n# Use custom threshold (20%) to change color coding sensitivity\r\nstyled_result = edaflow.check_null_columns(df, threshold=20)\r\nstyled_result\r\n\r\n# Access underlying data if needed\r\ndata = styled_result.data\r\nprint(data)\r\n```\r\n\r\n**Color Coding:**\r\n- \ud83d\udd34 **Red**: > 20% missing (high concern)\r\n- \ud83d\udfe1 **Yellow**: 10-20% missing (medium concern)  \r\n- \ud83d\udfe8 **Light Yellow**: 1-10% missing (low concern)\r\n- \u2b1c **Gray**: 0% missing (no issues)\r\n\r\n### Categorical Data Analysis with `analyze_categorical_columns`\r\n\r\nThe `analyze_categorical_columns` function helps identify data type issues and provides insights into object-type columns:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with mixed categorical types\r\ndf = pd.DataFrame({\r\n    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],\r\n    'price_str': ['999', '25', '75', '450'],  # Numbers stored as strings\r\n    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],\r\n    'rating': [4.5, 3.8, 4.2, 4.7],  # Already numeric\r\n    'mixed_ids': ['001', '002', 'ABC', '004'],  # Mixed format\r\n    'status': ['active', 'inactive', 'active', 'pending']\r\n})\r\n\r\n# Analyze categorical columns with default threshold (35%)\r\nedaflow.analyze_categorical_columns(df)\r\n\r\n# Use custom threshold (50%) to be more lenient about mixed data\r\nedaflow.analyze_categorical_columns(df, threshold=50)\r\n```\r\n\r\n**Output Interpretation:**\r\n- \ud83d\udd34\ud83d\udd35 **Highlighted in Red/Blue**: Potentially numeric columns that might need conversion\r\n- \ud83d\udfe1\u26ab **Highlighted in Yellow/Black**: Shows unique values for potential numeric columns\r\n- **Regular text**: Truly categorical columns with statistics\r\n- **\"not an object column\"**: Already properly typed numeric columns\r\n\r\n### Data Type Conversion with `convert_to_numeric`\r\n\r\nAfter analyzing your categorical columns, you can automatically convert appropriate columns to numeric:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with string numbers\r\ndf = pd.DataFrame({\r\n    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],\r\n    'price_str': ['999', '25', '75', '450'],      # Should convert\r\n    'mixed_ids': ['001', '002', 'ABC', '004'],    # Mixed data\r\n    'category': ['Electronics', 'Accessories', 'Electronics', 'Electronics']\r\n})\r\n\r\n# Convert appropriate columns to numeric (threshold=35% by default)\r\ndf_converted = edaflow.convert_to_numeric(df, threshold=35)\r\n\r\n# Or modify the original DataFrame in place\r\nedaflow.convert_to_numeric(df, threshold=35, inplace=True)\r\n\r\n# Use a stricter threshold (only convert if <20% non-numeric values)\r\ndf_strict = edaflow.convert_to_numeric(df, threshold=20)\r\n```\r\n\r\n**Function Features:**\r\n- \u2705 **Smart Detection**: Only converts columns with few non-numeric values\r\n- \u2705 **Customizable Threshold**: Control conversion sensitivity \r\n- \u2705 **Safe Conversion**: Non-numeric values become NaN (not errors)\r\n- \u2705 **Inplace Option**: Modify original DataFrame or create new one\r\n- \u2705 **Detailed Output**: Shows exactly what was converted and why\r\n\r\n### Categorical Data Visualization with `visualize_categorical_values`\r\n\r\nAfter cleaning your data, explore categorical columns in detail to understand value distributions:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Example DataFrame with categorical data\r\ndf = pd.DataFrame({\r\n    'department': ['Sales', 'Marketing', 'Sales', 'HR', 'Marketing', 'Sales', 'IT'],\r\n    'status': ['Active', 'Inactive', 'Active', 'Pending', 'Active', 'Active', 'Inactive'],\r\n    'priority': ['High', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],\r\n    'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],  # Numeric (ignored)\r\n    'salary': [50000, 60000, 55000, 45000, 58000, 62000, 70000]  # Numeric (ignored)\r\n})\r\n\r\n# Visualize all categorical columns\r\nedaflow.visualize_categorical_values(df)\r\n```\r\n\r\n**Advanced Usage Examples:**\r\n\r\n```python\r\n# Handle high-cardinality data (many unique values)\r\nlarge_df = pd.DataFrame({\r\n    'product_id': [f'PROD_{i:04d}' for i in range(100)],  # 100 unique values\r\n    'category': ['Electronics'] * 40 + ['Clothing'] * 35 + ['Books'] * 25,\r\n    'status': ['Available'] * 80 + ['Out of Stock'] * 15 + ['Discontinued'] * 5\r\n})\r\n\r\n# Limit display for high-cardinality columns\r\nedaflow.visualize_categorical_values(large_df, max_unique_values=5)\r\n```\r\n\r\n```python\r\n# DataFrame with missing values for comprehensive analysis\r\ndf_with_nulls = pd.DataFrame({\r\n    'region': ['North', 'South', None, 'East', 'West', 'North', None],\r\n    'customer_type': ['Premium', 'Standard', 'Premium', None, 'Standard', 'Premium', 'Standard'],\r\n    'transaction_id': [f'TXN_{i}' for i in range(7)],  # Mostly unique (ID-like)\r\n})\r\n\r\n# Get detailed insights including missing value analysis\r\nedaflow.visualize_categorical_values(df_with_nulls)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83c\udfaf **Smart Column Detection**: Automatically finds categorical (object-type) columns\r\n- \ud83d\udcca **Value Distribution**: Shows counts and percentages for each unique value  \r\n- \ud83d\udd0d **Missing Value Analysis**: Tracks and reports NaN/missing values\r\n- \u26a1 **High-Cardinality Handling**: Truncates display for columns with many unique values\r\n- \ud83d\udca1 **Actionable Insights**: Identifies ID-like columns and provides data quality recommendations\r\n- \ud83c\udfa8 **Color-Coded Output**: Easy-to-read formatted results with highlighting\r\n\r\n### Column Type Classification with `display_column_types`\r\n\r\nThe `display_column_types` function provides a simple way to categorize DataFrame columns into categorical and numerical types:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with mixed types\r\ndata = {\r\n    'name': ['Alice', 'Bob', 'Charlie'],\r\n    'age': [25, 30, 35],\r\n    'city': ['NYC', 'LA', 'Chicago'],\r\n    'salary': [50000, 60000, 70000],\r\n    'is_active': [True, False, True]\r\n}\r\ndf = pd.DataFrame(data)\r\n\r\n# Display column type classification\r\nresult = edaflow.display_column_types(df)\r\n\r\n# Access the categorized column lists\r\ncategorical_cols = result['categorical']  # ['name', 'city']\r\nnumerical_cols = result['numerical']      # ['age', 'salary', 'is_active']\r\n```\r\n\r\n**Example Output:**\r\n```\r\n\ud83d\udcca Column Type Analysis\r\n==================================================\r\n\r\n\ud83d\udcdd Categorical Columns (2 total):\r\n    1. name                 (unique values: 3)\r\n    2. city                 (unique values: 3)\r\n\r\n\ud83d\udd22 Numerical Columns (3 total):\r\n    1. age                  (dtype: int64)\r\n    2. salary               (dtype: int64)\r\n    3. is_active            (dtype: bool)\r\n\r\n\ud83d\udcc8 Summary:\r\n   Total columns: 5\r\n   Categorical: 2 (40.0%)\r\n   Numerical: 3 (60.0%)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udd0d **Simple Classification**: Separates columns into categorical (object dtype) and numerical (all other dtypes)\r\n- \ud83d\udcca **Detailed Information**: Shows unique value counts for categorical columns and data types for numerical columns\r\n- \ud83d\udcc8 **Summary Statistics**: Provides percentage breakdown of column types\r\n- \ud83c\udfaf **Return Values**: Returns dictionary with categorized column lists for programmatic use\r\n- \u26a1 **Fast Processing**: Efficient classification based on pandas data types\r\n- \ud83d\udee1\ufe0f **Error Handling**: Validates input and handles edge cases like empty DataFrames\r\n\r\n### Data Imputation with `impute_numerical_median` and `impute_categorical_mode`\r\n\r\nAfter analyzing your data, you often need to handle missing values. The edaflow package provides two specialized imputation functions for this purpose:\r\n\r\n#### Numerical Imputation with `impute_numerical_median`\r\n\r\nThe `impute_numerical_median` function fills missing values in numerical columns using the median value:\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing numerical values\r\ndf = pd.DataFrame({\r\n    'age': [25, None, 35, None, 45],\r\n    'salary': [50000, 60000, None, 70000, None],\r\n    'score': [85.5, None, 92.0, 88.5, None],\r\n    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']\r\n})\r\n\r\n# Impute all numerical columns with median values\r\ndf_imputed = edaflow.impute_numerical_median(df)\r\n\r\n# Impute specific columns only\r\ndf_imputed = edaflow.impute_numerical_median(df, columns=['age', 'salary'])\r\n\r\n# Impute in place (modifies original DataFrame)\r\nedaflow.impute_numerical_median(df, inplace=True)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udd22 **Smart Detection**: Automatically identifies numerical columns (int, float, etc.)\r\n- \ud83d\udcca **Median Imputation**: Uses median values which are robust to outliers\r\n- \ud83c\udfaf **Selective Imputation**: Option to specify which columns to impute\r\n- \ud83d\udd04 **Inplace Option**: Modify original DataFrame or create new one\r\n- \ud83d\udee1\ufe0f **Safe Handling**: Gracefully handles edge cases like all-missing columns\r\n- \ud83d\udccb **Detailed Reporting**: Shows exactly what was imputed and summary statistics\r\n\r\n#### Categorical Imputation with `impute_categorical_mode`\r\n\r\nThe `impute_categorical_mode` function fills missing values in categorical columns using the mode (most frequent value):\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Create sample data with missing categorical values\r\ndf = pd.DataFrame({\r\n    'category': ['A', 'B', 'A', None, 'A'],\r\n    'status': ['Active', None, 'Active', 'Inactive', None],\r\n    'priority': ['High', 'Medium', None, 'Low', 'High'],\r\n    'age': [25, 30, 35, 40, 45]\r\n})\r\n\r\n# Impute all categorical columns with mode values\r\ndf_imputed = edaflow.impute_categorical_mode(df)\r\n\r\n# Impute specific columns only\r\ndf_imputed = edaflow.impute_categorical_mode(df, columns=['category', 'status'])\r\n\r\n# Impute in place (modifies original DataFrame)\r\nedaflow.impute_categorical_mode(df, inplace=True)\r\n```\r\n\r\n**Function Features:**\r\n- \ud83d\udcdd **Smart Detection**: Automatically identifies categorical (object) columns\r\n- \ud83c\udfaf **Mode Imputation**: Uses most frequent value for each column\r\n- \u2696\ufe0f **Tie Handling**: Gracefully handles mode ties (multiple values with same frequency)\r\n- \ud83d\udd04 **Inplace Option**: Modify original DataFrame or create new one\r\n- \ud83d\udee1\ufe0f **Safe Handling**: Gracefully handles edge cases like all-missing columns\r\n- \ud83d\udccb **Detailed Reporting**: Shows exactly what was imputed and mode tie warnings\r\n\r\n#### Complete Imputation Workflow Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Sample data with both numerical and categorical missing values\r\ndf = pd.DataFrame({\r\n    'age': [25, None, 35, None, 45],\r\n    'salary': [50000, None, 70000, 80000, None],\r\n    'category': ['A', 'B', None, 'A', None],\r\n    'status': ['Active', None, 'Active', 'Inactive', None],\r\n    'score': [85.5, 92.0, None, 88.5, None]\r\n})\r\n\r\nprint(\"Original DataFrame:\")\r\nprint(df)\r\nprint(\"\\n\" + \"=\"*50)\r\n\r\n# Step 1: Impute numerical columns\r\nprint(\"STEP 1: Numerical Imputation\")\r\ndf_step1 = edaflow.impute_numerical_median(df)\r\n\r\n# Step 2: Impute categorical columns\r\nprint(\"\\nSTEP 2: Categorical Imputation\")\r\ndf_final = edaflow.impute_categorical_mode(df_step1)\r\n\r\nprint(\"\\nFinal DataFrame (all missing values imputed):\")\r\nprint(df_final)\r\n\r\n# Verify no missing values remain\r\nprint(f\"\\nMissing values remaining: {df_final.isnull().sum().sum()}\")\r\n```\r\n\r\n**Expected Output:**\r\n```\r\n\ud83d\udd22 Numerical Missing Value Imputation (Median)\r\n=======================================================\r\n\ud83d\udd04 age                  - Imputed 2 values with median: 35.0\r\n\ud83d\udd04 salary               - Imputed 2 values with median: 70000.0\r\n\ud83d\udd04 score                - Imputed 1 values with median: 88.75\r\n\r\n\ud83d\udcca Imputation Summary:\r\n   Columns processed: 3\r\n   Columns imputed: 3\r\n   Total values imputed: 5\r\n\r\n\ud83d\udcdd Categorical Missing Value Imputation (Mode)\r\n=======================================================\r\n\ud83d\udd04 category             - Imputed 2 values with mode: 'A'\r\n\ud83d\udd04 status               - Imputed 1 values with mode: 'Active'\r\n\r\n\ud83d\udcca Imputation Summary:\r\n   Columns processed: 2\r\n   Columns imputed: 2\r\n   Total values imputed: 3\r\n```\r\n\r\n### Complete EDA Workflow Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Load your dataset\r\ndf = pd.read_csv('customer_data.csv')\r\n\r\nprint(\"=== EXPLORATORY DATA ANALYSIS WITH EDAFLOW ===\")\r\nprint(f\"Dataset shape: {df.shape}\")\r\nprint(f\"Columns: {list(df.columns)}\")\r\n\r\n# Step 1: Check for missing data\r\nprint(\"\\n1. MISSING DATA ANALYSIS\")\r\nprint(\"-\" * 40)\r\nnull_analysis = edaflow.check_null_columns(df, threshold=15)\r\nnull_analysis  # Shows color-coded missing data summary\r\n\r\n# Step 2: Analyze categorical columns for data type issues\r\nprint(\"\\n2. CATEGORICAL DATA ANALYSIS\")  \r\nprint(\"-\" * 40)\r\nedaflow.analyze_categorical_columns(df, threshold=30)\r\n\r\n# Step 3: Convert appropriate columns to numeric automatically\r\nprint(\"\\n3. AUTOMATIC DATA TYPE CONVERSION\")\r\nprint(\"-\" * 40)\r\ndf_cleaned = edaflow.convert_to_numeric(df, threshold=30)\r\n\r\n# Step 4: Visualize categorical column values in detail\r\nprint(\"\\n4. CATEGORICAL VALUES EXPLORATION\")\r\nprint(\"-\" * 40)\r\nedaflow.visualize_categorical_values(df_cleaned, max_unique_values=10)\r\n\r\n# Step 5: Display column type classification\r\nprint(\"\\n5. COLUMN TYPE CLASSIFICATION\")\r\nprint(\"-\" * 40)\r\ncolumn_types = edaflow.display_column_types(df_cleaned)\r\n\r\n# Step 6: Handle missing values with imputation\r\nprint(\"\\n6. MISSING VALUE IMPUTATION\") \r\nprint(\"-\" * 40)\r\n# Impute numerical columns with median\r\ndf_numeric_imputed = edaflow.impute_numerical_median(df_cleaned)\r\n# Impute categorical columns with mode\r\ndf_fully_imputed = edaflow.impute_categorical_mode(df_numeric_imputed)\r\n\r\n# Step 7: Final data review\r\nprint(\"\\n7. DATA CLEANING SUMMARY\")\r\nprint(\"-\" * 40)\r\nprint(\"Original data types:\")\r\nprint(df.dtypes)\r\nprint(\"\\nCleaned data types:\")\r\nprint(df_fully_imputed.dtypes)\r\nprint(f\"\\nFinal dataset shape: {df_fully_imputed.shape}\")\r\nprint(f\"Missing values remaining: {df_fully_imputed.isnull().sum().sum()}\")\r\n\r\n# Now your data is ready for further analysis!\r\n# You can proceed with:\r\n# - Statistical analysis\r\n# - Machine learning preprocessing  \r\n# - Visualization\r\n# - Advanced EDA techniques\r\n```\r\n\r\n### Integration with Jupyter Notebooks\r\n\r\nFor the best experience, use these functions in Jupyter notebooks where:\r\n- `check_null_columns()` displays beautiful color-coded tables\r\n- `analyze_categorical_columns()` shows colored terminal output\r\n- You can iterate quickly on data cleaning decisions\r\n\r\n```python\r\n# In Jupyter notebook cell\r\nimport pandas as pd\r\nimport edaflow\r\n\r\ndf = pd.read_csv('your_data.csv')\r\n\r\n# This will display a nicely formatted, color-coded table\r\nedaflow.check_null_columns(df)\r\n```\r\n\r\n# Load your dataset\r\ndf = pd.read_csv('data.csv')\r\n\r\n# Analyze categorical columns to identify potential issues\r\nedaflow.analyze_categorical_columns(df, threshold=35)\r\n\r\n# This will identify:\r\n# - Object columns that might actually be numeric (need conversion)\r\n# - Truly categorical columns with their unique values\r\n# - Mixed data type issues\r\n```\r\n\r\n### Working with Data (Future Implementation)\r\n```python\r\nimport pandas as pd\r\nimport edaflow\r\n\r\n# Load your dataset\r\ndf = pd.read_csv('data.csv')\r\n\r\n# Perform EDA workflow\r\n# summary = edaflow.quick_summary(df)\r\n# edaflow.plot_overview(df)\r\n# clean_df = edaflow.clean_data(df)\r\n```\r\n\r\n## Project Structure\r\n\r\n```\r\nedaflow/\r\n\u251c\u2500\u2500 edaflow/\r\n\u2502   \u251c\u2500\u2500 __init__.py\r\n\u2502   \u251c\u2500\u2500 analysis/\r\n\u2502   \u251c\u2500\u2500 visualization/\r\n\u2502   \u2514\u2500\u2500 preprocessing/\r\n\u251c\u2500\u2500 tests/\r\n\u251c\u2500\u2500 docs/\r\n\u251c\u2500\u2500 examples/\r\n\u251c\u2500\u2500 setup.py\r\n\u251c\u2500\u2500 requirements.txt\r\n\u251c\u2500\u2500 README.md\r\n\u2514\u2500\u2500 LICENSE\r\n```\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\r\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\r\n4. Push to the branch (`git push origin feature/amazing-feature`)\r\n5. Open a Pull Request\r\n\r\n## Development\r\n\r\n### Setup Development Environment\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/evanlow/edaflow.git\r\ncd edaflow\r\n\r\n# Create virtual environment\r\npython -m venv venv\r\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\r\n\r\n# Install in development mode\r\npip install -e \".[dev]\"\r\n\r\n# Run tests\r\npytest\r\n\r\n# Run linting\r\nflake8 edaflow/\r\nblack edaflow/\r\nisort edaflow/\r\n```\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Changelog\r\n\r\n### v0.4.0 (Data Imputation Release)\r\n- **NEW**: `impute_numerical_median()` function for numerical missing value imputation using median\r\n- **NEW**: `impute_categorical_mode()` function for categorical missing value imputation using mode\r\n- **NEW**: Complete 7-function EDA workflow: analyze \u2192 convert \u2192 visualize \u2192 classify \u2192 impute\r\n- **NEW**: Smart column detection and validation for imputation functions\r\n- **NEW**: Inplace imputation option with detailed reporting and error handling\r\n- **NEW**: Comprehensive edge case handling (empty DataFrames, all missing values, mode ties)\r\n- Enhanced testing coverage with 54 comprehensive tests achieving 93% coverage\r\n\r\n### v0.3.1 (Feature Enhancement)\r\n- **NEW**: `display_column_types()` function for column type classification\r\n- **NEW**: Complete 5-function EDA workflow: analyze \u2192 convert \u2192 visualize \u2192 classify\r\n- **ENHANCED**: Updated comprehensive examples with full 5-function workflow\r\n- Enhanced testing coverage with 32 comprehensive tests covering all functions\r\n\r\n### v0.3.0 (Major Feature Release)\r\n- **NEW**: `convert_to_numeric()` function for automatic data type conversion\r\n- **NEW**: `visualize_categorical_values()` function for detailed categorical data exploration\r\n- **NEW**: Smart threshold-based conversion with detailed reporting\r\n- **NEW**: Inplace conversion option for flexible DataFrame modification\r\n- **NEW**: Safe conversion with NaN handling for invalid values\r\n- **NEW**: High-cardinality handling and data quality insights\r\n- Enhanced testing coverage with comprehensive tests\r\n\r\n### v0.2.1 (Documentation Enhancement)\r\n- **ENHANCED**: Comprehensive README with detailed usage examples\r\n- **NEW**: Step-by-step examples for both `check_null_columns()` and `analyze_categorical_columns()`\r\n- **NEW**: Complete EDA workflow example showing real-world usage\r\n- **NEW**: Jupyter notebook integration examples\r\n- **IMPROVED**: Color-coding explanations and output interpretation guides\r\n\r\n### v0.2.0 (Feature Release)\r\n- **NEW**: `analyze_categorical_columns()` function for categorical data analysis\r\n- **NEW**: Smart detection of object columns that might be numeric\r\n- **NEW**: Color-coded terminal output for better readability\r\n- Enhanced testing coverage with 12 comprehensive tests\r\n- Improved documentation with detailed usage examples\r\n\r\n### v0.1.1 (Documentation Update)\r\n- Updated README with improved acknowledgments\r\n- Fixed GitHub repository URLs\r\n- Enhanced PyPI package presentation\r\n\r\n### v0.1.0 (Initial Release)\r\n- Basic package structure\r\n- Sample hello() function\r\n- `check_null_columns()` function for missing data analysis\r\n- Core dependencies setup\r\n- Documentation framework\r\n\r\n## Support\r\n\r\nIf you encounter any issues or have questions, please file an issue on the [GitHub repository](https://github.com/evanlow/edaflow/issues).\r\n\r\n## Roadmap\r\n\r\n- [ ] Core analysis modules\r\n- [ ] Visualization utilities\r\n- [ ] Data preprocessing tools\r\n- [ ] Missing data handling\r\n- [ ] Statistical testing suite\r\n- [ ] Interactive dashboards\r\n- [ ] CLI interface\r\n- [ ] Documentation website\r\n\r\n## Acknowledgments\r\n\r\nedaflow was developed during the AI/ML course conducted by NTUC LearningHub. I am grateful for the privilege of working alongside my coursemates from Cohort 15. A special thanks to our awesome instructor, Ms. Isha Sehgal, who not only inspired us but also instilled the data science discipline that we now possess\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python package for exploratory data analysis workflows",
    "version": "0.4.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/evanlow/edaflow/issues",
        "Changelog": "https://github.com/evanlow/edaflow/blob/main/CHANGELOG.md",
        "Documentation": "https://edaflow.readthedocs.io",
        "Homepage": "https://github.com/evanlow/edaflow",
        "Repository": "https://github.com/evanlow/edaflow.git"
    },
    "split_keywords": [
        "data-analysis",
        " eda",
        " exploratory-data-analysis",
        " data-science",
        " visualization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a610864c6459c06cf27cd03d5398acee709181613cd8608dbcc085e33a8b31c2",
                "md5": "209ddd0e03fdc5c90c5ca1b18f95638c",
                "sha256": "2f1959c64b5950c955a2b23ae0df9e114af329d844b0e3caf421cf4d6da1c72a"
            },
            "downloads": -1,
            "filename": "edaflow-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "209ddd0e03fdc5c90c5ca1b18f95638c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 16887,
            "upload_time": "2025-08-04T04:40:31",
            "upload_time_iso_8601": "2025-08-04T04:40:31.199684Z",
            "url": "https://files.pythonhosted.org/packages/a6/10/864c6459c06cf27cd03d5398acee709181613cd8608dbcc085e33a8b31c2/edaflow-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2ef811c78b14311d1c323f5ea62f1f32675d221ad8971fc8132726fda18d607e",
                "md5": "1b1b281ae70b80c87a9f0e4bd3848811",
                "sha256": "c514c78a10f3a389bef9f36a3e2d4bc3bbe7f6cb23abaee8f66a158f1da4fc52"
            },
            "downloads": -1,
            "filename": "edaflow-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1b1b281ae70b80c87a9f0e4bd3848811",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 31281,
            "upload_time": "2025-08-04T04:40:32",
            "upload_time_iso_8601": "2025-08-04T04:40:32.697889Z",
            "url": "https://files.pythonhosted.org/packages/2e/f8/11c78b14311d1c323f5ea62f1f32675d221ad8971fc8132726fda18d607e/edaflow-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 04:40:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "evanlow",
    "github_project": "edaflow",
    "github_not_found": true,
    "lcname": "edaflow"
}
        
Elapsed time: 0.81291s