nullaxe

Name	nullaxe JSON
Version	0.4.2 JSON
	download
home_page	None
Summary	A data cleaning library for Pandas and Polars DataFrames with a simple, chainable API.
upload_time	2025-09-11 00:20:36
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT License Copyright (c) 2025 John Tocci Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	data cleaning pandas polars data science etl data processing nullaxe
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">Nullaxe</h1>

<div align="center">

[![PyPI version](https://img.shields.io/pypi/v/nullaxe.svg)](https://pypi.org/project/nullaxe/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

</div>

**Nullaxe** is a comprehensive, high-performance data cleaning and preprocessing library for Python, designed to work seamlessly with both **pandas** and **polars** DataFrames. With its intuitive, chainable API, Nullaxe transforms the traditionally tedious process of data cleaning into an elegant, readable workflow.

---

## Key Features

- **Fluent, Chainable API**: Clean your data in a single, readable chain of commands
- **Dual Backend Support**: Works effortlessly with both pandas and polars DataFrames
- **Comprehensive Cleaning**: From basic cleaning to advanced data extraction and transformation
- **Display Formatting Pipeline**: Format columns for presentation (currency, percentages, thousands separators, date formatting, truncation, title-cased headers)
- **Intelligent Outlier Detection**: Multiple methods including IQR and Z-score analysis
- **Advanced Data Extraction**: Extract emails, phone numbers, and custom patterns with regex
- **Smart Type Handling**: Automatic type inference and standardization
- **Performance Optimized**: Designed for speed and memory efficiency
- **Extensible**: Easily add custom cleaning functions to your pipeline

---

## Installation

Install Nullaxe easily with pip:

```bash
pip install nullaxe
```

**Requirements:**
- Python 3.8+
- pandas >= 1.0
- polars >= 0.19

---

## Quick Start

Here's how to transform messy data into clean, analysis-ready datasets:

```python
import pandas as pd
import nullaxe as nlx

# Create a messy sample dataset
data = {
    'First Name': ['  John  ', 'Jane', '  Peter', 'JOHN', None],
    'Last Name': ['Smith', 'Doe', 'Jones', 'Smith', 'Brown'],
    'Age': [28, 34, None, 28, 45],
    'Email': ['john@email.com', 'invalid-email', 'peter@test.org', 'john@email.com', None],
    'Phone': ['123-456-7890', '(555) 123-4567', 'not-a-phone', '123.456.7890', '+1-800-555-0199'],
    'Salary': ['$70,000', '80000', '$65,000.50', '$70,000', '€75,000'],
    'Active': ['True', 'False', 'yes', 'TRUE', 'N'],
    'Notes': ['  Important client  ', '', '   Follow up   ', None, 'VIP']
}
df = pd.DataFrame(data)

# Clean the entire dataset with a single chain
clean_df = (
    nlx(df)
    .clean_column_names()                    # Standardize column names
    .fill_missing(value='Unknown')           # Fill missing values
    .remove_whitespace()                     # Clean whitespace
    .remove_duplicates()                     # Remove duplicate rows
    .standardize_booleans()                  # Convert boolean-like values
    .extract_email()                         # Extract email addresses
    .extract_phone_numbers()                 # Extract phone numbers
    .extract_and_clean_numeric()             # Extract numeric values from strings
    .drop_single_value_columns()             # Remove columns with only one value
    .remove_outliers(method='iqr')           # Handle outliers
    .format_for_display(                     # NEW: Format for presentation
        rules={
            'salary': {'type': 'currency', 'symbol': '$', 'decimals': 2},
            'age': {'type': 'thousands'},
        },
        column_case='title'
    )
    .to_df()                                 # Return the cleaned, formatted DataFrame
)

print(clean_df.head())
```

---

## Complete API Reference

### Initialization

```python
import nullaxe as nlx

# Initialize with any DataFrame
cleaner = nlx(df)  # Works with pandas or polars DataFrames
```

### Column Name Standardization

Transform column names to consistent formats:

```python
# General column cleaning with case conversion
.clean_column_names(case='snake')  # Options: 'snake', 'camel', 'pascal', 'kebab', 'title', 'lower', 'screaming_snake'

# Specific case conversions
.snakecase()                       # column_name
.camelcase()                       # columnName
.pascalcase()                      # ColumnName
.kebabcase()                       # column-name
.titlecase()                       # Column Name
.lowercase()                       # column name
.screaming_snakecase()             # COLUMN_NAME
```

### Data Deduplication

Remove duplicate data efficiently:

```python
.remove_duplicates()               # Remove duplicate rows across all columns
```

### Missing Data Management

Handle missing values with precision:

```python
# Fill missing values
.fill_missing(value=0)                           # Fill all columns with 0
.fill_missing(value='Unknown', subset=['name'])  # Fill specific columns

# Drop missing values
.drop_missing()                                  # Drop rows with any missing values
.drop_missing(how='all')                         # Drop rows where all values are missing
.drop_missing(thresh=3)                          # Keep rows with at least 3 non-null values
.drop_missing(axis='columns')                    # Drop columns with missing values
.drop_missing(subset=['name', 'email'])          # Consider only specific columns
```

### Text and Whitespace Cleaning

Clean and standardize text data:

```python
.remove_whitespace()                             # Remove leading/trailing whitespace
.replace_text('old', 'new')                      # Replace text in all columns
.replace_text('old', 'new', subset=['name'])     # Replace in specific columns
.remove_punctuation()                            # Remove punctuation marks
.remove_punctuation(subset=['description'])      # Remove from specific columns
```

### Column Management

Manage DataFrame structure:

```python
.drop_single_value_columns()                     # Remove columns with only one unique value
.remove_unwanted_rows_and_cols()                 # Remove rows/cols with unwanted values
.remove_unwanted_rows_and_cols(                  # Custom unwanted values
    unwanted_values=['', 'N/A', 'NULL']
)
```

### Outlier Detection and Handling

Sophisticated outlier management:

```python
# General outlier handling
.handle_outliers()                               # Default: IQR method, factor=1.5
.handle_outliers(method='zscore', factor=2.0)    # Z-score method
.handle_outliers(subset=['salary', 'age'])       # Specific columns only

# Cap outliers (replace with threshold values)
.cap_outliers()                                  # Cap using IQR method
.cap_outliers(method='zscore', factor=2.5)       # Cap using Z-score

# Remove outlier rows entirely
.remove_outliers()                               # Remove rows with outliers
.remove_outliers(method='iqr', factor=1.5)       # Custom parameters
```

**Outlier Detection Methods:**
- **IQR (Interquartile Range)**: `Q1 - factor*IQR` to `Q3 + factor*IQR`
- **Z-Score**: Values beyond `factor` standard deviations from the mean

### Data Type Standardization

Convert and standardize data types:

```python
# Boolean standardization
.standardize_booleans()                          # Convert 'yes/no', 'true/false', etc.
.standardize_booleans(
    true_values=['yes', 'y', '1', 'true'],       # Custom true values
    false_values=['no', 'n', '0', 'false'],     # Custom false values
    columns=['active', 'verified']              # Specific columns
)
```

**Default Boolean Mappings:**
- **True**: 'true', '1', 't', 'yes', 'y', 'on'
- **False**: 'false', '0', 'f', 'no', 'n', 'off'

### Advanced Data Extraction

Extract structured data from unstructured text:

```python
# Email extraction
.extract_email()                                 # Extract emails from all columns
.extract_email(subset=['contact_info'])          # From specific columns

# Phone number extraction
.extract_phone_numbers()                         # Extract phone numbers
.extract_phone_numbers(subset=['contact'])       # From specific columns

# Numeric data extraction and cleaning
.extract_and_clean_numeric()                     # Extract numbers from text
.extract_and_clean_numeric(subset=['prices'])    # From specific columns

# Custom regex extraction (interactive)
.extract_with_regex()                            # Prompts for regex pattern
.extract_with_regex(subset=['text_column'])      # From specific columns

# Combined numeric cleaning
.clean_numeric()                                 # Extract + outlier handling
.clean_numeric(method='zscore', factor=2.0)      # Custom outlier parameters
```

### Display / Presentation Formatting (NEW in 0.3.0)

Format cleaned data for reports, dashboards, exports:

```python
.format_for_display(
    rules={
        'price': {'type': 'currency', 'symbol': '$', 'decimals': 2},
        'growth': {'type': 'percentage', 'decimals': 1},
        'volume': {'type': 'thousands'},
        'description': {'type': 'truncate', 'length': 30},
        'event_date': {'type': 'datetime', 'format': '%B %d, %Y'}
    },
    column_case='title'  # or None to preserve original column names
)
```

Supported rule types:
- `currency`: symbol + thousands + decimal precision
- `percentage`: multiplies by 100 + suffix `%`
- `thousands`: adds thousands separators, removes trailing `.0` for whole floats
- `truncate`: shortens long text and appends `...`
- `datetime`: parses and formats date/time strings

You can also call the function directly:
```python
from nullaxe.functions import format_for_display
formatted = format_for_display(df, rules=..., column_case='title')
```

### Output

```python
.to_df()                                         # Return the cleaned DataFrame
```

---

## Advanced Usage Examples

### Real-World Data Cleaning Pipeline

```python
import pandas as pd
import nullaxe as nlx

# Load messy customer data
df = pd.read_csv('messy_customer_data.csv')

# Comprehensive cleaning + formatting pipeline
clean_customers = (
    nlx(df)
    .clean_column_names(case='snake')
    .fill_missing(value='Not Provided')
    .remove_whitespace()
    .standardize_booleans(columns=['is_active', 'newsletter_opt_in'])
    .extract_email(subset=['contact_info'])
    .extract_phone_numbers(subset=['contact_info'])
    .extract_and_clean_numeric(subset=['revenue', 'age'])
    .remove_outliers(method='iqr', factor=2.0, subset=['revenue'])
    .drop_single_value_columns()
    .remove_duplicates()
    .format_for_display(
        rules={
            'revenue': {'type': 'currency', 'symbol': '$', 'decimals': 2},
            'age': {'type': 'thousands'},
            'signup_date': {'type': 'datetime', 'format': '%Y-%m-%d'}
        },
        column_case='title'
    )
    .to_df()
)
```

### Financial Data Processing

```python
financial_clean = (
    nlx(transactions_df)
    .clean_column_names(case='snake')
    .fill_missing(value=0, subset=['amount'])
    .extract_and_clean_numeric(subset=['amount', 'fee'])
    .standardize_booleans(subset=['is_recurring'])
    .cap_outliers(method='zscore', factor=3.0, subset=['amount'])
    .remove_whitespace()
    .format_for_display(
        rules={'amount': {'type': 'currency', 'symbol': '$', 'decimals': 2}},
        column_case='title'
    )
    .to_df()
)
```

### Survey Data Standardization

```python
survey_clean = (
    nlx(survey_df)
    .clean_column_names(case='snake')
    .standardize_booleans(
        true_values=['Yes', 'Y', 'Agree', 'True', '1'],
        false_values=['No', 'N', 'Disagree', 'False', '0']
    )
    .fill_missing(value='No Response')
    .remove_whitespace()
    .drop_single_value_columns()
    .format_for_display(
        rules={'age': {'type': 'thousands'}},
        column_case='title'
    )
    .to_df()
)
```

---

## Method Chaining Benefits

Nullaxe's chainable API provides several advantages:

1. **Readability**: Each step is clear and self-documenting
2. **Maintainability**: Easy to add, remove, or reorder operations
3. **Performance**: Optimized internal operations reduce memory overhead
4. **Flexibility**: Mix and match operations based on your data's needs

```python
# Traditional approach (verbose and hard to follow)
df = remove_duplicates(df)
df = fill_missing(df, value='Unknown')
df = standardize_booleans(df)
df = remove_outliers(df, method='iqr')

# Nullaxe approach (clean and readable)
df = (nlx(df)
      .remove_duplicates()
      .fill_missing(value='Unknown')
      .standardize_booleans()
      .remove_outliers(method='iqr')
      .format_for_display(rules={'value': {'type': 'currency'}}, column_case='title')
      .to_df())
```

---

## Performance Tips

1. **Use polars for large datasets** - Nullaxe automatically optimizes for polars' performance
2. **Chain operations efficiently** - Nullaxe minimizes intermediate copies
3. **Specify subsets** - Process only the columns you need
4. **Choose appropriate outlier methods** - IQR is faster, Z-score is more sensitive

```python
# Performance-optimized pipeline
result = (
    nlx(large_df)
    .remove_duplicates()
    .drop_single_value_columns()
    .fill_missing(value=0, subset=['numeric_cols'])
    .remove_outliers(method='iqr', subset=['revenue'])
    .format_for_display(rules={'revenue': {'type': 'currency'}}, column_case=None)
    .to_df()
)
```

---

## Testing and Quality Assurance

Nullaxe includes comprehensive test coverage with 118+ test cases covering:

- pandas and polars compatibility
- Edge cases and error handling
- Performance optimization
- Data integrity preservation
- Type safety and validation
- Presentation formatting (currency, percentage, thousands, truncation, datetime, column casing)

Run tests locally:
```bash
git clone https://github.com/johntocci/nullaxe
cd nullaxe
pip install -e .[dev]
pytest tests/
```

---

## Contributing

We welcome contributions! Nullaxe is designed to be extensible and community-driven.

### How to Contribute

1. **Fork the repository** on GitHub
2. **Create a feature branch**: `git checkout -b feature/amazing-feature`
3. **Add your changes** with comprehensive tests
4. **Follow the coding standards** (black formatting, type hints)
5. **Run the test suite**: `pytest tests/`
6. **Submit a pull request** with a clear description

### Development Setup

```bash
# Clone and setup development environment
git clone https://github.com/johntocci/nullaxe
cd nullaxe
pip install -e .[dev]

# Run tests
pytest tests/

# Format code
black src/ tests/
```

### Adding New Functions

Nullaxe's modular architecture makes it easy to add new cleaning functions:

1. Create your function in `src/nullaxe/functions/`
2. Add it to the imports in `src/nullaxe/functions/__init__.py`
3. Add a corresponding method to the `Nullaxe` class
4. Write comprehensive tests in `tests/`

---

## Changelog

- Migration: replace `import sanex as sx` with `import nullaxe as nlx` and `sx(` with `nlx(`
### Version 0.3.0
- Added `format_for_display` function + chain method for presentation formatting
- Added support for currency, percentage, thousands, truncate, datetime formatting
- Title-case header option integrated into formatting step
- Refactored internal formatting for pandas + polars parity
- Expanded test suite (now 118+ tests) including display formatting
- Improved thousands formatting (no trailing .0 on whole floats)

### Version 0.2.0
- Added comprehensive data extraction capabilities
- Enhanced outlier detection with multiple methods
- Improved text processing and punctuation removal
- Fixed boolean standardization edge cases
- Resolved missing data handling in complex workflows
- Performance optimizations for large datasets
- Comprehensive documentation updates

### Version 0.1.0
- Initial release with core cleaning functionality
- Chainable API implementation
- pandas and polars support

---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Acknowledgments

- Built with love for the data science community
- Inspired by the need for simple, powerful data cleaning tools
- Thanks to all contributors and users who help improve Nullaxe

---

<div align="center">

**Made with love by [John Tocci](https://github.com/johntocci)**

[Star us on GitHub](https://github.com/johntocci/nullaxe) | [Report Issues](https://github.com/johntocci/nullaxe/issues) | [Request Features](https://github.com/johntocci/nullaxe/issues)

</div>

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "nullaxe",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "data cleaning, pandas, polars, data science, etl, data processing, nullaxe",
    "author": null,
    "author_email": "John Tocci <john@johntocci.com>",
    "download_url": "https://files.pythonhosted.org/packages/64/87/66876f9b60f4b751b8c0ca291e6616bce99be0848d7c5f376479c27e0a44/nullaxe-0.4.2.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">Nullaxe</h1>\r\n\r\n<div align=\"center\">\r\n\r\n[![PyPI version](https://img.shields.io/pypi/v/nullaxe.svg)](https://pypi.org/project/nullaxe/)\r\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\r\n\r\n</div>\r\n\r\n**Nullaxe** is a comprehensive, high-performance data cleaning and preprocessing library for Python, designed to work seamlessly with both **pandas** and **polars** DataFrames. With its intuitive, chainable API, Nullaxe transforms the traditionally tedious process of data cleaning into an elegant, readable workflow.\r\n\r\n---\r\n\r\n## Key Features\r\n\r\n- **Fluent, Chainable API**: Clean your data in a single, readable chain of commands\r\n- **Dual Backend Support**: Works effortlessly with both pandas and polars DataFrames\r\n- **Comprehensive Cleaning**: From basic cleaning to advanced data extraction and transformation\r\n- **Display Formatting Pipeline**: Format columns for presentation (currency, percentages, thousands separators, date formatting, truncation, title-cased headers)\r\n- **Intelligent Outlier Detection**: Multiple methods including IQR and Z-score analysis\r\n- **Advanced Data Extraction**: Extract emails, phone numbers, and custom patterns with regex\r\n- **Smart Type Handling**: Automatic type inference and standardization\r\n- **Performance Optimized**: Designed for speed and memory efficiency\r\n- **Extensible**: Easily add custom cleaning functions to your pipeline\r\n\r\n---\r\n\r\n## Installation\r\n\r\nInstall Nullaxe easily with pip:\r\n\r\n```bash\r\npip install nullaxe\r\n```\r\n\r\n**Requirements:**\r\n- Python 3.8+\r\n- pandas >= 1.0\r\n- polars >= 0.19\r\n\r\n---\r\n\r\n## Quick Start\r\n\r\nHere's how to transform messy data into clean, analysis-ready datasets:\r\n\r\n```python\r\nimport pandas as pd\r\nimport nullaxe as nlx\r\n\r\n# Create a messy sample dataset\r\ndata = {\r\n    'First Name': ['  John  ', 'Jane', '  Peter', 'JOHN', None],\r\n    'Last Name': ['Smith', 'Doe', 'Jones', 'Smith', 'Brown'],\r\n    'Age': [28, 34, None, 28, 45],\r\n    'Email': ['john@email.com', 'invalid-email', 'peter@test.org', 'john@email.com', None],\r\n    'Phone': ['123-456-7890', '(555) 123-4567', 'not-a-phone', '123.456.7890', '+1-800-555-0199'],\r\n    'Salary': ['$70,000', '80000', '$65,000.50', '$70,000', '\u20ac75,000'],\r\n    'Active': ['True', 'False', 'yes', 'TRUE', 'N'],\r\n    'Notes': ['  Important client  ', '', '   Follow up   ', None, 'VIP']\r\n}\r\ndf = pd.DataFrame(data)\r\n\r\n# Clean the entire dataset with a single chain\r\nclean_df = (\r\n    nlx(df)\r\n    .clean_column_names()                    # Standardize column names\r\n    .fill_missing(value='Unknown')           # Fill missing values\r\n    .remove_whitespace()                     # Clean whitespace\r\n    .remove_duplicates()                     # Remove duplicate rows\r\n    .standardize_booleans()                  # Convert boolean-like values\r\n    .extract_email()                         # Extract email addresses\r\n    .extract_phone_numbers()                 # Extract phone numbers\r\n    .extract_and_clean_numeric()             # Extract numeric values from strings\r\n    .drop_single_value_columns()             # Remove columns with only one value\r\n    .remove_outliers(method='iqr')           # Handle outliers\r\n    .format_for_display(                     # NEW: Format for presentation\r\n        rules={\r\n            'salary': {'type': 'currency', 'symbol': '$', 'decimals': 2},\r\n            'age': {'type': 'thousands'},\r\n        },\r\n        column_case='title'\r\n    )\r\n    .to_df()                                 # Return the cleaned, formatted DataFrame\r\n)\r\n\r\nprint(clean_df.head())\r\n```\r\n\r\n---\r\n\r\n## Complete API Reference\r\n\r\n### Initialization\r\n\r\n```python\r\nimport nullaxe as nlx\r\n\r\n# Initialize with any DataFrame\r\ncleaner = nlx(df)  # Works with pandas or polars DataFrames\r\n```\r\n\r\n### Column Name Standardization\r\n\r\nTransform column names to consistent formats:\r\n\r\n```python\r\n# General column cleaning with case conversion\r\n.clean_column_names(case='snake')  # Options: 'snake', 'camel', 'pascal', 'kebab', 'title', 'lower', 'screaming_snake'\r\n\r\n# Specific case conversions\r\n.snakecase()                       # column_name\r\n.camelcase()                       # columnName\r\n.pascalcase()                      # ColumnName\r\n.kebabcase()                       # column-name\r\n.titlecase()                       # Column Name\r\n.lowercase()                       # column name\r\n.screaming_snakecase()             # COLUMN_NAME\r\n```\r\n\r\n### Data Deduplication\r\n\r\nRemove duplicate data efficiently:\r\n\r\n```python\r\n.remove_duplicates()               # Remove duplicate rows across all columns\r\n```\r\n\r\n### Missing Data Management\r\n\r\nHandle missing values with precision:\r\n\r\n```python\r\n# Fill missing values\r\n.fill_missing(value=0)                           # Fill all columns with 0\r\n.fill_missing(value='Unknown', subset=['name'])  # Fill specific columns\r\n\r\n# Drop missing values\r\n.drop_missing()                                  # Drop rows with any missing values\r\n.drop_missing(how='all')                         # Drop rows where all values are missing\r\n.drop_missing(thresh=3)                          # Keep rows with at least 3 non-null values\r\n.drop_missing(axis='columns')                    # Drop columns with missing values\r\n.drop_missing(subset=['name', 'email'])          # Consider only specific columns\r\n```\r\n\r\n### Text and Whitespace Cleaning\r\n\r\nClean and standardize text data:\r\n\r\n```python\r\n.remove_whitespace()                             # Remove leading/trailing whitespace\r\n.replace_text('old', 'new')                      # Replace text in all columns\r\n.replace_text('old', 'new', subset=['name'])     # Replace in specific columns\r\n.remove_punctuation()                            # Remove punctuation marks\r\n.remove_punctuation(subset=['description'])      # Remove from specific columns\r\n```\r\n\r\n### Column Management\r\n\r\nManage DataFrame structure:\r\n\r\n```python\r\n.drop_single_value_columns()                     # Remove columns with only one unique value\r\n.remove_unwanted_rows_and_cols()                 # Remove rows/cols with unwanted values\r\n.remove_unwanted_rows_and_cols(                  # Custom unwanted values\r\n    unwanted_values=['', 'N/A', 'NULL']\r\n)\r\n```\r\n\r\n### Outlier Detection and Handling\r\n\r\nSophisticated outlier management:\r\n\r\n```python\r\n# General outlier handling\r\n.handle_outliers()                               # Default: IQR method, factor=1.5\r\n.handle_outliers(method='zscore', factor=2.0)    # Z-score method\r\n.handle_outliers(subset=['salary', 'age'])       # Specific columns only\r\n\r\n# Cap outliers (replace with threshold values)\r\n.cap_outliers()                                  # Cap using IQR method\r\n.cap_outliers(method='zscore', factor=2.5)       # Cap using Z-score\r\n\r\n# Remove outlier rows entirely\r\n.remove_outliers()                               # Remove rows with outliers\r\n.remove_outliers(method='iqr', factor=1.5)       # Custom parameters\r\n```\r\n\r\n**Outlier Detection Methods:**\r\n- **IQR (Interquartile Range)**: `Q1 - factor*IQR` to `Q3 + factor*IQR`\r\n- **Z-Score**: Values beyond `factor` standard deviations from the mean\r\n\r\n### Data Type Standardization\r\n\r\nConvert and standardize data types:\r\n\r\n```python\r\n# Boolean standardization\r\n.standardize_booleans()                          # Convert 'yes/no', 'true/false', etc.\r\n.standardize_booleans(\r\n    true_values=['yes', 'y', '1', 'true'],       # Custom true values\r\n    false_values=['no', 'n', '0', 'false'],     # Custom false values\r\n    columns=['active', 'verified']              # Specific columns\r\n)\r\n```\r\n\r\n**Default Boolean Mappings:**\r\n- **True**: 'true', '1', 't', 'yes', 'y', 'on'\r\n- **False**: 'false', '0', 'f', 'no', 'n', 'off'\r\n\r\n### Advanced Data Extraction\r\n\r\nExtract structured data from unstructured text:\r\n\r\n```python\r\n# Email extraction\r\n.extract_email()                                 # Extract emails from all columns\r\n.extract_email(subset=['contact_info'])          # From specific columns\r\n\r\n# Phone number extraction\r\n.extract_phone_numbers()                         # Extract phone numbers\r\n.extract_phone_numbers(subset=['contact'])       # From specific columns\r\n\r\n# Numeric data extraction and cleaning\r\n.extract_and_clean_numeric()                     # Extract numbers from text\r\n.extract_and_clean_numeric(subset=['prices'])    # From specific columns\r\n\r\n# Custom regex extraction (interactive)\r\n.extract_with_regex()                            # Prompts for regex pattern\r\n.extract_with_regex(subset=['text_column'])      # From specific columns\r\n\r\n# Combined numeric cleaning\r\n.clean_numeric()                                 # Extract + outlier handling\r\n.clean_numeric(method='zscore', factor=2.0)      # Custom outlier parameters\r\n```\r\n\r\n### Display / Presentation Formatting (NEW in 0.3.0)\r\n\r\nFormat cleaned data for reports, dashboards, exports:\r\n\r\n```python\r\n.format_for_display(\r\n    rules={\r\n        'price': {'type': 'currency', 'symbol': '$', 'decimals': 2},\r\n        'growth': {'type': 'percentage', 'decimals': 1},\r\n        'volume': {'type': 'thousands'},\r\n        'description': {'type': 'truncate', 'length': 30},\r\n        'event_date': {'type': 'datetime', 'format': '%B %d, %Y'}\r\n    },\r\n    column_case='title'  # or None to preserve original column names\r\n)\r\n```\r\n\r\nSupported rule types:\r\n- `currency`: symbol + thousands + decimal precision\r\n- `percentage`: multiplies by 100 + suffix `%`\r\n- `thousands`: adds thousands separators, removes trailing `.0` for whole floats\r\n- `truncate`: shortens long text and appends `...`\r\n- `datetime`: parses and formats date/time strings\r\n\r\nYou can also call the function directly:\r\n```python\r\nfrom nullaxe.functions import format_for_display\r\nformatted = format_for_display(df, rules=..., column_case='title')\r\n```\r\n\r\n### Output\r\n\r\n```python\r\n.to_df()                                         # Return the cleaned DataFrame\r\n```\r\n\r\n---\r\n\r\n## Advanced Usage Examples\r\n\r\n### Real-World Data Cleaning Pipeline\r\n\r\n```python\r\nimport pandas as pd\r\nimport nullaxe as nlx\r\n\r\n# Load messy customer data\r\ndf = pd.read_csv('messy_customer_data.csv')\r\n\r\n# Comprehensive cleaning + formatting pipeline\r\nclean_customers = (\r\n    nlx(df)\r\n    .clean_column_names(case='snake')\r\n    .fill_missing(value='Not Provided')\r\n    .remove_whitespace()\r\n    .standardize_booleans(columns=['is_active', 'newsletter_opt_in'])\r\n    .extract_email(subset=['contact_info'])\r\n    .extract_phone_numbers(subset=['contact_info'])\r\n    .extract_and_clean_numeric(subset=['revenue', 'age'])\r\n    .remove_outliers(method='iqr', factor=2.0, subset=['revenue'])\r\n    .drop_single_value_columns()\r\n    .remove_duplicates()\r\n    .format_for_display(\r\n        rules={\r\n            'revenue': {'type': 'currency', 'symbol': '$', 'decimals': 2},\r\n            'age': {'type': 'thousands'},\r\n            'signup_date': {'type': 'datetime', 'format': '%Y-%m-%d'}\r\n        },\r\n        column_case='title'\r\n    )\r\n    .to_df()\r\n)\r\n```\r\n\r\n### Financial Data Processing\r\n\r\n```python\r\nfinancial_clean = (\r\n    nlx(transactions_df)\r\n    .clean_column_names(case='snake')\r\n    .fill_missing(value=0, subset=['amount'])\r\n    .extract_and_clean_numeric(subset=['amount', 'fee'])\r\n    .standardize_booleans(subset=['is_recurring'])\r\n    .cap_outliers(method='zscore', factor=3.0, subset=['amount'])\r\n    .remove_whitespace()\r\n    .format_for_display(\r\n        rules={'amount': {'type': 'currency', 'symbol': '$', 'decimals': 2}},\r\n        column_case='title'\r\n    )\r\n    .to_df()\r\n)\r\n```\r\n\r\n### Survey Data Standardization\r\n\r\n```python\r\nsurvey_clean = (\r\n    nlx(survey_df)\r\n    .clean_column_names(case='snake')\r\n    .standardize_booleans(\r\n        true_values=['Yes', 'Y', 'Agree', 'True', '1'],\r\n        false_values=['No', 'N', 'Disagree', 'False', '0']\r\n    )\r\n    .fill_missing(value='No Response')\r\n    .remove_whitespace()\r\n    .drop_single_value_columns()\r\n    .format_for_display(\r\n        rules={'age': {'type': 'thousands'}},\r\n        column_case='title'\r\n    )\r\n    .to_df()\r\n)\r\n```\r\n\r\n---\r\n\r\n## Method Chaining Benefits\r\n\r\nNullaxe's chainable API provides several advantages:\r\n\r\n1. **Readability**: Each step is clear and self-documenting\r\n2. **Maintainability**: Easy to add, remove, or reorder operations\r\n3. **Performance**: Optimized internal operations reduce memory overhead\r\n4. **Flexibility**: Mix and match operations based on your data's needs\r\n\r\n```python\r\n# Traditional approach (verbose and hard to follow)\r\ndf = remove_duplicates(df)\r\ndf = fill_missing(df, value='Unknown')\r\ndf = standardize_booleans(df)\r\ndf = remove_outliers(df, method='iqr')\r\n\r\n# Nullaxe approach (clean and readable)\r\ndf = (nlx(df)\r\n      .remove_duplicates()\r\n      .fill_missing(value='Unknown')\r\n      .standardize_booleans()\r\n      .remove_outliers(method='iqr')\r\n      .format_for_display(rules={'value': {'type': 'currency'}}, column_case='title')\r\n      .to_df())\r\n```\r\n\r\n---\r\n\r\n## Performance Tips\r\n\r\n1. **Use polars for large datasets** - Nullaxe automatically optimizes for polars' performance\r\n2. **Chain operations efficiently** - Nullaxe minimizes intermediate copies\r\n3. **Specify subsets** - Process only the columns you need\r\n4. **Choose appropriate outlier methods** - IQR is faster, Z-score is more sensitive\r\n\r\n```python\r\n# Performance-optimized pipeline\r\nresult = (\r\n    nlx(large_df)\r\n    .remove_duplicates()\r\n    .drop_single_value_columns()\r\n    .fill_missing(value=0, subset=['numeric_cols'])\r\n    .remove_outliers(method='iqr', subset=['revenue'])\r\n    .format_for_display(rules={'revenue': {'type': 'currency'}}, column_case=None)\r\n    .to_df()\r\n)\r\n```\r\n\r\n---\r\n\r\n## Testing and Quality Assurance\r\n\r\nNullaxe includes comprehensive test coverage with 118+ test cases covering:\r\n\r\n- pandas and polars compatibility\r\n- Edge cases and error handling\r\n- Performance optimization\r\n- Data integrity preservation\r\n- Type safety and validation\r\n- Presentation formatting (currency, percentage, thousands, truncation, datetime, column casing)\r\n\r\nRun tests locally:\r\n```bash\r\ngit clone https://github.com/johntocci/nullaxe\r\ncd nullaxe\r\npip install -e .[dev]\r\npytest tests/\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! Nullaxe is designed to be extensible and community-driven.\r\n\r\n### How to Contribute\r\n\r\n1. **Fork the repository** on GitHub\r\n2. **Create a feature branch**: `git checkout -b feature/amazing-feature`\r\n3. **Add your changes** with comprehensive tests\r\n4. **Follow the coding standards** (black formatting, type hints)\r\n5. **Run the test suite**: `pytest tests/`\r\n6. **Submit a pull request** with a clear description\r\n\r\n### Development Setup\r\n\r\n```bash\r\n# Clone and setup development environment\r\ngit clone https://github.com/johntocci/nullaxe\r\ncd nullaxe\r\npip install -e .[dev]\r\n\r\n# Run tests\r\npytest tests/\r\n\r\n# Format code\r\nblack src/ tests/\r\n```\r\n\r\n### Adding New Functions\r\n\r\nNullaxe's modular architecture makes it easy to add new cleaning functions:\r\n\r\n1. Create your function in `src/nullaxe/functions/`\r\n2. Add it to the imports in `src/nullaxe/functions/__init__.py`\r\n3. Add a corresponding method to the `Nullaxe` class\r\n4. Write comprehensive tests in `tests/`\r\n\r\n---\r\n\r\n## Changelog\r\n\r\n- Migration: replace `import sanex as sx` with `import nullaxe as nlx` and `sx(` with `nlx(`\r\n### Version 0.3.0\r\n- Added `format_for_display` function + chain method for presentation formatting\r\n- Added support for currency, percentage, thousands, truncate, datetime formatting\r\n- Title-case header option integrated into formatting step\r\n- Refactored internal formatting for pandas + polars parity\r\n- Expanded test suite (now 118+ tests) including display formatting\r\n- Improved thousands formatting (no trailing .0 on whole floats)\r\n\r\n### Version 0.2.0\r\n- Added comprehensive data extraction capabilities\r\n- Enhanced outlier detection with multiple methods\r\n- Improved text processing and punctuation removal\r\n- Fixed boolean standardization edge cases\r\n- Resolved missing data handling in complex workflows\r\n- Performance optimizations for large datasets\r\n- Comprehensive documentation updates\r\n\r\n### Version 0.1.0\r\n- Initial release with core cleaning functionality\r\n- Chainable API implementation\r\n- pandas and polars support\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\n## Acknowledgments\r\n\r\n- Built with love for the data science community\r\n- Inspired by the need for simple, powerful data cleaning tools\r\n- Thanks to all contributors and users who help improve Nullaxe\r\n\r\n---\r\n\r\n<div align=\"center\">\r\n\r\n**Made with love by [John Tocci](https://github.com/johntocci)**\r\n\r\n[Star us on GitHub](https://github.com/johntocci/nullaxe) | [Report Issues](https://github.com/johntocci/nullaxe/issues) | [Request Features](https://github.com/johntocci/nullaxe/issues)\r\n\r\n</div>\r\n",
    "bugtrack_url": null,
    "license": "MIT License\r\n        \r\n        Copyright (c) 2025 John Tocci\r\n        \r\n        Permission is hereby granted, free of charge, to any person obtaining a copy\r\n        of this software and associated documentation files (the \"Software\"), to deal\r\n        in the Software without restriction, including without limitation the rights\r\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\n        copies of the Software, and to permit persons to whom the Software is\r\n        furnished to do so, subject to the following conditions:\r\n        \r\n        The above copyright notice and this permission notice shall be included in all\r\n        copies or substantial portions of the Software.\r\n        \r\n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\n        SOFTWARE.\r\n        ",
    "summary": "A data cleaning library for Pandas and Polars DataFrames with a simple, chainable API.",
    "version": "0.4.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/johntocci/nullaxe/issues",
        "Homepage": "https://github.com/johntocci/nullaxe",
        "Repository": "https://github.com/johntocci/nullaxe"
    },
    "split_keywords": [
        "data cleaning",
        " pandas",
        " polars",
        " data science",
        " etl",
        " data processing",
        " nullaxe"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "94b658aef0a7cc0b0d4216a25b828cadc14e74a714c27c8495dd32d7ef4f0981",
                "md5": "e400582fc99ba4489adef93db9cfb480",
                "sha256": "0d0e73f29e822094ef51940fd9648a9ffd5cbf565dc248dc9161fcf8dea8226c"
            },
            "downloads": -1,
            "filename": "nullaxe-0.4.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e400582fc99ba4489adef93db9cfb480",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 40011,
            "upload_time": "2025-09-11T00:20:34",
            "upload_time_iso_8601": "2025-09-11T00:20:34.814170Z",
            "url": "https://files.pythonhosted.org/packages/94/b6/58aef0a7cc0b0d4216a25b828cadc14e74a714c27c8495dd32d7ef4f0981/nullaxe-0.4.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "648766876f9b60f4b751b8c0ca291e6616bce99be0848d7c5f376479c27e0a44",
                "md5": "fe3883976cf3d468dd351d4da1182aef",
                "sha256": "ef6f95685854a5ef22d62427b9cb06acce4b2b1ac290408da297a80482026701"
            },
            "downloads": -1,
            "filename": "nullaxe-0.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "fe3883976cf3d468dd351d4da1182aef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 47008,
            "upload_time": "2025-09-11T00:20:36",
            "upload_time_iso_8601": "2025-09-11T00:20:36.101692Z",
            "url": "https://files.pythonhosted.org/packages/64/87/66876f9b60f4b751b8c0ca291e6616bce99be0848d7c5f376479c27e0a44/nullaxe-0.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-11 00:20:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "johntocci",
    "github_project": "nullaxe",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "nullaxe"
}

None