aydie-dataset-cleaner


Nameaydie-dataset-cleaner JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryA Python library to validate, profile, and clean datasets (CSV, Excel, Parquet) for machine learning workflows.
upload_time2025-08-20 19:09:12
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords dataset cleaning validation ml data-science parquet csv excel
VCS
bugtrack_url
requirements pandas numpy openpyxl pyarrow scikit-learn jinja2 pytest black
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![Aydie Banner](https://aydie.in/banner.jpg)

# aydie-dataset-cleaner

[![PyPI version](https://badge.fury.io/py/aydie-dataset-cleaner.svg)](https://badge.fury.io/py/aydie-dataset-cleaner)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Versions](https://img.shields.io/pypi/pyversions/aydie-dataset-cleaner.svg)](https://pypi.org/project/aydie-dataset-cleaner/)

**A powerful Python toolkit for validating, cleaning, and preparing datasets for machine learning.**

---

`aydie-dataset-cleaner` is a comprehensive library designed to streamline the data preprocessing pipeline. It provides a structured, automated, and repeatable workflow for identifying and fixing common data quality issues in CSV, Excel, and Parquet files, ensuring your data is reliable and ready for analysis.

This library is part of the **Aydie** family of developer tools.

## Why `aydie-dataset-cleaner`?

In any data-driven project, 80% of the work is data preparation. Messy, inconsistent data leads to unreliable models and flawed insights. `aydie-dataset-cleaner` automates the tedious parts of this process, allowing you to focus on building great models.

- **Reliability & Consistency**: Apply a standardized set of validation rules to every dataset, ensuring consistent data quality across all your projects.
- **Increased Efficiency**: Stop writing boilerplate data cleaning code. With a few lines of code, you can load, validate, clean, and report on any dataset.
- **Clear Insights**: Generate beautiful HTML and machine-readable JSON reports to understand exactly what's wrong with your data and what was done to fix it.
- **Extensible & Controllable**: You have full control over the cleaning process, from choosing outlier detection methods to defining how missing values are handled.

## Core Features

-   **Versatile File Loading**: Seamlessly load data from **CSV**, **Excel (.xlsx, .xls)**, and **Parquet** files into pandas DataFrames.
-   **Comprehensive Validation**: Run a full suite of checks for common issues like **missing values**, **duplicate rows**, and **inconsistent data types**.
-   **Advanced Outlier Detection**: Identify outliers in your numerical data using multiple statistical methods:
    -   `iqr` (Interquartile Range)
    -   `zscore` (Standard Score)
    -   `mad` (Median Absolute Deviation)
    -   `percentile` (Trimming extreme values)
-   **Automated Cleaning**: Systematically fix issues found during validation, including filling missing data, removing duplicates, and correcting data types.
-   **Rich Reporting**: Automatically generate a detailed **HTML report** for easy visual inspection or a **JSON report** for programmatic use.

## Installation

You can install `aydie-dataset-cleaner` directly from PyPI:

```bash
pip install aydie-dataset-cleaner
```

## Full Workflow Example

Here’s a quick example demonstrating the end-to-end workflow.

```python
import pandas as pd
import numpy as np
from aydie_dataset_cleaner import file_loader, validator, cleaner, reporter

# --- 1. SETUP: Create a messy DataFrame for demonstration ---
data = {
    'product_id': ['A101', 'A102', 'A103', 'A104', 'A101'],
    'price': [1200.50, 75.00, np.nan, 250.75, 1200.50],
    'stock_quantity': [15, 200, 30, np.nan, 15],
    'rating': [4.5, 4.0, 3.5, 4.8, 4.5],
    'region': ['USA', 'EU', 'USA', 'USA', 'UK_typo']
}
dirty_df = pd.DataFrame(data)
print("--- Original Messy DataFrame ---")
print(dirty_df)

# --- 2. VALIDATE: Run all validation checks ---
print("\n--- Validating Dataset ---")
v = validator.DatasetValidator(dirty_df)
validation_results = v.run_all_checks()

# --- 3. REPORT: Generate a human-readable HTML report ---
print("\n--- Generating Report ---")
r = reporter.ReportGenerator(validation_results)
r.to_html('validation_report.html')
print("Validation report saved to 'validation_report.html'")

# --- 4. CLEAN: Clean the dataset based on the validation results ---
print("\n--- Cleaning Dataset ---")
c = cleaner.DatasetCleaner(dirty_df, validation_results)
cleaned_df = c.clean_dataset(missing_value_strategy='median')

# --- 5. VERIFY: Display the cleaned data ---
print("\n--- Cleaned DataFrame ---")
print(cleaned_df)
```

## Module-by-Module Examples

### `file_loader`: Loading Your Data

The `load_dataset` function automatically detects the file type and loads it into a DataFrame.

```python
from aydie_dataset_cleaner import file_loader

# Load a CSV file with specific options
df_csv = file_loader.load_dataset('data.csv', sep=';')

# Load a specific sheet from an Excel file
df_excel = file_loader.load_dataset('data.xlsx', sheet_name='SalesData')

# Load a Parquet file
df_parquet = file_loader.load_dataset('data.parquet')
```

### `validator`: Finding Data Issues

The `DatasetValidator` is the core of the library, identifying all potential issues.

```python
from aydie_dataset_cleaner import validator

# Assume 'df' is your loaded DataFrame
v = validator.DatasetValidator(df)

# Check for missing values
missing_report = v.check_missing_values()
# {'price': {'count': 1, 'percentage': 20.0}}

# Check for duplicate rows
duplicate_report = v.check_duplicate_rows()
# {'count': 1}

# Check for outliers using different methods
# The run_all_checks() method runs 'iqr' by default
outliers_iqr = v.check_outliers(method='iqr', multiplier=1.5)
outliers_zscore = v.check_outliers(method='zscore', threshold=3)
outliers_mad = v.check_outliers(method='mad', threshold=3.5)

# Run all checks at once
full_report = v.run_all_checks()
```

### `cleaner`: Fixing Data Issues

The `DatasetCleaner` uses the report from the validator to apply fixes.

```python
from aydie_dataset_cleaner import cleaner

# Assume 'df' is your DataFrame and 'validation_results' is your report
c = cleaner.DatasetCleaner(df, validation_results)

# Handle missing values with different strategies
df_mean_filled = c.handle_missing_values(strategy='mean')
df_median_filled = c.handle_missing_values(strategy='median')
df_mode_filled = c.handle_missing_values(strategy='mode')

# Remove duplicate rows
df_no_duplicates = c.handle_duplicate_rows()

# Run the entire cleaning pipeline
cleaned_df = c.clean_dataset(missing_value_strategy='mean')
```

### `reporter`: Generating Reports

The `ReportGenerator` creates clean, shareable reports from the validation results.

```python
from aydie_dataset_cleaner import reporter

# Assume 'validation_results' is your report
r = reporter.ReportGenerator(validation_results)

# Create a machine-readable JSON file
r.to_json('report.json')

# Create a beautiful, human-readable HTML file
r.to_html('report.html')
```

## Contributing

Contributions are welcome! If you have an idea for a new feature, find a bug, or want to improve the documentation, please open an issue or submit a pull request on our [GitHub repository](https://github.com/aydiegithub/aydie-dataset-cleaner).

## License

This project is licensed under the **MIT License**. See the `LICENSE` file for details.

---

## Connect with Me

[![GitHub](https://img.shields.io/badge/GitHub-Profile-181717?logo=github&logoColor=white)](https://github.com/aydiegithub)
[![Source Code](https://img.shields.io/badge/Source_Code-DataSetCleaner-2f80ed?logo=github&logoColor=white)](https://github.com/aydiegithub/aydie-dataset-cleaner)
[![Website](https://img.shields.io/badge/Website-aydie.in-2ea44f?logo=googlechrome&logoColor=white)](https://aydie.in)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-0a66c2?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/aydiemusic)
[![X](https://img.shields.io/badge/X-Profile-black?logo=x&logoColor=white)](https://x.com/aydiemusic)
[![Instagram](https://img.shields.io/badge/Instagram-Profile-e4405f?logo=instagram&logoColor=white)](https://instagram.com/aydiemusic)
[![YouTube](https://img.shields.io/badge/YouTube-Channel-ff0000?logo=youtube&logoColor=white)](https://youtube.com/@aydiemusic)
[![GitLab](https://img.shields.io/badge/GitLab-Profile-fca121?logo=gitlab&logoColor=white)](https://gitlab.com/aydie)
[![Email](https://img.shields.io/badge/Email-developer@aydie.in-d14836?logo=gmail&logoColor=white)](mailto:developer@aydie.in)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "aydie-dataset-cleaner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "dataset, cleaning, validation, ML, data-science, parquet, csv, excel",
    "author": null,
    "author_email": "\"Aditya (Aydie) Dinesh K\" <developer@aydie.in>",
    "download_url": "https://files.pythonhosted.org/packages/f0/55/6303164e74aba6301cfb801a3d2f33a8c32a5c66ff150729ed0b3504813e/aydie_dataset_cleaner-1.0.0.tar.gz",
    "platform": null,
    "description": "![Aydie Banner](https://aydie.in/banner.jpg)\n\n# aydie-dataset-cleaner\n\n[![PyPI version](https://badge.fury.io/py/aydie-dataset-cleaner.svg)](https://badge.fury.io/py/aydie-dataset-cleaner)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python Versions](https://img.shields.io/pypi/pyversions/aydie-dataset-cleaner.svg)](https://pypi.org/project/aydie-dataset-cleaner/)\n\n**A powerful Python toolkit for validating, cleaning, and preparing datasets for machine learning.**\n\n---\n\n`aydie-dataset-cleaner` is a comprehensive library designed to streamline the data preprocessing pipeline. It provides a structured, automated, and repeatable workflow for identifying and fixing common data quality issues in CSV, Excel, and Parquet files, ensuring your data is reliable and ready for analysis.\n\nThis library is part of the **Aydie** family of developer tools.\n\n## Why `aydie-dataset-cleaner`?\n\nIn any data-driven project, 80% of the work is data preparation. Messy, inconsistent data leads to unreliable models and flawed insights. `aydie-dataset-cleaner` automates the tedious parts of this process, allowing you to focus on building great models.\n\n- **Reliability & Consistency**: Apply a standardized set of validation rules to every dataset, ensuring consistent data quality across all your projects.\n- **Increased Efficiency**: Stop writing boilerplate data cleaning code. With a few lines of code, you can load, validate, clean, and report on any dataset.\n- **Clear Insights**: Generate beautiful HTML and machine-readable JSON reports to understand exactly what's wrong with your data and what was done to fix it.\n- **Extensible & Controllable**: You have full control over the cleaning process, from choosing outlier detection methods to defining how missing values are handled.\n\n## Core Features\n\n-   **Versatile File Loading**: Seamlessly load data from **CSV**, **Excel (.xlsx, .xls)**, and **Parquet** files into pandas DataFrames.\n-   **Comprehensive Validation**: Run a full suite of checks for common issues like **missing values**, **duplicate rows**, and **inconsistent data types**.\n-   **Advanced Outlier Detection**: Identify outliers in your numerical data using multiple statistical methods:\n    -   `iqr` (Interquartile Range)\n    -   `zscore` (Standard Score)\n    -   `mad` (Median Absolute Deviation)\n    -   `percentile` (Trimming extreme values)\n-   **Automated Cleaning**: Systematically fix issues found during validation, including filling missing data, removing duplicates, and correcting data types.\n-   **Rich Reporting**: Automatically generate a detailed **HTML report** for easy visual inspection or a **JSON report** for programmatic use.\n\n## Installation\n\nYou can install `aydie-dataset-cleaner` directly from PyPI:\n\n```bash\npip install aydie-dataset-cleaner\n```\n\n## Full Workflow Example\n\nHere\u2019s a quick example demonstrating the end-to-end workflow.\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom aydie_dataset_cleaner import file_loader, validator, cleaner, reporter\n\n# --- 1. SETUP: Create a messy DataFrame for demonstration ---\ndata = {\n    'product_id': ['A101', 'A102', 'A103', 'A104', 'A101'],\n    'price': [1200.50, 75.00, np.nan, 250.75, 1200.50],\n    'stock_quantity': [15, 200, 30, np.nan, 15],\n    'rating': [4.5, 4.0, 3.5, 4.8, 4.5],\n    'region': ['USA', 'EU', 'USA', 'USA', 'UK_typo']\n}\ndirty_df = pd.DataFrame(data)\nprint(\"--- Original Messy DataFrame ---\")\nprint(dirty_df)\n\n# --- 2. VALIDATE: Run all validation checks ---\nprint(\"\\n--- Validating Dataset ---\")\nv = validator.DatasetValidator(dirty_df)\nvalidation_results = v.run_all_checks()\n\n# --- 3. REPORT: Generate a human-readable HTML report ---\nprint(\"\\n--- Generating Report ---\")\nr = reporter.ReportGenerator(validation_results)\nr.to_html('validation_report.html')\nprint(\"Validation report saved to 'validation_report.html'\")\n\n# --- 4. CLEAN: Clean the dataset based on the validation results ---\nprint(\"\\n--- Cleaning Dataset ---\")\nc = cleaner.DatasetCleaner(dirty_df, validation_results)\ncleaned_df = c.clean_dataset(missing_value_strategy='median')\n\n# --- 5. VERIFY: Display the cleaned data ---\nprint(\"\\n--- Cleaned DataFrame ---\")\nprint(cleaned_df)\n```\n\n## Module-by-Module Examples\n\n### `file_loader`: Loading Your Data\n\nThe `load_dataset` function automatically detects the file type and loads it into a DataFrame.\n\n```python\nfrom aydie_dataset_cleaner import file_loader\n\n# Load a CSV file with specific options\ndf_csv = file_loader.load_dataset('data.csv', sep=';')\n\n# Load a specific sheet from an Excel file\ndf_excel = file_loader.load_dataset('data.xlsx', sheet_name='SalesData')\n\n# Load a Parquet file\ndf_parquet = file_loader.load_dataset('data.parquet')\n```\n\n### `validator`: Finding Data Issues\n\nThe `DatasetValidator` is the core of the library, identifying all potential issues.\n\n```python\nfrom aydie_dataset_cleaner import validator\n\n# Assume 'df' is your loaded DataFrame\nv = validator.DatasetValidator(df)\n\n# Check for missing values\nmissing_report = v.check_missing_values()\n# {'price': {'count': 1, 'percentage': 20.0}}\n\n# Check for duplicate rows\nduplicate_report = v.check_duplicate_rows()\n# {'count': 1}\n\n# Check for outliers using different methods\n# The run_all_checks() method runs 'iqr' by default\noutliers_iqr = v.check_outliers(method='iqr', multiplier=1.5)\noutliers_zscore = v.check_outliers(method='zscore', threshold=3)\noutliers_mad = v.check_outliers(method='mad', threshold=3.5)\n\n# Run all checks at once\nfull_report = v.run_all_checks()\n```\n\n### `cleaner`: Fixing Data Issues\n\nThe `DatasetCleaner` uses the report from the validator to apply fixes.\n\n```python\nfrom aydie_dataset_cleaner import cleaner\n\n# Assume 'df' is your DataFrame and 'validation_results' is your report\nc = cleaner.DatasetCleaner(df, validation_results)\n\n# Handle missing values with different strategies\ndf_mean_filled = c.handle_missing_values(strategy='mean')\ndf_median_filled = c.handle_missing_values(strategy='median')\ndf_mode_filled = c.handle_missing_values(strategy='mode')\n\n# Remove duplicate rows\ndf_no_duplicates = c.handle_duplicate_rows()\n\n# Run the entire cleaning pipeline\ncleaned_df = c.clean_dataset(missing_value_strategy='mean')\n```\n\n### `reporter`: Generating Reports\n\nThe `ReportGenerator` creates clean, shareable reports from the validation results.\n\n```python\nfrom aydie_dataset_cleaner import reporter\n\n# Assume 'validation_results' is your report\nr = reporter.ReportGenerator(validation_results)\n\n# Create a machine-readable JSON file\nr.to_json('report.json')\n\n# Create a beautiful, human-readable HTML file\nr.to_html('report.html')\n```\n\n## Contributing\n\nContributions are welcome! If you have an idea for a new feature, find a bug, or want to improve the documentation, please open an issue or submit a pull request on our [GitHub repository](https://github.com/aydiegithub/aydie-dataset-cleaner).\n\n## License\n\nThis project is licensed under the **MIT License**. See the `LICENSE` file for details.\n\n---\n\n## Connect with Me\n\n[![GitHub](https://img.shields.io/badge/GitHub-Profile-181717?logo=github&logoColor=white)](https://github.com/aydiegithub)\n[![Source Code](https://img.shields.io/badge/Source_Code-DataSetCleaner-2f80ed?logo=github&logoColor=white)](https://github.com/aydiegithub/aydie-dataset-cleaner)\n[![Website](https://img.shields.io/badge/Website-aydie.in-2ea44f?logo=googlechrome&logoColor=white)](https://aydie.in)\n[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-0a66c2?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/aydiemusic)\n[![X](https://img.shields.io/badge/X-Profile-black?logo=x&logoColor=white)](https://x.com/aydiemusic)\n[![Instagram](https://img.shields.io/badge/Instagram-Profile-e4405f?logo=instagram&logoColor=white)](https://instagram.com/aydiemusic)\n[![YouTube](https://img.shields.io/badge/YouTube-Channel-ff0000?logo=youtube&logoColor=white)](https://youtube.com/@aydiemusic)\n[![GitLab](https://img.shields.io/badge/GitLab-Profile-fca121?logo=gitlab&logoColor=white)](https://gitlab.com/aydie)\n[![Email](https://img.shields.io/badge/Email-developer@aydie.in-d14836?logo=gmail&logoColor=white)](mailto:developer@aydie.in)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python library to validate, profile, and clean datasets (CSV, Excel, Parquet) for machine learning workflows.",
    "version": "1.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/aydiegithub/aydie-dataset-cleaner/issues",
        "Documentation": "https://github.com/aydiegithub/aydie-dataset-cleaner#readme",
        "Homepage": "https://opensource.aydie.in/",
        "Source": "https://github.com/aydiegithub/aydie-dataset-cleaner"
    },
    "split_keywords": [
        "dataset",
        " cleaning",
        " validation",
        " ml",
        " data-science",
        " parquet",
        " csv",
        " excel"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9db4d821581e6ae6eb130065105139d829b0894f8b5343b321e723512b92a08f",
                "md5": "8c58593cf291eed1f8cf46b0931c49be",
                "sha256": "3c0baa418042da49d562815a5fd8d033040a272dcb83362881319cbe904192c8"
            },
            "downloads": -1,
            "filename": "aydie_dataset_cleaner-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8c58593cf291eed1f8cf46b0931c49be",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 17507,
            "upload_time": "2025-08-20T19:09:10",
            "upload_time_iso_8601": "2025-08-20T19:09:10.602558Z",
            "url": "https://files.pythonhosted.org/packages/9d/b4/d821581e6ae6eb130065105139d829b0894f8b5343b321e723512b92a08f/aydie_dataset_cleaner-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f0556303164e74aba6301cfb801a3d2f33a8c32a5c66ff150729ed0b3504813e",
                "md5": "7b7efd7775f6ce18e36e98e2f074214f",
                "sha256": "4129acc8614b89e88768fe2d4619c948c411eb4eacb7f9d53c453fb1af3d0b22"
            },
            "downloads": -1,
            "filename": "aydie_dataset_cleaner-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7b7efd7775f6ce18e36e98e2f074214f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 18504,
            "upload_time": "2025-08-20T19:09:12",
            "upload_time_iso_8601": "2025-08-20T19:09:12.318618Z",
            "url": "https://files.pythonhosted.org/packages/f0/55/6303164e74aba6301cfb801a3d2f33a8c32a5c66ff150729ed0b3504813e/aydie_dataset_cleaner-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-20 19:09:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "aydiegithub",
    "github_project": "aydie-dataset-cleaner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.23.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": []
        },
        {
            "name": "pyarrow",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "jinja2",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "black",
            "specs": []
        }
    ],
    "lcname": "aydie-dataset-cleaner"
}
        
Elapsed time: 0.46870s