autocleanss


Nameautocleanss JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryAn advanced and automated data cleaning toolkit for Python.
upload_time2025-08-31 11:28:48
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License
keywords data cleaning data preprocessing machine learning automation pandas
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AutoCleanSS: Your Automated Data Cleaning Companion

[![Python Version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/sudipta9749/autocleanss/blob/main/LICENSE)

An advanced and automated data cleaning toolkit for Python, designed to streamline your data preprocessing workflow with intelligent imputation, outlier handling, and comprehensive reporting.


## ✨ Key Features

*   **Automated Cleaning Pipeline**: Orchestrates a complete data cleaning process from duplicates to outliers.
*   **Duplicate Handling**: Automatically identifies and removes duplicate rows to ensure data uniqueness.
*   **Intelligent Missing Value Imputation**: Supports various strategies including `mean`, `median`, `mode`, and `knn` imputation, tailored for numerical and categorical data. Special handling for datetime missing values.
*   **Robust Outlier Treatment**: Employs the Interquartile Range (IQR) method to detect and cap outliers in numerical columns, preventing skewness and improving model performance.
*   **Automatic Data Type Inference**: Dynamically infers and converts appropriate data types (e.g., object to numeric, object to datetime) for better data integrity.
*   **Comprehensive HTML Reports**: Generates detailed, interactive HTML reports summarizing cleaning actions, before/after statistics, and visual distribution plots for key numerical features.
*   **Seamless Pandas Integration**: Built to work effortlessly with Pandas DataFrames, making it intuitive for data scientists and analysts.

## 📦 Installation

To get started with AutoCleanSS, clone the repository and install it using pip:

1.  **Install dependencies and the package**:
    It's recommended to install in a virtual environment.
    ```bash
    pip install autocleanss
    ```

## 🚀 Usage

Using `autocleanss` is straightforward. Here's a quick example:

```python
import pandas as pd
from autoclean import AutoClean

# 1. Load your messy data into a Pandas DataFrame
#    (Replace with your actual data loading, e.g., pd.read_csv('your_data.csv'))
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Feature_A': [10.1, 11.2, None, 13.4, 14.5, 100.0, 16.7, 17.8, 10.1, 19.9],
    'Feature_B': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'A', 'A', None],
    'Feature_C': ['01-01-2023', '02-01-2023', '03-01-2023', '04-01-2023', '05-01-2023', None, '07-01-2023', '08-01-2023', '01-01-2023', '09-01-2023'],
    'Feature_D': [5, 6, 5, 7, 8, 9, 10, 11, 5, 12]
}
df = pd.DataFrame(data)

print("--- Original DataFrame ---")
print(df)
print("\n" + "="*40 + "\n")

# 2. Initialize the AutoClean object with your DataFrame
#    You can specify an imputation strategy: 'mean', 'median', 'mode', or 'knn'
cleaner = AutoClean(df, imputation_strategy='knn')

# 3. Run the cleaning process
cleaned_df = cleaner.clean()

print("--- Cleaned DataFrame ---")
print(cleaned_df)
print("\n" + "="*40 + "\n")

# 4. Generate a comprehensive HTML report
#    The report will be saved as 'cleaning_report.html' in your current directory.
cleaner.generate_report(output_path='my_cleaning_report.html')
print("✅ Cleaning report saved to 'my_cleaning_report.html'")
```

This will produce a `my_cleaning_report.html` file in your project directory, detailing all the cleaning steps and showing the impact on your data.

You can use 
``` python
from IPython.display import HTML
HTML(filename='cleaning_report.html')
```
In your notebook to visualize the report.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "autocleanss",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "data cleaning, data preprocessing, machine learning, automation, pandas",
    "author": null,
    "author_email": "SUDIPTA BISWAS <sudiptabiswas394@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/1b/bc/e2fc5cd7643440e86f0ca68184b1a4d7e1723dbf9d784c359f8635418a31/autocleanss-0.1.3.tar.gz",
    "platform": null,
    "description": "# AutoCleanSS: Your Automated Data Cleaning Companion\r\n\r\n[![Python Version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\r\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/sudipta9749/autocleanss/blob/main/LICENSE)\r\n\r\nAn advanced and automated data cleaning toolkit for Python, designed to streamline your data preprocessing workflow with intelligent imputation, outlier handling, and comprehensive reporting.\r\n\r\n\r\n## \u2728 Key Features\r\n\r\n*   **Automated Cleaning Pipeline**: Orchestrates a complete data cleaning process from duplicates to outliers.\r\n*   **Duplicate Handling**: Automatically identifies and removes duplicate rows to ensure data uniqueness.\r\n*   **Intelligent Missing Value Imputation**: Supports various strategies including `mean`, `median`, `mode`, and `knn` imputation, tailored for numerical and categorical data. Special handling for datetime missing values.\r\n*   **Robust Outlier Treatment**: Employs the Interquartile Range (IQR) method to detect and cap outliers in numerical columns, preventing skewness and improving model performance.\r\n*   **Automatic Data Type Inference**: Dynamically infers and converts appropriate data types (e.g., object to numeric, object to datetime) for better data integrity.\r\n*   **Comprehensive HTML Reports**: Generates detailed, interactive HTML reports summarizing cleaning actions, before/after statistics, and visual distribution plots for key numerical features.\r\n*   **Seamless Pandas Integration**: Built to work effortlessly with Pandas DataFrames, making it intuitive for data scientists and analysts.\r\n\r\n## \ud83d\udce6 Installation\r\n\r\nTo get started with AutoCleanSS, clone the repository and install it using pip:\r\n\r\n1.  **Install dependencies and the package**:\r\n    It's recommended to install in a virtual environment.\r\n    ```bash\r\n    pip install autocleanss\r\n    ```\r\n\r\n## \ud83d\ude80 Usage\r\n\r\nUsing `autocleanss` is straightforward. Here's a quick example:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom autoclean import AutoClean\r\n\r\n# 1. Load your messy data into a Pandas DataFrame\r\n#    (Replace with your actual data loading, e.g., pd.read_csv('your_data.csv'))\r\ndata = {\r\n    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\r\n    'Feature_A': [10.1, 11.2, None, 13.4, 14.5, 100.0, 16.7, 17.8, 10.1, 19.9],\r\n    'Feature_B': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'A', 'A', None],\r\n    'Feature_C': ['01-01-2023', '02-01-2023', '03-01-2023', '04-01-2023', '05-01-2023', None, '07-01-2023', '08-01-2023', '01-01-2023', '09-01-2023'],\r\n    'Feature_D': [5, 6, 5, 7, 8, 9, 10, 11, 5, 12]\r\n}\r\ndf = pd.DataFrame(data)\r\n\r\nprint(\"--- Original DataFrame ---\")\r\nprint(df)\r\nprint(\"\\n\" + \"=\"*40 + \"\\n\")\r\n\r\n# 2. Initialize the AutoClean object with your DataFrame\r\n#    You can specify an imputation strategy: 'mean', 'median', 'mode', or 'knn'\r\ncleaner = AutoClean(df, imputation_strategy='knn')\r\n\r\n# 3. Run the cleaning process\r\ncleaned_df = cleaner.clean()\r\n\r\nprint(\"--- Cleaned DataFrame ---\")\r\nprint(cleaned_df)\r\nprint(\"\\n\" + \"=\"*40 + \"\\n\")\r\n\r\n# 4. Generate a comprehensive HTML report\r\n#    The report will be saved as 'cleaning_report.html' in your current directory.\r\ncleaner.generate_report(output_path='my_cleaning_report.html')\r\nprint(\"\u2705 Cleaning report saved to 'my_cleaning_report.html'\")\r\n```\r\n\r\nThis will produce a `my_cleaning_report.html` file in your project directory, detailing all the cleaning steps and showing the impact on your data.\r\n\r\nYou can use \r\n``` python\r\nfrom IPython.display import HTML\r\nHTML(filename='cleaning_report.html')\r\n```\r\nIn your notebook to visualize the report.\r\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "An advanced and automated data cleaning toolkit for Python.",
    "version": "0.1.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/sudipta9749/autocleanss/issues",
        "Homepage": "https://github.com/sudipta9749/autocleanss"
    },
    "split_keywords": [
        "data cleaning",
        " data preprocessing",
        " machine learning",
        " automation",
        " pandas"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9ee23226034db5004c82089cca2d47ae0f35e489fcf0d2aa87bfe1cf8e2119d6",
                "md5": "0e91110ade348aacba64599397c2be11",
                "sha256": "0df64ec911bbfb78ecc5eeee0d8d401acb0cab2cd95d162581ec8b9cf9cc41be"
            },
            "downloads": -1,
            "filename": "autocleanss-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0e91110ade348aacba64599397c2be11",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 9475,
            "upload_time": "2025-08-31T11:28:47",
            "upload_time_iso_8601": "2025-08-31T11:28:47.355250Z",
            "url": "https://files.pythonhosted.org/packages/9e/e2/3226034db5004c82089cca2d47ae0f35e489fcf0d2aa87bfe1cf8e2119d6/autocleanss-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1bbce2fc5cd7643440e86f0ca68184b1a4d7e1723dbf9d784c359f8635418a31",
                "md5": "c0f4bee2ba4374d0af42a53d85b25155",
                "sha256": "48c2eb9ee5cc506b7823401895ddf2ff7b09acdee22dd339ef650b94358a543b"
            },
            "downloads": -1,
            "filename": "autocleanss-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "c0f4bee2ba4374d0af42a53d85b25155",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 9362,
            "upload_time": "2025-08-31T11:28:48",
            "upload_time_iso_8601": "2025-08-31T11:28:48.714906Z",
            "url": "https://files.pythonhosted.org/packages/1b/bc/e2fc5cd7643440e86f0ca68184b1a4d7e1723dbf9d784c359f8635418a31/autocleanss-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-31 11:28:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sudipta9749",
    "github_project": "autocleanss",
    "github_not_found": true,
    "lcname": "autocleanss"
}
        
Elapsed time: 0.94724s