# AutoCleanSS: Your Automated Data Cleaning Companion
[](https://www.python.org/downloads/)
[](https://github.com/sudipta9749/autocleanss/blob/main/LICENSE)
An advanced and automated data cleaning toolkit for Python, designed to streamline your data preprocessing workflow with intelligent imputation, outlier handling, and comprehensive reporting.
## ✨ Key Features
* **Automated Cleaning Pipeline**: Orchestrates a complete data cleaning process from duplicates to outliers.
* **Duplicate Handling**: Automatically identifies and removes duplicate rows to ensure data uniqueness.
* **Intelligent Missing Value Imputation**: Supports various strategies including `mean`, `median`, `mode`, and `knn` imputation, tailored for numerical and categorical data. Special handling for datetime missing values.
* **Robust Outlier Treatment**: Employs the Interquartile Range (IQR) method to detect and cap outliers in numerical columns, preventing skewness and improving model performance.
* **Automatic Data Type Inference**: Dynamically infers and converts appropriate data types (e.g., object to numeric, object to datetime) for better data integrity.
* **Comprehensive HTML Reports**: Generates detailed, interactive HTML reports summarizing cleaning actions, before/after statistics, and visual distribution plots for key numerical features.
* **Seamless Pandas Integration**: Built to work effortlessly with Pandas DataFrames, making it intuitive for data scientists and analysts.
## 📦 Installation
To get started with AutoCleanSS, clone the repository and install it using pip:
1. **Install dependencies and the package**:
It's recommended to install in a virtual environment.
```bash
pip install autocleanss
```
## 🚀 Usage
Using `autocleanss` is straightforward. Here's a quick example:
```python
import pandas as pd
from autoclean import AutoClean
# 1. Load your messy data into a Pandas DataFrame
# (Replace with your actual data loading, e.g., pd.read_csv('your_data.csv'))
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Feature_A': [10.1, 11.2, None, 13.4, 14.5, 100.0, 16.7, 17.8, 10.1, 19.9],
'Feature_B': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'A', 'A', None],
'Feature_C': ['01-01-2023', '02-01-2023', '03-01-2023', '04-01-2023', '05-01-2023', None, '07-01-2023', '08-01-2023', '01-01-2023', '09-01-2023'],
'Feature_D': [5, 6, 5, 7, 8, 9, 10, 11, 5, 12]
}
df = pd.DataFrame(data)
print("--- Original DataFrame ---")
print(df)
print("\n" + "="*40 + "\n")
# 2. Initialize the AutoClean object with your DataFrame
# You can specify an imputation strategy: 'mean', 'median', 'mode', or 'knn'
cleaner = AutoClean(df, imputation_strategy='knn')
# 3. Run the cleaning process
cleaned_df = cleaner.clean()
print("--- Cleaned DataFrame ---")
print(cleaned_df)
print("\n" + "="*40 + "\n")
# 4. Generate a comprehensive HTML report
# The report will be saved as 'cleaning_report.html' in your current directory.
cleaner.generate_report(output_path='my_cleaning_report.html')
print("✅ Cleaning report saved to 'my_cleaning_report.html'")
```
This will produce a `my_cleaning_report.html` file in your project directory, detailing all the cleaning steps and showing the impact on your data.
You can use
``` python
from IPython.display import HTML
HTML(filename='cleaning_report.html')
```
In your notebook to visualize the report.
Raw data
{
"_id": null,
"home_page": null,
"name": "autocleanss",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "data cleaning, data preprocessing, machine learning, automation, pandas",
"author": null,
"author_email": "SUDIPTA BISWAS <sudiptabiswas394@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/1b/bc/e2fc5cd7643440e86f0ca68184b1a4d7e1723dbf9d784c359f8635418a31/autocleanss-0.1.3.tar.gz",
"platform": null,
"description": "# AutoCleanSS: Your Automated Data Cleaning Companion\r\n\r\n[](https://www.python.org/downloads/)\r\n[](https://github.com/sudipta9749/autocleanss/blob/main/LICENSE)\r\n\r\nAn advanced and automated data cleaning toolkit for Python, designed to streamline your data preprocessing workflow with intelligent imputation, outlier handling, and comprehensive reporting.\r\n\r\n\r\n## \u2728 Key Features\r\n\r\n* **Automated Cleaning Pipeline**: Orchestrates a complete data cleaning process from duplicates to outliers.\r\n* **Duplicate Handling**: Automatically identifies and removes duplicate rows to ensure data uniqueness.\r\n* **Intelligent Missing Value Imputation**: Supports various strategies including `mean`, `median`, `mode`, and `knn` imputation, tailored for numerical and categorical data. Special handling for datetime missing values.\r\n* **Robust Outlier Treatment**: Employs the Interquartile Range (IQR) method to detect and cap outliers in numerical columns, preventing skewness and improving model performance.\r\n* **Automatic Data Type Inference**: Dynamically infers and converts appropriate data types (e.g., object to numeric, object to datetime) for better data integrity.\r\n* **Comprehensive HTML Reports**: Generates detailed, interactive HTML reports summarizing cleaning actions, before/after statistics, and visual distribution plots for key numerical features.\r\n* **Seamless Pandas Integration**: Built to work effortlessly with Pandas DataFrames, making it intuitive for data scientists and analysts.\r\n\r\n## \ud83d\udce6 Installation\r\n\r\nTo get started with AutoCleanSS, clone the repository and install it using pip:\r\n\r\n1. **Install dependencies and the package**:\r\n It's recommended to install in a virtual environment.\r\n ```bash\r\n pip install autocleanss\r\n ```\r\n\r\n## \ud83d\ude80 Usage\r\n\r\nUsing `autocleanss` is straightforward. Here's a quick example:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom autoclean import AutoClean\r\n\r\n# 1. Load your messy data into a Pandas DataFrame\r\n# (Replace with your actual data loading, e.g., pd.read_csv('your_data.csv'))\r\ndata = {\r\n 'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\r\n 'Feature_A': [10.1, 11.2, None, 13.4, 14.5, 100.0, 16.7, 17.8, 10.1, 19.9],\r\n 'Feature_B': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'A', 'A', None],\r\n 'Feature_C': ['01-01-2023', '02-01-2023', '03-01-2023', '04-01-2023', '05-01-2023', None, '07-01-2023', '08-01-2023', '01-01-2023', '09-01-2023'],\r\n 'Feature_D': [5, 6, 5, 7, 8, 9, 10, 11, 5, 12]\r\n}\r\ndf = pd.DataFrame(data)\r\n\r\nprint(\"--- Original DataFrame ---\")\r\nprint(df)\r\nprint(\"\\n\" + \"=\"*40 + \"\\n\")\r\n\r\n# 2. Initialize the AutoClean object with your DataFrame\r\n# You can specify an imputation strategy: 'mean', 'median', 'mode', or 'knn'\r\ncleaner = AutoClean(df, imputation_strategy='knn')\r\n\r\n# 3. Run the cleaning process\r\ncleaned_df = cleaner.clean()\r\n\r\nprint(\"--- Cleaned DataFrame ---\")\r\nprint(cleaned_df)\r\nprint(\"\\n\" + \"=\"*40 + \"\\n\")\r\n\r\n# 4. Generate a comprehensive HTML report\r\n# The report will be saved as 'cleaning_report.html' in your current directory.\r\ncleaner.generate_report(output_path='my_cleaning_report.html')\r\nprint(\"\u2705 Cleaning report saved to 'my_cleaning_report.html'\")\r\n```\r\n\r\nThis will produce a `my_cleaning_report.html` file in your project directory, detailing all the cleaning steps and showing the impact on your data.\r\n\r\nYou can use \r\n``` python\r\nfrom IPython.display import HTML\r\nHTML(filename='cleaning_report.html')\r\n```\r\nIn your notebook to visualize the report.\r\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "An advanced and automated data cleaning toolkit for Python.",
"version": "0.1.3",
"project_urls": {
"Bug Tracker": "https://github.com/sudipta9749/autocleanss/issues",
"Homepage": "https://github.com/sudipta9749/autocleanss"
},
"split_keywords": [
"data cleaning",
" data preprocessing",
" machine learning",
" automation",
" pandas"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9ee23226034db5004c82089cca2d47ae0f35e489fcf0d2aa87bfe1cf8e2119d6",
"md5": "0e91110ade348aacba64599397c2be11",
"sha256": "0df64ec911bbfb78ecc5eeee0d8d401acb0cab2cd95d162581ec8b9cf9cc41be"
},
"downloads": -1,
"filename": "autocleanss-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0e91110ade348aacba64599397c2be11",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 9475,
"upload_time": "2025-08-31T11:28:47",
"upload_time_iso_8601": "2025-08-31T11:28:47.355250Z",
"url": "https://files.pythonhosted.org/packages/9e/e2/3226034db5004c82089cca2d47ae0f35e489fcf0d2aa87bfe1cf8e2119d6/autocleanss-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1bbce2fc5cd7643440e86f0ca68184b1a4d7e1723dbf9d784c359f8635418a31",
"md5": "c0f4bee2ba4374d0af42a53d85b25155",
"sha256": "48c2eb9ee5cc506b7823401895ddf2ff7b09acdee22dd339ef650b94358a543b"
},
"downloads": -1,
"filename": "autocleanss-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "c0f4bee2ba4374d0af42a53d85b25155",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 9362,
"upload_time": "2025-08-31T11:28:48",
"upload_time_iso_8601": "2025-08-31T11:28:48.714906Z",
"url": "https://files.pythonhosted.org/packages/1b/bc/e2fc5cd7643440e86f0ca68184b1a4d7e1723dbf9d784c359f8635418a31/autocleanss-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-31 11:28:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sudipta9749",
"github_project": "autocleanss",
"github_not_found": true,
"lcname": "autocleanss"
}