# ๐งน DataCleaner
A simple and flexible Python utility for **data cleaning and preprocessing**.
It helps you handle missing values, remove duplicates, treat outliers, reduce features, and encode categorical data โ all in one place.
---
## โจ Features
- **Missing Values Handling**
- Drop, Mean, Median, Mode, KNN Imputation
- **Remove Duplicates**
- **Outlier Detection & Removal**
- IQR, Z-score, Isolation Forest
- **Feature Reduction**
- Low variance feature removal
- High correlation feature removal
- **Encoding Categorical Data**
- Label Encoding
- One-Hot Encoding
- Frequency Encoding
- Auto Encoding (chooses based on cardinality)
- **Report Generation** โ keeps track of all cleaning steps applied
---
## ๐ฆ Installation
Clone the repo:
```bash
git clone https://github.com/ck-ahmad/datacleaner.git
```
## Install dependencies:
```bash
pip install -r requirements.txt
```
## ๐ Usage Example
``` python
import pandas as pd
from cleaner import DataCleaner
# Sample dataset
df = pd.DataFrame({
"Name": ["Ali", "Sara", "John", "Ali", None],
"Age": [25, None, 30, 25, 22],
"Salary": [50000, 60000, None, 50000, 45000],
"City": ["Lahore", "Karachi", "Karachi", "Lahore", "Islamabad"]
})
# Initialize cleaner
cleaner = DataCleaner(df)
# Apply cleaning steps
cleaned_df = (
cleaner
.handle_missing(strategy="mean") # fill missing values
.remove_duplicates() # remove duplicate rows
.handle_outliers(method="iqr") # handle outliers
.remove_low_variance(threshold=0.01)
.remove_high_correlation(corr_threshold=0.9)
.encode(method="auto") # encode categorical columns
.get_data()
)
# Get report of cleaning steps
report = cleaner.get_report()
print("โ
Cleaned Data:")
print(cleaned_df)
print("\n๐ Cleaning Report:")
print(report)
```
## ๐ Example Output
# Cleaned Data
``` bash
Copy code
Age Salary City_Karachi City_Lahore City_Islamabad
0 25 50000.0 0 1 0
1 25 50000.0 1 0 0
2 22 45000.0 0 0 1
```
# Cleaning Report
``` sql
Step Action
1 Filled missing Age with mean
2 Filled missing Salary with mean
3 Dropped rows with missing in Name
4 Removed 1 duplicate rows
5 Removed outliers in Age using IQR
6 Applied Auto Encoding
```
## ๐ ๏ธ Requirements
pandas
numpy
scikit-learn
```bash
Copy code
pip install pandas numpy scikit-learn
```
## ๐ License
This project is licensed under the MIT License โ feel free to use and Open To Collabrate For Contibutions in it.
## ๐จโ๐ป Author
Ahmad Abdullah
๐ LinkedIn: linkedin.com/in/ahmad0763
๐ป GitHub: github.com/ck-ahmad
Raw data
{
"_id": null,
"home_page": null,
"name": "datacleaner-ck",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "numpy, pandas, matplotlib, seaborn, machine-learning, data-science",
"author": null,
"author_email": "Ahmad Abdullah <ahmadleo498@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/40/7a/733a232f6ab3667d40e9411ecb8cf7bcb52611ceb28cb24e664553d8fcb3/datacleaner_ck-1.1.1.tar.gz",
"platform": null,
"description": "# \ud83e\uddf9 DataCleaner\r\n\r\nA simple and flexible Python utility for **data cleaning and preprocessing**. \r\nIt helps you handle missing values, remove duplicates, treat outliers, reduce features, and encode categorical data \u2014 all in one place. \r\n\r\n---\r\n\r\n## \u2728 Features\r\n- **Missing Values Handling** \r\n - Drop, Mean, Median, Mode, KNN Imputation \r\n- **Remove Duplicates** \r\n- **Outlier Detection & Removal** \r\n - IQR, Z-score, Isolation Forest \r\n- **Feature Reduction** \r\n - Low variance feature removal \r\n - High correlation feature removal \r\n- **Encoding Categorical Data** \r\n - Label Encoding \r\n - One-Hot Encoding \r\n - Frequency Encoding \r\n - Auto Encoding (chooses based on cardinality) \r\n- **Report Generation** \u2013 keeps track of all cleaning steps applied \r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\nClone the repo:\r\n\r\n```bash\r\ngit clone https://github.com/ck-ahmad/datacleaner.git\r\n```\r\n## Install dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n## \ud83d\ude80 Usage Example\r\n\r\n``` python\r\nimport pandas as pd\r\nfrom cleaner import DataCleaner\r\n\r\n# Sample dataset\r\ndf = pd.DataFrame({\r\n \"Name\": [\"Ali\", \"Sara\", \"John\", \"Ali\", None],\r\n \"Age\": [25, None, 30, 25, 22],\r\n \"Salary\": [50000, 60000, None, 50000, 45000],\r\n \"City\": [\"Lahore\", \"Karachi\", \"Karachi\", \"Lahore\", \"Islamabad\"]\r\n})\r\n\r\n# Initialize cleaner\r\ncleaner = DataCleaner(df)\r\n\r\n# Apply cleaning steps\r\ncleaned_df = (\r\n cleaner\r\n .handle_missing(strategy=\"mean\") # fill missing values\r\n .remove_duplicates() # remove duplicate rows\r\n .handle_outliers(method=\"iqr\") # handle outliers\r\n .remove_low_variance(threshold=0.01)\r\n .remove_high_correlation(corr_threshold=0.9)\r\n .encode(method=\"auto\") # encode categorical columns\r\n .get_data()\r\n)\r\n\r\n# Get report of cleaning steps\r\nreport = cleaner.get_report()\r\n\r\nprint(\"\u2705 Cleaned Data:\")\r\nprint(cleaned_df)\r\n\r\nprint(\"\\n\ud83d\udcdd Cleaning Report:\")\r\nprint(report)\r\n```\r\n## \ud83d\udccb Example Output\r\n# Cleaned Data\r\n\r\n``` bash\r\nCopy code\r\n Age Salary City_Karachi City_Lahore City_Islamabad\r\n0 25 50000.0 0 1 0\r\n1 25 50000.0 1 0 0\r\n2 22 45000.0 0 0 1\r\n```\r\n# Cleaning Report\r\n\r\n``` sql\r\n Step Action\r\n 1 Filled missing Age with mean\r\n 2 Filled missing Salary with mean\r\n 3 Dropped rows with missing in Name\r\n 4 Removed 1 duplicate rows\r\n 5 Removed outliers in Age using IQR\r\n 6 Applied Auto Encoding\r\n```\r\n## \ud83d\udee0\ufe0f Requirements\r\npandas\r\nnumpy\r\nscikit-learn\r\n\r\n```bash\r\nCopy code\r\npip install pandas numpy scikit-learn\r\n```\r\n## \ud83d\udcd6 License\r\nThis project is licensed under the MIT License \u2013 feel free to use and Open To Collabrate For Contibutions in it.\r\n\r\n## \ud83d\udc68\u200d\ud83d\udcbb Author\r\nAhmad Abdullah\r\n\ud83c\udf10 LinkedIn: linkedin.com/in/ahmad0763\r\n\ud83d\udcbb GitHub: github.com/ck-ahmad\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A practice project combining NumPy, Pandas, Matplotlib, and Seaborn for learning and revision.",
"version": "1.1.1",
"project_urls": {
"Homepage": "https://github.com/ck-ahmad/DataCleaner",
"LinkedIn": "https://www.linkedin.com/in/ahmad0763"
},
"split_keywords": [
"numpy",
" pandas",
" matplotlib",
" seaborn",
" machine-learning",
" data-science"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3c177c08104669d50bbc5279e97126ec4b06326f4c12657c1340abb6264fa225",
"md5": "3e415b23db06cc26491e9fd1371a29ee",
"sha256": "f64ab4ba74cc2f08af6ed9a2dbcfd873e7d0129469fa330d9e65ad71a7017623"
},
"downloads": -1,
"filename": "datacleaner_ck-1.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3e415b23db06cc26491e9fd1371a29ee",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 4446,
"upload_time": "2025-08-27T08:14:30",
"upload_time_iso_8601": "2025-08-27T08:14:30.906735Z",
"url": "https://files.pythonhosted.org/packages/3c/17/7c08104669d50bbc5279e97126ec4b06326f4c12657c1340abb6264fa225/datacleaner_ck-1.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "407a733a232f6ab3667d40e9411ecb8cf7bcb52611ceb28cb24e664553d8fcb3",
"md5": "04ea6afdb3256a5c2b7ae2ad07a8a557",
"sha256": "b7e5d348e52dcab4a549eada39eef3f579d6ebeb04c3c24ab887bb0f55e4e7ee"
},
"downloads": -1,
"filename": "datacleaner_ck-1.1.1.tar.gz",
"has_sig": false,
"md5_digest": "04ea6afdb3256a5c2b7ae2ad07a8a557",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 4065,
"upload_time": "2025-08-27T08:14:32",
"upload_time_iso_8601": "2025-08-27T08:14:32.672112Z",
"url": "https://files.pythonhosted.org/packages/40/7a/733a232f6ab3667d40e9411ecb8cf7bcb52611ceb28cb24e664553d8fcb3/datacleaner_ck-1.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-27 08:14:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ck-ahmad",
"github_project": "DataCleaner",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "datacleaner-ck"
}