datacleaner-ck

Name	datacleaner-ck JSON
Version	1.1.1 JSON
	download
home_page	None
Summary	A practice project combining NumPy, Pandas, Matplotlib, and Seaborn for learning and revision.
upload_time	2025-08-27 08:14:32
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	None
keywords	numpy pandas matplotlib seaborn machine-learning data-science
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 🧹 DataCleaner

A simple and flexible Python utility for **data cleaning and preprocessing**.  
It helps you handle missing values, remove duplicates, treat outliers, reduce features, and encode categorical data — all in one place.  

---

## ✨ Features
- **Missing Values Handling**  
  - Drop, Mean, Median, Mode, KNN Imputation  
- **Remove Duplicates**  
- **Outlier Detection & Removal**  
  - IQR, Z-score, Isolation Forest  
- **Feature Reduction**  
  - Low variance feature removal  
  - High correlation feature removal  
- **Encoding Categorical Data**  
  - Label Encoding  
  - One-Hot Encoding  
  - Frequency Encoding  
  - Auto Encoding (chooses based on cardinality)  
- **Report Generation** – keeps track of all cleaning steps applied  

---

## 📦 Installation

Clone the repo:

```bash
git clone https://github.com/ck-ahmad/datacleaner.git
```
## Install dependencies:

```bash
pip install -r requirements.txt
```

## 🚀 Usage Example

``` python
import pandas as pd
from cleaner import DataCleaner

# Sample dataset
df = pd.DataFrame({
    "Name": ["Ali", "Sara", "John", "Ali", None],
    "Age": [25, None, 30, 25, 22],
    "Salary": [50000, 60000, None, 50000, 45000],
    "City": ["Lahore", "Karachi", "Karachi", "Lahore", "Islamabad"]
})

# Initialize cleaner
cleaner = DataCleaner(df)

# Apply cleaning steps
cleaned_df = (
    cleaner
    .handle_missing(strategy="mean")   # fill missing values
    .remove_duplicates()               # remove duplicate rows
    .handle_outliers(method="iqr")     # handle outliers
    .remove_low_variance(threshold=0.01)
    .remove_high_correlation(corr_threshold=0.9)
    .encode(method="auto")             # encode categorical columns
    .get_data()
)

# Get report of cleaning steps
report = cleaner.get_report()

print("✅ Cleaned Data:")
print(cleaned_df)

print("\n📝 Cleaning Report:")
print(report)
```
## 📋 Example Output
# Cleaned Data

``` bash
Copy code
   Age   Salary  City_Karachi  City_Lahore  City_Islamabad
0   25  50000.0             0            1               0
1   25  50000.0             1            0               0
2   22  45000.0             0            0               1
```
# Cleaning Report

``` sql
 Step  Action
 1     Filled missing Age with mean
 2     Filled missing Salary with mean
 3     Dropped rows with missing in Name
 4     Removed 1 duplicate rows
 5     Removed outliers in Age using IQR
 6     Applied Auto Encoding
```
## 🛠️ Requirements
pandas
numpy
scikit-learn

```bash
Copy code
pip install pandas numpy scikit-learn
```
## 📖 License
This project is licensed under the MIT License – feel free to use and Open To Collabrate For Contibutions in it.

## 👨‍💻 Author
Ahmad Abdullah
🌐 LinkedIn: linkedin.com/in/ahmad0763
💻 GitHub: github.com/ck-ahmad

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datacleaner-ck",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "numpy, pandas, matplotlib, seaborn, machine-learning, data-science",
    "author": null,
    "author_email": "Ahmad Abdullah <ahmadleo498@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/40/7a/733a232f6ab3667d40e9411ecb8cf7bcb52611ceb28cb24e664553d8fcb3/datacleaner_ck-1.1.1.tar.gz",
    "platform": null,
    "description": "# \ud83e\uddf9 DataCleaner\r\n\r\nA simple and flexible Python utility for **data cleaning and preprocessing**.  \r\nIt helps you handle missing values, remove duplicates, treat outliers, reduce features, and encode categorical data \u2014 all in one place.  \r\n\r\n---\r\n\r\n## \u2728 Features\r\n- **Missing Values Handling**  \r\n  - Drop, Mean, Median, Mode, KNN Imputation  \r\n- **Remove Duplicates**  \r\n- **Outlier Detection & Removal**  \r\n  - IQR, Z-score, Isolation Forest  \r\n- **Feature Reduction**  \r\n  - Low variance feature removal  \r\n  - High correlation feature removal  \r\n- **Encoding Categorical Data**  \r\n  - Label Encoding  \r\n  - One-Hot Encoding  \r\n  - Frequency Encoding  \r\n  - Auto Encoding (chooses based on cardinality)  \r\n- **Report Generation** \u2013 keeps track of all cleaning steps applied  \r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\nClone the repo:\r\n\r\n```bash\r\ngit clone https://github.com/ck-ahmad/datacleaner.git\r\n```\r\n## Install dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n## \ud83d\ude80 Usage Example\r\n\r\n``` python\r\nimport pandas as pd\r\nfrom cleaner import DataCleaner\r\n\r\n# Sample dataset\r\ndf = pd.DataFrame({\r\n    \"Name\": [\"Ali\", \"Sara\", \"John\", \"Ali\", None],\r\n    \"Age\": [25, None, 30, 25, 22],\r\n    \"Salary\": [50000, 60000, None, 50000, 45000],\r\n    \"City\": [\"Lahore\", \"Karachi\", \"Karachi\", \"Lahore\", \"Islamabad\"]\r\n})\r\n\r\n# Initialize cleaner\r\ncleaner = DataCleaner(df)\r\n\r\n# Apply cleaning steps\r\ncleaned_df = (\r\n    cleaner\r\n    .handle_missing(strategy=\"mean\")   # fill missing values\r\n    .remove_duplicates()               # remove duplicate rows\r\n    .handle_outliers(method=\"iqr\")     # handle outliers\r\n    .remove_low_variance(threshold=0.01)\r\n    .remove_high_correlation(corr_threshold=0.9)\r\n    .encode(method=\"auto\")             # encode categorical columns\r\n    .get_data()\r\n)\r\n\r\n# Get report of cleaning steps\r\nreport = cleaner.get_report()\r\n\r\nprint(\"\u2705 Cleaned Data:\")\r\nprint(cleaned_df)\r\n\r\nprint(\"\\n\ud83d\udcdd Cleaning Report:\")\r\nprint(report)\r\n```\r\n## \ud83d\udccb Example Output\r\n# Cleaned Data\r\n\r\n``` bash\r\nCopy code\r\n   Age   Salary  City_Karachi  City_Lahore  City_Islamabad\r\n0   25  50000.0             0            1               0\r\n1   25  50000.0             1            0               0\r\n2   22  45000.0             0            0               1\r\n```\r\n# Cleaning Report\r\n\r\n``` sql\r\n Step  Action\r\n 1     Filled missing Age with mean\r\n 2     Filled missing Salary with mean\r\n 3     Dropped rows with missing in Name\r\n 4     Removed 1 duplicate rows\r\n 5     Removed outliers in Age using IQR\r\n 6     Applied Auto Encoding\r\n```\r\n## \ud83d\udee0\ufe0f Requirements\r\npandas\r\nnumpy\r\nscikit-learn\r\n\r\n```bash\r\nCopy code\r\npip install pandas numpy scikit-learn\r\n```\r\n## \ud83d\udcd6 License\r\nThis project is licensed under the MIT License \u2013 feel free to use and Open To Collabrate For Contibutions in it.\r\n\r\n## \ud83d\udc68\u200d\ud83d\udcbb Author\r\nAhmad Abdullah\r\n\ud83c\udf10 LinkedIn: linkedin.com/in/ahmad0763\r\n\ud83d\udcbb GitHub: github.com/ck-ahmad\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A practice project combining NumPy, Pandas, Matplotlib, and Seaborn for learning and revision.",
    "version": "1.1.1",
    "project_urls": {
        "Homepage": "https://github.com/ck-ahmad/DataCleaner",
        "LinkedIn": "https://www.linkedin.com/in/ahmad0763"
    },
    "split_keywords": [
        "numpy",
        " pandas",
        " matplotlib",
        " seaborn",
        " machine-learning",
        " data-science"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3c177c08104669d50bbc5279e97126ec4b06326f4c12657c1340abb6264fa225",
                "md5": "3e415b23db06cc26491e9fd1371a29ee",
                "sha256": "f64ab4ba74cc2f08af6ed9a2dbcfd873e7d0129469fa330d9e65ad71a7017623"
            },
            "downloads": -1,
            "filename": "datacleaner_ck-1.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3e415b23db06cc26491e9fd1371a29ee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 4446,
            "upload_time": "2025-08-27T08:14:30",
            "upload_time_iso_8601": "2025-08-27T08:14:30.906735Z",
            "url": "https://files.pythonhosted.org/packages/3c/17/7c08104669d50bbc5279e97126ec4b06326f4c12657c1340abb6264fa225/datacleaner_ck-1.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "407a733a232f6ab3667d40e9411ecb8cf7bcb52611ceb28cb24e664553d8fcb3",
                "md5": "04ea6afdb3256a5c2b7ae2ad07a8a557",
                "sha256": "b7e5d348e52dcab4a549eada39eef3f579d6ebeb04c3c24ab887bb0f55e4e7ee"
            },
            "downloads": -1,
            "filename": "datacleaner_ck-1.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "04ea6afdb3256a5c2b7ae2ad07a8a557",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 4065,
            "upload_time": "2025-08-27T08:14:32",
            "upload_time_iso_8601": "2025-08-27T08:14:32.672112Z",
            "url": "https://files.pythonhosted.org/packages/40/7a/733a232f6ab3667d40e9411ecb8cf7bcb52611ceb28cb24e664553d8fcb3/datacleaner_ck-1.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-27 08:14:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ck-ahmad",
    "github_project": "DataCleaner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "datacleaner-ck"
}

None