# Preprocessing Pipeline
## Overview
The **Preprocessing Pipeline** is a comprehensive Python-based tool designed to facilitate the preprocessing of data by performing initial data inspection, handling missing values, converting data types, managing outliers, scaling data, and transforming variables. This tool is modular, customizable, and suited for various data cleaning and preprocessing tasks essential for machine learning and data analysis.
## Table of Contents
- [Introduction](#introduction)
- [Problem Statement](#problem-statement)
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Initial Data Inspection](#initial-data-inspection)
- [Handling Missing Values](#handling-missing-values)
- [Data Type Conversion](#data-type-conversion)
- [Outlier Handling](#outlier-handling)
- [Scaling Data](#scaling-data)
- [Variable Transformation](#variable-transformation)
- [Classes and Methods](#classes-and-methods)
- [Report Generation](#report-generation)
- [Contributing](#contributing)
- [License](#license)
## Introduction
Data preprocessing is a crucial step in any data analysis or machine learning pipeline. This preprocessing tool provides a systematic approach to cleaning and preparing data by offering modules for inspecting data, handling missing values, managing outliers, scaling numerical features, and transforming variables to improve data quality and model performance.
## Problem Statement
Handling raw data can be challenging due to missing values, inconsistent data types, outliers, and other issues that can degrade the performance of predictive models. This tool/library/package aims to streamline the preprocessing workflow, making it easier to clean and prepare data for further analysis.
## Features
- **Initial Inspection**: Provides insights into the dataset, including shape, size, summary statistics, and detailed analysis of missing and duplicated values.
- **Data Type Conversion**: Converts data types like objects to strings, integers, floats, and datetime, ensuring data consistency.
- **Missing Value Handling**: Offers multiple strategies to fill or remove missing values, including mean, median, mode, bfill, linear, and polynomial interpolation.
- **Outlier Handling**: Detects and caps outliers using IQR and Z-score methods.
- **Scaling**: Standardizes numerical data using Standard Scaler, Robust Scaler, and Normalizer techniques.
- **Variable Transformation**: Provides transformations like binning, log transformation, square root transformation, label encoding, and one-hot encoding.
## Installation
Clone this repository to your local machine and ensure you have Python installed along with the required dependencies:
```bash
git clone https://github.com/KaRtHiK-56/EDA_python_package_library
pip install -r requirements.txt
```
## Usage
### Initial Data Inspection
The inspection methods provide a detailed overview of your dataset to identify data quality issues upfront.
### REFER TO PackageTest.ipynb FOR SAMPLE EXAMPLE
https://github.com/KaRtHiK-56/EDA_python_package_library
```python
!pip install eda-python-library==0.0.1.6
# Import the eda library
from eda.eda import Inspection
Inspection.inspect()
# This will ask you to enter the path of the csv file and then generated the initial inspection.
```
### Handling Missing Values
Handle missing values using different strategies like mean, median, mode, bfill, etc.
```python
# Import and use the MissingValueHandler class
from eda.eda import MissingValueHandler
MissingValueHandler.mean('col1')
MissingValueHandler.b_fill('col3')
```
### Data Type Conversion
Convert data types using the `DataTypeConverter` class, which supports conversions between object, string, integer, float, and datetime.
```python
# Import and use the DataTypeConverter class
from eda.eda import DataTypeConverter
converter = DataTypeConverter(df)
converter.to_string('col2')
converter.to_datetime('col3')
```
### Outlier Handling
Handle outliers using IQR and Z-score methods.
```python
# Import and use the OutlierHandler class
from eda.eda import OutlierHandler
OutlierHandler.iqr_capping('col1')
```
### Scaling Data
Scale numerical data using Standard Scaler, Robust Scaler, or Normalizer.
```python
# Import and use the ScalingHandler class
from eda.eda import NumericalScaler
NumericalScaler.standardscaler('col1')
```
### Variable Transformation
Transform variables using binning, log transformation, and encoding methods.
```python
# Import and use the VariableTransformation class
from eda.eda import VariableTransformation
VariableTransformation.binner('col1', bins=[0, 1, 2, 3, 4])
VariableTransformation.label_encoding('col2')
```
## Classes and Methods
### 1. `InitialInspection`
- **Methods**:
- `inspect()`: Generates a comprehensive inspection report on dataset shape, size, dimensions, summary, missing values, duplicates, numerical and categorical columns, skewness, and kurtosis.
### 2. `MissingValueHandler`
- **Methods**:
- `mean()`, `median()`, `mode()`, `b_fill()`, `f_fill()` ,`linear()`, `polynomial()`, `drop()`: Various techniques to handle missing values in the dataset.
### 3. `DataTypeConverter`
- **Methods**:
- `to_string()`, `to_int()`, `to_float()`, `to_datetime()`: Convert data types of specific columns.
### 4. `OutlierHandler`
- **Methods**:
- `iqr_capping()`, `zscore_capping()`: Detect and handle outliers using IQR and Z-score methods.
### 5. `ScalingHandler`
- **Methods**:
- `standardscaler()`, `robustscaler()`: Scale numerical data using different scaling techniques.
### 6. `VariableTransformation`
- **Methods**:
- `binning()`, `log_transformer()`, `sqrt_transformer()`, `label_encoding()`, `one_hot_encoding()`: Various transformations for numerical and categorical data.
## Report Generation
The `Inspection` class provides a detailed report summarizing the data, including counts and percentages of missing and duplicated values, column types, and descriptive statistics.
## Contributing
Contributions are welcome! Please read the [CONTRIBUTING.md](./CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.
## License
This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/KaRtHiK-56/EDA_python_package_library",
"name": "eda-python-library",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "eda",
"author": "Karthik",
"author_email": "karthiksurya611@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e8/4b/1e8f64135513a887978a8bb1f8d38ead5c7e316d208bbdcff3c572098bc9/eda_python_library-0.0.1.8.tar.gz",
"platform": null,
"description": "# Preprocessing Pipeline\r\n\r\n## Overview\r\n\r\nThe **Preprocessing Pipeline** is a comprehensive Python-based tool designed to facilitate the preprocessing of data by performing initial data inspection, handling missing values, converting data types, managing outliers, scaling data, and transforming variables. This tool is modular, customizable, and suited for various data cleaning and preprocessing tasks essential for machine learning and data analysis.\r\n\r\n## Table of Contents\r\n\r\n- [Introduction](#introduction)\r\n- [Problem Statement](#problem-statement)\r\n- [Features](#features)\r\n- [Installation](#installation)\r\n- [Usage](#usage)\r\n - [Initial Data Inspection](#initial-data-inspection)\r\n - [Handling Missing Values](#handling-missing-values)\r\n - [Data Type Conversion](#data-type-conversion)\r\n - [Outlier Handling](#outlier-handling)\r\n - [Scaling Data](#scaling-data)\r\n - [Variable Transformation](#variable-transformation)\r\n- [Classes and Methods](#classes-and-methods)\r\n- [Report Generation](#report-generation)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n\r\n## Introduction\r\n\r\nData preprocessing is a crucial step in any data analysis or machine learning pipeline. This preprocessing tool provides a systematic approach to cleaning and preparing data by offering modules for inspecting data, handling missing values, managing outliers, scaling numerical features, and transforming variables to improve data quality and model performance.\r\n\r\n## Problem Statement\r\n\r\nHandling raw data can be challenging due to missing values, inconsistent data types, outliers, and other issues that can degrade the performance of predictive models. This tool/library/package aims to streamline the preprocessing workflow, making it easier to clean and prepare data for further analysis.\r\n\r\n## Features\r\n\r\n- **Initial Inspection**: Provides insights into the dataset, including shape, size, summary statistics, and detailed analysis of missing and duplicated values.\r\n- **Data Type Conversion**: Converts data types like objects to strings, integers, floats, and datetime, ensuring data consistency.\r\n- **Missing Value Handling**: Offers multiple strategies to fill or remove missing values, including mean, median, mode, bfill, linear, and polynomial interpolation.\r\n- **Outlier Handling**: Detects and caps outliers using IQR and Z-score methods.\r\n- **Scaling**: Standardizes numerical data using Standard Scaler, Robust Scaler, and Normalizer techniques.\r\n- **Variable Transformation**: Provides transformations like binning, log transformation, square root transformation, label encoding, and one-hot encoding.\r\n\r\n## Installation\r\n\r\nClone this repository to your local machine and ensure you have Python installed along with the required dependencies:\r\n\r\n```bash\r\ngit clone https://github.com/KaRtHiK-56/EDA_python_package_library\r\npip install -r requirements.txt\r\n```\r\n\r\n## Usage\r\n\r\n### Initial Data Inspection\r\n\r\nThe inspection methods provide a detailed overview of your dataset to identify data quality issues upfront.\r\n\r\n### REFER TO PackageTest.ipynb FOR SAMPLE EXAMPLE\r\nhttps://github.com/KaRtHiK-56/EDA_python_package_library\r\n\r\n\r\n```python\r\n!pip install eda-python-library==0.0.1.6\r\n\r\n# Import the eda library\r\nfrom eda.eda import Inspection\r\n\r\nInspection.inspect()\r\n# This will ask you to enter the path of the csv file and then generated the initial inspection.\r\n```\r\n\r\n### Handling Missing Values\r\n\r\nHandle missing values using different strategies like mean, median, mode, bfill, etc.\r\n\r\n```python\r\n# Import and use the MissingValueHandler class\r\nfrom eda.eda import MissingValueHandler\r\n\r\nMissingValueHandler.mean('col1')\r\nMissingValueHandler.b_fill('col3')\r\n```\r\n\r\n### Data Type Conversion\r\n\r\nConvert data types using the `DataTypeConverter` class, which supports conversions between object, string, integer, float, and datetime.\r\n\r\n```python\r\n# Import and use the DataTypeConverter class\r\nfrom eda.eda import DataTypeConverter\r\n\r\nconverter = DataTypeConverter(df)\r\nconverter.to_string('col2')\r\nconverter.to_datetime('col3')\r\n```\r\n\r\n### Outlier Handling\r\n\r\nHandle outliers using IQR and Z-score methods.\r\n\r\n```python\r\n# Import and use the OutlierHandler class\r\nfrom eda.eda import OutlierHandler\r\n\r\nOutlierHandler.iqr_capping('col1')\r\n```\r\n\r\n### Scaling Data\r\n\r\nScale numerical data using Standard Scaler, Robust Scaler, or Normalizer.\r\n\r\n```python\r\n# Import and use the ScalingHandler class\r\nfrom eda.eda import NumericalScaler\r\n\r\nNumericalScaler.standardscaler('col1')\r\n```\r\n\r\n### Variable Transformation\r\n\r\nTransform variables using binning, log transformation, and encoding methods.\r\n\r\n```python\r\n# Import and use the VariableTransformation class\r\nfrom eda.eda import VariableTransformation\r\n\r\nVariableTransformation.binner('col1', bins=[0, 1, 2, 3, 4])\r\nVariableTransformation.label_encoding('col2')\r\n```\r\n\r\n## Classes and Methods\r\n\r\n### 1. `InitialInspection`\r\n- **Methods**: \r\n - `inspect()`: Generates a comprehensive inspection report on dataset shape, size, dimensions, summary, missing values, duplicates, numerical and categorical columns, skewness, and kurtosis.\r\n\r\n### 2. `MissingValueHandler`\r\n- **Methods**:\r\n - `mean()`, `median()`, `mode()`, `b_fill()`, `f_fill()` ,`linear()`, `polynomial()`, `drop()`: Various techniques to handle missing values in the dataset.\r\n\r\n### 3. `DataTypeConverter`\r\n- **Methods**:\r\n - `to_string()`, `to_int()`, `to_float()`, `to_datetime()`: Convert data types of specific columns.\r\n\r\n### 4. `OutlierHandler`\r\n- **Methods**:\r\n - `iqr_capping()`, `zscore_capping()`: Detect and handle outliers using IQR and Z-score methods.\r\n\r\n### 5. `ScalingHandler`\r\n- **Methods**:\r\n - `standardscaler()`, `robustscaler()`: Scale numerical data using different scaling techniques.\r\n\r\n### 6. `VariableTransformation`\r\n- **Methods**:\r\n - `binning()`, `log_transformer()`, `sqrt_transformer()`, `label_encoding()`, `one_hot_encoding()`: Various transformations for numerical and categorical data.\r\n\r\n## Report Generation\r\n\r\nThe `Inspection` class provides a detailed report summarizing the data, including counts and percentages of missing and duplicated values, column types, and descriptive statistics.\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please read the [CONTRIBUTING.md](./CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Library for Making the Explorartory Data Analysis process easy in single line of codes",
"version": "0.0.1.8",
"project_urls": {
"Homepage": "https://github.com/KaRtHiK-56/EDA_python_package_library"
},
"split_keywords": [
"eda"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e84b1e8f64135513a887978a8bb1f8d38ead5c7e316d208bbdcff3c572098bc9",
"md5": "e6880efa1386a5841914228de028720f",
"sha256": "1b0151151f73a104fb7d3403d25070c74a405680d3c43a85cbadded18acaf4b4"
},
"downloads": -1,
"filename": "eda_python_library-0.0.1.8.tar.gz",
"has_sig": false,
"md5_digest": "e6880efa1386a5841914228de028720f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 23917,
"upload_time": "2024-09-08T17:00:44",
"upload_time_iso_8601": "2024-09-08T17:00:44.809284Z",
"url": "https://files.pythonhosted.org/packages/e8/4b/1e8f64135513a887978a8bb1f8d38ead5c7e316d208bbdcff3c572098bc9/eda_python_library-0.0.1.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-08 17:00:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "KaRtHiK-56",
"github_project": "EDA_python_package_library",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "eda-python-library"
}