eda-python-library


Nameeda-python-library JSON
Version 0.0.1.8 PyPI version JSON
download
home_pagehttps://github.com/KaRtHiK-56/EDA_python_package_library
SummaryA Library for Making the Explorartory Data Analysis process easy in single line of codes
upload_time2024-09-08 17:00:44
maintainerNone
docs_urlNone
authorKarthik
requires_pythonNone
licenseMIT
keywords eda
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Preprocessing Pipeline

## Overview

The **Preprocessing Pipeline** is a comprehensive Python-based tool designed to facilitate the preprocessing of data by performing initial data inspection, handling missing values, converting data types, managing outliers, scaling data, and transforming variables. This tool is modular, customizable, and suited for various data cleaning and preprocessing tasks essential for machine learning and data analysis.

## Table of Contents

- [Introduction](#introduction)
- [Problem Statement](#problem-statement)
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
  - [Initial Data Inspection](#initial-data-inspection)
  - [Handling Missing Values](#handling-missing-values)
  - [Data Type Conversion](#data-type-conversion)
  - [Outlier Handling](#outlier-handling)
  - [Scaling Data](#scaling-data)
  - [Variable Transformation](#variable-transformation)
- [Classes and Methods](#classes-and-methods)
- [Report Generation](#report-generation)
- [Contributing](#contributing)
- [License](#license)

## Introduction

Data preprocessing is a crucial step in any data analysis or machine learning pipeline. This preprocessing tool provides a systematic approach to cleaning and preparing data by offering modules for inspecting data, handling missing values, managing outliers, scaling numerical features, and transforming variables to improve data quality and model performance.

## Problem Statement

Handling raw data can be challenging due to missing values, inconsistent data types, outliers, and other issues that can degrade the performance of predictive models. This tool/library/package aims to streamline the preprocessing workflow, making it easier to clean and prepare data for further analysis.

## Features

- **Initial Inspection**: Provides insights into the dataset, including shape, size, summary statistics, and detailed analysis of missing and duplicated values.
- **Data Type Conversion**: Converts data types like objects to strings, integers, floats, and datetime, ensuring data consistency.
- **Missing Value Handling**: Offers multiple strategies to fill or remove missing values, including mean, median, mode, bfill, linear, and polynomial interpolation.
- **Outlier Handling**: Detects and caps outliers using IQR and Z-score methods.
- **Scaling**: Standardizes numerical data using Standard Scaler, Robust Scaler, and Normalizer techniques.
- **Variable Transformation**: Provides transformations like binning, log transformation, square root transformation, label encoding, and one-hot encoding.

## Installation

Clone this repository to your local machine and ensure you have Python installed along with the required dependencies:

```bash
git clone https://github.com/KaRtHiK-56/EDA_python_package_library
pip install -r requirements.txt
```

## Usage

### Initial Data Inspection

The inspection methods provide a detailed overview of your dataset to identify data quality issues upfront.

### REFER TO PackageTest.ipynb FOR SAMPLE EXAMPLE
https://github.com/KaRtHiK-56/EDA_python_package_library


```python
!pip install eda-python-library==0.0.1.6

# Import the eda library
from eda.eda import Inspection

Inspection.inspect()
# This will ask you to enter the path of the csv file and then generated the initial inspection.
```

### Handling Missing Values

Handle missing values using different strategies like mean, median, mode, bfill, etc.

```python
# Import and use the MissingValueHandler class
from eda.eda import MissingValueHandler

MissingValueHandler.mean('col1')
MissingValueHandler.b_fill('col3')
```

### Data Type Conversion

Convert data types using the `DataTypeConverter` class, which supports conversions between object, string, integer, float, and datetime.

```python
# Import and use the DataTypeConverter class
from eda.eda import DataTypeConverter

converter = DataTypeConverter(df)
converter.to_string('col2')
converter.to_datetime('col3')
```

### Outlier Handling

Handle outliers using IQR and Z-score methods.

```python
# Import and use the OutlierHandler class
from eda.eda import OutlierHandler

OutlierHandler.iqr_capping('col1')
```

### Scaling Data

Scale numerical data using Standard Scaler, Robust Scaler, or Normalizer.

```python
# Import and use the ScalingHandler class
from eda.eda import NumericalScaler

NumericalScaler.standardscaler('col1')
```

### Variable Transformation

Transform variables using binning, log transformation, and encoding methods.

```python
# Import and use the VariableTransformation class
from eda.eda import VariableTransformation

VariableTransformation.binner('col1', bins=[0, 1, 2, 3, 4])
VariableTransformation.label_encoding('col2')
```

## Classes and Methods

### 1. `InitialInspection`
- **Methods**: 
  - `inspect()`: Generates a comprehensive inspection report on dataset shape, size, dimensions, summary, missing values, duplicates, numerical and categorical columns, skewness, and kurtosis.

### 2. `MissingValueHandler`
- **Methods**:
  - `mean()`, `median()`, `mode()`, `b_fill()`, `f_fill()` ,`linear()`, `polynomial()`, `drop()`: Various techniques to handle missing values in the dataset.

### 3. `DataTypeConverter`
- **Methods**:
  - `to_string()`, `to_int()`, `to_float()`, `to_datetime()`: Convert data types of specific columns.

### 4. `OutlierHandler`
- **Methods**:
  - `iqr_capping()`, `zscore_capping()`: Detect and handle outliers using IQR and Z-score methods.

### 5. `ScalingHandler`
- **Methods**:
  - `standardscaler()`, `robustscaler()`: Scale numerical data using different scaling techniques.

### 6. `VariableTransformation`
- **Methods**:
  - `binning()`, `log_transformer()`, `sqrt_transformer()`, `label_encoding()`, `one_hot_encoding()`: Various transformations for numerical and categorical data.

## Report Generation

The `Inspection` class provides a detailed report summarizing the data, including counts and percentages of missing and duplicated values, column types, and descriptive statistics.

## Contributing

Contributions are welcome! Please read the [CONTRIBUTING.md](./CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.

## License

This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KaRtHiK-56/EDA_python_package_library",
    "name": "eda-python-library",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "eda",
    "author": "Karthik",
    "author_email": "karthiksurya611@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e8/4b/1e8f64135513a887978a8bb1f8d38ead5c7e316d208bbdcff3c572098bc9/eda_python_library-0.0.1.8.tar.gz",
    "platform": null,
    "description": "# Preprocessing Pipeline\r\n\r\n## Overview\r\n\r\nThe **Preprocessing Pipeline** is a comprehensive Python-based tool designed to facilitate the preprocessing of data by performing initial data inspection, handling missing values, converting data types, managing outliers, scaling data, and transforming variables. This tool is modular, customizable, and suited for various data cleaning and preprocessing tasks essential for machine learning and data analysis.\r\n\r\n## Table of Contents\r\n\r\n- [Introduction](#introduction)\r\n- [Problem Statement](#problem-statement)\r\n- [Features](#features)\r\n- [Installation](#installation)\r\n- [Usage](#usage)\r\n  - [Initial Data Inspection](#initial-data-inspection)\r\n  - [Handling Missing Values](#handling-missing-values)\r\n  - [Data Type Conversion](#data-type-conversion)\r\n  - [Outlier Handling](#outlier-handling)\r\n  - [Scaling Data](#scaling-data)\r\n  - [Variable Transformation](#variable-transformation)\r\n- [Classes and Methods](#classes-and-methods)\r\n- [Report Generation](#report-generation)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n\r\n## Introduction\r\n\r\nData preprocessing is a crucial step in any data analysis or machine learning pipeline. This preprocessing tool provides a systematic approach to cleaning and preparing data by offering modules for inspecting data, handling missing values, managing outliers, scaling numerical features, and transforming variables to improve data quality and model performance.\r\n\r\n## Problem Statement\r\n\r\nHandling raw data can be challenging due to missing values, inconsistent data types, outliers, and other issues that can degrade the performance of predictive models. This tool/library/package aims to streamline the preprocessing workflow, making it easier to clean and prepare data for further analysis.\r\n\r\n## Features\r\n\r\n- **Initial Inspection**: Provides insights into the dataset, including shape, size, summary statistics, and detailed analysis of missing and duplicated values.\r\n- **Data Type Conversion**: Converts data types like objects to strings, integers, floats, and datetime, ensuring data consistency.\r\n- **Missing Value Handling**: Offers multiple strategies to fill or remove missing values, including mean, median, mode, bfill, linear, and polynomial interpolation.\r\n- **Outlier Handling**: Detects and caps outliers using IQR and Z-score methods.\r\n- **Scaling**: Standardizes numerical data using Standard Scaler, Robust Scaler, and Normalizer techniques.\r\n- **Variable Transformation**: Provides transformations like binning, log transformation, square root transformation, label encoding, and one-hot encoding.\r\n\r\n## Installation\r\n\r\nClone this repository to your local machine and ensure you have Python installed along with the required dependencies:\r\n\r\n```bash\r\ngit clone https://github.com/KaRtHiK-56/EDA_python_package_library\r\npip install -r requirements.txt\r\n```\r\n\r\n## Usage\r\n\r\n### Initial Data Inspection\r\n\r\nThe inspection methods provide a detailed overview of your dataset to identify data quality issues upfront.\r\n\r\n### REFER TO PackageTest.ipynb FOR SAMPLE EXAMPLE\r\nhttps://github.com/KaRtHiK-56/EDA_python_package_library\r\n\r\n\r\n```python\r\n!pip install eda-python-library==0.0.1.6\r\n\r\n# Import the eda library\r\nfrom eda.eda import Inspection\r\n\r\nInspection.inspect()\r\n# This will ask you to enter the path of the csv file and then generated the initial inspection.\r\n```\r\n\r\n### Handling Missing Values\r\n\r\nHandle missing values using different strategies like mean, median, mode, bfill, etc.\r\n\r\n```python\r\n# Import and use the MissingValueHandler class\r\nfrom eda.eda import MissingValueHandler\r\n\r\nMissingValueHandler.mean('col1')\r\nMissingValueHandler.b_fill('col3')\r\n```\r\n\r\n### Data Type Conversion\r\n\r\nConvert data types using the `DataTypeConverter` class, which supports conversions between object, string, integer, float, and datetime.\r\n\r\n```python\r\n# Import and use the DataTypeConverter class\r\nfrom eda.eda import DataTypeConverter\r\n\r\nconverter = DataTypeConverter(df)\r\nconverter.to_string('col2')\r\nconverter.to_datetime('col3')\r\n```\r\n\r\n### Outlier Handling\r\n\r\nHandle outliers using IQR and Z-score methods.\r\n\r\n```python\r\n# Import and use the OutlierHandler class\r\nfrom eda.eda import OutlierHandler\r\n\r\nOutlierHandler.iqr_capping('col1')\r\n```\r\n\r\n### Scaling Data\r\n\r\nScale numerical data using Standard Scaler, Robust Scaler, or Normalizer.\r\n\r\n```python\r\n# Import and use the ScalingHandler class\r\nfrom eda.eda import NumericalScaler\r\n\r\nNumericalScaler.standardscaler('col1')\r\n```\r\n\r\n### Variable Transformation\r\n\r\nTransform variables using binning, log transformation, and encoding methods.\r\n\r\n```python\r\n# Import and use the VariableTransformation class\r\nfrom eda.eda import VariableTransformation\r\n\r\nVariableTransformation.binner('col1', bins=[0, 1, 2, 3, 4])\r\nVariableTransformation.label_encoding('col2')\r\n```\r\n\r\n## Classes and Methods\r\n\r\n### 1. `InitialInspection`\r\n- **Methods**: \r\n  - `inspect()`: Generates a comprehensive inspection report on dataset shape, size, dimensions, summary, missing values, duplicates, numerical and categorical columns, skewness, and kurtosis.\r\n\r\n### 2. `MissingValueHandler`\r\n- **Methods**:\r\n  - `mean()`, `median()`, `mode()`, `b_fill()`, `f_fill()` ,`linear()`, `polynomial()`, `drop()`: Various techniques to handle missing values in the dataset.\r\n\r\n### 3. `DataTypeConverter`\r\n- **Methods**:\r\n  - `to_string()`, `to_int()`, `to_float()`, `to_datetime()`: Convert data types of specific columns.\r\n\r\n### 4. `OutlierHandler`\r\n- **Methods**:\r\n  - `iqr_capping()`, `zscore_capping()`: Detect and handle outliers using IQR and Z-score methods.\r\n\r\n### 5. `ScalingHandler`\r\n- **Methods**:\r\n  - `standardscaler()`, `robustscaler()`: Scale numerical data using different scaling techniques.\r\n\r\n### 6. `VariableTransformation`\r\n- **Methods**:\r\n  - `binning()`, `log_transformer()`, `sqrt_transformer()`, `label_encoding()`, `one_hot_encoding()`: Various transformations for numerical and categorical data.\r\n\r\n## Report Generation\r\n\r\nThe `Inspection` class provides a detailed report summarizing the data, including counts and percentages of missing and duplicated values, column types, and descriptive statistics.\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please read the [CONTRIBUTING.md](./CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Library for Making the Explorartory Data Analysis process easy in single line of codes",
    "version": "0.0.1.8",
    "project_urls": {
        "Homepage": "https://github.com/KaRtHiK-56/EDA_python_package_library"
    },
    "split_keywords": [
        "eda"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e84b1e8f64135513a887978a8bb1f8d38ead5c7e316d208bbdcff3c572098bc9",
                "md5": "e6880efa1386a5841914228de028720f",
                "sha256": "1b0151151f73a104fb7d3403d25070c74a405680d3c43a85cbadded18acaf4b4"
            },
            "downloads": -1,
            "filename": "eda_python_library-0.0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "e6880efa1386a5841914228de028720f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 23917,
            "upload_time": "2024-09-08T17:00:44",
            "upload_time_iso_8601": "2024-09-08T17:00:44.809284Z",
            "url": "https://files.pythonhosted.org/packages/e8/4b/1e8f64135513a887978a8bb1f8d38ead5c7e316d208bbdcff3c572098bc9/eda_python_library-0.0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-08 17:00:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KaRtHiK-56",
    "github_project": "EDA_python_package_library",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "eda-python-library"
}
        
Elapsed time: 0.31030s