dataglass


Namedataglass JSON
Version 0.8.1 PyPI version JSON
download
home_pageNone
Summarydataglass is a Python library for data preprocessing, exploratory data analysis (EDA), and machine learning. It includes modules for handling missing values, detecting and resolving duplicates, managing outliers, feature encoding, type conversion, scaling, and pipeline integration. With its latest update, dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow.
upload_time2025-09-07 07:54:05
maintainerNone
docs_urlNone
authorSaman Teymouri
requires_python>=3.10
licenseBSD-3-Clause
keywords data preprocessing eda machine learning data cleaning feature engineering pipeline pandas scikit-learn
VCS
bugtrack_url
requirements pandas numpy pytest rapidfuzz scikit-learn matplotlib seaborn category_encoders build setuptools twine
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ๐Ÿ”ฎ dataglass

**A modular and lightweight library for preprocessing, analysis, and modeling structured datasets in Python.**

`dataglass` provides an easy-to-use yet powerful framework to handle essential preprocessing tasks such as missing value handling, duplicate removal, outlier detection and management, feature encoding, type conversion, and feature scaling โ€” all designed to integrate with custom pipeline workflows. dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow.

---

## ๐Ÿค– Auto-Preprocessing (New!)

`dataglass` now features an intelligent auto-preprocessing module that dynamically constructs the optimal pipeline based on your datasetโ€™s characteristics, so no manual configuration required.

Just call a single function:

```python
df_cleaned = dg.auto_preprocess_for_analysis(
    data = df,
    verbose = True      # Show decisions and intermediate steps in a log file
)
```

---

## ๐Ÿš€ Preprocessing Features

**โ“ Missing Value Handling**  
  Drop rows, imputation by datatype (mean, median, mode), imputation by adjacent values (forward/backward fill), and interpolation (linear, time-based)  

**๐Ÿ“‘ Duplicate Detection & Removal**  
  Detect and remove exact and fuzzy duplicates using full and partial similarity checks  

**โ— Outlier Detection & Handling**  
  Detect outliers using IQR, Z-Score, Isolation Forest, and Local Outlier Factor (LOF)  
  Handle them by dropping, replacing with median, or capping with boundaries  
  Includes visualization tools: before vs. after boxplots and histograms  

**๐Ÿ”ข Feature Encoding**  
  Supports label encoding, one-hot encoding, and hashing for categorical variables  

**๐Ÿ” Type Conversion**  
  Automatic datatype inference and user-defined type conversion support  

**๐Ÿ“ Feature Scaling**  
  Includes Min-Max scaling, Z-Score (standard) scaling, robust scaling, and L2 normalization  

**๐Ÿงฉ Pipeline Compatibility**  
  Custom lightweight pipeline interface for chaining reusable preprocessing steps  

**๐Ÿ’พ Non-destructive Processing**  
  All operations are applied on copies, and original data remains unchanged  
  
---

## ๐Ÿ“ฆ Installation

```bash
pip install dataglass
```

---

## ๐Ÿ“˜ Usage Examples (Pipeline vs Functional)
There are two approaches to using the library features: the **pipeline architecture** and **standalone function** usage. The examples below demonstrate both methods.

<br>

### ๐Ÿงฉ Pipeline Architecture (Simplest Configuration)
Use this approach when you want a clean, modular, and reusable workflow for **end-to-end preprocessing**.

```python
# Importing the library and dependencies
import dataglass as dg
import pandas as pd
import numpy as np

# Creating a sample dataframe with a missing value and a categorical column
df = pd.DataFrame({
    "name": ["John", "Jane", "Jack"],
    "age": [40, np.nan, 50],
    "gender": ["male", "female", "male"]
})

# Step 1: Handle missing values by dropping rows that contain any missing value
handle_missing = dg.HandleMissingStep(dg.HandleMissingMethod.DROP)

# Step 2: Handle duplicates by removing exact duplicate rows
handle_duplicate = dg.HandleDuplicateStep(dg.HandleDuplicateMethod.EXACT)

# Step 3: Automatically detect and convert datatypes; verbose=True prints conversion logs
type_conversion = dg.TypeConversionStep(dg.ConvertDatatypeMethod.AUTO, verbose=True)

# Step 4: Detect outliers using IQR and remove them
handle_outlier = dg.HandleOutlierStep(dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.DROP)

# Step 5: Scale the 'age' column using Min-Max scaling
scale_feature = dg.ScaleFeatureStep({"column": ["age"], "scaling_method": ["MINMAX_SCALING"]})

# Step 6: Encode the 'gender' column using label encoding
encode_feature = dg.EncodeFeatureStep(dg.FeatureEncodingMethod.LABEL_ENCODING, ["gender"])

# Create the pipeline by chaining all the preprocessing steps in the desired order
dp = dg.DataPipeline([
    handle_missing,
    handle_duplicate,
    type_conversion,
    handle_outlier,
    scale_feature,
    encode_feature,
])

# Apply the pipeline to the dataframe
df_cleaned = dp.apply(df)

# Display the cleaned and transformed dataframe
print(f"Preprocessed Data:\n{df_cleaned}")

# =========== Expected Terminal Output =============

# Before automatic datatype conversion, the datatype are as follows:
# name       object
# age       float64
# gender     object
# dtype: object

# After automatic datatype conversion, the datatype are as follows:
# name      object
# age        int64
# gender    object
# dtype: object

# Preprocessed Data:
#    name  age gender  gender_encoded
# 0  John  0.0   male               0
# 2  Jack  1.0   male               0
```

<br>

### โš™๏ธ Standalone Function Usage
Use this approach when you need fine-grained control or quick one-off transformations on specific parts of your data.

#### โ“ Missing Handling Module
This module provides multiple strategies to handle missing data through these functions:

- ***handle_missing_values_drop***: Drop-based strategy
    - `Eliminate` all rows that contain any NaN value.

- ***handle_missing_values_datatype_imputation***: Data typeโ€“aware imputation
    - Fill missing *numeric* values using the specified strategy: `mean`, `median`, or `mode`.
    - Fill missing *categorical* values with the first `mode` of each column.

- ***handle_missing_values_adjacent_value_imputation***: Value propagation or interpolation
    - `Forward fill (ffill)`
    - `Backward fill (bfill)`
    - `Linear interpolation`
    - `Time-based interpolation` (if datetime index is present)

```python
import dataglass as dg
import pandas as pd
import numpy as np

# Creating a sample dataframe with a missing value
df = pd.DataFrame({
    "name": ["John", "Jane", "Jack"],
    "age": [40, np.nan, 50],
    "gender": ["male", "female", np.nan]
})

# Impute numeric columns using mean and the categorical columns using the first mode of that column
df_cleaned = dg.handle_missing_values_datatype_imputation(
    data = df,
    numeric_datatype_imputation_method = dg.NumericDatatypeImputationMethod.MEAN,
    verbose = True
)

print(f"Preprocessed Data:\n{df_cleaned}")

# =========== Expected Terminal Output =============

# Dataset has 3 rows before handling missing values.

# Missing values are:
# name      0
# age       1
# gender    1
# dtype: int64

# Dataset has 3 rows after handling missing values.

# Preprocessed Data:
#    name   age  gender
# 0  John  40.0    male
# 1  Jane  45.0  female
# 2  Jack  50.0  female
```
<br>

#### ๐Ÿ“‘ Duplicate Handling Module
This module provides two strategies to handle duplicate data through these functions:

- ***handle_duplicate_values_exact***: Remove `exact duplicate` rows
    - Optionally, a specific set of columns can be provided for duplicate analysis via `columns_subset`

- ***handle_duplicate_values_fuzzy***: Remove `approximate (fuzzy) duplicates` based on string similarity
    - Define the `similarity threshold` (e.g., 70โ€“90%)
    - Limit the comparison to specific columns via `columns_subset`


```python
import dataglass as dg
import pandas as pd
import numpy as np

# Creating a sample dataframe with a similar name values
df = pd.DataFrame({
    "name": ["John", "Johney", "Jack"],
    "age": [40, 45, 50],
})

# Only "name" column will be used to detect fuzzy duplicates
columns_subset = ["name"]

# Remove rows that are 70% or more similar in the "name" column (It keeps the first occurrence of each similarity group)
df_cleaned = dg.handle_duplicate_values_fuzzy(
    data = df, 
    columns_subset = columns_subset, 
    similarity_thresholds = (70,100), 
    verbose = True)

print(f"Preprocessed Data:\n{df_cleaned}")

# =========== Expected Terminal Output =============

# Dataset has 3 rows before handling duplicate values.

# Top 10 of duplicate values are (Totally 2 rows - including all duplicates, but from each group first one will remain and others will be removed):
#      name  age
# 0    John   40
# 1  Johney   45

# Dataset has 2 rows after handling duplicate values.

# Preprocessed Data:
#    name  age
# 0  John   40
# 2  Jack   50
```
<br>

#### โ— Outlier Handling Module
This module separates the detection and handling of outliers, giving you flexibility and control.

- ***detect_outliers***: Detects outliers using various statistical or model-based techniques:
    - `IQR`, `ZSCORE`, `ISOLATION_FOREST`, `LOCAL_OUTLIER_FACTOR`
    - An optional list of columns can be specified; otherwise, all numeric columns are used
    - Customization options like `contamination_rate` and `n_neighbors` available for model-based methods

- ***handle_outliers***: Applies the selected strategy to the detected outliers
    - `DROP`: Remove rows containing outliers
    - `REPLACE_WITH_MEDIAN`: Replace outlier values with their column median
    - `CAP_WITH_BOUNDARIES`: Clip outlier values to the inlier boundary limits (based on the detection method)


```python
import dataglass as dg
import pandas as pd
import numpy as np

# Sample dataset with an outlier in the "age" column
df = pd.DataFrame({
    "name": ["John", "Johney", "Jack", "Sara", "Chris"],
    "age": [40, 45, 30, 25, 200],
})

# Step 1: Detect outliers using the IQR method
outliers, boundaries = dg.detect_outliers(
    data = df, 
    detect_outlier_method = dg.DetectOutlierMethod.IQR)

print(f"Boundries:\n{boundaries}")

# Step 2: Cap outlier values with the calculated boundaries
df_cleaned = dg.handle_outliers(
    data = df,
    handle_outlier_method = dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES,
    outliers = outliers,
    boundaries=boundaries,
    verbose=True)

# Visualize the outliers using boxplot and histograms before and after cleaning
dg.visualize_outliers(df, df_cleaned, "", dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES)

print(f"Preprocessed Data:\n{df_cleaned}")

# =========== Expected Terminal Output =============

# Boundries:
# {'age': (np.float64(7.5), np.float64(67.5))}

# Dataset has 5 rows before handling outliers values.

# Top 10 of rows containing outliers are (Totally 1 rows):
#     name  age
# 4  Chris  200

# Dataset has 5 rows after handling outliers.

# Preprocessed Data:
#      name   age
# 0    John  40.0
# 1  Johney  45.0
# 2    Jack  30.0
# 3    Sara  25.0
# 4   Chris  67.5

# Visualizations have been saved in the 'visualizations' folder inside the project root directory.
```
<br>

#### ๐Ÿ”ข Feature Encoding Module
This module provides multiple methods to encode categorical features into numerical representations suitable for machine learning.

- ***encode_feature***: 
    - Supported methods: LABEL_ENCODING, ONEHOT_ENCODING, HASHING
    - Optionally specify columns; otherwise, all categorical columns will be encoded
    - To apply different methods to different columns, call the function multiple times with desired parameters

```python
import dataglass as dg
import pandas as pd
import numpy as np

# Sample dataset with a categorical "gender" column
df = pd.DataFrame({
    "name": ["John", "Jane", "Jack"],
    "age": [40, 45, 50],
    "gender": ["male", "female", "male"]
})

# Only "gender" column will be encoded
columns_subset = ["gender"]

# Convert "gender" to numerical labels (e.g., male=1, female=0)
df_cleaned = dg.encode_feature(
    data = df,
    feature_encoding_method = dg.FeatureEncodingMethod.LABEL_ENCODING,
    columns_subset = columns_subset)

print(f"Preprocessed Data:\n{df_cleaned}")

# =========== Expected Terminal Output =============

# Preprocessed Data:
#    name  age  gender  gender_encoded
# 0  John   40    male               1
# 1  Jane   45  female               0
# 2  Jack   50    male               1
```
<br>

#### ๐Ÿ” Type Conversion Module
This module provides methods for converting column datatypes for better compatibility and precision.

- ***convert_datatype_auto***: 
    - Automatically infers and converts column datatypes based on heuristics.
- ***convert_datatype_userdefined***:
    - Converts column datatypes based on a user-defined mapping scenario (supports formats like datetime parsing).

```python
import dataglass as dg
import pandas as pd
import numpy as np

# Sample dataset with mixed types
df = pd.DataFrame({
    "name": ["John", "Jane", "Jack"],
    "age": [40.0, 45, 50.0],
    "signup_date": ["2023-01-01", "2023-01-01", "2023-03-01"]
})

# user-defined scenario to request how to convert specific columns
convert_scenario =  {
    "column": ["age", "signup_date"],
    "datatype": ["int", "datetime"],
    "format": ["", "%Y-%m-%d"]
}

# Apply type conversion using the user-defined configuration
df_cleaned = dg.convert_datatype_userdefined(
    data = df,
    convert_scenario = convert_scenario,
    verbose=True)

print(f"Preprocessed Data:\n{df_cleaned}")

# =========== Expected Terminal Output =============

# Before automatic datatype conversion, the datatype are as follows:
# name            object
# age            float64
# signup_date     object
# dtype: object

# After automatic datatype conversion, the datatype are as follows:
# name                   object
# age                     int64
# signup_date    datetime64[ns]
# dtype: object

# Preprocessed Data:
#    name  age signup_date
# 0  John   40  2023-01-01
# 1  Jane   45  2023-01-01
# 2  Jack   50  2023-03-01
```
<br>

#### ๐Ÿ“ Feature Scaling Module
This module allows feature scaling using different methods on selected columns, with an optional L2 normalization across all numeric columns.

- ***scale_feature***: 
    - Supported scaling methods: `MINMAX_SCALING`, `ZSCORE_STANDARDIZATION`, `ROBUST_SCALING`
    - L2 normalization can be optionally applied to all numeric columns after scaling
    - Scaling can be customized per column using the `scaling_scenario`

```python
import dataglass as dg
import pandas as pd
import numpy as np

# Sample dataset with numeric features
df = pd.DataFrame({
    "name": ["John", "Jane", "Jack"],
    "age": [40, 45, 50],
    "score": [60, 70, 180],
    "income": [5000, 4500, 3000]
})

# Define a scenario to scale "age" using MinMax and "score" using RobustScaler
scaling_scenario = {
    "column": ["age", "score", "income"],
    "scaling_method": ["MINMAX_SCALING", "ROBUST_SCALING", "ZSCORE_STANDARDIZATION"]
}

# Apply scaling and then L2 normalize all numeric features
df_cleaned = dg.scale_feature(
    data = df,
    scaling_scenario = scaling_scenario,
    apply_l2normalization = True)

print(f"Preprocessed Data:\n{df_cleaned}")

# =========== Expected Terminal Output =============

# Preprocessed Data:
#    name       age     score    income
# 0  John  0.000000 -0.167564  0.985861
# 1  Jane  0.786796  0.000000  0.617213
# 2  Jack  0.400137  0.733584 -0.549313
```

---

## โœ… Requirements

- Python โ‰ฅ 3.10  
All other dependencies will be installed automatically via `pip install dataglass`.

---

## ๐Ÿ›ฃ๏ธ Roadmap  

- โœ… Preprocessing Modules  
- โœ… Custom Pipelines  
- โœ… Automatic Preprocessing
- โณ Exploratory Data Analysis (EDA) 
- โณ Machine Learning Modules

---

## ๐Ÿ“„ License  

This project is licensed under the [BSD 3-Clause License](https://opensource.org/license/BSD-3-Clause).  
See the [LICENSE](https://github.com/samantim/dataglass/blob/main/LICENSE) file in the repository for full details.

---

## ๐Ÿค Contributing  

Contributions, bug reports, and feature requests are welcome!  
Please open an issue or submit a pull request via [GitHub](https://github.com/samantim/dataglass).

---

## ๐Ÿ‘ค Author  

**Saman Teymouri**  
*Data Scientist/Analyst & Python Developer*  
Berlin, Germany

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dataglass",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "data preprocessing, EDA, machine learning, data cleaning, feature engineering, pipeline, pandas, scikit-learn",
    "author": "Saman Teymouri",
    "author_email": "Saman Teymouri <saman.teymouri@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/82/70/959ddba573704eac82079104fd458532ea7ce3edfe7161ed090d48521b64/dataglass-0.8.1.tar.gz",
    "platform": null,
    "description": "# \ud83d\udd2e dataglass\n\n**A modular and lightweight library for preprocessing, analysis, and modeling structured datasets in Python.**\n\n`dataglass` provides an easy-to-use yet powerful framework to handle essential preprocessing tasks such as missing value handling, duplicate removal, outlier detection and management, feature encoding, type conversion, and feature scaling \u2014 all designed to integrate with custom pipeline workflows. dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow.\n\n---\n\n## \ud83e\udd16 Auto-Preprocessing (New!)\n\n`dataglass` now features an intelligent auto-preprocessing module that dynamically constructs the optimal pipeline based on your dataset\u2019s characteristics, so no manual configuration required.\n\nJust call a single function:\n\n```python\ndf_cleaned = dg.auto_preprocess_for_analysis(\n    data = df,\n    verbose = True      # Show decisions and intermediate steps in a log file\n)\n```\n\n---\n\n## \ud83d\ude80 Preprocessing Features\n\n**\u2753 Missing Value Handling**  \n  Drop rows, imputation by datatype (mean, median, mode), imputation by adjacent values (forward/backward fill), and interpolation (linear, time-based)  \n\n**\ud83d\udcd1 Duplicate Detection & Removal**  \n  Detect and remove exact and fuzzy duplicates using full and partial similarity checks  \n\n**\u2757 Outlier Detection & Handling**  \n  Detect outliers using IQR, Z-Score, Isolation Forest, and Local Outlier Factor (LOF)  \n  Handle them by dropping, replacing with median, or capping with boundaries  \n  Includes visualization tools: before vs. after boxplots and histograms  \n\n**\ud83d\udd22 Feature Encoding**  \n  Supports label encoding, one-hot encoding, and hashing for categorical variables  \n\n**\ud83d\udd01 Type Conversion**  \n  Automatic datatype inference and user-defined type conversion support  \n\n**\ud83d\udccf Feature Scaling**  \n  Includes Min-Max scaling, Z-Score (standard) scaling, robust scaling, and L2 normalization  \n\n**\ud83e\udde9 Pipeline Compatibility**  \n  Custom lightweight pipeline interface for chaining reusable preprocessing steps  \n\n**\ud83d\udcbe Non-destructive Processing**  \n  All operations are applied on copies, and original data remains unchanged  \n  \n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install dataglass\n```\n\n---\n\n## \ud83d\udcd8 Usage Examples (Pipeline vs Functional)\nThere are two approaches to using the library features: the **pipeline architecture** and **standalone function** usage. The examples below demonstrate both methods.\n\n<br>\n\n### \ud83e\udde9 Pipeline Architecture (Simplest Configuration)\nUse this approach when you want a clean, modular, and reusable workflow for **end-to-end preprocessing**.\n\n```python\n# Importing the library and dependencies\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Creating a sample dataframe with a missing value and a categorical column\ndf = pd.DataFrame({\n    \"name\": [\"John\", \"Jane\", \"Jack\"],\n    \"age\": [40, np.nan, 50],\n    \"gender\": [\"male\", \"female\", \"male\"]\n})\n\n# Step 1: Handle missing values by dropping rows that contain any missing value\nhandle_missing = dg.HandleMissingStep(dg.HandleMissingMethod.DROP)\n\n# Step 2: Handle duplicates by removing exact duplicate rows\nhandle_duplicate = dg.HandleDuplicateStep(dg.HandleDuplicateMethod.EXACT)\n\n# Step 3: Automatically detect and convert datatypes; verbose=True prints conversion logs\ntype_conversion = dg.TypeConversionStep(dg.ConvertDatatypeMethod.AUTO, verbose=True)\n\n# Step 4: Detect outliers using IQR and remove them\nhandle_outlier = dg.HandleOutlierStep(dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.DROP)\n\n# Step 5: Scale the 'age' column using Min-Max scaling\nscale_feature = dg.ScaleFeatureStep({\"column\": [\"age\"], \"scaling_method\": [\"MINMAX_SCALING\"]})\n\n# Step 6: Encode the 'gender' column using label encoding\nencode_feature = dg.EncodeFeatureStep(dg.FeatureEncodingMethod.LABEL_ENCODING, [\"gender\"])\n\n# Create the pipeline by chaining all the preprocessing steps in the desired order\ndp = dg.DataPipeline([\n    handle_missing,\n    handle_duplicate,\n    type_conversion,\n    handle_outlier,\n    scale_feature,\n    encode_feature,\n])\n\n# Apply the pipeline to the dataframe\ndf_cleaned = dp.apply(df)\n\n# Display the cleaned and transformed dataframe\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Before automatic datatype conversion, the datatype are as follows:\n# name       object\n# age       float64\n# gender     object\n# dtype: object\n\n# After automatic datatype conversion, the datatype are as follows:\n# name      object\n# age        int64\n# gender    object\n# dtype: object\n\n# Preprocessed Data:\n#    name  age gender  gender_encoded\n# 0  John  0.0   male               0\n# 2  Jack  1.0   male               0\n```\n\n<br>\n\n### \u2699\ufe0f Standalone Function Usage\nUse this approach when you need fine-grained control or quick one-off transformations on specific parts of your data.\n\n#### \u2753 Missing Handling Module\nThis module provides multiple strategies to handle missing data through these functions:\n\n- ***handle_missing_values_drop***: Drop-based strategy\n    - `Eliminate` all rows that contain any NaN value.\n\n- ***handle_missing_values_datatype_imputation***: Data type\u2013aware imputation\n    - Fill missing *numeric* values using the specified strategy: `mean`, `median`, or `mode`.\n    - Fill missing *categorical* values with the first `mode` of each column.\n\n- ***handle_missing_values_adjacent_value_imputation***: Value propagation or interpolation\n    - `Forward fill (ffill)`\n    - `Backward fill (bfill)`\n    - `Linear interpolation`\n    - `Time-based interpolation` (if datetime index is present)\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Creating a sample dataframe with a missing value\ndf = pd.DataFrame({\n    \"name\": [\"John\", \"Jane\", \"Jack\"],\n    \"age\": [40, np.nan, 50],\n    \"gender\": [\"male\", \"female\", np.nan]\n})\n\n# Impute numeric columns using mean and the categorical columns using the first mode of that column\ndf_cleaned = dg.handle_missing_values_datatype_imputation(\n    data = df,\n    numeric_datatype_imputation_method = dg.NumericDatatypeImputationMethod.MEAN,\n    verbose = True\n)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Dataset has 3 rows before handling missing values.\n\n# Missing values are:\n# name      0\n# age       1\n# gender    1\n# dtype: int64\n\n# Dataset has 3 rows after handling missing values.\n\n# Preprocessed Data:\n#    name   age  gender\n# 0  John  40.0    male\n# 1  Jane  45.0  female\n# 2  Jack  50.0  female\n```\n<br>\n\n#### \ud83d\udcd1 Duplicate Handling Module\nThis module provides two strategies to handle duplicate data through these functions:\n\n- ***handle_duplicate_values_exact***: Remove `exact duplicate` rows\n    - Optionally, a specific set of columns can be provided for duplicate analysis via `columns_subset`\n\n- ***handle_duplicate_values_fuzzy***: Remove `approximate (fuzzy) duplicates` based on string similarity\n    - Define the `similarity threshold` (e.g., 70\u201390%)\n    - Limit the comparison to specific columns via `columns_subset`\n\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Creating a sample dataframe with a similar name values\ndf = pd.DataFrame({\n    \"name\": [\"John\", \"Johney\", \"Jack\"],\n    \"age\": [40, 45, 50],\n})\n\n# Only \"name\" column will be used to detect fuzzy duplicates\ncolumns_subset = [\"name\"]\n\n# Remove rows that are 70% or more similar in the \"name\" column (It keeps the first occurrence of each similarity group)\ndf_cleaned = dg.handle_duplicate_values_fuzzy(\n    data = df, \n    columns_subset = columns_subset, \n    similarity_thresholds = (70,100), \n    verbose = True)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Dataset has 3 rows before handling duplicate values.\n\n# Top 10 of duplicate values are (Totally 2 rows - including all duplicates, but from each group first one will remain and others will be removed):\n#      name  age\n# 0    John   40\n# 1  Johney   45\n\n# Dataset has 2 rows after handling duplicate values.\n\n# Preprocessed Data:\n#    name  age\n# 0  John   40\n# 2  Jack   50\n```\n<br>\n\n#### \u2757 Outlier Handling Module\nThis module separates the detection and handling of outliers, giving you flexibility and control.\n\n- ***detect_outliers***: Detects outliers using various statistical or model-based techniques:\n    - `IQR`, `ZSCORE`, `ISOLATION_FOREST`, `LOCAL_OUTLIER_FACTOR`\n    - An optional list of columns can be specified; otherwise, all numeric columns are used\n    - Customization options like `contamination_rate` and `n_neighbors` available for model-based methods\n\n- ***handle_outliers***: Applies the selected strategy to the detected outliers\n    - `DROP`: Remove rows containing outliers\n    - `REPLACE_WITH_MEDIAN`: Replace outlier values with their column median\n    - `CAP_WITH_BOUNDARIES`: Clip outlier values to the inlier boundary limits (based on the detection method)\n\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with an outlier in the \"age\" column\ndf = pd.DataFrame({\n    \"name\": [\"John\", \"Johney\", \"Jack\", \"Sara\", \"Chris\"],\n    \"age\": [40, 45, 30, 25, 200],\n})\n\n# Step 1: Detect outliers using the IQR method\noutliers, boundaries = dg.detect_outliers(\n    data = df, \n    detect_outlier_method = dg.DetectOutlierMethod.IQR)\n\nprint(f\"Boundries:\\n{boundaries}\")\n\n# Step 2: Cap outlier values with the calculated boundaries\ndf_cleaned = dg.handle_outliers(\n    data = df,\n    handle_outlier_method = dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES,\n    outliers = outliers,\n    boundaries=boundaries,\n    verbose=True)\n\n# Visualize the outliers using boxplot and histograms before and after cleaning\ndg.visualize_outliers(df, df_cleaned, \"\", dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Boundries:\n# {'age': (np.float64(7.5), np.float64(67.5))}\n\n# Dataset has 5 rows before handling outliers values.\n\n# Top 10 of rows containing outliers are (Totally 1 rows):\n#     name  age\n# 4  Chris  200\n\n# Dataset has 5 rows after handling outliers.\n\n# Preprocessed Data:\n#      name   age\n# 0    John  40.0\n# 1  Johney  45.0\n# 2    Jack  30.0\n# 3    Sara  25.0\n# 4   Chris  67.5\n\n# Visualizations have been saved in the 'visualizations' folder inside the project root directory.\n```\n<br>\n\n#### \ud83d\udd22 Feature Encoding Module\nThis module provides multiple methods to encode categorical features into numerical representations suitable for machine learning.\n\n- ***encode_feature***: \n    - Supported methods: LABEL_ENCODING, ONEHOT_ENCODING, HASHING\n    - Optionally specify columns; otherwise, all categorical columns will be encoded\n    - To apply different methods to different columns, call the function multiple times with desired parameters\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with a categorical \"gender\" column\ndf = pd.DataFrame({\n    \"name\": [\"John\", \"Jane\", \"Jack\"],\n    \"age\": [40, 45, 50],\n    \"gender\": [\"male\", \"female\", \"male\"]\n})\n\n# Only \"gender\" column will be encoded\ncolumns_subset = [\"gender\"]\n\n# Convert \"gender\" to numerical labels (e.g., male=1, female=0)\ndf_cleaned = dg.encode_feature(\n    data = df,\n    feature_encoding_method = dg.FeatureEncodingMethod.LABEL_ENCODING,\n    columns_subset = columns_subset)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Preprocessed Data:\n#    name  age  gender  gender_encoded\n# 0  John   40    male               1\n# 1  Jane   45  female               0\n# 2  Jack   50    male               1\n```\n<br>\n\n#### \ud83d\udd01 Type Conversion Module\nThis module provides methods for converting column datatypes for better compatibility and precision.\n\n- ***convert_datatype_auto***: \n    - Automatically infers and converts column datatypes based on heuristics.\n- ***convert_datatype_userdefined***:\n    - Converts column datatypes based on a user-defined mapping scenario (supports formats like datetime parsing).\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with mixed types\ndf = pd.DataFrame({\n    \"name\": [\"John\", \"Jane\", \"Jack\"],\n    \"age\": [40.0, 45, 50.0],\n    \"signup_date\": [\"2023-01-01\", \"2023-01-01\", \"2023-03-01\"]\n})\n\n# user-defined scenario to request how to convert specific columns\nconvert_scenario =  {\n    \"column\": [\"age\", \"signup_date\"],\n    \"datatype\": [\"int\", \"datetime\"],\n    \"format\": [\"\", \"%Y-%m-%d\"]\n}\n\n# Apply type conversion using the user-defined configuration\ndf_cleaned = dg.convert_datatype_userdefined(\n    data = df,\n    convert_scenario = convert_scenario,\n    verbose=True)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Before automatic datatype conversion, the datatype are as follows:\n# name            object\n# age            float64\n# signup_date     object\n# dtype: object\n\n# After automatic datatype conversion, the datatype are as follows:\n# name                   object\n# age                     int64\n# signup_date    datetime64[ns]\n# dtype: object\n\n# Preprocessed Data:\n#    name  age signup_date\n# 0  John   40  2023-01-01\n# 1  Jane   45  2023-01-01\n# 2  Jack   50  2023-03-01\n```\n<br>\n\n#### \ud83d\udccf Feature Scaling Module\nThis module allows feature scaling using different methods on selected columns, with an optional L2 normalization across all numeric columns.\n\n- ***scale_feature***: \n    - Supported scaling methods: `MINMAX_SCALING`, `ZSCORE_STANDARDIZATION`, `ROBUST_SCALING`\n    - L2 normalization can be optionally applied to all numeric columns after scaling\n    - Scaling can be customized per column using the `scaling_scenario`\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with numeric features\ndf = pd.DataFrame({\n    \"name\": [\"John\", \"Jane\", \"Jack\"],\n    \"age\": [40, 45, 50],\n    \"score\": [60, 70, 180],\n    \"income\": [5000, 4500, 3000]\n})\n\n# Define a scenario to scale \"age\" using MinMax and \"score\" using RobustScaler\nscaling_scenario = {\n    \"column\": [\"age\", \"score\", \"income\"],\n    \"scaling_method\": [\"MINMAX_SCALING\", \"ROBUST_SCALING\", \"ZSCORE_STANDARDIZATION\"]\n}\n\n# Apply scaling and then L2 normalize all numeric features\ndf_cleaned = dg.scale_feature(\n    data = df,\n    scaling_scenario = scaling_scenario,\n    apply_l2normalization = True)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Preprocessed Data:\n#    name       age     score    income\n# 0  John  0.000000 -0.167564  0.985861\n# 1  Jane  0.786796  0.000000  0.617213\n# 2  Jack  0.400137  0.733584 -0.549313\n```\n\n---\n\n## \u2705 Requirements\n\n- Python \u2265 3.10  \nAll other dependencies will be installed automatically via `pip install dataglass`.\n\n---\n\n## \ud83d\udee3\ufe0f Roadmap  \n\n- \u2705 Preprocessing Modules  \n- \u2705 Custom Pipelines  \n- \u2705 Automatic Preprocessing\n- \u23f3 Exploratory Data Analysis (EDA) \n- \u23f3 Machine Learning Modules\n\n---\n\n## \ud83d\udcc4 License  \n\nThis project is licensed under the [BSD 3-Clause License](https://opensource.org/license/BSD-3-Clause).  \nSee the [LICENSE](https://github.com/samantim/dataglass/blob/main/LICENSE) file in the repository for full details.\n\n---\n\n## \ud83e\udd1d Contributing  \n\nContributions, bug reports, and feature requests are welcome!  \nPlease open an issue or submit a pull request via [GitHub](https://github.com/samantim/dataglass).\n\n---\n\n## \ud83d\udc64 Author  \n\n**Saman Teymouri**  \n*Data Scientist/Analyst & Python Developer*  \nBerlin, Germany\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "dataglass is a Python library for data preprocessing, exploratory data analysis (EDA), and machine learning. It includes modules for handling missing values, detecting and resolving duplicates, managing outliers, feature encoding, type conversion, scaling, and pipeline integration. With its latest update, dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow.",
    "version": "0.8.1",
    "project_urls": {
        "Documentation": "https://github.com/samantim/dataglass/wiki",
        "Homepage": "https://github.com/samantim/dataglass",
        "Issues": "https://github.com/samantim/dataglass/issues",
        "Source": "https://github.com/samantim/dataglass"
    },
    "split_keywords": [
        "data preprocessing",
        " eda",
        " machine learning",
        " data cleaning",
        " feature engineering",
        " pipeline",
        " pandas",
        " scikit-learn"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7c91002702a6afa8bf466e248c800a313e9af06518f37e0e9fb75475a4c6b86a",
                "md5": "cb4a6c23efb7abab5a87a4fc439eef37",
                "sha256": "f6ba9634a77f227fd94b420d6e9bddcb6847ef89a81d0414b6b6657e559e622d"
            },
            "downloads": -1,
            "filename": "dataglass-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cb4a6c23efb7abab5a87a4fc439eef37",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 37487,
            "upload_time": "2025-09-07T07:54:03",
            "upload_time_iso_8601": "2025-09-07T07:54:03.266495Z",
            "url": "https://files.pythonhosted.org/packages/7c/91/002702a6afa8bf466e248c800a313e9af06518f37e0e9fb75475a4c6b86a/dataglass-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8270959ddba573704eac82079104fd458532ea7ce3edfe7161ed090d48521b64",
                "md5": "cca10ff7da1986421ca3de325c1efffb",
                "sha256": "633162be438557112d9bf206db8f47c62e48c4afaf2658ab4cfceeb0da78c355"
            },
            "downloads": -1,
            "filename": "dataglass-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "cca10ff7da1986421ca3de325c1efffb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 43484,
            "upload_time": "2025-09-07T07:54:05",
            "upload_time_iso_8601": "2025-09-07T07:54:05.096132Z",
            "url": "https://files.pythonhosted.org/packages/82/70/959ddba573704eac82079104fd458532ea7ce3edfe7161ed090d48521b64/dataglass-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-07 07:54:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "samantim",
    "github_project": "dataglass",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "rapidfuzz",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "seaborn",
            "specs": []
        },
        {
            "name": "category_encoders",
            "specs": []
        },
        {
            "name": "build",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "twine",
            "specs": []
        }
    ],
    "lcname": "dataglass"
}
        
Elapsed time: 1.02994s