Name | dataglass JSON |
Version |
0.8.1
JSON |
| download |
home_page | None |
Summary | dataglass is a Python library for data preprocessing, exploratory data analysis (EDA), and machine learning. It includes modules for handling missing values, detecting and resolving duplicates, managing outliers, feature encoding, type conversion, scaling, and pipeline integration. With its latest update, dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow. |
upload_time | 2025-09-07 07:54:05 |
maintainer | None |
docs_url | None |
author | Saman Teymouri |
requires_python | >=3.10 |
license | BSD-3-Clause |
keywords |
data preprocessing
eda
machine learning
data cleaning
feature engineering
pipeline
pandas
scikit-learn
|
VCS |
 |
bugtrack_url |
|
requirements |
pandas
numpy
pytest
rapidfuzz
scikit-learn
matplotlib
seaborn
category_encoders
build
setuptools
twine
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# ๐ฎ dataglass
**A modular and lightweight library for preprocessing, analysis, and modeling structured datasets in Python.**
`dataglass` provides an easy-to-use yet powerful framework to handle essential preprocessing tasks such as missing value handling, duplicate removal, outlier detection and management, feature encoding, type conversion, and feature scaling โ all designed to integrate with custom pipeline workflows. dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow.
---
## ๐ค Auto-Preprocessing (New!)
`dataglass` now features an intelligent auto-preprocessing module that dynamically constructs the optimal pipeline based on your datasetโs characteristics, so no manual configuration required.
Just call a single function:
```python
df_cleaned = dg.auto_preprocess_for_analysis(
data = df,
verbose = True # Show decisions and intermediate steps in a log file
)
```
---
## ๐ Preprocessing Features
**โ Missing Value Handling**
Drop rows, imputation by datatype (mean, median, mode), imputation by adjacent values (forward/backward fill), and interpolation (linear, time-based)
**๐ Duplicate Detection & Removal**
Detect and remove exact and fuzzy duplicates using full and partial similarity checks
**โ Outlier Detection & Handling**
Detect outliers using IQR, Z-Score, Isolation Forest, and Local Outlier Factor (LOF)
Handle them by dropping, replacing with median, or capping with boundaries
Includes visualization tools: before vs. after boxplots and histograms
**๐ข Feature Encoding**
Supports label encoding, one-hot encoding, and hashing for categorical variables
**๐ Type Conversion**
Automatic datatype inference and user-defined type conversion support
**๐ Feature Scaling**
Includes Min-Max scaling, Z-Score (standard) scaling, robust scaling, and L2 normalization
**๐งฉ Pipeline Compatibility**
Custom lightweight pipeline interface for chaining reusable preprocessing steps
**๐พ Non-destructive Processing**
All operations are applied on copies, and original data remains unchanged
---
## ๐ฆ Installation
```bash
pip install dataglass
```
---
## ๐ Usage Examples (Pipeline vs Functional)
There are two approaches to using the library features: the **pipeline architecture** and **standalone function** usage. The examples below demonstrate both methods.
<br>
### ๐งฉ Pipeline Architecture (Simplest Configuration)
Use this approach when you want a clean, modular, and reusable workflow for **end-to-end preprocessing**.
```python
# Importing the library and dependencies
import dataglass as dg
import pandas as pd
import numpy as np
# Creating a sample dataframe with a missing value and a categorical column
df = pd.DataFrame({
"name": ["John", "Jane", "Jack"],
"age": [40, np.nan, 50],
"gender": ["male", "female", "male"]
})
# Step 1: Handle missing values by dropping rows that contain any missing value
handle_missing = dg.HandleMissingStep(dg.HandleMissingMethod.DROP)
# Step 2: Handle duplicates by removing exact duplicate rows
handle_duplicate = dg.HandleDuplicateStep(dg.HandleDuplicateMethod.EXACT)
# Step 3: Automatically detect and convert datatypes; verbose=True prints conversion logs
type_conversion = dg.TypeConversionStep(dg.ConvertDatatypeMethod.AUTO, verbose=True)
# Step 4: Detect outliers using IQR and remove them
handle_outlier = dg.HandleOutlierStep(dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.DROP)
# Step 5: Scale the 'age' column using Min-Max scaling
scale_feature = dg.ScaleFeatureStep({"column": ["age"], "scaling_method": ["MINMAX_SCALING"]})
# Step 6: Encode the 'gender' column using label encoding
encode_feature = dg.EncodeFeatureStep(dg.FeatureEncodingMethod.LABEL_ENCODING, ["gender"])
# Create the pipeline by chaining all the preprocessing steps in the desired order
dp = dg.DataPipeline([
handle_missing,
handle_duplicate,
type_conversion,
handle_outlier,
scale_feature,
encode_feature,
])
# Apply the pipeline to the dataframe
df_cleaned = dp.apply(df)
# Display the cleaned and transformed dataframe
print(f"Preprocessed Data:\n{df_cleaned}")
# =========== Expected Terminal Output =============
# Before automatic datatype conversion, the datatype are as follows:
# name object
# age float64
# gender object
# dtype: object
# After automatic datatype conversion, the datatype are as follows:
# name object
# age int64
# gender object
# dtype: object
# Preprocessed Data:
# name age gender gender_encoded
# 0 John 0.0 male 0
# 2 Jack 1.0 male 0
```
<br>
### โ๏ธ Standalone Function Usage
Use this approach when you need fine-grained control or quick one-off transformations on specific parts of your data.
#### โ Missing Handling Module
This module provides multiple strategies to handle missing data through these functions:
- ***handle_missing_values_drop***: Drop-based strategy
- `Eliminate` all rows that contain any NaN value.
- ***handle_missing_values_datatype_imputation***: Data typeโaware imputation
- Fill missing *numeric* values using the specified strategy: `mean`, `median`, or `mode`.
- Fill missing *categorical* values with the first `mode` of each column.
- ***handle_missing_values_adjacent_value_imputation***: Value propagation or interpolation
- `Forward fill (ffill)`
- `Backward fill (bfill)`
- `Linear interpolation`
- `Time-based interpolation` (if datetime index is present)
```python
import dataglass as dg
import pandas as pd
import numpy as np
# Creating a sample dataframe with a missing value
df = pd.DataFrame({
"name": ["John", "Jane", "Jack"],
"age": [40, np.nan, 50],
"gender": ["male", "female", np.nan]
})
# Impute numeric columns using mean and the categorical columns using the first mode of that column
df_cleaned = dg.handle_missing_values_datatype_imputation(
data = df,
numeric_datatype_imputation_method = dg.NumericDatatypeImputationMethod.MEAN,
verbose = True
)
print(f"Preprocessed Data:\n{df_cleaned}")
# =========== Expected Terminal Output =============
# Dataset has 3 rows before handling missing values.
# Missing values are:
# name 0
# age 1
# gender 1
# dtype: int64
# Dataset has 3 rows after handling missing values.
# Preprocessed Data:
# name age gender
# 0 John 40.0 male
# 1 Jane 45.0 female
# 2 Jack 50.0 female
```
<br>
#### ๐ Duplicate Handling Module
This module provides two strategies to handle duplicate data through these functions:
- ***handle_duplicate_values_exact***: Remove `exact duplicate` rows
- Optionally, a specific set of columns can be provided for duplicate analysis via `columns_subset`
- ***handle_duplicate_values_fuzzy***: Remove `approximate (fuzzy) duplicates` based on string similarity
- Define the `similarity threshold` (e.g., 70โ90%)
- Limit the comparison to specific columns via `columns_subset`
```python
import dataglass as dg
import pandas as pd
import numpy as np
# Creating a sample dataframe with a similar name values
df = pd.DataFrame({
"name": ["John", "Johney", "Jack"],
"age": [40, 45, 50],
})
# Only "name" column will be used to detect fuzzy duplicates
columns_subset = ["name"]
# Remove rows that are 70% or more similar in the "name" column (It keeps the first occurrence of each similarity group)
df_cleaned = dg.handle_duplicate_values_fuzzy(
data = df,
columns_subset = columns_subset,
similarity_thresholds = (70,100),
verbose = True)
print(f"Preprocessed Data:\n{df_cleaned}")
# =========== Expected Terminal Output =============
# Dataset has 3 rows before handling duplicate values.
# Top 10 of duplicate values are (Totally 2 rows - including all duplicates, but from each group first one will remain and others will be removed):
# name age
# 0 John 40
# 1 Johney 45
# Dataset has 2 rows after handling duplicate values.
# Preprocessed Data:
# name age
# 0 John 40
# 2 Jack 50
```
<br>
#### โ Outlier Handling Module
This module separates the detection and handling of outliers, giving you flexibility and control.
- ***detect_outliers***: Detects outliers using various statistical or model-based techniques:
- `IQR`, `ZSCORE`, `ISOLATION_FOREST`, `LOCAL_OUTLIER_FACTOR`
- An optional list of columns can be specified; otherwise, all numeric columns are used
- Customization options like `contamination_rate` and `n_neighbors` available for model-based methods
- ***handle_outliers***: Applies the selected strategy to the detected outliers
- `DROP`: Remove rows containing outliers
- `REPLACE_WITH_MEDIAN`: Replace outlier values with their column median
- `CAP_WITH_BOUNDARIES`: Clip outlier values to the inlier boundary limits (based on the detection method)
```python
import dataglass as dg
import pandas as pd
import numpy as np
# Sample dataset with an outlier in the "age" column
df = pd.DataFrame({
"name": ["John", "Johney", "Jack", "Sara", "Chris"],
"age": [40, 45, 30, 25, 200],
})
# Step 1: Detect outliers using the IQR method
outliers, boundaries = dg.detect_outliers(
data = df,
detect_outlier_method = dg.DetectOutlierMethod.IQR)
print(f"Boundries:\n{boundaries}")
# Step 2: Cap outlier values with the calculated boundaries
df_cleaned = dg.handle_outliers(
data = df,
handle_outlier_method = dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES,
outliers = outliers,
boundaries=boundaries,
verbose=True)
# Visualize the outliers using boxplot and histograms before and after cleaning
dg.visualize_outliers(df, df_cleaned, "", dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES)
print(f"Preprocessed Data:\n{df_cleaned}")
# =========== Expected Terminal Output =============
# Boundries:
# {'age': (np.float64(7.5), np.float64(67.5))}
# Dataset has 5 rows before handling outliers values.
# Top 10 of rows containing outliers are (Totally 1 rows):
# name age
# 4 Chris 200
# Dataset has 5 rows after handling outliers.
# Preprocessed Data:
# name age
# 0 John 40.0
# 1 Johney 45.0
# 2 Jack 30.0
# 3 Sara 25.0
# 4 Chris 67.5
# Visualizations have been saved in the 'visualizations' folder inside the project root directory.
```
<br>
#### ๐ข Feature Encoding Module
This module provides multiple methods to encode categorical features into numerical representations suitable for machine learning.
- ***encode_feature***:
- Supported methods: LABEL_ENCODING, ONEHOT_ENCODING, HASHING
- Optionally specify columns; otherwise, all categorical columns will be encoded
- To apply different methods to different columns, call the function multiple times with desired parameters
```python
import dataglass as dg
import pandas as pd
import numpy as np
# Sample dataset with a categorical "gender" column
df = pd.DataFrame({
"name": ["John", "Jane", "Jack"],
"age": [40, 45, 50],
"gender": ["male", "female", "male"]
})
# Only "gender" column will be encoded
columns_subset = ["gender"]
# Convert "gender" to numerical labels (e.g., male=1, female=0)
df_cleaned = dg.encode_feature(
data = df,
feature_encoding_method = dg.FeatureEncodingMethod.LABEL_ENCODING,
columns_subset = columns_subset)
print(f"Preprocessed Data:\n{df_cleaned}")
# =========== Expected Terminal Output =============
# Preprocessed Data:
# name age gender gender_encoded
# 0 John 40 male 1
# 1 Jane 45 female 0
# 2 Jack 50 male 1
```
<br>
#### ๐ Type Conversion Module
This module provides methods for converting column datatypes for better compatibility and precision.
- ***convert_datatype_auto***:
- Automatically infers and converts column datatypes based on heuristics.
- ***convert_datatype_userdefined***:
- Converts column datatypes based on a user-defined mapping scenario (supports formats like datetime parsing).
```python
import dataglass as dg
import pandas as pd
import numpy as np
# Sample dataset with mixed types
df = pd.DataFrame({
"name": ["John", "Jane", "Jack"],
"age": [40.0, 45, 50.0],
"signup_date": ["2023-01-01", "2023-01-01", "2023-03-01"]
})
# user-defined scenario to request how to convert specific columns
convert_scenario = {
"column": ["age", "signup_date"],
"datatype": ["int", "datetime"],
"format": ["", "%Y-%m-%d"]
}
# Apply type conversion using the user-defined configuration
df_cleaned = dg.convert_datatype_userdefined(
data = df,
convert_scenario = convert_scenario,
verbose=True)
print(f"Preprocessed Data:\n{df_cleaned}")
# =========== Expected Terminal Output =============
# Before automatic datatype conversion, the datatype are as follows:
# name object
# age float64
# signup_date object
# dtype: object
# After automatic datatype conversion, the datatype are as follows:
# name object
# age int64
# signup_date datetime64[ns]
# dtype: object
# Preprocessed Data:
# name age signup_date
# 0 John 40 2023-01-01
# 1 Jane 45 2023-01-01
# 2 Jack 50 2023-03-01
```
<br>
#### ๐ Feature Scaling Module
This module allows feature scaling using different methods on selected columns, with an optional L2 normalization across all numeric columns.
- ***scale_feature***:
- Supported scaling methods: `MINMAX_SCALING`, `ZSCORE_STANDARDIZATION`, `ROBUST_SCALING`
- L2 normalization can be optionally applied to all numeric columns after scaling
- Scaling can be customized per column using the `scaling_scenario`
```python
import dataglass as dg
import pandas as pd
import numpy as np
# Sample dataset with numeric features
df = pd.DataFrame({
"name": ["John", "Jane", "Jack"],
"age": [40, 45, 50],
"score": [60, 70, 180],
"income": [5000, 4500, 3000]
})
# Define a scenario to scale "age" using MinMax and "score" using RobustScaler
scaling_scenario = {
"column": ["age", "score", "income"],
"scaling_method": ["MINMAX_SCALING", "ROBUST_SCALING", "ZSCORE_STANDARDIZATION"]
}
# Apply scaling and then L2 normalize all numeric features
df_cleaned = dg.scale_feature(
data = df,
scaling_scenario = scaling_scenario,
apply_l2normalization = True)
print(f"Preprocessed Data:\n{df_cleaned}")
# =========== Expected Terminal Output =============
# Preprocessed Data:
# name age score income
# 0 John 0.000000 -0.167564 0.985861
# 1 Jane 0.786796 0.000000 0.617213
# 2 Jack 0.400137 0.733584 -0.549313
```
---
## โ
Requirements
- Python โฅ 3.10
All other dependencies will be installed automatically via `pip install dataglass`.
---
## ๐ฃ๏ธ Roadmap
- โ
Preprocessing Modules
- โ
Custom Pipelines
- โ
Automatic Preprocessing
- โณ Exploratory Data Analysis (EDA)
- โณ Machine Learning Modules
---
## ๐ License
This project is licensed under the [BSD 3-Clause License](https://opensource.org/license/BSD-3-Clause).
See the [LICENSE](https://github.com/samantim/dataglass/blob/main/LICENSE) file in the repository for full details.
---
## ๐ค Contributing
Contributions, bug reports, and feature requests are welcome!
Please open an issue or submit a pull request via [GitHub](https://github.com/samantim/dataglass).
---
## ๐ค Author
**Saman Teymouri**
*Data Scientist/Analyst & Python Developer*
Berlin, Germany
Raw data
{
"_id": null,
"home_page": null,
"name": "dataglass",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "data preprocessing, EDA, machine learning, data cleaning, feature engineering, pipeline, pandas, scikit-learn",
"author": "Saman Teymouri",
"author_email": "Saman Teymouri <saman.teymouri@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/82/70/959ddba573704eac82079104fd458532ea7ce3edfe7161ed090d48521b64/dataglass-0.8.1.tar.gz",
"platform": null,
"description": "# \ud83d\udd2e dataglass\n\n**A modular and lightweight library for preprocessing, analysis, and modeling structured datasets in Python.**\n\n`dataglass` provides an easy-to-use yet powerful framework to handle essential preprocessing tasks such as missing value handling, duplicate removal, outlier detection and management, feature encoding, type conversion, and feature scaling \u2014 all designed to integrate with custom pipeline workflows. dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow.\n\n---\n\n## \ud83e\udd16 Auto-Preprocessing (New!)\n\n`dataglass` now features an intelligent auto-preprocessing module that dynamically constructs the optimal pipeline based on your dataset\u2019s characteristics, so no manual configuration required.\n\nJust call a single function:\n\n```python\ndf_cleaned = dg.auto_preprocess_for_analysis(\n data = df,\n verbose = True # Show decisions and intermediate steps in a log file\n)\n```\n\n---\n\n## \ud83d\ude80 Preprocessing Features\n\n**\u2753 Missing Value Handling** \n Drop rows, imputation by datatype (mean, median, mode), imputation by adjacent values (forward/backward fill), and interpolation (linear, time-based) \n\n**\ud83d\udcd1 Duplicate Detection & Removal** \n Detect and remove exact and fuzzy duplicates using full and partial similarity checks \n\n**\u2757 Outlier Detection & Handling** \n Detect outliers using IQR, Z-Score, Isolation Forest, and Local Outlier Factor (LOF) \n Handle them by dropping, replacing with median, or capping with boundaries \n Includes visualization tools: before vs. after boxplots and histograms \n\n**\ud83d\udd22 Feature Encoding** \n Supports label encoding, one-hot encoding, and hashing for categorical variables \n\n**\ud83d\udd01 Type Conversion** \n Automatic datatype inference and user-defined type conversion support \n\n**\ud83d\udccf Feature Scaling** \n Includes Min-Max scaling, Z-Score (standard) scaling, robust scaling, and L2 normalization \n\n**\ud83e\udde9 Pipeline Compatibility** \n Custom lightweight pipeline interface for chaining reusable preprocessing steps \n\n**\ud83d\udcbe Non-destructive Processing** \n All operations are applied on copies, and original data remains unchanged \n \n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install dataglass\n```\n\n---\n\n## \ud83d\udcd8 Usage Examples (Pipeline vs Functional)\nThere are two approaches to using the library features: the **pipeline architecture** and **standalone function** usage. The examples below demonstrate both methods.\n\n<br>\n\n### \ud83e\udde9 Pipeline Architecture (Simplest Configuration)\nUse this approach when you want a clean, modular, and reusable workflow for **end-to-end preprocessing**.\n\n```python\n# Importing the library and dependencies\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Creating a sample dataframe with a missing value and a categorical column\ndf = pd.DataFrame({\n \"name\": [\"John\", \"Jane\", \"Jack\"],\n \"age\": [40, np.nan, 50],\n \"gender\": [\"male\", \"female\", \"male\"]\n})\n\n# Step 1: Handle missing values by dropping rows that contain any missing value\nhandle_missing = dg.HandleMissingStep(dg.HandleMissingMethod.DROP)\n\n# Step 2: Handle duplicates by removing exact duplicate rows\nhandle_duplicate = dg.HandleDuplicateStep(dg.HandleDuplicateMethod.EXACT)\n\n# Step 3: Automatically detect and convert datatypes; verbose=True prints conversion logs\ntype_conversion = dg.TypeConversionStep(dg.ConvertDatatypeMethod.AUTO, verbose=True)\n\n# Step 4: Detect outliers using IQR and remove them\nhandle_outlier = dg.HandleOutlierStep(dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.DROP)\n\n# Step 5: Scale the 'age' column using Min-Max scaling\nscale_feature = dg.ScaleFeatureStep({\"column\": [\"age\"], \"scaling_method\": [\"MINMAX_SCALING\"]})\n\n# Step 6: Encode the 'gender' column using label encoding\nencode_feature = dg.EncodeFeatureStep(dg.FeatureEncodingMethod.LABEL_ENCODING, [\"gender\"])\n\n# Create the pipeline by chaining all the preprocessing steps in the desired order\ndp = dg.DataPipeline([\n handle_missing,\n handle_duplicate,\n type_conversion,\n handle_outlier,\n scale_feature,\n encode_feature,\n])\n\n# Apply the pipeline to the dataframe\ndf_cleaned = dp.apply(df)\n\n# Display the cleaned and transformed dataframe\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Before automatic datatype conversion, the datatype are as follows:\n# name object\n# age float64\n# gender object\n# dtype: object\n\n# After automatic datatype conversion, the datatype are as follows:\n# name object\n# age int64\n# gender object\n# dtype: object\n\n# Preprocessed Data:\n# name age gender gender_encoded\n# 0 John 0.0 male 0\n# 2 Jack 1.0 male 0\n```\n\n<br>\n\n### \u2699\ufe0f Standalone Function Usage\nUse this approach when you need fine-grained control or quick one-off transformations on specific parts of your data.\n\n#### \u2753 Missing Handling Module\nThis module provides multiple strategies to handle missing data through these functions:\n\n- ***handle_missing_values_drop***: Drop-based strategy\n - `Eliminate` all rows that contain any NaN value.\n\n- ***handle_missing_values_datatype_imputation***: Data type\u2013aware imputation\n - Fill missing *numeric* values using the specified strategy: `mean`, `median`, or `mode`.\n - Fill missing *categorical* values with the first `mode` of each column.\n\n- ***handle_missing_values_adjacent_value_imputation***: Value propagation or interpolation\n - `Forward fill (ffill)`\n - `Backward fill (bfill)`\n - `Linear interpolation`\n - `Time-based interpolation` (if datetime index is present)\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Creating a sample dataframe with a missing value\ndf = pd.DataFrame({\n \"name\": [\"John\", \"Jane\", \"Jack\"],\n \"age\": [40, np.nan, 50],\n \"gender\": [\"male\", \"female\", np.nan]\n})\n\n# Impute numeric columns using mean and the categorical columns using the first mode of that column\ndf_cleaned = dg.handle_missing_values_datatype_imputation(\n data = df,\n numeric_datatype_imputation_method = dg.NumericDatatypeImputationMethod.MEAN,\n verbose = True\n)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Dataset has 3 rows before handling missing values.\n\n# Missing values are:\n# name 0\n# age 1\n# gender 1\n# dtype: int64\n\n# Dataset has 3 rows after handling missing values.\n\n# Preprocessed Data:\n# name age gender\n# 0 John 40.0 male\n# 1 Jane 45.0 female\n# 2 Jack 50.0 female\n```\n<br>\n\n#### \ud83d\udcd1 Duplicate Handling Module\nThis module provides two strategies to handle duplicate data through these functions:\n\n- ***handle_duplicate_values_exact***: Remove `exact duplicate` rows\n - Optionally, a specific set of columns can be provided for duplicate analysis via `columns_subset`\n\n- ***handle_duplicate_values_fuzzy***: Remove `approximate (fuzzy) duplicates` based on string similarity\n - Define the `similarity threshold` (e.g., 70\u201390%)\n - Limit the comparison to specific columns via `columns_subset`\n\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Creating a sample dataframe with a similar name values\ndf = pd.DataFrame({\n \"name\": [\"John\", \"Johney\", \"Jack\"],\n \"age\": [40, 45, 50],\n})\n\n# Only \"name\" column will be used to detect fuzzy duplicates\ncolumns_subset = [\"name\"]\n\n# Remove rows that are 70% or more similar in the \"name\" column (It keeps the first occurrence of each similarity group)\ndf_cleaned = dg.handle_duplicate_values_fuzzy(\n data = df, \n columns_subset = columns_subset, \n similarity_thresholds = (70,100), \n verbose = True)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Dataset has 3 rows before handling duplicate values.\n\n# Top 10 of duplicate values are (Totally 2 rows - including all duplicates, but from each group first one will remain and others will be removed):\n# name age\n# 0 John 40\n# 1 Johney 45\n\n# Dataset has 2 rows after handling duplicate values.\n\n# Preprocessed Data:\n# name age\n# 0 John 40\n# 2 Jack 50\n```\n<br>\n\n#### \u2757 Outlier Handling Module\nThis module separates the detection and handling of outliers, giving you flexibility and control.\n\n- ***detect_outliers***: Detects outliers using various statistical or model-based techniques:\n - `IQR`, `ZSCORE`, `ISOLATION_FOREST`, `LOCAL_OUTLIER_FACTOR`\n - An optional list of columns can be specified; otherwise, all numeric columns are used\n - Customization options like `contamination_rate` and `n_neighbors` available for model-based methods\n\n- ***handle_outliers***: Applies the selected strategy to the detected outliers\n - `DROP`: Remove rows containing outliers\n - `REPLACE_WITH_MEDIAN`: Replace outlier values with their column median\n - `CAP_WITH_BOUNDARIES`: Clip outlier values to the inlier boundary limits (based on the detection method)\n\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with an outlier in the \"age\" column\ndf = pd.DataFrame({\n \"name\": [\"John\", \"Johney\", \"Jack\", \"Sara\", \"Chris\"],\n \"age\": [40, 45, 30, 25, 200],\n})\n\n# Step 1: Detect outliers using the IQR method\noutliers, boundaries = dg.detect_outliers(\n data = df, \n detect_outlier_method = dg.DetectOutlierMethod.IQR)\n\nprint(f\"Boundries:\\n{boundaries}\")\n\n# Step 2: Cap outlier values with the calculated boundaries\ndf_cleaned = dg.handle_outliers(\n data = df,\n handle_outlier_method = dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES,\n outliers = outliers,\n boundaries=boundaries,\n verbose=True)\n\n# Visualize the outliers using boxplot and histograms before and after cleaning\ndg.visualize_outliers(df, df_cleaned, \"\", dg.DetectOutlierMethod.IQR, dg.HandleOutlierMethod.CAP_WITH_BOUNDARIES)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Boundries:\n# {'age': (np.float64(7.5), np.float64(67.5))}\n\n# Dataset has 5 rows before handling outliers values.\n\n# Top 10 of rows containing outliers are (Totally 1 rows):\n# name age\n# 4 Chris 200\n\n# Dataset has 5 rows after handling outliers.\n\n# Preprocessed Data:\n# name age\n# 0 John 40.0\n# 1 Johney 45.0\n# 2 Jack 30.0\n# 3 Sara 25.0\n# 4 Chris 67.5\n\n# Visualizations have been saved in the 'visualizations' folder inside the project root directory.\n```\n<br>\n\n#### \ud83d\udd22 Feature Encoding Module\nThis module provides multiple methods to encode categorical features into numerical representations suitable for machine learning.\n\n- ***encode_feature***: \n - Supported methods: LABEL_ENCODING, ONEHOT_ENCODING, HASHING\n - Optionally specify columns; otherwise, all categorical columns will be encoded\n - To apply different methods to different columns, call the function multiple times with desired parameters\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with a categorical \"gender\" column\ndf = pd.DataFrame({\n \"name\": [\"John\", \"Jane\", \"Jack\"],\n \"age\": [40, 45, 50],\n \"gender\": [\"male\", \"female\", \"male\"]\n})\n\n# Only \"gender\" column will be encoded\ncolumns_subset = [\"gender\"]\n\n# Convert \"gender\" to numerical labels (e.g., male=1, female=0)\ndf_cleaned = dg.encode_feature(\n data = df,\n feature_encoding_method = dg.FeatureEncodingMethod.LABEL_ENCODING,\n columns_subset = columns_subset)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Preprocessed Data:\n# name age gender gender_encoded\n# 0 John 40 male 1\n# 1 Jane 45 female 0\n# 2 Jack 50 male 1\n```\n<br>\n\n#### \ud83d\udd01 Type Conversion Module\nThis module provides methods for converting column datatypes for better compatibility and precision.\n\n- ***convert_datatype_auto***: \n - Automatically infers and converts column datatypes based on heuristics.\n- ***convert_datatype_userdefined***:\n - Converts column datatypes based on a user-defined mapping scenario (supports formats like datetime parsing).\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with mixed types\ndf = pd.DataFrame({\n \"name\": [\"John\", \"Jane\", \"Jack\"],\n \"age\": [40.0, 45, 50.0],\n \"signup_date\": [\"2023-01-01\", \"2023-01-01\", \"2023-03-01\"]\n})\n\n# user-defined scenario to request how to convert specific columns\nconvert_scenario = {\n \"column\": [\"age\", \"signup_date\"],\n \"datatype\": [\"int\", \"datetime\"],\n \"format\": [\"\", \"%Y-%m-%d\"]\n}\n\n# Apply type conversion using the user-defined configuration\ndf_cleaned = dg.convert_datatype_userdefined(\n data = df,\n convert_scenario = convert_scenario,\n verbose=True)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Before automatic datatype conversion, the datatype are as follows:\n# name object\n# age float64\n# signup_date object\n# dtype: object\n\n# After automatic datatype conversion, the datatype are as follows:\n# name object\n# age int64\n# signup_date datetime64[ns]\n# dtype: object\n\n# Preprocessed Data:\n# name age signup_date\n# 0 John 40 2023-01-01\n# 1 Jane 45 2023-01-01\n# 2 Jack 50 2023-03-01\n```\n<br>\n\n#### \ud83d\udccf Feature Scaling Module\nThis module allows feature scaling using different methods on selected columns, with an optional L2 normalization across all numeric columns.\n\n- ***scale_feature***: \n - Supported scaling methods: `MINMAX_SCALING`, `ZSCORE_STANDARDIZATION`, `ROBUST_SCALING`\n - L2 normalization can be optionally applied to all numeric columns after scaling\n - Scaling can be customized per column using the `scaling_scenario`\n\n```python\nimport dataglass as dg\nimport pandas as pd\nimport numpy as np\n\n# Sample dataset with numeric features\ndf = pd.DataFrame({\n \"name\": [\"John\", \"Jane\", \"Jack\"],\n \"age\": [40, 45, 50],\n \"score\": [60, 70, 180],\n \"income\": [5000, 4500, 3000]\n})\n\n# Define a scenario to scale \"age\" using MinMax and \"score\" using RobustScaler\nscaling_scenario = {\n \"column\": [\"age\", \"score\", \"income\"],\n \"scaling_method\": [\"MINMAX_SCALING\", \"ROBUST_SCALING\", \"ZSCORE_STANDARDIZATION\"]\n}\n\n# Apply scaling and then L2 normalize all numeric features\ndf_cleaned = dg.scale_feature(\n data = df,\n scaling_scenario = scaling_scenario,\n apply_l2normalization = True)\n\nprint(f\"Preprocessed Data:\\n{df_cleaned}\")\n\n# =========== Expected Terminal Output =============\n\n# Preprocessed Data:\n# name age score income\n# 0 John 0.000000 -0.167564 0.985861\n# 1 Jane 0.786796 0.000000 0.617213\n# 2 Jack 0.400137 0.733584 -0.549313\n```\n\n---\n\n## \u2705 Requirements\n\n- Python \u2265 3.10 \nAll other dependencies will be installed automatically via `pip install dataglass`.\n\n---\n\n## \ud83d\udee3\ufe0f Roadmap \n\n- \u2705 Preprocessing Modules \n- \u2705 Custom Pipelines \n- \u2705 Automatic Preprocessing\n- \u23f3 Exploratory Data Analysis (EDA) \n- \u23f3 Machine Learning Modules\n\n---\n\n## \ud83d\udcc4 License \n\nThis project is licensed under the [BSD 3-Clause License](https://opensource.org/license/BSD-3-Clause). \nSee the [LICENSE](https://github.com/samantim/dataglass/blob/main/LICENSE) file in the repository for full details.\n\n---\n\n## \ud83e\udd1d Contributing \n\nContributions, bug reports, and feature requests are welcome! \nPlease open an issue or submit a pull request via [GitHub](https://github.com/samantim/dataglass).\n\n---\n\n## \ud83d\udc64 Author \n\n**Saman Teymouri** \n*Data Scientist/Analyst & Python Developer* \nBerlin, Germany\n",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "dataglass is a Python library for data preprocessing, exploratory data analysis (EDA), and machine learning. It includes modules for handling missing values, detecting and resolving duplicates, managing outliers, feature encoding, type conversion, scaling, and pipeline integration. With its latest update, dataglass introduces intelligent automation which dynamically adapting preprocessing steps based on dataset characteristics, minimizing manual configuration and accelerating your workflow.",
"version": "0.8.1",
"project_urls": {
"Documentation": "https://github.com/samantim/dataglass/wiki",
"Homepage": "https://github.com/samantim/dataglass",
"Issues": "https://github.com/samantim/dataglass/issues",
"Source": "https://github.com/samantim/dataglass"
},
"split_keywords": [
"data preprocessing",
" eda",
" machine learning",
" data cleaning",
" feature engineering",
" pipeline",
" pandas",
" scikit-learn"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7c91002702a6afa8bf466e248c800a313e9af06518f37e0e9fb75475a4c6b86a",
"md5": "cb4a6c23efb7abab5a87a4fc439eef37",
"sha256": "f6ba9634a77f227fd94b420d6e9bddcb6847ef89a81d0414b6b6657e559e622d"
},
"downloads": -1,
"filename": "dataglass-0.8.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cb4a6c23efb7abab5a87a4fc439eef37",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 37487,
"upload_time": "2025-09-07T07:54:03",
"upload_time_iso_8601": "2025-09-07T07:54:03.266495Z",
"url": "https://files.pythonhosted.org/packages/7c/91/002702a6afa8bf466e248c800a313e9af06518f37e0e9fb75475a4c6b86a/dataglass-0.8.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8270959ddba573704eac82079104fd458532ea7ce3edfe7161ed090d48521b64",
"md5": "cca10ff7da1986421ca3de325c1efffb",
"sha256": "633162be438557112d9bf206db8f47c62e48c4afaf2658ab4cfceeb0da78c355"
},
"downloads": -1,
"filename": "dataglass-0.8.1.tar.gz",
"has_sig": false,
"md5_digest": "cca10ff7da1986421ca3de325c1efffb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 43484,
"upload_time": "2025-09-07T07:54:05",
"upload_time_iso_8601": "2025-09-07T07:54:05.096132Z",
"url": "https://files.pythonhosted.org/packages/82/70/959ddba573704eac82079104fd458532ea7ce3edfe7161ed090d48521b64/dataglass-0.8.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-07 07:54:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "samantim",
"github_project": "dataglass",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pandas",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "rapidfuzz",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "seaborn",
"specs": []
},
{
"name": "category_encoders",
"specs": []
},
{
"name": "build",
"specs": []
},
{
"name": "setuptools",
"specs": []
},
{
"name": "twine",
"specs": []
}
],
"lcname": "dataglass"
}