# RobustPreprocessor
**RobustPreprocessor** is a Python package for comprehensive and flexible data preprocessing. It is designed to clean and prepare numeric datasets for machine learning and analysis by handling outliers, missing values, infinity values, and redundant features.
---
## Features
- **Outlier Handling**:
- Interquartile Range (IQR) clipping.
- Z-score filtering.
- **Infinity Value Handling**:
- Replace with finite extremes.
- Drop rows containing infinity.
- Replace with a specific default value (e.g., 0).
- **Missing Value Imputation**:
- Mean, median, or most frequent value strategies.
- **Feature Removal**:
- Drop constant or near-constant features.
- **Visualization**:
- Plot feature distributions with histograms.
- **Execution Logging**:
- Outputs a JSON summary of preprocessing steps, including dropped columns and execution time.
---
## Installation
Install via pip:
```bash
pip install robustpreprocessor
```
Or clone the repository and install locally:
```bash
git clone https://github.com/nqmn/robustpreprocessor.git
cd robustpreprocessor
pip install .
```
---
## Usage
### Import the Package
```python
from robustpreprocessor import RobustPreprocessor
import pandas as pd
# Example dataset
data = pd.DataFrame({
"feature_1": [1, 2, 3, 1000, 5],
"feature_2": [1, 2, None, 4, 5],
"feature_3": [0, 0, 0, 0, 0],
"feature_4": [1, 2, np.inf, -np.inf, 5],
})
# Initialize the preprocessor
preprocessor = RobustPreprocessor(verbose=True)
# Preprocess the dataset
cleaned_data = preprocessor.preprocess(
data,
outlier_method="IQR",
infinity_handling="set_value",
missing_value_strategy="mean",
feature_removal_criteria="constant"
)
# View the cleaned data
print(cleaned_data)
```
---
## Visualization
Use the `plot_feature_distributions` method to visualize the distributions of numeric features:
```python
preprocessor.plot_feature_distributions(cleaned_data)
```
---
## Logging and Execution Summary
After preprocessing, the class outputs a JSON log summarizing the steps taken, including execution time and dropped columns:
```json
{
"process_type": "RobustPreprocessor",
"user_selections": {
"outlier_method": "IQR",
"infinity_handling": "set_value",
"missing_value_strategy": "mean",
"feature_removal_criteria": "constant"
},
"steps_executed": {
"select_numeric_columns": "Selected 4 numeric columns",
"outlier_handling": "Handled outliers using IQR method.",
"infinity_handling": "Replaced infinity values with a set value (0).",
"missing_value_imputation": "Imputed missing values with mean strategy.",
"feature_removal": "Dropped 1 constant columns."
},
"dropped_columns": 1,
"execution_time_seconds": 0.1234
}
```
---
## Dependencies
- `numpy`
- `pandas`
- `scikit-learn`
- `matplotlib`
- `scipy`
Install them with:
```bash
pip install -r requirements.txt
```
---
## Contributing
Contributions are welcome! Please follow these steps:
1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Commit your changes with clear descriptions.
4. Submit a pull request.
---
## License
This project is licensed under the MIT License. See the `LICENSE` file for details.
---
## Support
If you encounter issues or have questions, feel free to open an issue on [GitHub](https://github.com/nqmn/robustpreprocessor/issues).
---
## Acknowledgments
- Inspired by common challenges in data preprocessing.
- Thanks to the contributors and the open-source community!
Raw data
{
"_id": null,
"home_page": "https://github.com/nqmn/robustpreprocessor",
"name": "robustpreprocessor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "data preprocessing, outlier handling, missing data, data cleaning, feature engineering, robust preprocessing",
"author": "Mohd Adil",
"author_email": "mohdadil@live.com",
"download_url": "https://files.pythonhosted.org/packages/41/e4/329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897/robustpreprocessor-1.0.0.tar.gz",
"platform": null,
"description": "# RobustPreprocessor\r\n\r\n**RobustPreprocessor** is a Python package for comprehensive and flexible data preprocessing. It is designed to clean and prepare numeric datasets for machine learning and analysis by handling outliers, missing values, infinity values, and redundant features.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **Outlier Handling**:\r\n - Interquartile Range (IQR) clipping.\r\n - Z-score filtering.\r\n \r\n- **Infinity Value Handling**:\r\n - Replace with finite extremes.\r\n - Drop rows containing infinity.\r\n - Replace with a specific default value (e.g., 0).\r\n\r\n- **Missing Value Imputation**:\r\n - Mean, median, or most frequent value strategies.\r\n\r\n- **Feature Removal**:\r\n - Drop constant or near-constant features.\r\n \r\n- **Visualization**:\r\n - Plot feature distributions with histograms.\r\n\r\n- **Execution Logging**:\r\n - Outputs a JSON summary of preprocessing steps, including dropped columns and execution time.\r\n\r\n---\r\n\r\n## Installation\r\n\r\nInstall via pip:\r\n\r\n```bash\r\npip install robustpreprocessor\r\n```\r\n\r\nOr clone the repository and install locally:\r\n\r\n```bash\r\ngit clone https://github.com/nqmn/robustpreprocessor.git\r\ncd robustpreprocessor\r\npip install .\r\n```\r\n\r\n---\r\n\r\n## Usage\r\n\r\n### Import the Package\r\n\r\n```python\r\nfrom robustpreprocessor import RobustPreprocessor\r\nimport pandas as pd\r\n\r\n# Example dataset\r\ndata = pd.DataFrame({\r\n \"feature_1\": [1, 2, 3, 1000, 5],\r\n \"feature_2\": [1, 2, None, 4, 5],\r\n \"feature_3\": [0, 0, 0, 0, 0],\r\n \"feature_4\": [1, 2, np.inf, -np.inf, 5],\r\n})\r\n\r\n# Initialize the preprocessor\r\npreprocessor = RobustPreprocessor(verbose=True)\r\n\r\n# Preprocess the dataset\r\ncleaned_data = preprocessor.preprocess(\r\n data,\r\n outlier_method=\"IQR\",\r\n infinity_handling=\"set_value\",\r\n missing_value_strategy=\"mean\",\r\n feature_removal_criteria=\"constant\"\r\n)\r\n\r\n# View the cleaned data\r\nprint(cleaned_data)\r\n```\r\n\r\n---\r\n\r\n## Visualization\r\n\r\nUse the `plot_feature_distributions` method to visualize the distributions of numeric features:\r\n\r\n```python\r\npreprocessor.plot_feature_distributions(cleaned_data)\r\n```\r\n\r\n---\r\n\r\n## Logging and Execution Summary\r\n\r\nAfter preprocessing, the class outputs a JSON log summarizing the steps taken, including execution time and dropped columns:\r\n\r\n```json\r\n{\r\n \"process_type\": \"RobustPreprocessor\",\r\n \"user_selections\": {\r\n \"outlier_method\": \"IQR\",\r\n \"infinity_handling\": \"set_value\",\r\n \"missing_value_strategy\": \"mean\",\r\n \"feature_removal_criteria\": \"constant\"\r\n },\r\n \"steps_executed\": {\r\n \"select_numeric_columns\": \"Selected 4 numeric columns\",\r\n \"outlier_handling\": \"Handled outliers using IQR method.\",\r\n \"infinity_handling\": \"Replaced infinity values with a set value (0).\",\r\n \"missing_value_imputation\": \"Imputed missing values with mean strategy.\",\r\n \"feature_removal\": \"Dropped 1 constant columns.\"\r\n },\r\n \"dropped_columns\": 1,\r\n \"execution_time_seconds\": 0.1234\r\n}\r\n```\r\n\r\n---\r\n\r\n## Dependencies\r\n\r\n- `numpy`\r\n- `pandas`\r\n- `scikit-learn`\r\n- `matplotlib`\r\n- `scipy`\r\n\r\nInstall them with:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please follow these steps:\r\n\r\n1. Fork the repository.\r\n2. Create a new branch for your feature or bug fix.\r\n3. Commit your changes with clear descriptions.\r\n4. Submit a pull request.\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\r\n\r\n---\r\n\r\n## Support\r\n\r\nIf you encounter issues or have questions, feel free to open an issue on [GitHub](https://github.com/nqmn/robustpreprocessor/issues).\r\n\r\n---\r\n\r\n## Acknowledgments\r\n\r\n- Inspired by common challenges in data preprocessing.\r\n- Thanks to the contributors and the open-source community!\r\n",
"bugtrack_url": null,
"license": null,
"summary": "RobustPreprocessor is designed to preprocess datasets effectively to ensure robust data preparation before further analysis or modeling.",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/nqmn/robustpreprocessor"
},
"split_keywords": [
"data preprocessing",
" outlier handling",
" missing data",
" data cleaning",
" feature engineering",
" robust preprocessing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f257b6b01d00dd3a290e679b237ff1ff224822f2ad661b38bad6d869cae8192f",
"md5": "1d18f0d712ccfb19069746b528fb36ab",
"sha256": "a44fb3bb47994a1496e19db3031c259b2d81b24dbbc2589ca475f4d06052e041"
},
"downloads": -1,
"filename": "robustpreprocessor-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1d18f0d712ccfb19069746b528fb36ab",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 7047,
"upload_time": "2024-11-22T09:57:55",
"upload_time_iso_8601": "2024-11-22T09:57:55.422007Z",
"url": "https://files.pythonhosted.org/packages/f2/57/b6b01d00dd3a290e679b237ff1ff224822f2ad661b38bad6d869cae8192f/robustpreprocessor-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "41e4329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897",
"md5": "1af201853f70f5d2a6022a5483d77bef",
"sha256": "13768a47a1528b89b669e4fc6f5cf02a8171a0b783a277eca0e6e2c012e7f2c2"
},
"downloads": -1,
"filename": "robustpreprocessor-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "1af201853f70f5d2a6022a5483d77bef",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 6450,
"upload_time": "2024-11-22T09:57:57",
"upload_time_iso_8601": "2024-11-22T09:57:57.598828Z",
"url": "https://files.pythonhosted.org/packages/41/e4/329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897/robustpreprocessor-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-22 09:57:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nqmn",
"github_project": "robustpreprocessor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.20"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.2.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.7"
]
]
}
],
"lcname": "robustpreprocessor"
}