robustpreprocessor


Namerobustpreprocessor JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/nqmn/robustpreprocessor
SummaryRobustPreprocessor is designed to preprocess datasets effectively to ensure robust data preparation before further analysis or modeling.
upload_time2024-11-22 09:57:57
maintainerNone
docs_urlNone
authorMohd Adil
requires_python>=3.8
licenseNone
keywords data preprocessing outlier handling missing data data cleaning feature engineering robust preprocessing
VCS
bugtrack_url
requirements scikit-learn numpy pandas matplotlib scipy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # RobustPreprocessor

**RobustPreprocessor** is a Python package for comprehensive and flexible data preprocessing. It is designed to clean and prepare numeric datasets for machine learning and analysis by handling outliers, missing values, infinity values, and redundant features.

---

## Features

- **Outlier Handling**:
  - Interquartile Range (IQR) clipping.
  - Z-score filtering.
  
- **Infinity Value Handling**:
  - Replace with finite extremes.
  - Drop rows containing infinity.
  - Replace with a specific default value (e.g., 0).

- **Missing Value Imputation**:
  - Mean, median, or most frequent value strategies.

- **Feature Removal**:
  - Drop constant or near-constant features.
  
- **Visualization**:
  - Plot feature distributions with histograms.

- **Execution Logging**:
  - Outputs a JSON summary of preprocessing steps, including dropped columns and execution time.

---

## Installation

Install via pip:

```bash
pip install robustpreprocessor
```

Or clone the repository and install locally:

```bash
git clone https://github.com/nqmn/robustpreprocessor.git
cd robustpreprocessor
pip install .
```

---

## Usage

### Import the Package

```python
from robustpreprocessor import RobustPreprocessor
import pandas as pd

# Example dataset
data = pd.DataFrame({
    "feature_1": [1, 2, 3, 1000, 5],
    "feature_2": [1, 2, None, 4, 5],
    "feature_3": [0, 0, 0, 0, 0],
    "feature_4": [1, 2, np.inf, -np.inf, 5],
})

# Initialize the preprocessor
preprocessor = RobustPreprocessor(verbose=True)

# Preprocess the dataset
cleaned_data = preprocessor.preprocess(
    data,
    outlier_method="IQR",
    infinity_handling="set_value",
    missing_value_strategy="mean",
    feature_removal_criteria="constant"
)

# View the cleaned data
print(cleaned_data)
```

---

## Visualization

Use the `plot_feature_distributions` method to visualize the distributions of numeric features:

```python
preprocessor.plot_feature_distributions(cleaned_data)
```

---

## Logging and Execution Summary

After preprocessing, the class outputs a JSON log summarizing the steps taken, including execution time and dropped columns:

```json
{
    "process_type": "RobustPreprocessor",
    "user_selections": {
        "outlier_method": "IQR",
        "infinity_handling": "set_value",
        "missing_value_strategy": "mean",
        "feature_removal_criteria": "constant"
    },
    "steps_executed": {
        "select_numeric_columns": "Selected 4 numeric columns",
        "outlier_handling": "Handled outliers using IQR method.",
        "infinity_handling": "Replaced infinity values with a set value (0).",
        "missing_value_imputation": "Imputed missing values with mean strategy.",
        "feature_removal": "Dropped 1 constant columns."
    },
    "dropped_columns": 1,
    "execution_time_seconds": 0.1234
}
```

---

## Dependencies

- `numpy`
- `pandas`
- `scikit-learn`
- `matplotlib`
- `scipy`

Install them with:

```bash
pip install -r requirements.txt
```

---

## Contributing

Contributions are welcome! Please follow these steps:

1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Commit your changes with clear descriptions.
4. Submit a pull request.

---

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

---

## Support

If you encounter issues or have questions, feel free to open an issue on [GitHub](https://github.com/nqmn/robustpreprocessor/issues).

---

## Acknowledgments

- Inspired by common challenges in data preprocessing.
- Thanks to the contributors and the open-source community!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nqmn/robustpreprocessor",
    "name": "robustpreprocessor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "data preprocessing, outlier handling, missing data, data cleaning, feature engineering, robust preprocessing",
    "author": "Mohd Adil",
    "author_email": "mohdadil@live.com",
    "download_url": "https://files.pythonhosted.org/packages/41/e4/329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897/robustpreprocessor-1.0.0.tar.gz",
    "platform": null,
    "description": "# RobustPreprocessor\r\n\r\n**RobustPreprocessor** is a Python package for comprehensive and flexible data preprocessing. It is designed to clean and prepare numeric datasets for machine learning and analysis by handling outliers, missing values, infinity values, and redundant features.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **Outlier Handling**:\r\n  - Interquartile Range (IQR) clipping.\r\n  - Z-score filtering.\r\n  \r\n- **Infinity Value Handling**:\r\n  - Replace with finite extremes.\r\n  - Drop rows containing infinity.\r\n  - Replace with a specific default value (e.g., 0).\r\n\r\n- **Missing Value Imputation**:\r\n  - Mean, median, or most frequent value strategies.\r\n\r\n- **Feature Removal**:\r\n  - Drop constant or near-constant features.\r\n  \r\n- **Visualization**:\r\n  - Plot feature distributions with histograms.\r\n\r\n- **Execution Logging**:\r\n  - Outputs a JSON summary of preprocessing steps, including dropped columns and execution time.\r\n\r\n---\r\n\r\n## Installation\r\n\r\nInstall via pip:\r\n\r\n```bash\r\npip install robustpreprocessor\r\n```\r\n\r\nOr clone the repository and install locally:\r\n\r\n```bash\r\ngit clone https://github.com/nqmn/robustpreprocessor.git\r\ncd robustpreprocessor\r\npip install .\r\n```\r\n\r\n---\r\n\r\n## Usage\r\n\r\n### Import the Package\r\n\r\n```python\r\nfrom robustpreprocessor import RobustPreprocessor\r\nimport pandas as pd\r\n\r\n# Example dataset\r\ndata = pd.DataFrame({\r\n    \"feature_1\": [1, 2, 3, 1000, 5],\r\n    \"feature_2\": [1, 2, None, 4, 5],\r\n    \"feature_3\": [0, 0, 0, 0, 0],\r\n    \"feature_4\": [1, 2, np.inf, -np.inf, 5],\r\n})\r\n\r\n# Initialize the preprocessor\r\npreprocessor = RobustPreprocessor(verbose=True)\r\n\r\n# Preprocess the dataset\r\ncleaned_data = preprocessor.preprocess(\r\n    data,\r\n    outlier_method=\"IQR\",\r\n    infinity_handling=\"set_value\",\r\n    missing_value_strategy=\"mean\",\r\n    feature_removal_criteria=\"constant\"\r\n)\r\n\r\n# View the cleaned data\r\nprint(cleaned_data)\r\n```\r\n\r\n---\r\n\r\n## Visualization\r\n\r\nUse the `plot_feature_distributions` method to visualize the distributions of numeric features:\r\n\r\n```python\r\npreprocessor.plot_feature_distributions(cleaned_data)\r\n```\r\n\r\n---\r\n\r\n## Logging and Execution Summary\r\n\r\nAfter preprocessing, the class outputs a JSON log summarizing the steps taken, including execution time and dropped columns:\r\n\r\n```json\r\n{\r\n    \"process_type\": \"RobustPreprocessor\",\r\n    \"user_selections\": {\r\n        \"outlier_method\": \"IQR\",\r\n        \"infinity_handling\": \"set_value\",\r\n        \"missing_value_strategy\": \"mean\",\r\n        \"feature_removal_criteria\": \"constant\"\r\n    },\r\n    \"steps_executed\": {\r\n        \"select_numeric_columns\": \"Selected 4 numeric columns\",\r\n        \"outlier_handling\": \"Handled outliers using IQR method.\",\r\n        \"infinity_handling\": \"Replaced infinity values with a set value (0).\",\r\n        \"missing_value_imputation\": \"Imputed missing values with mean strategy.\",\r\n        \"feature_removal\": \"Dropped 1 constant columns.\"\r\n    },\r\n    \"dropped_columns\": 1,\r\n    \"execution_time_seconds\": 0.1234\r\n}\r\n```\r\n\r\n---\r\n\r\n## Dependencies\r\n\r\n- `numpy`\r\n- `pandas`\r\n- `scikit-learn`\r\n- `matplotlib`\r\n- `scipy`\r\n\r\nInstall them with:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please follow these steps:\r\n\r\n1. Fork the repository.\r\n2. Create a new branch for your feature or bug fix.\r\n3. Commit your changes with clear descriptions.\r\n4. Submit a pull request.\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\r\n\r\n---\r\n\r\n## Support\r\n\r\nIf you encounter issues or have questions, feel free to open an issue on [GitHub](https://github.com/nqmn/robustpreprocessor/issues).\r\n\r\n---\r\n\r\n## Acknowledgments\r\n\r\n- Inspired by common challenges in data preprocessing.\r\n- Thanks to the contributors and the open-source community!\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "RobustPreprocessor is designed to preprocess datasets effectively to ensure robust data preparation before further analysis or modeling.",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/nqmn/robustpreprocessor"
    },
    "split_keywords": [
        "data preprocessing",
        " outlier handling",
        " missing data",
        " data cleaning",
        " feature engineering",
        " robust preprocessing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f257b6b01d00dd3a290e679b237ff1ff224822f2ad661b38bad6d869cae8192f",
                "md5": "1d18f0d712ccfb19069746b528fb36ab",
                "sha256": "a44fb3bb47994a1496e19db3031c259b2d81b24dbbc2589ca475f4d06052e041"
            },
            "downloads": -1,
            "filename": "robustpreprocessor-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1d18f0d712ccfb19069746b528fb36ab",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 7047,
            "upload_time": "2024-11-22T09:57:55",
            "upload_time_iso_8601": "2024-11-22T09:57:55.422007Z",
            "url": "https://files.pythonhosted.org/packages/f2/57/b6b01d00dd3a290e679b237ff1ff224822f2ad661b38bad6d869cae8192f/robustpreprocessor-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "41e4329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897",
                "md5": "1af201853f70f5d2a6022a5483d77bef",
                "sha256": "13768a47a1528b89b669e4fc6f5cf02a8171a0b783a277eca0e6e2c012e7f2c2"
            },
            "downloads": -1,
            "filename": "robustpreprocessor-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1af201853f70f5d2a6022a5483d77bef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6450,
            "upload_time": "2024-11-22T09:57:57",
            "upload_time_iso_8601": "2024-11-22T09:57:57.598828Z",
            "url": "https://files.pythonhosted.org/packages/41/e4/329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897/robustpreprocessor-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-22 09:57:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nqmn",
    "github_project": "robustpreprocessor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.2.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.7"
                ]
            ]
        }
    ],
    "lcname": "robustpreprocessor"
}
        
Elapsed time: 0.62854s