AutoPrep

Name	AutoPrep JSON
Version	3.0.0 JSON
	download
home_page	https://github.com/JAdelhelm/AutoPrepn
Summary	AutoPrep is an automated preprocessing pipeline with univariate anomaly marking
upload_time	2024-09-30 17:39:27
maintainer	None
docs_url	None
author	Jörg Adelhelm
requires_python	>=3.8
license	MIT
keywords	anomaly-detection preprocessing automated automated-preprocessing cleaning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # AutoPrep -  Automated Preprocessing Pipeline with Univariate Anomaly Indicators
[![PyPIv](https://img.shields.io/pypi/v/AutoPrep)](https://pypi.org/project/AutoPrep/)
![PyPI status](https://img.shields.io/pypi/status/AutoPrep)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/AutoPrep) ![PyPI - License](https://img.shields.io/pypi/l/AutoPrep)
<!-- [![Downloads](https://static.pepy.tech/badge/AutoPrep)](https://pepy.tech/project/AutoPrep) -->



This pipeline focuses on data preprocessing, standardization, and cleaning, with additional features to identify univariate anomalies.

- I used sklearn's Pipeline and Transformer concept to create this preprocessing pipeline
    - Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
    - Transformer: https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html


```python
pip install AutoPrep
```
#### Dependencies
- scikit-learn
- category_encoders
- bitstring



## Basic Usage
To utilize this pipeline, you need to import the necessary libraries and initialize the AutoPrep pipeline. Here is a basic example:

````python
import pandas as pd
import numpy as np

X_train = pd.DataFrame({

    'ID': [1, 2, 3, 4],                 
    'Name': ['Alice', 'Alice', 'Alice', "Alice"],  
    'Rank': ['A','B','C','D'],
    'Age': [25, 30, 35, 40],                 
    'Salary': [50000.00, 60000.50, 75000.75, 8_000], 
    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), 
    'Is Manager': [False, True, False, ""]  
})
X_test = pd.DataFrame({

    'ID': [1, 2, 3, 4],                 
    'Name': ['Alice', 'Alice', 'Alice', "Bob"],  
    'Rank': ['A','B','C','D'],
    'Age': [25, 30, 35, np.nan],                 
    'Salary': [50000.00, 60000.50, 75000.75, 8_000_000], 
    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), 
    'Is Manager': [False, True, False, ""]  
})


########################################
from AutoPrep import AutoPrep

pipeline = AutoPrep(remove_columns_no_variance=False)

pipeline.fit(X=X_train)
X_output = pipeline.transform(X=X_test)

X_output
````

## Highlights ⭐


#### 📌 Implementation of univariate methods / *Detection of univariate anomalies*
Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.

#### 📌 BinaryEncoder instead of OneHotEncoder for nominal columns / *Big Data and Performance*
   Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.
   - (John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1–41.), Tables 2, 4
   - (Diogo Seca and João Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146–155.), P. 151

#### 📌 Transformation of time series data and standardization of data with RobustScaler / *Normalization for better prediction results*

#### 📌 Labeling of NaN values in an extra column instead of removing them / *No loss of information*



---





## Pipeline - Built-in Logic
<!-- ![Logic of Pipeline](./images/decision_rules.png) -->
![Logic of Pipeline](https://raw.githubusercontent.com/JAdelhelm/AutoPrep/main/AutoPrep/img/decision_rules.png) 





<!-- ## Abstract View (Code Structure) -->
<!-- ![Abstract view of the project](./images/project.png) -->
<!-- ![Abstract view of the project](https://raw.githubusercontent.com/JAdelhelm/AutoPrep/main/images/project.png) -->




---


### Reference
- https://www.researchgate.net/publication/379640146_Detektion_von_Anomalien_in_der_Datenqualitatskontrolle_mittels_unuberwachter_Ansatze (German Thesis)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JAdelhelm/AutoPrepn",
    "name": "AutoPrep",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "anomaly-detection, preprocessing, automated, automated-preprocessing, cleaning",
    "author": "J\u00f6rg Adelhelm",
    "author_email": "J\u00f6rg Adelhelm <adeljoe@gmx.de>",
    "download_url": "https://files.pythonhosted.org/packages/eb/90/eefe25bc91eb1498556e1830fbd7856d02d8c537ea93db6e358253b9afa7/autoprep-3.0.0.tar.gz",
    "platform": null,
    "description": "# AutoPrep -  Automated Preprocessing Pipeline with Univariate Anomaly Indicators\r\n[![PyPIv](https://img.shields.io/pypi/v/AutoPrep)](https://pypi.org/project/AutoPrep/)\r\n![PyPI status](https://img.shields.io/pypi/status/AutoPrep)\r\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/AutoPrep) ![PyPI - License](https://img.shields.io/pypi/l/AutoPrep)\r\n<!-- [![Downloads](https://static.pepy.tech/badge/AutoPrep)](https://pepy.tech/project/AutoPrep) -->\r\n\r\n\r\n\r\nThis pipeline focuses on data preprocessing, standardization, and cleaning, with additional features to identify univariate anomalies.\r\n\r\n- I used sklearn's Pipeline and Transformer concept to create this preprocessing pipeline\r\n    - Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html\r\n    - Transformer: https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html\r\n\r\n\r\n```python\r\npip install AutoPrep\r\n```\r\n#### Dependencies\r\n- scikit-learn\r\n- category_encoders\r\n- bitstring\r\n\r\n\r\n\r\n## Basic Usage\r\nTo utilize this pipeline, you need to import the necessary libraries and initialize the AutoPrep pipeline. Here is a basic example:\r\n\r\n````python\r\nimport pandas as pd\r\nimport numpy as np\r\n\r\nX_train = pd.DataFrame({\r\n\r\n    'ID': [1, 2, 3, 4],                 \r\n    'Name': ['Alice', 'Alice', 'Alice', \"Alice\"],  \r\n    'Rank': ['A','B','C','D'],\r\n    'Age': [25, 30, 35, 40],                 \r\n    'Salary': [50000.00, 60000.50, 75000.75, 8_000], \r\n    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), \r\n    'Is Manager': [False, True, False, \"\"]  \r\n})\r\nX_test = pd.DataFrame({\r\n\r\n    'ID': [1, 2, 3, 4],                 \r\n    'Name': ['Alice', 'Alice', 'Alice', \"Bob\"],  \r\n    'Rank': ['A','B','C','D'],\r\n    'Age': [25, 30, 35, np.nan],                 \r\n    'Salary': [50000.00, 60000.50, 75000.75, 8_000_000], \r\n    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), \r\n    'Is Manager': [False, True, False, \"\"]  \r\n})\r\n\r\n\r\n########################################\r\nfrom AutoPrep import AutoPrep\r\n\r\npipeline = AutoPrep(remove_columns_no_variance=False)\r\n\r\npipeline.fit(X=X_train)\r\nX_output = pipeline.transform(X=X_test)\r\n\r\nX_output\r\n````\r\n\r\n## Highlights \u2b50\r\n\r\n\r\n#### \ud83d\udccc Implementation of univariate methods / *Detection of univariate anomalies*\r\nBoth methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.\r\n\r\n#### \ud83d\udccc BinaryEncoder instead of OneHotEncoder for nominal columns / *Big Data and Performance*\r\n   Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.\r\n   - (John T. Hancock and Taghi M. Khoshgoftaar. \"Survey on categorical data for neural networks.\" In: Journal of Big Data 7.1 (2020), pp. 1\u201341.), Tables 2, 4\r\n   - (Diogo Seca and Jo\u00e3o Mendes-Moreira. \"Benchmark of Encoders of Nominal Features for Regression.\" In: World Conference on Information Systems and Technologies. 2021, pp. 146\u2013155.), P. 151\r\n\r\n#### \ud83d\udccc Transformation of time series data and standardization of data with RobustScaler / *Normalization for better prediction results*\r\n\r\n#### \ud83d\udccc Labeling of NaN values in an extra column instead of removing them / *No loss of information*\r\n\r\n\r\n\r\n---\r\n\r\n\r\n\r\n\r\n\r\n## Pipeline - Built-in Logic\r\n<!-- ![Logic of Pipeline](./images/decision_rules.png) -->\r\n![Logic of Pipeline](https://raw.githubusercontent.com/JAdelhelm/AutoPrep/main/AutoPrep/img/decision_rules.png) \r\n\r\n\r\n\r\n\r\n\r\n<!-- ## Abstract View (Code Structure) -->\r\n<!-- ![Abstract view of the project](./images/project.png) -->\r\n<!-- ![Abstract view of the project](https://raw.githubusercontent.com/JAdelhelm/AutoPrep/main/images/project.png) -->\r\n\r\n\r\n\r\n\r\n---\r\n\r\n\r\n### Reference\r\n- https://www.researchgate.net/publication/379640146_Detektion_von_Anomalien_in_der_Datenqualitatskontrolle_mittels_unuberwachter_Ansatze (German Thesis)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AutoPrep is an automated preprocessing pipeline with univariate anomaly marking",
    "version": "3.0.0",
    "project_urls": {
        "Homepage": "https://github.com/JAdelhelm/AutoPrep",
        "Issues": "https://github.com/JAdelhelm/AutoPrep/issues"
    },
    "split_keywords": [
        "anomaly-detection",
        " preprocessing",
        " automated",
        " automated-preprocessing",
        " cleaning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a9b1a1ae13d80fa779f0e4fa2fb9a308e8b24e722b3c050e8cdf8a97dfec913c",
                "md5": "04799e74e382315acc3303d9e376be0f",
                "sha256": "97044683460ed3a452fd45f72d5d35f92454cd974ec5a4aaa9a08343f9d22bd3"
            },
            "downloads": -1,
            "filename": "AutoPrep-3.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "04799e74e382315acc3303d9e376be0f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 22370,
            "upload_time": "2024-09-30T17:39:25",
            "upload_time_iso_8601": "2024-09-30T17:39:25.315704Z",
            "url": "https://files.pythonhosted.org/packages/a9/b1/a1ae13d80fa779f0e4fa2fb9a308e8b24e722b3c050e8cdf8a97dfec913c/AutoPrep-3.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eb90eefe25bc91eb1498556e1830fbd7856d02d8c537ea93db6e358253b9afa7",
                "md5": "3ecb3937c2d70ae479395e9fdb4feb2d",
                "sha256": "70aa4189501bc21a91e6f302f4af6e29f30e9f270d7663136de1aee024a42f20"
            },
            "downloads": -1,
            "filename": "autoprep-3.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3ecb3937c2d70ae479395e9fdb4feb2d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14953,
            "upload_time": "2024-09-30T17:39:27",
            "upload_time_iso_8601": "2024-09-30T17:39:27.367427Z",
            "url": "https://files.pythonhosted.org/packages/eb/90/eefe25bc91eb1498556e1830fbd7856d02d8c537ea93db6e358253b9afa7/autoprep-3.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-30 17:39:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JAdelhelm",
    "github_project": "AutoPrepn",
    "github_not_found": true,
    "lcname": "autoprep"
}

Jörg Adelhelm