segmentae


Namesegmentae JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/TsLu1s/SegmentAE
SummarySegmentAE: A Python Library for Anomaly Detection Optimization
upload_time2024-06-19 22:27:18
maintainerNone
docs_urlNone
authorLuís Santos
requires_pythonNone
licenseMIT
keywords pythondata science machine learning deep learning neural networks autoencoder clustering anomaly detection novelty detectionfraud detection data preprocessing
VCS
bugtrack_url
requirements pandas numpy atlantic mlimputer tensorflow ucimlrepo scipy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SegmentAE: A Python Library for Anomaly Detection Optimization

## Framework Overview

`SegmentAE` is designed to enhance anomaly detection performance through the optimization of reconstruction error by integrating and intersecting clustering methods with tabular autoencoders. It provides a comprehensive, scalable and robust solution for anomaly detection applications in relevant domains such as financial fraud detection or network security, ensuring extensive customization and optimization capabilities.

## Key Features and Capabilities

### 1. General Applicability on Tabular Datasets

`SegmentAE` is engineered to handle a wide range of tabular datasets, making it suitable for various anomaly detection tasks across different use case contexts, it can be seamlessly integrated into diverse applications, ensuring broad utility and adaptability.

### 2. Optimization and Customization

The framework offers complete configurability for each component of the anomaly detection pipeline, this includes data preprocessing, clustering algorithms and provides the customization of baseline autoencoders or the integration of fully developed models. Each component therefore can be fine-tuned to achieve optimal performance tailored to specific use case.

### 3. Enhanced Detection Performance

By leveraging a combination of clustering algorithms and advanced anomaly detection techniques, `SegmentAE` aims to improve the accuracy and reliability of anomaly detection. The integration of tabular autoencoders with clustering mechanisms ensures that the framework effectively captures and identifies different patterns in the input data, optimizing this way the reconstruction error for each existent cluster of the anomaly detection, thereby enhancing predictive performance.

### Main Development Tools <a name = "pre1"></a>

Major frameworks used to built this project: 

* [TensorFlow](https://www.tensorflow.org/)
* [Keras](https://keras.io/)
* [Scikit-Learn](https://scikit-learn.org/stable/)
* [Atlantic](https://pypi.org/project/atlantic/)
* [MLimputer](https://pypi.org/project/mlimputer/)
    
## Where to get it <a name = "ta"></a>
    
Binary installer for the latest released version is available at the Python Package Index [(PyPI)](https://pypi.org/project/segmentae/).   

The source code is currently hosted on GitHub at: https://github.com/TsLu1s/SegmentAE

## Installation  

To install this package from Pypi repository run the following command:

```
pip install segmentae
```

## SegmentAE - Technical Components and Pipeline Structure

The `SegmentAE` framework consists of several integrated components, each playing a critical role in the optimization of anomaly detection through clustering and tabular autoencoders. The pipeline is structured to ensure seamless data flow and modular customization, allowing optimal changes for each use case specific needs.

### 1. Data Preprocessing

Proper preprocessing is crucial for ensuring the quality and consistency of the data fed into the subsequent stages of the pipeline. The data preprocessing module is responsible for preparing raw data for predictive applications, this includes:

- **Missing Value Imputation**: Multiple supervised algorithmic imputation options to handle and impute missing data points.
- **Normalization**: Scaling features to ensure they have comparable magnitudes, essential for the performance of many machine learning algorithms.
- **Categorical Encoding**: Transforming categorical variables into numerical representations suitable for machine learning algorithms, using methods such as label encoding, InverseFrequency encoding and one-hot encoding.

### 2. Clustering

Clustering forms the backbone of the `SegmentAE` framework, providing the capability to segment data into meaningful distinct groups. This segmentation helps in understanding the underlying structure of the input data and provides a basis for the anomaly detection reconstruction error improvements.

- **Clustering Algorithms**: Support and customization for a variety of algorithm options such as `K-Means`, `MiniBatchKMeans`, `GaussianMixture`, and `Agglomerative` clustering, allowing the framework to adapt to different data structures and distribution patterns.

### 3. Anomaly Detection - Baseline Autoencoders

The core of the `SegmentAE` framework is its anomaly detection optimization module, which employs advanced methods such as tabular autoencoders to identify anomalies. Autoencoders are neural networks designed to learn efficient representations of input data, enabling the detection of anomalies by measuring reconstruction errors. This framework includes 2 baseline autoencoder algorithms (`Dense` & `Batch Norm`) for user application that allow the customization of each, including the network architecture, training epochs, activation layers and others.

Furthermore, it's a main feature option for you to build your own autoencoder model (`Keras` based) and integrate it into the `SegmentAE` pipeline -> 
<a href="https://github.com/TsLu1s/SegmentAE/blob/main/examples/basic_model.py" style="text-decoration:none;">
    <img src="https://img.shields.io/badge/Custom%20Model-blue?style=for-the-badge&logo=readme&logoColor=white" alt="Custom Model">
</a>

Also, application example for totally unlabeled data available here -> 
<a href="https://github.com/TsLu1s/SegmentAE/blob/main/examples/unlabeled_application.py" style="text-decoration:none;">
    <img src="https://img.shields.io/badge/Unlabeled%20Example-blue?style=for-the-badge&logo=readme&logoColor=white" alt="Unlabeled Example">
</a>

## SegmentAE - Predictive Application

To demonstrate the usage of `SegmentAE`, a DenseAutoencoder is trained and integrated with KMeans clustering (with 3 clusters). The following script outlines the entire process from data loading, preprocessing, clustering, autoencoder training, integration with clustering for anomaly detection, evaluation performance, and predicting future anomalies.

```py
import pandas as pd
from segmentae.data_sources.examples import load_dataset
from segmentae.anomaly_detection import (SegmentAE,
                                         Preprocessing,
                                         Clustering,
                                         DenseAutoencoder, 
                                         #BatchNormAutoencoder
                                         )
from sklearn.model_selection import train_test_split

## Data Loading

train, test, target = load_dataset(dataset_selection = 'german_credit_card', # Options | 'german_credit_card', 'network_intrusions', 'default_credit_card'
                                   split_ratio = 0.75)                       

test, future_data = train_test_split(test, train_size = 0.9, random_state = 5)

# Resetting Index is Required
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
future_data = future_data.reset_index(drop=True)

X_train, y_train = train.drop(columns=[target]).copy(), train[target].astype(int) 
X_test, y_test = test.drop(columns=[target]).copy(), test[target].astype(int)
X_future_data = future_data.drop(columns=[target]).copy()

## Preprocessing

pr = Preprocessing(encoder = None,          # Options | "IFrequencyEncoder", "LabelEncoder", "OneHotEncoder", None
                   scaler = "MinMaxScaler", # Options | "MinMaxScaler", "StandardScaler", "RobustScaler", None
                   imputer = None)          # Options | "Simple","RandomForest","ExtraTrees","GBR","KNN",
                                            #         | "XGBoost","Lightgbm","Catboost", None

pr.fit(X = X_train)
X_train = pr.transform(X = X_train)
X_test = pr.transform(X = X_test)
X_future_data = pr.transform(X = X_future_data)

## Clustering Implementation

cl_model = Clustering(cluster_model = ["KMeans"], # Options | KMeans, MiniBatchKMeans, GMM, Agglomerative
                      n_clusters = 3)
cl_model.clustering_fit(X = X_train)

## Autoencoder Implementation

denseAutoencoder = DenseAutoencoder(hidden_dims = [16, 12, 8, 4],
                                    encoder_activation = 'relu',  
                                    decoder_activation = 'relu',
                                    optimizer = 'adam',
                                    learning_rate = 0.001,
                                    epochs = 150,
                                    val_size = 0.15,
                                    stopping_patient = 20,
                                    dropout_rate = 0.1,
                                    batch_size = None)
denseAutoencoder.fit(input_data = X_train)
denseAutoencoder.summary()

## Autoencoder + Clustering Integration

sg = SegmentAE(ae_model = denseAutoencoder, 
                cl_model = cl_model)

## Train Reconstruction

sg.reconstruction(input_data = X_train,
                  threshold_metric = 'mse')  # Options | mse, mae, rmse, max_error

## Reconstruction Performance (Assuming y_test existence)

results = sg.evaluation(input_data = X_test,
                        target_col = y_test, 
                        threshold_ratio = 2.0) # Selected Threshold Reconstruction Error Multiplier

preds_test, recon_metrics_test = sg.preds_test, sg.reconstruction_test # Test Metadata by Cluster

## Anomaly Detection Predictions

predictions = sg.detections(input_data = X_future_data,
                            threshold_ratio = 2.0)

```

### Performance Evaluation

`SegmentAE` employs a rigorous evaluation methodology to assess the performance of its anomaly detection capabilities. This includes product based strategies tailored for extensive experiments ensembling for detecting the best combination of autoencoder and clustering model. Key performance of different reconstruction error threshold ratios are also analysed in order to lay out a comprehensive evaluation of the model's effectiveness and improvements for each tested combination -> <a href="https://github.com/TsLu1s/SegmentAE/blob/main/examples/evaluate_combinations.py" style="text-decoration:none;">
    <img src="https://img.shields.io/badge/Combinations%20Evaluation-blue?style=for-the-badge&logo=readme&logoColor=white" alt="Combinations Evaluation">
</a>
## License

Distributed under the MIT License. See [LICENSE](https://github.com/TsLu1s/SegmentAE/blob/main/LICENSE) for more information.

## Contact 
 
Luis Santos - [LinkedIn](https://www.linkedin.com/in/lu%C3%ADsfssantos/)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TsLu1s/SegmentAE",
    "name": "segmentae",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "pythondata science, machine learning, deep learning, neural networks, autoencoder, clustering, anomaly detection, novelty detectionfraud detection, data preprocessing",
    "author": "Lu\u00eds Santos",
    "author_email": "luisf_ssantos@hotmail.com",
    "download_url": null,
    "platform": null,
    "description": "# SegmentAE: A Python Library for Anomaly Detection Optimization\r\n\r\n## Framework Overview\r\n\r\n`SegmentAE` is designed to enhance anomaly detection performance through the optimization of reconstruction error by integrating and intersecting clustering methods with tabular autoencoders. It provides a comprehensive, scalable and robust solution for anomaly detection applications in relevant domains such as financial fraud detection or network security, ensuring extensive customization and optimization capabilities.\r\n\r\n## Key Features and Capabilities\r\n\r\n### 1. General Applicability on Tabular Datasets\r\n\r\n`SegmentAE` is engineered to handle a wide range of tabular datasets, making it suitable for various anomaly detection tasks across different use case contexts, it can be seamlessly integrated into diverse applications, ensuring broad utility and adaptability.\r\n\r\n### 2. Optimization and Customization\r\n\r\nThe framework offers complete configurability for each component of the anomaly detection pipeline, this includes data preprocessing, clustering algorithms and provides the customization of baseline autoencoders or the integration of fully developed models. Each component therefore can be fine-tuned to achieve optimal performance tailored to specific use case.\r\n\r\n### 3. Enhanced Detection Performance\r\n\r\nBy leveraging a combination of clustering algorithms and advanced anomaly detection techniques, `SegmentAE` aims to improve the accuracy and reliability of anomaly detection. The integration of tabular autoencoders with clustering mechanisms ensures that the framework effectively captures and identifies different patterns in the input data, optimizing this way the reconstruction error for each existent cluster of the anomaly detection, thereby enhancing predictive performance.\r\n\r\n### Main Development Tools <a name = \"pre1\"></a>\r\n\r\nMajor frameworks used to built this project: \r\n\r\n* [TensorFlow](https://www.tensorflow.org/)\r\n* [Keras](https://keras.io/)\r\n* [Scikit-Learn](https://scikit-learn.org/stable/)\r\n* [Atlantic](https://pypi.org/project/atlantic/)\r\n* [MLimputer](https://pypi.org/project/mlimputer/)\r\n    \r\n## Where to get it <a name = \"ta\"></a>\r\n    \r\nBinary installer for the latest released version is available at the Python Package Index [(PyPI)](https://pypi.org/project/segmentae/).   \r\n\r\nThe source code is currently hosted on GitHub at: https://github.com/TsLu1s/SegmentAE\r\n\r\n## Installation  \r\n\r\nTo install this package from Pypi repository run the following command:\r\n\r\n```\r\npip install segmentae\r\n```\r\n\r\n## SegmentAE - Technical Components and Pipeline Structure\r\n\r\nThe `SegmentAE` framework consists of several integrated components, each playing a critical role in the optimization of anomaly detection through clustering and tabular autoencoders. The pipeline is structured to ensure seamless data flow and modular customization, allowing optimal changes for each use case specific needs.\r\n\r\n### 1. Data Preprocessing\r\n\r\nProper preprocessing is crucial for ensuring the quality and consistency of the data fed into the subsequent stages of the pipeline. The data preprocessing module is responsible for preparing raw data for predictive applications, this includes:\r\n\r\n- **Missing Value Imputation**: Multiple supervised algorithmic imputation options to handle and impute missing data points.\r\n- **Normalization**: Scaling features to ensure they have comparable magnitudes, essential for the performance of many machine learning algorithms.\r\n- **Categorical Encoding**: Transforming categorical variables into numerical representations suitable for machine learning algorithms, using methods such as label encoding, InverseFrequency encoding and one-hot encoding.\r\n\r\n### 2. Clustering\r\n\r\nClustering forms the backbone of the `SegmentAE` framework, providing the capability to segment data into meaningful distinct groups. This segmentation helps in understanding the underlying structure of the input data and provides a basis for the anomaly detection reconstruction error improvements.\r\n\r\n- **Clustering Algorithms**: Support and customization for a variety of algorithm options such as `K-Means`, `MiniBatchKMeans`, `GaussianMixture`, and `Agglomerative` clustering, allowing the framework to adapt to different data structures and distribution patterns.\r\n\r\n### 3. Anomaly Detection - Baseline Autoencoders\r\n\r\nThe core of the `SegmentAE` framework is its anomaly detection optimization module, which employs advanced methods such as tabular autoencoders to identify anomalies. Autoencoders are neural networks designed to learn efficient representations of input data, enabling the detection of anomalies by measuring reconstruction errors. This framework includes 2 baseline autoencoder algorithms (`Dense` & `Batch Norm`) for user application that allow the customization of each, including the network architecture, training epochs, activation layers and others.\r\n\r\nFurthermore, it's a main feature option for you to build your own autoencoder model (`Keras` based) and integrate it into the `SegmentAE` pipeline -> \r\n<a href=\"https://github.com/TsLu1s/SegmentAE/blob/main/examples/basic_model.py\" style=\"text-decoration:none;\">\r\n    <img src=\"https://img.shields.io/badge/Custom%20Model-blue?style=for-the-badge&logo=readme&logoColor=white\" alt=\"Custom Model\">\r\n</a>\r\n\r\nAlso, application example for totally unlabeled data available here -> \r\n<a href=\"https://github.com/TsLu1s/SegmentAE/blob/main/examples/unlabeled_application.py\" style=\"text-decoration:none;\">\r\n    <img src=\"https://img.shields.io/badge/Unlabeled%20Example-blue?style=for-the-badge&logo=readme&logoColor=white\" alt=\"Unlabeled Example\">\r\n</a>\r\n\r\n## SegmentAE - Predictive Application\r\n\r\nTo demonstrate the usage of `SegmentAE`, a DenseAutoencoder is trained and integrated with KMeans clustering (with 3 clusters). The following script outlines the entire process from data loading, preprocessing, clustering, autoencoder training, integration with clustering for anomaly detection, evaluation performance, and predicting future anomalies.\r\n\r\n```py\r\nimport pandas as pd\r\nfrom segmentae.data_sources.examples import load_dataset\r\nfrom segmentae.anomaly_detection import (SegmentAE,\r\n                                         Preprocessing,\r\n                                         Clustering,\r\n                                         DenseAutoencoder, \r\n                                         #BatchNormAutoencoder\r\n                                         )\r\nfrom sklearn.model_selection import train_test_split\r\n\r\n## Data Loading\r\n\r\ntrain, test, target = load_dataset(dataset_selection = 'german_credit_card', # Options | 'german_credit_card', 'network_intrusions', 'default_credit_card'\r\n                                   split_ratio = 0.75)                       \r\n\r\ntest, future_data = train_test_split(test, train_size = 0.9, random_state = 5)\r\n\r\n# Resetting Index is Required\r\ntrain = train.reset_index(drop=True)\r\ntest = test.reset_index(drop=True)\r\nfuture_data = future_data.reset_index(drop=True)\r\n\r\nX_train, y_train = train.drop(columns=[target]).copy(), train[target].astype(int) \r\nX_test, y_test = test.drop(columns=[target]).copy(), test[target].astype(int)\r\nX_future_data = future_data.drop(columns=[target]).copy()\r\n\r\n## Preprocessing\r\n\r\npr = Preprocessing(encoder = None,          # Options | \"IFrequencyEncoder\", \"LabelEncoder\", \"OneHotEncoder\", None\r\n                   scaler = \"MinMaxScaler\", # Options | \"MinMaxScaler\", \"StandardScaler\", \"RobustScaler\", None\r\n                   imputer = None)          # Options | \"Simple\",\"RandomForest\",\"ExtraTrees\",\"GBR\",\"KNN\",\r\n                                            #         | \"XGBoost\",\"Lightgbm\",\"Catboost\", None\r\n\r\npr.fit(X = X_train)\r\nX_train = pr.transform(X = X_train)\r\nX_test = pr.transform(X = X_test)\r\nX_future_data = pr.transform(X = X_future_data)\r\n\r\n## Clustering Implementation\r\n\r\ncl_model = Clustering(cluster_model = [\"KMeans\"], # Options | KMeans, MiniBatchKMeans, GMM, Agglomerative\r\n                      n_clusters = 3)\r\ncl_model.clustering_fit(X = X_train)\r\n\r\n## Autoencoder Implementation\r\n\r\ndenseAutoencoder = DenseAutoencoder(hidden_dims = [16, 12, 8, 4],\r\n                                    encoder_activation = 'relu',  \r\n                                    decoder_activation = 'relu',\r\n                                    optimizer = 'adam',\r\n                                    learning_rate = 0.001,\r\n                                    epochs = 150,\r\n                                    val_size = 0.15,\r\n                                    stopping_patient = 20,\r\n                                    dropout_rate = 0.1,\r\n                                    batch_size = None)\r\ndenseAutoencoder.fit(input_data = X_train)\r\ndenseAutoencoder.summary()\r\n\r\n## Autoencoder + Clustering Integration\r\n\r\nsg = SegmentAE(ae_model = denseAutoencoder, \r\n                cl_model = cl_model)\r\n\r\n## Train Reconstruction\r\n\r\nsg.reconstruction(input_data = X_train,\r\n                  threshold_metric = 'mse')  # Options | mse, mae, rmse, max_error\r\n\r\n## Reconstruction Performance (Assuming y_test existence)\r\n\r\nresults = sg.evaluation(input_data = X_test,\r\n                        target_col = y_test, \r\n                        threshold_ratio = 2.0) # Selected Threshold Reconstruction Error Multiplier\r\n\r\npreds_test, recon_metrics_test = sg.preds_test, sg.reconstruction_test # Test Metadata by Cluster\r\n\r\n## Anomaly Detection Predictions\r\n\r\npredictions = sg.detections(input_data = X_future_data,\r\n                            threshold_ratio = 2.0)\r\n\r\n```\r\n\r\n### Performance Evaluation\r\n\r\n`SegmentAE` employs a rigorous evaluation methodology to assess the performance of its anomaly detection capabilities. This includes product based strategies tailored for extensive experiments ensembling for detecting the best combination of autoencoder and clustering model. Key performance of different reconstruction error threshold ratios are also analysed in order to lay out a comprehensive evaluation of the model's effectiveness and improvements for each tested combination -> <a href=\"https://github.com/TsLu1s/SegmentAE/blob/main/examples/evaluate_combinations.py\" style=\"text-decoration:none;\">\r\n    <img src=\"https://img.shields.io/badge/Combinations%20Evaluation-blue?style=for-the-badge&logo=readme&logoColor=white\" alt=\"Combinations Evaluation\">\r\n</a>\r\n## License\r\n\r\nDistributed under the MIT License. See [LICENSE](https://github.com/TsLu1s/SegmentAE/blob/main/LICENSE) for more information.\r\n\r\n## Contact \r\n \r\nLuis Santos - [LinkedIn](https://www.linkedin.com/in/lu%C3%ADsfssantos/)\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "SegmentAE: A Python Library for Anomaly Detection Optimization",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/TsLu1s/SegmentAE"
    },
    "split_keywords": [
        "pythondata science",
        " machine learning",
        " deep learning",
        " neural networks",
        " autoencoder",
        " clustering",
        " anomaly detection",
        " novelty detectionfraud detection",
        " data preprocessing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "49dff62393e43cab7ffeefc5d387a561134df48f78b0a402b103adf546280267",
                "md5": "b228973852c9df12161d0d6899c1ff19",
                "sha256": "b971e2230cbad51768ecfa0e0d373695dbb9dbb27d863e6e83142c9bb4be60eb"
            },
            "downloads": -1,
            "filename": "segmentae-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b228973852c9df12161d0d6899c1ff19",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 27714,
            "upload_time": "2024-06-19T22:27:18",
            "upload_time_iso_8601": "2024-06-19T22:27:18.303933Z",
            "url": "https://files.pythonhosted.org/packages/49/df/f62393e43cab7ffeefc5d387a561134df48f78b0a402b103adf546280267/segmentae-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-19 22:27:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TsLu1s",
    "github_project": "SegmentAE",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.19.5"
                ]
            ]
        },
        {
            "name": "atlantic",
            "specs": [
                [
                    ">=",
                    "1.1.67"
                ]
            ]
        },
        {
            "name": "mlimputer",
            "specs": [
                [
                    ">=",
                    "1.0.67"
                ]
            ]
        },
        {
            "name": "tensorflow",
            "specs": [
                [
                    ">=",
                    "2.10.0"
                ]
            ]
        },
        {
            "name": "ucimlrepo",
            "specs": [
                [
                    ">=",
                    "0.0.7"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.11.4"
                ]
            ]
        }
    ],
    "lcname": "segmentae"
}
        
Elapsed time: 0.27405s