# ASPIRE (Adaptive Scaler and PCA with Intelligent REduction)
*Previously known as AdaptivePCA*
ASPIRE is an advanced implementation of Principal Component Analysis (PCA) that provides intelligent feature scaling, comprehensive preprocessing, and built-in validation capabilities. It automatically adapts to your data's characteristics to deliver optimal dimensionality reduction results.
## Core Functionality
AdaptivePCA employs a comprehensive preprocessing and analysis approach:
1. **Intelligent Preprocessing**
- Comprehensive data cleaning and preprocessing
- Handles outliers using IQR method
- Manages infinity values and missing data
- Feature-wise normality testing using Shapiro-Wilk test
- Automatic selection between StandardScaler and MinMaxScaler
- Class imbalance detection and handling via SMOTE
2. **Dynamic Dimensionality Reduction**
- Determines optimal number of PCA components based on variance threshold
- Considers eigenvalue thresholds for component selection
- Adapts to dataset characteristics
- Built-in validation framework
The algorithm's key innovation lies in its adaptive nature, particularly in:
- Automatic selection between StandardScaler and MinMaxScaler based on feature distributions
- Dynamic component selection based on cumulative variance threshold
- Integrated preprocessing pipeline with outlier handling and missing value imputation
- Automatic class imbalance detection and correction
- Comprehensive validation framework with efficiency metrics
This implementation provides an end-to-end solution for dimensionality reduction while handling common data challenges automatically.
## Overall Design Pattern
```bash
Data → Preprocessing → Scaler Selection → PCA Optimization → Validation → Prediction
```
## Dependencies
- numpy>=1.19.0
- pandas>=1.2.0
- scikit-learn>=0.24.0
- lightgbm>=3.0.0
- imbalanced-learn>=0.8.0
- scipy>=1.6.0
## Installation
Install dependencies:
```bash
pip install scikit-learn numpy pandas lightgbm scipy imbalanced-learn
```
Instal from Pypi repository:
```bash
pip install adaptivepca
```
Clone this repository and install the package using `pip`:
```bash
git clone https://github.com/nqmn/adaptivepca.git
cd adaptivepca
pip install .
```
## Usage
### Basic Usage
```python
# Load your data
data = pd.read_csv("your_dataset.csv")
X = data.drop(['Label'])
y = data['Label']
# Initialize AdaptivePCA
adaptive_pca = AdaptivePCA()
X_preprocessed, y_preprocessed, smote_applied = adaptive_pca.preprocess_data(X, y)
adaptive_pca.fit(X_preprocessed, y_preprocessed, smote_applied)
adaptive_pca.validate_with_classifier(X_preprocessed, y_preprocessed)
adaptive_pca.predict_with_classifier(X_preprocessed, y_preprocessed)
adaptive_pca.export_model('your_model_name.joblib')
```
### Advanced Usage
```python
import pandas as pd
from adaptivepca import AdaptivePCA
from sklearn.tree import DecisionTreeClassifier
# Load your data
data = pd.read_csv("your_dataset.csv")
X = data.drop(columns=['Label']) # Features
y = data['Label'] # Target variable
# Initialize AdaptivePCA
adaptive_pca = AdaptivePCA(
variance_threshold=0.95,
max_components=50,
min_eigenvalue_threshold=1e-4,
normality_ratio=0.05,
verbose=1
)
# Run Preprocessing
X_preprocessed, y_preprocessed, smote_applied = adaptive_pca.preprocess_data(X, y)
# Fit AdaptivePCA
adaptive_pca.fit(X_preprocessed, y_preprocessed, smote_applied)
# Optional - Validate with a classifier with full and reduced dataset performance
adaptive_pca.validate_with_classifier(X, y, classifier=DecisionTreeClassifier(), test_size=0.2, cv=5)
# Optional - Run prediction with classifier, show output of confusion matrix, classification report,
# inference time, fpr, far, specificity, auc-roc, mcc
adaptive_pca.predict_with_classifier(X, y)
# Optional - Export the model in joblib format
adaptive_pca.export_model("your_model_name.joblib")
```
# Key Components
## Initialization Parameters
- `variance_threshold`: Minimum cumulative explained variance (default: 0.95)
- `max_components`: Maximum PCA components to consider (default: 50)
- `min_eigenvalue_threshold`: Minimum eigenvalue cutoff (default: 1e-4)
- `normality_ratio`: P-value threshold for Shapiro-Wilk test (default: 0.05)
- `verbose`: Logging detail level (default: 0)
## Preprocessing Pipeline
### Data Cleaning
- Selection of numeric columns only
- Handles outliers using `IQR methods` (clips values outside 1.5\*IQR)
- Replaces infinitiy values with finite extremes
- Imputes missing values using `mean` strategy
### Feature Scaling Selection
- Perform `Shapiro-Wilk test` on each feature
- Counts features better suited for StandardScaler and MinMaxScaler
- Applies majority voting to select final scaler
### Class Balance Handling
- Perform chi-squared test for class imbalance
- Applies SMOTE if significant imbalance detected `(p<0.05)`
## PCA Optimization Algorithm
- Find optimal components meeting variance threshold: `max variance_threshold` and `min_eigenvalue_threshold`
## Validation Framework
- Classification validation on full and reduced dataset
- Performance metrics: Accuracy comparison, time efficiency, ROC-AUC score, detailed classification report
## Key Mathematical Components
### Feature normality testing
- Shapiro-Wilk test for normality
### Class imbalance detection
- Chi-squared test for class balance
## Methods
- `fit(X)`: Fits the AdaptivePCA model to the data `X`.
- `preprocess_data(X)`: Run preprocessing pipeline.
- `validate_with_classifier(X, y, classifier=None, cv=5, test_size=0.2)`: Tests model performance.
- `predict_with_classifier(X, y)`: Makes predictions using trained classifier.
- `export_model(model_name, classifier)`: Saves model to file.
## Use Cases
ASPIRE is particularly valuable for:
- Machine learning pipelines requiring automated preprocessing
- High-dimensional data analysis
- Feature engineering optimization
- Model performance enhancement
- Exploratory data analysis
## Technical Foundation
The system integrates:
- Statistical testing for data distribution analysis
- Adaptive scaling techniques
- Principal Component Analysis
- Machine learning validation frameworks
- Performance optimization methods
## Performance Comparison: AdaptivePCA vs. Traditional PCA Optimization (GridSearch)
### Speed
AdaptivePCA adaptively selects the optimal configuration based on data-driven rules, which is less computationally intense than the exhaustive search performed by grid search. In our tests, AdaptivePCA achieved up to a 90% reduction in processing time compared to the traditional PCA method. This is especially useful when working with high-dimensional data, where traditional methods may take significantly longer due to sequential grid search.
### Explained Variance
Both AdaptivePCA and traditional PCA achieve similar levels of explained variance, with AdaptivePCA dynamically selecting the number of components based on a defined variance threshold. Traditional PCA, on the other hand, requires manual parameter tuning, which can be time-consuming.
## Performance on Different Dataset (Full & Reduced Dataset)
Most datasets maintain high accuracy, with reduced datasets achieving similar scores to full datasets in nearly all cases. Additionally, the reduced datasets significantly decrease processing time, with time reductions ranging from 1.85% to 58.03%. This indicates that reduced datasets can offer substantial efficiency benefits, especially for larger datasets.
| Dataset | Score (Acc) | Time (s) | Gain (%) |
|---------|-------------|----------|----------|
|insdn_ddos_binary_01.ds (full)| 1.000000 | 1.5492 | - |
|insdn_ddos_binary_01.ds (reduced)| 1.000000 | 0.6502 | 58.03 |
|hldddosdn_hlddos_combine_binary.ds (full)| 1.000000 | 30.3948 | - |
|hldddosdn_hlddos_combine_binary.ds (reduced)| 1.000000 | 14.4875 | 52.34 |
|cicddos2019_tcpudp_combine_d1_binary_rus.ds (full) | 1.000000 | 1.6453 | - |
|cicddos2019_tcpudp_combine_d1_binary_rus.ds (reduced) | 1.000000 | 0.7371 | 55.20 |
|mendeley_ddos_sdn_binary_19.ds (full) | 1.000000 | 0.9839 | - |
|mendeley_ddos_sdn_binary_19.ds (reduced) | 0.942738 | 0.9355 | 4.93 |
|Wednesday-workingHours.pcap_ISCX.csv (full) | 0.921126 | 39.7610 | - |
|Wednesday-workingHours.pcap_ISCX.csv (reduced) | 0.970010 | 28.8390 | 27.47 |
|LR-HR DDoS 2024 Dataset for SDN-Based Networks.csv (full) | 0.999982 | 0.7314 | - |
|LR-HR DDoS 2024 Dataset for SDN-Based Networks.csv (reduced) | 0.999982 | 0.5131 | 29.84 |
|dataset_sdn.csv (full) | 1.000000 | 1.0547 | - |
|dataset_sdn.csv (reduced) | 0.932359 | 1.0352 | 1.85 |
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request to discuss your changes.
## Acknowledgments
This project makes use of the `scikit-learn`, `numpy`, and `pandas` libraries for data processing and machine learning.
## Version Update Log
- `1.0.3` - Added flexibility in scaling, fix error handling when max_components exceeding the available number of features or samples.
- `1.0.6` - Added Parameter verbose as an argument to __init__, with a default value of 0.
- `1.1.0` - Added validation, prediction with classifier, clean up the code.
- `1.1.3` - Revamped the code. Refer to description above.
Raw data
{
"_id": null,
"home_page": "https://github.com/nqmn/adaptivepca",
"name": "adaptivepca",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "machine learning, dimensionality reduction, pca, feature selection, data preprocessing, adaptive scaling, classification, data analysis, statistics",
"author": "Mohd Adil",
"author_email": "mohdadil@live.com",
"download_url": "https://files.pythonhosted.org/packages/9e/7e/5be9f87760316af236adfb445ab1eb33c4c8a0a72641dc3cc38a31691f9c/adaptivepca-1.1.3.tar.gz",
"platform": null,
"description": "# ASPIRE (Adaptive Scaler and PCA with Intelligent REduction)\r\n*Previously known as AdaptivePCA*\r\n\r\nASPIRE is an advanced implementation of Principal Component Analysis (PCA) that provides intelligent feature scaling, comprehensive preprocessing, and built-in validation capabilities. It automatically adapts to your data's characteristics to deliver optimal dimensionality reduction results.\r\n\r\n## Core Functionality\r\n\r\nAdaptivePCA employs a comprehensive preprocessing and analysis approach:\r\n\r\n1. **Intelligent Preprocessing**\r\n - Comprehensive data cleaning and preprocessing\r\n - Handles outliers using IQR method\r\n - Manages infinity values and missing data\r\n - Feature-wise normality testing using Shapiro-Wilk test\r\n - Automatic selection between StandardScaler and MinMaxScaler\r\n - Class imbalance detection and handling via SMOTE\r\n\r\n2. **Dynamic Dimensionality Reduction**\r\n - Determines optimal number of PCA components based on variance threshold\r\n - Considers eigenvalue thresholds for component selection\r\n - Adapts to dataset characteristics\r\n - Built-in validation framework\r\n\r\n\r\nThe algorithm's key innovation lies in its adaptive nature, particularly in:\r\n\r\n- Automatic selection between StandardScaler and MinMaxScaler based on feature distributions\r\n- Dynamic component selection based on cumulative variance threshold\r\n- Integrated preprocessing pipeline with outlier handling and missing value imputation\r\n- Automatic class imbalance detection and correction\r\n- Comprehensive validation framework with efficiency metrics\r\n\r\nThis implementation provides an end-to-end solution for dimensionality reduction while handling common data challenges automatically.\r\n\r\n## Overall Design Pattern\r\n```bash\r\nData \u00e2\u2020\u2019 Preprocessing \u00e2\u2020\u2019 Scaler Selection \u00e2\u2020\u2019 PCA Optimization \u00e2\u2020\u2019 Validation \u00e2\u2020\u2019 Prediction\r\n```\r\n\r\n## Dependencies\r\n- numpy>=1.19.0\r\n- pandas>=1.2.0\r\n- scikit-learn>=0.24.0\r\n- lightgbm>=3.0.0\r\n- imbalanced-learn>=0.8.0\r\n- scipy>=1.6.0\r\n\r\n## Installation\r\n\r\nInstall dependencies:\r\n```bash\r\npip install scikit-learn numpy pandas lightgbm scipy imbalanced-learn\r\n```\r\n\r\nInstal from Pypi repository:\r\n```bash\r\npip install adaptivepca\r\n```\r\n\r\nClone this repository and install the package using `pip`:\r\n```bash\r\ngit clone https://github.com/nqmn/adaptivepca.git\r\ncd adaptivepca\r\npip install .\r\n```\r\n\r\n## Usage\r\n\r\n### Basic Usage\r\n\r\n```python\r\n# Load your data\r\ndata = pd.read_csv(\"your_dataset.csv\")\r\nX = data.drop(['Label'])\r\ny = data['Label']\r\n\r\n# Initialize AdaptivePCA\r\nadaptive_pca = AdaptivePCA()\r\nX_preprocessed, y_preprocessed, smote_applied = adaptive_pca.preprocess_data(X, y)\r\nadaptive_pca.fit(X_preprocessed, y_preprocessed, smote_applied)\r\nadaptive_pca.validate_with_classifier(X_preprocessed, y_preprocessed)\r\nadaptive_pca.predict_with_classifier(X_preprocessed, y_preprocessed)\r\nadaptive_pca.export_model('your_model_name.joblib')\r\n```\r\n\r\n### Advanced Usage\r\n\r\n```python\r\nimport pandas as pd\r\nfrom adaptivepca import AdaptivePCA\r\nfrom sklearn.tree import DecisionTreeClassifier\r\n\r\n# Load your data\r\ndata = pd.read_csv(\"your_dataset.csv\")\r\nX = data.drop(columns=['Label']) # Features\r\ny = data['Label'] # Target variable\r\n\r\n# Initialize AdaptivePCA\r\nadaptive_pca = AdaptivePCA(\r\n variance_threshold=0.95,\r\n max_components=50,\r\n min_eigenvalue_threshold=1e-4,\r\n normality_ratio=0.05,\r\n verbose=1\r\n)\r\n# Run Preprocessing\r\nX_preprocessed, y_preprocessed, smote_applied = adaptive_pca.preprocess_data(X, y)\r\n\r\n# Fit AdaptivePCA\r\nadaptive_pca.fit(X_preprocessed, y_preprocessed, smote_applied)\r\n\r\n\r\n# Optional - Validate with a classifier with full and reduced dataset performance\r\nadaptive_pca.validate_with_classifier(X, y, classifier=DecisionTreeClassifier(), test_size=0.2, cv=5)\r\n\r\n# Optional - Run prediction with classifier, show output of confusion matrix, classification report,\r\n# inference time, fpr, far, specificity, auc-roc, mcc\r\nadaptive_pca.predict_with_classifier(X, y)\r\n\r\n# Optional - Export the model in joblib format\r\nadaptive_pca.export_model(\"your_model_name.joblib\")\r\n\r\n```\r\n\r\n# Key Components\r\n\r\n## Initialization Parameters\r\n- `variance_threshold`: Minimum cumulative explained variance (default: 0.95)\r\n- `max_components`: Maximum PCA components to consider (default: 50)\r\n- `min_eigenvalue_threshold`: Minimum eigenvalue cutoff (default: 1e-4)\r\n- `normality_ratio`: P-value threshold for Shapiro-Wilk test (default: 0.05)\r\n- `verbose`: Logging detail level (default: 0)\r\n\r\n## Preprocessing Pipeline\r\n### Data Cleaning\r\n- Selection of numeric columns only\r\n- Handles outliers using `IQR methods` (clips values outside 1.5\\*IQR)\r\n- Replaces infinitiy values with finite extremes\r\n- Imputes missing values using `mean` strategy\r\n\r\n### Feature Scaling Selection\r\n- Perform `Shapiro-Wilk test` on each feature\r\n- Counts features better suited for StandardScaler and MinMaxScaler\r\n- Applies majority voting to select final scaler\r\n\r\n### Class Balance Handling\r\n- Perform chi-squared test for class imbalance\r\n- Applies SMOTE if significant imbalance detected `(p<0.05)`\r\n\r\n## PCA Optimization Algorithm\r\n- Find optimal components meeting variance threshold: `max variance_threshold` and `min_eigenvalue_threshold`\r\n\r\n## Validation Framework\r\n- Classification validation on full and reduced dataset\r\n- Performance metrics: Accuracy comparison, time efficiency, ROC-AUC score, detailed classification report\r\n\r\n## Key Mathematical Components\r\n### Feature normality testing\r\n- Shapiro-Wilk test for normality\r\n\r\n### Class imbalance detection\r\n- Chi-squared test for class balance\r\n\r\n## Methods\r\n- `fit(X)`: Fits the AdaptivePCA model to the data `X`.\r\n- `preprocess_data(X)`: Run preprocessing pipeline.\r\n- `validate_with_classifier(X, y, classifier=None, cv=5, test_size=0.2)`: Tests model performance.\r\n- `predict_with_classifier(X, y)`: Makes predictions using trained classifier.\r\n- `export_model(model_name, classifier)`: Saves model to file.\r\n\r\n## Use Cases\r\nASPIRE is particularly valuable for:\r\n- Machine learning pipelines requiring automated preprocessing\r\n- High-dimensional data analysis\r\n- Feature engineering optimization\r\n- Model performance enhancement\r\n- Exploratory data analysis\r\n\r\n## Technical Foundation\r\nThe system integrates:\r\n- Statistical testing for data distribution analysis\r\n- Adaptive scaling techniques\r\n- Principal Component Analysis\r\n- Machine learning validation frameworks\r\n- Performance optimization methods\r\n\r\n## Performance Comparison: AdaptivePCA vs. Traditional PCA Optimization (GridSearch)\r\n\r\n### Speed\r\n\r\nAdaptivePCA adaptively selects the optimal configuration based on data-driven rules, which is less computationally intense than the exhaustive search performed by grid search. In our tests, AdaptivePCA achieved up to a 90% reduction in processing time compared to the traditional PCA method. This is especially useful when working with high-dimensional data, where traditional methods may take significantly longer due to sequential grid search.\r\n\r\n### Explained Variance\r\n\r\nBoth AdaptivePCA and traditional PCA achieve similar levels of explained variance, with AdaptivePCA dynamically selecting the number of components based on a defined variance threshold. Traditional PCA, on the other hand, requires manual parameter tuning, which can be time-consuming.\r\n\r\n## Performance on Different Dataset (Full & Reduced Dataset)\r\n\r\nMost datasets maintain high accuracy, with reduced datasets achieving similar scores to full datasets in nearly all cases. Additionally, the reduced datasets significantly decrease processing time, with time reductions ranging from 1.85% to 58.03%. This indicates that reduced datasets can offer substantial efficiency benefits, especially for larger datasets.\r\n\r\n| Dataset | Score (Acc) | Time (s) | Gain (%) |\r\n|---------|-------------|----------|----------|\r\n|insdn_ddos_binary_01.ds (full)| 1.000000 | 1.5492 | - |\r\n|insdn_ddos_binary_01.ds (reduced)| 1.000000 | 0.6502 | 58.03 |\r\n|hldddosdn_hlddos_combine_binary.ds (full)| 1.000000 | 30.3948 | - |\r\n|hldddosdn_hlddos_combine_binary.ds (reduced)| 1.000000 | 14.4875 | 52.34 |\r\n|cicddos2019_tcpudp_combine_d1_binary_rus.ds (full) | 1.000000 | 1.6453 | - |\r\n|cicddos2019_tcpudp_combine_d1_binary_rus.ds (reduced) | 1.000000 | 0.7371 | 55.20 |\r\n|mendeley_ddos_sdn_binary_19.ds (full) | 1.000000 | 0.9839 | - |\r\n|mendeley_ddos_sdn_binary_19.ds (reduced) | 0.942738 | 0.9355 | 4.93 |\r\n|Wednesday-workingHours.pcap_ISCX.csv (full) | 0.921126 | 39.7610 | - |\r\n|Wednesday-workingHours.pcap_ISCX.csv (reduced) | 0.970010 | 28.8390 | 27.47 |\r\n|LR-HR DDoS 2024 Dataset for SDN-Based Networks.csv (full) | 0.999982 | 0.7314 | - |\r\n|LR-HR DDoS 2024 Dataset for SDN-Based Networks.csv (reduced) | 0.999982 | 0.5131 | 29.84 |\r\n|dataset_sdn.csv (full) | 1.000000 | 1.0547 | - |\r\n|dataset_sdn.csv (reduced) | 0.932359 | 1.0352 | 1.85 |\r\n\r\n## License\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Contributing\r\nContributions are welcome! Please open an issue or submit a pull request to discuss your changes.\r\n\r\n## Acknowledgments\r\nThis project makes use of the `scikit-learn`, `numpy`, and `pandas` libraries for data processing and machine learning.\r\n\r\n## Version Update Log\r\n- `1.0.3` - Added flexibility in scaling, fix error handling when max_components exceeding the available number of features or samples.\r\n- `1.0.6` - Added Parameter verbose as an argument to __init__, with a default value of 0.\r\n- `1.1.0` - Added validation, prediction with classifier, clean up the code.\r\n- `1.1.3` - Revamped the code. Refer to description above.\r\n",
"bugtrack_url": null,
"license": null,
"summary": "An advanced PCA implementation with adaptive feature scaling and preprocessing",
"version": "1.1.3",
"project_urls": {
"Homepage": "https://github.com/nqmn/adaptivepca"
},
"split_keywords": [
"machine learning",
" dimensionality reduction",
" pca",
" feature selection",
" data preprocessing",
" adaptive scaling",
" classification",
" data analysis",
" statistics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9491aa8190c2476eb883a5e754d5486a6b90b9ba79bc123709887a762a6b05e8",
"md5": "1c5b0a7e023db0d3b895f97aeed8400d",
"sha256": "0eaa8c26db49f7031843479aea18bb6d8ef38078c1ec0f19ed0394484e2f4160"
},
"downloads": -1,
"filename": "adaptivepca-1.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1c5b0a7e023db0d3b895f97aeed8400d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11843,
"upload_time": "2024-10-31T03:51:42",
"upload_time_iso_8601": "2024-10-31T03:51:42.606070Z",
"url": "https://files.pythonhosted.org/packages/94/91/aa8190c2476eb883a5e754d5486a6b90b9ba79bc123709887a762a6b05e8/adaptivepca-1.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9e7e5be9f87760316af236adfb445ab1eb33c4c8a0a72641dc3cc38a31691f9c",
"md5": "376f8cdeaca362e3386593ae9ec4a697",
"sha256": "ae9a9cf9a0d172461a57fdce4937db721b8ffc704c37f9aaf311e08f29d31b2d"
},
"downloads": -1,
"filename": "adaptivepca-1.1.3.tar.gz",
"has_sig": false,
"md5_digest": "376f8cdeaca362e3386593ae9ec4a697",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 14661,
"upload_time": "2024-10-31T03:51:45",
"upload_time_iso_8601": "2024-10-31T03:51:45.168085Z",
"url": "https://files.pythonhosted.org/packages/9e/7e/5be9f87760316af236adfb445ab1eb33c4c8a0a72641dc3cc38a31691f9c/adaptivepca-1.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-31 03:51:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nqmn",
"github_project": "adaptivepca",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "scikit-learn",
"specs": [
[
">=",
"0.24"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.19"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.1"
]
]
}
],
"lcname": "adaptivepca"
}