<div align="center">
<img src="auxiliary/docs/logo.png" alt="BlitzML" width="400"/>

### **Automate machine learning pipelines rapidly**

<div align="left">

- [Install BlitzML](#install-blitzml)
- [Classification](#classification)
- [Regression](#regression)
- [Time-Series](#time-series)
- [Clustering](#clustering)

# Install BlitzML  

pip install blitzml

# Classification

from blitzml.tabular import Classification
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("auxiliary/datasets/banknote/train.csv")
test_df = pd.read_csv("auxiliary/datasets/banknote/test.csv")

# create the pipeline
auto = Classification(train_df, test_df, algorithm = 'RF', n_estimators = 50)

# perform the entire process

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict


## Available Classifiers

- Random Forest 'RF' 
- LinearDiscriminantAnalysis 'LDA' 
- Support Vector Classifier 'SVC' 
- KNeighborsClassifier 'KNN' 
- GaussianNB 'GNB' 
- LogisticRegression 'LR'
- AdaBoostClassifier 'AB'
- GradientBoostingClassifier 'GB'
- DecisionTreeClassifier 'DT'
- MLPClassifier 'MLP'

## **Parameters**
options: {'RF','LDA','SVC','KNN','GNB','LR','AB','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'  
`auto: selects the best scoring classifier based on f1-score`  
`custom: enables providing a custom classifier through *file_path* and *class_name* parameters`  
when using 'custom' classifier, pass the path of the file containing the custom class, default = 'none'  
when using 'custom' classifier, pass the class name through this parameter, default = 'none'  
options: {'correlation', 'importance', 'none'}, default = 'none'  
`correlation: use feature columns with the highest correlation with the target`  
`importance: use feature columns that are important for the model to predict the target`  
`none: use all feature columns`  
value determining the validation split percentage (value from 0 to 1), default = 0.1  
when performing multiclass classification, provide the average type for the resulting metrics, default = 'macro'  
number of k-folds for cross validation, if 1 then no cv will be performed, default = 1  
optional parameters for the chosen classifier. you can find available parameters in the [sklearn docs](  
## **Attributes**  
the preprocessed train dataset (after running `Classification.preprocess()`)  
the preprocessed test dataset (after running `Classification.preprocess()`)  
the trained model (after running `Classification.train_the_model()`)  
the prediction dataframe (test_df + predicted target) (after running `Classification.gen_pred_df(Classification.test_df)`)  
the validation metrics (after running `Classification.gen_metrics_dict()`)  
    "accuracy": acc,  
    "f1": f1,  
    "precision": pre,  
    "recall": recall,  
    "hamming_loss": h_loss,  
    "cross_validation_score":cv_score, `returns None if cross_validation_k_folds==1`  
## **Methods**  
a shortcut that runs the entire process:  
- preprocessing
- model training  
- prediction  
- model evaluation  

accuracy scores when varying the sampling size of the train_df (after running `Classification.train_the_model()`).  

plotting line chart visualizes accuracy history

# Regression  

from blitzml.tabular import Regression
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("auxiliary/datasets/house prices/train.csv")
test_df = pd.read_csv("auxiliary/datasets/house prices/test.csv")

# create the pipeline
auto = Regression(train_df, test_df, algorithm = 'RF')

# perform the entire process

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict


## Available Regressors

- Random Forest 'RF'
- Support Vector Regressor 'SVR'
- KNeighborsRegressor 'KNN'
- Lasso Regressor 'LSS'
- LinearRegression 'LR'
- Ridge Regressor 'RDG'
- GaussianProcessRegressor 'GPR'
- GradientBoostingRegressor 'GB'
- DecisionTreeRegressor 'DT'
- MLPRegressor 'MLP'

## **Parameters**
options: {'RF','SVR','KNN','LSS','LR','RDG','GPR','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'  
`auto: selects the best scoring regressor based on r2 score`  
`custom: enables providing a custom regressor through *file_path* and *class_name* parameters`  
when using 'custom' regressor, pass the path of the file containing the custom class, default = 'none'  
when using 'custom' regressor, pass the class name through this parameter, default = 'none'  
options: {'correlation', 'importance', 'none'}, default = 'none'  
`correlation: use feature columns with the highest correlation with the target`  
`importance: use feature columns that are important for the model to predict the target`  
`none: use all feature columns`  
value determining the validation split percentage (value from 0 to 1), default = 0.1  
number of k-folds for cross validation, if 1 then no cv will be performed, default = 1  
optional parameters for the chosen regressor. you can find available parameters in the [sklearn docs](  
## **Attributes**  
the preprocessed train dataset (after running `Regression.preprocess()`)  
the preprocessed test dataset (after running `Regression.preprocess()`)   
the trained model (after running `Regression.train_the_model()`)  
the prediction dataframe (test_df + predicted target) (after running `Regression.gen_pred_df(Regression.test_df)`)  
the validation metrics (after running `Regression.gen_metrics_dict()`)  
    "r2_score": r2,  
    "mean_squared_error": mse,  
    "root_mean_squared_error": rmse,  
    "mean_absolute_error" : mae,  
    "cross_validation_score":cv_score, `returns None if cross_validation_k_folds==1`  
## **Methods**  

a shortcut that runs the entire process:  
- preprocessing
- model training  
- prediction  
- model evaluation   


plotting line chart visualizes RMSE history

RMSE scores when varying the sampling size of the train_df (after running `Regression.train_the_model()`).  
# Time-series
time series is a particular problem of Regression, but time series have some additional functions:
- stationary test (IsStationary()). 
- convert to stationary.
- reverse predicted.

and the dataset must have a DateTime column, even if the DataType of this column is Object.
from blitzml.tabular import TimeSeries 
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("train_dataset.csv")
test_df = pd.read_csv("test_dataset.csv")

# create the pipeline
auto = TimeSeries(train_df, test_df, algorithm = 'RF')

# Perform the entire process:

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict

# Clustering 

from blitzml.unsupervised import Clustering
import pandas as pd

# prepare your dataframe
train_df = pd.read_csv("auxiliary/datasets/customer personality/train.csv")

# create the pipeline
auto = Clustering(train_df, clustering_algorithm = 'KM')

# first perform data preprocessing
# second train the model

# After training the model we can generate:

# We can get their values using:

## Available Clustering Algorithms 

- K-Means 'KM' 
- Affinity Propagation 'AP' 
- Agglomerative Clustering 'AC' 
- Mean Shift 'MS' 
- Spectral Clustering 'SC' 
- Birch 'Birch' 
- Bisecting K-Means 'BKM' 

## **Parameters** 
options: {"KM", "AP", "AC", "MS", "SC", "Birch", "BKM", "OPTICS", "DBSCAN", 'auto', 'custom'}, default = 'KM' 
`auto: selects the best scoring clustering algorithm based on silhouette score` 
`custom: enables providing a custom clustering algorithm through *file_path* and *class_name* parameters` 
when using 'custom' clustering_algorithm, pass the path of the file containing the custom class, default = 'none'   
when using 'custom' clustering_algorithm, pass the class name through this parameter, default = 'none' 
options: {'importance', 'none'}, default = 'none' 
`importance: use feature columns that are important for the model to predict the target` 
`none: use all feature columns` 
optional parameters for the chosen clustering_algorithm. you can find available parameters in the [sklearn docs]( 
## **Attributes** 
the preprocessed train dataset (after running `Clustering.preprocess()`)  
the trained model (after running `Clustering.train_the_model()`) 
the prediction dataframe (test_df + predicted target) (after running `Clustering.gen_pred_df()`) 
the validation metrics (after running `Clustering.gen_metrics_dict()`) 
    "silhouette_score": sil_score, 
    "calinski_harabasz_score": cal_har_score, 
    "davies_bouldin_score": dav_boul_score, 
    "n_clusters" : n 
## **Methods** 
perform preprocessing on train_df  
train the chosen clustering algorithm on the train_df 
2-d visualization of the data points with its corresponding labels  (after doing dimensionality reduction using Principal Componenet Analysis). 
generates the prediction dataframe and assigns it to the `pred_df` attribute 
generates the clustering metrics and assigns it to the `metrics_dict`  
a shortcut that runs the following methods: 
## Development  

- Clone the repo  
- run `pip install virtualenv`
- run `python -m virtualenv venv`
- run `. ./venv/bin/activate` on UNIX based systems or `. ./venv/Scripts/activate.ps1` if on windows
- run `pip install -r requirements.txt`
- run `pre-commit install`


