blitzml


Nameblitzml JSON
Version 0.20.0 PyPI version JSON
download
home_page
SummaryA low-code library for machine learning pipelines
upload_time2023-08-25 12:11:05
maintainer
docs_urlNone
authorAI Team
requires_python>=3.8
license
keywords ml machine learning classification
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
<div align="center">
<img src="auxiliary/docs/logo.png" alt="BlitzML" width="400"/>

### **Automate machine learning pipelines rapidly**


<div align="left">

- [Install BlitzML](#install-blitzml)
- [Classification](#classification)
- [Regression](#regression)
- [Time-Series](#time-series)
- [Clustering](#clustering)



# Install BlitzML  


```bash
pip install blitzml
```


# Classification

```python
from blitzml.tabular import Classification
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("auxiliary/datasets/banknote/train.csv")
test_df = pd.read_csv("auxiliary/datasets/banknote/test.csv")

# create the pipeline
auto = Classification(train_df, test_df, algorithm = 'RF', n_estimators = 50)

# perform the entire process
auto.run()

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict

print(pred_df.head())
print(metrics_dict)
```


## Available Classifiers

- Random Forest 'RF' 
- LinearDiscriminantAnalysis 'LDA' 
- Support Vector Classifier 'SVC' 
- KNeighborsClassifier 'KNN' 
- GaussianNB 'GNB' 
- LogisticRegression 'LR'
- AdaBoostClassifier 'AB'
- GradientBoostingClassifier 'GB'
- DecisionTreeClassifier 'DT'
- MLPClassifier 'MLP'


## **Parameters**
**classifier**  
options: {'RF','LDA','SVC','KNN','GNB','LR','AB','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'  
`auto: selects the best scoring classifier based on f1-score`  
`custom: enables providing a custom classifier through *file_path* and *class_name* parameters`  
**file_path**  
when using 'custom' classifier, pass the path of the file containing the custom class, default = 'none'  
**class_name**  
when using 'custom' classifier, pass the class name through this parameter, default = 'none'  
**feature_selection**  
options: {'correlation', 'importance', 'none'}, default = 'none'  
`correlation: use feature columns with the highest correlation with the target`  
`importance: use feature columns that are important for the model to predict the target`  
`none: use all feature columns`  
**validation_percentage**  
value determining the validation split percentage (value from 0 to 1), default = 0.1  
**average_type**  
when performing multiclass classification, provide the average type for the resulting metrics, default = 'macro'  
**cross_validation_k_folds**  
number of k-folds for cross validation, if 1 then no cv will be performed, default = 1  
****kwargs**  
optional parameters for the chosen classifier. you can find available parameters in the [sklearn docs](https://scikit-learn.org/stable/user_guide.html)  
## **Attributes**  
**train_df**  
the preprocessed train dataset (after running `Classification.preprocess()`)  
**test_df**  
the preprocessed test dataset (after running `Classification.preprocess()`)  
**model**  
the trained model (after running `Classification.train_the_model()`)  
**pred_df**  
the prediction dataframe (test_df + predicted target) (after running `Classification.gen_pred_df(Classification.test_df)`)  
**metrics_dict**  
the validation metrics (after running `Classification.gen_metrics_dict()`)  
{  
    "accuracy": acc,  
    "f1": f1,  
    "precision": pre,  
    "recall": recall,  
    "hamming_loss": h_loss,  
    "cross_validation_score":cv_score, `returns None if cross_validation_k_folds==1`  
}   
## **Methods**  
**run()**  
a shortcut that runs the entire process:  
- preprocessing
- model training  
- prediction  
- model evaluation  

**accuracy_history()**  
accuracy scores when varying the sampling size of the train_df (after running `Classification.train_the_model()`).  
*returns:*  
{  
    'x':train_df_sample_sizes,  
    'y1':train_scores_mean,  
    'y2':test_scores_mean,  
    'title':title  
}  
**plot()**

plotting line chart visualizes accuracy history


# Regression  

```python
from blitzml.tabular import Regression
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("auxiliary/datasets/house prices/train.csv")
test_df = pd.read_csv("auxiliary/datasets/house prices/test.csv")

# create the pipeline
auto = Regression(train_df, test_df, algorithm = 'RF')

# perform the entire process
auto.run()

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict

print(pred_df.head())
print(metrics_dict)
```


## Available Regressors

- Random Forest 'RF'
- Support Vector Regressor 'SVR'
- KNeighborsRegressor 'KNN'
- Lasso Regressor 'LSS'
- LinearRegression 'LR'
- Ridge Regressor 'RDG'
- GaussianProcessRegressor 'GPR'
- GradientBoostingRegressor 'GB'
- DecisionTreeRegressor 'DT'
- MLPRegressor 'MLP'

## **Parameters**
**regressor**  
options: {'RF','SVR','KNN','LSS','LR','RDG','GPR','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'  
`auto: selects the best scoring regressor based on r2 score`  
`custom: enables providing a custom regressor through *file_path* and *class_name* parameters`  
**file_path**  
when using 'custom' regressor, pass the path of the file containing the custom class, default = 'none'  
**class_name**  
when using 'custom' regressor, pass the class name through this parameter, default = 'none'  
**feature_selection**  
options: {'correlation', 'importance', 'none'}, default = 'none'  
`correlation: use feature columns with the highest correlation with the target`  
`importance: use feature columns that are important for the model to predict the target`  
`none: use all feature columns`  
**validation_percentage**  
value determining the validation split percentage (value from 0 to 1), default = 0.1  
**cross_validation_k_folds**  
number of k-folds for cross validation, if 1 then no cv will be performed, default = 1  
****kwargs**  
optional parameters for the chosen regressor. you can find available parameters in the [sklearn docs](https://scikit-learn.org/stable/user_guide.html)  
## **Attributes**  
**train_df**  
the preprocessed train dataset (after running `Regression.preprocess()`)  
**test_df**  
the preprocessed test dataset (after running `Regression.preprocess()`)   
**model**  
the trained model (after running `Regression.train_the_model()`)  
**pred_df**  
the prediction dataframe (test_df + predicted target) (after running `Regression.gen_pred_df(Regression.test_df)`)  
**metrics_dict**  
the validation metrics (after running `Regression.gen_metrics_dict()`)  
{  
    "r2_score": r2,  
    "mean_squared_error": mse,  
    "root_mean_squared_error": rmse,  
    "mean_absolute_error" : mae,  
    "cross_validation_score":cv_score, `returns None if cross_validation_k_folds==1`  
}  
## **Methods**  


**run()**  
a shortcut that runs the entire process:  
- preprocessing
- model training  
- prediction  
- model evaluation   

**plot()**

plotting line chart visualizes RMSE history

**RMSE_history()**  
RMSE scores when varying the sampling size of the train_df (after running `Regression.train_the_model()`).  
*returns:*  
{  
    'x':train_df_sample_sizes,  
    'y1':train_scores_mean,  
    'y2':test_scores_mean,  
    'title':title  
}  
# Time-series
time series is a particular problem of Regression, but time series have some additional functions:
- stationary test (IsStationary()). 
- convert to stationary.
- reverse predicted.

and the dataset must have a DateTime column, even if the DataType of this column is Object.
```python
from blitzml.tabular import TimeSeries 
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("train_dataset.csv")
test_df = pd.read_csv("test_dataset.csv")

# create the pipeline
auto = TimeSeries(train_df, test_df, algorithm = 'RF')

# Perform the entire process:
auto.run()

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict

print(pred_df.head())
print(metrics_dict)
```
# Clustering 

```python
from blitzml.unsupervised import Clustering
import pandas as pd

# prepare your dataframe
train_df = pd.read_csv("auxiliary/datasets/customer personality/train.csv")

# create the pipeline
auto = Clustering(train_df, clustering_algorithm = 'KM')

# first perform data preprocessing
auto.preprocess()
# second train the model
auto.train_the_model()

# After training the model we can generate:
auto.gen_pred_df()
auto.gen_metrics_dict()

# We can get their values using:
print(auto.pred_df.head())
print(auto.metrics_dict)
```


## Available Clustering Algorithms 

- K-Means 'KM' 
- Affinity Propagation 'AP' 
- Agglomerative Clustering 'AC' 
- Mean Shift 'MS' 
- Spectral Clustering 'SC' 
- Birch 'Birch' 
- Bisecting K-Means 'BKM' 
- OPTICS 'OPTICS' 
- DBSCAN 'DBSCAN' 

## **Parameters** 
**clustering_algorithm**  
options: {"KM", "AP", "AC", "MS", "SC", "Birch", "BKM", "OPTICS", "DBSCAN", 'auto', 'custom'}, default = 'KM' 
`auto: selects the best scoring clustering algorithm based on silhouette score` 
`custom: enables providing a custom clustering algorithm through *file_path* and *class_name* parameters` 
**file_path** 
when using 'custom' clustering_algorithm, pass the path of the file containing the custom class, default = 'none'   
**class_name**
when using 'custom' clustering_algorithm, pass the class name through this parameter, default = 'none' 
**feature_selection** 
options: {'importance', 'none'}, default = 'none' 
`importance: use feature columns that are important for the model to predict the target` 
`none: use all feature columns` 
****kwargs** 
optional parameters for the chosen clustering_algorithm. you can find available parameters in the [sklearn docs](https://scikit-learn.org/stable/user_guide.html) 
## **Attributes** 
**train_df** 
the preprocessed train dataset (after running `Clustering.preprocess()`)  
**model** 
the trained model (after running `Clustering.train_the_model()`) 
**pred_df** 
the prediction dataframe (test_df + predicted target) (after running `Clustering.gen_pred_df()`) 
**metrics_dict** 
the validation metrics (after running `Clustering.gen_metrics_dict()`) 
{ 
    "silhouette_score": sil_score, 
    "calinski_harabasz_score": cal_har_score, 
    "davies_bouldin_score": dav_boul_score, 
    "n_clusters" : n 
} 
## **Methods** 
**preprocess()** 
perform preprocessing on train_df  
**train_the_model()** 
train the chosen clustering algorithm on the train_df 
**clustering_visualization()** 
2-d visualization of the data points with its corresponding labels  (after doing dimensionality reduction using Principal Componenet Analysis). 
*returns:* 
{ 
    'principal_component_1':pc1, 
    'principal_component_2':pc2, 
    'cluster_labels':labels, 
    'title':title 
} 
**gen_pred_df()** 
generates the prediction dataframe and assigns it to the `pred_df` attribute 
**gen_metrics_dict()** 
generates the clustering metrics and assigns it to the `metrics_dict`  
**run()** 
a shortcut that runs the following methods: 
preprocess() 
train_the_model() 
gen_pred_df() 
gen_metrics_dict() 
## Development  

- Clone the repo  
- run `pip install virtualenv`
- run `python -m virtualenv venv`
- run `. ./venv/bin/activate` on UNIX based systems or `. ./venv/Scripts/activate.ps1` if on windows
- run `pip install -r requirements.txt`
- run `pre-commit install`

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "blitzml",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "ml,machine learning,classification",
    "author": "AI Team",
    "author_email": "AI Team <CV.Team.CSE.2023@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/76/b7/b5a5255509c646dcef4fd75ebeb182b76b7618f9bc67d816768871c5ecc0/blitzml-0.20.0.tar.gz",
    "platform": null,
    "description": "\r\n<div align=\"center\">\r\n<img src=\"auxiliary/docs/logo.png\" alt=\"BlitzML\" width=\"400\"/>\r\n\r\n### **Automate machine learning pipelines rapidly**\r\n\r\n\r\n<div align=\"left\">\r\n\r\n- [Install BlitzML](#install-blitzml)\r\n- [Classification](#classification)\r\n- [Regression](#regression)\r\n- [Time-Series](#time-series)\r\n- [Clustering](#clustering)\r\n\r\n\r\n\r\n# Install BlitzML  \r\n\r\n\r\n```bash\r\npip install blitzml\r\n```\r\n\r\n\r\n# Classification\r\n\r\n```python\r\nfrom blitzml.tabular import Classification\r\nimport pandas as pd\r\n\r\n# prepare your dataframes\r\ntrain_df = pd.read_csv(\"auxiliary/datasets/banknote/train.csv\")\r\ntest_df = pd.read_csv(\"auxiliary/datasets/banknote/test.csv\")\r\n\r\n# create the pipeline\r\nauto = Classification(train_df, test_df, algorithm = 'RF', n_estimators = 50)\r\n\r\n# perform the entire process\r\nauto.run()\r\n\r\n# We can get their values using:\r\npred_df = auto.pred_df\r\nmetrics_dict = auto.metrics_dict\r\n\r\nprint(pred_df.head())\r\nprint(metrics_dict)\r\n```\r\n\r\n\r\n## Available Classifiers\r\n\r\n- Random Forest 'RF' \r\n- LinearDiscriminantAnalysis 'LDA' \r\n- Support Vector Classifier 'SVC' \r\n- KNeighborsClassifier 'KNN' \r\n- GaussianNB 'GNB' \r\n- LogisticRegression 'LR'\r\n- AdaBoostClassifier 'AB'\r\n- GradientBoostingClassifier 'GB'\r\n- DecisionTreeClassifier 'DT'\r\n- MLPClassifier 'MLP'\r\n\r\n\r\n## **Parameters**\r\n**classifier**  \r\noptions: {'RF','LDA','SVC','KNN','GNB','LR','AB','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'  \r\n`auto: selects the best scoring classifier based on f1-score`  \r\n`custom: enables providing a custom classifier through *file_path* and *class_name* parameters`  \r\n**file_path**  \r\nwhen using 'custom' classifier, pass the path of the file containing the custom class, default = 'none'  \r\n**class_name**  \r\nwhen using 'custom' classifier, pass the class name through this parameter, default = 'none'  \r\n**feature_selection**  \r\noptions: {'correlation', 'importance', 'none'}, default = 'none'  \r\n`correlation: use feature columns with the highest correlation with the target`  \r\n`importance: use feature columns that are important for the model to predict the target`  \r\n`none: use all feature columns`  \r\n**validation_percentage**  \r\nvalue determining the validation split percentage (value from 0 to 1), default = 0.1  \r\n**average_type**  \r\nwhen performing multiclass classification, provide the average type for the resulting metrics, default = 'macro'  \r\n**cross_validation_k_folds**  \r\nnumber of k-folds for cross validation, if 1 then no cv will be performed, default = 1  \r\n****kwargs**  \r\noptional parameters for the chosen classifier. you can find available parameters in the [sklearn docs](https://scikit-learn.org/stable/user_guide.html)  \r\n## **Attributes**  \r\n**train_df**  \r\nthe preprocessed train dataset (after running `Classification.preprocess()`)  \r\n**test_df**  \r\nthe preprocessed test dataset (after running `Classification.preprocess()`)  \r\n**model**  \r\nthe trained model (after running `Classification.train_the_model()`)  \r\n**pred_df**  \r\nthe prediction dataframe (test_df + predicted target) (after running `Classification.gen_pred_df(Classification.test_df)`)  \r\n**metrics_dict**  \r\nthe validation metrics (after running `Classification.gen_metrics_dict()`)  \r\n{  \r\n\u00a0\u00a0\u00a0\u00a0\"accuracy\": acc,  \r\n\u00a0\u00a0\u00a0\u00a0\"f1\": f1,  \r\n\u00a0\u00a0\u00a0\u00a0\"precision\": pre,  \r\n\u00a0\u00a0\u00a0\u00a0\"recall\": recall,  \r\n\u00a0\u00a0\u00a0\u00a0\"hamming_loss\": h_loss,  \r\n\u00a0\u00a0\u00a0\u00a0\"cross_validation_score\":cv_score, `returns None if cross_validation_k_folds==1`  \r\n}   \r\n## **Methods**  \r\n**run()**  \r\na shortcut that runs the entire process:  \r\n- preprocessing\r\n- model training  \r\n- prediction  \r\n- model evaluation  \r\n\r\n**accuracy_history()**  \r\naccuracy scores when varying the sampling size of the train_df (after running `Classification.train_the_model()`).  \r\n*returns:*  \r\n{  \r\n\u00a0\u00a0\u00a0\u00a0'x':train_df_sample_sizes,  \r\n\u00a0\u00a0\u00a0\u00a0'y1':train_scores_mean,  \r\n\u00a0\u00a0\u00a0\u00a0'y2':test_scores_mean,  \r\n\u00a0\u00a0\u00a0\u00a0'title':title  \r\n}  \r\n**plot()**\r\n\r\nplotting line chart visualizes accuracy history\r\n\r\n\r\n# Regression  \r\n\r\n```python\r\nfrom blitzml.tabular import Regression\r\nimport pandas as pd\r\n\r\n# prepare your dataframes\r\ntrain_df = pd.read_csv(\"auxiliary/datasets/house prices/train.csv\")\r\ntest_df = pd.read_csv(\"auxiliary/datasets/house prices/test.csv\")\r\n\r\n# create the pipeline\r\nauto = Regression(train_df, test_df, algorithm = 'RF')\r\n\r\n# perform the entire process\r\nauto.run()\r\n\r\n# We can get their values using:\r\npred_df = auto.pred_df\r\nmetrics_dict = auto.metrics_dict\r\n\r\nprint(pred_df.head())\r\nprint(metrics_dict)\r\n```\r\n\r\n\r\n## Available Regressors\r\n\r\n- Random Forest 'RF'\r\n- Support Vector Regressor 'SVR'\r\n- KNeighborsRegressor 'KNN'\r\n- Lasso Regressor 'LSS'\r\n- LinearRegression 'LR'\r\n- Ridge Regressor 'RDG'\r\n- GaussianProcessRegressor 'GPR'\r\n- GradientBoostingRegressor 'GB'\r\n- DecisionTreeRegressor 'DT'\r\n- MLPRegressor 'MLP'\r\n\r\n## **Parameters**\r\n**regressor**  \r\noptions: {'RF','SVR','KNN','LSS','LR','RDG','GPR','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'  \r\n`auto: selects the best scoring regressor based on r2 score`  \r\n`custom: enables providing a custom regressor through *file_path* and *class_name* parameters`  \r\n**file_path**  \r\nwhen using 'custom' regressor, pass the path of the file containing the custom class, default = 'none'  \r\n**class_name**  \r\nwhen using 'custom' regressor, pass the class name through this parameter, default = 'none'  \r\n**feature_selection**  \r\noptions: {'correlation', 'importance', 'none'}, default = 'none'  \r\n`correlation: use feature columns with the highest correlation with the target`  \r\n`importance: use feature columns that are important for the model to predict the target`  \r\n`none: use all feature columns`  \r\n**validation_percentage**  \r\nvalue determining the validation split percentage (value from 0 to 1), default = 0.1  \r\n**cross_validation_k_folds**  \r\nnumber of k-folds for cross validation, if 1 then no cv will be performed, default = 1  \r\n****kwargs**  \r\noptional parameters for the chosen regressor. you can find available parameters in the [sklearn docs](https://scikit-learn.org/stable/user_guide.html)  \r\n## **Attributes**  \r\n**train_df**  \r\nthe preprocessed train dataset (after running `Regression.preprocess()`)  \r\n**test_df**  \r\nthe preprocessed test dataset (after running `Regression.preprocess()`)   \r\n**model**  \r\nthe trained model (after running `Regression.train_the_model()`)  \r\n**pred_df**  \r\nthe prediction dataframe (test_df + predicted target) (after running `Regression.gen_pred_df(Regression.test_df)`)  \r\n**metrics_dict**  \r\nthe validation metrics (after running `Regression.gen_metrics_dict()`)  \r\n{  \r\n\u00a0\u00a0\u00a0\u00a0\"r2_score\": r2,  \r\n\u00a0\u00a0\u00a0\u00a0\"mean_squared_error\": mse,  \r\n\u00a0\u00a0\u00a0\u00a0\"root_mean_squared_error\": rmse,  \r\n\u00a0\u00a0\u00a0\u00a0\"mean_absolute_error\" : mae,  \r\n\u00a0\u00a0\u00a0\u00a0\"cross_validation_score\":cv_score, `returns None if cross_validation_k_folds==1`  \r\n}  \r\n## **Methods**  \r\n\r\n\r\n**run()**  \r\na shortcut that runs the entire process:  \r\n- preprocessing\r\n- model training  \r\n- prediction  \r\n- model evaluation   \r\n\r\n**plot()**\r\n\r\nplotting line chart visualizes RMSE history\r\n\r\n**RMSE_history()**  \r\nRMSE scores when varying the sampling size of the train_df (after running `Regression.train_the_model()`).  \r\n*returns:*  \r\n{  \r\n\u00a0\u00a0\u00a0\u00a0'x':train_df_sample_sizes,  \r\n\u00a0\u00a0\u00a0\u00a0'y1':train_scores_mean,  \r\n\u00a0\u00a0\u00a0\u00a0'y2':test_scores_mean,  \r\n\u00a0\u00a0\u00a0\u00a0'title':title  \r\n}  \r\n# Time-series\r\ntime series is a particular problem of Regression, but time series have some additional functions:\r\n- stationary test (IsStationary()). \r\n- convert to stationary.\r\n- reverse predicted.\r\n\r\nand the dataset must have a DateTime column, even if the DataType of this column is Object.\r\n```python\r\nfrom blitzml.tabular import TimeSeries \r\nimport pandas as pd\r\n\r\n# prepare your dataframes\r\ntrain_df = pd.read_csv(\"train_dataset.csv\")\r\ntest_df = pd.read_csv(\"test_dataset.csv\")\r\n\r\n# create the pipeline\r\nauto = TimeSeries(train_df, test_df, algorithm = 'RF')\r\n\r\n# Perform the entire process:\r\nauto.run()\r\n\r\n# We can get their values using:\r\npred_df = auto.pred_df\r\nmetrics_dict = auto.metrics_dict\r\n\r\nprint(pred_df.head())\r\nprint(metrics_dict)\r\n```\r\n# Clustering \r\n\r\n```python\r\nfrom blitzml.unsupervised import Clustering\r\nimport pandas as pd\r\n\r\n# prepare your dataframe\r\ntrain_df = pd.read_csv(\"auxiliary/datasets/customer personality/train.csv\")\r\n\r\n# create the pipeline\r\nauto = Clustering(train_df, clustering_algorithm = 'KM')\r\n\r\n# first perform data preprocessing\r\nauto.preprocess()\r\n# second train the model\r\nauto.train_the_model()\r\n\r\n# After training the model we can generate:\r\nauto.gen_pred_df()\r\nauto.gen_metrics_dict()\r\n\r\n# We can get their values using:\r\nprint(auto.pred_df.head())\r\nprint(auto.metrics_dict)\r\n```\r\n\r\n\r\n## Available Clustering Algorithms \r\n\r\n- K-Means 'KM' \r\n- Affinity Propagation 'AP' \r\n- Agglomerative Clustering 'AC' \r\n- Mean Shift 'MS' \r\n- Spectral Clustering 'SC' \r\n- Birch 'Birch' \r\n- Bisecting K-Means 'BKM' \r\n- OPTICS 'OPTICS' \r\n- DBSCAN 'DBSCAN' \r\n\r\n## **Parameters** \r\n**clustering_algorithm**  \r\noptions: {\"KM\", \"AP\", \"AC\", \"MS\", \"SC\", \"Birch\", \"BKM\", \"OPTICS\", \"DBSCAN\", 'auto', 'custom'}, default = 'KM' \r\n`auto: selects the best scoring clustering algorithm based on silhouette score` \r\n`custom: enables providing a custom clustering algorithm through *file_path* and *class_name* parameters` \r\n**file_path** \r\nwhen using 'custom' clustering_algorithm, pass the path of the file containing the custom class, default = 'none'   \r\n**class_name**\r\nwhen using 'custom' clustering_algorithm, pass the class name through this parameter, default = 'none' \r\n**feature_selection** \r\noptions: {'importance', 'none'}, default = 'none' \r\n`importance: use feature columns that are important for the model to predict the target` \r\n`none: use all feature columns` \r\n****kwargs** \r\noptional parameters for the chosen clustering_algorithm. you can find available parameters in the [sklearn docs](https://scikit-learn.org/stable/user_guide.html) \r\n## **Attributes** \r\n**train_df** \r\nthe preprocessed train dataset (after running `Clustering.preprocess()`)  \r\n**model** \r\nthe trained model (after running `Clustering.train_the_model()`) \r\n**pred_df** \r\nthe prediction dataframe (test_df + predicted target) (after running `Clustering.gen_pred_df()`) \r\n**metrics_dict** \r\nthe validation metrics (after running `Clustering.gen_metrics_dict()`) \r\n{ \r\n\u00a0\u00a0\u00a0\u00a0\"silhouette_score\": sil_score, \r\n\u00a0\u00a0\u00a0\u00a0\"calinski_harabasz_score\": cal_har_score, \r\n\u00a0\u00a0\u00a0\u00a0\"davies_bouldin_score\": dav_boul_score, \r\n\u00a0\u00a0\u00a0\u00a0\"n_clusters\" : n \r\n} \r\n## **Methods** \r\n**preprocess()** \r\nperform preprocessing on train_df  \r\n**train_the_model()** \r\ntrain the chosen clustering algorithm on the train_df \r\n**clustering_visualization()** \r\n2-d visualization of the data points with its corresponding labels  (after doing dimensionality reduction using Principal Componenet Analysis). \r\n*returns:* \r\n{ \r\n\u00a0\u00a0\u00a0\u00a0'principal_component_1':pc1, \r\n\u00a0\u00a0\u00a0\u00a0'principal_component_2':pc2, \r\n\u00a0\u00a0\u00a0\u00a0'cluster_labels':labels, \r\n\u00a0\u00a0\u00a0\u00a0'title':title \r\n} \r\n**gen_pred_df()** \r\ngenerates the prediction dataframe and assigns it to the `pred_df` attribute \r\n**gen_metrics_dict()** \r\ngenerates the clustering metrics and assigns it to the `metrics_dict`  \r\n**run()** \r\na shortcut that runs the following methods: \r\npreprocess() \r\ntrain_the_model() \r\ngen_pred_df() \r\ngen_metrics_dict() \r\n## Development  \r\n\r\n- Clone the repo  \r\n- run `pip install virtualenv`\r\n- run `python -m virtualenv venv`\r\n- run `. ./venv/bin/activate` on UNIX based systems or `. ./venv/Scripts/activate.ps1` if on windows\r\n- run `pip install -r requirements.txt`\r\n- run `pre-commit install`\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A low-code library for machine learning pipelines",
    "version": "0.20.0",
    "project_urls": {
        "Homepage": "https://github.com/AhmedMohamed25/blitzml"
    },
    "split_keywords": [
        "ml",
        "machine learning",
        "classification"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dd85b8b07124cebf6573f3ea11d0ca4edaf2fd5fd18f7be81c5c558f136cf33b",
                "md5": "5a1806498d66d64fbd9f75c03c507884",
                "sha256": "779e6726bc27382a72d0c07693c46fcea8b5875ba5f54c9a11b4d348623b9ee0"
            },
            "downloads": -1,
            "filename": "blitzml-0.20.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5a1806498d66d64fbd9f75c03c507884",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 19336,
            "upload_time": "2023-08-25T12:11:03",
            "upload_time_iso_8601": "2023-08-25T12:11:03.040231Z",
            "url": "https://files.pythonhosted.org/packages/dd/85/b8b07124cebf6573f3ea11d0ca4edaf2fd5fd18f7be81c5c558f136cf33b/blitzml-0.20.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "76b7b5a5255509c646dcef4fd75ebeb182b76b7618f9bc67d816768871c5ecc0",
                "md5": "7d4ecd29b64ae76f03bd8af9ee0242a7",
                "sha256": "ef023f48b1ef50ccfbb6601db28074f638a6dd5abce2f45960fee35515f87fde"
            },
            "downloads": -1,
            "filename": "blitzml-0.20.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7d4ecd29b64ae76f03bd8af9ee0242a7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 16654,
            "upload_time": "2023-08-25T12:11:05",
            "upload_time_iso_8601": "2023-08-25T12:11:05.119291Z",
            "url": "https://files.pythonhosted.org/packages/76/b7/b5a5255509c646dcef4fd75ebeb182b76b7618f9bc67d816768871c5ecc0/blitzml-0.20.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-25 12:11:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AhmedMohamed25",
    "github_project": "blitzml",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "blitzml"
}
        
Elapsed time: 0.50360s