mlimputer


Namemlimputer JSON
Version 1.0.68 PyPI version JSON
download
home_pagehttps://github.com/TsLu1s/MLimputer
SummaryMLimputer - Missing Data Imputation Framework for Supervised Machine Learning
upload_time2024-07-31 00:17:25
maintainerNone
docs_urlNone
authorLuís Santos
requires_pythonNone
licenseMIT
keywords data science machine learning null imputation missing data imputation predictive imputation multivariate imputation automated machine learning data preprocessing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <br>
<p align="center">
  <h2 align="center"> MLimputer - Missing Data Imputation Framework for Supervised Machine Learning
  <br>
  
## Framework Contextualization <a name = "ta"></a>

The `MLimputer` project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction and aims at reducing bias and increase the precision of imputation results when compared to more classic imputation methods.
This package provides multiple algorithm options to impute your data, in which every observed data column with existing missing values is fitted with a robust preprocessing approach and subsequently predicted.

The architecture design includes three main sections, these being: missing data analysis, data preprocessing and supervised model imputation which are organized in a customizable pipeline structure.

This project aims at providing the following application capabilities:

* General applicability on tabular datasets: The developed imputation procedures are applicable on any data table associated with any Supervised ML scopes, based on missing data columns to be imputed.
    
* Robustness and improvement of predictive results: The application of the MLimputer preprocessing aims at improve the predictive performance through customization and optimization of existing missing values imputation in the dataset input columns. 
   
#### Main Development Tools <a name = "pre1"></a>

Major frameworks used to built this project: 

* [Pandas](https://pandas.pydata.org/)
* [Sklearn](https://scikit-learn.org/stable/)
* [CatBoost](https://catboost.ai/)
    
## Where to get it <a name = "ta"></a>
    
Binary installer for the latest released version is available at the Python Package Index [(PyPI)](https://pypi.org/project/mlimputer/).   

The source code is currently hosted on GitHub at: https://github.com/TsLu1s/MLimputer

## Installation  

To install this package from Pypi repository run the following command:

```
pip install mlimputer
```

# Usage Examples
    
The first needed step after importing the package is to load a dataset (split it) and define your choosen imputation model.
The imputation model options for handling the missing data in your dataset are the following:
* `RandomForest`
* `ExtraTrees`
* `GBR`
* `KNN`
* `XGBoost`
* `Lightgbm`
* `Catboost`

After creating a `MLimputer` object with your imputation selected model, you can then fit the missing data through the `fit_imput` method. From there you can impute the future datasets with `transform_imput` (validate, test ...) with the same data properties. Note, as it shows in the example bellow, you can also customize your model imputer parameters by changing it's configurations and then, implementing them in the `imputer_configs` parameter.

Through the `cross_validation` function you can also compare the predictive performance evalution of multiple imputations, allowing you to validate which imputation model fits better your future predictions.

Importante Notes:

* The actual version of this package does not incorporate the imputing of categorical values, just the automatic handling of numeric missing values is implemented.

```py

from mlimputer.imputation import MLimputer
import mlimputer.model_selection as ms
from mlimputer.parameters import imputer_parameters
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore", category=Warning) #-> For a clean console

data = pd.read_csv('csv_directory_path') # Dataframe Loading Example

train,test = train_test_split(data, train_size=0.8)
train,test = train.reset_index(drop=True), test.reset_index(drop=True) # <- Required

# All model imputation options ->  "RandomForest","ExtraTrees","GBR","KNN","XGBoost","Lightgbm","Catboost"

# Customizing Hyperparameters Example

hparameters = imputer_parameters()
print(hparameters)
hparameters["KNN"]["n_neighbors"] = 5
hparameters["RandomForest"]["n_estimators"] = 30
    
# Imputation Example 1 : KNN

mli_knn = MLimputer(imput_model = "KNN", imputer_configs = hparameters)
mli_knn.fit_imput(X = train)
train_knn = mli_knn.transform_imput(X = train)
test_knn = mli_knn.transform_imput(X = test)

# Imputation Example 2 : RandomForest

mli_rf = MLimputer(imput_model = "RandomForest", imputer_configs = hparameters)
mli_rf.fit_imput(X = train)
train_rf = mli_rf.transform_imput(X = train)
test_rf = mli_rf.transform_imput(X = test)
    
#(...)
    
## Performance Evaluation Regression - Imputation CrossValidation Example

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
        
leaderboard_rf_imp=ms.cross_validation(X = train_rf,
                                       target = "Target_Name_Col", 
                                       test_size = 0.2,
                                       n_splits = 3,
                                       models = [LinearRegression(), RandomForestRegressor(), CatBoostRegressor()])

## Export Imputation Metadata

# Imputation Metadata
import pickle 
output = open("imputer_rf.pkl", 'wb')
pickle.dump(mli_rf, output)

```  
    
## License

Distributed under the MIT License. See [LICENSE](https://github.com/TsLu1s/MLimputer/blob/main/LICENSE) for more information.

## Contact 
 
Luis Santos - [LinkedIn](https://www.linkedin.com/in/lu%C3%ADsfssantos/)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TsLu1s/MLimputer",
    "name": "mlimputer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "data science, machine learning, null imputation, missing data imputation, predictive imputation, multivariate imputation, automated machine learning, data preprocessing",
    "author": "Lu\u00eds Santos",
    "author_email": "luisf_ssantos@hotmail.com",
    "download_url": null,
    "platform": null,
    "description": "<br>\r\n<p align=\"center\">\r\n  <h2 align=\"center\"> MLimputer - Missing Data Imputation Framework for Supervised Machine Learning\r\n  <br>\r\n  \r\n## Framework Contextualization <a name = \"ta\"></a>\r\n\r\nThe `MLimputer` project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction and aims at reducing bias and increase the precision of imputation results when compared to more classic imputation methods.\r\nThis package provides multiple algorithm options to impute your data, in which every observed data column with existing missing values is fitted with a robust preprocessing approach and subsequently predicted.\r\n\r\nThe architecture design includes three main sections, these being: missing data analysis, data preprocessing and supervised model imputation which are organized in a customizable pipeline structure.\r\n\r\nThis project aims at providing the following application capabilities:\r\n\r\n* General applicability on tabular datasets: The developed imputation procedures are applicable on any data table associated with any Supervised ML scopes, based on missing data columns to be imputed.\r\n    \r\n* Robustness and improvement of predictive results: The application of the MLimputer preprocessing aims at improve the predictive performance through customization and optimization of existing missing values imputation in the dataset input columns. \r\n   \r\n#### Main Development Tools <a name = \"pre1\"></a>\r\n\r\nMajor frameworks used to built this project: \r\n\r\n* [Pandas](https://pandas.pydata.org/)\r\n* [Sklearn](https://scikit-learn.org/stable/)\r\n* [CatBoost](https://catboost.ai/)\r\n    \r\n## Where to get it <a name = \"ta\"></a>\r\n    \r\nBinary installer for the latest released version is available at the Python Package Index [(PyPI)](https://pypi.org/project/mlimputer/).   \r\n\r\nThe source code is currently hosted on GitHub at: https://github.com/TsLu1s/MLimputer\r\n\r\n## Installation  \r\n\r\nTo install this package from Pypi repository run the following command:\r\n\r\n```\r\npip install mlimputer\r\n```\r\n\r\n# Usage Examples\r\n    \r\nThe first needed step after importing the package is to load a dataset (split it) and define your choosen imputation model.\r\nThe imputation model options for handling the missing data in your dataset are the following:\r\n* `RandomForest`\r\n* `ExtraTrees`\r\n* `GBR`\r\n* `KNN`\r\n* `XGBoost`\r\n* `Lightgbm`\r\n* `Catboost`\r\n\r\nAfter creating a `MLimputer` object with your imputation selected model, you can then fit the missing data through the `fit_imput` method. From there you can impute the future datasets with `transform_imput` (validate, test ...) with the same data properties. Note, as it shows in the example bellow, you can also customize your model imputer parameters by changing it's configurations and then, implementing them in the `imputer_configs` parameter.\r\n\r\nThrough the `cross_validation` function you can also compare the predictive performance evalution of multiple imputations, allowing you to validate which imputation model fits better your future predictions.\r\n\r\nImportante Notes:\r\n\r\n* The actual version of this package does not incorporate the imputing of categorical values, just the automatic handling of numeric missing values is implemented.\r\n\r\n```py\r\n\r\nfrom mlimputer.imputation import MLimputer\r\nimport mlimputer.model_selection as ms\r\nfrom mlimputer.parameters import imputer_parameters\r\nimport pandas as pd\r\nimport numpy as np\r\nfrom sklearn.model_selection import train_test_split\r\nimport warnings\r\nwarnings.filterwarnings(\"ignore\", category=Warning) #-> For a clean console\r\n\r\ndata = pd.read_csv('csv_directory_path') # Dataframe Loading Example\r\n\r\ntrain,test = train_test_split(data, train_size=0.8)\r\ntrain,test = train.reset_index(drop=True), test.reset_index(drop=True) # <- Required\r\n\r\n# All model imputation options ->  \"RandomForest\",\"ExtraTrees\",\"GBR\",\"KNN\",\"XGBoost\",\"Lightgbm\",\"Catboost\"\r\n\r\n# Customizing Hyperparameters Example\r\n\r\nhparameters = imputer_parameters()\r\nprint(hparameters)\r\nhparameters[\"KNN\"][\"n_neighbors\"] = 5\r\nhparameters[\"RandomForest\"][\"n_estimators\"] = 30\r\n    \r\n# Imputation Example 1 : KNN\r\n\r\nmli_knn = MLimputer(imput_model = \"KNN\", imputer_configs = hparameters)\r\nmli_knn.fit_imput(X = train)\r\ntrain_knn = mli_knn.transform_imput(X = train)\r\ntest_knn = mli_knn.transform_imput(X = test)\r\n\r\n# Imputation Example 2 : RandomForest\r\n\r\nmli_rf = MLimputer(imput_model = \"RandomForest\", imputer_configs = hparameters)\r\nmli_rf.fit_imput(X = train)\r\ntrain_rf = mli_rf.transform_imput(X = train)\r\ntest_rf = mli_rf.transform_imput(X = test)\r\n    \r\n#(...)\r\n    \r\n## Performance Evaluation Regression - Imputation CrossValidation Example\r\n\r\nfrom sklearn.linear_model import LinearRegression\r\nfrom sklearn.ensemble import RandomForestRegressor\r\nfrom catboost import CatBoostRegressor\r\n        \r\nleaderboard_rf_imp=ms.cross_validation(X = train_rf,\r\n                                       target = \"Target_Name_Col\", \r\n                                       test_size = 0.2,\r\n                                       n_splits = 3,\r\n                                       models = [LinearRegression(), RandomForestRegressor(), CatBoostRegressor()])\r\n\r\n## Export Imputation Metadata\r\n\r\n# Imputation Metadata\r\nimport pickle \r\noutput = open(\"imputer_rf.pkl\", 'wb')\r\npickle.dump(mli_rf, output)\r\n\r\n```  \r\n    \r\n## License\r\n\r\nDistributed under the MIT License. See [LICENSE](https://github.com/TsLu1s/MLimputer/blob/main/LICENSE) for more information.\r\n\r\n## Contact \r\n \r\nLuis Santos - [LinkedIn](https://www.linkedin.com/in/lu%C3%ADsfssantos/)\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "MLimputer - Missing Data Imputation Framework for Supervised Machine Learning",
    "version": "1.0.68",
    "project_urls": {
        "Homepage": "https://github.com/TsLu1s/MLimputer"
    },
    "split_keywords": [
        "data science",
        " machine learning",
        " null imputation",
        " missing data imputation",
        " predictive imputation",
        " multivariate imputation",
        " automated machine learning",
        " data preprocessing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "725f068f3dd973020ee457600f5a67519de4180eb42ab0baa9e87db37f1c9773",
                "md5": "27db85376141d448fb4b63269abfc7f2",
                "sha256": "93f169c6b0ec6b0fc579f8ebe087fc03c3a8eab3b66d1d87f8edd7d1fbe924a0"
            },
            "downloads": -1,
            "filename": "mlimputer-1.0.68-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "27db85376141d448fb4b63269abfc7f2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 13285,
            "upload_time": "2024-07-31T00:17:25",
            "upload_time_iso_8601": "2024-07-31T00:17:25.156358Z",
            "url": "https://files.pythonhosted.org/packages/72/5f/068f3dd973020ee457600f5a67519de4180eb42ab0baa9e87db37f1c9773/mlimputer-1.0.68-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-31 00:17:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TsLu1s",
    "github_project": "MLimputer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "mlimputer"
}
        
Elapsed time: 0.33714s