# XGBImputer - Extreme Gradient Boosting Imputer
<div class="termy">
XGBImputer is an effort to implement the concepts of the MissForest algorithm proposed by Daniel J. Stekhoven and Peter Bühlmann[1] in 2012, but leveraging the robustness and predictive power of the XGBoost[2] algorithm released in 2014.
The package also aims to simplify the process of imputing categorical values in a scikit-learn[3] compatible way.
</div>
## Installation
<div class="termy">
```console
$ pip install xgbimputer
```
</div>
## Approach
<div class="termy">
Given a 2D array X with missing values, the imputer:
* 1 - counts the missing values in each column and arranges them in the ascending order;
* 2 - makes an initial guess for the missing values in X using the mean for numerical columns and the mode for the categorical columns;
* 3 - sorts the columns according to the amount of missing values, starting with the lowest amount;
* 4 - preprocesses all categorical columns with scikit-learn's OrdinalEncoder to get a purely numerical array;
* 5 - iterates over all columns with missing values in the order established on step 1;
* 5.1 - selects the column in context on the iteration as the target;
* 5.2 - one hot encodes all categorical columns other than the target;
* 5.3 - fits the XGBoost algorithm (XGBClassifier for the categorical columns and XGBRegressor for the numeric columns) where the target column has no missing values;
* 5.4 - predicts the missing values of the target column and replaces them on the X array;
* 5.5 - calculates the stopping criterion (gamma) for the numerical and categorical columns identified as having missing data;
* 6 - repeats the process described in step 5 until the stopping criterion is met; and
* 7 - returns X with the imputed values.
</div>
## Example
<div class="termy">
```Python
import pandas as pd
from xgbimputer import XGBImputer
df = pd.read_csv('titanic.csv')
df.head()
```
```
| | PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---:|--------------:|---------:|:---------------------------------------------|:-------|------:|--------:|--------:|---------:|--------:|--------:|:-----------|
| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | nan | Q |
| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47 | 1 | 0 | 363272 | 7 | nan | S |
| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62 | 0 | 0 | 240276 | 9.6875 | nan | Q |
| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27 | 0 | 0 | 315154 | 8.6625 | nan | S |
| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22 | 1 | 1 | 3101298 | 12.2875 | nan | S |
```
```Python
df = df.drop(columns=['PassengerId', 'Name', 'Ticket'])
df.info()
```
```class 'pandas.core.frame.DataFrame'
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Sex 418 non-null object
2 Age 332 non-null float64
3 SibSp 418 non-null int64
4 Parch 418 non-null int64
5 Fare 417 non-null float64
6 Cabin 91 non-null object
7 Embarked 418 non-null object
dtypes: float64(2), int64(3), object(3)
memory usage: 26.2+ KB
```
```Python
df_missing_data = pd.DataFrame(df.isna().sum().loc[df.isna().sum() > 0], columns=['missing_data_count'])
df_missing_data['missing_data_type'] = df.dtypes
df_missing_data['missing_data_percentage'] = df_missing_data['missing_data_count'] / len(df)
df_missing_data = df_missing_data.sort_values(by='missing_data_percentage', ascending=False)
df_missing_data
```
```
| | missing_data_count | missing_data_type | missing_data_percentage |
|:------|---------------------:|:--------------------|--------------------------:|
| Cabin | 327 | object | 0.782297 |
| Age | 86 | float64 | 0.205742 |
| Fare | 1 | float64 | 0.00239234 |
```
```Python
imputer = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=True)
X = imputer.fit_transform(df)
```
```
XGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522
XGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584
XGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.
XGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.
```
```Python
type(X)
```
```
numpy.ndarray
```
```Python
pd.DataFrame(X).head(15)
```
```
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---:|----:|:-------|--------:|----:|----:|--------:|:----------------|:----|
| 0 | 3 | male | 34.5 | 0 | 0 | 7.8292 | C78 | Q |
| 1 | 3 | female | 47 | 1 | 0 | 7 | C23 C25 C27 | S |
| 2 | 2 | male | 62 | 0 | 0 | 9.6875 | C78 | Q |
| 3 | 3 | male | 27 | 0 | 0 | 8.6625 | C31 | S |
| 4 | 3 | female | 22 | 1 | 1 | 12.2875 | C23 C25 C27 | S |
| 5 | 3 | male | 14 | 0 | 0 | 9.225 | C31 | S |
| 6 | 3 | female | 30 | 0 | 0 | 7.6292 | C78 | Q |
| 7 | 2 | male | 26 | 1 | 1 | 29 | C31 | S |
| 8 | 3 | female | 18 | 0 | 0 | 7.2292 | B57 B59 B63 B66 | C |
| 9 | 3 | male | 21 | 2 | 0 | 24.15 | C31 | S |
| 10 | 3 | male | 24.7614 | 0 | 0 | 7.8958 | C31 | S |
| 11 | 1 | male | 46 | 0 | 0 | 26 | C31 | S |
| 12 | 1 | female | 23 | 1 | 0 | 82.2667 | B45 | S |
| 13 | 2 | male | 63 | 1 | 0 | 26 | C31 | S |
| 14 | 1 | female | 47 | 1 | 0 | 61.175 | E31 | S |
```
```Python
imputer2 = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=False)
X2 = imputer2.fit_transform(df)
```
```
XGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522
XGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584
XGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.
XGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.
```
```
pd.DataFrame(X2).head(15)
```
```
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---:|----:|----:|--------:|----:|----:|--------:|----:|----:|
| 0 | 2 | 1 | 34.5 | 0 | 0 | 7.8292 | 41 | 1 |
| 1 | 2 | 0 | 47 | 1 | 0 | 7 | 28 | 2 |
| 2 | 1 | 1 | 62 | 0 | 0 | 9.6875 | 41 | 1 |
| 3 | 2 | 1 | 27 | 0 | 0 | 8.6625 | 30 | 2 |
| 4 | 2 | 0 | 22 | 1 | 1 | 12.2875 | 28 | 2 |
| 5 | 2 | 1 | 14 | 0 | 0 | 9.225 | 30 | 2 |
| 6 | 2 | 0 | 30 | 0 | 0 | 7.6292 | 41 | 1 |
| 7 | 1 | 1 | 26 | 1 | 1 | 29 | 30 | 2 |
| 8 | 2 | 0 | 18 | 0 | 0 | 7.2292 | 15 | 0 |
| 9 | 2 | 1 | 21 | 2 | 0 | 24.15 | 30 | 2 |
| 10 | 2 | 1 | 24.7614 | 0 | 0 | 7.8958 | 30 | 2 |
| 11 | 0 | 1 | 46 | 0 | 0 | 26 | 30 | 2 |
| 12 | 0 | 0 | 23 | 1 | 0 | 82.2667 | 12 | 2 |
| 13 | 1 | 1 | 63 | 1 | 0 | 26 | 30 | 2 |
| 14 | 0 | 0 | 47 | 1 | 0 | 61.175 | 60 | 2 |
```
</div>
## License
<div class="termy">
Licensed under an [Apache-2](https://github.com/leonardodepaula/xgbimputer/blob/master/LICENSE) license.
</div>
## References
<div class="termy">
* [1] [Daniel J. Stekhoven and Peter Bühlmann. "MissForest—non-parametric missing value imputation for mixed-type data."](https://academic.oup.com/bioinformatics/article/28/1/112/219101)
* [2] [XGBoost](https://xgboost.ai/)
* [3] [scikit-learn](https://scikit-learn.org/)
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/leonardodepaula/xgbimputer",
"name": "xgbimputer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "python,machine learning,missing values,imputation",
"author": "Leonardo de Paula Liebscher",
"author_email": "<leonardopx@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0a/24/f1e2dfee553204a4455c589ed2a6ae142a080a4ea5ddf89f9ce2853a2af3/xgbimputer-0.2.0.tar.gz",
"platform": null,
"description": "# XGBImputer - Extreme Gradient Boosting Imputer\n\n<div class=\"termy\">\n\nXGBImputer is an effort to implement the concepts of the MissForest algorithm proposed by Daniel J. Stekhoven and Peter B\u00fchlmann[1] in 2012, but leveraging the robustness and predictive power of the XGBoost[2] algorithm released in 2014.\n\nThe package also aims to simplify the process of imputing categorical values in a scikit-learn[3] compatible way.\n\n</div>\n\n## Installation\n\n<div class=\"termy\">\n\n```console\n$ pip install xgbimputer\n```\n\n</div>\n\n## Approach\n\n<div class=\"termy\">\n\nGiven a 2D array X with missing values, the imputer:\n\n* 1 - counts the missing values in each column and arranges them in the ascending order;\n\n* 2 - makes an initial guess for the missing values in X using the mean for numerical columns and the mode for the categorical columns;\n\n* 3 - sorts the columns according to the amount of missing values, starting with the lowest amount;\n\n* 4 - preprocesses all categorical columns with scikit-learn's OrdinalEncoder to get a purely numerical array;\n\n* 5 - iterates over all columns with missing values in the order established on step 1;\n\n * 5.1 - selects the column in context on the iteration as the target;\n\n * 5.2 - one hot encodes all categorical columns other than the target;\n\n * 5.3 - fits the XGBoost algorithm (XGBClassifier for the categorical columns and XGBRegressor for the numeric columns) where the target column has no missing values;\n\n * 5.4 - predicts the missing values of the target column and replaces them on the X array;\n\n * 5.5 - calculates the stopping criterion (gamma) for the numerical and categorical columns identified as having missing data;\n\n* 6 - repeats the process described in step 5 until the stopping criterion is met; and\n\n* 7 - returns X with the imputed values.\n\n</div>\n\n## Example\n\n<div class=\"termy\">\n\n```Python\nimport pandas as pd\nfrom xgbimputer import XGBImputer\n\ndf = pd.read_csv('titanic.csv')\ndf.head()\n```\n```\n| | PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |\n|---:|--------------:|---------:|:---------------------------------------------|:-------|------:|--------:|--------:|---------:|--------:|--------:|:-----------|\n| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | nan | Q |\n| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47 | 1 | 0 | 363272 | 7 | nan | S |\n| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62 | 0 | 0 | 240276 | 9.6875 | nan | Q |\n| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27 | 0 | 0 | 315154 | 8.6625 | nan | S |\n| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22 | 1 | 1 | 3101298 | 12.2875 | nan | S |\n```\n```Python\ndf = df.drop(columns=['PassengerId', 'Name', 'Ticket'])\ndf.info()\n```\n```class 'pandas.core.frame.DataFrame'\nRangeIndex: 418 entries, 0 to 417\nData columns (total 8 columns):\n# Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 Pclass 418 non-null int64 \n 1 Sex 418 non-null object \n 2 Age 332 non-null float64\n 3 SibSp 418 non-null int64 \n 4 Parch 418 non-null int64 \n 5 Fare 417 non-null float64\n 6 Cabin 91 non-null object \n 7 Embarked 418 non-null object \ndtypes: float64(2), int64(3), object(3)\nmemory usage: 26.2+ KB\n```\n```Python\ndf_missing_data = pd.DataFrame(df.isna().sum().loc[df.isna().sum() > 0], columns=['missing_data_count'])\ndf_missing_data['missing_data_type'] = df.dtypes\ndf_missing_data['missing_data_percentage'] = df_missing_data['missing_data_count'] / len(df)\ndf_missing_data = df_missing_data.sort_values(by='missing_data_percentage', ascending=False)\ndf_missing_data\n```\n```\n| | missing_data_count | missing_data_type | missing_data_percentage |\n|:------|---------------------:|:--------------------|--------------------------:|\n| Cabin | 327 | object | 0.782297 |\n| Age | 86 | float64 | 0.205742 |\n| Fare | 1 | float64 | 0.00239234 |\n```\n```Python\nimputer = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=True)\nX = imputer.fit_transform(df)\n```\n```\nXGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522\nXGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584\nXGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.\nXGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.\n```\n```Python\ntype(X)\n```\n```\nnumpy.ndarray\n```\n```Python\npd.DataFrame(X).head(15)\n```\n```\n| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |\n|---:|----:|:-------|--------:|----:|----:|--------:|:----------------|:----|\n| 0 | 3 | male | 34.5 | 0 | 0 | 7.8292 | C78 | Q |\n| 1 | 3 | female | 47 | 1 | 0 | 7 | C23 C25 C27 | S |\n| 2 | 2 | male | 62 | 0 | 0 | 9.6875 | C78 | Q |\n| 3 | 3 | male | 27 | 0 | 0 | 8.6625 | C31 | S |\n| 4 | 3 | female | 22 | 1 | 1 | 12.2875 | C23 C25 C27 | S |\n| 5 | 3 | male | 14 | 0 | 0 | 9.225 | C31 | S |\n| 6 | 3 | female | 30 | 0 | 0 | 7.6292 | C78 | Q |\n| 7 | 2 | male | 26 | 1 | 1 | 29 | C31 | S |\n| 8 | 3 | female | 18 | 0 | 0 | 7.2292 | B57 B59 B63 B66 | C |\n| 9 | 3 | male | 21 | 2 | 0 | 24.15 | C31 | S |\n| 10 | 3 | male | 24.7614 | 0 | 0 | 7.8958 | C31 | S |\n| 11 | 1 | male | 46 | 0 | 0 | 26 | C31 | S |\n| 12 | 1 | female | 23 | 1 | 0 | 82.2667 | B45 | S |\n| 13 | 2 | male | 63 | 1 | 0 | 26 | C31 | S |\n| 14 | 1 | female | 47 | 1 | 0 | 61.175 | E31 | S |\n```\n```Python\nimputer2 = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=False)\nX2 = imputer2.fit_transform(df)\n```\n```\nXGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522\nXGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584\nXGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.\nXGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.\n```\n```\npd.DataFrame(X2).head(15)\n```\n```\n| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |\n|---:|----:|----:|--------:|----:|----:|--------:|----:|----:|\n| 0 | 2 | 1 | 34.5 | 0 | 0 | 7.8292 | 41 | 1 |\n| 1 | 2 | 0 | 47 | 1 | 0 | 7 | 28 | 2 |\n| 2 | 1 | 1 | 62 | 0 | 0 | 9.6875 | 41 | 1 |\n| 3 | 2 | 1 | 27 | 0 | 0 | 8.6625 | 30 | 2 |\n| 4 | 2 | 0 | 22 | 1 | 1 | 12.2875 | 28 | 2 |\n| 5 | 2 | 1 | 14 | 0 | 0 | 9.225 | 30 | 2 |\n| 6 | 2 | 0 | 30 | 0 | 0 | 7.6292 | 41 | 1 |\n| 7 | 1 | 1 | 26 | 1 | 1 | 29 | 30 | 2 |\n| 8 | 2 | 0 | 18 | 0 | 0 | 7.2292 | 15 | 0 |\n| 9 | 2 | 1 | 21 | 2 | 0 | 24.15 | 30 | 2 |\n| 10 | 2 | 1 | 24.7614 | 0 | 0 | 7.8958 | 30 | 2 |\n| 11 | 0 | 1 | 46 | 0 | 0 | 26 | 30 | 2 |\n| 12 | 0 | 0 | 23 | 1 | 0 | 82.2667 | 12 | 2 |\n| 13 | 1 | 1 | 63 | 1 | 0 | 26 | 30 | 2 |\n| 14 | 0 | 0 | 47 | 1 | 0 | 61.175 | 60 | 2 |\n```\n\n</div>\n\n## License\n\n<div class=\"termy\">\n\nLicensed under an [Apache-2](https://github.com/leonardodepaula/xgbimputer/blob/master/LICENSE) license.\n\n</div>\n\n## References\n\n<div class=\"termy\">\n\n* [1] [Daniel J. Stekhoven and Peter B\u00fchlmann. \"MissForest\u2014non-parametric missing value imputation for mixed-type data.\"](https://academic.oup.com/bioinformatics/article/28/1/112/219101)\n\n* [2] [XGBoost](https://xgboost.ai/)\n\n* [3] [scikit-learn](https://scikit-learn.org/)\n\n</div>\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Extreme Gradient Boosting imputer for Machine Learning.",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/leonardodepaula/xgbimputer"
},
"split_keywords": [
"python",
"machine learning",
"missing values",
"imputation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d289b72db749978d0fe38ca57a01a81cbfee8443e7b9c94b1166b2465b574176",
"md5": "acc674b3b00294446dd498af9dc5c293",
"sha256": "7bc396f38848f5ab15893e7a69c717b1f551c23d8b66c40930f2875cf924da70"
},
"downloads": -1,
"filename": "xgbimputer-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "acc674b3b00294446dd498af9dc5c293",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 7961,
"upload_time": "2022-03-14T18:41:57",
"upload_time_iso_8601": "2022-03-14T18:41:57.597282Z",
"url": "https://files.pythonhosted.org/packages/d2/89/b72db749978d0fe38ca57a01a81cbfee8443e7b9c94b1166b2465b574176/xgbimputer-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0a24f1e2dfee553204a4455c589ed2a6ae142a080a4ea5ddf89f9ce2853a2af3",
"md5": "38d71d3a01f93049db37d26ea9f7cb96",
"sha256": "78d3b8b0c85350e47d793033987a3b5805f8e5441fe7efb584449063964c62a9"
},
"downloads": -1,
"filename": "xgbimputer-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "38d71d3a01f93049db37d26ea9f7cb96",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 9259,
"upload_time": "2022-03-14T18:42:00",
"upload_time_iso_8601": "2022-03-14T18:42:00.166582Z",
"url": "https://files.pythonhosted.org/packages/0a/24/f1e2dfee553204a4455c589ed2a6ae142a080a4ea5ddf89f9ce2853a2af3/xgbimputer-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-03-14 18:42:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "leonardodepaula",
"github_project": "xgbimputer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "xgbimputer"
}