xgbimputer


Namexgbimputer JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/leonardodepaula/xgbimputer
SummaryExtreme Gradient Boosting imputer for Machine Learning.
upload_time2022-03-14 18:42:00
maintainer
docs_urlNone
authorLeonardo de Paula Liebscher
requires_python
license
keywords python machine learning missing values imputation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # XGBImputer - Extreme Gradient Boosting Imputer

<div class="termy">

XGBImputer is an effort to implement the concepts of the MissForest algorithm proposed by Daniel J. Stekhoven and Peter Bühlmann[1] in 2012, but leveraging the robustness and predictive power of the XGBoost[2] algorithm released in 2014.

The package also aims to simplify the process of imputing categorical values in a scikit-learn[3] compatible way.

</div>

## Installation

<div class="termy">

```console
$ pip install xgbimputer
```

</div>

## Approach

<div class="termy">

Given a 2D array X with missing values, the imputer:

* 1 - counts the missing values in each column and arranges them in the ascending order;

* 2 - makes an initial guess for the missing values in X using the mean for numerical columns and the mode for the categorical columns;

* 3 - sorts the columns according to the amount of missing values, starting with the lowest amount;

* 4 - preprocesses all categorical columns with scikit-learn's OrdinalEncoder to get a purely numerical array;

* 5 - iterates over all columns with missing values in the order established on step 1;

  * 5.1 - selects the column in context on the iteration as the target;

  * 5.2 - one hot encodes all categorical columns other than the target;

  * 5.3 - fits the XGBoost algorithm (XGBClassifier for the categorical columns and XGBRegressor for the numeric columns) where the target column has no missing values;

  * 5.4 - predicts the missing values of the target column and replaces them on the X array;

  * 5.5 - calculates the stopping criterion (gamma) for the numerical and categorical columns identified as having missing data;

* 6 - repeats the process described in step 5 until the stopping criterion is met; and

* 7 - returns X with the imputed values.

</div>

## Example

<div class="termy">

```Python
import pandas as pd
from xgbimputer import XGBImputer

df = pd.read_csv('titanic.csv')
df.head()
```
```
|    |   PassengerId |   Pclass | Name                                         | Sex    |   Age |   SibSp |   Parch |   Ticket |    Fare |   Cabin | Embarked   |
|---:|--------------:|---------:|:---------------------------------------------|:-------|------:|--------:|--------:|---------:|--------:|--------:|:-----------|
|  0 |           892 |        3 | Kelly, Mr. James                             | male   |  34.5 |       0 |       0 |   330911 |  7.8292 |     nan | Q          |
|  1 |           893 |        3 | Wilkes, Mrs. James (Ellen Needs)             | female |  47   |       1 |       0 |   363272 |  7      |     nan | S          |
|  2 |           894 |        2 | Myles, Mr. Thomas Francis                    | male   |  62   |       0 |       0 |   240276 |  9.6875 |     nan | Q          |
|  3 |           895 |        3 | Wirz, Mr. Albert                             | male   |  27   |       0 |       0 |   315154 |  8.6625 |     nan | S          |
|  4 |           896 |        3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female |  22   |       1 |       1 |  3101298 | 12.2875 |     nan | S          |
```
```Python
df = df.drop(columns=['PassengerId', 'Name', 'Ticket'])
df.info()
```
```class 'pandas.core.frame.DataFrame'
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
#   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Sex       418 non-null    object 
 2   Age       332 non-null    float64
 3   SibSp     418 non-null    int64  
 4   Parch     418 non-null    int64  
 5   Fare      417 non-null    float64
 6   Cabin     91 non-null     object 
 7   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(3)
memory usage: 26.2+ KB
```
```Python
df_missing_data = pd.DataFrame(df.isna().sum().loc[df.isna().sum() > 0], columns=['missing_data_count'])
df_missing_data['missing_data_type'] = df.dtypes
df_missing_data['missing_data_percentage'] = df_missing_data['missing_data_count'] / len(df)
df_missing_data = df_missing_data.sort_values(by='missing_data_percentage', ascending=False)
df_missing_data
```
```
|       |   missing_data_count | missing_data_type   |   missing_data_percentage |
|:------|---------------------:|:--------------------|--------------------------:|
| Cabin |                  327 | object              |                0.782297   |
| Age   |                   86 | float64             |                0.205742   |
| Fare  |                    1 | float64             |                0.00239234 |
```
```Python
imputer = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=True)
X = imputer.fit_transform(df)
```
```
XGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522
XGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584
XGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.
XGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.
```
```Python
type(X)
```
```
numpy.ndarray
```
```Python
pd.DataFrame(X).head(15)
```
```
|    |   0 | 1      |       2 |   3 |   4 |       5 | 6               | 7   |
|---:|----:|:-------|--------:|----:|----:|--------:|:----------------|:----|
|  0 |   3 | male   | 34.5    |   0 |   0 |  7.8292 | C78             | Q   |
|  1 |   3 | female | 47      |   1 |   0 |  7      | C23 C25 C27     | S   |
|  2 |   2 | male   | 62      |   0 |   0 |  9.6875 | C78             | Q   |
|  3 |   3 | male   | 27      |   0 |   0 |  8.6625 | C31             | S   |
|  4 |   3 | female | 22      |   1 |   1 | 12.2875 | C23 C25 C27     | S   |
|  5 |   3 | male   | 14      |   0 |   0 |  9.225  | C31             | S   |
|  6 |   3 | female | 30      |   0 |   0 |  7.6292 | C78             | Q   |
|  7 |   2 | male   | 26      |   1 |   1 | 29      | C31             | S   |
|  8 |   3 | female | 18      |   0 |   0 |  7.2292 | B57 B59 B63 B66 | C   |
|  9 |   3 | male   | 21      |   2 |   0 | 24.15   | C31             | S   |
| 10 |   3 | male   | 24.7614 |   0 |   0 |  7.8958 | C31             | S   |
| 11 |   1 | male   | 46      |   0 |   0 | 26      | C31             | S   |
| 12 |   1 | female | 23      |   1 |   0 | 82.2667 | B45             | S   |
| 13 |   2 | male   | 63      |   1 |   0 | 26      | C31             | S   |
| 14 |   1 | female | 47      |   1 |   0 | 61.175  | E31             | S   |
```
```Python
imputer2 = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=False)
X2 = imputer2.fit_transform(df)
```
```
XGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522
XGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584
XGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.
XGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.
```
```
pd.DataFrame(X2).head(15)
```
```
|    |   0 |   1 |       2 |   3 |   4 |       5 |   6 |   7 |
|---:|----:|----:|--------:|----:|----:|--------:|----:|----:|
|  0 |   2 |   1 | 34.5    |   0 |   0 |  7.8292 |  41 |   1 |
|  1 |   2 |   0 | 47      |   1 |   0 |  7      |  28 |   2 |
|  2 |   1 |   1 | 62      |   0 |   0 |  9.6875 |  41 |   1 |
|  3 |   2 |   1 | 27      |   0 |   0 |  8.6625 |  30 |   2 |
|  4 |   2 |   0 | 22      |   1 |   1 | 12.2875 |  28 |   2 |
|  5 |   2 |   1 | 14      |   0 |   0 |  9.225  |  30 |   2 |
|  6 |   2 |   0 | 30      |   0 |   0 |  7.6292 |  41 |   1 |
|  7 |   1 |   1 | 26      |   1 |   1 | 29      |  30 |   2 |
|  8 |   2 |   0 | 18      |   0 |   0 |  7.2292 |  15 |   0 |
|  9 |   2 |   1 | 21      |   2 |   0 | 24.15   |  30 |   2 |
| 10 |   2 |   1 | 24.7614 |   0 |   0 |  7.8958 |  30 |   2 |
| 11 |   0 |   1 | 46      |   0 |   0 | 26      |  30 |   2 |
| 12 |   0 |   0 | 23      |   1 |   0 | 82.2667 |  12 |   2 |
| 13 |   1 |   1 | 63      |   1 |   0 | 26      |  30 |   2 |
| 14 |   0 |   0 | 47      |   1 |   0 | 61.175  |  60 |   2 |
```

</div>

## License

<div class="termy">

Licensed under an [Apache-2](https://github.com/leonardodepaula/xgbimputer/blob/master/LICENSE) license.

</div>

## References

<div class="termy">

* [1] [Daniel J. Stekhoven and Peter Bühlmann. "MissForest—non-parametric missing value imputation for mixed-type data."](https://academic.oup.com/bioinformatics/article/28/1/112/219101)

* [2] [XGBoost](https://xgboost.ai/)

* [3] [scikit-learn](https://scikit-learn.org/)

</div>



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/leonardodepaula/xgbimputer",
    "name": "xgbimputer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,machine learning,missing values,imputation",
    "author": "Leonardo de Paula Liebscher",
    "author_email": "<leonardopx@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0a/24/f1e2dfee553204a4455c589ed2a6ae142a080a4ea5ddf89f9ce2853a2af3/xgbimputer-0.2.0.tar.gz",
    "platform": null,
    "description": "# XGBImputer - Extreme Gradient Boosting Imputer\n\n<div class=\"termy\">\n\nXGBImputer is an effort to implement the concepts of the MissForest algorithm proposed by Daniel J. Stekhoven and Peter B\u00fchlmann[1] in 2012, but leveraging the robustness and predictive power of the XGBoost[2] algorithm released in 2014.\n\nThe package also aims to simplify the process of imputing categorical values in a scikit-learn[3] compatible way.\n\n</div>\n\n## Installation\n\n<div class=\"termy\">\n\n```console\n$ pip install xgbimputer\n```\n\n</div>\n\n## Approach\n\n<div class=\"termy\">\n\nGiven a 2D array X with missing values, the imputer:\n\n* 1 - counts the missing values in each column and arranges them in the ascending order;\n\n* 2 - makes an initial guess for the missing values in X using the mean for numerical columns and the mode for the categorical columns;\n\n* 3 - sorts the columns according to the amount of missing values, starting with the lowest amount;\n\n* 4 - preprocesses all categorical columns with scikit-learn's OrdinalEncoder to get a purely numerical array;\n\n* 5 - iterates over all columns with missing values in the order established on step 1;\n\n  * 5.1 - selects the column in context on the iteration as the target;\n\n  * 5.2 - one hot encodes all categorical columns other than the target;\n\n  * 5.3 - fits the XGBoost algorithm (XGBClassifier for the categorical columns and XGBRegressor for the numeric columns) where the target column has no missing values;\n\n  * 5.4 - predicts the missing values of the target column and replaces them on the X array;\n\n  * 5.5 - calculates the stopping criterion (gamma) for the numerical and categorical columns identified as having missing data;\n\n* 6 - repeats the process described in step 5 until the stopping criterion is met; and\n\n* 7 - returns X with the imputed values.\n\n</div>\n\n## Example\n\n<div class=\"termy\">\n\n```Python\nimport pandas as pd\nfrom xgbimputer import XGBImputer\n\ndf = pd.read_csv('titanic.csv')\ndf.head()\n```\n```\n|    |   PassengerId |   Pclass | Name                                         | Sex    |   Age |   SibSp |   Parch |   Ticket |    Fare |   Cabin | Embarked   |\n|---:|--------------:|---------:|:---------------------------------------------|:-------|------:|--------:|--------:|---------:|--------:|--------:|:-----------|\n|  0 |           892 |        3 | Kelly, Mr. James                             | male   |  34.5 |       0 |       0 |   330911 |  7.8292 |     nan | Q          |\n|  1 |           893 |        3 | Wilkes, Mrs. James (Ellen Needs)             | female |  47   |       1 |       0 |   363272 |  7      |     nan | S          |\n|  2 |           894 |        2 | Myles, Mr. Thomas Francis                    | male   |  62   |       0 |       0 |   240276 |  9.6875 |     nan | Q          |\n|  3 |           895 |        3 | Wirz, Mr. Albert                             | male   |  27   |       0 |       0 |   315154 |  8.6625 |     nan | S          |\n|  4 |           896 |        3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female |  22   |       1 |       1 |  3101298 | 12.2875 |     nan | S          |\n```\n```Python\ndf = df.drop(columns=['PassengerId', 'Name', 'Ticket'])\ndf.info()\n```\n```class 'pandas.core.frame.DataFrame'\nRangeIndex: 418 entries, 0 to 417\nData columns (total 8 columns):\n#   Column    Non-Null Count  Dtype  \n---  ------    --------------  -----  \n 0   Pclass    418 non-null    int64  \n 1   Sex       418 non-null    object \n 2   Age       332 non-null    float64\n 3   SibSp     418 non-null    int64  \n 4   Parch     418 non-null    int64  \n 5   Fare      417 non-null    float64\n 6   Cabin     91 non-null     object \n 7   Embarked  418 non-null    object \ndtypes: float64(2), int64(3), object(3)\nmemory usage: 26.2+ KB\n```\n```Python\ndf_missing_data = pd.DataFrame(df.isna().sum().loc[df.isna().sum() > 0], columns=['missing_data_count'])\ndf_missing_data['missing_data_type'] = df.dtypes\ndf_missing_data['missing_data_percentage'] = df_missing_data['missing_data_count'] / len(df)\ndf_missing_data = df_missing_data.sort_values(by='missing_data_percentage', ascending=False)\ndf_missing_data\n```\n```\n|       |   missing_data_count | missing_data_type   |   missing_data_percentage |\n|:------|---------------------:|:--------------------|--------------------------:|\n| Cabin |                  327 | object              |                0.782297   |\n| Age   |                   86 | float64             |                0.205742   |\n| Fare  |                    1 | float64             |                0.00239234 |\n```\n```Python\nimputer = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=True)\nX = imputer.fit_transform(df)\n```\n```\nXGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522\nXGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584\nXGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.\nXGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.\n```\n```Python\ntype(X)\n```\n```\nnumpy.ndarray\n```\n```Python\npd.DataFrame(X).head(15)\n```\n```\n|    |   0 | 1      |       2 |   3 |   4 |       5 | 6               | 7   |\n|---:|----:|:-------|--------:|----:|----:|--------:|:----------------|:----|\n|  0 |   3 | male   | 34.5    |   0 |   0 |  7.8292 | C78             | Q   |\n|  1 |   3 | female | 47      |   1 |   0 |  7      | C23 C25 C27     | S   |\n|  2 |   2 | male   | 62      |   0 |   0 |  9.6875 | C78             | Q   |\n|  3 |   3 | male   | 27      |   0 |   0 |  8.6625 | C31             | S   |\n|  4 |   3 | female | 22      |   1 |   1 | 12.2875 | C23 C25 C27     | S   |\n|  5 |   3 | male   | 14      |   0 |   0 |  9.225  | C31             | S   |\n|  6 |   3 | female | 30      |   0 |   0 |  7.6292 | C78             | Q   |\n|  7 |   2 | male   | 26      |   1 |   1 | 29      | C31             | S   |\n|  8 |   3 | female | 18      |   0 |   0 |  7.2292 | B57 B59 B63 B66 | C   |\n|  9 |   3 | male   | 21      |   2 |   0 | 24.15   | C31             | S   |\n| 10 |   3 | male   | 24.7614 |   0 |   0 |  7.8958 | C31             | S   |\n| 11 |   1 | male   | 46      |   0 |   0 | 26      | C31             | S   |\n| 12 |   1 | female | 23      |   1 |   0 | 82.2667 | B45             | S   |\n| 13 |   2 | male   | 63      |   1 |   0 | 26      | C31             | S   |\n| 14 |   1 | female | 47      |   1 |   0 | 61.175  | E31             | S   |\n```\n```Python\nimputer2 = XGBImputer(categorical_features_index=[0,1,6,7], replace_categorical_values_back=False)\nX2 = imputer2.fit_transform(df)\n```\n```\nXGBImputer - Epoch: 1 | Categorical gamma: inf/274. | Numerical gamma: inf/0.0020067522\nXGBImputer - Epoch: 2 | Categorical gamma: 274./0. | Numerical gamma: 0.0020067522/0.0000494584\nXGBImputer - Epoch: 3 | Categorical gamma: 0./0. | Numerical gamma: 0.0000494584/0.\nXGBImputer - Epoch: 4 | Categorical gamma: 0./0. | Numerical gamma: 0./0.\n```\n```\npd.DataFrame(X2).head(15)\n```\n```\n|    |   0 |   1 |       2 |   3 |   4 |       5 |   6 |   7 |\n|---:|----:|----:|--------:|----:|----:|--------:|----:|----:|\n|  0 |   2 |   1 | 34.5    |   0 |   0 |  7.8292 |  41 |   1 |\n|  1 |   2 |   0 | 47      |   1 |   0 |  7      |  28 |   2 |\n|  2 |   1 |   1 | 62      |   0 |   0 |  9.6875 |  41 |   1 |\n|  3 |   2 |   1 | 27      |   0 |   0 |  8.6625 |  30 |   2 |\n|  4 |   2 |   0 | 22      |   1 |   1 | 12.2875 |  28 |   2 |\n|  5 |   2 |   1 | 14      |   0 |   0 |  9.225  |  30 |   2 |\n|  6 |   2 |   0 | 30      |   0 |   0 |  7.6292 |  41 |   1 |\n|  7 |   1 |   1 | 26      |   1 |   1 | 29      |  30 |   2 |\n|  8 |   2 |   0 | 18      |   0 |   0 |  7.2292 |  15 |   0 |\n|  9 |   2 |   1 | 21      |   2 |   0 | 24.15   |  30 |   2 |\n| 10 |   2 |   1 | 24.7614 |   0 |   0 |  7.8958 |  30 |   2 |\n| 11 |   0 |   1 | 46      |   0 |   0 | 26      |  30 |   2 |\n| 12 |   0 |   0 | 23      |   1 |   0 | 82.2667 |  12 |   2 |\n| 13 |   1 |   1 | 63      |   1 |   0 | 26      |  30 |   2 |\n| 14 |   0 |   0 | 47      |   1 |   0 | 61.175  |  60 |   2 |\n```\n\n</div>\n\n## License\n\n<div class=\"termy\">\n\nLicensed under an [Apache-2](https://github.com/leonardodepaula/xgbimputer/blob/master/LICENSE) license.\n\n</div>\n\n## References\n\n<div class=\"termy\">\n\n* [1] [Daniel J. Stekhoven and Peter B\u00fchlmann. \"MissForest\u2014non-parametric missing value imputation for mixed-type data.\"](https://academic.oup.com/bioinformatics/article/28/1/112/219101)\n\n* [2] [XGBoost](https://xgboost.ai/)\n\n* [3] [scikit-learn](https://scikit-learn.org/)\n\n</div>\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Extreme Gradient Boosting imputer for Machine Learning.",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/leonardodepaula/xgbimputer"
    },
    "split_keywords": [
        "python",
        "machine learning",
        "missing values",
        "imputation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d289b72db749978d0fe38ca57a01a81cbfee8443e7b9c94b1166b2465b574176",
                "md5": "acc674b3b00294446dd498af9dc5c293",
                "sha256": "7bc396f38848f5ab15893e7a69c717b1f551c23d8b66c40930f2875cf924da70"
            },
            "downloads": -1,
            "filename": "xgbimputer-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "acc674b3b00294446dd498af9dc5c293",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 7961,
            "upload_time": "2022-03-14T18:41:57",
            "upload_time_iso_8601": "2022-03-14T18:41:57.597282Z",
            "url": "https://files.pythonhosted.org/packages/d2/89/b72db749978d0fe38ca57a01a81cbfee8443e7b9c94b1166b2465b574176/xgbimputer-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0a24f1e2dfee553204a4455c589ed2a6ae142a080a4ea5ddf89f9ce2853a2af3",
                "md5": "38d71d3a01f93049db37d26ea9f7cb96",
                "sha256": "78d3b8b0c85350e47d793033987a3b5805f8e5441fe7efb584449063964c62a9"
            },
            "downloads": -1,
            "filename": "xgbimputer-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "38d71d3a01f93049db37d26ea9f7cb96",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 9259,
            "upload_time": "2022-03-14T18:42:00",
            "upload_time_iso_8601": "2022-03-14T18:42:00.166582Z",
            "url": "https://files.pythonhosted.org/packages/0a/24/f1e2dfee553204a4455c589ed2a6ae142a080a4ea5ddf89f9ce2853a2af3/xgbimputer-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-03-14 18:42:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "leonardodepaula",
    "github_project": "xgbimputer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "xgbimputer"
}
        
Elapsed time: 0.25385s