## Overview
`shap-select` implements a heuristic for fast feature selection, for tabular regression and classification models.
The basic idea is running a linear or logistic regression of the target on the Shapley values of
the original features, on the validation set,
discarding the features with negative coefficients, and ranking/filtering the rest according to their
statistical significance. For motivation and details, see the [example notebook](https://github.com/transferwise/shap-select/blob/main/docs/Quick%20feature%20selection%20through%20regression%20on%20Shapley%20values.ipynb)
Earlier packages using Shapley values for feature selection exist, the advantages of this one are
* Regression on the **validation set** to combat overfitting
* Only a single fit of the original model needed
* A single intuitive hyperparameter for feature selection: statistical significance
* Bonferroni correction for multiclass classification
* Address collinearity of (Shapley value) features by repeated (linear/logistic) regression
## Usage
```python
from shap_select import shap_select
# Here model is any model supported by the shap library, fitted on a different (train) dataset
# Task can be regression, binary, or multiclass
selected_features_df = shap_select(model, X_val, y_val, task="multiclass", threshold=0.05)
```
<table id="T_694ab">
<thead>
<tr>
<th class="blank level0" > </th>
<th id="T_694ab_level0_col0" class="col_heading level0 col0" >feature name</th>
<th id="T_694ab_level0_col1" class="col_heading level0 col1" >t-value</th>
<th id="T_694ab_level0_col2" class="col_heading level0 col2" >stat.significance</th>
<th id="T_694ab_level0_col3" class="col_heading level0 col3" >coefficient</th>
<th id="T_694ab_level0_col4" class="col_heading level0 col4" >selected</th>
</tr>
</thead>
<tbody>
<tr>
<th id="T_694ab_level0_row0" class="row_heading level0 row0" >0</th>
<td id="T_694ab_row0_col0" class="data row0 col0" >x5</td>
<td id="T_694ab_row0_col1" class="data row0 col1" >20.211299</td>
<td id="T_694ab_row0_col2" class="data row0 col2" >0.000000</td>
<td id="T_694ab_row0_col3" class="data row0 col3" >1.052030</td>
<td id="T_694ab_row0_col4" class="data row0 col4" >1</td>
</tr>
<tr>
<th id="T_694ab_level0_row1" class="row_heading level0 row1" >1</th>
<td id="T_694ab_row1_col0" class="data row1 col0" >x4</td>
<td id="T_694ab_row1_col1" class="data row1 col1" >18.315144</td>
<td id="T_694ab_row1_col2" class="data row1 col2" >0.000000</td>
<td id="T_694ab_row1_col3" class="data row1 col3" >0.952416</td>
<td id="T_694ab_row1_col4" class="data row1 col4" >1</td>
</tr>
<tr>
<th id="T_694ab_level0_row2" class="row_heading level0 row2" >2</th>
<td id="T_694ab_row2_col0" class="data row2 col0" >x3</td>
<td id="T_694ab_row2_col1" class="data row2 col1" >6.835690</td>
<td id="T_694ab_row2_col2" class="data row2 col2" >0.000000</td>
<td id="T_694ab_row2_col3" class="data row2 col3" >1.098154</td>
<td id="T_694ab_row2_col4" class="data row2 col4" >1</td>
</tr>
<tr>
<th id="T_694ab_level0_row3" class="row_heading level0 row3" >3</th>
<td id="T_694ab_row3_col0" class="data row3 col0" >x2</td>
<td id="T_694ab_row3_col1" class="data row3 col1" >6.457140</td>
<td id="T_694ab_row3_col2" class="data row3 col2" >0.000000</td>
<td id="T_694ab_row3_col3" class="data row3 col3" >1.044842</td>
<td id="T_694ab_row3_col4" class="data row3 col4" >1</td>
</tr>
<tr>
<th id="T_694ab_level0_row4" class="row_heading level0 row4" >4</th>
<td id="T_694ab_row4_col0" class="data row4 col0" >x1</td>
<td id="T_694ab_row4_col1" class="data row4 col1" >5.530556</td>
<td id="T_694ab_row4_col2" class="data row4 col2" >0.000000</td>
<td id="T_694ab_row4_col3" class="data row4 col3" >0.917242</td>
<td id="T_694ab_row4_col4" class="data row4 col4" >1</td>
</tr>
<tr>
<th id="T_694ab_level0_row5" class="row_heading level0 row5" >5</th>
<td id="T_694ab_row5_col0" class="data row5 col0" >x6</td>
<td id="T_694ab_row5_col1" class="data row5 col1" >2.390868</td>
<td id="T_694ab_row5_col2" class="data row5 col2" >0.016827</td>
<td id="T_694ab_row5_col3" class="data row5 col3" >1.497983</td>
<td id="T_694ab_row5_col4" class="data row5 col4" >1</td>
</tr>
<tr>
<th id="T_694ab_level0_row6" class="row_heading level0 row6" >6</th>
<td id="T_694ab_row6_col0" class="data row6 col0" >x7</td>
<td id="T_694ab_row6_col1" class="data row6 col1" >0.901098</td>
<td id="T_694ab_row6_col2" class="data row6 col2" >0.367558</td>
<td id="T_694ab_row6_col3" class="data row6 col3" >2.865508</td>
<td id="T_694ab_row6_col4" class="data row6 col4" >0</td>
</tr>
<tr>
<th id="T_694ab_level0_row7" class="row_heading level0 row7" >7</th>
<td id="T_694ab_row7_col0" class="data row7 col0" >x8</td>
<td id="T_694ab_row7_col1" class="data row7 col1" >0.563214</td>
<td id="T_694ab_row7_col2" class="data row7 col2" >0.573302</td>
<td id="T_694ab_row7_col3" class="data row7 col3" >1.933632</td>
<td id="T_694ab_row7_col4" class="data row7 col4" >0</td>
</tr>
<tr>
<th id="T_694ab_level0_row8" class="row_heading level0 row8" >8</th>
<td id="T_694ab_row8_col0" class="data row8 col0" >x9</td>
<td id="T_694ab_row8_col1" class="data row8 col1" >-1.607814</td>
<td id="T_694ab_row8_col2" class="data row8 col2" >0.107908</td>
<td id="T_694ab_row8_col3" class="data row8 col3" >-4.537098</td>
<td id="T_694ab_row8_col4" class="data row8 col4" >-1</td>
</tr>
</tbody>
</table>
Raw data
{
"_id": null,
"home_page": "https://github.com/transferwise/shap-select",
"name": "shap-select",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "shap-select",
"author": "Wise Plc",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/ba/19/c45eee82dfa35673533501330b044007346605026ab917038cbd9702ecc9/shap-select-0.1.0.tar.gz",
"platform": null,
"description": "## Overview\n`shap-select` implements a heuristic for fast feature selection, for tabular regression and classification models. \n\nThe basic idea is running a linear or logistic regression of the target on the Shapley values of \nthe original features, on the validation set,\ndiscarding the features with negative coefficients, and ranking/filtering the rest according to their \nstatistical significance. For motivation and details, see the [example notebook](https://github.com/transferwise/shap-select/blob/main/docs/Quick%20feature%20selection%20through%20regression%20on%20Shapley%20values.ipynb)\n\nEarlier packages using Shapley values for feature selection exist, the advantages of this one are\n* Regression on the **validation set** to combat overfitting\n* Only a single fit of the original model needed\n* A single intuitive hyperparameter for feature selection: statistical significance\n* Bonferroni correction for multiclass classification\n* Address collinearity of (Shapley value) features by repeated (linear/logistic) regression\n\n## Usage\n```python\nfrom shap_select import shap_select\n# Here model is any model supported by the shap library, fitted on a different (train) dataset\n# Task can be regression, binary, or multiclass\nselected_features_df = shap_select(model, X_val, y_val, task=\"multiclass\", threshold=0.05)\n```\n\n<table id=\"T_694ab\">\n <thead>\n <tr>\n <th class=\"blank level0\" > </th>\n <th id=\"T_694ab_level0_col0\" class=\"col_heading level0 col0\" >feature name</th>\n <th id=\"T_694ab_level0_col1\" class=\"col_heading level0 col1\" >t-value</th>\n <th id=\"T_694ab_level0_col2\" class=\"col_heading level0 col2\" >stat.significance</th>\n <th id=\"T_694ab_level0_col3\" class=\"col_heading level0 col3\" >coefficient</th>\n <th id=\"T_694ab_level0_col4\" class=\"col_heading level0 col4\" >selected</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th id=\"T_694ab_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n <td id=\"T_694ab_row0_col0\" class=\"data row0 col0\" >x5</td>\n <td id=\"T_694ab_row0_col1\" class=\"data row0 col1\" >20.211299</td>\n <td id=\"T_694ab_row0_col2\" class=\"data row0 col2\" >0.000000</td>\n <td id=\"T_694ab_row0_col3\" class=\"data row0 col3\" >1.052030</td>\n <td id=\"T_694ab_row0_col4\" class=\"data row0 col4\" >1</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n <td id=\"T_694ab_row1_col0\" class=\"data row1 col0\" >x4</td>\n <td id=\"T_694ab_row1_col1\" class=\"data row1 col1\" >18.315144</td>\n <td id=\"T_694ab_row1_col2\" class=\"data row1 col2\" >0.000000</td>\n <td id=\"T_694ab_row1_col3\" class=\"data row1 col3\" >0.952416</td>\n <td id=\"T_694ab_row1_col4\" class=\"data row1 col4\" >1</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n <td id=\"T_694ab_row2_col0\" class=\"data row2 col0\" >x3</td>\n <td id=\"T_694ab_row2_col1\" class=\"data row2 col1\" >6.835690</td>\n <td id=\"T_694ab_row2_col2\" class=\"data row2 col2\" >0.000000</td>\n <td id=\"T_694ab_row2_col3\" class=\"data row2 col3\" >1.098154</td>\n <td id=\"T_694ab_row2_col4\" class=\"data row2 col4\" >1</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n <td id=\"T_694ab_row3_col0\" class=\"data row3 col0\" >x2</td>\n <td id=\"T_694ab_row3_col1\" class=\"data row3 col1\" >6.457140</td>\n <td id=\"T_694ab_row3_col2\" class=\"data row3 col2\" >0.000000</td>\n <td id=\"T_694ab_row3_col3\" class=\"data row3 col3\" >1.044842</td>\n <td id=\"T_694ab_row3_col4\" class=\"data row3 col4\" >1</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n <td id=\"T_694ab_row4_col0\" class=\"data row4 col0\" >x1</td>\n <td id=\"T_694ab_row4_col1\" class=\"data row4 col1\" >5.530556</td>\n <td id=\"T_694ab_row4_col2\" class=\"data row4 col2\" >0.000000</td>\n <td id=\"T_694ab_row4_col3\" class=\"data row4 col3\" >0.917242</td>\n <td id=\"T_694ab_row4_col4\" class=\"data row4 col4\" >1</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n <td id=\"T_694ab_row5_col0\" class=\"data row5 col0\" >x6</td>\n <td id=\"T_694ab_row5_col1\" class=\"data row5 col1\" >2.390868</td>\n <td id=\"T_694ab_row5_col2\" class=\"data row5 col2\" >0.016827</td>\n <td id=\"T_694ab_row5_col3\" class=\"data row5 col3\" >1.497983</td>\n <td id=\"T_694ab_row5_col4\" class=\"data row5 col4\" >1</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row6\" class=\"row_heading level0 row6\" >6</th>\n <td id=\"T_694ab_row6_col0\" class=\"data row6 col0\" >x7</td>\n <td id=\"T_694ab_row6_col1\" class=\"data row6 col1\" >0.901098</td>\n <td id=\"T_694ab_row6_col2\" class=\"data row6 col2\" >0.367558</td>\n <td id=\"T_694ab_row6_col3\" class=\"data row6 col3\" >2.865508</td>\n <td id=\"T_694ab_row6_col4\" class=\"data row6 col4\" >0</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row7\" class=\"row_heading level0 row7\" >7</th>\n <td id=\"T_694ab_row7_col0\" class=\"data row7 col0\" >x8</td>\n <td id=\"T_694ab_row7_col1\" class=\"data row7 col1\" >0.563214</td>\n <td id=\"T_694ab_row7_col2\" class=\"data row7 col2\" >0.573302</td>\n <td id=\"T_694ab_row7_col3\" class=\"data row7 col3\" >1.933632</td>\n <td id=\"T_694ab_row7_col4\" class=\"data row7 col4\" >0</td>\n </tr>\n <tr>\n <th id=\"T_694ab_level0_row8\" class=\"row_heading level0 row8\" >8</th>\n <td id=\"T_694ab_row8_col0\" class=\"data row8 col0\" >x9</td>\n <td id=\"T_694ab_row8_col1\" class=\"data row8 col1\" >-1.607814</td>\n <td id=\"T_694ab_row8_col2\" class=\"data row8 col2\" >0.107908</td>\n <td id=\"T_694ab_row8_col3\" class=\"data row8 col3\" >-4.537098</td>\n <td id=\"T_694ab_row8_col4\" class=\"data row8 col4\" >-1</td>\n </tr>\n </tbody>\n</table>\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Heuristic for quick feature selection for tabular regression/classification using shapley values",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/transferwise/shap-select"
},
"split_keywords": [
"shap-select"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ba19c45eee82dfa35673533501330b044007346605026ab917038cbd9702ecc9",
"md5": "3bb40dc1450362572cc254124d897528",
"sha256": "76f72cb564f60a3422af3dac1432b319e381901bd65c96e062d58ca707f91b6d"
},
"downloads": -1,
"filename": "shap-select-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "3bb40dc1450362572cc254124d897528",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 11551,
"upload_time": "2024-10-03T12:22:05",
"upload_time_iso_8601": "2024-10-03T12:22:05.455378Z",
"url": "https://files.pythonhosted.org/packages/ba/19/c45eee82dfa35673533501330b044007346605026ab917038cbd9702ecc9/shap-select-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-03 12:22:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "transferwise",
"github_project": "shap-select",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "shap-select"
}