shap-select


Nameshap-select JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/transferwise/shap-select
SummaryHeuristic for quick feature selection for tabular regression/classification using shapley values
upload_time2024-10-03 12:22:05
maintainerNone
docs_urlNone
authorWise Plc
requires_pythonNone
licenseNone
keywords shap-select
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Overview
`shap-select` implements a heuristic for fast feature selection, for tabular regression and classification models. 

The basic idea is running a linear or logistic regression of the target on the Shapley values of 
the original features, on the validation set,
discarding the features with negative coefficients, and ranking/filtering the rest according to their 
statistical significance. For motivation and details, see the [example notebook](https://github.com/transferwise/shap-select/blob/main/docs/Quick%20feature%20selection%20through%20regression%20on%20Shapley%20values.ipynb)

Earlier packages using Shapley values for feature selection exist, the advantages of this one are
* Regression on the **validation set** to combat overfitting
* Only a single fit of the original model needed
* A single intuitive hyperparameter for feature selection: statistical significance
* Bonferroni correction for multiclass classification
* Address collinearity of (Shapley value) features by repeated (linear/logistic) regression

## Usage
```python
from shap_select import shap_select
# Here model is any model supported by the shap library, fitted on a different (train) dataset
# Task can be regression, binary, or multiclass
selected_features_df = shap_select(model, X_val, y_val, task="multiclass", threshold=0.05)
```

<table id="T_694ab">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_694ab_level0_col0" class="col_heading level0 col0" >feature name</th>
      <th id="T_694ab_level0_col1" class="col_heading level0 col1" >t-value</th>
      <th id="T_694ab_level0_col2" class="col_heading level0 col2" >stat.significance</th>
      <th id="T_694ab_level0_col3" class="col_heading level0 col3" >coefficient</th>
      <th id="T_694ab_level0_col4" class="col_heading level0 col4" >selected</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_694ab_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_694ab_row0_col0" class="data row0 col0" >x5</td>
      <td id="T_694ab_row0_col1" class="data row0 col1" >20.211299</td>
      <td id="T_694ab_row0_col2" class="data row0 col2" >0.000000</td>
      <td id="T_694ab_row0_col3" class="data row0 col3" >1.052030</td>
      <td id="T_694ab_row0_col4" class="data row0 col4" >1</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_694ab_row1_col0" class="data row1 col0" >x4</td>
      <td id="T_694ab_row1_col1" class="data row1 col1" >18.315144</td>
      <td id="T_694ab_row1_col2" class="data row1 col2" >0.000000</td>
      <td id="T_694ab_row1_col3" class="data row1 col3" >0.952416</td>
      <td id="T_694ab_row1_col4" class="data row1 col4" >1</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_694ab_row2_col0" class="data row2 col0" >x3</td>
      <td id="T_694ab_row2_col1" class="data row2 col1" >6.835690</td>
      <td id="T_694ab_row2_col2" class="data row2 col2" >0.000000</td>
      <td id="T_694ab_row2_col3" class="data row2 col3" >1.098154</td>
      <td id="T_694ab_row2_col4" class="data row2 col4" >1</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row3" class="row_heading level0 row3" >3</th>
      <td id="T_694ab_row3_col0" class="data row3 col0" >x2</td>
      <td id="T_694ab_row3_col1" class="data row3 col1" >6.457140</td>
      <td id="T_694ab_row3_col2" class="data row3 col2" >0.000000</td>
      <td id="T_694ab_row3_col3" class="data row3 col3" >1.044842</td>
      <td id="T_694ab_row3_col4" class="data row3 col4" >1</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row4" class="row_heading level0 row4" >4</th>
      <td id="T_694ab_row4_col0" class="data row4 col0" >x1</td>
      <td id="T_694ab_row4_col1" class="data row4 col1" >5.530556</td>
      <td id="T_694ab_row4_col2" class="data row4 col2" >0.000000</td>
      <td id="T_694ab_row4_col3" class="data row4 col3" >0.917242</td>
      <td id="T_694ab_row4_col4" class="data row4 col4" >1</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row5" class="row_heading level0 row5" >5</th>
      <td id="T_694ab_row5_col0" class="data row5 col0" >x6</td>
      <td id="T_694ab_row5_col1" class="data row5 col1" >2.390868</td>
      <td id="T_694ab_row5_col2" class="data row5 col2" >0.016827</td>
      <td id="T_694ab_row5_col3" class="data row5 col3" >1.497983</td>
      <td id="T_694ab_row5_col4" class="data row5 col4" >1</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row6" class="row_heading level0 row6" >6</th>
      <td id="T_694ab_row6_col0" class="data row6 col0" >x7</td>
      <td id="T_694ab_row6_col1" class="data row6 col1" >0.901098</td>
      <td id="T_694ab_row6_col2" class="data row6 col2" >0.367558</td>
      <td id="T_694ab_row6_col3" class="data row6 col3" >2.865508</td>
      <td id="T_694ab_row6_col4" class="data row6 col4" >0</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row7" class="row_heading level0 row7" >7</th>
      <td id="T_694ab_row7_col0" class="data row7 col0" >x8</td>
      <td id="T_694ab_row7_col1" class="data row7 col1" >0.563214</td>
      <td id="T_694ab_row7_col2" class="data row7 col2" >0.573302</td>
      <td id="T_694ab_row7_col3" class="data row7 col3" >1.933632</td>
      <td id="T_694ab_row7_col4" class="data row7 col4" >0</td>
    </tr>
    <tr>
      <th id="T_694ab_level0_row8" class="row_heading level0 row8" >8</th>
      <td id="T_694ab_row8_col0" class="data row8 col0" >x9</td>
      <td id="T_694ab_row8_col1" class="data row8 col1" >-1.607814</td>
      <td id="T_694ab_row8_col2" class="data row8 col2" >0.107908</td>
      <td id="T_694ab_row8_col3" class="data row8 col3" >-4.537098</td>
      <td id="T_694ab_row8_col4" class="data row8 col4" >-1</td>
    </tr>
  </tbody>
</table>



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/transferwise/shap-select",
    "name": "shap-select",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "shap-select",
    "author": "Wise Plc",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/ba/19/c45eee82dfa35673533501330b044007346605026ab917038cbd9702ecc9/shap-select-0.1.0.tar.gz",
    "platform": null,
    "description": "## Overview\n`shap-select` implements a heuristic for fast feature selection, for tabular regression and classification models. \n\nThe basic idea is running a linear or logistic regression of the target on the Shapley values of \nthe original features, on the validation set,\ndiscarding the features with negative coefficients, and ranking/filtering the rest according to their \nstatistical significance. For motivation and details, see the [example notebook](https://github.com/transferwise/shap-select/blob/main/docs/Quick%20feature%20selection%20through%20regression%20on%20Shapley%20values.ipynb)\n\nEarlier packages using Shapley values for feature selection exist, the advantages of this one are\n* Regression on the **validation set** to combat overfitting\n* Only a single fit of the original model needed\n* A single intuitive hyperparameter for feature selection: statistical significance\n* Bonferroni correction for multiclass classification\n* Address collinearity of (Shapley value) features by repeated (linear/logistic) regression\n\n## Usage\n```python\nfrom shap_select import shap_select\n# Here model is any model supported by the shap library, fitted on a different (train) dataset\n# Task can be regression, binary, or multiclass\nselected_features_df = shap_select(model, X_val, y_val, task=\"multiclass\", threshold=0.05)\n```\n\n<table id=\"T_694ab\">\n  <thead>\n    <tr>\n      <th class=\"blank level0\" >&nbsp;</th>\n      <th id=\"T_694ab_level0_col0\" class=\"col_heading level0 col0\" >feature name</th>\n      <th id=\"T_694ab_level0_col1\" class=\"col_heading level0 col1\" >t-value</th>\n      <th id=\"T_694ab_level0_col2\" class=\"col_heading level0 col2\" >stat.significance</th>\n      <th id=\"T_694ab_level0_col3\" class=\"col_heading level0 col3\" >coefficient</th>\n      <th id=\"T_694ab_level0_col4\" class=\"col_heading level0 col4\" >selected</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th id=\"T_694ab_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n      <td id=\"T_694ab_row0_col0\" class=\"data row0 col0\" >x5</td>\n      <td id=\"T_694ab_row0_col1\" class=\"data row0 col1\" >20.211299</td>\n      <td id=\"T_694ab_row0_col2\" class=\"data row0 col2\" >0.000000</td>\n      <td id=\"T_694ab_row0_col3\" class=\"data row0 col3\" >1.052030</td>\n      <td id=\"T_694ab_row0_col4\" class=\"data row0 col4\" >1</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n      <td id=\"T_694ab_row1_col0\" class=\"data row1 col0\" >x4</td>\n      <td id=\"T_694ab_row1_col1\" class=\"data row1 col1\" >18.315144</td>\n      <td id=\"T_694ab_row1_col2\" class=\"data row1 col2\" >0.000000</td>\n      <td id=\"T_694ab_row1_col3\" class=\"data row1 col3\" >0.952416</td>\n      <td id=\"T_694ab_row1_col4\" class=\"data row1 col4\" >1</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n      <td id=\"T_694ab_row2_col0\" class=\"data row2 col0\" >x3</td>\n      <td id=\"T_694ab_row2_col1\" class=\"data row2 col1\" >6.835690</td>\n      <td id=\"T_694ab_row2_col2\" class=\"data row2 col2\" >0.000000</td>\n      <td id=\"T_694ab_row2_col3\" class=\"data row2 col3\" >1.098154</td>\n      <td id=\"T_694ab_row2_col4\" class=\"data row2 col4\" >1</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n      <td id=\"T_694ab_row3_col0\" class=\"data row3 col0\" >x2</td>\n      <td id=\"T_694ab_row3_col1\" class=\"data row3 col1\" >6.457140</td>\n      <td id=\"T_694ab_row3_col2\" class=\"data row3 col2\" >0.000000</td>\n      <td id=\"T_694ab_row3_col3\" class=\"data row3 col3\" >1.044842</td>\n      <td id=\"T_694ab_row3_col4\" class=\"data row3 col4\" >1</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n      <td id=\"T_694ab_row4_col0\" class=\"data row4 col0\" >x1</td>\n      <td id=\"T_694ab_row4_col1\" class=\"data row4 col1\" >5.530556</td>\n      <td id=\"T_694ab_row4_col2\" class=\"data row4 col2\" >0.000000</td>\n      <td id=\"T_694ab_row4_col3\" class=\"data row4 col3\" >0.917242</td>\n      <td id=\"T_694ab_row4_col4\" class=\"data row4 col4\" >1</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n      <td id=\"T_694ab_row5_col0\" class=\"data row5 col0\" >x6</td>\n      <td id=\"T_694ab_row5_col1\" class=\"data row5 col1\" >2.390868</td>\n      <td id=\"T_694ab_row5_col2\" class=\"data row5 col2\" >0.016827</td>\n      <td id=\"T_694ab_row5_col3\" class=\"data row5 col3\" >1.497983</td>\n      <td id=\"T_694ab_row5_col4\" class=\"data row5 col4\" >1</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row6\" class=\"row_heading level0 row6\" >6</th>\n      <td id=\"T_694ab_row6_col0\" class=\"data row6 col0\" >x7</td>\n      <td id=\"T_694ab_row6_col1\" class=\"data row6 col1\" >0.901098</td>\n      <td id=\"T_694ab_row6_col2\" class=\"data row6 col2\" >0.367558</td>\n      <td id=\"T_694ab_row6_col3\" class=\"data row6 col3\" >2.865508</td>\n      <td id=\"T_694ab_row6_col4\" class=\"data row6 col4\" >0</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row7\" class=\"row_heading level0 row7\" >7</th>\n      <td id=\"T_694ab_row7_col0\" class=\"data row7 col0\" >x8</td>\n      <td id=\"T_694ab_row7_col1\" class=\"data row7 col1\" >0.563214</td>\n      <td id=\"T_694ab_row7_col2\" class=\"data row7 col2\" >0.573302</td>\n      <td id=\"T_694ab_row7_col3\" class=\"data row7 col3\" >1.933632</td>\n      <td id=\"T_694ab_row7_col4\" class=\"data row7 col4\" >0</td>\n    </tr>\n    <tr>\n      <th id=\"T_694ab_level0_row8\" class=\"row_heading level0 row8\" >8</th>\n      <td id=\"T_694ab_row8_col0\" class=\"data row8 col0\" >x9</td>\n      <td id=\"T_694ab_row8_col1\" class=\"data row8 col1\" >-1.607814</td>\n      <td id=\"T_694ab_row8_col2\" class=\"data row8 col2\" >0.107908</td>\n      <td id=\"T_694ab_row8_col3\" class=\"data row8 col3\" >-4.537098</td>\n      <td id=\"T_694ab_row8_col4\" class=\"data row8 col4\" >-1</td>\n    </tr>\n  </tbody>\n</table>\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Heuristic for quick feature selection for tabular regression/classification using shapley values",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/transferwise/shap-select"
    },
    "split_keywords": [
        "shap-select"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ba19c45eee82dfa35673533501330b044007346605026ab917038cbd9702ecc9",
                "md5": "3bb40dc1450362572cc254124d897528",
                "sha256": "76f72cb564f60a3422af3dac1432b319e381901bd65c96e062d58ca707f91b6d"
            },
            "downloads": -1,
            "filename": "shap-select-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3bb40dc1450362572cc254124d897528",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11551,
            "upload_time": "2024-10-03T12:22:05",
            "upload_time_iso_8601": "2024-10-03T12:22:05.455378Z",
            "url": "https://files.pythonhosted.org/packages/ba/19/c45eee82dfa35673533501330b044007346605026ab917038cbd9702ecc9/shap-select-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-03 12:22:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "transferwise",
    "github_project": "shap-select",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "shap-select"
}
        
Elapsed time: 1.14223s