powershap


Namepowershap JSON
Version 0.0.9 PyPI version JSON
download
home_pagehttps://github.com/predict-idlab/powershap
SummaryFeature selection using statistical significance of shap values
upload_time2023-01-12 15:56:25
maintainer
docs_urlNone
authorJarne Verhaeghe, Jeroen Van Der Donckt
requires_python>=3.7,<3.11
licenseMIT
keywords feature selection shap data-science machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            	
<p align="center">
    <a href="#readme">
        <img alt="PowerShap logo" src="https://raw.githubusercontent.com/predict-idlab/powershap/main/powershap_full_scaled.png" width=70%>
    </a>
</p>

[![PyPI Latest Release](https://img.shields.io/pypi/v/powershap.svg)](https://pypi.org/project/powershap/)
[![support-version](https://img.shields.io/pypi/pyversions/powershap)](https://img.shields.io/pypi/pyversions/powershap)
[![codecov](https://img.shields.io/codecov/c/github/predict-idlab/powershap?logo=codecov)](https://codecov.io/gh/predict-idlab/powershap)
[![Downloads](https://pepy.tech/badge/powershap)](https://pepy.tech/project/powershap)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?)](http://makeapullrequest.com)
[![Testing](https://github.com/predict-idlab/powershap/actions/workflows/test.yml/badge.svg)](https://github.com/predict-idlab/powershap/actions/workflows/test.yml)
[![DOI](https://zenodo.org/badge/470633431.svg)](https://zenodo.org/badge/latestdoi/470633431)

> *powershap* is a **feature selection method** that uses statistical hypothesis testing and power calculations on **Shapley values**, enabling fast and intuitive wrapper-based feature selection.  

## Installation ⚙️

| [**pip**](https://pypi.org/project/powershap/) | `pip install powershap` | 
| ---| ----|

## Usage 🛠

*powershap* is built to be intuitive, it supports various models including linear, tree-based, and even deep learning models for classification and regression tasks.  
<!-- It is also implented as sklearn `Transformer` component, allowing convenient integration in `sklearn` pipelines. -->

```py
from powershap import PowerShap
from catboost import CatBoostClassifier

X, y = ...  # your classification dataset

selector = PowerShap(
    model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True)
)

selector.fit(X, y)  # Fit the PowerShap feature selector
selector.transform(X)  # Reduce the dataset to the selected features

```

## Features ✨

* default automatic mode
* `scikit-learn` compatible
* supports various models
* insights into the feature selection method: call the `._processed_shaps_df` on a fitted `PowerSHAP` feature selector.
* tested code!

## Benchmarks ⏱

Check out our benchmark results [here](examples/results/).  

## How does it work ⁉️

Powershap is built on the core assumption that *an informative feature will have a larger impact on the prediction compared to a known random feature.*

* Powershap trains multiple models with different random seeds on different subsets of the data. Each iteration it adds a random uniform feature to the dataset for training.
* In a single iteration after training a model, powershap calculates the absolute Shapley values of all features, including the random feature. If there are multiple outputs or multiple classes, powershap uses the maximum across these multiple outputs. These values are then averaged for each feature, symbolising the impact of the feature in this iteration.
* After performing all iterations, each feature then has an array of impacts. The impact array of each feature is then compared to the average of the random feature impact array using the percentile formula to provide a p-value. This tests whether the feature has a larger impact than the random feature and outputs a low p-value if true. 
* Powershap then outputs all features with a p-value below the provided threshold. The threshold is by default 0.01.


### Automatic mode 🤖

The required number of iterations and the threshold values are hyperparameters of powershap. However, to *avoid manually optimizing the hyperparameters* powershap by default uses an automatic mode that automatically determines these hyperparameters. 

* The automatic mode first starts with executing powershap using ten iterations.
* Then, for each feature powershap calculates the effect size and the statistical power of the test using a student-t power test. 
* Using the calculated effect size, powershap then calculates the required iterations to achieve a predefined power requirement. By default this is 0.99, which represents a false positive probability of 0.01.
* If the required iterations are larger than the already performed iterations, powershap then further executes for the extra required iterations. 
* Afterward, powershap re-calculates the required iterations and it keeps re-executing until the required iterations are met.

## Referencing our package :memo:

If you use *powershap* in a scientific publication, we would highly appreciate citing us as:

```bibtex
@misc{https://doi.org/10.48550/arxiv.2206.08394,
  doi = {10.48550/ARXIV.2206.08394},
  url = {https://arxiv.org/abs/2206.08394},
  author = {Verhaeghe, Jarne and Van Der Donckt, Jeroen and Ongenae, Femke and Van Hoecke, Sofie},
  keywords = {Machine Learning (cs.LG), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Powershap: A Power-full Shapley Feature Selection Method},
  publisher = {arXiv},
  year = {2022}
  copyright = {arXiv.org perpetual, non-exclusive license}
}

```

Paper is accepted at ECML PKDD 2022 and will be presented there. The preprint can be found on arXiv ([link](https://arxiv.org/abs/2206.08394)) and on the github.

---

<p align="center">
👤 <i>Jarne Verhaeghe, Jeroen Van Der Donckt</i>
</p>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/predict-idlab/powershap",
    "name": "powershap",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,<3.11",
    "maintainer_email": "",
    "keywords": "feature selection,shap,data-science,machine learning",
    "author": "Jarne Verhaeghe, Jeroen Van Der Donckt",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/bd/6c/79b55eb6dbbf708277e487155abc3712d66dc94ba46e545539b01ad93471/powershap-0.0.9.tar.gz",
    "platform": null,
    "description": "\t\n<p align=\"center\">\n    <a href=\"#readme\">\n        <img alt=\"PowerShap logo\" src=\"https://raw.githubusercontent.com/predict-idlab/powershap/main/powershap_full_scaled.png\" width=70%>\n    </a>\n</p>\n\n[![PyPI Latest Release](https://img.shields.io/pypi/v/powershap.svg)](https://pypi.org/project/powershap/)\n[![support-version](https://img.shields.io/pypi/pyversions/powershap)](https://img.shields.io/pypi/pyversions/powershap)\n[![codecov](https://img.shields.io/codecov/c/github/predict-idlab/powershap?logo=codecov)](https://codecov.io/gh/predict-idlab/powershap)\n[![Downloads](https://pepy.tech/badge/powershap)](https://pepy.tech/project/powershap)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?)](http://makeapullrequest.com)\n[![Testing](https://github.com/predict-idlab/powershap/actions/workflows/test.yml/badge.svg)](https://github.com/predict-idlab/powershap/actions/workflows/test.yml)\n[![DOI](https://zenodo.org/badge/470633431.svg)](https://zenodo.org/badge/latestdoi/470633431)\n\n> *powershap* is a **feature selection method** that uses statistical hypothesis testing and power calculations on **Shapley values**, enabling fast and intuitive wrapper-based feature selection.  \n\n## Installation \u2699\ufe0f\n\n| [**pip**](https://pypi.org/project/powershap/) | `pip install powershap` | \n| ---| ----|\n\n## Usage \ud83d\udee0\n\n*powershap* is built to be intuitive, it supports various models including linear, tree-based, and even deep learning models for classification and regression tasks.  \n<!-- It is also implented as sklearn `Transformer` component, allowing convenient integration in `sklearn` pipelines. -->\n\n```py\nfrom powershap import PowerShap\nfrom catboost import CatBoostClassifier\n\nX, y = ...  # your classification dataset\n\nselector = PowerShap(\n    model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True)\n)\n\nselector.fit(X, y)  # Fit the PowerShap feature selector\nselector.transform(X)  # Reduce the dataset to the selected features\n\n```\n\n## Features \u2728\n\n* default automatic mode\n* `scikit-learn` compatible\n* supports various models\n* insights into the feature selection method: call the `._processed_shaps_df` on a fitted `PowerSHAP` feature selector.\n* tested code!\n\n## Benchmarks \u23f1\n\nCheck out our benchmark results [here](examples/results/).  \n\n## How does it work \u2049\ufe0f\n\nPowershap is built on the core assumption that *an informative feature will have a larger impact on the prediction compared to a known random feature.*\n\n* Powershap trains multiple models with different random seeds on different subsets of the data. Each iteration it adds a random uniform feature to the dataset for training.\n* In a single iteration after training a model, powershap calculates the absolute Shapley values of all features, including the random feature. If there are multiple outputs or multiple classes, powershap uses the maximum across these multiple outputs. These values are then averaged for each feature, symbolising the impact of the feature in this iteration.\n* After performing all iterations, each feature then has an array of impacts. The impact array of each feature is then compared to the average of the random feature impact array using the percentile formula to provide a p-value. This tests whether the feature has a larger impact than the random feature and outputs a low p-value if true. \n* Powershap then outputs all features with a p-value below the provided threshold. The threshold is by default 0.01.\n\n\n### Automatic mode \ud83e\udd16\n\nThe required number of iterations and the threshold values are hyperparameters of powershap. However, to *avoid manually optimizing the hyperparameters* powershap by default uses an automatic mode that automatically determines these hyperparameters. \n\n* The automatic mode first starts with executing powershap using ten iterations.\n* Then, for each feature powershap calculates the effect size and the statistical power of the test using a student-t power test. \n* Using the calculated effect size, powershap then calculates the required iterations to achieve a predefined power requirement. By default this is 0.99, which represents a false positive probability of 0.01.\n* If the required iterations are larger than the already performed iterations, powershap then further executes for the extra required iterations. \n* Afterward, powershap re-calculates the required iterations and it keeps re-executing until the required iterations are met.\n\n## Referencing our package :memo:\n\nIf you use *powershap* in a scientific publication, we would highly appreciate citing us as:\n\n```bibtex\n@misc{https://doi.org/10.48550/arxiv.2206.08394,\n  doi = {10.48550/ARXIV.2206.08394},\n  url = {https://arxiv.org/abs/2206.08394},\n  author = {Verhaeghe, Jarne and Van Der Donckt, Jeroen and Ongenae, Femke and Van Hoecke, Sofie},\n  keywords = {Machine Learning (cs.LG), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences},\n  title = {Powershap: A Power-full Shapley Feature Selection Method},\n  publisher = {arXiv},\n  year = {2022}\n  copyright = {arXiv.org perpetual, non-exclusive license}\n}\n\n```\n\nPaper is accepted at ECML PKDD 2022 and will be presented there. The preprint can be found on arXiv ([link](https://arxiv.org/abs/2206.08394)) and on the github.\n\n---\n\n<p align=\"center\">\n\ud83d\udc64 <i>Jarne Verhaeghe, Jeroen Van Der Donckt</i>\n</p>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Feature selection using statistical significance of shap values",
    "version": "0.0.9",
    "split_keywords": [
        "feature selection",
        "shap",
        "data-science",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fb04ec3908d14a905c8357dbd127c50b68a028b2bdee077563ff78dc8f587b93",
                "md5": "5e9f0c0280223a255eb2424394706277",
                "sha256": "3defe01ea83aefc6a1646c8c211c5f080d9efd4b9c268e98a40a3661e22764ff"
            },
            "downloads": -1,
            "filename": "powershap-0.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5e9f0c0280223a255eb2424394706277",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7,<3.11",
            "size": 16368,
            "upload_time": "2023-01-12T15:56:24",
            "upload_time_iso_8601": "2023-01-12T15:56:24.026264Z",
            "url": "https://files.pythonhosted.org/packages/fb/04/ec3908d14a905c8357dbd127c50b68a028b2bdee077563ff78dc8f587b93/powershap-0.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bd6c79b55eb6dbbf708277e487155abc3712d66dc94ba46e545539b01ad93471",
                "md5": "4eeddea79a22f57ceda971ab93a36606",
                "sha256": "0a2344797c59c0ad073ff724a75996dd4b1b9a43e4921ab832b26b85687a69cc"
            },
            "downloads": -1,
            "filename": "powershap-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "4eeddea79a22f57ceda971ab93a36606",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,<3.11",
            "size": 16190,
            "upload_time": "2023-01-12T15:56:25",
            "upload_time_iso_8601": "2023-01-12T15:56:25.803671Z",
            "url": "https://files.pythonhosted.org/packages/bd/6c/79b55eb6dbbf708277e487155abc3712d66dc94ba46e545539b01ad93471/powershap-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-12 15:56:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "predict-idlab",
    "github_project": "powershap",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "powershap"
}
        
Elapsed time: 0.03622s