<p align="center">
<img src="https://github.com/pievos101/TreeSmoothing/blob/main/blurredForest.jpg" width="400">
</p>
This above picture is from (https://www.blackandwhite.ie/mononeil/blurred-forest)
# Bayesian post-hoc regularization for random forests
## Method Description
Random Forests are powerful ensemble learning algorithms widely used in various machine learning tasks. However, they have a tendency to overfit noisy or irrelevant features, which can result in decreased generalization performance. Post-hoc regularization techniques aim to mitigate this issue by modifying the structure of the learned ensemble after its training.
Here, we propose Bayesian post-hoc regularization to leverage the reliable patterns captured by leaf nodes closer to the root, while potentially reducing the impact of more specific and potentially noisy leaf nodes deeper in the tree. This approach allows for a form of pruning that does not alter the general structure of the trees but rather adjusts the influence of leaf nodes based on their proximity to the root node. We have evaluated the performance of our method on various machine learning data sets. Our approach demonstrates competitive performance with the state-of-the-art methods and, in certain cases, surpasses them in terms of predictive accuracy and generalization.
## Code Description
All classes inherit from `ShrinkageEstimator`, which extends `sklearn.base.BaseEstimator`.
Usage of these two classes is entirely analogous, and works just like any other `sklearn` estimator:
- `__init__()` parameters:
- `base_estimator`: the estimator around which we "wrap" hierarchical shrinkage. This should be a tree-based estimator: `DecisionTreeClassifier`, `RandomForestClassifier`, ... (analogous for `Regressor`s)
- `shrink_mode`: 2 options:
- `"hs"`: classical Hierarchical Shrinkage (from Agarwal et al. 2022)
- `"beta"`: Bayesian post-hoc regularization (from Pfeifer 2023)
- `lmb`: lambda hyperparameter
- `alpha`: alpha hyperparameter
- `beta`: beta hyperparameter
- `random_state`: random state for reproducibility
- Other functions: `fit(X, y)`, `predict(X)`, `predict_proba(X)`, `score(X, y)` work just like with any other `sklearn` estimator.
## Usage
Install the Python package treesmoothing via pip
```python
pip install treesmoothing
```
and import the ShrinkageClassifier as
```python
from treesmoothing import ShrinkageClassifier
```
or install locally import of main function from source
```python
pip install ./treesmoothing
from treesmooting import ShrinkageClassifier
```
Other imports
```python
from imodels.util.data_util import get_clean_dataset
import numpy as np
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import roc_auc_score
import sys
```
Example data set
```python
clf_datasets = [
("breast-cancer", "breast_cancer", "imodels")
]
# scoring
#sc = "balanced_accuracy"
sc = "roc_auc"
# number of trees
ntrees = 10
# Read in data set
X, y, feature_names = get_clean_dataset('breast_cancer', data_source='imodels')
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
scores = {}
scores["vanilla"] = []
scores["hs"] = []
scores["beta"] = []
```
### Vanilla Random Forest
```python
# vanilla RF ##########################################
print("Vanilla Mode")
shrink_mode="vanilla"
#######################################################
clf = RandomForestClassifier(n_estimators=ntrees)
clf.fit(X_train, y_train)
if sc == "balanced_accuracy":
pred_vanilla = clf.predict(X_test)
scores[shrink_mode].append(balanced_accuracy_score(y_test, pred_vanilla))
if sc == "roc_auc":
pred_vanilla = clf.predict_proba(X_test)[:,1]
scores[shrink_mode].append(roc_auc_score(y_test, pred_vanilla))
```
### Hierarchical Shrinkage from Agarwal et al. 2022
```python
# hs - Hierarchical Shrinkage #########################
print("HS Mode")
shrink_mode="hs"
#######################################################
param_grid = {
"lmb": [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 200],
"shrink_mode": ["hs"]}
grid_search = GridSearchCV(ShrinkageClassifier(RandomForestClassifier(n_estimators=ntrees)),
param_grid, cv=5, n_jobs=-1, scoring=sc)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(best_params)
clf = ShrinkageClassifier(RandomForestClassifier(n_estimators=ntrees),shrink_mode=shrink_mode,
lmb=best_params.get('lmb'))
clf.fit(X_train, y_train)
if sc == "balanced_accuracy":
pred_hs = clf.predict(X_test)
scores[shrink_mode].append(balanced_accuracy_score(y_test, pred_hs))
if sc == "roc_auc":
pred_hs = clf.predict_proba(X_test)[:,1]
scores[shrink_mode].append(roc_auc_score(y_test, pred_hs))
```
### Bayesian post-hoc regularization from Pfeifer 2023
```python
# beta - Bayesian post-hoc regularization #########################
print("Beta Shrinkage")
shrink_mode="beta"
###################################################################
param_grid = {
"alpha": [1500, 1000, 800, 500, 100, 50, 30, 10, 1],
"beta": [1500, 1000, 800, 500, 100, 50, 30, 10, 1],
"shrink_mode": ["beta"]}
grid_search = GridSearchCV(ShrinkageClassifier
(RandomForestClassifier(n_estimators=ntrees)), param_grid, cv=5,
n_jobs=-1, scoring=sc)
grid_search.fit(X, y)
best_params = grid_search.best_params_
print(best_params)
clf = ShrinkageClassifier(RandomForestClassifier(n_estimators=ntrees),shrink_mode=shrink_mode,
alpha=best_params.get('alpha'), beta=best_params.get('beta'))
clf.fit(X_train, y_train)
if sc == "balanced_accuracy":
pred_beta = clf.predict(X_test)
scores[shrink_mode].append(balanced_accuracy_score(y_test, pred_beta))
if sc == "roc_auc":
pred_beta = clf.predict_proba(X_test)[:,1]
scores[shrink_mode].append(roc_auc_score(y_test, pred_beta))
```
Print the results
```python
print(scores)
```
## Acknowledgement
The TreeSmoothing Python code was written by Bastian Pfeifer and Arne Gevaert.
It is based on the Hierarchical Shrinkage implementation within the Python package imodels (https://github.com/csinva/imodels).
## Citation
If you find the Bayesian post-hoc method useful please cite
```
@misc{pfeifer2023bayesian,
title={Bayesian post-hoc regularization of random forests},
author={Bastian Pfeifer},
year={2023},
eprint={2306.03702},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
### Bibtex References
```
@inproceedings{agarwal2022hierarchical,
title={Hierarchical Shrinkage: Improving the accuracy and interpretability of tree-based models.},
author={Agarwal, Abhineet and Tan, Yan Shuo and Ronen, Omer and Singh, Chandan and Yu, Bin},
booktitle={International Conference on Machine Learning},
pages={111--135},
year={2022},
organization={PMLR}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/pievos101/TreeSmoothing",
"name": "treesmoothing",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "Random Forest,Post-hoc Regularization",
"author": "Bastian Pfeifer and Arne Gevaert",
"author_email": "bastian.pfeifer@medunigraz.at",
"download_url": "https://files.pythonhosted.org/packages/fa/ae/44b5d7f4140a0e81903dbbfab0124974ebcaf3648349f03d52b37b6adc96/treesmoothing-0.0.3.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n<img src=\"https://github.com/pievos101/TreeSmoothing/blob/main/blurredForest.jpg\" width=\"400\">\n</p>\n\nThis above picture is from (https://www.blackandwhite.ie/mononeil/blurred-forest)\n\n# Bayesian post-hoc regularization for random forests\n\n## Method Description\nRandom Forests are powerful ensemble learning algorithms widely used in various machine learning tasks. However, they have a tendency to overfit noisy or irrelevant features, which can result in decreased generalization performance. Post-hoc regularization techniques aim to mitigate this issue by modifying the structure of the learned ensemble after its training.\n\nHere, we propose Bayesian post-hoc regularization to leverage the reliable patterns captured by leaf nodes closer to the root, while potentially reducing the impact of more specific and potentially noisy leaf nodes deeper in the tree. This approach allows for a form of pruning that does not alter the general structure of the trees but rather adjusts the influence of leaf nodes based on their proximity to the root node. We have evaluated the performance of our method on various machine learning data sets. Our approach demonstrates competitive performance with the state-of-the-art methods and, in certain cases, surpasses them in terms of predictive accuracy and generalization.\n\n## Code Description\nAll classes inherit from `ShrinkageEstimator`, which extends `sklearn.base.BaseEstimator`.\nUsage of these two classes is entirely analogous, and works just like any other `sklearn` estimator:\n- `__init__()` parameters:\n - `base_estimator`: the estimator around which we \"wrap\" hierarchical shrinkage. This should be a tree-based estimator: `DecisionTreeClassifier`, `RandomForestClassifier`, ... (analogous for `Regressor`s)\n - `shrink_mode`: 2 options:\n - `\"hs\"`: classical Hierarchical Shrinkage (from Agarwal et al. 2022)\n - `\"beta\"`: Bayesian post-hoc regularization (from Pfeifer 2023)\n - `lmb`: lambda hyperparameter\n - `alpha`: alpha hyperparameter\n - `beta`: beta hyperparameter\n - `random_state`: random state for reproducibility\n- Other functions: `fit(X, y)`, `predict(X)`, `predict_proba(X)`, `score(X, y)` work just like with any other `sklearn` estimator.\n\n\n## Usage\n\nInstall the Python package treesmoothing via pip\n\n```python\npip install treesmoothing\n```\n\nand import the ShrinkageClassifier as \n\n```python\nfrom treesmoothing import ShrinkageClassifier\n```\n\nor install locally import of main function from source \n\n```python\npip install ./treesmoothing\nfrom treesmooting import ShrinkageClassifier\n```\n\nOther imports\n\n```python\nfrom imodels.util.data_util import get_clean_dataset\nimport numpy as np\nfrom sklearn.model_selection import cross_val_score\nimport matplotlib.pyplot as plt\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import balanced_accuracy_score\nfrom sklearn.metrics import roc_auc_score\n\nimport sys\n```\n\nExample data set \n\n```python\nclf_datasets = [\n (\"breast-cancer\", \"breast_cancer\", \"imodels\")\n]\n\n# scoring\n#sc = \"balanced_accuracy\"\nsc = \"roc_auc\"\n\n# number of trees \nntrees = 10\n\n# Read in data set\nX, y, feature_names = get_clean_dataset('breast_cancer', data_source='imodels')\n\n# train-test split\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n\nscores = {}\n\nscores[\"vanilla\"] = []\nscores[\"hs\"] = []\nscores[\"beta\"] = []\n```\n\n### Vanilla Random Forest \n\n```python\n# vanilla RF ##########################################\nprint(\"Vanilla Mode\")\nshrink_mode=\"vanilla\"\n#######################################################\n\nclf = RandomForestClassifier(n_estimators=ntrees) \nclf.fit(X_train, y_train)\nif sc == \"balanced_accuracy\":\n pred_vanilla = clf.predict(X_test)\n scores[shrink_mode].append(balanced_accuracy_score(y_test, pred_vanilla)) \nif sc == \"roc_auc\":\n pred_vanilla = clf.predict_proba(X_test)[:,1]\n scores[shrink_mode].append(roc_auc_score(y_test, pred_vanilla)) \n```\n\n### Hierarchical Shrinkage from Agarwal et al. 2022\n\n```python\n# hs - Hierarchical Shrinkage #########################\nprint(\"HS Mode\")\nshrink_mode=\"hs\"\n#######################################################\n\nparam_grid = {\n\"lmb\": [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 200],\n\"shrink_mode\": [\"hs\"]}\n\ngrid_search = GridSearchCV(ShrinkageClassifier(RandomForestClassifier(n_estimators=ntrees)), \nparam_grid, cv=5, n_jobs=-1, scoring=sc)\n\ngrid_search.fit(X_train, y_train)\n\nbest_params = grid_search.best_params_\nprint(best_params)\n\nclf = ShrinkageClassifier(RandomForestClassifier(n_estimators=ntrees),shrink_mode=shrink_mode, \nlmb=best_params.get('lmb'))\n\nclf.fit(X_train, y_train)\nif sc == \"balanced_accuracy\":\n pred_hs = clf.predict(X_test)\n scores[shrink_mode].append(balanced_accuracy_score(y_test, pred_hs)) \nif sc == \"roc_auc\":\n pred_hs = clf.predict_proba(X_test)[:,1]\n scores[shrink_mode].append(roc_auc_score(y_test, pred_hs)) \n```\n### Bayesian post-hoc regularization from Pfeifer 2023\n\n```python\n# beta - Bayesian post-hoc regularization #########################\nprint(\"Beta Shrinkage\")\nshrink_mode=\"beta\"\n###################################################################\n\nparam_grid = {\n\"alpha\": [1500, 1000, 800, 500, 100, 50, 30, 10, 1],\n\"beta\": [1500, 1000, 800, 500, 100, 50, 30, 10, 1],\n\"shrink_mode\": [\"beta\"]}\n\ngrid_search = GridSearchCV(ShrinkageClassifier\n(RandomForestClassifier(n_estimators=ntrees)), param_grid, cv=5,\nn_jobs=-1, scoring=sc)\n\ngrid_search.fit(X, y)\n\nbest_params = grid_search.best_params_\nprint(best_params)\nclf = ShrinkageClassifier(RandomForestClassifier(n_estimators=ntrees),shrink_mode=shrink_mode, \nalpha=best_params.get('alpha'), beta=best_params.get('beta'))\nclf.fit(X_train, y_train)\n\nif sc == \"balanced_accuracy\":\n pred_beta = clf.predict(X_test)\n scores[shrink_mode].append(balanced_accuracy_score(y_test, pred_beta)) \nif sc == \"roc_auc\":\n pred_beta = clf.predict_proba(X_test)[:,1]\n scores[shrink_mode].append(roc_auc_score(y_test, pred_beta)) \n```\n\nPrint the results\n\n```python\nprint(scores)\n```\n\n## Acknowledgement \nThe TreeSmoothing Python code was written by Bastian Pfeifer and Arne Gevaert. \nIt is based on the Hierarchical Shrinkage implementation within the Python package imodels (https://github.com/csinva/imodels).\n\n## Citation\nIf you find the Bayesian post-hoc method useful please cite\n\n```\n@misc{pfeifer2023bayesian,\n title={Bayesian post-hoc regularization of random forests}, \n author={Bastian Pfeifer},\n year={2023},\n eprint={2306.03702},\n archivePrefix={arXiv},\n primaryClass={cs.LG}\n}\n```\n\n### Bibtex References\n```\n@inproceedings{agarwal2022hierarchical,\n title={Hierarchical Shrinkage: Improving the accuracy and interpretability of tree-based models.},\n author={Agarwal, Abhineet and Tan, Yan Shuo and Ronen, Omer and Singh, Chandan and Yu, Bin},\n booktitle={International Conference on Machine Learning},\n pages={111--135},\n year={2022},\n organization={PMLR}\n}\n\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Bayesian post-hoc regularization for random forests",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/pievos101/TreeSmoothing"
},
"split_keywords": [
"random forest",
"post-hoc regularization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "53f2401f24e9fdf816a6349d5ebed86d00b75134f70058334543a22eb0736a9e",
"md5": "8d30e6b5e8638146130e71f51bb8ea22",
"sha256": "f6c395d3283deda5f66876d3140db4bd959adb8a99374d65bd28fd4446a4f30c"
},
"downloads": -1,
"filename": "treesmoothing-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8d30e6b5e8638146130e71f51bb8ea22",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 7907,
"upload_time": "2023-12-20T11:43:45",
"upload_time_iso_8601": "2023-12-20T11:43:45.359484Z",
"url": "https://files.pythonhosted.org/packages/53/f2/401f24e9fdf816a6349d5ebed86d00b75134f70058334543a22eb0736a9e/treesmoothing-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "faae44b5d7f4140a0e81903dbbfab0124974ebcaf3648349f03d52b37b6adc96",
"md5": "945e6e84241ae9834204630c0eb41225",
"sha256": "961b4dec21d8b011009b9a6d521d9b41ad38d6f2afb51cacc3175a179a01a8ca"
},
"downloads": -1,
"filename": "treesmoothing-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "945e6e84241ae9834204630c0eb41225",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 9731,
"upload_time": "2023-12-20T11:43:46",
"upload_time_iso_8601": "2023-12-20T11:43:46.798678Z",
"url": "https://files.pythonhosted.org/packages/fa/ae/44b5d7f4140a0e81903dbbfab0124974ebcaf3648349f03d52b37b6adc96/treesmoothing-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-20 11:43:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pievos101",
"github_project": "TreeSmoothing",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "treesmoothing"
}