fgclustering


Namefgclustering JSON
Version 1.0.3 PyPI version JSON
download
home_page
SummaryForest-Guided Clustering - Explainability method for Random Forest models.
upload_time2023-04-12 16:38:13
maintainer
docs_urlNone
authorDominik Thalmeier
requires_python<3.11,>=3.7
license
keywords random forest xai explainable ai
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

<img src="https://raw.githubusercontent.com/HelmholtzAI-Consultants-Munich/fg-clustering/main/docs/source/_figures/FGC_Logo.png" width="200">
	

# Forest-Guided Clustering - Explainability for Random Forest Models

[![test](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/actions/workflows/test.yml/badge.svg)](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/actions/workflows/test.yml)
[![PyPI](https://img.shields.io/pypi/v/fgclustering.svg)](https://pypi.org/project/fgclustering)
[![stars](https://img.shields.io/github/stars/HelmholtzAI-Consultants-Munich/forest_guided_clustering?logo=GitHub&color=yellow)](https://github.com/HelmholtzAI-Consultants-Munich/forest_guided_clustering/stargazers)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![cite](https://zenodo.org/badge/397931780.svg)](https://zenodo.org/badge/latestdoi/397931780)
	
[Docs] | [Tutorials]

[Docs]: https://forest-guided-clustering.readthedocs.io/en/latest/
[Tutorials]: https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/tree/main/tutorials

</div>

Forest-Guided Clustering (FGC) is an explainability method for Random Forest models. Standard explainability methods (e.g. feature importance) assume independence of model features and hence, are not suited in the presence of correlated features. The Forest-Guided Clustering algorithm does not assume independence of model features, because it computes the feature importance based on subgroups of instances that follow similar decision rules within the Random Forest model. Hence, this method is well suited for cases with high correlation among model features. 

For a detailed comparison of FGC and Permutation Feature Importance, please have a look at the Notebook [Introduction to FGC: Comparison of Forest-Guided Clustering and Feature Importance](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/tutorials/introduction_to_FGC_comparing_FGC_to_FI.ipynb).

## Documentation

Please see [here](https://forest-guided-clustering.readthedocs.io/) for full documentation on:

- Getting Started (installation, basic usage)
- Theoretical Background (introduction, general algorith, feature importance)
- Tutorials (simple use cases, special cases)
- API documentation

For a short introduction to Forest-Guided Clustering, click below:

<div align="center">

[![Video](http://i.vimeocdn.com/video/1501376117-3e402fde211d1a52080fb16b317efc3786a34d0be852a81cfe3a03aa89adc475-d_295x166)](https://vimeo.com/746443233/07ddf2290b)

</div>

## Installation

### Requirements

This packages was tested for ```Python 3.7 - 3.11``` on ubuntu, macos and windows. It depends on the ```kmedoids``` python package. If you are using windows or macos, you may need to first install Rust/Cargo with:

```
conda install -c conda-forge rust
```

If this does not work, please try to install Cargo from source:

```
git clone https://github.com/rust-lang/cargo
cd cargo
cargo build --release
```

For further information on the kmedoids package, please visit [this page](https://pypi.org/project/kmedoids/).

All other required packages are automatically installed if installation is done via ```pip```.


### Install Options

The installation of the package is done via pip. Note: if you are using conda, first install pip with: ```conda install pip```.

PyPI install:

```
pip install fgclustering
```


Installation from source:

```
git clone https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering.git
```

- Installation as python package (run inside directory):

		pip install .   


- Development Installation as python package (run inside directory):

		pip install -e . [dev]


## Basic Usage

To get explainability of your Random Forest model via Forest-Guided Clustering, you simply need to run the following commands:

```python
from fgclustering import FgClustering
   
# initialize and run fgclustering object
fgc = FgClustering(model=rf, data=data, target_column='target')
fgc.run()
   
# visualize results
fgc.plot_global_feature_importance()
fgc.plot_local_feature_importance()
fgc.plot_decision_paths()
   
# obtain optimal number of clusters and vector that contains the cluster label of each data point
optimal_number_of_clusters = fgc.k
cluster_labels = fgc.cluster_labels
```

where 

- ```model=rf``` is a Random Forest Classifier or Regressor object,
- ```data=data``` is a dataset containing the same features as required by the Random Forest model, and
- ```target_column='target'``` is the name of the target column (i.e. *target*) in the provided dataset. 

For detailed instructions, please have a look at the Notebook [Introduction to FGC: Simple Use Cases](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/tutorials/introduction_to_FGC_use_cases.ipynb).

**Usage on big datasets**

If you are working with the dataset containing large number of samples, you can use one of the following strategies:

- Use the cores you have at your disposal to parallelize the optimization of the cluster number. You can do so by setting the parameter ```n_jobs``` to a value > 1 in the ```run()``` function.
- Use the faster implementation of the pam method that K-Medoids algorithm uses to find the clusters by setting the parameter  ```method_clustering``` to *fasterpam* in the ```run()``` function.
- Use subsampling technique

For detailed instructions, please have a look at the Notebook [Special Case: FGC for Big Datasets](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/tutorials/special_case_big_data_with_FGC.ipynb).

## Contributing
 
Contributions are more than welcome! Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

## How to cite

If Forest-Guided Clustering is useful for your research, consider citing the package:

```
@software{lisa_sousa_2022_6445529,
    author       = {Lisa Barros de Andrade e Sousa,
                     Helena Pelin,
                     Dominik Thalmeier,
                     Marie Piraud},
    title        = {{Forest-Guided Clustering - Explainability for Random Forest Models}},
    month        = april,
    year         = 2022,
    publisher    = {Zenodo},
    version      = {v0.2.0},
    doi          = {10.5281/zenodo.7085465},
    url          = {https://doi.org/10.5281/zenodo.7085465}
}
```

## License

```fgclustering``` is released under the MIT license. See [LICENSE](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/LICENSE) for additional details about it.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "fgclustering",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "<3.11,>=3.7",
    "maintainer_email": "Lisa Barros de Andrade e Sousa <lisa.barros.andrade.sousa@gmail.com>, Helena Pelin <helena.pelin@helmholtz-muenchen.de>",
    "keywords": "random forest,xai,explainable ai",
    "author": "Dominik Thalmeier",
    "author_email": "Lisa Barros de Andrade e Sousa <lisa.barros.andrade.sousa@gmail.com>, Helena Pelin <helena.pelin@helmholtz-muenchen.de>, Marie Piraud <marie.piraud@helmholtz-muenchen.de>",
    "download_url": "https://files.pythonhosted.org/packages/ad/ce/7e0c791f063a423f749cd6ede2fe0d839b2f987ad1fd32db2f82cb1af113/fgclustering-1.0.3.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n<img src=\"https://raw.githubusercontent.com/HelmholtzAI-Consultants-Munich/fg-clustering/main/docs/source/_figures/FGC_Logo.png\" width=\"200\">\n\t\n\n# Forest-Guided Clustering - Explainability for Random Forest Models\n\n[![test](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/actions/workflows/test.yml/badge.svg)](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/actions/workflows/test.yml)\n[![PyPI](https://img.shields.io/pypi/v/fgclustering.svg)](https://pypi.org/project/fgclustering)\n[![stars](https://img.shields.io/github/stars/HelmholtzAI-Consultants-Munich/forest_guided_clustering?logo=GitHub&color=yellow)](https://github.com/HelmholtzAI-Consultants-Munich/forest_guided_clustering/stargazers)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![cite](https://zenodo.org/badge/397931780.svg)](https://zenodo.org/badge/latestdoi/397931780)\n\t\n[Docs] | [Tutorials]\n\n[Docs]: https://forest-guided-clustering.readthedocs.io/en/latest/\n[Tutorials]: https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/tree/main/tutorials\n\n</div>\n\nForest-Guided Clustering (FGC) is an explainability method for Random Forest models. Standard explainability methods (e.g. feature importance) assume independence of model features and hence, are not suited in the presence of correlated features. The Forest-Guided Clustering algorithm does not assume independence of model features, because it computes the feature importance based on subgroups of instances that follow similar decision rules within the Random Forest model. Hence, this method is well suited for cases with high correlation among model features. \n\nFor a detailed comparison of FGC and Permutation Feature Importance, please have a look at the Notebook [Introduction to FGC: Comparison of Forest-Guided Clustering and Feature Importance](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/tutorials/introduction_to_FGC_comparing_FGC_to_FI.ipynb).\n\n## Documentation\n\nPlease see [here](https://forest-guided-clustering.readthedocs.io/) for full documentation on:\n\n- Getting Started (installation, basic usage)\n- Theoretical Background (introduction, general algorith, feature importance)\n- Tutorials (simple use cases, special cases)\n- API documentation\n\nFor a short introduction to Forest-Guided Clustering, click below:\n\n<div align=\"center\">\n\n[![Video](http://i.vimeocdn.com/video/1501376117-3e402fde211d1a52080fb16b317efc3786a34d0be852a81cfe3a03aa89adc475-d_295x166)](https://vimeo.com/746443233/07ddf2290b)\n\n</div>\n\n## Installation\n\n### Requirements\n\nThis packages was tested for ```Python 3.7 - 3.11``` on ubuntu, macos and windows. It depends on the ```kmedoids``` python package. If you are using windows or macos, you may need to first install Rust/Cargo with:\n\n```\nconda install -c conda-forge rust\n```\n\nIf this does not work, please try to install Cargo from source:\n\n```\ngit clone https://github.com/rust-lang/cargo\ncd cargo\ncargo build --release\n```\n\nFor further information on the kmedoids package, please visit [this page](https://pypi.org/project/kmedoids/).\n\nAll other required packages are automatically installed if installation is done via ```pip```.\n\n\n### Install Options\n\nThe installation of the package is done via pip. Note: if you are using conda, first install pip with: ```conda install pip```.\n\nPyPI install:\n\n```\npip install fgclustering\n```\n\n\nInstallation from source:\n\n```\ngit clone https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering.git\n```\n\n- Installation as python package (run inside directory):\n\n\t\tpip install .   \n\n\n- Development Installation as python package (run inside directory):\n\n\t\tpip install -e . [dev]\n\n\n## Basic Usage\n\nTo get explainability of your Random Forest model via Forest-Guided Clustering, you simply need to run the following commands:\n\n```python\nfrom fgclustering import FgClustering\n   \n# initialize and run fgclustering object\nfgc = FgClustering(model=rf, data=data, target_column='target')\nfgc.run()\n   \n# visualize results\nfgc.plot_global_feature_importance()\nfgc.plot_local_feature_importance()\nfgc.plot_decision_paths()\n   \n# obtain optimal number of clusters and vector that contains the cluster label of each data point\noptimal_number_of_clusters = fgc.k\ncluster_labels = fgc.cluster_labels\n```\n\nwhere \n\n- ```model=rf``` is a Random Forest Classifier or Regressor object,\n- ```data=data``` is a dataset containing the same features as required by the Random Forest model, and\n- ```target_column='target'``` is the name of the target column (i.e. *target*) in the provided dataset. \n\nFor detailed instructions, please have a look at the Notebook [Introduction to FGC: Simple Use Cases](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/tutorials/introduction_to_FGC_use_cases.ipynb).\n\n**Usage on big datasets**\n\nIf you are working with the dataset containing large number of samples, you can use one of the following strategies:\n\n- Use the cores you have at your disposal to parallelize the optimization of the cluster number. You can do so by setting the parameter ```n_jobs``` to a value > 1 in the ```run()``` function.\n- Use the faster implementation of the pam method that K-Medoids algorithm uses to find the clusters by setting the parameter  ```method_clustering``` to *fasterpam* in the ```run()``` function.\n- Use subsampling technique\n\nFor detailed instructions, please have a look at the Notebook [Special Case: FGC for Big Datasets](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/tutorials/special_case_big_data_with_FGC.ipynb).\n\n## Contributing\n \nContributions are more than welcome! Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.\n\n## How to cite\n\nIf Forest-Guided Clustering is useful for your research, consider citing the package:\n\n```\n@software{lisa_sousa_2022_6445529,\n    author       = {Lisa Barros de Andrade e Sousa,\n                     Helena Pelin,\n                     Dominik Thalmeier,\n                     Marie Piraud},\n    title        = {{Forest-Guided Clustering - Explainability for Random Forest Models}},\n    month        = april,\n    year         = 2022,\n    publisher    = {Zenodo},\n    version      = {v0.2.0},\n    doi          = {10.5281/zenodo.7085465},\n    url          = {https://doi.org/10.5281/zenodo.7085465}\n}\n```\n\n## License\n\n```fgclustering``` is released under the MIT license. See [LICENSE](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/blob/main/LICENSE) for additional details about it.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Forest-Guided Clustering - Explainability method for Random Forest models.",
    "version": "1.0.3",
    "split_keywords": [
        "random forest",
        "xai",
        "explainable ai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5152c63eed67aa6d56d5ad1aa6505bc3ef8625fa00db7480f1fac3c58512b030",
                "md5": "fe8074c2f952c4f03e06a481bd735663",
                "sha256": "bc9c174b883f85f8c08fd46023ab6c2f2cebd55cd311355b344c971d76aa27c5"
            },
            "downloads": -1,
            "filename": "fgclustering-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fe8074c2f952c4f03e06a481bd735663",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.11,>=3.7",
            "size": 19086,
            "upload_time": "2023-04-12T16:38:10",
            "upload_time_iso_8601": "2023-04-12T16:38:10.913016Z",
            "url": "https://files.pythonhosted.org/packages/51/52/c63eed67aa6d56d5ad1aa6505bc3ef8625fa00db7480f1fac3c58512b030/fgclustering-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "adce7e0c791f063a423f749cd6ede2fe0d839b2f987ad1fd32db2f82cb1af113",
                "md5": "2ce1c80ac971b8a3849e3bc7458888fd",
                "sha256": "f078464aeff80fc20781428c78ad8ddc35fe432495fc10123d3c2af07290a59d"
            },
            "downloads": -1,
            "filename": "fgclustering-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "2ce1c80ac971b8a3849e3bc7458888fd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.11,>=3.7",
            "size": 3467762,
            "upload_time": "2023-04-12T16:38:13",
            "upload_time_iso_8601": "2023-04-12T16:38:13.826708Z",
            "url": "https://files.pythonhosted.org/packages/ad/ce/7e0c791f063a423f749cd6ede2fe0d839b2f987ad1fd32db2f82cb1af113/fgclustering-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-12 16:38:13",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "fgclustering"
}
        
Elapsed time: 0.05397s