DivExplorer


NameDivExplorer JSON
Version 0.2.6 PyPI version JSON
download
home_pageNone
SummaryAnalyze Pandas dataframes, and other tabular data (csv), to find subgroups of data with properties that diverge from those of the overall dataset
upload_time2024-12-06 21:07:30
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT License Copyright (c) 2021-23 Eliana Pastor Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords pandas fairness subgroup analysis data mining
VCS
bugtrack_url
requirements matplotlib numpy mlxtend pandas plotly python_igraph
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI](https://img.shields.io/pypi/v/divexplorer)](https://pypi.org/project/divexplorer/)
[![Downloads](https://pepy.tech/badge/divexplorer)](https://pepy.tech/project/divexplorer)

# DivExplorer

Machine learning models may perform differently on different data subgroups. We propose the notion of divergence over itemsets (i.e., conjunctions of simple predicates) as a measure of different classification behavior on data subgroups, and the use of frequent pattern mining techniques for their identification. We quantify the contribution of different attribute values to divergence with the notion of Shapley values to identify both critical and peculiar behaviors of attributes.
See our [paper](https://divexplorer.github.io/static/DivExplorer.pdf) and our [project page](https://divexplorer.github.io/) for all the details.

## Installation

Install using [pip](http://www.pip-installer.org/en/latest) with:

<pre>
pip install divexplorer
</pre>

or, download a wheel or source archive from [PyPI](https://pypi.org/project/divexplorer/).

## Example Notebooks

This [notebook](https://github.com/divexplorer/divexplorer/blob/main/notebooks/DivExplorerExample.ipynb) gives an example of how to use DivExplorer to find divergent subgroups in datasets and in the predictions of a classifier.

## Documentation

For the code details, see the [documentation](https://github.com/divexplorer/divexplorer/blob/main/Documentation.md). 

The original paper is:

> [Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence](https://divexplorer.github.io/static/DivExplorer.pdf). [Eliana Pastor](https://github.com/elianap), [Luca de Alfaro](https://luca.dealfaro.com/), [Elena Baralis](https://dbdmg.polito.it/wordpress/people/elena-baralis/). In Proceedings of the 2021 ACM SIGMOD Conference, 2021.

You can find more papers and information in the [DivExplorer project page](https://divexplorer.github.io/).


## Quick Start

DivExplorer works on Pandas datasets.  Here we load an example one, and discretize in coarser ranges one of its attributes. 

```python
import pandas as pd

df_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')
df_census["AGE_RANGE"] = df_census.apply(lambda row : 10 * (row["A_AGE"] // 10), axis=1)
```

We can then find the data subgroups that have highest income divergence, using the `DivergenceExplorer` class as follows: 

```python
from divexplorer import DivergenceExplorer

fp_diver = DivergenceExplorer(df_census)
subgroups = fp_diver.get_pattern_divergence(
    min_support=0.001,
    attributes=["STATE", "SEX", "EDUCATION", "AGE_RANGE"], 
    quantitative_outcomes=["PTOTVAL"])
subgroups.sort_values(by="PTOTVAL_div", ascending=False).head(10)
```

You can also prune redundant subgroups by specifying:
*  a threshold, so that attributes that don't increase the divergence by at least the threshold value are not included in subgroups, 
* a minimum t-value, to select only significant subgroups.

```python
from divexplorer import DivergencePatternProcessor

processor = DivergencePatternProcessor(subgroups, "PTOTVAL")
pruned_subgroups = pd.DataFrame(processor.redundancy_pruning(th_redundancy=10000))
pruned_subgroups = pruned_subgroups[pruned_subgroups["PTOTVAL_t"] > 2]
pruned_subgroups.sort_values(by="PTOTVAL_div", ascending=False, ignore_index=True)
```

### Finding subgroups with divergent performance in classifiers

For classifiers, it may be of interest to find the subgroups with the highest (or lowest) divergence in characteristics such as false positive rates, etc.  Here is how to do it for the false-positive rate in a COMPAS-derived classifier. 

```python
compas_df = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/compas_discretized.csv')
```

We generate an `fp` column whose average will give the false-positive rate, like so: 

```python
from divexplorer.outcomes import get_false_positive_rate_outcome

y_trues = compas_df["class"]
y_preds = compas_df["predicted"]

compas_df['fp'] =  get_false_positive_rate_outcome(y_trues, y_preds)
```

The `fp` column has values: 

* 1, if the data is a false positive (`class` is 0 and `predicted` is 1)
* 0, if the data is a true negative (`class` is 0 and `predicted` is 0). 
* NaN, if the class is positive (`class` is 1).

We use Nan for `class` 1 data, to exclude those data from the average, so that the column average is the false-positive rate.
We can then find the most divergent groups as in the previous example, noting that here we use `boolean_outcomes` rather than `quantitative_outcomes` because `fp` is boolean: 

```python
fp_diver = DivergenceExplorer(compas_df)

attributes = ['race', '#prior', 'sex', 'age']
FP_fm = fp_diver.get_pattern_divergence(min_support=0.1, attributes=attributes, 
                                        boolean_outcomes=['fp'])
FP_fm.sort_values(by="fp_div", ascending=False).head(10)
```

Note how we specify the attributes that can be used to define subgroups. 
In the above code, we use `boolean_outcomes` because `fp` is boolean. 
The following example, from the example notebook, shows how to use 
`quantitative_outcomes` for a quantitative outcome.

```python
df_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')
explorer = DivergenceExplorer(df_census)
value_subgroups = explorer.get_pattern_divergence(
    min_support=0.001, quantitative_outcomes=["PTOTVAL"])
```

### Analyzing subgroups via Shapley values

Returning to our COMPAS example, if we want to analyze what factors 
contribute to the divergence of a particular subgroup, 
we can do so via Shapley values: 

```python
fp_details = DivergencePatternProcessor(FP_fm, 'fp')

pattern = fp_details.patterns['itemset'].iloc[37]
fp_details.shapley_value(pattern)
```

### Pruning redundant subgroups

If you get too many subgroups, you can prune redundant ones via _redundancy pruning_. 
This prunes a pattern $\beta$ if there is a pattern $\alpha$, subset of $\beta$, with a divergence difference below a threshold. 

```python
df_pruned = fp_details.redundancy_pruning(th_redundancy=0.01)
df_pruned.sort_values("fp_div", ascending=False).head(5)
```

## Code Contributors

Project lead:

- [Eliana Pastor](https://github.com/elianap)

Other contributors: 

- [Luca de Alfaro](https://luca.dealfaro.com/)
- [Harsh Dadhich]()

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "DivExplorer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "Pandas, Fairness, Subgroup Analysis, Data Mining",
    "author": null,
    "author_email": "Eliana Pastor <eliana.pastor@polito.it>, Luca de Alfaro <luca@ucsc.edu>",
    "download_url": "https://files.pythonhosted.org/packages/9b/33/b1ea040d7a90db9a5cc3634e7788817d2b419499fb130708dc4a61490d5a/divexplorer-0.2.6.tar.gz",
    "platform": null,
    "description": "[![PyPI](https://img.shields.io/pypi/v/divexplorer)](https://pypi.org/project/divexplorer/)\n[![Downloads](https://pepy.tech/badge/divexplorer)](https://pepy.tech/project/divexplorer)\n\n# DivExplorer\n\nMachine learning models may perform differently on different data subgroups. We propose the notion of divergence over itemsets (i.e., conjunctions of simple predicates) as a measure of different classification behavior on data subgroups, and the use of frequent pattern mining techniques for their identification. We quantify the contribution of different attribute values to divergence with the notion of Shapley values to identify both critical and peculiar behaviors of attributes.\nSee our [paper](https://divexplorer.github.io/static/DivExplorer.pdf) and our [project page](https://divexplorer.github.io/) for all the details.\n\n## Installation\n\nInstall using [pip](http://www.pip-installer.org/en/latest) with:\n\n<pre>\npip install divexplorer\n</pre>\n\nor, download a wheel or source archive from [PyPI](https://pypi.org/project/divexplorer/).\n\n## Example Notebooks\n\nThis [notebook](https://github.com/divexplorer/divexplorer/blob/main/notebooks/DivExplorerExample.ipynb) gives an example of how to use DivExplorer to find divergent subgroups in datasets and in the predictions of a classifier.\n\n## Documentation\n\nFor the code details, see the [documentation](https://github.com/divexplorer/divexplorer/blob/main/Documentation.md). \n\nThe original paper is:\n\n> [Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence](https://divexplorer.github.io/static/DivExplorer.pdf). [Eliana Pastor](https://github.com/elianap), [Luca de Alfaro](https://luca.dealfaro.com/), [Elena Baralis](https://dbdmg.polito.it/wordpress/people/elena-baralis/). In Proceedings of the 2021 ACM SIGMOD Conference, 2021.\n\nYou can find more papers and information in the [DivExplorer project page](https://divexplorer.github.io/).\n\n\n## Quick Start\n\nDivExplorer works on Pandas datasets.  Here we load an example one, and discretize in coarser ranges one of its attributes. \n\n```python\nimport pandas as pd\n\ndf_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')\ndf_census[\"AGE_RANGE\"] = df_census.apply(lambda row : 10 * (row[\"A_AGE\"] // 10), axis=1)\n```\n\nWe can then find the data subgroups that have highest income divergence, using the `DivergenceExplorer` class as follows: \n\n```python\nfrom divexplorer import DivergenceExplorer\n\nfp_diver = DivergenceExplorer(df_census)\nsubgroups = fp_diver.get_pattern_divergence(\n    min_support=0.001,\n    attributes=[\"STATE\", \"SEX\", \"EDUCATION\", \"AGE_RANGE\"], \n    quantitative_outcomes=[\"PTOTVAL\"])\nsubgroups.sort_values(by=\"PTOTVAL_div\", ascending=False).head(10)\n```\n\nYou can also prune redundant subgroups by specifying:\n*  a threshold, so that attributes that don't increase the divergence by at least the threshold value are not included in subgroups, \n* a minimum t-value, to select only significant subgroups.\n\n```python\nfrom divexplorer import DivergencePatternProcessor\n\nprocessor = DivergencePatternProcessor(subgroups, \"PTOTVAL\")\npruned_subgroups = pd.DataFrame(processor.redundancy_pruning(th_redundancy=10000))\npruned_subgroups = pruned_subgroups[pruned_subgroups[\"PTOTVAL_t\"] > 2]\npruned_subgroups.sort_values(by=\"PTOTVAL_div\", ascending=False, ignore_index=True)\n```\n\n### Finding subgroups with divergent performance in classifiers\n\nFor classifiers, it may be of interest to find the subgroups with the highest (or lowest) divergence in characteristics such as false positive rates, etc.  Here is how to do it for the false-positive rate in a COMPAS-derived classifier. \n\n```python\ncompas_df = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/compas_discretized.csv')\n```\n\nWe generate an `fp` column whose average will give the false-positive rate, like so: \n\n```python\nfrom divexplorer.outcomes import get_false_positive_rate_outcome\n\ny_trues = compas_df[\"class\"]\ny_preds = compas_df[\"predicted\"]\n\ncompas_df['fp'] =  get_false_positive_rate_outcome(y_trues, y_preds)\n```\n\nThe `fp` column has values: \n\n* 1, if the data is a false positive (`class` is 0 and `predicted` is 1)\n* 0, if the data is a true negative (`class` is 0 and `predicted` is 0). \n* NaN, if the class is positive (`class` is 1).\n\nWe use Nan for `class` 1 data, to exclude those data from the average, so that the column average is the false-positive rate.\nWe can then find the most divergent groups as in the previous example, noting that here we use `boolean_outcomes` rather than `quantitative_outcomes` because `fp` is boolean: \n\n```python\nfp_diver = DivergenceExplorer(compas_df)\n\nattributes = ['race', '#prior', 'sex', 'age']\nFP_fm = fp_diver.get_pattern_divergence(min_support=0.1, attributes=attributes, \n                                        boolean_outcomes=['fp'])\nFP_fm.sort_values(by=\"fp_div\", ascending=False).head(10)\n```\n\nNote how we specify the attributes that can be used to define subgroups. \nIn the above code, we use `boolean_outcomes` because `fp` is boolean. \nThe following example, from the example notebook, shows how to use \n`quantitative_outcomes` for a quantitative outcome.\n\n```python\ndf_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')\nexplorer = DivergenceExplorer(df_census)\nvalue_subgroups = explorer.get_pattern_divergence(\n    min_support=0.001, quantitative_outcomes=[\"PTOTVAL\"])\n```\n\n### Analyzing subgroups via Shapley values\n\nReturning to our COMPAS example, if we want to analyze what factors \ncontribute to the divergence of a particular subgroup, \nwe can do so via Shapley values: \n\n```python\nfp_details = DivergencePatternProcessor(FP_fm, 'fp')\n\npattern = fp_details.patterns['itemset'].iloc[37]\nfp_details.shapley_value(pattern)\n```\n\n### Pruning redundant subgroups\n\nIf you get too many subgroups, you can prune redundant ones via _redundancy pruning_. \nThis prunes a pattern $\\beta$ if there is a pattern $\\alpha$, subset of $\\beta$, with a divergence difference below a threshold. \n\n```python\ndf_pruned = fp_details.redundancy_pruning(th_redundancy=0.01)\ndf_pruned.sort_values(\"fp_div\", ascending=False).head(5)\n```\n\n## Code Contributors\n\nProject lead:\n\n- [Eliana Pastor](https://github.com/elianap)\n\nOther contributors: \n\n- [Luca de Alfaro](https://luca.dealfaro.com/)\n- [Harsh Dadhich]()\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2021-23 Eliana Pastor  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Analyze Pandas dataframes, and other tabular data (csv), to find subgroups of data with properties that diverge from those of the overall dataset",
    "version": "0.2.6",
    "project_urls": {
        "Homepage": "https://divexplorer.github.io/",
        "Source": "https://github.com/DivExplorer/divexplorer"
    },
    "split_keywords": [
        "pandas",
        " fairness",
        " subgroup analysis",
        " data mining"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9f1da2a5195672f97555b0110b8b1658e606113228153f6395499e364ad687f4",
                "md5": "df1e11df0e28bf540b63d6564242ed4e",
                "sha256": "274c21c878affe6abef3d954ebc0697832409305d0b00a5bbcd09954b4c4c6cb"
            },
            "downloads": -1,
            "filename": "DivExplorer-0.2.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "df1e11df0e28bf540b63d6564242ed4e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 21171,
            "upload_time": "2024-12-06T21:07:29",
            "upload_time_iso_8601": "2024-12-06T21:07:29.062980Z",
            "url": "https://files.pythonhosted.org/packages/9f/1d/a2a5195672f97555b0110b8b1658e606113228153f6395499e364ad687f4/DivExplorer-0.2.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b33b1ea040d7a90db9a5cc3634e7788817d2b419499fb130708dc4a61490d5a",
                "md5": "f8832fa58f74374e5b79e4705e917cc3",
                "sha256": "67703a761716a8abcad1d1f6bf971032e87dc05e914b41a4ae2eacb5cf49d759"
            },
            "downloads": -1,
            "filename": "divexplorer-0.2.6.tar.gz",
            "has_sig": false,
            "md5_digest": "f8832fa58f74374e5b79e4705e917cc3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 18491,
            "upload_time": "2024-12-06T21:07:30",
            "upload_time_iso_8601": "2024-12-06T21:07:30.663173Z",
            "url": "https://files.pythonhosted.org/packages/9b/33/b1ea040d7a90db9a5cc3634e7788817d2b419499fb130708dc4a61490d5a/divexplorer-0.2.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-06 21:07:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DivExplorer",
    "github_project": "divexplorer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.1.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.16.4"
                ]
            ]
        },
        {
            "name": "mlxtend",
            "specs": [
                [
                    ">=",
                    "0.17.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "0.24.2"
                ]
            ]
        },
        {
            "name": "plotly",
            "specs": [
                [
                    ">=",
                    "4.5.0"
                ]
            ]
        },
        {
            "name": "python_igraph",
            "specs": [
                [
                    ">=",
                    "0.8.3"
                ]
            ]
        }
    ],
    "lcname": "divexplorer"
}
        
Elapsed time: 0.46363s