pysubgroup

Name	pysubgroup JSON
Version	0.8.0 JSON
	download
home_page	https://github.com/flemmerich/pysubgroup
Summary	pysubgroup is a Python library for the data analysis task of subgroup discovery.
upload_time	2024-01-18 16:49:29
maintainer
docs_url	None
author	Florian Lemmerich, Felix Stamm, Martin Becker
requires_python	>=3.6
license	Apache 2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            <!-- These are examples of badges you might want to add to your README:
     please update the URLs accordingly

[![Twitter](https://img.shields.io/twitter/url/http/shields.io.svg?style=social&label=Twitter)](https://twitter.com/pysubgroup)
-->

![Build status](https://github.com/flemmerich/pysubgroup/actions/workflows/ci.yaml/badge.svg)
[![ReadTheDocs](https://readthedocs.org/projects/pysubgroup/badge/?version=latest)](https://pysubgroup.readthedocs.io/en/stable/)
[![Coveralls](https://img.shields.io/coveralls/github/flemmerich/pysubgroup/main.svg)](https://coveralls.io/r/flemmerich/pysubgroup)
[![PyPI-Server](https://img.shields.io/pypi/v/pysubgroup.svg)](https://pypi.org/project/pysubgroup/)
[![Conda-Forge](https://img.shields.io/conda/vn/conda-forge/pysubgroup.svg)](https://anaconda.org/conda-forge/pysubgroup)
[![Monthly Downloads](https://pepy.tech/badge/pysubgroup/month)](https://pepy.tech/project/pysubgroup)

# pysubgroup

**pysubgroup** is a Python package that enables subgroup discovery in Python+pandas (scipy stack) data analysis environment. It provides for a lightweight, easy-to-use, extensible and freely available implementation of state-of-the-art algorithms, interestingness measures and presentation options.

This library is still in a prototype phase. It has, however, been already successfully employed in active application projects.

## Subgroup Discovery

Subgroup Discovery is a well established data mining technique that allows you to identify patterns in your data.
More precisely, the goal of subgroup discovery is to identify descriptions of data subsets that show an interesting distribution with respect to a pre-specified target concept.
For example, given a dataset of patients in a hospital, we could be interested in subgroups of patients, for which a certain treatment X was successful.
One example result could then be stated as:

_"While in general the operation is successful in only 60% of the cases", for the subgroup
of female patients under 50 that also have been treated with drug d, the success rate was 82%."_

Here, a variable _operation success_ is the target concept, the identified subgroup has the interpretable description _female=True AND age<50 AND drug_D = True_. We call these single conditions (such as _female=True_) selection expressions or short _selectors_.
The interesting behavior for this subgroup is that the distribution of the target concept differs significantly from the distribution in the overall general dataset.
A discovered subgroup could also be seen as a rule:
```
female=True AND age<50 AND drug_D = True ==> Operation_outcome=SUCCESS
```
Computationally, subgroup discovery is challenging since a large number of such conjunctive subgroup descriptions have to be considered. Of course, finding computable criteria, which subgroups are likely interesting to a user is also an eternal struggle.
Therefore, a lot of literature has been devoted to the topic of subgroup discovery (including some of my own work). Recent overviews on the topic are for example:

* Herrera, Franciso, et al. ["An overview on subgroup discovery: foundations and applications."](https://scholar.google.de/scholar?q=Herrera%2C+Franciso%2C+et+al.+%E2%80%9CAn+overview+on+subgroup+discovery%3A+foundations+and+applications.%E2%80%9D+Knowledge+and+information+systems+29.3+(2011)%3A+495-525.) Knowledge and information systems 29.3 (2011): 495-525.
* Atzmueller, Martin. ["Subgroup discovery."](https://scholar.google.de/scholar?q=Atzmueller%2C+Martin.+%E2%80%9CSubgroup+discovery.%E2%80%9D+Wiley+Interdisciplinary+Reviews%3A+Data+Mining+and+Knowledge+Discovery+5.1+(2015)%3A+35-49.) Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5.1 (2015): 35-49.
* And of course, my point of view on the topic is [summarized in my dissertation](https://opus.bibliothek.uni-wuerzburg.de/files/9781/Dissertation-Lemmerich.pdf):

### Prerequisites and Installation
pysubgroup is built to fit in the standard Python data analysis environment from the scipy-stack.
Thus, it can be used just having pandas (including its dependencies numpy, scipy, and matplotlib) installed. Visualizations are carried out with the matplotlib library.

pysubgroup consists of pure Python code. Thus, you can simply download the code from the repository and copy it in your `site-packages` directory.
pysubgroup is also on PyPI and should be installable using:
`pip install pysubgroup`

**Note**: Some users complained about the **pip installation not working**.
If, after the installation, it still doesn't find the package, then do the following steps:
 1. Find where the directory `site-packages` is.
 2. Copy the folder `pysubgroup`, which contains the source code, into the `site-packages` directory. (WARNING: This is not the main repository folder. The `pysubgroup` folder is inside the main repository folder, at the same level as `doc`)
 3. Now you can import the module with `import pysubgroup`.

## How to use:
A simple use case (here using the well known _titanic_ data) can be created in just a few lines of code:

```python
import pysubgroup as ps

# Load the example dataset
from pysubgroup.datasets import get_titanic_data
data = get_titanic_data()

target = ps.BinaryTarget ('Survived', True)
searchspace = ps.create_selectors(data, ignore=['Survived'])
task = ps.SubgroupDiscoveryTask (
    data,
    target,
    searchspace,
    result_set_size=5,
    depth=2,
    qf=ps.WRAccQF())
result = ps.DFS().execute(task)
```
The first line imports _pysubgroup_ package.
The following lines load an example dataset (the popular titanic dataset).

Therafter, we define a target, i.e., the property we are mainly interested in (_'survived'}.
Then, we define the searchspace as a list of basic selectors. Descriptions are built from this searchspace. We can create this list manually, or use an utility function.
Next, we create a SubgroupDiscoveryTask object that encapsulates what we want to find in our search.
In particular, that comprises the target, the search space, the depth of the search (maximum numbers of selectors combined in a subgroup description), and the interestingness measure for candidate scoring (here, the Weighted Relative Accuracy measure).

The last line executes the defined task by performing a search with an algorithm---in this case depth first search. The result of this algorithm execution is stored in a SubgroupDiscoveryResults object.

To just print the result, we could for example do:

```python
print(result.to_dataframe())
```

to get:

<table border="1" class="dataframe">
<thead>    <tr style="text-align: right;">      <th></th>      <th>quality</th>      <th>description</th>    </tr>  </thead>
<tbody>
    <tr>      <th>0</th>      <td>0.132150</td>      <td>Sex==female</td>    </tr>
    <tr>      <th>1</th>      <td>0.101331</td>      <td>Parch==0 AND Sex==female</td>    </tr>
    <tr>      <th>2</th>      <td>0.079142</td>      <td>Sex==female AND SibSp: [0:1[</td>    </tr>
    <tr>      <th>3</th>      <td>0.077663</td>      <td>Cabin.isnull() AND Sex==female</td>    </tr>
    <tr>      <th>4</th>      <td>0.071746</td>      <td>Embarked==S AND Sex==female</td>    </tr>
</tbody></table>


## Key classes
Here is an outline on the most important classes:
* Selector: A Selector represents an atomic condition over the data, e.g., _age < 50_. There several subtypes of Selectors, i.e., NominalSelector (color==BLUE), NumericSelector (age < 50) and NegatedSelector (a wrapper such as not selector1)
* SubgroupDiscoveryTask: As mentioned before, encapsulates the specification of how an algorithm should search for interesting subgroups
* SubgroupDicoveryResult: These are the main outcome of a subgroup disovery run. You can obtain a list of subgroups using the `to_subgroups()` or to a dataframe using `to_dataframe()`
* Conjunction: A conjunction is the most widely used SubgroupDescription, and indicates which data instances are covered by the subgroup. It can be seen as the left hand side of a rule.


## License
We are happy about anyone using this software. Thus, this work is put under an Apache license. However, if this constitutes
any hindrance to your application, please feel free to contact us, we am sure that we can work something out.

    Copyright 2016-2019 Florian Lemmerich

    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.


## Warning
* GP-growth is in an experimental stage.

## Cite
If you are using pysubgroup for your research, please consider citing our demo paper:

    Lemmerich, F., & Becker, M. (2018, September). pysubgroup: Easy-to-use subgroup discovery in python. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECMLPKDD). pp. 658-662.

bibtex:

    @inproceedings{lemmerich2018pysubgroup,
      title={pysubgroup: Easy-to-use subgroup discovery in python},
      author={Lemmerich, Florian and Becker, Martin},
      booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
      pages={658--662},
      year={2018}
    }


## Note

This project has been set up using PyScaffold 4.5. For details and usage
information on PyScaffold see https://pyscaffold.org/.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/flemmerich/pysubgroup",
    "name": "pysubgroup",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Florian Lemmerich, Felix Stamm, Martin Becker",
    "author_email": "florian@lemmerich.net",
    "download_url": "https://files.pythonhosted.org/packages/10/81/9c1cbad472a587bd6ae5d2f98a5135efba6dfa64fad75d8edfad710cf266/pysubgroup-0.8.0.tar.gz",
    "platform": "any",
    "description": "<!-- These are examples of badges you might want to add to your README:\n     please update the URLs accordingly\n\n[![Twitter](https://img.shields.io/twitter/url/http/shields.io.svg?style=social&label=Twitter)](https://twitter.com/pysubgroup)\n-->\n\n![Build status](https://github.com/flemmerich/pysubgroup/actions/workflows/ci.yaml/badge.svg)\n[![ReadTheDocs](https://readthedocs.org/projects/pysubgroup/badge/?version=latest)](https://pysubgroup.readthedocs.io/en/stable/)\n[![Coveralls](https://img.shields.io/coveralls/github/flemmerich/pysubgroup/main.svg)](https://coveralls.io/r/flemmerich/pysubgroup)\n[![PyPI-Server](https://img.shields.io/pypi/v/pysubgroup.svg)](https://pypi.org/project/pysubgroup/)\n[![Conda-Forge](https://img.shields.io/conda/vn/conda-forge/pysubgroup.svg)](https://anaconda.org/conda-forge/pysubgroup)\n[![Monthly Downloads](https://pepy.tech/badge/pysubgroup/month)](https://pepy.tech/project/pysubgroup)\n\n# pysubgroup\n\n**pysubgroup** is a Python package that enables subgroup discovery in Python+pandas (scipy stack) data analysis environment. It provides for a lightweight, easy-to-use, extensible and freely available implementation of state-of-the-art algorithms, interestingness measures and presentation options.\n\nThis library is still in a prototype phase. It has, however, been already successfully employed in active application projects.\n\n## Subgroup Discovery\n\nSubgroup Discovery is a well established data mining technique that allows you to identify patterns in your data.\nMore precisely, the goal of subgroup discovery is to identify descriptions of data subsets that show an interesting distribution with respect to a pre-specified target concept.\nFor example, given a dataset of patients in a hospital, we could be interested in subgroups of patients, for which a certain treatment X was successful.\nOne example result could then be stated as:\n\n_\"While in general the operation is successful in only 60% of the cases\", for the subgroup\nof female patients under 50 that also have been treated with drug d, the success rate was 82%.\"_\n\nHere, a variable _operation success_ is the target concept, the identified subgroup has the interpretable description _female=True AND age<50 AND drug_D = True_. We call these single conditions (such as _female=True_) selection expressions or short _selectors_.\nThe interesting behavior for this subgroup is that the distribution of the target concept differs significantly from the distribution in the overall general dataset.\nA discovered subgroup could also be seen as a rule:\n```\nfemale=True AND age<50 AND drug_D = True ==> Operation_outcome=SUCCESS\n```\nComputationally, subgroup discovery is challenging since a large number of such conjunctive subgroup descriptions have to be considered. Of course, finding computable criteria, which subgroups are likely interesting to a user is also an eternal struggle.\nTherefore, a lot of literature has been devoted to the topic of subgroup discovery (including some of my own work). Recent overviews on the topic are for example:\n\n* Herrera, Franciso, et al. [\"An overview on subgroup discovery: foundations and applications.\"](https://scholar.google.de/scholar?q=Herrera%2C+Franciso%2C+et+al.+%E2%80%9CAn+overview+on+subgroup+discovery%3A+foundations+and+applications.%E2%80%9D+Knowledge+and+information+systems+29.3+(2011)%3A+495-525.) Knowledge and information systems 29.3 (2011): 495-525.\n* Atzmueller, Martin. [\"Subgroup discovery.\"](https://scholar.google.de/scholar?q=Atzmueller%2C+Martin.+%E2%80%9CSubgroup+discovery.%E2%80%9D+Wiley+Interdisciplinary+Reviews%3A+Data+Mining+and+Knowledge+Discovery+5.1+(2015)%3A+35-49.) Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5.1 (2015): 35-49.\n* And of course, my point of view on the topic is [summarized in my dissertation](https://opus.bibliothek.uni-wuerzburg.de/files/9781/Dissertation-Lemmerich.pdf):\n\n### Prerequisites and Installation\npysubgroup is built to fit in the standard Python data analysis environment from the scipy-stack.\nThus, it can be used just having pandas (including its dependencies numpy, scipy, and matplotlib) installed. Visualizations are carried out with the matplotlib library.\n\npysubgroup consists of pure Python code. Thus, you can simply download the code from the repository and copy it in your `site-packages` directory.\npysubgroup is also on PyPI and should be installable using:\n`pip install pysubgroup`\n\n**Note**: Some users complained about the **pip installation not working**.\nIf, after the installation, it still doesn't find the package, then do the following steps:\n 1. Find where the directory `site-packages` is.\n 2. Copy the folder `pysubgroup`, which contains the source code, into the `site-packages` directory. (WARNING: This is not the main repository folder. The `pysubgroup` folder is inside the main repository folder, at the same level as `doc`)\n 3. Now you can import the module with `import pysubgroup`.\n\n## How to use:\nA simple use case (here using the well known _titanic_ data) can be created in just a few lines of code:\n\n```python\nimport pysubgroup as ps\n\n# Load the example dataset\nfrom pysubgroup.datasets import get_titanic_data\ndata = get_titanic_data()\n\ntarget = ps.BinaryTarget ('Survived', True)\nsearchspace = ps.create_selectors(data, ignore=['Survived'])\ntask = ps.SubgroupDiscoveryTask (\n    data,\n    target,\n    searchspace,\n    result_set_size=5,\n    depth=2,\n    qf=ps.WRAccQF())\nresult = ps.DFS().execute(task)\n```\nThe first line imports _pysubgroup_ package.\nThe following lines load an example dataset (the popular titanic dataset).\n\nTherafter, we define a target, i.e., the property we are mainly interested in (_'survived'}.\nThen, we define the searchspace as a list of basic selectors. Descriptions are built from this searchspace. We can create this list manually, or use an utility function.\nNext, we create a SubgroupDiscoveryTask object that encapsulates what we want to find in our search.\nIn particular, that comprises the target, the search space, the depth of the search (maximum numbers of selectors combined in a subgroup description), and the interestingness measure for candidate scoring (here, the Weighted Relative Accuracy measure).\n\nThe last line executes the defined task by performing a search with an algorithm---in this case depth first search. The result of this algorithm execution is stored in a SubgroupDiscoveryResults object.\n\nTo just print the result, we could for example do:\n\n```python\nprint(result.to_dataframe())\n```\n\nto get:\n\n<table border=\"1\" class=\"dataframe\">\n<thead>    <tr style=\"text-align: right;\">      <th></th>      <th>quality</th>      <th>description</th>    </tr>  </thead>\n<tbody>\n    <tr>      <th>0</th>      <td>0.132150</td>      <td>Sex==female</td>    </tr>\n    <tr>      <th>1</th>      <td>0.101331</td>      <td>Parch==0 AND Sex==female</td>    </tr>\n    <tr>      <th>2</th>      <td>0.079142</td>      <td>Sex==female AND SibSp: [0:1[</td>    </tr>\n    <tr>      <th>3</th>      <td>0.077663</td>      <td>Cabin.isnull() AND Sex==female</td>    </tr>\n    <tr>      <th>4</th>      <td>0.071746</td>      <td>Embarked==S AND Sex==female</td>    </tr>\n</tbody></table>\n\n\n## Key classes\nHere is an outline on the most important classes:\n* Selector: A Selector represents an atomic condition over the data, e.g., _age < 50_. There several subtypes of Selectors, i.e., NominalSelector (color==BLUE), NumericSelector (age < 50) and NegatedSelector (a wrapper such as not selector1)\n* SubgroupDiscoveryTask: As mentioned before, encapsulates the specification of how an algorithm should search for interesting subgroups\n* SubgroupDicoveryResult: These are the main outcome of a subgroup disovery run. You can obtain a list of subgroups using the `to_subgroups()` or to a dataframe using `to_dataframe()`\n* Conjunction: A conjunction is the most widely used SubgroupDescription, and indicates which data instances are covered by the subgroup. It can be seen as the left hand side of a rule.\n\n\n## License\nWe are happy about anyone using this software. Thus, this work is put under an Apache license. However, if this constitutes\nany hindrance to your application, please feel free to contact us, we am sure that we can work something out.\n\n    Copyright 2016-2019 Florian Lemmerich\n\n    Licensed under the Apache License, Version 2.0 (the \"License\");\n    you may not use this file except in compliance with the License.\n    You may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\n    Unless required by applicable law or agreed to in writing, software\n    distributed under the License is distributed on an \"AS IS\" BASIS,\n    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n    See the License for the specific language governing permissions and\n    limitations under the License.\n\n\n## Warning\n* GP-growth is in an experimental stage.\n\n## Cite\nIf you are using pysubgroup for your research, please consider citing our demo paper:\n\n    Lemmerich, F., & Becker, M. (2018, September). pysubgroup: Easy-to-use subgroup discovery in python. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECMLPKDD). pp. 658-662.\n\nbibtex:\n\n    @inproceedings{lemmerich2018pysubgroup,\n      title={pysubgroup: Easy-to-use subgroup discovery in python},\n      author={Lemmerich, Florian and Becker, Martin},\n      booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},\n      pages={658--662},\n      year={2018}\n    }\n\n\n## Note\n\nThis project has been set up using PyScaffold 4.5. For details and usage\ninformation on PyScaffold see https://pyscaffold.org/.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "pysubgroup is a Python library for the data analysis task of subgroup discovery.",
    "version": "0.8.0",
    "project_urls": {
        "Documentation": "https://pysubgroup.readthedocs.io/",
        "Download": "https://pypi.org/project/pysubgroup/#files",
        "Homepage": "https://github.com/flemmerich/pysubgroup",
        "Source": "https://github.com/flemmerich/pysubgroup",
        "Tracker": "https://github.com/flemmerich/pysubgroup/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c54325c7f35aa880c00b8662dd5a9fb588e0bcbce54a45231f567a45d182ddf9",
                "md5": "da00d4a0657b5b4fc75c01ae58b456b2",
                "sha256": "c91c3e6f785971b3a91596501c17642f22921f6ea70e9a571b36f82e916644cf"
            },
            "downloads": -1,
            "filename": "pysubgroup-0.8.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da00d4a0657b5b4fc75c01ae58b456b2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 70542,
            "upload_time": "2024-01-18T16:49:28",
            "upload_time_iso_8601": "2024-01-18T16:49:28.397907Z",
            "url": "https://files.pythonhosted.org/packages/c5/43/25c7f35aa880c00b8662dd5a9fb588e0bcbce54a45231f567a45d182ddf9/pysubgroup-0.8.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "10819c1cbad472a587bd6ae5d2f98a5135efba6dfa64fad75d8edfad710cf266",
                "md5": "de46d1da8fcbd312f39d350ff4d187c9",
                "sha256": "1ee5c5c23b0f1c39db27773b3fe991a5a1d6157c253b520a6215a4d07a53c03f"
            },
            "downloads": -1,
            "filename": "pysubgroup-0.8.0.tar.gz",
            "has_sig": false,
            "md5_digest": "de46d1da8fcbd312f39d350ff4d187c9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 121519,
            "upload_time": "2024-01-18T16:49:29",
            "upload_time_iso_8601": "2024-01-18T16:49:29.695496Z",
            "url": "https://files.pythonhosted.org/packages/10/81/9c1cbad472a587bd6ae5d2f98a5135efba6dfa64fad75d8edfad710cf266/pysubgroup-0.8.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-18 16:49:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "flemmerich",
    "github_project": "pysubgroup",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "tox": true,
    "lcname": "pysubgroup"
}

Florian Lemmerich, Felix Stamm, Martin Becker