# `csd` -- A Python Package for (Constrained) Subgroup Discovery
The package `csd` contains several subgroup-discovery methods and evaluation metrics.
Besides traditional (unconstrained) subgroup discovery,
we allow searching subgroup descriptions using a limited number of features (-> feature-cardinality constraint).
Additionally, some methods allow searching alternative subgroup descriptions,
which should replicate a given subgroup membership as good as possible while using different features in the description.
This document provides:
- Steps for [setting up](#setup) the package.
- A short [overview](#functionality) of the (subgroup-discovery) functionality.
- A [demo](#demo) for discovering subgroups and alternative subgroup descriptions.
- [Guidelines for developers](#developer-info) who want to modify or extend the code base.
If you use this package for a scientific publication, please cite [our paper](https://doi.org/10.48550/arXiv.2406.01411)
```
@misc{bach2024using,
title={Using Constraints to Discover Sparse and Alternative Subgroup Descriptions},
author={Bach, Jakob},
howpublished={arXiv:2406.01411v1 [cs.LG]},
year={2024},
doi={10.48550/arXiv.2406.01411},
}
```
## Setup
You can install our package from [PyPI](https://pypi.org/):
```
python -m pip install csd
```
Alternatively, you can install the package from GitHub:
```bash
python -m pip install git+https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery.git#subdirectory=csd_package
```
If you already have the source code for the package (i.e., the directory in which this `README` resides)
as a local directory on your computer (e.g., after cloning the project), you can also perform a local install:
```bash
python -m pip install csd_package/
```
## Functionality
Currently, we provide seven subgroup-discovery methods as classes:
- solver-based search: `MIPSubgroupDiscoverer`, `SMTSubgroupDiscoverer`
- heuristic search: `BeamSearchSubgroupDiscoverer`, `BestIntervalSubgroupDiscoverer`, `PRIMSubgroupDiscoverer`
- baselines (simple heuristics): `MORSSubgroupDiscoverer`, `RandomSubgroupDiscoverer`
All methods expect datasets with numeric feature values (as `pd.DataFrame`) and a binary prediction target (as `pd.Series`).
The heuristics are from literature or adaptations from other packages,
while we conceived the solver-based methods and baselines.
All subgroup-discovery methods can discover subgroups (who would have guessed?),
optionally using a limited number of features in the subgroup description (a feature-cardinality constraint),
while some of them can also find alternative subgroup descriptions.
Further, we provide four evaluation metrics for binary (`bool` or `int`) subgroup-membership vectors:
- `wracc()`: Weighted Relative Accuracy
- `nwracc()`: Weighted Relative Accuracy normalized to `[-1, 1]`
- `jaccard()`: Jaccary similarity (1 - Jaccard distance)
- `hamming()`: Normalized Hamming similarity (1 - Hamming distance normalized to `[0, 1]`; equals prediction accuracy)
For each of them, there is a general version (accepting `pd.Series`, `np.array`, or even simpler sequences like plain lists)
and a faster, `numpy`-specfic version with the same functionality (suffixed `_np`).
## Demo
We demonstrate subgroup discovery as well as alternative-subgroup-description discovery.
### Discovering (Original) Subgroups
Using a subgroup-discovery method from `csd` is similar to working with prediction models from `scikit-learn`.
All subgroup-discovery methods are implemented as subclasses of `SubgroupDiscoverer` and provide the following functionality:
- `fit(X, y)` and `predict(X)`, as in `scikit-learn`, working with `pandas.DataFrame` and `pandas.Series`
- `evaluate(X_train, y_train, X_test, y_test)`, which combines fitting and prediction on a train-test split of a dataset
- `get_box_lbs()` and `get_box_ubs()`, which return the lower and upper bounds on features in the subgroup description (as `pandas.Series`)
- `is_feature_selected()` and `get_selected_feature_idxs()`, which provide information on selected (= restricted, used) features in the subgroup description (as lists)
The actual subgroup discovery occurs during fitting.
The passed dataset has to be purely numeric (categorical columns should be encoded accordingly) with a binary target,
with the class label `1` being the class of interest and `0` being the other class.
Any parameters for the search have to be passed before fitting,
i.e., already when initializing the instance of the subgroup-discovery method.
While some parameters are specific to the subgroup-discovery method,
all existing methods have a parameter `k` to limit (upper bound) the number of features selected in the subgroup description.
The prediction routine classifies data objects as belonging to the subgroup (= 1) or not (= 0).
`csd` also contains functions to evaluate predictions,
though the evaluation metrics for binary classification from `sklearn.metrics` should work as well.
Here is a small example for initialization, fitting, prediction, and evaluation:
```python
import csd
import sklearn.datasets
import sklearn.model_selection
X, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)
y = (y == 1).astype(int) # binary classification (in original data, three classes)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, train_size=0.8, random_state=25)
model = csd.BeamSearchSubgroupDiscoverer(beam_width=10)
fitting_result = model.fit(X=X_train, y=y_train)
print('Fitting result:', fitting_result)
print('Train WRAcc:', round(csd.wracc(y_true=y_train, y_pred=model.predict(X=X_train)), 2))
print('Test WRAcc:', round(csd.wracc(y_true=y_test, y_pred=model.predict(X=X_test)), 2))
print('Lower bounds:', model.get_box_lbs().tolist())
print('Upper bounds:', model.get_box_ubs().tolist())
```
The output of this code snippet looks like this (optimization time may vary):
```
Fitting result: {'objective_value': 0.208125, 'optimization_status': None, 'optimization_time': 0.03125}
Train WRAcc: 0.21
Test WRAcc: 0.21
Lower bounds: [-inf, -inf, 3.0, -inf]
Upper bounds: [7.0, inf, 5.1, 1.7]
```
Beam search optimizes Weighted Relative Accuracy (WRAcc), so objective value and train WRAcc (without rounding) are equal.
The `optimization_status` only is relevant for solver-based subgroup-discovery methods.
If a feature is not restricted regarding lower or upper bound, the corresponding bounds are infinite.
`evaluate()` combines fitting, prediction, and evaluation:
```python
import csd
import sklearn.datasets
import sklearn.model_selection
X, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)
y = (y == 1).astype(int) # binary classification (in original data, three classes)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, train_size=0.8, random_state=25)
model = csd.BeamSearchSubgroupDiscoverer(beam_width=10)
evaluation_result = model.evaluate(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
print(evaluation_result.transpose())
```
The output is a `pandas.DataFrame` (optimization time and fitting time may vary):
```
0
objective_value 0.208125
optimization_status None
optimization_time 0.03125
fitting_time 0.046875
train_wracc 0.208125
test_wracc 0.212222
train_nwracc 0.975904
test_nwracc 0.864253
box_lbs [-inf, -inf, 3.0, -inf]
box_ubs [7.0, inf, 5.1, 1.7]
selected_feature_idxs [0, 2, 3]
```
Fitting time may be higher than optimization time since it also includes preparing optimization and the results object.
However, except for solver-based optimization (where entering all constraints into the solver may take some time),
the difference between optimization time and fitting time should be marginal.
The evaluation routine computes WRAcc, whose minimum and maximum depend on the class imbalance in the dataset,
as well as nWRAcc, which is normalized to `[-1, 1]`.
### Discovering Alternative Subgroup Descriptions
Additionally, subgroup-discovery methods inheriting from `AlternativeSubgroupDiscoverer`
(currently only `BeamSearchSubgroupDiscoverer` and `SMTSubgroupDiscoverer`)
can search alternative subgroup descriptions with the method `search_alternative_descriptions()`.
Before starting this search, you should set `a` (number of alternatives) and
`tau_abs` (number of previously selected features that must not be selected again) in the initializer.
Also, we highly recommend limiting the number of selected features with the parameter `k` in the initializer,
so there are still enough unselected features that may be selected in the alternatives instead.
```python
import csd
import sklearn.datasets
import sklearn.model_selection
X, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)
y = (y == 1).astype(int) # binary classification (in original data, three classes)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, train_size=0.8, random_state=25)
model = csd.BeamSearchSubgroupDiscoverer(beam_width=10, k=2, tau_abs=1, a=4)
search_results = model.search_alternative_descriptions(X_train=X_train, y_train=y_train,
X_test=X_test, y_test=y_test)
print(search_results.drop(columns=['box_lbs', 'box_ubs', 'selected_feature_idxs']).round(2).transpose())
```
This code snippet outputs a `pandas.DataFrame` (optimization time and fitting time may vary):
```
objective_value 0.21 0.94 0.92 0.81 0.76
optimization_status None None None None None
optimization_time 0.02 0.03 0.02 0.03 0.02
fitting_time 0.03 0.03 0.02 0.03 0.02
train_wracc 0.21 0.17 0.16 0.1 0.1
test_wracc 0.21 0.17 0.17 0.13 0.11
train_nwracc 0.98 0.82 0.77 0.46 0.45
test_nwracc 0.86 0.71 0.69 0.54 0.46
alt.hamming 1.0 0.94 0.92 0.81 0.76
alt.jaccard 1.0 0.83 0.77 0.5 0.48
alt.number 0 1 2 3 4
```
The first subgroup (with an `alt.number` of `0`) is the same as if using the conventional `fit()` routine.
The objective value corresponds to train WRAcc for this subgroup and normalized Hamming similarity for the subsequent subgroups.
You can see that the similarity of alternative subgroup descriptions to the original subgroup
(`alt.hamming` and `alt.jaccard`) decreases over the number of alternatives (`alt.number`),
as does the WRAcc on the training set and test set.
## Developer Info
If you want to add another subgroup-discovery method, make it a subclass of `SubgroupDiscoverer`.
As a minimum, you should
- add the initializer `__init__()` to set parameters of your subgroup-discovery method
(also call `super().__init__()` for the initialization from the superclass) and
- override the `fit()` method to search for subgroups.
`fit()` needs to set the fields `_box_lbs` and `_box_ubs` (inherited from the superclass)
to contain the features' lower and upper bounds in the subgroup description (as `pandas.Series`).
If `_box_lbs` and `_box_ubs` are set as fields, `predict()` works automatically with these bounds and need not be overridden.
For integration into the experimental pipeline of our paper, `fit()` should also return a dictionary
with the keys `objective_value`, `optimization_status` (may be `None`), and `optimization_time`.
If your subgroup-discovery method is not tailored towards a specific objective,
you may use the `wracc()` or `wracc_np()` from `csd` to guide the search
(the former is slower but supports more data types, the latter is tailored to `numpy` arrays and faster).
To support feature-cardinality constraints, like the existing subgroup-discovery methods,
you may want to allow setting an upper bound on the number of selected features `k`
during initialization and observing it during fitting.
We propose to store `k` in a field called `_k`;
however, it is not used in the methods of superclass `SubgroupDiscoverer`, as you need to implement the constraint during fitting.
If your subgroup-discovery method should also support the search for alternative subgroup descriptions,
make it a subclass of `AbstractSubgroupDiscoverer` (which inherits from `SubgroupDiscoverer`).
The search for alternatives and the fitting routine are already implemented there.
In particular, `fit()` dispatches to the method `_optimize()`, which you need to override.
The latter should both be able to
- search for the original subgroup, optimizing WRAcc or another notion of subgroup quality and
- search for alternative subgroup descriptions by
- optimizing normalized Hamming similarity (see functions `hamming()` and `hamming_np` in `csd`)
to the original subgroup (or another notion of subgroup similarity) and
- having a constraint that at least `tau_abs` features from each previous subgroup description need to be de-selected
(though not more features than actually were selected, to prevent infeasibilities).
Your implementation of `_optimize()` has to automatically switch the objective and add the constraints for alternatives if necessary.
The method's parameters `was_feature_selected_list` and `was_instance_in_box` tell you, if not `None`,
that you should search for an alternative subgroup description; else, search for the original subgroup.
Additionally, you should add parameters `a` and `tau_abs` in the initializer of your subgroup-discovery class,
storing them in the fields `_a` and `_tau_abs`.
The latter two fields are automatically used in `evaluate()` (to decide whether to search only for an original subgroup
or also for alternative subgroup descriptions) and should be used in your implementation of `_optimize()`.
Raw data
{
"_id": null,
"home_page": null,
"name": "csd",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "subgroup discovery, constraints, feature selection, alternative subgroup descriptions",
"author": null,
"author_email": "Jakob Bach <jakob.bach.ka@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0b/78/9c3f6916c5e5be64eedf53f214ba6d4211e6cca42b72c9a4e794c8336da0/csd-1.0.0.tar.gz",
"platform": null,
"description": "# `csd` -- A Python Package for (Constrained) Subgroup Discovery\r\n\r\nThe package `csd` contains several subgroup-discovery methods and evaluation metrics.\r\nBesides traditional (unconstrained) subgroup discovery,\r\nwe allow searching subgroup descriptions using a limited number of features (-> feature-cardinality constraint).\r\nAdditionally, some methods allow searching alternative subgroup descriptions,\r\nwhich should replicate a given subgroup membership as good as possible while using different features in the description.\r\n\r\nThis document provides:\r\n\r\n- Steps for [setting up](#setup) the package.\r\n- A short [overview](#functionality) of the (subgroup-discovery) functionality.\r\n- A [demo](#demo) for discovering subgroups and alternative subgroup descriptions.\r\n- [Guidelines for developers](#developer-info) who want to modify or extend the code base.\r\n\r\nIf you use this package for a scientific publication, please cite [our paper](https://doi.org/10.48550/arXiv.2406.01411)\r\n\r\n```\r\n@misc{bach2024using,\r\n\ttitle={Using Constraints to Discover Sparse and Alternative Subgroup Descriptions},\r\n\tauthor={Bach, Jakob},\r\n\thowpublished={arXiv:2406.01411v1 [cs.LG]},\r\n\tyear={2024},\r\n\tdoi={10.48550/arXiv.2406.01411},\r\n}\r\n```\r\n\r\n## Setup\r\n\r\nYou can install our package from [PyPI](https://pypi.org/):\r\n\r\n```\r\npython -m pip install csd\r\n```\r\n\r\nAlternatively, you can install the package from GitHub:\r\n\r\n```bash\r\npython -m pip install git+https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery.git#subdirectory=csd_package\r\n```\r\n\r\nIf you already have the source code for the package (i.e., the directory in which this `README` resides)\r\nas a local directory on your computer (e.g., after cloning the project), you can also perform a local install:\r\n\r\n```bash\r\npython -m pip install csd_package/\r\n```\r\n\r\n## Functionality\r\n\r\nCurrently, we provide seven subgroup-discovery methods as classes:\r\n\r\n- solver-based search: `MIPSubgroupDiscoverer`, `SMTSubgroupDiscoverer`\r\n- heuristic search: `BeamSearchSubgroupDiscoverer`, `BestIntervalSubgroupDiscoverer`, `PRIMSubgroupDiscoverer`\r\n- baselines (simple heuristics): `MORSSubgroupDiscoverer`, `RandomSubgroupDiscoverer`\r\n\r\nAll methods expect datasets with numeric feature values (as `pd.DataFrame`) and a binary prediction target (as `pd.Series`).\r\nThe heuristics are from literature or adaptations from other packages,\r\nwhile we conceived the solver-based methods and baselines.\r\nAll subgroup-discovery methods can discover subgroups (who would have guessed?),\r\noptionally using a limited number of features in the subgroup description (a feature-cardinality constraint),\r\nwhile some of them can also find alternative subgroup descriptions.\r\n\r\nFurther, we provide four evaluation metrics for binary (`bool` or `int`) subgroup-membership vectors:\r\n\r\n- `wracc()`: Weighted Relative Accuracy\r\n- `nwracc()`: Weighted Relative Accuracy normalized to `[-1, 1]`\r\n- `jaccard()`: Jaccary similarity (1 - Jaccard distance)\r\n- `hamming()`: Normalized Hamming similarity (1 - Hamming distance normalized to `[0, 1]`; equals prediction accuracy)\r\n\r\nFor each of them, there is a general version (accepting `pd.Series`, `np.array`, or even simpler sequences like plain lists)\r\nand a faster, `numpy`-specfic version with the same functionality (suffixed `_np`).\r\n\r\n## Demo\r\n\r\nWe demonstrate subgroup discovery as well as alternative-subgroup-description discovery.\r\n\r\n### Discovering (Original) Subgroups\r\n\r\nUsing a subgroup-discovery method from `csd` is similar to working with prediction models from `scikit-learn`.\r\nAll subgroup-discovery methods are implemented as subclasses of `SubgroupDiscoverer` and provide the following functionality:\r\n\r\n- `fit(X, y)` and `predict(X)`, as in `scikit-learn`, working with `pandas.DataFrame` and `pandas.Series`\r\n- `evaluate(X_train, y_train, X_test, y_test)`, which combines fitting and prediction on a train-test split of a dataset\r\n- `get_box_lbs()` and `get_box_ubs()`, which return the lower and upper bounds on features in the subgroup description (as `pandas.Series`)\r\n- `is_feature_selected()` and `get_selected_feature_idxs()`, which provide information on selected (= restricted, used) features in the subgroup description (as lists)\r\n\r\nThe actual subgroup discovery occurs during fitting.\r\nThe passed dataset has to be purely numeric (categorical columns should be encoded accordingly) with a binary target,\r\nwith the class label `1` being the class of interest and `0` being the other class.\r\nAny parameters for the search have to be passed before fitting,\r\ni.e., already when initializing the instance of the subgroup-discovery method.\r\nWhile some parameters are specific to the subgroup-discovery method,\r\nall existing methods have a parameter `k` to limit (upper bound) the number of features selected in the subgroup description.\r\nThe prediction routine classifies data objects as belonging to the subgroup (= 1) or not (= 0).\r\n`csd` also contains functions to evaluate predictions,\r\nthough the evaluation metrics for binary classification from `sklearn.metrics` should work as well.\r\n\r\nHere is a small example for initialization, fitting, prediction, and evaluation:\r\n\r\n```python\r\nimport csd\r\nimport sklearn.datasets\r\nimport sklearn.model_selection\r\n\r\nX, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)\r\ny = (y == 1).astype(int) # binary classification (in original data, three classes)\r\nX_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\r\n X, y, train_size=0.8, random_state=25)\r\n\r\nmodel = csd.BeamSearchSubgroupDiscoverer(beam_width=10)\r\nfitting_result = model.fit(X=X_train, y=y_train)\r\nprint('Fitting result:', fitting_result)\r\nprint('Train WRAcc:', round(csd.wracc(y_true=y_train, y_pred=model.predict(X=X_train)), 2))\r\nprint('Test WRAcc:', round(csd.wracc(y_true=y_test, y_pred=model.predict(X=X_test)), 2))\r\nprint('Lower bounds:', model.get_box_lbs().tolist())\r\nprint('Upper bounds:', model.get_box_ubs().tolist())\r\n```\r\n\r\nThe output of this code snippet looks like this (optimization time may vary):\r\n\r\n```\r\nFitting result: {'objective_value': 0.208125, 'optimization_status': None, 'optimization_time': 0.03125}\r\nTrain WRAcc: 0.21\r\nTest WRAcc: 0.21\r\nLower bounds: [-inf, -inf, 3.0, -inf]\r\nUpper bounds: [7.0, inf, 5.1, 1.7]\r\n```\r\n\r\nBeam search optimizes Weighted Relative Accuracy (WRAcc), so objective value and train WRAcc (without rounding) are equal.\r\nThe `optimization_status` only is relevant for solver-based subgroup-discovery methods.\r\nIf a feature is not restricted regarding lower or upper bound, the corresponding bounds are infinite.\r\n\r\n`evaluate()` combines fitting, prediction, and evaluation:\r\n\r\n```python\r\nimport csd\r\nimport sklearn.datasets\r\nimport sklearn.model_selection\r\n\r\nX, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)\r\ny = (y == 1).astype(int) # binary classification (in original data, three classes)\r\nX_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\r\n X, y, train_size=0.8, random_state=25)\r\n\r\nmodel = csd.BeamSearchSubgroupDiscoverer(beam_width=10)\r\nevaluation_result = model.evaluate(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)\r\nprint(evaluation_result.transpose())\r\n```\r\n\r\nThe output is a `pandas.DataFrame` (optimization time and fitting time may vary):\r\n\r\n```\r\n 0\r\nobjective_value 0.208125\r\noptimization_status None\r\noptimization_time 0.03125\r\nfitting_time 0.046875\r\ntrain_wracc 0.208125\r\ntest_wracc 0.212222\r\ntrain_nwracc 0.975904\r\ntest_nwracc 0.864253\r\nbox_lbs [-inf, -inf, 3.0, -inf]\r\nbox_ubs [7.0, inf, 5.1, 1.7]\r\nselected_feature_idxs [0, 2, 3]\r\n```\r\n\r\nFitting time may be higher than optimization time since it also includes preparing optimization and the results object.\r\nHowever, except for solver-based optimization (where entering all constraints into the solver may take some time),\r\nthe difference between optimization time and fitting time should be marginal.\r\nThe evaluation routine computes WRAcc, whose minimum and maximum depend on the class imbalance in the dataset,\r\nas well as nWRAcc, which is normalized to `[-1, 1]`.\r\n\r\n### Discovering Alternative Subgroup Descriptions\r\n\r\nAdditionally, subgroup-discovery methods inheriting from `AlternativeSubgroupDiscoverer`\r\n(currently only `BeamSearchSubgroupDiscoverer` and `SMTSubgroupDiscoverer`)\r\ncan search alternative subgroup descriptions with the method `search_alternative_descriptions()`.\r\nBefore starting this search, you should set `a` (number of alternatives) and\r\n`tau_abs` (number of previously selected features that must not be selected again) in the initializer.\r\nAlso, we highly recommend limiting the number of selected features with the parameter `k` in the initializer,\r\nso there are still enough unselected features that may be selected in the alternatives instead.\r\n\r\n```python\r\nimport csd\r\nimport sklearn.datasets\r\nimport sklearn.model_selection\r\n\r\nX, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)\r\ny = (y == 1).astype(int) # binary classification (in original data, three classes)\r\nX_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\r\n X, y, train_size=0.8, random_state=25)\r\n\r\nmodel = csd.BeamSearchSubgroupDiscoverer(beam_width=10, k=2, tau_abs=1, a=4)\r\nsearch_results = model.search_alternative_descriptions(X_train=X_train, y_train=y_train,\r\n X_test=X_test, y_test=y_test)\r\nprint(search_results.drop(columns=['box_lbs', 'box_ubs', 'selected_feature_idxs']).round(2).transpose())\r\n```\r\n\r\nThis code snippet outputs a `pandas.DataFrame` (optimization time and fitting time may vary):\r\n\r\n```\r\nobjective_value 0.21 0.94 0.92 0.81 0.76\r\noptimization_status None None None None None\r\noptimization_time 0.02 0.03 0.02 0.03 0.02\r\nfitting_time 0.03 0.03 0.02 0.03 0.02\r\ntrain_wracc 0.21 0.17 0.16 0.1 0.1\r\ntest_wracc 0.21 0.17 0.17 0.13 0.11\r\ntrain_nwracc 0.98 0.82 0.77 0.46 0.45\r\ntest_nwracc 0.86 0.71 0.69 0.54 0.46\r\nalt.hamming 1.0 0.94 0.92 0.81 0.76\r\nalt.jaccard 1.0 0.83 0.77 0.5 0.48\r\nalt.number 0 1 2 3 4\r\n```\r\n\r\nThe first subgroup (with an `alt.number` of `0`) is the same as if using the conventional `fit()` routine.\r\nThe objective value corresponds to train WRAcc for this subgroup and normalized Hamming similarity for the subsequent subgroups.\r\nYou can see that the similarity of alternative subgroup descriptions to the original subgroup\r\n(`alt.hamming` and `alt.jaccard`) decreases over the number of alternatives (`alt.number`),\r\nas does the WRAcc on the training set and test set.\r\n\r\n## Developer Info\r\n\r\nIf you want to add another subgroup-discovery method, make it a subclass of `SubgroupDiscoverer`.\r\nAs a minimum, you should\r\n\r\n- add the initializer `__init__()` to set parameters of your subgroup-discovery method\r\n (also call `super().__init__()` for the initialization from the superclass) and\r\n- override the `fit()` method to search for subgroups.\r\n\r\n`fit()` needs to set the fields `_box_lbs` and `_box_ubs` (inherited from the superclass)\r\nto contain the features' lower and upper bounds in the subgroup description (as `pandas.Series`).\r\nIf `_box_lbs` and `_box_ubs` are set as fields, `predict()` works automatically with these bounds and need not be overridden.\r\nFor integration into the experimental pipeline of our paper, `fit()` should also return a dictionary\r\nwith the keys `objective_value`, `optimization_status` (may be `None`), and `optimization_time`.\r\nIf your subgroup-discovery method is not tailored towards a specific objective,\r\nyou may use the `wracc()` or `wracc_np()` from `csd` to guide the search\r\n(the former is slower but supports more data types, the latter is tailored to `numpy` arrays and faster).\r\n\r\nTo support feature-cardinality constraints, like the existing subgroup-discovery methods,\r\nyou may want to allow setting an upper bound on the number of selected features `k`\r\nduring initialization and observing it during fitting.\r\nWe propose to store `k` in a field called `_k`;\r\nhowever, it is not used in the methods of superclass `SubgroupDiscoverer`, as you need to implement the constraint during fitting.\r\n\r\nIf your subgroup-discovery method should also support the search for alternative subgroup descriptions,\r\nmake it a subclass of `AbstractSubgroupDiscoverer` (which inherits from `SubgroupDiscoverer`).\r\nThe search for alternatives and the fitting routine are already implemented there.\r\nIn particular, `fit()` dispatches to the method `_optimize()`, which you need to override.\r\nThe latter should both be able to\r\n\r\n- search for the original subgroup, optimizing WRAcc or another notion of subgroup quality and\r\n- search for alternative subgroup descriptions by\r\n - optimizing normalized Hamming similarity (see functions `hamming()` and `hamming_np` in `csd`)\r\n to the original subgroup (or another notion of subgroup similarity) and\r\n - having a constraint that at least `tau_abs` features from each previous subgroup description need to be de-selected\r\n (though not more features than actually were selected, to prevent infeasibilities).\r\n\r\nYour implementation of `_optimize()` has to automatically switch the objective and add the constraints for alternatives if necessary.\r\nThe method's parameters `was_feature_selected_list` and `was_instance_in_box` tell you, if not `None`,\r\nthat you should search for an alternative subgroup description; else, search for the original subgroup.\r\nAdditionally, you should add parameters `a` and `tau_abs` in the initializer of your subgroup-discovery class,\r\nstoring them in the fields `_a` and `_tau_abs`.\r\nThe latter two fields are automatically used in `evaluate()` (to decide whether to search only for an original subgroup\r\nor also for alternative subgroup descriptions) and should be used in your implementation of `_optimize()`.\r\n",
"bugtrack_url": null,
"license": null,
"summary": "(Constrained) subgroup discovery",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/main/csd_package",
"Issues": "https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/issues"
},
"split_keywords": [
"subgroup discovery",
" constraints",
" feature selection",
" alternative subgroup descriptions"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2ca10b55e1fc2877dd26287e0730e9cce923c98298988ea7f4c6415e37b74bad",
"md5": "4430b60b49cc104df170a9ca0f777173",
"sha256": "8bf9438c31dc284be0aedc3477951bf60612b8535a16083aa0970b7236c9693b"
},
"downloads": -1,
"filename": "csd-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4430b60b49cc104df170a9ca0f777173",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 24212,
"upload_time": "2024-10-04T11:18:11",
"upload_time_iso_8601": "2024-10-04T11:18:11.607086Z",
"url": "https://files.pythonhosted.org/packages/2c/a1/0b55e1fc2877dd26287e0730e9cce923c98298988ea7f4c6415e37b74bad/csd-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0b789c3f6916c5e5be64eedf53f214ba6d4211e6cca42b72c9a4e794c8336da0",
"md5": "bd7a04cd9c3da09e9026c9a8487a0e56",
"sha256": "5d1055cc5ba99bd184dbc398769a7fae006a0239fa8081d8f824e48f9390a92f"
},
"downloads": -1,
"filename": "csd-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "bd7a04cd9c3da09e9026c9a8487a0e56",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 28322,
"upload_time": "2024-10-04T11:18:13",
"upload_time_iso_8601": "2024-10-04T11:18:13.864047Z",
"url": "https://files.pythonhosted.org/packages/0b/78/9c3f6916c5e5be64eedf53f214ba6d4211e6cca42b72c9a4e794c8336da0/csd-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-04 11:18:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Jakob-Bach",
"github_project": "Constrained-Subgroup-Discovery",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "csd"
}