# `alfese` -- A Python Package for Alternative Feature Selection
The package `alfese` contains several methods for alternative feature selection.
Alternative feature selection is the problem of finding multiple feature sets (sequentially or simultaneously)
that optimize feature-set quality while being sufficiently dissimilar to each other.
Users can control the number of alternatives and a dissimilarity threshold.
The package can also be used for traditional feature selection (i.e., sequential search with zero alternatives)
but may not be the most efficient solution for this purpose.
This document provides:
- Steps for [setting up](#setup) the package.
- A short [overview](#functionality) of the (feature-selection) functionality.
- A [demo](#demo) of the functionality.
- [Guidelines for developers](#developer-info) who want to modify or extend the code base.
If you use this package for a scientific publication, please cite [our paper](https://doi.org/10.1007/s41060-024-00527-8)
```
@article{bach2024alternative,
title={Alternative feature selection with user control},
author={Bach, Jakob and B{\"o}hm, Klemens},
journal={International Journal of Data Science and Analytics},
year={2024},
doi={10.1007/s41060-024-00527-8}
}
```
(is partially outdated regarding our current implementation, e.g., does not describe the heuristic search methods)
or [our other paper](https://doi.org/10.48550/arXiv.2307.11607)
```
@misc{bach2023finding,
title={Finding Optimal Diverse Feature Sets with Alternative Feature Selection},
author={Bach, Jakob},
howpublished={arXiv:2307.11607 [cs.LG]},
year={2023},
doi={10.48550/arXiv.2307.11607}
}
```
(at least the most recent version is more up-to-date than the journal version).
## Setup
You can install our package from [PyPI](https://pypi.org/):
```
python -m pip install alfese
```
Alternatively, you can install the package from GitHub:
```bash
python -m pip install git+https://github.com/Jakob-Bach/Alternative-Feature-Selection.git#subdirectory=alfese_package
```
If you already have the source code for the package (i.e., the directory in which this `README` resides)
as a local directory on your computer (e.g., after cloning the project), you can also perform a local install:
```bash
python -m pip install .
```
## Functionality
`alfese.py` contains six feature-selection methods as classes:
- `FCBFSelector`: (adapted version of) FCBF, a multivariate filter method
- `GreedyWrapperSelector`: a wrapper method (by default, using a decision tree as prediction model)
- `ManualUnivariateQualitySelector`: a univariate filter method where you can enter each feature's utility directly
(instead of computing it from a dataset)
- `MISelector`: a univariate filter method based on mutual information
- `ModelImportanceSelector`: a univariate filter method using feature importances from a prediction model
(by default, a decision tree)
- `MRMRSelector`: mRMR, a multivariate filter method
The feature-selection method determines the notion of feature-set quality, i.e., the optimization objective.
Additionally, there are the following abstract superclasses:
- `AlternativeFeatureSelector`: highest superclass; defines solver, constraints for alternatives,
and solver-based sequential/simultaneous search
- `LinearQualityFeatureSelector`: super-class for feature-selection methods with a linear objective;
defines heuristic search methods for alternatives (do not require a solver)
- `WhiteBoxFeatureSelector`: superclass for feature-selection methods with a white-box objective,
i.e., optimizing purely with a solver rather than using the solver in an algorithmic search routine
All feature-selection methods support sequential and simultaneous search for alternatives,
as demonstrated next.
## Demo
Running alternative feature selection only requires three steps:
1) Create the feature selector (our code contains five different ones),
thereby determining the notion of feature-set quality to be optimized.
2) Set the dataset (`set_data()`):
- Four parameters: feature part and prediction target are separated, train-test split
- Data types: `DataFrame` (feature parts) and `Series` (targets) from `pandas`
3) Run the search for alternatives:
- Method name (`search_sequentially()` / `search_simultaneously()`) determines whether
a (solver-based) sequential or a simultaneous search is run. `LinearQualityFeatureSelector`s
(like "MI" and model-based importance) also support the heuristic procedures
`search_greedy_replacement()` and `search_greedy_balancing()`.
- `k` determines the number of features to be selected.
- `num_alternatives` determines ... you can guess what.
- `tau_abs` determines by how many features the feature sets should differ from each other.
You can also provide a relative value (from the interval `[0,1]`) via `tau`,
or change the dissimilarity `d_name` to `'jaccard'` (default is `'dice'`).
- `objective_agg` switches between min-aggregation and sum-aggregation in solver-based simultaneous search.
Has no effect in sequential search (which only returns one feature set, so there is no need to
aggregate feature-set quality over feature sets) and the heuristic search methods.
```python
import alfese
import sklearn.datasets
import sklearn.model_selection
dataset = sklearn.datasets.load_iris(as_frame=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
dataset['data'], dataset['target'], train_size=0.8, random_state=25)
feature_selector = alfese.MISelector()
feature_selector.set_data(X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)
search_result = feature_selector.search_sequentially(k=3, num_alternatives=5, tau_abs=1)
print(search_result.drop(columns='optimization_time').round(2))
```
The search result is a `DataFrame` containing the indices of the selected features (can be used to
subset the columns in `X`), objective values on the training set and test set, optimization status,
and optimization time:
```
selected_idxs train_objective test_objective optimization_status
0 [0, 2, 3] 0.91 0.89 0
1 [1, 2, 3] 0.83 0.78 0
2 [0, 1, 3] 0.64 0.65 0
3 [0, 1, 2] 0.62 0.68 0
4 [] NaN NaN 2
5 [] NaN NaN 2
```
The search procedure ran out of features here, as the `iris` dataset only has four features.
The optimization statuses are:
- 0: `Optimal` (optimal solution found)
- 1: `Feasible` (a valid solution found till timeout, but may not be optimal)
- 2: `Infeasible` (there is no valid solution)
- 6: `Not solved` (no valid solution found till timeout, but there may be one)
If you don't want to provide a dataset but use manually defined univariate qualities
(which result in the same optimization problem as "MI" and model importance), you can do so as well:
```python
import alfese
feature_selector = alfese.ManualQualityUnivariateSelector()
feature_selector.set_data(q_train=[1, 2, 3, 7, 8, 9])
search_result = feature_selector.search_sequentially(k=3, num_alternatives=3, tau_abs=2)
print(search_result.drop(columns='optimization_time').round(2))
```
## Developer Info
`AlternativeFeatureSelector` is the topmost abstract superclass.
It contains code for solver handling, the dissimilarity-based definition of alternatives, and the
two solver-based search procedures, i.e., sequential as well as simultaneous (sum-aggregation and min-aggregation).
For defining a new feature-selection method, you should create a subclass of `AlternativeFeatureSelector`.
In particular, you need to define how to solve the optimization problem of alternative feature selection
by overriding the abstract method `select_and_evaluate()`.
To this end, you may want to define the optimization problem
(objective function, which expresses feature-set quality, and maybe further constraints)
by overriding `initialize_solver()`.
You should also call the original implementation of this methods via `super().initialize_solver()`
to not override general initialization steps (solver configuration, cardinality constraints).
The sequential and simultaneous search procedures for alternatives implemented in `AlternativeFeatureSelector`
basically add further constraints (for alternatives) to the optimization problem and call `select_and_evaluate()`.
Thus, if the latter method is implemented properly, you do not need to override the search procedures,
as they should work as-is in new subclasses as well.
There are further abstract superclasses extracting commonalities between feature-selection methods:
- `WhiteBoxFeatureSelector` is a good starting point if you want to optimize your objective with a solver
(rather than using the solver in an algorithmic search routine with a black-box objective, like Greedy Wrapper does).
When creating a subclass, you need to define the white-box objective by overriding the abstract method `create_objectives()`
(define objectives separately for training set and test set, as they may use different constants for feature qualities).
`select_and_evaluate()` and `initialize_solver()` need not be overridden in your subclass anymore.
- `LinearQualityFeatureSelector` is a good starting point if your objective is a plain sum of feature qualities.
When creating a subclass, you need to provide these qualities by overriding the abstract method `compute_qualities()`.
`select_and_evaluate()` and `initialize_solver()` need not be overridden in your subclass anymore.
Raw data
{
"_id": null,
"home_page": null,
"name": "alfese",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "feature selection, alternatives, constraints",
"author": null,
"author_email": "Jakob Bach <jakob.bach.ka@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/e0/d3/39bd4c36c386d2962d5ec24d881c9c6012fd4c171186dc2f089a935c28a7/alfese-1.0.0.tar.gz",
"platform": null,
"description": "# `alfese` -- A Python Package for Alternative Feature Selection\r\n\r\nThe package `alfese` contains several methods for alternative feature selection.\r\nAlternative feature selection is the problem of finding multiple feature sets (sequentially or simultaneously)\r\nthat optimize feature-set quality while being sufficiently dissimilar to each other.\r\nUsers can control the number of alternatives and a dissimilarity threshold.\r\nThe package can also be used for traditional feature selection (i.e., sequential search with zero alternatives)\r\nbut may not be the most efficient solution for this purpose.\r\n\r\nThis document provides:\r\n\r\n- Steps for [setting up](#setup) the package.\r\n- A short [overview](#functionality) of the (feature-selection) functionality.\r\n- A [demo](#demo) of the functionality.\r\n- [Guidelines for developers](#developer-info) who want to modify or extend the code base.\r\n\r\nIf you use this package for a scientific publication, please cite [our paper](https://doi.org/10.1007/s41060-024-00527-8)\r\n\r\n```\r\n@article{bach2024alternative,\r\n title={Alternative feature selection with user control},\r\n author={Bach, Jakob and B{\\\"o}hm, Klemens},\r\n journal={International Journal of Data Science and Analytics},\r\n year={2024},\r\n doi={10.1007/s41060-024-00527-8}\r\n}\r\n```\r\n\r\n(is partially outdated regarding our current implementation, e.g., does not describe the heuristic search methods)\r\nor [our other paper](https://doi.org/10.48550/arXiv.2307.11607)\r\n\r\n```\r\n@misc{bach2023finding,\r\n\ttitle={Finding Optimal Diverse Feature Sets with Alternative Feature Selection},\r\n\tauthor={Bach, Jakob},\r\n\thowpublished={arXiv:2307.11607 [cs.LG]},\r\n\tyear={2023},\r\n\tdoi={10.48550/arXiv.2307.11607}\r\n}\r\n```\r\n\r\n(at least the most recent version is more up-to-date than the journal version).\r\n\r\n## Setup\r\n\r\nYou can install our package from [PyPI](https://pypi.org/):\r\n\r\n```\r\npython -m pip install alfese\r\n```\r\n\r\nAlternatively, you can install the package from GitHub:\r\n\r\n```bash\r\npython -m pip install git+https://github.com/Jakob-Bach/Alternative-Feature-Selection.git#subdirectory=alfese_package\r\n```\r\n\r\nIf you already have the source code for the package (i.e., the directory in which this `README` resides)\r\nas a local directory on your computer (e.g., after cloning the project), you can also perform a local install:\r\n\r\n```bash\r\npython -m pip install .\r\n```\r\n\r\n## Functionality\r\n\r\n`alfese.py` contains six feature-selection methods as classes:\r\n\r\n- `FCBFSelector`: (adapted version of) FCBF, a multivariate filter method\r\n- `GreedyWrapperSelector`: a wrapper method (by default, using a decision tree as prediction model)\r\n- `ManualUnivariateQualitySelector`: a univariate filter method where you can enter each feature's utility directly\r\n (instead of computing it from a dataset)\r\n- `MISelector`: a univariate filter method based on mutual information\r\n- `ModelImportanceSelector`: a univariate filter method using feature importances from a prediction model\r\n (by default, a decision tree)\r\n- `MRMRSelector`: mRMR, a multivariate filter method\r\n\r\nThe feature-selection method determines the notion of feature-set quality, i.e., the optimization objective.\r\n\r\nAdditionally, there are the following abstract superclasses:\r\n\r\n- `AlternativeFeatureSelector`: highest superclass; defines solver, constraints for alternatives,\r\n and solver-based sequential/simultaneous search\r\n- `LinearQualityFeatureSelector`: super-class for feature-selection methods with a linear objective;\r\n defines heuristic search methods for alternatives (do not require a solver)\r\n- `WhiteBoxFeatureSelector`: superclass for feature-selection methods with a white-box objective,\r\n i.e., optimizing purely with a solver rather than using the solver in an algorithmic search routine\r\n\r\nAll feature-selection methods support sequential and simultaneous search for alternatives,\r\nas demonstrated next.\r\n\r\n## Demo\r\n\r\nRunning alternative feature selection only requires three steps:\r\n\r\n1) Create the feature selector (our code contains five different ones),\r\n thereby determining the notion of feature-set quality to be optimized.\r\n2) Set the dataset (`set_data()`):\r\n - Four parameters: feature part and prediction target are separated, train-test split\r\n - Data types: `DataFrame` (feature parts) and `Series` (targets) from `pandas`\r\n3) Run the search for alternatives:\r\n - Method name (`search_sequentially()` / `search_simultaneously()`) determines whether\r\n a (solver-based) sequential or a simultaneous search is run. `LinearQualityFeatureSelector`s\r\n (like \"MI\" and model-based importance) also support the heuristic procedures\r\n `search_greedy_replacement()` and `search_greedy_balancing()`.\r\n - `k` determines the number of features to be selected.\r\n - `num_alternatives` determines ... you can guess what.\r\n - `tau_abs` determines by how many features the feature sets should differ from each other.\r\n You can also provide a relative value (from the interval `[0,1]`) via `tau`,\r\n or change the dissimilarity `d_name` to `'jaccard'` (default is `'dice'`).\r\n - `objective_agg` switches between min-aggregation and sum-aggregation in solver-based simultaneous search.\r\n Has no effect in sequential search (which only returns one feature set, so there is no need to\r\n aggregate feature-set quality over feature sets) and the heuristic search methods.\r\n\r\n```python\r\nimport alfese\r\nimport sklearn.datasets\r\nimport sklearn.model_selection\r\n\r\ndataset = sklearn.datasets.load_iris(as_frame=True)\r\nX_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\r\n dataset['data'], dataset['target'], train_size=0.8, random_state=25)\r\nfeature_selector = alfese.MISelector()\r\nfeature_selector.set_data(X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)\r\nsearch_result = feature_selector.search_sequentially(k=3, num_alternatives=5, tau_abs=1)\r\nprint(search_result.drop(columns='optimization_time').round(2))\r\n```\r\n\r\nThe search result is a `DataFrame` containing the indices of the selected features (can be used to\r\nsubset the columns in `X`), objective values on the training set and test set, optimization status,\r\nand optimization time:\r\n\r\n```\r\n selected_idxs train_objective test_objective optimization_status\r\n0 [0, 2, 3] 0.91 0.89 0\r\n1 [1, 2, 3] 0.83 0.78 0\r\n2 [0, 1, 3] 0.64 0.65 0\r\n3 [0, 1, 2] 0.62 0.68 0\r\n4 [] NaN NaN 2\r\n5 [] NaN NaN 2\r\n```\r\n\r\nThe search procedure ran out of features here, as the `iris` dataset only has four features.\r\nThe optimization statuses are:\r\n\r\n- 0: `Optimal` (optimal solution found)\r\n- 1: `Feasible` (a valid solution found till timeout, but may not be optimal)\r\n- 2: `Infeasible` (there is no valid solution)\r\n- 6: `Not solved` (no valid solution found till timeout, but there may be one)\r\n\r\nIf you don't want to provide a dataset but use manually defined univariate qualities\r\n(which result in the same optimization problem as \"MI\" and model importance), you can do so as well:\r\n\r\n```python\r\nimport alfese\r\n\r\nfeature_selector = alfese.ManualQualityUnivariateSelector()\r\nfeature_selector.set_data(q_train=[1, 2, 3, 7, 8, 9])\r\nsearch_result = feature_selector.search_sequentially(k=3, num_alternatives=3, tau_abs=2)\r\nprint(search_result.drop(columns='optimization_time').round(2))\r\n```\r\n\r\n## Developer Info\r\n\r\n`AlternativeFeatureSelector` is the topmost abstract superclass.\r\nIt contains code for solver handling, the dissimilarity-based definition of alternatives, and the\r\ntwo solver-based search procedures, i.e., sequential as well as simultaneous (sum-aggregation and min-aggregation).\r\nFor defining a new feature-selection method, you should create a subclass of `AlternativeFeatureSelector`.\r\nIn particular, you need to define how to solve the optimization problem of alternative feature selection\r\nby overriding the abstract method `select_and_evaluate()`.\r\nTo this end, you may want to define the optimization problem\r\n(objective function, which expresses feature-set quality, and maybe further constraints)\r\nby overriding `initialize_solver()`.\r\nYou should also call the original implementation of this methods via `super().initialize_solver()`\r\nto not override general initialization steps (solver configuration, cardinality constraints).\r\nThe sequential and simultaneous search procedures for alternatives implemented in `AlternativeFeatureSelector`\r\nbasically add further constraints (for alternatives) to the optimization problem and call `select_and_evaluate()`.\r\nThus, if the latter method is implemented properly, you do not need to override the search procedures,\r\nas they should work as-is in new subclasses as well.\r\n\r\nThere are further abstract superclasses extracting commonalities between feature-selection methods:\r\n\r\n- `WhiteBoxFeatureSelector` is a good starting point if you want to optimize your objective with a solver\r\n (rather than using the solver in an algorithmic search routine with a black-box objective, like Greedy Wrapper does).\r\n When creating a subclass, you need to define the white-box objective by overriding the abstract method `create_objectives()`\r\n (define objectives separately for training set and test set, as they may use different constants for feature qualities).\r\n `select_and_evaluate()` and `initialize_solver()` need not be overridden in your subclass anymore.\r\n- `LinearQualityFeatureSelector` is a good starting point if your objective is a plain sum of feature qualities.\r\n When creating a subclass, you need to provide these qualities by overriding the abstract method `compute_qualities()`.\r\n `select_and_evaluate()` and `initialize_solver()` need not be overridden in your subclass anymore.\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Alternative feature selection",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/Jakob-Bach/Alternative-Feature-Selection/tree/main/alfese_package",
"Issues": "https://github.com/Jakob-Bach/Alternative-Feature-Selection/issues"
},
"split_keywords": [
"feature selection",
" alternatives",
" constraints"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c3e23bf534a3fcef5eef625a9c2055c1606c9fff26db3a71766521364661d940",
"md5": "5b00f1c9ceb9f41cd4d7f316043b2e89",
"sha256": "5d7282a5551e0e928ea40782007bebadcc9f807a8556e7992575efb61f7560bc"
},
"downloads": -1,
"filename": "alfese-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5b00f1c9ceb9f41cd4d7f316043b2e89",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 19352,
"upload_time": "2024-10-12T07:19:39",
"upload_time_iso_8601": "2024-10-12T07:19:39.048906Z",
"url": "https://files.pythonhosted.org/packages/c3/e2/3bf534a3fcef5eef625a9c2055c1606c9fff26db3a71766521364661d940/alfese-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e0d339bd4c36c386d2962d5ec24d881c9c6012fd4c171186dc2f089a935c28a7",
"md5": "f2f736a74b92e259d983d607745063d9",
"sha256": "56cfe758a46f7c6cc1b0c956d4de4a54905ba56b91297532a05436a7555e7f16"
},
"downloads": -1,
"filename": "alfese-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "f2f736a74b92e259d983d607745063d9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 22520,
"upload_time": "2024-10-12T07:19:40",
"upload_time_iso_8601": "2024-10-12T07:19:40.811051Z",
"url": "https://files.pythonhosted.org/packages/e0/d3/39bd4c36c386d2962d5ec24d881c9c6012fd4c171186dc2f089a935c28a7/alfese-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-12 07:19:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Jakob-Bach",
"github_project": "Alternative-Feature-Selection",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "alfese"
}