cffs


Namecffs JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryConstrained feature selection
upload_time2024-10-04 08:18:33
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords feature selection constraints
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # `cffs` -- A Python Package for Constrained Filter Feature Selection

The package `cffs` contains classes to formulate and solve constrained-feature-selection problems.
In particular, we support univariate filter feature selection as the objective
and constraints from propositional logic and linear arithmetic.
We use an SMT (Satisfiability Modulo Theories) solver to find optimal feature sets under the constraints.

This document provides:

- Steps for [setting up](#setup) the package.
- A short [overview](#functionality) of the (constrained-feature-selection) functionality.
- A [demo](#demo) of the functionality.
- [Guidelines for developers](#developer-info) who want to modify or extend the code base.

If you use this package for a scientific publication, please cite [our paper](https://doi.org/10.1007/s42979-022-01338-z)

```
@article{bach2022empirical,
  title={An Empirical Evaluation of Constrained Feature Selection},
  author={Bach, Jakob and Zoller, Kolja and Trittenbach, Holger and Schulz, Katrin and B{\"o}hm, Klemens},
  journal={SN Computer Science},
  volume={3},
  number={6},
  year={2022},
  doi={10.1007/s42979-022-01338-z}
}
```

## Setup

You can install our package from [PyPI](https://pypi.org/):

```
python -m pip install cffs
```

Alternatively, you can install the package from GitHub:

```bash
python -m pip install git+https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection.git#subdirectory=src/cffs_package
```

If you already have the source code for the package (i.e., the directory in which this `README` resides)
as a local directory on your computer (e.g., after cloning the project), you can also perform a local install:

```bash
python -m pip install cffs_package/
```

## Functionality

The package `cffs` contains the following modules:

- `combi_expressions.py`: Formulate constraints in propositional logic and linear arithmetic,
  simultaneously for our own expression classes (`expressions.py`) and the solver `Z3`.
  These constraints may be added to an optimization problem in `combi_solving.py`.
- `combi_solving.py`: Formulate a constrained-filter-feature-selection *optimization* problem
  with constraints from `combi_expressions.py`.
  Count the number of solutions with our own implementation (`solving.py`) and optimize the problem with `Z3`.
- `expressions.py`: Formulate constraints in propositional logic and linear arithmetic,
  using our own expression classes.
  These constraints may be added to a satisfiability problem in `solving.py`.
- `feature_qualities.py`: Compute univariate feature qualities for the (linear) optimization objective.
- `solving.py`: Formulate a constrained-filter-feature-selection *satisfiability* problem
  with constraints from `expressions.py`.
  Count the number of solutions with our own implementation; optimization is not supported.

## Demo

For constrained filter feature selection, we first need to compute an individual quality score for each feature.
(We only consider univariate feature selection, so we ignore interactions between features.)
The objective value is the summed quality of all selected features.
To this end, `cffs.feature_qualities` provides functions for

- the absolute value of Pearson correlation between each feature and the prediction target (`abs_corr()`)
- the mutual information between each feature and the prediction target (`mut_info()`)

Both functions round the qualities to two digits to speed up solving.
(We found that the solver becomes slower the more precise the floats are,
as they are represented as rational numbers in the solver.)
As inputs, the quality functions require a dataset in X-y form (as used in `sklearn`).

After computing feature qualities, we set up an SMT optimization problem from `cffs.combi_solving`.
It is "combi" in the sense that our code wraps an existing SMT solver (`Z3`).
We retrieve the problem's decision variables (one binary variable for each feature) and use them to
formulate constraints with `cffs.combi_expressions`.
These constraints are added to `Z3` but also to our own expression tree,
which we use to count the number of valid solutions in the search space.
Finally, we start optimization.

```python
from cffs.combi_expressions import And, AtMost, Xor
from cffs.combi_solving import Problem
from cffs.feature_qualities import mut_info
import sklearn.datasets

X, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)
feature_qualities = mut_info(X=X, y=y)
problem = Problem(variable_names=X.columns, qualities=feature_qualities)

print('--- Constrained problem ---')
variables = problem.get_variables()
problem.add_constraint(AtMost(variables, 2))
problem.add_constraint(And([Xor(variables[0], variables[1]), Xor(variables[2], variables[3])]))
print(problem.optimize())
print('Number of constraints:', problem.get_num_constraints())
print('Fraction of valid solutions:', problem.compute_solution_fraction())

print('\n--- Unconstrained problem ---')
problem.clear_constraints()
print(problem.optimize())
print('Number of constraints:', problem.get_num_constraints())
print('Fraction of valid solutions:', problem.compute_solution_fraction())
```

The output is the following:

```
--- Constrained problem ---
{'objective_value': 1.48, 'num_selected': 2, 'selected': ['petal width (cm)', 'sepal length (cm)']}
Number of constraints: 2
Fraction of valid solutions: 0.25

--- Unconstrained problem ---
{'objective_value': 2.71, 'num_selected': 4, 'selected': ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']}
Number of constraints: 0
Fraction of valid solutions: 1.0
```

The optimization procedure returns the objective value (summed quality of selected features)
and the feature selection.
To assess how strongly the constraints cut down the space of valid solutions,
we can use `compute_solution_fraction()`.
However, this function iterates over each solution candidate and checks whether it is valid or not,
which becomes very expensive with a growing number of features.
Alternatively, `estimate_solution_fraction()` randomly samples solutions to estimate this quantity.

Our code snippet also shows that you can remove all constraints via `clear_constraints()` without setting up a new optimization problem.
You can also add further constraints after optimization and then optimize again.
The optimizer keeps its state between optimizations, so you may benefit from a warm start.

## Developer Info

New operators / expressions types for constraints in `expressions.py` should (directly or indirectly)
inherit from the top-level superclass `Expression`. Preferably, inherit from one of its subclasses

- `BooleanExpression` (if your operator returns a boolean value; need to override `is_true()`) or
- `ArithmeticExpression` (if your operator returns a numeric value; need to override `get_value()`).

`is_true()` or `get_value()` express the logic of your operator, i.e., how it will be evaluated
(typically depending on the values of child expressions serving as operands).
Also, you need to override `get_children()` to return all (direct) child expressions,
so it is possible to traverse the full expression tree.

`solving.py` supports adding arbitrary `BooleanExpression`s from `expressions.py` as constraints.

New operators / expression types for constraints in `combi_expressions.py` should simultaneously inherit from
the superclass `BooleanExpression` in this module and one expression class from `expressions.py`.
Arithmetic expressions currently do not exist on their own in this module,
but only nested into more complex boolean expressions (like "weighted sum <= some threshold").
You need to initialize the `expressions.py` expression (preferably in the initializer by calling `super().__init__()`)
and store a corresponding `Z3` expression in the field `z3_expr`;
use `get_z3()` to access `Z3` representations of your child expressions serving as operands.

`combi_solving.py` supports adding arbitrary `BooleanExpression`s from `combi_expressions.py` as constraints.
Also, it supports arbitrary univariate feature qualities (e.g., computed with `feature_qualities.py`)
to formulate a univariate (= linear) objective function.

New functions for univariate feature qualities in `feature_qualities.py` should satisfy the following implicit interface:

- Two inputs: Dataset `X` (`pandas.DataFrame`) and prediction target `y` (`pandas.Series`).
- Output is a sequence (e.g., list) of floats whose length equals the number of columns in `X`.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cffs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "feature selection, constraints",
    "author": null,
    "author_email": "Jakob Bach <jakob.bach.ka@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7c/29/afd5a61e686cc6fd57df9ba13cfc7ad3e7b9d78859647633ec41a78aeb36/cffs-1.0.0.tar.gz",
    "platform": null,
    "description": "# `cffs` -- A Python Package for Constrained Filter Feature Selection\r\n\r\nThe package `cffs` contains classes to formulate and solve constrained-feature-selection problems.\r\nIn particular, we support univariate filter feature selection as the objective\r\nand constraints from propositional logic and linear arithmetic.\r\nWe use an SMT (Satisfiability Modulo Theories) solver to find optimal feature sets under the constraints.\r\n\r\nThis document provides:\r\n\r\n- Steps for [setting up](#setup) the package.\r\n- A short [overview](#functionality) of the (constrained-feature-selection) functionality.\r\n- A [demo](#demo) of the functionality.\r\n- [Guidelines for developers](#developer-info) who want to modify or extend the code base.\r\n\r\nIf you use this package for a scientific publication, please cite [our paper](https://doi.org/10.1007/s42979-022-01338-z)\r\n\r\n```\r\n@article{bach2022empirical,\r\n  title={An Empirical Evaluation of Constrained Feature Selection},\r\n  author={Bach, Jakob and Zoller, Kolja and Trittenbach, Holger and Schulz, Katrin and B{\\\"o}hm, Klemens},\r\n  journal={SN Computer Science},\r\n  volume={3},\r\n  number={6},\r\n  year={2022},\r\n  doi={10.1007/s42979-022-01338-z}\r\n}\r\n```\r\n\r\n## Setup\r\n\r\nYou can install our package from [PyPI](https://pypi.org/):\r\n\r\n```\r\npython -m pip install cffs\r\n```\r\n\r\nAlternatively, you can install the package from GitHub:\r\n\r\n```bash\r\npython -m pip install git+https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection.git#subdirectory=src/cffs_package\r\n```\r\n\r\nIf you already have the source code for the package (i.e., the directory in which this `README` resides)\r\nas a local directory on your computer (e.g., after cloning the project), you can also perform a local install:\r\n\r\n```bash\r\npython -m pip install cffs_package/\r\n```\r\n\r\n## Functionality\r\n\r\nThe package `cffs` contains the following modules:\r\n\r\n- `combi_expressions.py`: Formulate constraints in propositional logic and linear arithmetic,\r\n  simultaneously for our own expression classes (`expressions.py`) and the solver `Z3`.\r\n  These constraints may be added to an optimization problem in `combi_solving.py`.\r\n- `combi_solving.py`: Formulate a constrained-filter-feature-selection *optimization* problem\r\n  with constraints from `combi_expressions.py`.\r\n  Count the number of solutions with our own implementation (`solving.py`) and optimize the problem with `Z3`.\r\n- `expressions.py`: Formulate constraints in propositional logic and linear arithmetic,\r\n  using our own expression classes.\r\n  These constraints may be added to a satisfiability problem in `solving.py`.\r\n- `feature_qualities.py`: Compute univariate feature qualities for the (linear) optimization objective.\r\n- `solving.py`: Formulate a constrained-filter-feature-selection *satisfiability* problem\r\n  with constraints from `expressions.py`.\r\n  Count the number of solutions with our own implementation; optimization is not supported.\r\n\r\n## Demo\r\n\r\nFor constrained filter feature selection, we first need to compute an individual quality score for each feature.\r\n(We only consider univariate feature selection, so we ignore interactions between features.)\r\nThe objective value is the summed quality of all selected features.\r\nTo this end, `cffs.feature_qualities` provides functions for\r\n\r\n- the absolute value of Pearson correlation between each feature and the prediction target (`abs_corr()`)\r\n- the mutual information between each feature and the prediction target (`mut_info()`)\r\n\r\nBoth functions round the qualities to two digits to speed up solving.\r\n(We found that the solver becomes slower the more precise the floats are,\r\nas they are represented as rational numbers in the solver.)\r\nAs inputs, the quality functions require a dataset in X-y form (as used in `sklearn`).\r\n\r\nAfter computing feature qualities, we set up an SMT optimization problem from `cffs.combi_solving`.\r\nIt is \"combi\" in the sense that our code wraps an existing SMT solver (`Z3`).\r\nWe retrieve the problem's decision variables (one binary variable for each feature) and use them to\r\nformulate constraints with `cffs.combi_expressions`.\r\nThese constraints are added to `Z3` but also to our own expression tree,\r\nwhich we use to count the number of valid solutions in the search space.\r\nFinally, we start optimization.\r\n\r\n```python\r\nfrom cffs.combi_expressions import And, AtMost, Xor\r\nfrom cffs.combi_solving import Problem\r\nfrom cffs.feature_qualities import mut_info\r\nimport sklearn.datasets\r\n\r\nX, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)\r\nfeature_qualities = mut_info(X=X, y=y)\r\nproblem = Problem(variable_names=X.columns, qualities=feature_qualities)\r\n\r\nprint('--- Constrained problem ---')\r\nvariables = problem.get_variables()\r\nproblem.add_constraint(AtMost(variables, 2))\r\nproblem.add_constraint(And([Xor(variables[0], variables[1]), Xor(variables[2], variables[3])]))\r\nprint(problem.optimize())\r\nprint('Number of constraints:', problem.get_num_constraints())\r\nprint('Fraction of valid solutions:', problem.compute_solution_fraction())\r\n\r\nprint('\\n--- Unconstrained problem ---')\r\nproblem.clear_constraints()\r\nprint(problem.optimize())\r\nprint('Number of constraints:', problem.get_num_constraints())\r\nprint('Fraction of valid solutions:', problem.compute_solution_fraction())\r\n```\r\n\r\nThe output is the following:\r\n\r\n```\r\n--- Constrained problem ---\r\n{'objective_value': 1.48, 'num_selected': 2, 'selected': ['petal width (cm)', 'sepal length (cm)']}\r\nNumber of constraints: 2\r\nFraction of valid solutions: 0.25\r\n\r\n--- Unconstrained problem ---\r\n{'objective_value': 2.71, 'num_selected': 4, 'selected': ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']}\r\nNumber of constraints: 0\r\nFraction of valid solutions: 1.0\r\n```\r\n\r\nThe optimization procedure returns the objective value (summed quality of selected features)\r\nand the feature selection.\r\nTo assess how strongly the constraints cut down the space of valid solutions,\r\nwe can use `compute_solution_fraction()`.\r\nHowever, this function iterates over each solution candidate and checks whether it is valid or not,\r\nwhich becomes very expensive with a growing number of features.\r\nAlternatively, `estimate_solution_fraction()` randomly samples solutions to estimate this quantity.\r\n\r\nOur code snippet also shows that you can remove all constraints via `clear_constraints()` without setting up a new optimization problem.\r\nYou can also add further constraints after optimization and then optimize again.\r\nThe optimizer keeps its state between optimizations, so you may benefit from a warm start.\r\n\r\n## Developer Info\r\n\r\nNew operators / expressions types for constraints in `expressions.py` should (directly or indirectly)\r\ninherit from the top-level superclass `Expression`. Preferably, inherit from one of its subclasses\r\n\r\n- `BooleanExpression` (if your operator returns a boolean value; need to override `is_true()`) or\r\n- `ArithmeticExpression` (if your operator returns a numeric value; need to override `get_value()`).\r\n\r\n`is_true()` or `get_value()` express the logic of your operator, i.e., how it will be evaluated\r\n(typically depending on the values of child expressions serving as operands).\r\nAlso, you need to override `get_children()` to return all (direct) child expressions,\r\nso it is possible to traverse the full expression tree.\r\n\r\n`solving.py` supports adding arbitrary `BooleanExpression`s from `expressions.py` as constraints.\r\n\r\nNew operators / expression types for constraints in `combi_expressions.py` should simultaneously inherit from\r\nthe superclass `BooleanExpression` in this module and one expression class from `expressions.py`.\r\nArithmetic expressions currently do not exist on their own in this module,\r\nbut only nested into more complex boolean expressions (like \"weighted sum <= some threshold\").\r\nYou need to initialize the `expressions.py` expression (preferably in the initializer by calling `super().__init__()`)\r\nand store a corresponding `Z3` expression in the field `z3_expr`;\r\nuse `get_z3()` to access `Z3` representations of your child expressions serving as operands.\r\n\r\n`combi_solving.py` supports adding arbitrary `BooleanExpression`s from `combi_expressions.py` as constraints.\r\nAlso, it supports arbitrary univariate feature qualities (e.g., computed with `feature_qualities.py`)\r\nto formulate a univariate (= linear) objective function.\r\n\r\nNew functions for univariate feature qualities in `feature_qualities.py` should satisfy the following implicit interface:\r\n\r\n- Two inputs: Dataset `X` (`pandas.DataFrame`) and prediction target `y` (`pandas.Series`).\r\n- Output is a sequence (e.g., list) of floats whose length equals the number of columns in `X`.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Constrained feature selection",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/master/src/cffs_package",
        "Issues": "https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/issues"
    },
    "split_keywords": [
        "feature selection",
        " constraints"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "549b46fa15f708039f977dca1f8a42f7cc0961d7f43422bf4e55d17a048c6f94",
                "md5": "3620f4a280b81856a58efa2ce6166d57",
                "sha256": "ef5f49d25c33c5145e42469bb320b1aaedf148bd3c0e5b71a2d2a5faff70480a"
            },
            "downloads": -1,
            "filename": "cffs-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3620f4a280b81856a58efa2ce6166d57",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 14664,
            "upload_time": "2024-10-04T08:18:31",
            "upload_time_iso_8601": "2024-10-04T08:18:31.555767Z",
            "url": "https://files.pythonhosted.org/packages/54/9b/46fa15f708039f977dca1f8a42f7cc0961d7f43422bf4e55d17a048c6f94/cffs-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7c29afd5a61e686cc6fd57df9ba13cfc7ad3e7b9d78859647633ec41a78aeb36",
                "md5": "d0204e779d345360cb579589e4d362e3",
                "sha256": "8540c16a6488c350ed5c94ea4f98061f77357e420aedb43d37469c112f32d004"
            },
            "downloads": -1,
            "filename": "cffs-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d0204e779d345360cb579589e4d362e3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 14783,
            "upload_time": "2024-10-04T08:18:33",
            "upload_time_iso_8601": "2024-10-04T08:18:33.564563Z",
            "url": "https://files.pythonhosted.org/packages/7c/29/afd5a61e686cc6fd57df9ba13cfc7ad3e7b9d78859647633ec41a78aeb36/cffs-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-04 08:18:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Jakob-Bach",
    "github_project": "Constrained-Filter-Feature-Selection",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "cffs"
}
        
Elapsed time: 9.86791s