gsbfs


Namegsbfs JSON
Version 0.0.1 PyPI version JSON
download
home_page
SummaryGram-Schmidt Boruta Feature Selection.
upload_time2023-05-11 16:31:24
maintainer
docs_urlNone
author
requires_python>=3.9
licenseMIT License Copyright (c) 2023 Thomas Civeit <thomas@civeit.com> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords feature-engineering feature-extraction feature-selection machine-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Gram-Schmidt Boruta Feature Selection

The `gsbfs` package provides 2 simple functions:

- `gso_rank(X_input, y_output)` that ranks candidate features.
- `gso_boruta_select(X_input, y_output)` that selects features more relevant than random probes.

Where `X_input` is the matrix of input features and `y_ouput` is the vector of output targets.
The use of the Gram-Schmidt Orthogonalization (GSO) procedure for ranking the variables of a model
that is linear with respect to its parameters was first described by
[Chen et al. (1989)](https://www.tandfonline.com/doi/abs/10.1080/00207178908953472).
The features are automatically selected using the Boruta algorithm described by
[Kursa & Rudnicki (2010)](https://www.jstatsoft.org/article/view/v036i11).

[![PyPI - Version](https://img.shields.io/pypi/v/gsbfs.svg)](https://pypi.org/project/gsbfs)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gsbfs.svg)](https://pypi.org/project/gsbfs)

## Installation

The easiest way to get `gsbfs` is to use pip:
```console
$ pip install gsbfs
```
That will install the `gsbfs` package along with all the required dependencies.

## GSO Feature Ranking

We consider a model with P candidate features. The input matrix X has P columns (i.e. Xp input vectors)
corresponding to the P features, and N rows corresponding to the N observations. 
The vector y is an N-vector that contains the values of the output for each observation.
All Xp vectors and the y vector must be centered before proceeding with the GSO algorithm.

The first iteration of the procedure consists in finding the feature vector that "best explains" the output target,
i.e. which has the smallest angle with the output vector in the N-dimensional space of observations.
To this end, we calculate cos(Xp, y)**2, the square of cosine of the angles between the feature vector and
the target vector. The feature vector for which this quantity is largest is selected.

In order to discard what is already explained by the first selected vector, all P-1 remaining input vectors,
and the output vector, are projected onto the subspace (of dimension N-1) of the selected feature.
Then, in that subspace, the projected input vector that best explains the projected  output is selected,
and the P-2 remaining feature vectors are projected onto the subspace  of the first two ranked vectors.
The procedure is repeated until all P input vectors are ranked.

## Boruta Feature Selection

The Boruta algorithm selects features that are more relevant than random probes. The latter are created by
randomly shuffling each feature of the *existing* observations. Each Xp input vector generates a new "shadow"
input vector (i.e. random probe) that is statistically irrelevant since the values of the feature for each
observation have been randomly permuted. The resulting input matrix X has 2*P columns (and still N rows),
combining the original vectors and the shadow vectors.

All the candidate features are ranked following the GSO procedure described above.
The threshold to discard features is then defined as the highest rank recorded among the shadow features.
When the rank of an original feature is higher than this threshold (i.e. better than a random probe),
it is called a "hit".

However, let us consider 2 arbitrary vectors v1 and v2, and a random vector vr.
It is important to note that there is still a probability of 50% that cos(v1, vr)**2 > cos(v1, v2)**2,
so a "lucky" random probe could randomly obtain a better rank than an original feature.
Therefore, the selection method consists in repeating the experiment (random shadow vectors + ranking) n times,
and counting the number of "hits" for each feature. Since each independent experiment can give a
binary outcome (hit or no hit), a series of n trials follows a binomial distribution.

The statistical criteria for feature selection is then the maximum value of the probability mass function
that will be considered as predictive (right tail), or as non-predictive (left tail) since the binomial
distribution PMF is symmetrical. The fraction of the PMF that is neither predictive nor non-predictive,
is considered as indecisive. For instance, considering 20 trials and a maximum probability of 0.5%,
features having 16-20 hits will be selected, and features having 0-4 hits will be rejected.

## Usage Example

Let us create a data set consisting of 50 features, including only 10 informative features,
and 5000 observations. The features will be ranked by running:

```python
from gsbfs.gsbfs import gso_rank, gso_boruta_select
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=50, n_informative=10, shuffle=False)
ranked_indexes, cos_sq_max = gso_rank(X, y)
```

which will return an array containing the ranked features indexes, see `gso_rank()` function documentation.

Or the features will be selected by running:

```python
rejected_indexes, selected_indexes, indecisive_indexes = gso_boruta_select(X, y)
```
which will return an array containing the selected features indexes, see `gso_boruta_select()` function
documentation. Since the process can take several minutes or hours to complete when X is very large,
the function provides a `verbose` option to report its completion progress.

Other usage examples are provided in [example.py](example.py).

Please note that the selection process relies on random probes, which means that running the procedure multiple
times may yield different results. Moreover, when the number of observations is significantly  greater than the
number of features (N >> P), there is a higher likelihood of selecting the informative features.

## License

`gsbfs` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "gsbfs",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "feature-engineering,feature-extraction,feature-selection,machine-learning",
    "author": "",
    "author_email": "Thomas Civeit <thomas@civeit.com>",
    "download_url": "https://files.pythonhosted.org/packages/29/70/81ed5b17dc0295fa9ecac384763e11b1637dd6dcdf4a330f74c3336e0f0a/gsbfs-0.0.1.tar.gz",
    "platform": null,
    "description": "# Gram-Schmidt Boruta Feature Selection\n\nThe `gsbfs` package provides 2 simple functions:\n\n- `gso_rank(X_input, y_output)` that ranks candidate features.\n- `gso_boruta_select(X_input, y_output)` that selects features more relevant than random probes.\n\nWhere `X_input` is the matrix of input features and `y_ouput` is the vector of output targets.\nThe use of the Gram-Schmidt Orthogonalization (GSO) procedure for ranking the variables of a model\nthat is linear with respect to its parameters was first described by\n[Chen et al. (1989)](https://www.tandfonline.com/doi/abs/10.1080/00207178908953472).\nThe features are automatically selected using the Boruta algorithm described by\n[Kursa & Rudnicki (2010)](https://www.jstatsoft.org/article/view/v036i11).\n\n[![PyPI - Version](https://img.shields.io/pypi/v/gsbfs.svg)](https://pypi.org/project/gsbfs)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gsbfs.svg)](https://pypi.org/project/gsbfs)\n\n## Installation\n\nThe easiest way to get `gsbfs` is to use pip:\n```console\n$ pip install gsbfs\n```\nThat will install the `gsbfs` package along with all the required dependencies.\n\n## GSO Feature Ranking\n\nWe consider a model with P candidate features. The input matrix X has P columns (i.e. Xp input vectors)\ncorresponding to the P features, and N rows corresponding to the N observations. \nThe vector y is an N-vector that contains the values of the output for each observation.\nAll Xp vectors and the y vector must be centered before proceeding with the GSO algorithm.\n\nThe first iteration of the procedure consists in finding the feature vector that \"best explains\" the output target,\ni.e. which has the smallest angle with the output vector in the N-dimensional space of observations.\nTo this end, we calculate cos(Xp, y)**2, the square of cosine of the angles between the feature vector and\nthe target vector. The feature vector for which this quantity is largest is selected.\n\nIn order to discard what is already explained by the first selected vector, all P-1 remaining input vectors,\nand the output vector, are projected onto the subspace (of dimension N-1) of the selected feature.\nThen, in that subspace, the projected input vector that best explains the projected  output is selected,\nand the P-2 remaining feature vectors are projected onto the subspace  of the first two ranked vectors.\nThe procedure is repeated until all P input vectors are ranked.\n\n## Boruta Feature Selection\n\nThe Boruta algorithm selects features that are more relevant than random probes. The latter are created by\nrandomly shuffling each feature of the *existing* observations. Each Xp input vector generates a new \"shadow\"\ninput vector (i.e. random probe) that is statistically irrelevant since the values of the feature for each\nobservation have been randomly permuted. The resulting input matrix X has 2*P columns (and still N rows),\ncombining the original vectors and the shadow vectors.\n\nAll the candidate features are ranked following the GSO procedure described above.\nThe threshold to discard features is then defined as the highest rank recorded among the shadow features.\nWhen the rank of an original feature is higher than this threshold (i.e. better than a random probe),\nit is called a \"hit\".\n\nHowever, let us consider 2 arbitrary vectors v1 and v2, and a random vector vr.\nIt is important to note that there is still a probability of 50% that cos(v1, vr)**2 > cos(v1, v2)**2,\nso a \"lucky\" random probe could randomly obtain a better rank than an original feature.\nTherefore, the selection method consists in repeating the experiment (random shadow vectors + ranking) n times,\nand counting the number of \"hits\" for each feature. Since each independent experiment can give a\nbinary outcome (hit or no hit), a series of n trials follows a binomial distribution.\n\nThe statistical criteria for feature selection is then the maximum value of the probability mass function\nthat will be considered as predictive (right tail), or as non-predictive (left tail) since the binomial\ndistribution PMF is symmetrical. The fraction of the PMF that is neither predictive nor non-predictive,\nis considered as indecisive. For instance, considering 20 trials and a maximum probability of 0.5%,\nfeatures having 16-20 hits will be selected, and features having 0-4 hits will be rejected.\n\n## Usage Example\n\nLet us create a data set consisting of 50 features, including only 10 informative features,\nand 5000 observations. The features will be ranked by running:\n\n```python\nfrom gsbfs.gsbfs import gso_rank, gso_boruta_select\nfrom sklearn.datasets import make_classification\nX, y = make_classification(n_samples=5000, n_features=50, n_informative=10, shuffle=False)\nranked_indexes, cos_sq_max = gso_rank(X, y)\n```\n\nwhich will return an array containing the ranked features indexes, see `gso_rank()` function documentation.\n\nOr the features will be selected by running:\n\n```python\nrejected_indexes, selected_indexes, indecisive_indexes = gso_boruta_select(X, y)\n```\nwhich will return an array containing the selected features indexes, see `gso_boruta_select()` function\ndocumentation. Since the process can take several minutes or hours to complete when X is very large,\nthe function provides a `verbose` option to report its completion progress.\n\nOther usage examples are provided in [example.py](example.py).\n\nPlease note that the selection process relies on random probes, which means that running the procedure multiple\ntimes may yield different results. Moreover, when the number of observations is significantly  greater than the\nnumber of features (N >> P), there is a higher likelihood of selecting the informative features.\n\n## License\n\n`gsbfs` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Thomas Civeit <thomas@civeit.com>  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Gram-Schmidt Boruta Feature Selection.",
    "version": "0.0.1",
    "project_urls": {
        "Issues": "https://github.com/tomcv/gsbfs/issues",
        "Source": "https://github.com/tomcv/gsbfs"
    },
    "split_keywords": [
        "feature-engineering",
        "feature-extraction",
        "feature-selection",
        "machine-learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cd1e75a3702c8ff23f7e921393c4f165b19a5e0f17e8f4bd40ebfb33870177ff",
                "md5": "4a51ece6fcd032bd7e84eb3600947af7",
                "sha256": "d60cfb116708cb0671d5cd467d585282d2fb5ac4bd2f302e9a3ccc50c5865d80"
            },
            "downloads": -1,
            "filename": "gsbfs-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4a51ece6fcd032bd7e84eb3600947af7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 7855,
            "upload_time": "2023-05-11T16:31:20",
            "upload_time_iso_8601": "2023-05-11T16:31:20.615461Z",
            "url": "https://files.pythonhosted.org/packages/cd/1e/75a3702c8ff23f7e921393c4f165b19a5e0f17e8f4bd40ebfb33870177ff/gsbfs-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "297081ed5b17dc0295fa9ecac384763e11b1637dd6dcdf4a330f74c3336e0f0a",
                "md5": "6213a5686e38fe0a1ad26f81a03baa4d",
                "sha256": "e74ec7dc05affece1793e194b635e204e6e9bdaef92b2ea8589ad64190337b33"
            },
            "downloads": -1,
            "filename": "gsbfs-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6213a5686e38fe0a1ad26f81a03baa4d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 6874,
            "upload_time": "2023-05-11T16:31:24",
            "upload_time_iso_8601": "2023-05-11T16:31:24.443806Z",
            "url": "https://files.pythonhosted.org/packages/29/70/81ed5b17dc0295fa9ecac384763e11b1637dd6dcdf4a330f74c3336e0f0a/gsbfs-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-11 16:31:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tomcv",
    "github_project": "gsbfs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "gsbfs"
}
        
Elapsed time: 0.07395s