collinearity


Namecollinearity JSON
Version 0.6.1 PyPI version JSON
download
home_pagehttps://github.com/gianlucamalato/collinearity
SummaryA Python library for removing collinearity in machine learning datasets
upload_time2021-06-25 10:31:45
maintainer
docs_urlNone
authorGianluca Malato
requires_python
licenseMIT
keywords machine learning collinearity supervised models
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Introduction

This library implements some functionf for removing collinearity from a dataset of features. It can be used both for supervised and for unsupervised machine learning problems.

Collinearity is evaluated calculating __Pearson's linear correlation coefficient__ between the features. The user sets a __threshold__, which is the maximum absolute value allowed for the correlation coefficients in the correlation matrix. 

For __unsupervised problems__, the algorithm selects only those features that produce a correlation matrix whose off-diagonal elements are, in absolute value, less than the threshold. 

For __supervised problems__, the importance of the features with respect to the target variable is calculated using a univariate approach. Then, the features are added with the same unsupervised approach, starting from the most important ones.

# Objects

The main object is __SelectNonCollinear__. It can be imported this way:

```python
from collinearity import SelectNonCollinear
```

> collinearity.__SelectNonCollinear__(_correlation_threshold=0.4, scoring=f_classif_)

Parameters:

__correlation_threshold : _float (between 0 and 1), default = 0.4___

Only those features that produce a correlation matrix with off-diagonal elements that are, in absolute value, less than this threshold will be chosen.

__scoring : _callable, default=f_classif___

The scoring function for supervised problems. It must be the same accepted by [sklearn.feature_selection.SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html).

# Methods

This object supports the main methods of scikit-learn Estimators:

> fit(X,y=None)

Identifies the features to consider. For supervised problems, _y_ is the target array and the algorithm is:
- Sort the features by scoring descending
- Take the most important feature (i.e. the first feature)
- Take the next feature if it shows a linear correlation coefficient with the already selected feature that is, in absolute value, lower than the threshold
- Keep adding features as long as the correlation constraint holds

For unsupervised problems, we have `y = None` and the algorithm is:
- Take the couple of features that have the lowest absolute value of the linear correlation coefficient.
- If it's lower than the threshold, consider these features
- Keep adding features as long as the correlation matrix doesn't show off-diagonal elements whose absolute value is greater than the threshold. 

> transform(X)

Selects the features according to the result of _fit_. It must be called after _fit_.

> fit_transform(X,y=None)

Calls _fit_ and then _transform_

> get_support()

Returns an array of _True_ and _False_ of size X.shape[1]. A feature is selected if the value on this array corresponding to its index is _True_, otherwise it's not selected.

# Examples

The following examples explain how the main objects work. The code to run in advance for initializing the environment is:

```python
from collinearity import SelectNonCollinear
from sklearn.feature_selection import f_regression
import numpy as np
from sklearn.datasets import load_diabetes

X,y = load_diabetes(return_X_y=True)
```

## Unsupervised problems


This example shows how to perform selection according to minimum collinearity in unsupervised problems. 

Let's consider, for this example, a threshold equal to 0.3.

```python
selector = SelectNonCollinear(0.3)
```

If we apply the selection to the features and calculate the correlation matrix, we have:

```python
np.corrcoef(selector.fit_transform(X),rowvar=False)

# array([[1.       , 0.1737371 , 0.18508467, 0.26006082],
#       [0.1737371 , 1.        , 0.0881614 , 0.03527682],
#       [0.18508467, 0.0881614 , 1.        , 0.24977742],
#       [0.26006082, 0.03527682, 0.24977742, 1.        ]])

```
As we can see, no off-diagonal element is greater than the threshold.

## Supervised problems

For this problem, we must set the value of the `scoring` argument in the constructor. 

Let's consider a threshold equal to 0.4 and a scoring equal to `f_regression`.

```python
selector = SelectNonCollinear(correlation_threshold=0.4,scoring=f_regression)

selector.fit(X,y)
```

The correlation matrix is:
```python
np.corrcoef(selector.transform(X),rowvar=False)

# array([[ 1.       ,  0.1737371 ,  0.18508467,  0.33542671,  0.26006082,
#        -0.07518097,  0.30173101],
#       [ 0.1737371 ,  1.        ,  0.0881614 ,  0.24101317,  0.03527682,
#        -0.37908963,  0.20813322],
#       [ 0.18508467,  0.0881614 ,  1.        ,  0.39541532,  0.24977742,
#        -0.36681098,  0.38867999],
#       [ 0.33542671,  0.24101317,  0.39541532,  1.        ,  0.24246971,
#        -0.17876121,  0.39042938],
#       [ 0.26006082,  0.03527682,  0.24977742,  0.24246971,  1.        ,
#         0.05151936,  0.32571675],
#       [-0.07518097, -0.37908963, -0.36681098, -0.17876121,  0.05151936,
#         1.        , -0.2736973 ],
#       [ 0.30173101,  0.20813322,  0.38867999,  0.39042938,  0.32571675,
#        -0.2736973 ,  1.        ]])
```

Again, no off-diagonal element is greater than the threshold in absolute value.

## Use in pipelines

It's possible to use `SelectNonCollinear` inside a pipeline, if necessary.

```python
pipeline = make_pipeline(SelectNonCollinear(correlation_threshold=0.4, scoring=f_regression), LinearRegression())
```
# Contact the author

For any questions, you can contact me at gianluca.malato@gmail.com
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gianlucamalato/collinearity",
    "name": "collinearity",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "machine learning,collinearity,supervised models",
    "author": "Gianluca Malato",
    "author_email": "gianluca.malato@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/02/b2/d9e6c89039a5a239f0025f10c6799d63a23ca608241efdbc600069854626/collinearity-0.6.1.tar.gz",
    "platform": "",
    "description": "# Introduction\n\nThis library implements some functionf for removing collinearity from a dataset of features. It can be used both for supervised and for unsupervised machine learning problems.\n\nCollinearity is evaluated calculating __Pearson's linear correlation coefficient__ between the features. The user sets a __threshold__, which is the maximum absolute value allowed for the correlation coefficients in the correlation matrix. \n\nFor __unsupervised problems__, the algorithm selects only those features that produce a correlation matrix whose off-diagonal elements are, in absolute value, less than the threshold. \n\nFor __supervised problems__, the importance of the features with respect to the target variable is calculated using a univariate approach. Then, the features are added with the same unsupervised approach, starting from the most important ones.\n\n# Objects\n\nThe main object is __SelectNonCollinear__. It can be imported this way:\n\n```python\nfrom collinearity import SelectNonCollinear\n```\n\n> collinearity.__SelectNonCollinear__(_correlation_threshold=0.4, scoring=f_classif_)\n\nParameters:\n\n__correlation_threshold : _float (between 0 and 1), default = 0.4___\n\nOnly those features that produce a correlation matrix with off-diagonal elements that are, in absolute value, less than this threshold will be chosen.\n\n__scoring : _callable, default=f_classif___\n\nThe scoring function for supervised problems. It must be the same accepted by [sklearn.feature_selection.SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html).\n\n# Methods\n\nThis object supports the main methods of scikit-learn Estimators:\n\n> fit(X,y=None)\n\nIdentifies the features to consider. For supervised problems, _y_ is the target array and the algorithm is:\n- Sort the features by scoring descending\n- Take the most important feature (i.e. the first feature)\n- Take the next feature if it shows a linear correlation coefficient with the already selected feature that is, in absolute value, lower than the threshold\n- Keep adding features as long as the correlation constraint holds\n\nFor unsupervised problems, we have `y = None` and the algorithm is:\n- Take the couple of features that have the lowest absolute value of the linear correlation coefficient.\n- If it's lower than the threshold, consider these features\n- Keep adding features as long as the correlation matrix doesn't show off-diagonal elements whose absolute value is greater than the threshold. \n\n> transform(X)\n\nSelects the features according to the result of _fit_. It must be called after _fit_.\n\n> fit_transform(X,y=None)\n\nCalls _fit_ and then _transform_\n\n> get_support()\n\nReturns an array of _True_ and _False_ of size X.shape[1]. A feature is selected if the value on this array corresponding to its index is _True_, otherwise it's not selected.\n\n# Examples\n\nThe following examples explain how the main objects work. The code to run in advance for initializing the environment is:\n\n```python\nfrom collinearity import SelectNonCollinear\nfrom sklearn.feature_selection import f_regression\nimport numpy as np\nfrom sklearn.datasets import load_diabetes\n\nX,y = load_diabetes(return_X_y=True)\n```\n\n## Unsupervised problems\n\n\nThis example shows how to perform selection according to minimum collinearity in unsupervised problems. \n\nLet's consider, for this example, a threshold equal to 0.3.\n\n```python\nselector = SelectNonCollinear(0.3)\n```\n\nIf we apply the selection to the features and calculate the correlation matrix, we have:\n\n```python\nnp.corrcoef(selector.fit_transform(X),rowvar=False)\n\n# array([[1.       , 0.1737371 , 0.18508467, 0.26006082],\n#       [0.1737371 , 1.        , 0.0881614 , 0.03527682],\n#       [0.18508467, 0.0881614 , 1.        , 0.24977742],\n#       [0.26006082, 0.03527682, 0.24977742, 1.        ]])\n\n```\nAs we can see, no off-diagonal element is greater than the threshold.\n\n## Supervised problems\n\nFor this problem, we must set the value of the `scoring` argument in the constructor. \n\nLet's consider a threshold equal to 0.4 and a scoring equal to `f_regression`.\n\n```python\nselector = SelectNonCollinear(correlation_threshold=0.4,scoring=f_regression)\n\nselector.fit(X,y)\n```\n\nThe correlation matrix is:\n```python\nnp.corrcoef(selector.transform(X),rowvar=False)\n\n# array([[ 1.       ,  0.1737371 ,  0.18508467,  0.33542671,  0.26006082,\n#        -0.07518097,  0.30173101],\n#       [ 0.1737371 ,  1.        ,  0.0881614 ,  0.24101317,  0.03527682,\n#        -0.37908963,  0.20813322],\n#       [ 0.18508467,  0.0881614 ,  1.        ,  0.39541532,  0.24977742,\n#        -0.36681098,  0.38867999],\n#       [ 0.33542671,  0.24101317,  0.39541532,  1.        ,  0.24246971,\n#        -0.17876121,  0.39042938],\n#       [ 0.26006082,  0.03527682,  0.24977742,  0.24246971,  1.        ,\n#         0.05151936,  0.32571675],\n#       [-0.07518097, -0.37908963, -0.36681098, -0.17876121,  0.05151936,\n#         1.        , -0.2736973 ],\n#       [ 0.30173101,  0.20813322,  0.38867999,  0.39042938,  0.32571675,\n#        -0.2736973 ,  1.        ]])\n```\n\nAgain, no off-diagonal element is greater than the threshold in absolute value.\n\n## Use in pipelines\n\nIt's possible to use `SelectNonCollinear` inside a pipeline, if necessary.\n\n```python\npipeline = make_pipeline(SelectNonCollinear(correlation_threshold=0.4, scoring=f_regression), LinearRegression())\n```\n# Contact the author\n\nFor any questions, you can contact me at gianluca.malato@gmail.com",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library for removing collinearity in machine learning datasets",
    "version": "0.6.1",
    "split_keywords": [
        "machine learning",
        "collinearity",
        "supervised models"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "45af25fcfd20661237ebc9ad10abb412",
                "sha256": "b24d97391118873b259ad4f8f500cbfafeb2e531a16534fe15b3e1c530fd323e"
            },
            "downloads": -1,
            "filename": "collinearity-0.6.1.tar.gz",
            "has_sig": false,
            "md5_digest": "45af25fcfd20661237ebc9ad10abb412",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5027,
            "upload_time": "2021-06-25T10:31:45",
            "upload_time_iso_8601": "2021-06-25T10:31:45.814787Z",
            "url": "https://files.pythonhosted.org/packages/02/b2/d9e6c89039a5a239f0025f10c6799d63a23ca608241efdbc600069854626/collinearity-0.6.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-06-25 10:31:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "gianlucamalato",
    "github_project": "collinearity",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "collinearity"
}
        
Elapsed time: 0.04423s