# Introduction
This library implements some functionf for removing collinearity from a dataset of features. It can be used both for supervised and for unsupervised machine learning problems.
Collinearity is evaluated calculating __Pearson's linear correlation coefficient__ between the features. The user sets a __threshold__, which is the maximum absolute value allowed for the correlation coefficients in the correlation matrix.
For __unsupervised problems__, the algorithm selects only those features that produce a correlation matrix whose off-diagonal elements are, in absolute value, less than the threshold.
For __supervised problems__, the importance of the features with respect to the target variable is calculated using a univariate approach. Then, the features are added with the same unsupervised approach, starting from the most important ones.
# Objects
The main object is __SelectNonCollinear__. It can be imported this way:
```python
from collinearity import SelectNonCollinear
```
> collinearity.__SelectNonCollinear__(_correlation_threshold=0.4, scoring=f_classif_)
Parameters:
__correlation_threshold : _float (between 0 and 1), default = 0.4___
Only those features that produce a correlation matrix with off-diagonal elements that are, in absolute value, less than this threshold will be chosen.
__scoring : _callable, default=f_classif___
The scoring function for supervised problems. It must be the same accepted by [sklearn.feature_selection.SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html).
# Methods
This object supports the main methods of scikit-learn Estimators:
> fit(X,y=None)
Identifies the features to consider. For supervised problems, _y_ is the target array and the algorithm is:
- Sort the features by scoring descending
- Take the most important feature (i.e. the first feature)
- Take the next feature if it shows a linear correlation coefficient with the already selected feature that is, in absolute value, lower than the threshold
- Keep adding features as long as the correlation constraint holds
For unsupervised problems, we have `y = None` and the algorithm is:
- Take the couple of features that have the lowest absolute value of the linear correlation coefficient.
- If it's lower than the threshold, consider these features
- Keep adding features as long as the correlation matrix doesn't show off-diagonal elements whose absolute value is greater than the threshold.
> transform(X)
Selects the features according to the result of _fit_. It must be called after _fit_.
> fit_transform(X,y=None)
Calls _fit_ and then _transform_
> get_support()
Returns an array of _True_ and _False_ of size X.shape[1]. A feature is selected if the value on this array corresponding to its index is _True_, otherwise it's not selected.
# Examples
The following examples explain how the main objects work. The code to run in advance for initializing the environment is:
```python
from collinearity import SelectNonCollinear
from sklearn.feature_selection import f_regression
import numpy as np
from sklearn.datasets import load_diabetes
X,y = load_diabetes(return_X_y=True)
```
## Unsupervised problems
This example shows how to perform selection according to minimum collinearity in unsupervised problems.
Let's consider, for this example, a threshold equal to 0.3.
```python
selector = SelectNonCollinear(0.3)
```
If we apply the selection to the features and calculate the correlation matrix, we have:
```python
np.corrcoef(selector.fit_transform(X),rowvar=False)
# array([[1. , 0.1737371 , 0.18508467, 0.26006082],
# [0.1737371 , 1. , 0.0881614 , 0.03527682],
# [0.18508467, 0.0881614 , 1. , 0.24977742],
# [0.26006082, 0.03527682, 0.24977742, 1. ]])
```
As we can see, no off-diagonal element is greater than the threshold.
## Supervised problems
For this problem, we must set the value of the `scoring` argument in the constructor.
Let's consider a threshold equal to 0.4 and a scoring equal to `f_regression`.
```python
selector = SelectNonCollinear(correlation_threshold=0.4,scoring=f_regression)
selector.fit(X,y)
```
The correlation matrix is:
```python
np.corrcoef(selector.transform(X),rowvar=False)
# array([[ 1. , 0.1737371 , 0.18508467, 0.33542671, 0.26006082,
# -0.07518097, 0.30173101],
# [ 0.1737371 , 1. , 0.0881614 , 0.24101317, 0.03527682,
# -0.37908963, 0.20813322],
# [ 0.18508467, 0.0881614 , 1. , 0.39541532, 0.24977742,
# -0.36681098, 0.38867999],
# [ 0.33542671, 0.24101317, 0.39541532, 1. , 0.24246971,
# -0.17876121, 0.39042938],
# [ 0.26006082, 0.03527682, 0.24977742, 0.24246971, 1. ,
# 0.05151936, 0.32571675],
# [-0.07518097, -0.37908963, -0.36681098, -0.17876121, 0.05151936,
# 1. , -0.2736973 ],
# [ 0.30173101, 0.20813322, 0.38867999, 0.39042938, 0.32571675,
# -0.2736973 , 1. ]])
```
Again, no off-diagonal element is greater than the threshold in absolute value.
## Use in pipelines
It's possible to use `SelectNonCollinear` inside a pipeline, if necessary.
```python
pipeline = make_pipeline(SelectNonCollinear(correlation_threshold=0.4, scoring=f_regression), LinearRegression())
```
# Contact the author
For any questions, you can contact me at gianluca.malato@gmail.com
Raw data
{
"_id": null,
"home_page": "https://github.com/gianlucamalato/collinearity",
"name": "collinearity",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "machine learning,collinearity,supervised models",
"author": "Gianluca Malato",
"author_email": "gianluca.malato@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/02/b2/d9e6c89039a5a239f0025f10c6799d63a23ca608241efdbc600069854626/collinearity-0.6.1.tar.gz",
"platform": "",
"description": "# Introduction\n\nThis library implements some functionf for removing collinearity from a dataset of features. It can be used both for supervised and for unsupervised machine learning problems.\n\nCollinearity is evaluated calculating __Pearson's linear correlation coefficient__ between the features. The user sets a __threshold__, which is the maximum absolute value allowed for the correlation coefficients in the correlation matrix. \n\nFor __unsupervised problems__, the algorithm selects only those features that produce a correlation matrix whose off-diagonal elements are, in absolute value, less than the threshold. \n\nFor __supervised problems__, the importance of the features with respect to the target variable is calculated using a univariate approach. Then, the features are added with the same unsupervised approach, starting from the most important ones.\n\n# Objects\n\nThe main object is __SelectNonCollinear__. It can be imported this way:\n\n```python\nfrom collinearity import SelectNonCollinear\n```\n\n> collinearity.__SelectNonCollinear__(_correlation_threshold=0.4, scoring=f_classif_)\n\nParameters:\n\n__correlation_threshold : _float (between 0 and 1), default = 0.4___\n\nOnly those features that produce a correlation matrix with off-diagonal elements that are, in absolute value, less than this threshold will be chosen.\n\n__scoring : _callable, default=f_classif___\n\nThe scoring function for supervised problems. It must be the same accepted by [sklearn.feature_selection.SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html).\n\n# Methods\n\nThis object supports the main methods of scikit-learn Estimators:\n\n> fit(X,y=None)\n\nIdentifies the features to consider. For supervised problems, _y_ is the target array and the algorithm is:\n- Sort the features by scoring descending\n- Take the most important feature (i.e. the first feature)\n- Take the next feature if it shows a linear correlation coefficient with the already selected feature that is, in absolute value, lower than the threshold\n- Keep adding features as long as the correlation constraint holds\n\nFor unsupervised problems, we have `y = None` and the algorithm is:\n- Take the couple of features that have the lowest absolute value of the linear correlation coefficient.\n- If it's lower than the threshold, consider these features\n- Keep adding features as long as the correlation matrix doesn't show off-diagonal elements whose absolute value is greater than the threshold. \n\n> transform(X)\n\nSelects the features according to the result of _fit_. It must be called after _fit_.\n\n> fit_transform(X,y=None)\n\nCalls _fit_ and then _transform_\n\n> get_support()\n\nReturns an array of _True_ and _False_ of size X.shape[1]. A feature is selected if the value on this array corresponding to its index is _True_, otherwise it's not selected.\n\n# Examples\n\nThe following examples explain how the main objects work. The code to run in advance for initializing the environment is:\n\n```python\nfrom collinearity import SelectNonCollinear\nfrom sklearn.feature_selection import f_regression\nimport numpy as np\nfrom sklearn.datasets import load_diabetes\n\nX,y = load_diabetes(return_X_y=True)\n```\n\n## Unsupervised problems\n\n\nThis example shows how to perform selection according to minimum collinearity in unsupervised problems. \n\nLet's consider, for this example, a threshold equal to 0.3.\n\n```python\nselector = SelectNonCollinear(0.3)\n```\n\nIf we apply the selection to the features and calculate the correlation matrix, we have:\n\n```python\nnp.corrcoef(selector.fit_transform(X),rowvar=False)\n\n# array([[1. , 0.1737371 , 0.18508467, 0.26006082],\n# [0.1737371 , 1. , 0.0881614 , 0.03527682],\n# [0.18508467, 0.0881614 , 1. , 0.24977742],\n# [0.26006082, 0.03527682, 0.24977742, 1. ]])\n\n```\nAs we can see, no off-diagonal element is greater than the threshold.\n\n## Supervised problems\n\nFor this problem, we must set the value of the `scoring` argument in the constructor. \n\nLet's consider a threshold equal to 0.4 and a scoring equal to `f_regression`.\n\n```python\nselector = SelectNonCollinear(correlation_threshold=0.4,scoring=f_regression)\n\nselector.fit(X,y)\n```\n\nThe correlation matrix is:\n```python\nnp.corrcoef(selector.transform(X),rowvar=False)\n\n# array([[ 1. , 0.1737371 , 0.18508467, 0.33542671, 0.26006082,\n# -0.07518097, 0.30173101],\n# [ 0.1737371 , 1. , 0.0881614 , 0.24101317, 0.03527682,\n# -0.37908963, 0.20813322],\n# [ 0.18508467, 0.0881614 , 1. , 0.39541532, 0.24977742,\n# -0.36681098, 0.38867999],\n# [ 0.33542671, 0.24101317, 0.39541532, 1. , 0.24246971,\n# -0.17876121, 0.39042938],\n# [ 0.26006082, 0.03527682, 0.24977742, 0.24246971, 1. ,\n# 0.05151936, 0.32571675],\n# [-0.07518097, -0.37908963, -0.36681098, -0.17876121, 0.05151936,\n# 1. , -0.2736973 ],\n# [ 0.30173101, 0.20813322, 0.38867999, 0.39042938, 0.32571675,\n# -0.2736973 , 1. ]])\n```\n\nAgain, no off-diagonal element is greater than the threshold in absolute value.\n\n## Use in pipelines\n\nIt's possible to use `SelectNonCollinear` inside a pipeline, if necessary.\n\n```python\npipeline = make_pipeline(SelectNonCollinear(correlation_threshold=0.4, scoring=f_regression), LinearRegression())\n```\n# Contact the author\n\nFor any questions, you can contact me at gianluca.malato@gmail.com",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python library for removing collinearity in machine learning datasets",
"version": "0.6.1",
"split_keywords": [
"machine learning",
"collinearity",
"supervised models"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "45af25fcfd20661237ebc9ad10abb412",
"sha256": "b24d97391118873b259ad4f8f500cbfafeb2e531a16534fe15b3e1c530fd323e"
},
"downloads": -1,
"filename": "collinearity-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "45af25fcfd20661237ebc9ad10abb412",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5027,
"upload_time": "2021-06-25T10:31:45",
"upload_time_iso_8601": "2021-06-25T10:31:45.814787Z",
"url": "https://files.pythonhosted.org/packages/02/b2/d9e6c89039a5a239f0025f10c6799d63a23ca608241efdbc600069854626/collinearity-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2021-06-25 10:31:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "gianlucamalato",
"github_project": "collinearity",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "collinearity"
}