mrmr-selection

Name	mrmr-selection JSON
Version	0.2.8 JSON
	download
home_page	https://github.com/smazzanti/mrmr
Summary	minimum-Redundancy-Maximum-Relevance algorithm for feature selection
upload_time	2023-06-30 19:34:38
maintainer
docs_url	None
author	Samuele Mazzanti
requires_python
license	GNU General Public License v3.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
<img src="./docs/img/mrmr_logo_white_bck.png" alt="drawing" width="450"/>
</p>

## What is mRMR

*mRMR*, which stands for "minimum Redundancy - Maximum Relevance", is a feature selection algorithm.

## Why is it unique

The peculiarity of *mRMR* is that it is a **minimal-optimal** feature selection algorithm. <br/>
This means it is designed to find the smallest relevant subset of features for a given Machine Learning task.

Selecting the minimum number of useful features is desirable for many reasons:
- memory consumption,
- time required,
- performance,
- explainability of results.

This is why a minimal-optimal method such as *mrmr* is often preferable.

On the contrary, the majority of other methods (for instance, Boruta or Positive-Feature-Importance) are classified as **all-relevant**, 
since they identify all the features that have some kind of relationship with the target variable.

## When to use mRMR

Due to its efficiency, *mRMR* is ideal for practical ML applications, 
where it is necessary to perform feature selection frequently and automatically, 
in a relatively small amount of time.

For instance, in **2019**, **Uber** engineers published a paper describing how they implemented 
*mRMR* in their marketing machine learning platform [Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform](https://eng.uber.com/research/maximum-relevance-and-minimum-redundancy-feature-selection-methods-for-a-marketing-machine-learning-platform/).

## How to install this package

You can install this package in your environment via pip:

<pre>
pip install mrmr_selection
</pre>

And then import it in Python through:

<pre>
import mrmr
</pre>

## How to use this package

This package is designed to do *mRMR* selection through different tools, depending on your needs and constraints.

Currently, the following tools are supported (others will be added):
- **Pandas**
- **Polars**
- **Spark**
- **Google BigQuery**

The package has a module for each supported tool. Each module has *at least* these 2 functions:
- `mrmr_classif`, for feature selection when the target variable is categorical (binary or multiclass).
- `mrmr_regression`, for feature selection when the target variable is numeric.

Let's see some examples.

#### 1. Pandas example
You have a Pandas DataFrame (`X`) and a Series which is your target variable (`y`).
You want to select the best `K` features to make predictions on `y`.

```python
# create some pandas data
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)

# select top 10 features using mRMR
from mrmr import mrmr_classif
selected_features = mrmr_classif(X=X, y=y, K=10)
```

Note: the output of mrmr_classif is a list containing K selected features. This is a **ranking**, therefore, if you want to make a further selection, take the first elements of this list.

#### 2. Polars example

```python
# create some polars data
import polars
data = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3), 
        (2.0, None, 2.0, 7.0, 8.5, 6.7), 
        (2.0, None, 3.0, 7.0, -2.3, 4.4),
        (3.0, 4.0, 3.0, 7.0, 0.0, 0.0),
        (4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]
columns = ["target", "some_null", "feature", "constant", "other_feature", "another_feature"]
df_polars = polars.DataFrame(data=data, schema=columns)

# select top 2 features using mRMR
import mrmr
selected_features = mrmr.polars.mrmr_regression(df=df_polars, target_column="target", K=2)
```

#### 3. Spark example

```python
# create some spark data
import pyspark
session = pyspark.sql.SparkSession(pyspark.context.SparkContext())
data = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3), 
        (2.0, float('NaN'), 2.0, 7.0, 8.5, 6.7), 
        (2.0, float('NaN'), 3.0, 7.0, -2.3, 4.4),
        (3.0, 4.0, 3.0, 7.0, 0.0, 0.0),
        (4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]
columns = ["target", "some_null", "feature", "constant", "other_feature", "another_feature"]
df_spark = session.createDataFrame(data=data, schema=columns)

# select top 2 features using mRMR
import mrmr
selected_features = mrmr.spark.mrmr_regression(df=df_spark, target_column="target", K=2)
```

#### 4. Google BigQuery example

```python
# initialize BigQuery client
from google.cloud.bigquery import Client
bq_client = Client(credentials=your_credentials)

# select top 20 features using mRMR
import mrmr
selected_features = mrmr.bigquery.mrmr_regression(
    bq_client=bq_client,
    table_id='bigquery-public-data.covid19_open_data.covid19_open_data',
    target_column='new_deceased',
    K=20
)
```


## Reference

For an easy-going introduction to *mRMR*, read my article on **Towards Data Science**: [“MRMR” Explained Exactly How You Wished Someone Explained to You](https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b).

Also, this article describes an example of *mRMR* used on the world famous **MNIST** dataset: [Feature Selection: How To Throw Away 95% of Your Data and Get 95% Accuracy](https://towardsdatascience.com/feature-selection-how-to-throw-away-95-of-your-data-and-get-95-accuracy-ad41ca016877).

*mRMR* was born in **2003**, this is the original paper: [Minimum Redundancy Feature Selection From Microarray Gene Expression Data](https://www.researchgate.net/publication/4033100_Minimum_Redundancy_Feature_Selection_From_Microarray_Gene_Expression_Data).

Since then, it has been used in many practical applications, due to its simplicity and effectiveness.
For instance, in **2019**, **Uber** engineers published a paper describing how they implemented MRMR in their marketing machine learning platform [Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform](https://eng.uber.com/research/maximum-relevance-and-minimum-redundancy-feature-selection-methods-for-a-marketing-machine-learning-platform/).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/smazzanti/mrmr",
    "name": "mrmr-selection",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Samuele Mazzanti",
    "author_email": "mazzanti.sam@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "<p align=\"center\">\n<img src=\"./docs/img/mrmr_logo_white_bck.png\" alt=\"drawing\" width=\"450\"/>\n</p>\n\n## What is mRMR\n\n*mRMR*, which stands for \"minimum Redundancy - Maximum Relevance\", is a feature selection algorithm.\n\n## Why is it unique\n\nThe peculiarity of *mRMR* is that it is a **minimal-optimal** feature selection algorithm. <br/>\nThis means it is designed to find the smallest relevant subset of features for a given Machine Learning task.\n\nSelecting the minimum number of useful features is desirable for many reasons:\n- memory consumption,\n- time required,\n- performance,\n- explainability of results.\n\nThis is why a minimal-optimal method such as *mrmr* is often preferable.\n\nOn the contrary, the majority of other methods (for instance, Boruta or Positive-Feature-Importance) are classified as **all-relevant**, \nsince they identify all the features that have some kind of relationship with the target variable.\n\n## When to use mRMR\n\nDue to its efficiency, *mRMR* is ideal for practical ML applications, \nwhere it is necessary to perform feature selection frequently and automatically, \nin a relatively small amount of time.\n\nFor instance, in **2019**, **Uber** engineers published a paper describing how they implemented \n*mRMR* in their marketing machine learning platform [Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform](https://eng.uber.com/research/maximum-relevance-and-minimum-redundancy-feature-selection-methods-for-a-marketing-machine-learning-platform/).\n\n## How to install this package\n\nYou can install this package in your environment via pip:\n\n<pre>\npip install mrmr_selection\n</pre>\n\nAnd then import it in Python through:\n\n<pre>\nimport mrmr\n</pre>\n\n## How to use this package\n\nThis package is designed to do *mRMR* selection through different tools, depending on your needs and constraints.\n\nCurrently, the following tools are supported (others will be added):\n- **Pandas**\n- **Polars**\n- **Spark**\n- **Google BigQuery**\n\nThe package has a module for each supported tool. Each module has *at least* these 2 functions:\n- `mrmr_classif`, for feature selection when the target variable is categorical (binary or multiclass).\n- `mrmr_regression`, for feature selection when the target variable is numeric.\n\nLet's see some examples.\n\n#### 1. Pandas example\nYou have a Pandas DataFrame (`X`) and a Series which is your target variable (`y`).\nYou want to select the best `K` features to make predictions on `y`.\n\n```python\n# create some pandas data\nimport pandas as pd\nfrom sklearn.datasets import make_classification\nX, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)\nX = pd.DataFrame(X)\ny = pd.Series(y)\n\n# select top 10 features using mRMR\nfrom mrmr import mrmr_classif\nselected_features = mrmr_classif(X=X, y=y, K=10)\n```\n\nNote: the output of mrmr_classif is a list containing K selected features. This is a **ranking**, therefore, if you want to make a further selection, take the first elements of this list.\n\n#### 2. Polars example\n\n```python\n# create some polars data\nimport polars\ndata = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3), \n        (2.0, None, 2.0, 7.0, 8.5, 6.7), \n        (2.0, None, 3.0, 7.0, -2.3, 4.4),\n        (3.0, 4.0, 3.0, 7.0, 0.0, 0.0),\n        (4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]\ncolumns = [\"target\", \"some_null\", \"feature\", \"constant\", \"other_feature\", \"another_feature\"]\ndf_polars = polars.DataFrame(data=data, schema=columns)\n\n# select top 2 features using mRMR\nimport mrmr\nselected_features = mrmr.polars.mrmr_regression(df=df_polars, target_column=\"target\", K=2)\n```\n\n#### 3. Spark example\n\n```python\n# create some spark data\nimport pyspark\nsession = pyspark.sql.SparkSession(pyspark.context.SparkContext())\ndata = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3), \n        (2.0, float('NaN'), 2.0, 7.0, 8.5, 6.7), \n        (2.0, float('NaN'), 3.0, 7.0, -2.3, 4.4),\n        (3.0, 4.0, 3.0, 7.0, 0.0, 0.0),\n        (4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]\ncolumns = [\"target\", \"some_null\", \"feature\", \"constant\", \"other_feature\", \"another_feature\"]\ndf_spark = session.createDataFrame(data=data, schema=columns)\n\n# select top 2 features using mRMR\nimport mrmr\nselected_features = mrmr.spark.mrmr_regression(df=df_spark, target_column=\"target\", K=2)\n```\n\n#### 4. Google BigQuery example\n\n```python\n# initialize BigQuery client\nfrom google.cloud.bigquery import Client\nbq_client = Client(credentials=your_credentials)\n\n# select top 20 features using mRMR\nimport mrmr\nselected_features = mrmr.bigquery.mrmr_regression(\n    bq_client=bq_client,\n    table_id='bigquery-public-data.covid19_open_data.covid19_open_data',\n    target_column='new_deceased',\n    K=20\n)\n```\n\n\n## Reference\n\nFor an easy-going introduction to *mRMR*, read my article on **Towards Data Science**: [\u201cMRMR\u201d Explained Exactly How You Wished Someone Explained to You](https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b).\n\nAlso, this article describes an example of *mRMR* used on the world famous **MNIST** dataset: [Feature Selection: How To Throw Away 95% of Your Data and Get 95% Accuracy](https://towardsdatascience.com/feature-selection-how-to-throw-away-95-of-your-data-and-get-95-accuracy-ad41ca016877).\n\n*mRMR* was born in **2003**, this is the original paper: [Minimum Redundancy Feature Selection From Microarray Gene Expression Data](https://www.researchgate.net/publication/4033100_Minimum_Redundancy_Feature_Selection_From_Microarray_Gene_Expression_Data).\n\nSince then, it has been used in many practical applications, due to its simplicity and effectiveness.\nFor instance, in **2019**, **Uber** engineers published a paper describing how they implemented MRMR in their marketing machine learning platform [Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform](https://eng.uber.com/research/maximum-relevance-and-minimum-redundancy-feature-selection-methods-for-a-marketing-machine-learning-platform/).\n\n\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3.0",
    "summary": "minimum-Redundancy-Maximum-Relevance algorithm for feature selection",
    "version": "0.2.8",
    "project_urls": {
        "Homepage": "https://github.com/smazzanti/mrmr"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a9f4b09243bbcd133952f393b54924553c4a21ab2f4e4a7a083050f25afe5fcb",
                "md5": "c11d849bfab6d600e7f78838cec4a3c4",
                "sha256": "70723070d516802397e467075366dc3be07117de6378deee6821d5aba2686aa4"
            },
            "downloads": -1,
            "filename": "mrmr_selection-0.2.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c11d849bfab6d600e7f78838cec4a3c4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 15149,
            "upload_time": "2023-06-30T19:34:38",
            "upload_time_iso_8601": "2023-06-30T19:34:38.977974Z",
            "url": "https://files.pythonhosted.org/packages/a9/f4/b09243bbcd133952f393b54924553c4a21ab2f4e4a7a083050f25afe5fcb/mrmr_selection-0.2.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-30 19:34:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "smazzanti",
    "github_project": "mrmr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mrmr-selection"
}

Samuele Mazzanti