selective


Nameselective JSON
Version 1.2.0 PyPI version JSON
download
home_pagehttps://github.com/fidelity/selective
Summaryfeature selection library
upload_time2025-09-04 20:23:41
maintainerNone
docs_urlNone
authorFMR LLC
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements catboost joblib lightgbm numpy mip pandas scikit-learn seaborn statsmodels textwiser xgboost
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![ci](https://github.com/fidelity/selective/actions/workflows/ci.yml/badge.svg?branch=master)](https://github.com/fidelity/selective/actions/workflows/ci.yml) [![PyPI version fury.io](https://badge.fury.io/py/selective.svg)](https://pypi.python.org/pypi/selective/) [![PyPI license](https://img.shields.io/pypi/l/selective.svg)](https://pypi.python.org/pypi/selective/) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com) [![Downloads](https://static.pepy.tech/personalized-badge/selective?period=total&units=international_system&left_color=grey&right_color=orange&left_text=Downloads)](https://pepy.tech/project/selective)


# Selective: Feature Selection Library
**Selective** is a white-box feature selection library that supports supervised and unsupervised selection methods for classification and regression tasks. 

The library provides:

* Simple to complex selection methods: Variance, Correlation, Statistical, Linear, Tree-based, or Customized.
* [Text-based selection](#text-based-selection) to maximize diversity in text embeddings and metadata coverage.
* Interoperable with data frames as the input.
* Automated task detection. No need to know what feature selection method works with what machine learning task.
* Benchmarking multiple selectors using cross-validation with built-in parallelization.
* Inspection of the results and feature importance. 

Selective also provides optimized item selection based on diversity of text embeddings via [TextWiser](https://github.com/fidelity/textwiser) and 
coverage of binary labels via multi-objective optimization ([AMAI'24](https://trebuchet.public.springernature.app/get_content/2c9eb6df-5c2b-42bc-89d6-4e3eb8bc8799?utm_source=rct_congratemailt&utm_medium=email&utm_campaign=nonoa_20240405&utm_content=10.1007/s10472-024-09941-x), [CPAIOR'21](https://link.springer.com/chapter/10.1007/978-3-030-78230-6_27), [DSO@IJCAI'22](https://arxiv.org/abs/2112.03105)). This approach speeds-up online experimentation and boosts recommender systems significantly as presented at [NVIDIA GTC'22](https://www.youtube.com/watch?v=_v-B2nRy79w).  

Selective is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments.

## Quick Start
```python
# Import Selective and SelectionMethod
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import Selective, SelectionMethod

# Data
data, label = get_data_label(fetch_california_housing())

# Feature selectors from simple to more complex
selector = Selective(SelectionMethod.Variance(threshold=0.0))
selector = Selective(SelectionMethod.Correlation(threshold=0.5, method="pearson"))
selector = Selective(SelectionMethod.Statistical(num_features=3, method="anova"))
selector = Selective(SelectionMethod.Linear(num_features=3, regularization="none"))
selector = Selective(SelectionMethod.TreeBased(num_features=3))

# Feature reduction
subset = selector.fit_transform(data, label)
print("Reduction:", list(subset.columns))
print("Scores:", list(selector.get_absolute_scores()))
```


## Available Methods

|                                                           Method                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                        Options                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|:--------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| [Variance per Feature](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) |                                                                                                                                                                                                                                                                                                                                                                                                                                      `threshold`                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|   [Correlation pairwise Features](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)   |                                                                                                                                                                                                                                                                     [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) <br> [Kendall Rank Correlation Coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) <br> [Spearman's Rank Correlation Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) <br>                                                                                                                                                                                                                                                                      |
|    [Statistical Analysis](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)     |                                                                                                             [ANOVA F-test Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html) <br> [F-value Regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html) <br> [Chi-Square](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) <br> [KL Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) <br> [Mutual Information Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) <br> [Variance Inflation Factor](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html)                                                                                                               |
|                             [Linear Methods](https://en.wikipedia.org/wiki/Linear_regression)                              |                                                                                                   [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression) <br> [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression) <br> [Lasso Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) <br> [Ridge Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) <br>                                                                                                    |
|                          [Tree-based Methods](https://scikit-learn.org/stable/modules/tree.html)                           | [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) <br> [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier) <br> [Extra Trees Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) <br> [XGBoost](https://xgboost.readthedocs.io/en/latest/) <br> [LightGBM](https://lightgbm.readthedocs.io/en/latest/) <br> [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) <br> [CatBoost](https://github.com/catboost)<br> [Gradient Boosting Tree](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) <br> |
|  [Text-based Methods](https://link.springer.com/chapter/10.1007/978-3-030-78230-6_27)  |                                                                                                                                                                                                                                                                                                                                              `featurization_method` = [TextWiser](https://github.com/fidelity/textwiser) <br> `optimization_method = ["exact", "greedy", "kmeans", "random"]` <br> `cost_metric = ["unicost", "diverse"]`                                                                                                                                                                                                                                                                                                                                              |



## Benchmarking

```python
# Imports
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from xgboost import XGBClassifier, XGBRegressor
from feature.selector import SelectionMethod, benchmark, calculate_statistics

# Data
data, label = get_data_label(fetch_california_housing())

# Selectors
corr_threshold = 0.5
num_features = 3
tree_params = {"n_estimators": 50, "max_depth": 5, "random_state": 111, "n_jobs": 4}
selectors = {

  # Correlation methods
  "corr_pearson": SelectionMethod.Correlation(corr_threshold, method="pearson"),
  "corr_kendall": SelectionMethod.Correlation(corr_threshold, method="kendall"),
  "corr_spearman": SelectionMethod.Correlation(corr_threshold, method="spearman"),
  
  # Statistical methods
  "stat_anova": SelectionMethod.Statistical(num_features, method="anova"),
  "stat_chi_square": SelectionMethod.Statistical(num_features, method="chi_square"),
  "stat_kl_divergence": SelectionMethod.Statistical(num_features, method="kl_divergence"),
  "stat_mutual_info": SelectionMethod.Statistical(num_features, method="mutual_info"),
  
  # Linear methods
  "linear": SelectionMethod.Linear(num_features, regularization="none"),
  "lasso": SelectionMethod.Linear(num_features, regularization="lasso", alpha=1000),
  "ridge": SelectionMethod.Linear(num_features, regularization="ridge", alpha=1000),
  
  # Non-linear tree-based methods
  "random_forest": SelectionMethod.TreeBased(num_features),
  "xgboost_classif": SelectionMethod.TreeBased(num_features, estimator=XGBClassifier(**tree_params)),
  "xgboost_regress": SelectionMethod.TreeBased(num_features, estimator=XGBRegressor(**tree_params))
}

# Benchmark (sequential)
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)

# Benchmark (in parallel)
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5, n_jobs=4)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)

# Get benchmark statistics by feature
stats_df = calculate_statistics(score_df, selected_df)
print(stats_df)
```

## Text-based Selection
This example shows how to use text-based selection. In this scenario, we would like to select a subset of articles that is most diverse in the text embedding space and covers a range of topics. 

```python
# Import Selective and TextWiser
import pandas as pd
from feature.selector import Selective, SelectionMethod
from textwiser import TextWiser, Embedding, Transformation

# Data with the text content of each article
data = pd.DataFrame({"article_1": ["article text here"],
                     "article_2": ["article text here"],
                     "article_3": ["article text here"],
                     "article_4": ["article text here"],
                     "article_5": ["article text here"]})

# Labels to denote 0/1 coverage metadata for each article 
# across four labels, e.g., sports, international, entertainment, science    
labels = pd.DataFrame({"article_1": [1, 1, 0, 1],
                       "article_2": [0, 1, 0, 0],
                       "article_3": [0, 0, 1, 0],
                       "article_4": [0, 0, 1, 1],
                       "article_5": [1, 1, 1, 0]},
                      index=["label_1", "label_2", "label_3", "label_4"])

# TextWiser featurization method to create text embeddings
textwiser = TextWiser(Embedding.TfIdf(), Transformation.NMF(n_components=20))

# Text-based selection
# The goal is to select a subset of articles 
# that is most diverse in the text embedding space of articles
# and covers the most labels in each topic
selector = Selective(SelectionMethod.TextBased(num_features=2, featurization_method=textwiser))

# Feature reduction
subset = selector.fit_transform(data, labels)
print("Reduction:", list(subset.columns))
```

## Visualization

```python
import pandas as pd
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import SelectionMethod, Selective, plot_importance

# Data
data, label = get_data_label(fetch_california_housing())

# Feature Selector
selector = Selective(SelectionMethod.Linear(num_features=8, regularization="none"))
subset = selector.fit_transform(data, label)

# Plot Feature Importance
df = pd.DataFrame(selector.get_absolute_scores(), index=data.columns)
plot_importance(df)
```

## Installation

Selective requires **Python 3.8+** and can be installed from PyPI using ``pip install selective``.

## Source 

Alternatively, you can build a wheel package on your platform from scratch using the source code:

```bash
git clone https://github.com/fidelity/selective.git
cd selective
pip install setuptools wheel # if wheel is not installed
python setup.py sdist bdist_wheel
pip install dist/selective-X.X.X-py3-none-any.whl
```

## Test your setup

```
cd selective
python -m unittest discover tests
```

## Citation

If you use Selective in a publication, please cite it as:

```bibtex
    @article{DBLP:journals/amai/HaDVH98,
    author       = {Kad\i{}o\u{g}lu, Serdar and Kleynhans, Bernard and Wang, Xin},
    title        = {Integrating optimized item selection with active learning for continuous exploration in recommender systems},
    journal      = {Ann. Math. Artif. Intell.},
    year         = {2024},
    url          = {https://doi.org/10.1007/s10472-024-09941-x},
    doi          = {10.1007/s10472-024-09941-x},
    }
}
```

## Support

Please submit bug reports and feature requests as [Issues](https://github.com/fidelity/selective/issues).

## License
Selective is licensed under [Apache 2.0](https://github.com/fidelity/selective/blob/master/LICENSE.md)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fidelity/selective",
    "name": "selective",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "FMR LLC",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/9b/5c/325851a0f7c8599fc1475d76181b1f0b654a7085913eeaf1f3f3e662bab1/selective-1.2.0.tar.gz",
    "platform": null,
    "description": "[![ci](https://github.com/fidelity/selective/actions/workflows/ci.yml/badge.svg?branch=master)](https://github.com/fidelity/selective/actions/workflows/ci.yml) [![PyPI version fury.io](https://badge.fury.io/py/selective.svg)](https://pypi.python.org/pypi/selective/) [![PyPI license](https://img.shields.io/pypi/l/selective.svg)](https://pypi.python.org/pypi/selective/) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com) [![Downloads](https://static.pepy.tech/personalized-badge/selective?period=total&units=international_system&left_color=grey&right_color=orange&left_text=Downloads)](https://pepy.tech/project/selective)\n\n\n# Selective: Feature Selection Library\n**Selective** is a white-box feature selection library that supports supervised and unsupervised selection methods for classification and regression tasks. \n\nThe library provides:\n\n* Simple to complex selection methods: Variance, Correlation, Statistical, Linear, Tree-based, or Customized.\n* [Text-based selection](#text-based-selection) to maximize diversity in text embeddings and metadata coverage.\n* Interoperable with data frames as the input.\n* Automated task detection. No need to know what feature selection method works with what machine learning task.\n* Benchmarking multiple selectors using cross-validation with built-in parallelization.\n* Inspection of the results and feature importance. \n\nSelective also provides optimized item selection based on diversity of text embeddings via [TextWiser](https://github.com/fidelity/textwiser) and \ncoverage of binary labels via multi-objective optimization ([AMAI'24](https://trebuchet.public.springernature.app/get_content/2c9eb6df-5c2b-42bc-89d6-4e3eb8bc8799?utm_source=rct_congratemailt&utm_medium=email&utm_campaign=nonoa_20240405&utm_content=10.1007/s10472-024-09941-x), [CPAIOR'21](https://link.springer.com/chapter/10.1007/978-3-030-78230-6_27), [DSO@IJCAI'22](https://arxiv.org/abs/2112.03105)). This approach speeds-up online experimentation and boosts recommender systems significantly as presented at [NVIDIA GTC'22](https://www.youtube.com/watch?v=_v-B2nRy79w).  \n\nSelective is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments.\n\n## Quick Start\n```python\n# Import Selective and SelectionMethod\nfrom sklearn.datasets import fetch_california_housing\nfrom feature.utils import get_data_label\nfrom feature.selector import Selective, SelectionMethod\n\n# Data\ndata, label = get_data_label(fetch_california_housing())\n\n# Feature selectors from simple to more complex\nselector = Selective(SelectionMethod.Variance(threshold=0.0))\nselector = Selective(SelectionMethod.Correlation(threshold=0.5, method=\"pearson\"))\nselector = Selective(SelectionMethod.Statistical(num_features=3, method=\"anova\"))\nselector = Selective(SelectionMethod.Linear(num_features=3, regularization=\"none\"))\nselector = Selective(SelectionMethod.TreeBased(num_features=3))\n\n# Feature reduction\nsubset = selector.fit_transform(data, label)\nprint(\"Reduction:\", list(subset.columns))\nprint(\"Scores:\", list(selector.get_absolute_scores()))\n```\n\n\n## Available Methods\n\n|                                                           Method                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                        Options                                                                                                                                                                                                                                                                                                                                                                                                                                         |\n|:--------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|\n| [Variance per Feature](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) |                                                                                                                                                                                                                                                                                                                                                                                                                                      `threshold`                                                                                                                                                                                                                                                                                                                                                                                                                                       |\n|   [Correlation pairwise Features](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)   |                                                                                                                                                                                                                                                                     [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) <br> [Kendall Rank Correlation Coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) <br> [Spearman's Rank Correlation Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) <br>                                                                                                                                                                                                                                                                      |\n|    [Statistical Analysis](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)     |                                                                                                             [ANOVA F-test Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html) <br> [F-value Regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html) <br> [Chi-Square](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) <br> [KL Divergence](https://en.wikipedia.org/wiki/Kullback\u2013Leibler_divergence) <br> [Mutual Information Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) <br> [Variance Inflation Factor](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html)                                                                                                               |\n|                             [Linear Methods](https://en.wikipedia.org/wiki/Linear_regression)                              |                                                                                                   [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression) <br> [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression) <br> [Lasso Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) <br> [Ridge Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) <br>                                                                                                    |\n|                          [Tree-based Methods](https://scikit-learn.org/stable/modules/tree.html)                           | [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) <br> [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier) <br> [Extra Trees Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) <br> [XGBoost](https://xgboost.readthedocs.io/en/latest/) <br> [LightGBM](https://lightgbm.readthedocs.io/en/latest/) <br> [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) <br> [CatBoost](https://github.com/catboost)<br> [Gradient Boosting Tree](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) <br> |\n|  [Text-based Methods](https://link.springer.com/chapter/10.1007/978-3-030-78230-6_27)  |                                                                                                                                                                                                                                                                                                                                              `featurization_method` = [TextWiser](https://github.com/fidelity/textwiser) <br> `optimization_method = [\"exact\", \"greedy\", \"kmeans\", \"random\"]` <br> `cost_metric = [\"unicost\", \"diverse\"]`                                                                                                                                                                                                                                                                                                                                              |\n\n\n\n## Benchmarking\n\n```python\n# Imports\nfrom sklearn.datasets import fetch_california_housing\nfrom feature.utils import get_data_label\nfrom xgboost import XGBClassifier, XGBRegressor\nfrom feature.selector import SelectionMethod, benchmark, calculate_statistics\n\n# Data\ndata, label = get_data_label(fetch_california_housing())\n\n# Selectors\ncorr_threshold = 0.5\nnum_features = 3\ntree_params = {\"n_estimators\": 50, \"max_depth\": 5, \"random_state\": 111, \"n_jobs\": 4}\nselectors = {\n\n  # Correlation methods\n  \"corr_pearson\": SelectionMethod.Correlation(corr_threshold, method=\"pearson\"),\n  \"corr_kendall\": SelectionMethod.Correlation(corr_threshold, method=\"kendall\"),\n  \"corr_spearman\": SelectionMethod.Correlation(corr_threshold, method=\"spearman\"),\n  \n  # Statistical methods\n  \"stat_anova\": SelectionMethod.Statistical(num_features, method=\"anova\"),\n  \"stat_chi_square\": SelectionMethod.Statistical(num_features, method=\"chi_square\"),\n  \"stat_kl_divergence\": SelectionMethod.Statistical(num_features, method=\"kl_divergence\"),\n  \"stat_mutual_info\": SelectionMethod.Statistical(num_features, method=\"mutual_info\"),\n  \n  # Linear methods\n  \"linear\": SelectionMethod.Linear(num_features, regularization=\"none\"),\n  \"lasso\": SelectionMethod.Linear(num_features, regularization=\"lasso\", alpha=1000),\n  \"ridge\": SelectionMethod.Linear(num_features, regularization=\"ridge\", alpha=1000),\n  \n  # Non-linear tree-based methods\n  \"random_forest\": SelectionMethod.TreeBased(num_features),\n  \"xgboost_classif\": SelectionMethod.TreeBased(num_features, estimator=XGBClassifier(**tree_params)),\n  \"xgboost_regress\": SelectionMethod.TreeBased(num_features, estimator=XGBRegressor(**tree_params))\n}\n\n# Benchmark (sequential)\nscore_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5)\nprint(score_df, \"\\n\\n\", selected_df, \"\\n\\n\", runtime_df)\n\n# Benchmark (in parallel)\nscore_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5, n_jobs=4)\nprint(score_df, \"\\n\\n\", selected_df, \"\\n\\n\", runtime_df)\n\n# Get benchmark statistics by feature\nstats_df = calculate_statistics(score_df, selected_df)\nprint(stats_df)\n```\n\n## Text-based Selection\nThis example shows how to use text-based selection. In this scenario, we would like to select a subset of articles that is most diverse in the text embedding space and covers a range of topics. \n\n```python\n# Import Selective and TextWiser\nimport pandas as pd\nfrom feature.selector import Selective, SelectionMethod\nfrom textwiser import TextWiser, Embedding, Transformation\n\n# Data with the text content of each article\ndata = pd.DataFrame({\"article_1\": [\"article text here\"],\n                     \"article_2\": [\"article text here\"],\n                     \"article_3\": [\"article text here\"],\n                     \"article_4\": [\"article text here\"],\n                     \"article_5\": [\"article text here\"]})\n\n# Labels to denote 0/1 coverage metadata for each article \n# across four labels, e.g., sports, international, entertainment, science    \nlabels = pd.DataFrame({\"article_1\": [1, 1, 0, 1],\n                       \"article_2\": [0, 1, 0, 0],\n                       \"article_3\": [0, 0, 1, 0],\n                       \"article_4\": [0, 0, 1, 1],\n                       \"article_5\": [1, 1, 1, 0]},\n                      index=[\"label_1\", \"label_2\", \"label_3\", \"label_4\"])\n\n# TextWiser featurization method to create text embeddings\ntextwiser = TextWiser(Embedding.TfIdf(), Transformation.NMF(n_components=20))\n\n# Text-based selection\n# The goal is to select a subset of articles \n# that is most diverse in the text embedding space of articles\n# and covers the most labels in each topic\nselector = Selective(SelectionMethod.TextBased(num_features=2, featurization_method=textwiser))\n\n# Feature reduction\nsubset = selector.fit_transform(data, labels)\nprint(\"Reduction:\", list(subset.columns))\n```\n\n## Visualization\n\n```python\nimport pandas as pd\nfrom sklearn.datasets import fetch_california_housing\nfrom feature.utils import get_data_label\nfrom feature.selector import SelectionMethod, Selective, plot_importance\n\n# Data\ndata, label = get_data_label(fetch_california_housing())\n\n# Feature Selector\nselector = Selective(SelectionMethod.Linear(num_features=8, regularization=\"none\"))\nsubset = selector.fit_transform(data, label)\n\n# Plot Feature Importance\ndf = pd.DataFrame(selector.get_absolute_scores(), index=data.columns)\nplot_importance(df)\n```\n\n## Installation\n\nSelective requires **Python 3.8+** and can be installed from PyPI using ``pip install selective``.\n\n## Source \n\nAlternatively, you can build a wheel package on your platform from scratch using the source code:\n\n```bash\ngit clone https://github.com/fidelity/selective.git\ncd selective\npip install setuptools wheel # if wheel is not installed\npython setup.py sdist bdist_wheel\npip install dist/selective-X.X.X-py3-none-any.whl\n```\n\n## Test your setup\n\n```\ncd selective\npython -m unittest discover tests\n```\n\n## Citation\n\nIf you use Selective in a publication, please cite it as:\n\n```bibtex\n    @article{DBLP:journals/amai/HaDVH98,\n    author       = {Kad\\i{}o\\u{g}lu, Serdar and Kleynhans, Bernard and Wang, Xin},\n    title        = {Integrating optimized item selection with active learning for continuous exploration in recommender systems},\n    journal      = {Ann. Math. Artif. Intell.},\n    year         = {2024},\n    url          = {https://doi.org/10.1007/s10472-024-09941-x},\n    doi          = {10.1007/s10472-024-09941-x},\n    }\n}\n```\n\n## Support\n\nPlease submit bug reports and feature requests as [Issues](https://github.com/fidelity/selective/issues).\n\n## License\nSelective is licensed under [Apache 2.0](https://github.com/fidelity/selective/blob/master/LICENSE.md)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "feature selection library",
    "version": "1.2.0",
    "project_urls": {
        "Homepage": "https://github.com/fidelity/selective",
        "Source": "https://github.com/fidelity/selective"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b1cc64a43cef8e8e76afe8c46be5837b0a90a9f20de458c7b4f7fbb612ba5aa3",
                "md5": "c32252ce0bb38ee4fbd1608096e616db",
                "sha256": "75ba0074ccade62c92cce9087788d28f720f7a2d2e2507b76b305d8146fcde5b"
            },
            "downloads": -1,
            "filename": "selective-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c32252ce0bb38ee4fbd1608096e616db",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 39037,
            "upload_time": "2025-09-04T20:23:40",
            "upload_time_iso_8601": "2025-09-04T20:23:40.725580Z",
            "url": "https://files.pythonhosted.org/packages/b1/cc/64a43cef8e8e76afe8c46be5837b0a90a9f20de458c7b4f7fbb612ba5aa3/selective-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9b5c325851a0f7c8599fc1475d76181b1f0b654a7085913eeaf1f3f3e662bab1",
                "md5": "b187c99943771f9e28ed00e3188c96f5",
                "sha256": "7812c69401d23d6dcaba7ad7312a6da3857b77fda643c9d3dd7040194ae0f241"
            },
            "downloads": -1,
            "filename": "selective-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b187c99943771f9e28ed00e3188c96f5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 47993,
            "upload_time": "2025-09-04T20:23:41",
            "upload_time_iso_8601": "2025-09-04T20:23:41.608709Z",
            "url": "https://files.pythonhosted.org/packages/9b/5c/325851a0f7c8599fc1475d76181b1f0b654a7085913eeaf1f3f3e662bab1/selective-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-04 20:23:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fidelity",
    "github_project": "selective",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "catboost",
            "specs": []
        },
        {
            "name": "joblib",
            "specs": []
        },
        {
            "name": "lightgbm",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "mip",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "seaborn",
            "specs": []
        },
        {
            "name": "statsmodels",
            "specs": []
        },
        {
            "name": "textwiser",
            "specs": []
        },
        {
            "name": "xgboost",
            "specs": []
        }
    ],
    "lcname": "selective"
}
        
Elapsed time: 4.65159s