# dataclr: The feature selection library
[![PyPI version](https://img.shields.io/pypi/v/dataclr?label=PyPI&color=blue)](https://pypi.org/project/dataclr/)
[![Python Versions](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11%20|%203.12%20|%203.13-blue)](https://www.python.org/)
[![License](https://img.shields.io/github/license/dataclr/dataclr?color=blue)](https://github.com/dataclr/dataclr/blob/main/LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/dataclr/dataclr?label=Stars&color=yellow)](https://github.com/dataclr/dataclr/stargazers)
<div align="center">
<a href="https://www.dataclr.com/">Docs</a>
<span> • </span>
<a href="https://www.dataclr.com/">Website</a>
<hr />
</div>
_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.
## Features
- **Comprehensive Methods**:
- **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.
| Method | Regression | Classification |
| -------------------------------- | ---------- | -------------- |
| `ANOVA` | Yes | Yes |
| `Chi2` | No | Yes |
| `CumulativeDistributionFunction` | Yes | Yes |
| `CohensD` | No | Yes |
| `CramersV` | No | Yes |
| `DistanceCorrelation` | Yes | Yes |
| `Entropy` | Yes | Yes |
| `KendallCorrelation` | Yes | Yes |
| `Kurtosis` | Yes | Yes |
| `LinearCorrelation` | Yes | Yes |
| `MaximalInformationCoefficient` | Yes | Yes |
| `MeanAbsoluteDeviation` | Yes | Yes |
| `mRMR` | Yes | Yes |
| `MutualInformation` | Yes | Yes |
| `Skewness` | Yes | Yes |
| `SpearmanCorrelation` | Yes | Yes |
| `VarianceThreshold` | Yes | Yes |
| `VarianceInflationFactor` | Yes | Yes |
| `ZScore` | Yes | Yes |
- **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.
| Method | Regression | Classification |
| ---------------- | ---------- | -------------- |
| `BorutaMethod` | Yes | Yes |
| `HyperoptMethod` | Yes | Yes |
| `OptunaMethod` | Yes | Yes |
| `ShapMethod` | Yes | Yes |
- **Flexible and Scalable**:
- Supports both regression and classification tasks.
- Handles high-dimensional datasets efficiently.
- **Interpretable Results**:
- Provides ranked feature lists with detailed importance scores.
- Shows used methods along with their parameters.
- **Seamless Integration**:
- Works with popular Python libraries like `pandas` and `scikit-learn`.
## Installation
Install `dataclr` using pip:
```bash
pip install dataclr
```
## Getting Started
### 1. Load Your Dataset
Prepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):
```bash
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Example dataset
X = pd.DataFrame({...}) # Replace with your feature matrix
y = pd.Series([...]) # Replace with your target variable
# Preprocessing
X_encoded = pd.get_dummies(X) # Encode categorical features
scaler = StandardScaler()
X_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)
```
### 2. Use `FeatureSelector`
The `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:
```bash
from sklearn.ensemble import RandomForestClassifier
from dataclr.feature_selection import FeatureSelector
# Define a scikit-learn model
my_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Initialize the FeatureSelector
selector = FeatureSelector(
model=my_model,
metric="accuracy",
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test,
)
# Perform feature selection
selected_features = selector.select_features(n_results=5)
print(selected_features)
```
### 3. Use Singular Methods
For granular control, you can use individual feature selection methods:
```bash
from sklearn.linear_model import LogisticRegression
from dataclr.methods import MutualInformation
# Define a scikit-learn model
my_model = LogisticRegression(solver="liblinear", max_iter=1000)
# Initialize a method
method = MutualInformation(model=my_model, metric="accuracy")
# Fit and transform
results = method.fit_transform(X_train, X_test, y_train, y_test)
print(results)
```
## Benchmarks
As our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.
![benchmark_bank](https://i.imgur.com/qiG1L9j.png)
![benchmark_students](https://i.imgur.com/FpY3N9h.png)
![benchmark_fifa](https://i.imgur.com/BDTkYgL.png)
![benchmark_uber](https://i.imgur.com/X3uYyCX.png)
## Documentation
Explore the <a href="https://www.dataclr.com">full documentation</a> for detailed usage
instructions, API references, and examples.
Raw data
{
"_id": null,
"home_page": null,
"name": "dataclr",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "feature selection, data science, machine learning, tabular data",
"author": null,
"author_email": "Lukasz Machutt <lukasz.machutt@gmail.com>, Jakub Nurkiewicz <jakub.nurkiewicz.2003@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b3/80/2d2ceb8bdbb01c6bb9951dd0c8311f5b8a33f578459d3bc705e52a1e96a8/dataclr-0.2.0.tar.gz",
"platform": null,
"description": "# dataclr: The feature selection library\n\n[![PyPI version](https://img.shields.io/pypi/v/dataclr?label=PyPI&color=blue)](https://pypi.org/project/dataclr/)\n[![Python Versions](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11%20|%203.12%20|%203.13-blue)](https://www.python.org/)\n[![License](https://img.shields.io/github/license/dataclr/dataclr?color=blue)](https://github.com/dataclr/dataclr/blob/main/LICENSE)\n[![GitHub stars](https://img.shields.io/github/stars/dataclr/dataclr?label=Stars&color=yellow)](https://github.com/dataclr/dataclr/stargazers)\n\n<div align=\"center\">\n <a href=\"https://www.dataclr.com/\">Docs</a>\n <span> \u2022 </span>\n <a href=\"https://www.dataclr.com/\">Website</a>\n <hr />\n</div>\n\n_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.\n\n## Features\n\n- **Comprehensive Methods**:\n\n - **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.\n\n | Method | Regression | Classification |\n | -------------------------------- | ---------- | -------------- |\n | `ANOVA` | Yes | Yes |\n | `Chi2` | No | Yes |\n | `CumulativeDistributionFunction` | Yes | Yes |\n | `CohensD` | No | Yes |\n | `CramersV` | No | Yes |\n | `DistanceCorrelation` | Yes | Yes |\n | `Entropy` | Yes | Yes |\n | `KendallCorrelation` | Yes | Yes |\n | `Kurtosis` | Yes | Yes |\n | `LinearCorrelation` | Yes | Yes |\n | `MaximalInformationCoefficient` | Yes | Yes |\n | `MeanAbsoluteDeviation` | Yes | Yes |\n | `mRMR` | Yes | Yes |\n | `MutualInformation` | Yes | Yes |\n | `Skewness` | Yes | Yes |\n | `SpearmanCorrelation` | Yes | Yes |\n | `VarianceThreshold` | Yes | Yes |\n | `VarianceInflationFactor` | Yes | Yes |\n | `ZScore` | Yes | Yes |\n\n - **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.\n\n | Method | Regression | Classification |\n | ---------------- | ---------- | -------------- |\n | `BorutaMethod` | Yes | Yes |\n | `HyperoptMethod` | Yes | Yes |\n | `OptunaMethod` | Yes | Yes |\n | `ShapMethod` | Yes | Yes |\n\n- **Flexible and Scalable**:\n\n - Supports both regression and classification tasks.\n - Handles high-dimensional datasets efficiently.\n\n- **Interpretable Results**:\n\n - Provides ranked feature lists with detailed importance scores.\n - Shows used methods along with their parameters.\n\n- **Seamless Integration**:\n - Works with popular Python libraries like `pandas` and `scikit-learn`.\n\n## Installation\n\nInstall `dataclr` using pip:\n\n```bash\npip install dataclr\n```\n\n## Getting Started\n\n### 1. Load Your Dataset\n\nPrepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):\n\n```bash\nimport pandas as pd\nfrom sklearn.preprocessing import StandardScaler\n\n# Example dataset\nX = pd.DataFrame({...}) # Replace with your feature matrix\ny = pd.Series([...]) # Replace with your target variable\n\n# Preprocessing\nX_encoded = pd.get_dummies(X) # Encode categorical features\nscaler = StandardScaler()\nX_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)\n```\n\n### 2. Use `FeatureSelector`\n\nThe `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:\n\n```bash\nfrom sklearn.ensemble import RandomForestClassifier\nfrom dataclr.feature_selection import FeatureSelector\n\n# Define a scikit-learn model\nmy_model = RandomForestClassifier(n_estimators=100, random_state=42)\n\n# Initialize the FeatureSelector\nselector = FeatureSelector(\n model=my_model,\n metric=\"accuracy\",\n X_train=X_train,\n X_test=X_test,\n y_train=y_train,\n y_test=y_test,\n)\n\n# Perform feature selection\nselected_features = selector.select_features(n_results=5)\nprint(selected_features)\n```\n\n### 3. Use Singular Methods\n\nFor granular control, you can use individual feature selection methods:\n\n```bash\nfrom sklearn.linear_model import LogisticRegression\nfrom dataclr.methods import MutualInformation\n\n# Define a scikit-learn model\nmy_model = LogisticRegression(solver=\"liblinear\", max_iter=1000)\n\n# Initialize a method\nmethod = MutualInformation(model=my_model, metric=\"accuracy\")\n\n# Fit and transform\nresults = method.fit_transform(X_train, X_test, y_train, y_test)\nprint(results)\n```\n\n## Benchmarks\n\nAs our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.\n\n![benchmark_bank](https://i.imgur.com/qiG1L9j.png)\n![benchmark_students](https://i.imgur.com/FpY3N9h.png)\n![benchmark_fifa](https://i.imgur.com/BDTkYgL.png)\n![benchmark_uber](https://i.imgur.com/X3uYyCX.png)\n\n## Documentation\n\nExplore the <a href=\"https://www.dataclr.com\">full documentation</a> for detailed usage\ninstructions, API references, and examples.\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python library for feature selection in tabular datasets",
"version": "0.2.0",
"project_urls": {
"Documentation": "https://www.dataclr.com",
"Homepage": "https://github.com/dataclr/dataclr"
},
"split_keywords": [
"feature selection",
" data science",
" machine learning",
" tabular data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "30b1d8c4144486911f88ef7065de9eb3dba2698e4c93c830a41ad20130ecf0a9",
"md5": "17f933e2eb8ba3f30351aae7083e7418",
"sha256": "b6827d48422718cbcda114dd772041dbbf7a37b9cbe67e692348e39a730606bc"
},
"downloads": -1,
"filename": "dataclr-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "17f933e2eb8ba3f30351aae7083e7418",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 58165,
"upload_time": "2025-01-06T10:55:54",
"upload_time_iso_8601": "2025-01-06T10:55:54.485788Z",
"url": "https://files.pythonhosted.org/packages/30/b1/d8c4144486911f88ef7065de9eb3dba2698e4c93c830a41ad20130ecf0a9/dataclr-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b3802d2ceb8bdbb01c6bb9951dd0c8311f5b8a33f578459d3bc705e52a1e96a8",
"md5": "5cb8145fc579db4efb712cb43bb0a899",
"sha256": "bbdf7e399c9e84d927483c9749e14940f001016a7641f40351bc8a8ad08c668a"
},
"downloads": -1,
"filename": "dataclr-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "5cb8145fc579db4efb712cb43bb0a899",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 37286,
"upload_time": "2025-01-06T10:55:56",
"upload_time_iso_8601": "2025-01-06T10:55:56.734403Z",
"url": "https://files.pythonhosted.org/packages/b3/80/2d2ceb8bdbb01c6bb9951dd0c8311f5b8a33f578459d3bc705e52a1e96a8/dataclr-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-06 10:55:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dataclr",
"github_project": "dataclr",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dataclr"
}