# dataclr: The feature selection library
[](https://pypi.org/project/dataclr/)
[](https://www.python.org/)
[](https://github.com/dataclr/dataclr/blob/main/LICENSE)
[](https://github.com/dataclr/dataclr/stargazers)
<div align="center">
<a href="https://www.dataclr.com/">Docs</a>
<span> • </span>
<a href="https://www.dataclr.com/">Website</a>
<hr />
</div>
_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.
## Features
- **Comprehensive Methods**:
- **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.
| Method | Regression | Classification |
| -------------------------------- | ---------- | -------------- |
| `ANOVA` | Yes | Yes |
| `Chi2` | No | Yes |
| `CumulativeDistributionFunction` | Yes | Yes |
| `CohensD` | No | Yes |
| `CramersV` | No | Yes |
| `DistanceCorrelation` | Yes | Yes |
| `Entropy` | Yes | Yes |
| `KendallCorrelation` | Yes | Yes |
| `Kurtosis` | Yes | Yes |
| `LinearCorrelation` | Yes | Yes |
| `MaximalInformationCoefficient` | Yes | Yes |
| `MeanAbsoluteDeviation` | Yes | Yes |
| `mRMR` | Yes | Yes |
| `MutualInformation` | Yes | Yes |
| `Skewness` | Yes | Yes |
| `SpearmanCorrelation` | Yes | Yes |
| `VarianceThreshold` | Yes | Yes |
| `VarianceInflationFactor` | Yes | Yes |
| `ZScore` | Yes | Yes |
- **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.
| Method | Regression | Classification |
| -------------------------------- | ---------- | -------------- |
| `BorutaMethod` | Yes | Yes |
| `HyperoptMethod` | Yes | Yes |
| `OptunaMethod` | Yes | Yes |
| `ShapMethod` | Yes | Yes |
| `Recursive Feature Elimination` | Yes | Yes |
| `Recursive Feature Addition` | Yes | Yes |
- **Flexible and Scalable**:
- Supports both regression and classification tasks.
- Handles high-dimensional datasets efficiently.
- **Interpretable Results**:
- Provides ranked feature lists with detailed importance scores.
- Shows used methods along with their parameters.
- **Seamless Integration**:
- Works with popular Python libraries like `pandas` and `scikit-learn`.
## Installation
Install `dataclr` using pip:
```bash
pip install dataclr
```
## Getting Started
### 1. Load Your Dataset
Prepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):
```bash
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Example dataset
X = pd.DataFrame({...}) # Replace with your feature matrix
y = pd.Series([...]) # Replace with your target variable
# Preprocessing
X_encoded = pd.get_dummies(X) # Encode categorical features
scaler = StandardScaler()
X_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)
```
### 2. Use `FeatureSelector`
The `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:
```bash
from sklearn.ensemble import RandomForestClassifier
from dataclr.feature_selection import FeatureSelector
# Define a scikit-learn model
my_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Initialize the FeatureSelector
selector = FeatureSelector(
model=my_model,
metric="accuracy",
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test,
)
# Perform feature selection
selected_features = selector.select_features(n_results=5)
print(selected_features)
```
### 3. Use Singular Methods
For granular control, you can use individual feature selection methods:
```bash
from sklearn.linear_model import LogisticRegression
from dataclr.methods import MutualInformation
# Define a scikit-learn model
my_model = LogisticRegression(solver="liblinear", max_iter=1000)
# Initialize a method
method = MutualInformation(model=my_model, metric="accuracy")
# Fit and transform
results = method.fit_transform(X_train, X_test, y_train, y_test)
print(results)
```
## Benchmarks
As our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.




## Documentation
Explore the <a href="https://www.dataclr.com">full documentation</a> for detailed usage
instructions, API references, and examples.
Raw data
{
"_id": null,
"home_page": null,
"name": "dataclr",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "feature selection, data science, machine learning, tabular data",
"author": null,
"author_email": "Lukasz Machutt <lukasz.machutt@gmail.com>, Jakub Nurkiewicz <jakub.nurkiewicz.2003@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/9e/58/bd9aabc19a2a9574418c86f1ee4fa0a54131731e0b6afff807e654962a9b/dataclr-0.3.0.tar.gz",
"platform": null,
"description": "# dataclr: The feature selection library\n\n[](https://pypi.org/project/dataclr/)\n[](https://www.python.org/)\n[](https://github.com/dataclr/dataclr/blob/main/LICENSE)\n[](https://github.com/dataclr/dataclr/stargazers)\n\n<div align=\"center\">\n <a href=\"https://www.dataclr.com/\">Docs</a>\n <span> \u2022 </span>\n <a href=\"https://www.dataclr.com/\">Website</a>\n <hr />\n</div>\n\n_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.\n\n## Features\n\n- **Comprehensive Methods**:\n\n - **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.\n\n | Method | Regression | Classification |\n | -------------------------------- | ---------- | -------------- |\n | `ANOVA` | Yes | Yes |\n | `Chi2` | No | Yes |\n | `CumulativeDistributionFunction` | Yes | Yes |\n | `CohensD` | No | Yes |\n | `CramersV` | No | Yes |\n | `DistanceCorrelation` | Yes | Yes |\n | `Entropy` | Yes | Yes |\n | `KendallCorrelation` | Yes | Yes |\n | `Kurtosis` | Yes | Yes |\n | `LinearCorrelation` | Yes | Yes |\n | `MaximalInformationCoefficient` | Yes | Yes |\n | `MeanAbsoluteDeviation` | Yes | Yes |\n | `mRMR` | Yes | Yes |\n | `MutualInformation` | Yes | Yes |\n | `Skewness` | Yes | Yes |\n | `SpearmanCorrelation` | Yes | Yes |\n | `VarianceThreshold` | Yes | Yes |\n | `VarianceInflationFactor` | Yes | Yes |\n | `ZScore` | Yes | Yes |\n\n - **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.\n\n | Method | Regression | Classification |\n | -------------------------------- | ---------- | -------------- |\n | `BorutaMethod` | Yes | Yes |\n | `HyperoptMethod` | Yes | Yes |\n | `OptunaMethod` | Yes | Yes |\n | `ShapMethod` | Yes | Yes |\n | `Recursive Feature Elimination` | Yes | Yes |\n | `Recursive Feature Addition` | Yes | Yes |\n\n- **Flexible and Scalable**:\n\n - Supports both regression and classification tasks.\n - Handles high-dimensional datasets efficiently.\n\n- **Interpretable Results**:\n\n - Provides ranked feature lists with detailed importance scores.\n - Shows used methods along with their parameters.\n\n- **Seamless Integration**:\n - Works with popular Python libraries like `pandas` and `scikit-learn`.\n\n## Installation\n\nInstall `dataclr` using pip:\n\n```bash\npip install dataclr\n```\n\n## Getting Started\n\n### 1. Load Your Dataset\n\nPrepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):\n\n```bash\nimport pandas as pd\nfrom sklearn.preprocessing import StandardScaler\n\n# Example dataset\nX = pd.DataFrame({...}) # Replace with your feature matrix\ny = pd.Series([...]) # Replace with your target variable\n\n# Preprocessing\nX_encoded = pd.get_dummies(X) # Encode categorical features\nscaler = StandardScaler()\nX_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)\n```\n\n### 2. Use `FeatureSelector`\n\nThe `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:\n\n```bash\nfrom sklearn.ensemble import RandomForestClassifier\nfrom dataclr.feature_selection import FeatureSelector\n\n# Define a scikit-learn model\nmy_model = RandomForestClassifier(n_estimators=100, random_state=42)\n\n# Initialize the FeatureSelector\nselector = FeatureSelector(\n model=my_model,\n metric=\"accuracy\",\n X_train=X_train,\n X_test=X_test,\n y_train=y_train,\n y_test=y_test,\n)\n\n# Perform feature selection\nselected_features = selector.select_features(n_results=5)\nprint(selected_features)\n```\n\n### 3. Use Singular Methods\n\nFor granular control, you can use individual feature selection methods:\n\n```bash\nfrom sklearn.linear_model import LogisticRegression\nfrom dataclr.methods import MutualInformation\n\n# Define a scikit-learn model\nmy_model = LogisticRegression(solver=\"liblinear\", max_iter=1000)\n\n# Initialize a method\nmethod = MutualInformation(model=my_model, metric=\"accuracy\")\n\n# Fit and transform\nresults = method.fit_transform(X_train, X_test, y_train, y_test)\nprint(results)\n```\n\n## Benchmarks\n\nAs our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.\n\n\n\n\n\n\n## Documentation\n\nExplore the <a href=\"https://www.dataclr.com\">full documentation</a> for detailed usage\ninstructions, API references, and examples.\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python library for feature selection in tabular datasets",
"version": "0.3.0",
"project_urls": {
"Documentation": "https://www.dataclr.com",
"Homepage": "https://github.com/dataclr/dataclr"
},
"split_keywords": [
"feature selection",
" data science",
" machine learning",
" tabular data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2c8ee0cb6dbe0a84a35b034952aa3b7b4e1951292e909420de19e6788a365cf1",
"md5": "c2477480641bc3812cb12fde1db3de71",
"sha256": "f4f4a856ab6d86f5fff735167cf787113fbfc3b97e5e04838dab50e0e5e419ba"
},
"downloads": -1,
"filename": "dataclr-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c2477480641bc3812cb12fde1db3de71",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 61092,
"upload_time": "2025-03-05T22:08:38",
"upload_time_iso_8601": "2025-03-05T22:08:38.009507Z",
"url": "https://files.pythonhosted.org/packages/2c/8e/e0cb6dbe0a84a35b034952aa3b7b4e1951292e909420de19e6788a365cf1/dataclr-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9e58bd9aabc19a2a9574418c86f1ee4fa0a54131731e0b6afff807e654962a9b",
"md5": "2da432325508240d36229e9ffe3a6f2e",
"sha256": "9c98e3d08bfc34a94ce0018ec5837e1ccdb1f74df3100bc4c5c6147fa702d2d7"
},
"downloads": -1,
"filename": "dataclr-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "2da432325508240d36229e9ffe3a6f2e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 39495,
"upload_time": "2025-03-05T22:08:39",
"upload_time_iso_8601": "2025-03-05T22:08:39.820148Z",
"url": "https://files.pythonhosted.org/packages/9e/58/bd9aabc19a2a9574418c86f1ee4fa0a54131731e0b6afff807e654962a9b/dataclr-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-03-05 22:08:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dataclr",
"github_project": "dataclr",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dataclr"
}