dataclr


Namedataclr JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryA Python library for feature selection in tabular datasets
upload_time2025-03-05 22:08:39
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords feature selection data science machine learning tabular data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # dataclr: The feature selection library

[![PyPI version](https://img.shields.io/pypi/v/dataclr?label=PyPI&color=blue)](https://pypi.org/project/dataclr/)
[![Python Versions](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11%20|%203.12%20|%203.13-blue)](https://www.python.org/)
[![License](https://img.shields.io/github/license/dataclr/dataclr?color=blue)](https://github.com/dataclr/dataclr/blob/main/LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/dataclr/dataclr?label=Stars&color=yellow)](https://github.com/dataclr/dataclr/stargazers)

<div align="center">
  <a href="https://www.dataclr.com/">Docs</a>
  <span>&nbsp;&nbsp;•&nbsp;&nbsp;</span>
  <a href="https://www.dataclr.com/">Website</a>
  <hr />
</div>

_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.

## Features

- **Comprehensive Methods**:

  - **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.

    | Method                           | Regression | Classification |
    | -------------------------------- | ---------- | -------------- |
    | `ANOVA`                          | Yes        | Yes            |
    | `Chi2`                           | No         | Yes            |
    | `CumulativeDistributionFunction` | Yes        | Yes            |
    | `CohensD`                        | No         | Yes            |
    | `CramersV`                       | No         | Yes            |
    | `DistanceCorrelation`            | Yes        | Yes            |
    | `Entropy`                        | Yes        | Yes            |
    | `KendallCorrelation`             | Yes        | Yes            |
    | `Kurtosis`                       | Yes        | Yes            |
    | `LinearCorrelation`              | Yes        | Yes            |
    | `MaximalInformationCoefficient`  | Yes        | Yes            |
    | `MeanAbsoluteDeviation`          | Yes        | Yes            |
    | `mRMR`                           | Yes        | Yes            |
    | `MutualInformation`              | Yes        | Yes            |
    | `Skewness`                       | Yes        | Yes            |
    | `SpearmanCorrelation`            | Yes        | Yes            |
    | `VarianceThreshold`              | Yes        | Yes            |
    | `VarianceInflationFactor`        | Yes        | Yes            |
    | `ZScore`                         | Yes        | Yes            |

  - **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.

    | Method                           | Regression | Classification |
    | -------------------------------- | ---------- | -------------- |
    | `BorutaMethod`                   | Yes        | Yes            |
    | `HyperoptMethod`                 | Yes        | Yes            |
    | `OptunaMethod`                   | Yes        | Yes            |
    | `ShapMethod`                     | Yes        | Yes            |
    | `Recursive Feature Elimination`  | Yes        | Yes            |
    | `Recursive Feature Addition`     | Yes        | Yes            |

- **Flexible and Scalable**:

  - Supports both regression and classification tasks.
  - Handles high-dimensional datasets efficiently.

- **Interpretable Results**:

  - Provides ranked feature lists with detailed importance scores.
  - Shows used methods along with their parameters.

- **Seamless Integration**:
  - Works with popular Python libraries like `pandas` and `scikit-learn`.

## Installation

Install `dataclr` using pip:

```bash
pip install dataclr
```

## Getting Started

### 1. Load Your Dataset

Prepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):

```bash
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Example dataset
X = pd.DataFrame({...})  # Replace with your feature matrix
y = pd.Series([...])     # Replace with your target variable

# Preprocessing
X_encoded = pd.get_dummies(X)  # Encode categorical features
scaler = StandardScaler()
X_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)
```

### 2. Use `FeatureSelector`

The `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:

```bash
from sklearn.ensemble import RandomForestClassifier
from dataclr.feature_selection import FeatureSelector

# Define a scikit-learn model
my_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize the FeatureSelector
selector = FeatureSelector(
    model=my_model,
    metric="accuracy",
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)

# Perform feature selection
selected_features = selector.select_features(n_results=5)
print(selected_features)
```

### 3. Use Singular Methods

For granular control, you can use individual feature selection methods:

```bash
from sklearn.linear_model import LogisticRegression
from dataclr.methods import MutualInformation

# Define a scikit-learn model
my_model = LogisticRegression(solver="liblinear", max_iter=1000)

# Initialize a method
method = MutualInformation(model=my_model, metric="accuracy")

# Fit and transform
results = method.fit_transform(X_train, X_test, y_train, y_test)
print(results)
```

## Benchmarks

As our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.

![benchmark_bank](https://i.imgur.com/qiG1L9j.png)
![benchmark_students](https://i.imgur.com/FpY3N9h.png)
![benchmark_fifa](https://i.imgur.com/BDTkYgL.png)
![benchmark_uber](https://i.imgur.com/X3uYyCX.png)

## Documentation

Explore the <a href="https://www.dataclr.com">full documentation</a> for detailed usage
instructions, API references, and examples.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dataclr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "feature selection, data science, machine learning, tabular data",
    "author": null,
    "author_email": "Lukasz Machutt <lukasz.machutt@gmail.com>, Jakub Nurkiewicz <jakub.nurkiewicz.2003@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/9e/58/bd9aabc19a2a9574418c86f1ee4fa0a54131731e0b6afff807e654962a9b/dataclr-0.3.0.tar.gz",
    "platform": null,
    "description": "# dataclr: The feature selection library\n\n[![PyPI version](https://img.shields.io/pypi/v/dataclr?label=PyPI&color=blue)](https://pypi.org/project/dataclr/)\n[![Python Versions](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11%20|%203.12%20|%203.13-blue)](https://www.python.org/)\n[![License](https://img.shields.io/github/license/dataclr/dataclr?color=blue)](https://github.com/dataclr/dataclr/blob/main/LICENSE)\n[![GitHub stars](https://img.shields.io/github/stars/dataclr/dataclr?label=Stars&color=yellow)](https://github.com/dataclr/dataclr/stargazers)\n\n<div align=\"center\">\n  <a href=\"https://www.dataclr.com/\">Docs</a>\n  <span>&nbsp;&nbsp;\u2022&nbsp;&nbsp;</span>\n  <a href=\"https://www.dataclr.com/\">Website</a>\n  <hr />\n</div>\n\n_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.\n\n## Features\n\n- **Comprehensive Methods**:\n\n  - **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.\n\n    | Method                           | Regression | Classification |\n    | -------------------------------- | ---------- | -------------- |\n    | `ANOVA`                          | Yes        | Yes            |\n    | `Chi2`                           | No         | Yes            |\n    | `CumulativeDistributionFunction` | Yes        | Yes            |\n    | `CohensD`                        | No         | Yes            |\n    | `CramersV`                       | No         | Yes            |\n    | `DistanceCorrelation`            | Yes        | Yes            |\n    | `Entropy`                        | Yes        | Yes            |\n    | `KendallCorrelation`             | Yes        | Yes            |\n    | `Kurtosis`                       | Yes        | Yes            |\n    | `LinearCorrelation`              | Yes        | Yes            |\n    | `MaximalInformationCoefficient`  | Yes        | Yes            |\n    | `MeanAbsoluteDeviation`          | Yes        | Yes            |\n    | `mRMR`                           | Yes        | Yes            |\n    | `MutualInformation`              | Yes        | Yes            |\n    | `Skewness`                       | Yes        | Yes            |\n    | `SpearmanCorrelation`            | Yes        | Yes            |\n    | `VarianceThreshold`              | Yes        | Yes            |\n    | `VarianceInflationFactor`        | Yes        | Yes            |\n    | `ZScore`                         | Yes        | Yes            |\n\n  - **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.\n\n    | Method                           | Regression | Classification |\n    | -------------------------------- | ---------- | -------------- |\n    | `BorutaMethod`                   | Yes        | Yes            |\n    | `HyperoptMethod`                 | Yes        | Yes            |\n    | `OptunaMethod`                   | Yes        | Yes            |\n    | `ShapMethod`                     | Yes        | Yes            |\n    | `Recursive Feature Elimination`  | Yes        | Yes            |\n    | `Recursive Feature Addition`     | Yes        | Yes            |\n\n- **Flexible and Scalable**:\n\n  - Supports both regression and classification tasks.\n  - Handles high-dimensional datasets efficiently.\n\n- **Interpretable Results**:\n\n  - Provides ranked feature lists with detailed importance scores.\n  - Shows used methods along with their parameters.\n\n- **Seamless Integration**:\n  - Works with popular Python libraries like `pandas` and `scikit-learn`.\n\n## Installation\n\nInstall `dataclr` using pip:\n\n```bash\npip install dataclr\n```\n\n## Getting Started\n\n### 1. Load Your Dataset\n\nPrepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):\n\n```bash\nimport pandas as pd\nfrom sklearn.preprocessing import StandardScaler\n\n# Example dataset\nX = pd.DataFrame({...})  # Replace with your feature matrix\ny = pd.Series([...])     # Replace with your target variable\n\n# Preprocessing\nX_encoded = pd.get_dummies(X)  # Encode categorical features\nscaler = StandardScaler()\nX_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)\n```\n\n### 2. Use `FeatureSelector`\n\nThe `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:\n\n```bash\nfrom sklearn.ensemble import RandomForestClassifier\nfrom dataclr.feature_selection import FeatureSelector\n\n# Define a scikit-learn model\nmy_model = RandomForestClassifier(n_estimators=100, random_state=42)\n\n# Initialize the FeatureSelector\nselector = FeatureSelector(\n    model=my_model,\n    metric=\"accuracy\",\n    X_train=X_train,\n    X_test=X_test,\n    y_train=y_train,\n    y_test=y_test,\n)\n\n# Perform feature selection\nselected_features = selector.select_features(n_results=5)\nprint(selected_features)\n```\n\n### 3. Use Singular Methods\n\nFor granular control, you can use individual feature selection methods:\n\n```bash\nfrom sklearn.linear_model import LogisticRegression\nfrom dataclr.methods import MutualInformation\n\n# Define a scikit-learn model\nmy_model = LogisticRegression(solver=\"liblinear\", max_iter=1000)\n\n# Initialize a method\nmethod = MutualInformation(model=my_model, metric=\"accuracy\")\n\n# Fit and transform\nresults = method.fit_transform(X_train, X_test, y_train, y_test)\nprint(results)\n```\n\n## Benchmarks\n\nAs our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.\n\n![benchmark_bank](https://i.imgur.com/qiG1L9j.png)\n![benchmark_students](https://i.imgur.com/FpY3N9h.png)\n![benchmark_fifa](https://i.imgur.com/BDTkYgL.png)\n![benchmark_uber](https://i.imgur.com/X3uYyCX.png)\n\n## Documentation\n\nExplore the <a href=\"https://www.dataclr.com\">full documentation</a> for detailed usage\ninstructions, API references, and examples.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python library for feature selection in tabular datasets",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://www.dataclr.com",
        "Homepage": "https://github.com/dataclr/dataclr"
    },
    "split_keywords": [
        "feature selection",
        " data science",
        " machine learning",
        " tabular data"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2c8ee0cb6dbe0a84a35b034952aa3b7b4e1951292e909420de19e6788a365cf1",
                "md5": "c2477480641bc3812cb12fde1db3de71",
                "sha256": "f4f4a856ab6d86f5fff735167cf787113fbfc3b97e5e04838dab50e0e5e419ba"
            },
            "downloads": -1,
            "filename": "dataclr-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c2477480641bc3812cb12fde1db3de71",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 61092,
            "upload_time": "2025-03-05T22:08:38",
            "upload_time_iso_8601": "2025-03-05T22:08:38.009507Z",
            "url": "https://files.pythonhosted.org/packages/2c/8e/e0cb6dbe0a84a35b034952aa3b7b4e1951292e909420de19e6788a365cf1/dataclr-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9e58bd9aabc19a2a9574418c86f1ee4fa0a54131731e0b6afff807e654962a9b",
                "md5": "2da432325508240d36229e9ffe3a6f2e",
                "sha256": "9c98e3d08bfc34a94ce0018ec5837e1ccdb1f74df3100bc4c5c6147fa702d2d7"
            },
            "downloads": -1,
            "filename": "dataclr-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "2da432325508240d36229e9ffe3a6f2e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 39495,
            "upload_time": "2025-03-05T22:08:39",
            "upload_time_iso_8601": "2025-03-05T22:08:39.820148Z",
            "url": "https://files.pythonhosted.org/packages/9e/58/bd9aabc19a2a9574418c86f1ee4fa0a54131731e0b6afff807e654962a9b/dataclr-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-05 22:08:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dataclr",
    "github_project": "dataclr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dataclr"
}
        
Elapsed time: 0.87302s