dataclr


Namedataclr JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryA Python library for feature selection in tabular datasets
upload_time2025-01-06 10:55:56
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords feature selection data science machine learning tabular data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # dataclr: The feature selection library

[![PyPI version](https://img.shields.io/pypi/v/dataclr?label=PyPI&color=blue)](https://pypi.org/project/dataclr/)
[![Python Versions](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11%20|%203.12%20|%203.13-blue)](https://www.python.org/)
[![License](https://img.shields.io/github/license/dataclr/dataclr?color=blue)](https://github.com/dataclr/dataclr/blob/main/LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/dataclr/dataclr?label=Stars&color=yellow)](https://github.com/dataclr/dataclr/stargazers)

<div align="center">
  <a href="https://www.dataclr.com/">Docs</a>
  <span>&nbsp;&nbsp;•&nbsp;&nbsp;</span>
  <a href="https://www.dataclr.com/">Website</a>
  <hr />
</div>

_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.

## Features

- **Comprehensive Methods**:

  - **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.

    | Method                           | Regression | Classification |
    | -------------------------------- | ---------- | -------------- |
    | `ANOVA`                          | Yes        | Yes            |
    | `Chi2`                           | No         | Yes            |
    | `CumulativeDistributionFunction` | Yes        | Yes            |
    | `CohensD`                        | No         | Yes            |
    | `CramersV`                       | No         | Yes            |
    | `DistanceCorrelation`            | Yes        | Yes            |
    | `Entropy`                        | Yes        | Yes            |
    | `KendallCorrelation`             | Yes        | Yes            |
    | `Kurtosis`                       | Yes        | Yes            |
    | `LinearCorrelation`              | Yes        | Yes            |
    | `MaximalInformationCoefficient`  | Yes        | Yes            |
    | `MeanAbsoluteDeviation`          | Yes        | Yes            |
    | `mRMR`                           | Yes        | Yes            |
    | `MutualInformation`              | Yes        | Yes            |
    | `Skewness`                       | Yes        | Yes            |
    | `SpearmanCorrelation`            | Yes        | Yes            |
    | `VarianceThreshold`              | Yes        | Yes            |
    | `VarianceInflationFactor`        | Yes        | Yes            |
    | `ZScore`                         | Yes        | Yes            |

  - **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.

    | Method           | Regression | Classification |
    | ---------------- | ---------- | -------------- |
    | `BorutaMethod`   | Yes        | Yes            |
    | `HyperoptMethod` | Yes        | Yes            |
    | `OptunaMethod`   | Yes        | Yes            |
    | `ShapMethod`     | Yes        | Yes            |

- **Flexible and Scalable**:

  - Supports both regression and classification tasks.
  - Handles high-dimensional datasets efficiently.

- **Interpretable Results**:

  - Provides ranked feature lists with detailed importance scores.
  - Shows used methods along with their parameters.

- **Seamless Integration**:
  - Works with popular Python libraries like `pandas` and `scikit-learn`.

## Installation

Install `dataclr` using pip:

```bash
pip install dataclr
```

## Getting Started

### 1. Load Your Dataset

Prepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):

```bash
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Example dataset
X = pd.DataFrame({...})  # Replace with your feature matrix
y = pd.Series([...])     # Replace with your target variable

# Preprocessing
X_encoded = pd.get_dummies(X)  # Encode categorical features
scaler = StandardScaler()
X_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)
```

### 2. Use `FeatureSelector`

The `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:

```bash
from sklearn.ensemble import RandomForestClassifier
from dataclr.feature_selection import FeatureSelector

# Define a scikit-learn model
my_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize the FeatureSelector
selector = FeatureSelector(
    model=my_model,
    metric="accuracy",
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)

# Perform feature selection
selected_features = selector.select_features(n_results=5)
print(selected_features)
```

### 3. Use Singular Methods

For granular control, you can use individual feature selection methods:

```bash
from sklearn.linear_model import LogisticRegression
from dataclr.methods import MutualInformation

# Define a scikit-learn model
my_model = LogisticRegression(solver="liblinear", max_iter=1000)

# Initialize a method
method = MutualInformation(model=my_model, metric="accuracy")

# Fit and transform
results = method.fit_transform(X_train, X_test, y_train, y_test)
print(results)
```

## Benchmarks

As our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.

![benchmark_bank](https://i.imgur.com/qiG1L9j.png)
![benchmark_students](https://i.imgur.com/FpY3N9h.png)
![benchmark_fifa](https://i.imgur.com/BDTkYgL.png)
![benchmark_uber](https://i.imgur.com/X3uYyCX.png)

## Documentation

Explore the <a href="https://www.dataclr.com">full documentation</a> for detailed usage
instructions, API references, and examples.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dataclr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "feature selection, data science, machine learning, tabular data",
    "author": null,
    "author_email": "Lukasz Machutt <lukasz.machutt@gmail.com>, Jakub Nurkiewicz <jakub.nurkiewicz.2003@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/b3/80/2d2ceb8bdbb01c6bb9951dd0c8311f5b8a33f578459d3bc705e52a1e96a8/dataclr-0.2.0.tar.gz",
    "platform": null,
    "description": "# dataclr: The feature selection library\n\n[![PyPI version](https://img.shields.io/pypi/v/dataclr?label=PyPI&color=blue)](https://pypi.org/project/dataclr/)\n[![Python Versions](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11%20|%203.12%20|%203.13-blue)](https://www.python.org/)\n[![License](https://img.shields.io/github/license/dataclr/dataclr?color=blue)](https://github.com/dataclr/dataclr/blob/main/LICENSE)\n[![GitHub stars](https://img.shields.io/github/stars/dataclr/dataclr?label=Stars&color=yellow)](https://github.com/dataclr/dataclr/stargazers)\n\n<div align=\"center\">\n  <a href=\"https://www.dataclr.com/\">Docs</a>\n  <span>&nbsp;&nbsp;\u2022&nbsp;&nbsp;</span>\n  <a href=\"https://www.dataclr.com/\">Website</a>\n  <hr />\n</div>\n\n_dataclr_ is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves _state-of-the-art_ results, enhancing model performance and simplifying feature engineering.\n\n## Features\n\n- **Comprehensive Methods**:\n\n  - **Filter Methods**: Statistical and data-driven approaches like `ANOVA`, `MutualInformation`, and `VarianceThreshold`.\n\n    | Method                           | Regression | Classification |\n    | -------------------------------- | ---------- | -------------- |\n    | `ANOVA`                          | Yes        | Yes            |\n    | `Chi2`                           | No         | Yes            |\n    | `CumulativeDistributionFunction` | Yes        | Yes            |\n    | `CohensD`                        | No         | Yes            |\n    | `CramersV`                       | No         | Yes            |\n    | `DistanceCorrelation`            | Yes        | Yes            |\n    | `Entropy`                        | Yes        | Yes            |\n    | `KendallCorrelation`             | Yes        | Yes            |\n    | `Kurtosis`                       | Yes        | Yes            |\n    | `LinearCorrelation`              | Yes        | Yes            |\n    | `MaximalInformationCoefficient`  | Yes        | Yes            |\n    | `MeanAbsoluteDeviation`          | Yes        | Yes            |\n    | `mRMR`                           | Yes        | Yes            |\n    | `MutualInformation`              | Yes        | Yes            |\n    | `Skewness`                       | Yes        | Yes            |\n    | `SpearmanCorrelation`            | Yes        | Yes            |\n    | `VarianceThreshold`              | Yes        | Yes            |\n    | `VarianceInflationFactor`        | Yes        | Yes            |\n    | `ZScore`                         | Yes        | Yes            |\n\n  - **Wrapper Methods**: Model-based iterative methods like `BorutaMethod`, `ShapMethod`, and `OptunaMethod`.\n\n    | Method           | Regression | Classification |\n    | ---------------- | ---------- | -------------- |\n    | `BorutaMethod`   | Yes        | Yes            |\n    | `HyperoptMethod` | Yes        | Yes            |\n    | `OptunaMethod`   | Yes        | Yes            |\n    | `ShapMethod`     | Yes        | Yes            |\n\n- **Flexible and Scalable**:\n\n  - Supports both regression and classification tasks.\n  - Handles high-dimensional datasets efficiently.\n\n- **Interpretable Results**:\n\n  - Provides ranked feature lists with detailed importance scores.\n  - Shows used methods along with their parameters.\n\n- **Seamless Integration**:\n  - Works with popular Python libraries like `pandas` and `scikit-learn`.\n\n## Installation\n\nInstall `dataclr` using pip:\n\n```bash\npip install dataclr\n```\n\n## Getting Started\n\n### 1. Load Your Dataset\n\nPrepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):\n\n```bash\nimport pandas as pd\nfrom sklearn.preprocessing import StandardScaler\n\n# Example dataset\nX = pd.DataFrame({...})  # Replace with your feature matrix\ny = pd.Series([...])     # Replace with your target variable\n\n# Preprocessing\nX_encoded = pd.get_dummies(X)  # Encode categorical features\nscaler = StandardScaler()\nX_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)\n```\n\n### 2. Use `FeatureSelector`\n\nThe `FeatureSelector` is a high-level API that combines multiple methods to select the best feature subsets:\n\n```bash\nfrom sklearn.ensemble import RandomForestClassifier\nfrom dataclr.feature_selection import FeatureSelector\n\n# Define a scikit-learn model\nmy_model = RandomForestClassifier(n_estimators=100, random_state=42)\n\n# Initialize the FeatureSelector\nselector = FeatureSelector(\n    model=my_model,\n    metric=\"accuracy\",\n    X_train=X_train,\n    X_test=X_test,\n    y_train=y_train,\n    y_test=y_test,\n)\n\n# Perform feature selection\nselected_features = selector.select_features(n_results=5)\nprint(selected_features)\n```\n\n### 3. Use Singular Methods\n\nFor granular control, you can use individual feature selection methods:\n\n```bash\nfrom sklearn.linear_model import LogisticRegression\nfrom dataclr.methods import MutualInformation\n\n# Define a scikit-learn model\nmy_model = LogisticRegression(solver=\"liblinear\", max_iter=1000)\n\n# Initialize a method\nmethod = MutualInformation(model=my_model, metric=\"accuracy\")\n\n# Fit and transform\nresults = method.fit_transform(X_train, X_test, y_train, y_test)\nprint(results)\n```\n\n## Benchmarks\n\nAs our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.\n\n![benchmark_bank](https://i.imgur.com/qiG1L9j.png)\n![benchmark_students](https://i.imgur.com/FpY3N9h.png)\n![benchmark_fifa](https://i.imgur.com/BDTkYgL.png)\n![benchmark_uber](https://i.imgur.com/X3uYyCX.png)\n\n## Documentation\n\nExplore the <a href=\"https://www.dataclr.com\">full documentation</a> for detailed usage\ninstructions, API references, and examples.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python library for feature selection in tabular datasets",
    "version": "0.2.0",
    "project_urls": {
        "Documentation": "https://www.dataclr.com",
        "Homepage": "https://github.com/dataclr/dataclr"
    },
    "split_keywords": [
        "feature selection",
        " data science",
        " machine learning",
        " tabular data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "30b1d8c4144486911f88ef7065de9eb3dba2698e4c93c830a41ad20130ecf0a9",
                "md5": "17f933e2eb8ba3f30351aae7083e7418",
                "sha256": "b6827d48422718cbcda114dd772041dbbf7a37b9cbe67e692348e39a730606bc"
            },
            "downloads": -1,
            "filename": "dataclr-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "17f933e2eb8ba3f30351aae7083e7418",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 58165,
            "upload_time": "2025-01-06T10:55:54",
            "upload_time_iso_8601": "2025-01-06T10:55:54.485788Z",
            "url": "https://files.pythonhosted.org/packages/30/b1/d8c4144486911f88ef7065de9eb3dba2698e4c93c830a41ad20130ecf0a9/dataclr-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b3802d2ceb8bdbb01c6bb9951dd0c8311f5b8a33f578459d3bc705e52a1e96a8",
                "md5": "5cb8145fc579db4efb712cb43bb0a899",
                "sha256": "bbdf7e399c9e84d927483c9749e14940f001016a7641f40351bc8a8ad08c668a"
            },
            "downloads": -1,
            "filename": "dataclr-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5cb8145fc579db4efb712cb43bb0a899",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 37286,
            "upload_time": "2025-01-06T10:55:56",
            "upload_time_iso_8601": "2025-01-06T10:55:56.734403Z",
            "url": "https://files.pythonhosted.org/packages/b3/80/2d2ceb8bdbb01c6bb9951dd0c8311f5b8a33f578459d3bc705e52a1e96a8/dataclr-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-06 10:55:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dataclr",
    "github_project": "dataclr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dataclr"
}
        
Elapsed time: 0.41732s