kditransform


Namekditransform JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttp://github.com/calvinmccarter/kditransform
SummaryKernel density integral transformation
upload_time2025-07-20 01:20:16
maintainerCalvin McCarter
docs_urlNone
authorCalvin McCarter
requires_pythonNone
licenseNone
keywords preprocessing integral transformation kernel smoothing
VCS
bugtrack_url
requirements numba numpy pre-commit pytest scikit-learn scipy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # kditransform

[![PyPI version](https://badge.fury.io/py/kditransform.svg)](https://badge.fury.io/py/kditransform)
[![Downloads](https://pepy.tech/badge/kditransform)](https://pepy.tech/project/kditransform)

The kernel-density integral transformation [(McCarter, 2023, TMLR)](https://openreview.net/pdf?id=6OEcDKZj5j), like [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [quantile transformation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), maps continuous features to the range `[0, 1]`.
It achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.
It can also be used to discretize features, offering a data-driven alternative to univariate clustering or [K-bins discretization](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-discretization).

You can tune the interpolation $\alpha$ between 0 (quantile transform) and $\infty$ (min-max transform), but a good default is $\alpha=1$, which is equivalent to using `scipy.stats.gaussian_kde(bw_method=1)`. This is an easy way to improves performance for a lot of supervised learning problems. See [this notebook](https://github.com/calvinmccarter/kditransform/blob/master/examples/regression-plots.ipynb) for example usage and the [paper](https://openreview.net/pdf?id=6OEcDKZj5j) for a detailed description of the method.

<figure>
  <figcaption><i>Accuracy on Iris</i></figcaption>
  <img src="examples/Accuracy-vs-bwf-iris-pca.jpg" alt="drawing" width="300"/>
</figure>
<figure>
  <figcaption><i>rMSE on CA Housing</i></figcaption>
  <img src="examples/MSE-vs-bwf-cahousing-linr-nolegend.jpg" alt="drawing" width="300"/>
</figure>
    

## Installation 

### Installation from PyPI
```
pip install kditransform
```

### Installation from source
After cloning this repo, install the dependencies on the command-line, then install kditransform:
```
pip install -r requirements.txt
pip install -e .
pytest
```

## Usage

`kditransform.KDITransformer` is a drop-in replacement for [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html). When `alpha` (defaults to 1.0) is small, our method behaves like the QuantileTransformer; when `alpha` is large, it behaves like [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

To produce features that are roughly scaled like z-scores as in `StandardScaler`, use `KDITransformer(output_distribution='normal')`. This applies the standard normal inverse CDF transform after the KDI transform.

```
import numpy as np
from kditransform import KDITransformer
X = np.random.uniform(size=(500, 1))
kdt = KDITransformer(alpha=1.)
Y = kdt.fit_transform(X)
```

`kditransform.KDIDiscretizer` offers an API based on [sklearn.preprocessing.KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html). It encodes each feature ordinally, similarly to `KBinsDiscretizer(encode='ordinal')`.

```
from kditransform import KDIDiscretizer
rng = np.random.default_rng(1)
x1 = rng.normal(1, 0.75, size=int(0.55*N))
x2 = rng.normal(4, 1, size=int(0.3*N))
x3 = rng.uniform(0, 20, size=int(0.15*N))
X = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)
kdd = KDIDiscretizer()
T = kdd.fit_transform(X)
```

Initialized as `KDIDiscretizer(enable_predict_proba=True)`, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.

```
kdd = KDIDiscretizer(enable_predict_proba=True).fit(X)
P = kdd.predict(X)  # one-hot encoding
P = kdd.predict_proba(X)  # probabilistic one-hot encoding
```

## Citing this method

If you use this tool, please cite KDITransform
using the following reference to our [TMLR paper](https://openreview.net/pdf?id=6OEcDKZj5j):

In Bibtex format:

```bibtex
@article{
mccarter2023the,
title={The Kernel Density Integral Transformation},
author={Calvin McCarter},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=6OEcDKZj5j},
note={}
}
```

## Usage with TabPFN

[TabPFN](https://arxiv.org/abs/2207.01848) is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply [adding KDITransform'ed features](https://github.com/calvinmccarter/TabPFN/commit/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d#diff-6e18bf62a38856a86e8846cefd2d9fd323dc178c161d4e63d23bf613dc6de654), I observed [improvements](https://github.com/calvinmccarter/TabPFN/blob/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d/replicate-kditransform.ipynb) on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.

            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/calvinmccarter/kditransform",
    "name": "kditransform",
    "maintainer": "Calvin McCarter",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "Calvin McCarter <mccarter.calvin@gmail.com>",
    "keywords": "preprocessing, integral transformation, kernel smoothing",
    "author": "Calvin McCarter",
    "author_email": "Calvin McCarter <mccarter.calvin@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/46/8e/bb56690773e55c99ca91de48337037d80874448792d1d917ec2fa26533a5/kditransform-1.0.0.tar.gz",
    "platform": null,
    "description": "# kditransform\n\n[![PyPI version](https://badge.fury.io/py/kditransform.svg)](https://badge.fury.io/py/kditransform)\n[![Downloads](https://pepy.tech/badge/kditransform)](https://pepy.tech/project/kditransform)\n\nThe kernel-density integral transformation [(McCarter, 2023, TMLR)](https://openreview.net/pdf?id=6OEcDKZj5j), like [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [quantile transformation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), maps continuous features to the range `[0, 1]`.\nIt achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.\nIt can also be used to discretize features, offering a data-driven alternative to univariate clustering or [K-bins discretization](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-discretization).\n\nYou can tune the interpolation $\\alpha$ between 0 (quantile transform) and $\\infty$ (min-max transform), but a good default is $\\alpha=1$, which is equivalent to using `scipy.stats.gaussian_kde(bw_method=1)`. This is an easy way to improves performance for a lot of supervised learning problems. See [this notebook](https://github.com/calvinmccarter/kditransform/blob/master/examples/regression-plots.ipynb) for example usage and the [paper](https://openreview.net/pdf?id=6OEcDKZj5j) for a detailed description of the method.\n\n<figure>\n  <figcaption><i>Accuracy on Iris</i></figcaption>\n  <img src=\"examples/Accuracy-vs-bwf-iris-pca.jpg\" alt=\"drawing\" width=\"300\"/>\n</figure>\n<figure>\n  <figcaption><i>rMSE on CA Housing</i></figcaption>\n  <img src=\"examples/MSE-vs-bwf-cahousing-linr-nolegend.jpg\" alt=\"drawing\" width=\"300\"/>\n</figure>\n    \n\n## Installation \n\n### Installation from PyPI\n```\npip install kditransform\n```\n\n### Installation from source\nAfter cloning this repo, install the dependencies on the command-line, then install kditransform:\n```\npip install -r requirements.txt\npip install -e .\npytest\n```\n\n## Usage\n\n`kditransform.KDITransformer` is a drop-in replacement for [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html). When `alpha` (defaults to 1.0) is small, our method behaves like the QuantileTransformer; when `alpha` is large, it behaves like [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).\n\nTo produce features that are roughly scaled like z-scores as in `StandardScaler`, use `KDITransformer(output_distribution='normal')`. This applies the standard normal inverse CDF transform after the KDI transform.\n\n```\nimport numpy as np\nfrom kditransform import KDITransformer\nX = np.random.uniform(size=(500, 1))\nkdt = KDITransformer(alpha=1.)\nY = kdt.fit_transform(X)\n```\n\n`kditransform.KDIDiscretizer` offers an API based on [sklearn.preprocessing.KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html). It encodes each feature ordinally, similarly to `KBinsDiscretizer(encode='ordinal')`.\n\n```\nfrom kditransform import KDIDiscretizer\nrng = np.random.default_rng(1)\nx1 = rng.normal(1, 0.75, size=int(0.55*N))\nx2 = rng.normal(4, 1, size=int(0.3*N))\nx3 = rng.uniform(0, 20, size=int(0.15*N))\nX = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)\nkdd = KDIDiscretizer()\nT = kdd.fit_transform(X)\n```\n\nInitialized as `KDIDiscretizer(enable_predict_proba=True)`, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.\n\n```\nkdd = KDIDiscretizer(enable_predict_proba=True).fit(X)\nP = kdd.predict(X)  # one-hot encoding\nP = kdd.predict_proba(X)  # probabilistic one-hot encoding\n```\n\n## Citing this method\n\nIf you use this tool, please cite KDITransform\nusing the following reference to our [TMLR paper](https://openreview.net/pdf?id=6OEcDKZj5j):\n\nIn Bibtex format:\n\n```bibtex\n@article{\nmccarter2023the,\ntitle={The Kernel Density Integral Transformation},\nauthor={Calvin McCarter},\njournal={Transactions on Machine Learning Research},\nissn={2835-8856},\nyear={2023},\nurl={https://openreview.net/forum?id=6OEcDKZj5j},\nnote={}\n}\n```\n\n## Usage with TabPFN\n\n[TabPFN](https://arxiv.org/abs/2207.01848) is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply [adding KDITransform'ed features](https://github.com/calvinmccarter/TabPFN/commit/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d#diff-6e18bf62a38856a86e8846cefd2d9fd323dc178c161d4e63d23bf613dc6de654), I observed [improvements](https://github.com/calvinmccarter/TabPFN/blob/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d/replicate-kditransform.ipynb) on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Kernel density integral transformation",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "http://github.com/calvinmccarter/kditransform"
    },
    "split_keywords": [
        "preprocessing",
        " integral transformation",
        " kernel smoothing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "55c9f06a0cef6cb92da3b2e7011822237d5e2021ab791ca5dff408cfc0d2ee41",
                "md5": "25c152108200db52d345dcc1a645cbb6",
                "sha256": "38a039c951165cb476f26662198a595a71efd1b059f52efd78a3195184ff8f8f"
            },
            "downloads": -1,
            "filename": "kditransform-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "25c152108200db52d345dcc1a645cbb6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 27875,
            "upload_time": "2025-07-20T01:20:15",
            "upload_time_iso_8601": "2025-07-20T01:20:15.775072Z",
            "url": "https://files.pythonhosted.org/packages/55/c9/f06a0cef6cb92da3b2e7011822237d5e2021ab791ca5dff408cfc0d2ee41/kditransform-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "468ebb56690773e55c99ca91de48337037d80874448792d1d917ec2fa26533a5",
                "md5": "49cacaa9ddf5ef001013cee95f9d064b",
                "sha256": "9a3f67b4ec71b147a5b385a5961beda35f3d5ad06ba536cce497777e3daae227"
            },
            "downloads": -1,
            "filename": "kditransform-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "49cacaa9ddf5ef001013cee95f9d064b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 28792,
            "upload_time": "2025-07-20T01:20:16",
            "upload_time_iso_8601": "2025-07-20T01:20:16.953506Z",
            "url": "https://files.pythonhosted.org/packages/46/8e/bb56690773e55c99ca91de48337037d80874448792d1d917ec2fa26533a5/kditransform-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-20 01:20:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "calvinmccarter",
    "github_project": "kditransform",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numba",
            "specs": [
                [
                    ">=",
                    "0.48"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "5.4.1"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "0.23"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.0"
                ]
            ]
        }
    ],
    "lcname": "kditransform"
}
        
Elapsed time: 0.42222s