# kditransform
[](https://badge.fury.io/py/kditransform)
[](https://pepy.tech/project/kditransform)
The kernel-density integral transformation [(McCarter, 2023, TMLR)](https://openreview.net/pdf?id=6OEcDKZj5j), like [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [quantile transformation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), maps continuous features to the range `[0, 1]`.
It achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.
It can also be used to discretize features, offering a data-driven alternative to univariate clustering or [K-bins discretization](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-discretization).
You can tune the interpolation $\alpha$ between 0 (quantile transform) and $\infty$ (min-max transform), but a good default is $\alpha=1$, which is equivalent to using `scipy.stats.gaussian_kde(bw_method=1)`. This is an easy way to improves performance for a lot of supervised learning problems. See [this notebook](https://github.com/calvinmccarter/kditransform/blob/master/examples/regression-plots.ipynb) for example usage and the [paper](https://openreview.net/pdf?id=6OEcDKZj5j) for a detailed description of the method.
<figure>
<figcaption><i>Accuracy on Iris</i></figcaption>
<img src="examples/Accuracy-vs-bwf-iris-pca.jpg" alt="drawing" width="300"/>
</figure>
<figure>
<figcaption><i>rMSE on CA Housing</i></figcaption>
<img src="examples/MSE-vs-bwf-cahousing-linr-nolegend.jpg" alt="drawing" width="300"/>
</figure>
## Installation
### Installation from PyPI
```
pip install kditransform
```
### Installation from source
After cloning this repo, install the dependencies on the command-line, then install kditransform:
```
pip install -r requirements.txt
pip install -e .
pytest
```
## Usage
`kditransform.KDITransformer` is a drop-in replacement for [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html). When `alpha` (defaults to 1.0) is small, our method behaves like the QuantileTransformer; when `alpha` is large, it behaves like [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).
To produce features that are roughly scaled like z-scores as in `StandardScaler`, use `KDITransformer(output_distribution='normal')`. This applies the standard normal inverse CDF transform after the KDI transform.
```
import numpy as np
from kditransform import KDITransformer
X = np.random.uniform(size=(500, 1))
kdt = KDITransformer(alpha=1.)
Y = kdt.fit_transform(X)
```
`kditransform.KDIDiscretizer` offers an API based on [sklearn.preprocessing.KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html). It encodes each feature ordinally, similarly to `KBinsDiscretizer(encode='ordinal')`.
```
from kditransform import KDIDiscretizer
rng = np.random.default_rng(1)
x1 = rng.normal(1, 0.75, size=int(0.55*N))
x2 = rng.normal(4, 1, size=int(0.3*N))
x3 = rng.uniform(0, 20, size=int(0.15*N))
X = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)
kdd = KDIDiscretizer()
T = kdd.fit_transform(X)
```
Initialized as `KDIDiscretizer(enable_predict_proba=True)`, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.
```
kdd = KDIDiscretizer(enable_predict_proba=True).fit(X)
P = kdd.predict(X) # one-hot encoding
P = kdd.predict_proba(X) # probabilistic one-hot encoding
```
## Citing this method
If you use this tool, please cite KDITransform
using the following reference to our [TMLR paper](https://openreview.net/pdf?id=6OEcDKZj5j):
In Bibtex format:
```bibtex
@article{
mccarter2023the,
title={The Kernel Density Integral Transformation},
author={Calvin McCarter},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=6OEcDKZj5j},
note={}
}
```
## Usage with TabPFN
[TabPFN](https://arxiv.org/abs/2207.01848) is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply [adding KDITransform'ed features](https://github.com/calvinmccarter/TabPFN/commit/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d#diff-6e18bf62a38856a86e8846cefd2d9fd323dc178c161d4e63d23bf613dc6de654), I observed [improvements](https://github.com/calvinmccarter/TabPFN/blob/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d/replicate-kditransform.ipynb) on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.
Raw data
{
"_id": null,
"home_page": "http://github.com/calvinmccarter/kditransform",
"name": "kditransform",
"maintainer": "Calvin McCarter",
"docs_url": null,
"requires_python": null,
"maintainer_email": "Calvin McCarter <mccarter.calvin@gmail.com>",
"keywords": "preprocessing, integral transformation, kernel smoothing",
"author": "Calvin McCarter",
"author_email": "Calvin McCarter <mccarter.calvin@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/46/8e/bb56690773e55c99ca91de48337037d80874448792d1d917ec2fa26533a5/kditransform-1.0.0.tar.gz",
"platform": null,
"description": "# kditransform\n\n[](https://badge.fury.io/py/kditransform)\n[](https://pepy.tech/project/kditransform)\n\nThe kernel-density integral transformation [(McCarter, 2023, TMLR)](https://openreview.net/pdf?id=6OEcDKZj5j), like [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [quantile transformation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), maps continuous features to the range `[0, 1]`.\nIt achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.\nIt can also be used to discretize features, offering a data-driven alternative to univariate clustering or [K-bins discretization](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-discretization).\n\nYou can tune the interpolation $\\alpha$ between 0 (quantile transform) and $\\infty$ (min-max transform), but a good default is $\\alpha=1$, which is equivalent to using `scipy.stats.gaussian_kde(bw_method=1)`. This is an easy way to improves performance for a lot of supervised learning problems. See [this notebook](https://github.com/calvinmccarter/kditransform/blob/master/examples/regression-plots.ipynb) for example usage and the [paper](https://openreview.net/pdf?id=6OEcDKZj5j) for a detailed description of the method.\n\n<figure>\n <figcaption><i>Accuracy on Iris</i></figcaption>\n <img src=\"examples/Accuracy-vs-bwf-iris-pca.jpg\" alt=\"drawing\" width=\"300\"/>\n</figure>\n<figure>\n <figcaption><i>rMSE on CA Housing</i></figcaption>\n <img src=\"examples/MSE-vs-bwf-cahousing-linr-nolegend.jpg\" alt=\"drawing\" width=\"300\"/>\n</figure>\n \n\n## Installation \n\n### Installation from PyPI\n```\npip install kditransform\n```\n\n### Installation from source\nAfter cloning this repo, install the dependencies on the command-line, then install kditransform:\n```\npip install -r requirements.txt\npip install -e .\npytest\n```\n\n## Usage\n\n`kditransform.KDITransformer` is a drop-in replacement for [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html). When `alpha` (defaults to 1.0) is small, our method behaves like the QuantileTransformer; when `alpha` is large, it behaves like [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).\n\nTo produce features that are roughly scaled like z-scores as in `StandardScaler`, use `KDITransformer(output_distribution='normal')`. This applies the standard normal inverse CDF transform after the KDI transform.\n\n```\nimport numpy as np\nfrom kditransform import KDITransformer\nX = np.random.uniform(size=(500, 1))\nkdt = KDITransformer(alpha=1.)\nY = kdt.fit_transform(X)\n```\n\n`kditransform.KDIDiscretizer` offers an API based on [sklearn.preprocessing.KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html). It encodes each feature ordinally, similarly to `KBinsDiscretizer(encode='ordinal')`.\n\n```\nfrom kditransform import KDIDiscretizer\nrng = np.random.default_rng(1)\nx1 = rng.normal(1, 0.75, size=int(0.55*N))\nx2 = rng.normal(4, 1, size=int(0.3*N))\nx3 = rng.uniform(0, 20, size=int(0.15*N))\nX = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)\nkdd = KDIDiscretizer()\nT = kdd.fit_transform(X)\n```\n\nInitialized as `KDIDiscretizer(enable_predict_proba=True)`, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.\n\n```\nkdd = KDIDiscretizer(enable_predict_proba=True).fit(X)\nP = kdd.predict(X) # one-hot encoding\nP = kdd.predict_proba(X) # probabilistic one-hot encoding\n```\n\n## Citing this method\n\nIf you use this tool, please cite KDITransform\nusing the following reference to our [TMLR paper](https://openreview.net/pdf?id=6OEcDKZj5j):\n\nIn Bibtex format:\n\n```bibtex\n@article{\nmccarter2023the,\ntitle={The Kernel Density Integral Transformation},\nauthor={Calvin McCarter},\njournal={Transactions on Machine Learning Research},\nissn={2835-8856},\nyear={2023},\nurl={https://openreview.net/forum?id=6OEcDKZj5j},\nnote={}\n}\n```\n\n## Usage with TabPFN\n\n[TabPFN](https://arxiv.org/abs/2207.01848) is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply [adding KDITransform'ed features](https://github.com/calvinmccarter/TabPFN/commit/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d#diff-6e18bf62a38856a86e8846cefd2d9fd323dc178c161d4e63d23bf613dc6de654), I observed [improvements](https://github.com/calvinmccarter/TabPFN/blob/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d/replicate-kditransform.ipynb) on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.\n",
"bugtrack_url": null,
"license": null,
"summary": "Kernel density integral transformation",
"version": "1.0.0",
"project_urls": {
"Homepage": "http://github.com/calvinmccarter/kditransform"
},
"split_keywords": [
"preprocessing",
" integral transformation",
" kernel smoothing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "55c9f06a0cef6cb92da3b2e7011822237d5e2021ab791ca5dff408cfc0d2ee41",
"md5": "25c152108200db52d345dcc1a645cbb6",
"sha256": "38a039c951165cb476f26662198a595a71efd1b059f52efd78a3195184ff8f8f"
},
"downloads": -1,
"filename": "kditransform-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "25c152108200db52d345dcc1a645cbb6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 27875,
"upload_time": "2025-07-20T01:20:15",
"upload_time_iso_8601": "2025-07-20T01:20:15.775072Z",
"url": "https://files.pythonhosted.org/packages/55/c9/f06a0cef6cb92da3b2e7011822237d5e2021ab791ca5dff408cfc0d2ee41/kditransform-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "468ebb56690773e55c99ca91de48337037d80874448792d1d917ec2fa26533a5",
"md5": "49cacaa9ddf5ef001013cee95f9d064b",
"sha256": "9a3f67b4ec71b147a5b385a5961beda35f3d5ad06ba536cce497777e3daae227"
},
"downloads": -1,
"filename": "kditransform-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "49cacaa9ddf5ef001013cee95f9d064b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 28792,
"upload_time": "2025-07-20T01:20:16",
"upload_time_iso_8601": "2025-07-20T01:20:16.953506Z",
"url": "https://files.pythonhosted.org/packages/46/8e/bb56690773e55c99ca91de48337037d80874448792d1d917ec2fa26533a5/kditransform-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 01:20:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "calvinmccarter",
"github_project": "kditransform",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numba",
"specs": [
[
">=",
"0.48"
]
]
},
{
"name": "numpy",
"specs": []
},
{
"name": "pre-commit",
"specs": [
[
">=",
"2.2.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"5.4.1"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"0.23"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.0"
]
]
}
],
"lcname": "kditransform"
}