kditransform


Namekditransform JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttp://github.com/calvinmccarter/kditransform
SummaryKernel density integral transformation
upload_time2023-11-28 23:35:32
maintainerCalvin McCarter
docs_urlNone
authorCalvin McCarter
requires_python
license
keywords preprocessing integral transformation kernel smoothing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # kditransform

The kernel-density integral transformation [(McCarter, 2023, TMLR)](https://openreview.net/pdf?id=6OEcDKZj5j), like [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [quantile transformation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), maps continuous features to the range `[0, 1]`.
It achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.
It can also be used to discretize features, offering a data-driven alternative to univariate clustering or [K-bins discretization](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-discretization).

You can tune the interpolation $\alpha$ between 0 (quantile transform) and $\infty$ (min-max transform), but a good default is $\alpha=1$, which is equivalent to using `scipy.stats.gaussian_kde(bw_method=1)`. This is an easy way to improves performance for a lot of supervised learning problems. See [this notebook](https://github.com/calvinmccarter/kditransform/blob/master/examples/regression-plots.ipynb) for example usage and the [paper](https://openreview.net/pdf?id=6OEcDKZj5j) for a detailed description of the method.

<figure>
  <figcaption><i>Accuracy on Iris</i></figcaption>
  <img src="examples/Accuracy-vs-bwf-iris-pca.jpg" alt="drawing" width="300"/>
</figure>
<figure>
  <figcaption><i>rMSE on CA Housing</i></figcaption>
  <img src="examples/MSE-vs-bwf-cahousing-linr-nolegend.jpg" alt="drawing" width="300"/>
</figure>
    

## Installation 

### Installation from PyPI
```
pip install kditransform
```

### Installation from source
After cloning this repo, install the dependencies on the command-line, then install kditransform:
```
pip install -r requirements.txt
pip install -e .
pytest
```

## Usage

`kditransform.KDTransformer` is a drop-in replacement for [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html). When `alpha` (defaults to 1.0) is small, our method behaves like the QuantileTransformer; when `alpha` is large, it behaves like [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

```
import numpy as np
from kditransform import KDITransformer
X = np.random.uniform(size=(500, 1))
kdt = KDITransformer(alpha=1.)
Y = kdt.fit_transform(X)
```

`kditransform.KDIDiscretizer` offers an API based on [sklearn.preprocessing.KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html). It encodes each feature ordinally, similarly to `KBinsDiscretizer(encode='ordinal')`.

```
from kditransform import KDIDiscretizer
rng = np.random.default_rng(1)
x1 = rng.normal(1, 0.75, size=int(0.55*N))
x2 = rng.normal(4, 1, size=int(0.3*N))
x3 = rng.uniform(0, 20, size=int(0.15*N))
X = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)
kdd = KDIDiscretizer()
T = kdd.fit_transform(X)
```

Initialized as `KDIDiscretizer(enable_predict_proba=True)`, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.

```
kdd = KDIDiscretizer(enable_predict_proba=True).fit(X)
P = kdd.predict(X)  # one-hot encoding
P = kdd.predict_proba(X)  # probabilistic one-hot encoding
```

## Citing this method

If you use this tool, please cite KDITransform
using the following reference to our [TMLR paper](https://openreview.net/pdf?id=6OEcDKZj5j):

In Bibtex format:

```bibtex
@article{
mccarter2023the,
title={The Kernel Density Integral Transformation},
author={Calvin McCarter},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=6OEcDKZj5j},
note={}
}
```

## Usage with TabPFN

[TabPFN](https://arxiv.org/abs/2207.01848) is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply [adding KDITransform'ed features](https://github.com/calvinmccarter/TabPFN/commit/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d#diff-6e18bf62a38856a86e8846cefd2d9fd323dc178c161d4e63d23bf613dc6de654), I observed [improvements](https://github.com/calvinmccarter/TabPFN/blob/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d/replicate-kditransform.ipynb) on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.

            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/calvinmccarter/kditransform",
    "name": "kditransform",
    "maintainer": "Calvin McCarter",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "mccarter.calvin@gmail.com",
    "keywords": "preprocessing,integral transformation,kernel smoothing",
    "author": "Calvin McCarter",
    "author_email": "mccarter.calvin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/9c/68/5f8e2ec775faf714ab440609b109c0ca01b77810e4a8d8d65037abf99c9f/kditransform-0.2.0.tar.gz",
    "platform": null,
    "description": "# kditransform\n\nThe kernel-density integral transformation [(McCarter, 2023, TMLR)](https://openreview.net/pdf?id=6OEcDKZj5j), like [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [quantile transformation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), maps continuous features to the range `[0, 1]`.\nIt achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.\nIt can also be used to discretize features, offering a data-driven alternative to univariate clustering or [K-bins discretization](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-discretization).\n\nYou can tune the interpolation $\\alpha$ between 0 (quantile transform) and $\\infty$ (min-max transform), but a good default is $\\alpha=1$, which is equivalent to using `scipy.stats.gaussian_kde(bw_method=1)`. This is an easy way to improves performance for a lot of supervised learning problems. See [this notebook](https://github.com/calvinmccarter/kditransform/blob/master/examples/regression-plots.ipynb) for example usage and the [paper](https://openreview.net/pdf?id=6OEcDKZj5j) for a detailed description of the method.\n\n<figure>\n  <figcaption><i>Accuracy on Iris</i></figcaption>\n  <img src=\"examples/Accuracy-vs-bwf-iris-pca.jpg\" alt=\"drawing\" width=\"300\"/>\n</figure>\n<figure>\n  <figcaption><i>rMSE on CA Housing</i></figcaption>\n  <img src=\"examples/MSE-vs-bwf-cahousing-linr-nolegend.jpg\" alt=\"drawing\" width=\"300\"/>\n</figure>\n    \n\n## Installation \n\n### Installation from PyPI\n```\npip install kditransform\n```\n\n### Installation from source\nAfter cloning this repo, install the dependencies on the command-line, then install kditransform:\n```\npip install -r requirements.txt\npip install -e .\npytest\n```\n\n## Usage\n\n`kditransform.KDTransformer` is a drop-in replacement for [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html). When `alpha` (defaults to 1.0) is small, our method behaves like the QuantileTransformer; when `alpha` is large, it behaves like [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).\n\n```\nimport numpy as np\nfrom kditransform import KDITransformer\nX = np.random.uniform(size=(500, 1))\nkdt = KDITransformer(alpha=1.)\nY = kdt.fit_transform(X)\n```\n\n`kditransform.KDIDiscretizer` offers an API based on [sklearn.preprocessing.KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html). It encodes each feature ordinally, similarly to `KBinsDiscretizer(encode='ordinal')`.\n\n```\nfrom kditransform import KDIDiscretizer\nrng = np.random.default_rng(1)\nx1 = rng.normal(1, 0.75, size=int(0.55*N))\nx2 = rng.normal(4, 1, size=int(0.3*N))\nx3 = rng.uniform(0, 20, size=int(0.15*N))\nX = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)\nkdd = KDIDiscretizer()\nT = kdd.fit_transform(X)\n```\n\nInitialized as `KDIDiscretizer(enable_predict_proba=True)`, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.\n\n```\nkdd = KDIDiscretizer(enable_predict_proba=True).fit(X)\nP = kdd.predict(X)  # one-hot encoding\nP = kdd.predict_proba(X)  # probabilistic one-hot encoding\n```\n\n## Citing this method\n\nIf you use this tool, please cite KDITransform\nusing the following reference to our [TMLR paper](https://openreview.net/pdf?id=6OEcDKZj5j):\n\nIn Bibtex format:\n\n```bibtex\n@article{\nmccarter2023the,\ntitle={The Kernel Density Integral Transformation},\nauthor={Calvin McCarter},\njournal={Transactions on Machine Learning Research},\nissn={2835-8856},\nyear={2023},\nurl={https://openreview.net/forum?id=6OEcDKZj5j},\nnote={}\n}\n```\n\n## Usage with TabPFN\n\n[TabPFN](https://arxiv.org/abs/2207.01848) is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply [adding KDITransform'ed features](https://github.com/calvinmccarter/TabPFN/commit/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d#diff-6e18bf62a38856a86e8846cefd2d9fd323dc178c161d4e63d23bf613dc6de654), I observed [improvements](https://github.com/calvinmccarter/TabPFN/blob/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d/replicate-kditransform.ipynb) on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Kernel density integral transformation",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "http://github.com/calvinmccarter/kditransform"
    },
    "split_keywords": [
        "preprocessing",
        "integral transformation",
        "kernel smoothing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5a863a9dc920850fe36e01a406c08420b307e26fe84174a002666290a8626326",
                "md5": "a746066dd2b54035c9cc2a6f62842f42",
                "sha256": "4d11ef1ae39eff3419dd9256b6531e4d666ac0b63210faf72b073f16f9563c67"
            },
            "downloads": -1,
            "filename": "kditransform-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a746066dd2b54035c9cc2a6f62842f42",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 27646,
            "upload_time": "2023-11-28T23:35:31",
            "upload_time_iso_8601": "2023-11-28T23:35:31.231336Z",
            "url": "https://files.pythonhosted.org/packages/5a/86/3a9dc920850fe36e01a406c08420b307e26fe84174a002666290a8626326/kditransform-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9c685f8e2ec775faf714ab440609b109c0ca01b77810e4a8d8d65037abf99c9f",
                "md5": "4ec6ec9e251b7be0d18bbf29338d7336",
                "sha256": "59e61e83198e0c03f14923250d3086b42675682c0ea2e3aaf27550ad302357a6"
            },
            "downloads": -1,
            "filename": "kditransform-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4ec6ec9e251b7be0d18bbf29338d7336",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 28102,
            "upload_time": "2023-11-28T23:35:32",
            "upload_time_iso_8601": "2023-11-28T23:35:32.804193Z",
            "url": "https://files.pythonhosted.org/packages/9c/68/5f8e2ec775faf714ab440609b109c0ca01b77810e4a8d8d65037abf99c9f/kditransform-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-28 23:35:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "calvinmccarter",
    "github_project": "kditransform",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "kditransform"
}
        
Elapsed time: 0.14971s