# unmasking-trees 😷➡️🥳 🌲🌲🌲
[![PyPI version](https://badge.fury.io/py/utrees.svg)](https://badge.fury.io/py/utrees)
[![Downloads](https://static.pepy.tech/badge/utrees)](https://pepy.tech/project/utrees)
UnmaskingTrees is a method for tabular data generation and imputation. It's an order-agnostic autoregressive diffusion model, wherein a training dataset is constructed by incrementally masking features in random order. Per-feature gradient-boosted trees are then trained to unmask each feature. Read more about it in my [blog post](https://calvinmccarter.substack.com/p/unmasking-trees-for-tabular-data)!
To better model conditional distributions which are multi-modal ("modal" as in "modes", not as in "modalities"), we by default discretize continuous features into `n_bins` bins. You can customize this, via the `quantize_cols` parameter in the `fit` method. Provide a list of length `n_dims`, with values in `('continuous', 'categorical', 'integer')`. Given `categorical` it skips quantization of that feature; given `integer` it only quantizes if the number of unique values > `n_bins`.
<figure>
<figcaption><i>Here's how well it works on imputation with the [Two Moons](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) synthetic dataset:</i></figcaption>
<img src="paper/moons-imputation.png" alt="drawing" width="600"/>
</figure>
## Installation
### Installation from PyPI
```
pip install utrees
```
### Installation from source
After cloning this repo, install the dependencies on the command-line, then install utrees:
```
pip install -r requirements.txt
pip install -e .
pytest
```
## Usage
Check out [this notebook](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) with the Two Moons example, or [this one](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/iris.ipynb) with the Iris dataset.
### Synthetic data generation
You can fit `utrees.UnmaskingTrees` the way you would an sklearn model, with the added option that you can call `fit` with `quantize_cols`, a list to specify which columns are continuous (and therefore need to be discretized). By default, all columns are assumed to contain continuous features.
```
import numpy as np
from sklearn.datasets import make_moons
from utrees import UnmaskingTrees
data, labels = make_moons((100, 100), shuffle=False, noise=0.1, random_state=123) # size (200, 2)
utree = UnmaskingTrees().fit(data)
```
Then, you can generate new data:
```
newdata = utree.generate(n_generate=123) # size (123, 2)
```
### Missing data imputation
You can fit your `UnmaskingTrees` model on data with missing elements, provided as `np.nan`. You can then impute the missing values, potentially with multiple imputations per missing element. Given an array of `(n_samples, n_dims)`, you will get back an array of size `(n_impute, n_samples, n_dims)`, where the NaNs have been replaced while the others are unchanged.
```
data4impute = data.copy()
data4impute[:, 1] = np.nan
X=np.concatenate([data, data4impute], axis=0) # size (400, 2)
utree = UnmaskingTrees().fit(X)
imputeddata = utree.impute(n_impute=5) # size (5, 400, 2)
```
You can also provide a totally new dataset to be imputed, so the model performs imputation without retraining:
```
utree = UnmaskingTrees().fit(data)
imputeddata = utree.impute(n_impute=5, X=data4impute) # size (5, 200, 2)
```
### Hyperparameters
- depth: Depth of balanced binary tree for recursively quantizing each feature.
- duplicate_K: Number of random masking orders per actual sample. The training dataset will be of size `(n_samples * n_dims * duplicate_K, n_dims)`.
- xgboost_kwargs: dict to pass to XGBClassifier.
- strategy: how to quantize continuous features ('kdiquantile', 'quantile', 'uniform', or 'kmeans').
- random_state: controls randomness.
## Citing this method
Please consider citing the UnmaskingTrees [arXiv preprint](https://arxiv.org/pdf/2407.05593). The bibtex is:
```
@article{mccarter2024unmasking,
title={Unmasking Trees for Tabular Data},
author={McCarter, Calvin},
journal={arXiv preprint arXiv:2407.05593},
year={2024}
}
````
Also, please consider citing ForestDiffusion ([code](https://github.com/SamsungSAILMontreal/ForestDiffusion) and [paper](https://arxiv.org/abs/2309.09968)), which this work builds on.
Raw data
{
"_id": null,
"home_page": "http://github.com/calvinmccarter/unmasking-trees",
"name": "utrees",
"maintainer": "Calvin McCarter",
"docs_url": null,
"requires_python": null,
"maintainer_email": "mccarter.calvin@gmail.com",
"keywords": "tabular, imputation, generation",
"author": "Calvin McCarter",
"author_email": "mccarter.calvin@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c0/e9/a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0/utrees-0.3.0.tar.gz",
"platform": null,
"description": "# unmasking-trees \ud83d\ude37\u27a1\ufe0f\ud83e\udd73 \ud83c\udf32\ud83c\udf32\ud83c\udf32\n\n[![PyPI version](https://badge.fury.io/py/utrees.svg)](https://badge.fury.io/py/utrees)\n[![Downloads](https://static.pepy.tech/badge/utrees)](https://pepy.tech/project/utrees)\n\nUnmaskingTrees is a method for tabular data generation and imputation. It's an order-agnostic autoregressive diffusion model, wherein a training dataset is constructed by incrementally masking features in random order. Per-feature gradient-boosted trees are then trained to unmask each feature. Read more about it in my [blog post](https://calvinmccarter.substack.com/p/unmasking-trees-for-tabular-data)!\n\nTo better model conditional distributions which are multi-modal (\"modal\" as in \"modes\", not as in \"modalities\"), we by default discretize continuous features into `n_bins` bins. You can customize this, via the `quantize_cols` parameter in the `fit` method. Provide a list of length `n_dims`, with values in `('continuous', 'categorical', 'integer')`. Given `categorical` it skips quantization of that feature; given `integer` it only quantizes if the number of unique values > `n_bins`.\n\n\n<figure>\n <figcaption><i>Here's how well it works on imputation with the [Two Moons](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) synthetic dataset:</i></figcaption>\n <img src=\"paper/moons-imputation.png\" alt=\"drawing\" width=\"600\"/>\n</figure>\n\n## Installation \n\n### Installation from PyPI\n```\npip install utrees\n```\n\n### Installation from source\nAfter cloning this repo, install the dependencies on the command-line, then install utrees:\n```\npip install -r requirements.txt\npip install -e .\npytest\n```\n\n## Usage\n\nCheck out [this notebook](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) with the Two Moons example, or [this one](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/iris.ipynb) with the Iris dataset.\n\n### Synthetic data generation\n\nYou can fit `utrees.UnmaskingTrees` the way you would an sklearn model, with the added option that you can call `fit` with `quantize_cols`, a list to specify which columns are continuous (and therefore need to be discretized). By default, all columns are assumed to contain continuous features.\n\n```\nimport numpy as np\nfrom sklearn.datasets import make_moons\nfrom utrees import UnmaskingTrees\ndata, labels = make_moons((100, 100), shuffle=False, noise=0.1, random_state=123) # size (200, 2)\nutree = UnmaskingTrees().fit(data)\n```\n\nThen, you can generate new data:\n\n```\nnewdata = utree.generate(n_generate=123) # size (123, 2)\n```\n\n### Missing data imputation\n\nYou can fit your `UnmaskingTrees` model on data with missing elements, provided as `np.nan`. You can then impute the missing values, potentially with multiple imputations per missing element. Given an array of `(n_samples, n_dims)`, you will get back an array of size `(n_impute, n_samples, n_dims)`, where the NaNs have been replaced while the others are unchanged.\n\n```\ndata4impute = data.copy()\ndata4impute[:, 1] = np.nan\nX=np.concatenate([data, data4impute], axis=0) # size (400, 2)\nutree = UnmaskingTrees().fit(X) \nimputeddata = utree.impute(n_impute=5) # size (5, 400, 2)\n```\n\nYou can also provide a totally new dataset to be imputed, so the model performs imputation without retraining:\n\n```\nutree = UnmaskingTrees().fit(data) \nimputeddata = utree.impute(n_impute=5, X=data4impute) # size (5, 200, 2)\n```\n\n### Hyperparameters\n\n- depth: Depth of balanced binary tree for recursively quantizing each feature.\n- duplicate_K: Number of random masking orders per actual sample. The training dataset will be of size `(n_samples * n_dims * duplicate_K, n_dims)`.\n- xgboost_kwargs: dict to pass to XGBClassifier.\n- strategy: how to quantize continuous features ('kdiquantile', 'quantile', 'uniform', or 'kmeans').\n- random_state: controls randomness.\n\n\n## Citing this method\n\nPlease consider citing the UnmaskingTrees [arXiv preprint](https://arxiv.org/pdf/2407.05593). The bibtex is:\n\n```\n@article{mccarter2024unmasking,\n title={Unmasking Trees for Tabular Data},\n author={McCarter, Calvin},\n journal={arXiv preprint arXiv:2407.05593},\n year={2024}\n}\n````\n\nAlso, please consider citing ForestDiffusion ([code](https://github.com/SamsungSAILMontreal/ForestDiffusion) and [paper](https://arxiv.org/abs/2309.09968)), which this work builds on.\n",
"bugtrack_url": null,
"license": null,
"summary": "Unmasking trees for tabular data generation and imputation",
"version": "0.3.0",
"project_urls": {
"Homepage": "http://github.com/calvinmccarter/unmasking-trees"
},
"split_keywords": [
"tabular",
" imputation",
" generation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9f401e54762f89ec2795fa224f58d75403938df39d7b1a81ffa7e8432cb244db",
"md5": "36eed7f6423b3c11608abebaf94df1b6",
"sha256": "6ea0d6ea51bed95e164f98c188322b2f8e507d71c72144df5628e6125aafac09"
},
"downloads": -1,
"filename": "utrees-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "36eed7f6423b3c11608abebaf94df1b6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 14863,
"upload_time": "2024-09-21T00:54:44",
"upload_time_iso_8601": "2024-09-21T00:54:44.366594Z",
"url": "https://files.pythonhosted.org/packages/9f/40/1e54762f89ec2795fa224f58d75403938df39d7b1a81ffa7e8432cb244db/utrees-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c0e9a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0",
"md5": "a49005d4329f46b21ef4e92ab638bf4b",
"sha256": "be5336c77e3b45c39b455eb3238ec0798a26840b4fdb965734318a7f5a67c61a"
},
"downloads": -1,
"filename": "utrees-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "a49005d4329f46b21ef4e92ab638bf4b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15068,
"upload_time": "2024-09-21T00:54:45",
"upload_time_iso_8601": "2024-09-21T00:54:45.969858Z",
"url": "https://files.pythonhosted.org/packages/c0/e9/a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0/utrees-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-21 00:54:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "calvinmccarter",
"github_project": "unmasking-trees",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": []
},
{
"name": "pre-commit",
"specs": [
[
">=",
"2.2.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"5.4.1"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"0.23"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.0"
]
]
},
{
"name": "tqdm",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "lightgbm",
"specs": []
},
{
"name": "xgboost",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "jupyterlab",
"specs": []
},
{
"name": "seaborn",
"specs": []
}
],
"lcname": "utrees"
}