
Nameutrees JSON
Version 0.3.0 PyPI version JSON
SummaryUnmasking trees for tabular data generation and imputation
upload_time2024-09-21 00:54:45
maintainerCalvin McCarter
authorCalvin McCarter
keywords tabular imputation generation
requirements numpy pre-commit pytest scikit-learn scipy tqdm torch lightgbm xgboost pandas matplotlib jupyterlab seaborn
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # unmasking-trees 😷➡️🥳 🌲🌲🌲

[![PyPI version](https://badge.fury.io/py/utrees.svg)](https://badge.fury.io/py/utrees)

UnmaskingTrees is a method for tabular data generation and imputation. It's an order-agnostic autoregressive diffusion model, wherein a training dataset is constructed by incrementally masking features in random order. Per-feature gradient-boosted trees are then trained to unmask each feature. Read more about it in my [blog post](https://calvinmccarter.substack.com/p/unmasking-trees-for-tabular-data)!

To better model conditional distributions which are multi-modal ("modal" as in "modes", not as in "modalities"), we by default discretize continuous features into `n_bins` bins. You can customize this, via the `quantize_cols` parameter in the `fit` method. Provide a list of length `n_dims`, with values in `('continuous', 'categorical', 'integer')`. Given `categorical` it skips quantization of that feature; given `integer` it only quantizes if the number of unique values > `n_bins`.

  <figcaption><i>Here's how well it works on imputation with the [Two Moons](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) synthetic dataset:</i></figcaption>
  <img src="paper/moons-imputation.png" alt="drawing" width="600"/>

## Installation 

### Installation from PyPI
pip install utrees

### Installation from source
After cloning this repo, install the dependencies on the command-line, then install utrees:
pip install -r requirements.txt
pip install -e .

## Usage

Check out [this notebook](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) with the Two Moons example, or [this one](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/iris.ipynb) with the Iris dataset.

### Synthetic data generation

You can fit `utrees.UnmaskingTrees` the way you would an sklearn model, with the added option that you can call `fit` with `quantize_cols`, a list to specify which columns are continuous (and therefore need to be discretized). By default, all columns are assumed to contain continuous features.

import numpy as np
from sklearn.datasets import make_moons
from utrees import UnmaskingTrees
data, labels = make_moons((100, 100), shuffle=False, noise=0.1, random_state=123)  # size (200, 2)
utree = UnmaskingTrees().fit(data)

Then, you can generate new data:

newdata = utree.generate(n_generate=123)  # size (123, 2)

### Missing data imputation

You can fit your `UnmaskingTrees` model on data with missing elements, provided as `np.nan`. You can then impute the missing values, potentially with multiple imputations per missing element. Given an array of `(n_samples, n_dims)`, you will get back an array of size `(n_impute, n_samples, n_dims)`, where the NaNs have been replaced while the others are unchanged.

data4impute = data.copy()
data4impute[:, 1] = np.nan
X=np.concatenate([data, data4impute], axis=0)  # size (400, 2)
utree = UnmaskingTrees().fit(X)                                                                                    
imputeddata = utree.impute(n_impute=5)  # size (5, 400, 2)

You can also provide a totally new dataset to be imputed, so the model performs imputation without retraining:

utree = UnmaskingTrees().fit(data)                                                                                    
imputeddata = utree.impute(n_impute=5, X=data4impute)  # size (5, 200, 2)

### Hyperparameters

- depth: Depth of balanced binary tree for recursively quantizing each feature.
- duplicate_K: Number of random masking orders per actual sample. The training dataset will be of size `(n_samples * n_dims * duplicate_K, n_dims)`.
- xgboost_kwargs: dict to pass to XGBClassifier.
- strategy: how to quantize continuous features ('kdiquantile', 'quantile', 'uniform', or 'kmeans').
- random_state: controls randomness.

## Citing this method

Please consider citing the UnmaskingTrees [arXiv preprint](https://arxiv.org/pdf/2407.05593). The bibtex is:

  title={Unmasking Trees for Tabular Data},
  author={McCarter, Calvin},
  journal={arXiv preprint arXiv:2407.05593},

Also, please consider citing ForestDiffusion ([code](https://github.com/SamsungSAILMontreal/ForestDiffusion) and [paper](https://arxiv.org/abs/2309.09968)), which this work builds on.


Raw data

    "_id": null,
    "home_page": "http://github.com/calvinmccarter/unmasking-trees",
    "name": "utrees",
    "maintainer": "Calvin McCarter",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "mccarter.calvin@gmail.com",
    "keywords": "tabular, imputation, generation",
    "author": "Calvin McCarter",
    "author_email": "mccarter.calvin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c0/e9/a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0/utrees-0.3.0.tar.gz",
    "platform": null,
    "description": "# unmasking-trees \ud83d\ude37\u27a1\ufe0f\ud83e\udd73 \ud83c\udf32\ud83c\udf32\ud83c\udf32\n\n[![PyPI version](https://badge.fury.io/py/utrees.svg)](https://badge.fury.io/py/utrees)\n[![Downloads](https://static.pepy.tech/badge/utrees)](https://pepy.tech/project/utrees)\n\nUnmaskingTrees is a method for tabular data generation and imputation. It's an order-agnostic autoregressive diffusion model, wherein a training dataset is constructed by incrementally masking features in random order. Per-feature gradient-boosted trees are then trained to unmask each feature. Read more about it in my [blog post](https://calvinmccarter.substack.com/p/unmasking-trees-for-tabular-data)!\n\nTo better model conditional distributions which are multi-modal (\"modal\" as in \"modes\", not as in \"modalities\"), we by default discretize continuous features into `n_bins` bins. You can customize this, via the `quantize_cols` parameter in the `fit` method. Provide a list of length `n_dims`, with values in `('continuous', 'categorical', 'integer')`. Given `categorical` it skips quantization of that feature; given `integer` it only quantizes if the number of unique values > `n_bins`.\n\n\n<figure>\n  <figcaption><i>Here's how well it works on imputation with the [Two Moons](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) synthetic dataset:</i></figcaption>\n  <img src=\"paper/moons-imputation.png\" alt=\"drawing\" width=\"600\"/>\n</figure>\n\n## Installation \n\n### Installation from PyPI\n```\npip install utrees\n```\n\n### Installation from source\nAfter cloning this repo, install the dependencies on the command-line, then install utrees:\n```\npip install -r requirements.txt\npip install -e .\npytest\n```\n\n## Usage\n\nCheck out [this notebook](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) with the Two Moons example, or [this one](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/iris.ipynb) with the Iris dataset.\n\n### Synthetic data generation\n\nYou can fit `utrees.UnmaskingTrees` the way you would an sklearn model, with the added option that you can call `fit` with `quantize_cols`, a list to specify which columns are continuous (and therefore need to be discretized). By default, all columns are assumed to contain continuous features.\n\n```\nimport numpy as np\nfrom sklearn.datasets import make_moons\nfrom utrees import UnmaskingTrees\ndata, labels = make_moons((100, 100), shuffle=False, noise=0.1, random_state=123)  # size (200, 2)\nutree = UnmaskingTrees().fit(data)\n```\n\nThen, you can generate new data:\n\n```\nnewdata = utree.generate(n_generate=123)  # size (123, 2)\n```\n\n### Missing data imputation\n\nYou can fit your `UnmaskingTrees` model on data with missing elements, provided as `np.nan`. You can then impute the missing values, potentially with multiple imputations per missing element. Given an array of `(n_samples, n_dims)`, you will get back an array of size `(n_impute, n_samples, n_dims)`, where the NaNs have been replaced while the others are unchanged.\n\n```\ndata4impute = data.copy()\ndata4impute[:, 1] = np.nan\nX=np.concatenate([data, data4impute], axis=0)  # size (400, 2)\nutree = UnmaskingTrees().fit(X)                                                                                    \nimputeddata = utree.impute(n_impute=5)  # size (5, 400, 2)\n```\n\nYou can also provide a totally new dataset to be imputed, so the model performs imputation without retraining:\n\n```\nutree = UnmaskingTrees().fit(data)                                                                                    \nimputeddata = utree.impute(n_impute=5, X=data4impute)  # size (5, 200, 2)\n```\n\n### Hyperparameters\n\n- depth: Depth of balanced binary tree for recursively quantizing each feature.\n- duplicate_K: Number of random masking orders per actual sample. The training dataset will be of size `(n_samples * n_dims * duplicate_K, n_dims)`.\n- xgboost_kwargs: dict to pass to XGBClassifier.\n- strategy: how to quantize continuous features ('kdiquantile', 'quantile', 'uniform', or 'kmeans').\n- random_state: controls randomness.\n\n\n## Citing this method\n\nPlease consider citing the UnmaskingTrees [arXiv preprint](https://arxiv.org/pdf/2407.05593). The bibtex is:\n\n```\n@article{mccarter2024unmasking,\n  title={Unmasking Trees for Tabular Data},\n  author={McCarter, Calvin},\n  journal={arXiv preprint arXiv:2407.05593},\n  year={2024}\n}\n````\n\nAlso, please consider citing ForestDiffusion ([code](https://github.com/SamsungSAILMontreal/ForestDiffusion) and [paper](https://arxiv.org/abs/2309.09968)), which this work builds on.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Unmasking trees for tabular data generation and imputation",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "http://github.com/calvinmccarter/unmasking-trees"
    "split_keywords": [
        " imputation",
        " generation"
    "urls": [
            "comment_text": "",
            "digests": {
                "blake2b_256": "9f401e54762f89ec2795fa224f58d75403938df39d7b1a81ffa7e8432cb244db",
                "md5": "36eed7f6423b3c11608abebaf94df1b6",
                "sha256": "6ea0d6ea51bed95e164f98c188322b2f8e507d71c72144df5628e6125aafac09"
            "downloads": -1,
            "filename": "utrees-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "36eed7f6423b3c11608abebaf94df1b6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14863,
            "upload_time": "2024-09-21T00:54:44",
            "upload_time_iso_8601": "2024-09-21T00:54:44.366594Z",
            "url": "https://files.pythonhosted.org/packages/9f/40/1e54762f89ec2795fa224f58d75403938df39d7b1a81ffa7e8432cb244db/utrees-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
            "comment_text": "",
            "digests": {
                "blake2b_256": "c0e9a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0",
                "md5": "a49005d4329f46b21ef4e92ab638bf4b",
                "sha256": "be5336c77e3b45c39b455eb3238ec0798a26840b4fdb965734318a7f5a67c61a"
            "downloads": -1,
            "filename": "utrees-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a49005d4329f46b21ef4e92ab638bf4b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 15068,
            "upload_time": "2024-09-21T00:54:45",
            "upload_time_iso_8601": "2024-09-21T00:54:45.969858Z",
            "url": "https://files.pythonhosted.org/packages/c0/e9/a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0/utrees-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
    "upload_time": "2024-09-21 00:54:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "calvinmccarter",
    "github_project": "unmasking-trees",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
            "name": "numpy",
            "specs": []
            "name": "pre-commit",
            "specs": [
            "name": "pytest",
            "specs": [
            "name": "scikit-learn",
            "specs": [
            "name": "scipy",
            "specs": [
            "name": "tqdm",
            "specs": []
            "name": "torch",
            "specs": []
            "name": "lightgbm",
            "specs": []
            "name": "xgboost",
            "specs": []
            "name": "pandas",
            "specs": []
            "name": "matplotlib",
            "specs": []
            "name": "jupyterlab",
            "specs": []
            "name": "seaborn",
            "specs": []
    "lcname": "utrees"
Elapsed time: 1.86137s