utrees


Nameutrees JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttp://github.com/calvinmccarter/unmasking-trees
SummaryUnmasking trees for tabular data generation and imputation
upload_time2024-09-21 00:54:45
maintainerCalvin McCarter
docs_urlNone
authorCalvin McCarter
requires_pythonNone
licenseNone
keywords tabular imputation generation
VCS
bugtrack_url
requirements numpy pre-commit pytest scikit-learn scipy tqdm torch lightgbm xgboost pandas matplotlib jupyterlab seaborn
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # unmasking-trees 😷➡️🥳 🌲🌲🌲

[![PyPI version](https://badge.fury.io/py/utrees.svg)](https://badge.fury.io/py/utrees)
[![Downloads](https://static.pepy.tech/badge/utrees)](https://pepy.tech/project/utrees)

UnmaskingTrees is a method for tabular data generation and imputation. It's an order-agnostic autoregressive diffusion model, wherein a training dataset is constructed by incrementally masking features in random order. Per-feature gradient-boosted trees are then trained to unmask each feature. Read more about it in my [blog post](https://calvinmccarter.substack.com/p/unmasking-trees-for-tabular-data)!

To better model conditional distributions which are multi-modal ("modal" as in "modes", not as in "modalities"), we by default discretize continuous features into `n_bins` bins. You can customize this, via the `quantize_cols` parameter in the `fit` method. Provide a list of length `n_dims`, with values in `('continuous', 'categorical', 'integer')`. Given `categorical` it skips quantization of that feature; given `integer` it only quantizes if the number of unique values > `n_bins`.


<figure>
  <figcaption><i>Here's how well it works on imputation with the [Two Moons](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) synthetic dataset:</i></figcaption>
  <img src="paper/moons-imputation.png" alt="drawing" width="600"/>
</figure>

## Installation 

### Installation from PyPI
```
pip install utrees
```

### Installation from source
After cloning this repo, install the dependencies on the command-line, then install utrees:
```
pip install -r requirements.txt
pip install -e .
pytest
```

## Usage

Check out [this notebook](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) with the Two Moons example, or [this one](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/iris.ipynb) with the Iris dataset.

### Synthetic data generation

You can fit `utrees.UnmaskingTrees` the way you would an sklearn model, with the added option that you can call `fit` with `quantize_cols`, a list to specify which columns are continuous (and therefore need to be discretized). By default, all columns are assumed to contain continuous features.

```
import numpy as np
from sklearn.datasets import make_moons
from utrees import UnmaskingTrees
data, labels = make_moons((100, 100), shuffle=False, noise=0.1, random_state=123)  # size (200, 2)
utree = UnmaskingTrees().fit(data)
```

Then, you can generate new data:

```
newdata = utree.generate(n_generate=123)  # size (123, 2)
```

### Missing data imputation

You can fit your `UnmaskingTrees` model on data with missing elements, provided as `np.nan`. You can then impute the missing values, potentially with multiple imputations per missing element. Given an array of `(n_samples, n_dims)`, you will get back an array of size `(n_impute, n_samples, n_dims)`, where the NaNs have been replaced while the others are unchanged.

```
data4impute = data.copy()
data4impute[:, 1] = np.nan
X=np.concatenate([data, data4impute], axis=0)  # size (400, 2)
utree = UnmaskingTrees().fit(X)                                                                                    
imputeddata = utree.impute(n_impute=5)  # size (5, 400, 2)
```

You can also provide a totally new dataset to be imputed, so the model performs imputation without retraining:

```
utree = UnmaskingTrees().fit(data)                                                                                    
imputeddata = utree.impute(n_impute=5, X=data4impute)  # size (5, 200, 2)
```

### Hyperparameters

- depth: Depth of balanced binary tree for recursively quantizing each feature.
- duplicate_K: Number of random masking orders per actual sample. The training dataset will be of size `(n_samples * n_dims * duplicate_K, n_dims)`.
- xgboost_kwargs: dict to pass to XGBClassifier.
- strategy: how to quantize continuous features ('kdiquantile', 'quantile', 'uniform', or 'kmeans').
- random_state: controls randomness.


## Citing this method

Please consider citing the UnmaskingTrees [arXiv preprint](https://arxiv.org/pdf/2407.05593). The bibtex is:

```
@article{mccarter2024unmasking,
  title={Unmasking Trees for Tabular Data},
  author={McCarter, Calvin},
  journal={arXiv preprint arXiv:2407.05593},
  year={2024}
}
````

Also, please consider citing ForestDiffusion ([code](https://github.com/SamsungSAILMontreal/ForestDiffusion) and [paper](https://arxiv.org/abs/2309.09968)), which this work builds on.

            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/calvinmccarter/unmasking-trees",
    "name": "utrees",
    "maintainer": "Calvin McCarter",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "mccarter.calvin@gmail.com",
    "keywords": "tabular, imputation, generation",
    "author": "Calvin McCarter",
    "author_email": "mccarter.calvin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c0/e9/a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0/utrees-0.3.0.tar.gz",
    "platform": null,
    "description": "# unmasking-trees \ud83d\ude37\u27a1\ufe0f\ud83e\udd73 \ud83c\udf32\ud83c\udf32\ud83c\udf32\n\n[![PyPI version](https://badge.fury.io/py/utrees.svg)](https://badge.fury.io/py/utrees)\n[![Downloads](https://static.pepy.tech/badge/utrees)](https://pepy.tech/project/utrees)\n\nUnmaskingTrees is a method for tabular data generation and imputation. It's an order-agnostic autoregressive diffusion model, wherein a training dataset is constructed by incrementally masking features in random order. Per-feature gradient-boosted trees are then trained to unmask each feature. Read more about it in my [blog post](https://calvinmccarter.substack.com/p/unmasking-trees-for-tabular-data)!\n\nTo better model conditional distributions which are multi-modal (\"modal\" as in \"modes\", not as in \"modalities\"), we by default discretize continuous features into `n_bins` bins. You can customize this, via the `quantize_cols` parameter in the `fit` method. Provide a list of length `n_dims`, with values in `('continuous', 'categorical', 'integer')`. Given `categorical` it skips quantization of that feature; given `integer` it only quantizes if the number of unique values > `n_bins`.\n\n\n<figure>\n  <figcaption><i>Here's how well it works on imputation with the [Two Moons](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) synthetic dataset:</i></figcaption>\n  <img src=\"paper/moons-imputation.png\" alt=\"drawing\" width=\"600\"/>\n</figure>\n\n## Installation \n\n### Installation from PyPI\n```\npip install utrees\n```\n\n### Installation from source\nAfter cloning this repo, install the dependencies on the command-line, then install utrees:\n```\npip install -r requirements.txt\npip install -e .\npytest\n```\n\n## Usage\n\nCheck out [this notebook](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/moons.ipynb) with the Two Moons example, or [this one](https://github.com/calvinmccarter/unmasking-trees/blob/master/paper/iris.ipynb) with the Iris dataset.\n\n### Synthetic data generation\n\nYou can fit `utrees.UnmaskingTrees` the way you would an sklearn model, with the added option that you can call `fit` with `quantize_cols`, a list to specify which columns are continuous (and therefore need to be discretized). By default, all columns are assumed to contain continuous features.\n\n```\nimport numpy as np\nfrom sklearn.datasets import make_moons\nfrom utrees import UnmaskingTrees\ndata, labels = make_moons((100, 100), shuffle=False, noise=0.1, random_state=123)  # size (200, 2)\nutree = UnmaskingTrees().fit(data)\n```\n\nThen, you can generate new data:\n\n```\nnewdata = utree.generate(n_generate=123)  # size (123, 2)\n```\n\n### Missing data imputation\n\nYou can fit your `UnmaskingTrees` model on data with missing elements, provided as `np.nan`. You can then impute the missing values, potentially with multiple imputations per missing element. Given an array of `(n_samples, n_dims)`, you will get back an array of size `(n_impute, n_samples, n_dims)`, where the NaNs have been replaced while the others are unchanged.\n\n```\ndata4impute = data.copy()\ndata4impute[:, 1] = np.nan\nX=np.concatenate([data, data4impute], axis=0)  # size (400, 2)\nutree = UnmaskingTrees().fit(X)                                                                                    \nimputeddata = utree.impute(n_impute=5)  # size (5, 400, 2)\n```\n\nYou can also provide a totally new dataset to be imputed, so the model performs imputation without retraining:\n\n```\nutree = UnmaskingTrees().fit(data)                                                                                    \nimputeddata = utree.impute(n_impute=5, X=data4impute)  # size (5, 200, 2)\n```\n\n### Hyperparameters\n\n- depth: Depth of balanced binary tree for recursively quantizing each feature.\n- duplicate_K: Number of random masking orders per actual sample. The training dataset will be of size `(n_samples * n_dims * duplicate_K, n_dims)`.\n- xgboost_kwargs: dict to pass to XGBClassifier.\n- strategy: how to quantize continuous features ('kdiquantile', 'quantile', 'uniform', or 'kmeans').\n- random_state: controls randomness.\n\n\n## Citing this method\n\nPlease consider citing the UnmaskingTrees [arXiv preprint](https://arxiv.org/pdf/2407.05593). The bibtex is:\n\n```\n@article{mccarter2024unmasking,\n  title={Unmasking Trees for Tabular Data},\n  author={McCarter, Calvin},\n  journal={arXiv preprint arXiv:2407.05593},\n  year={2024}\n}\n````\n\nAlso, please consider citing ForestDiffusion ([code](https://github.com/SamsungSAILMontreal/ForestDiffusion) and [paper](https://arxiv.org/abs/2309.09968)), which this work builds on.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Unmasking trees for tabular data generation and imputation",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "http://github.com/calvinmccarter/unmasking-trees"
    },
    "split_keywords": [
        "tabular",
        " imputation",
        " generation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9f401e54762f89ec2795fa224f58d75403938df39d7b1a81ffa7e8432cb244db",
                "md5": "36eed7f6423b3c11608abebaf94df1b6",
                "sha256": "6ea0d6ea51bed95e164f98c188322b2f8e507d71c72144df5628e6125aafac09"
            },
            "downloads": -1,
            "filename": "utrees-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "36eed7f6423b3c11608abebaf94df1b6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14863,
            "upload_time": "2024-09-21T00:54:44",
            "upload_time_iso_8601": "2024-09-21T00:54:44.366594Z",
            "url": "https://files.pythonhosted.org/packages/9f/40/1e54762f89ec2795fa224f58d75403938df39d7b1a81ffa7e8432cb244db/utrees-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c0e9a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0",
                "md5": "a49005d4329f46b21ef4e92ab638bf4b",
                "sha256": "be5336c77e3b45c39b455eb3238ec0798a26840b4fdb965734318a7f5a67c61a"
            },
            "downloads": -1,
            "filename": "utrees-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a49005d4329f46b21ef4e92ab638bf4b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 15068,
            "upload_time": "2024-09-21T00:54:45",
            "upload_time_iso_8601": "2024-09-21T00:54:45.969858Z",
            "url": "https://files.pythonhosted.org/packages/c0/e9/a0dfa5df4ea427b755862774e8513c73974452687d773bdff6177672c0c0/utrees-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-21 00:54:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "calvinmccarter",
    "github_project": "unmasking-trees",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "5.4.1"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "0.23"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "lightgbm",
            "specs": []
        },
        {
            "name": "xgboost",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "jupyterlab",
            "specs": []
        },
        {
            "name": "seaborn",
            "specs": []
        }
    ],
    "lcname": "utrees"
}
        
Elapsed time: 1.86137s