# GANBLR Toolbox
GANBLR Toolbox contains GANBLR models proposed by `Tulip Lab` for tabular data generation, which can sample fully artificial data from real data.
Currently, this package contains following GANBLR models:
- GANBLR
- GANBLR++
For a quick start, you can check out this usage example in Google Colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1w7A26JRkrXPeeA9q1Kbi_CRjbptkr8Ls?usp=sharing)
# Install
We recommend you to install ganblr through pip:
```bash
pip install ganblr
```
Alternatively, you can also clone the repository and install it from sources.
```bash
git clone git@github.com:tulip-lab/ganblr.git
cd ganblr
python setup.py install
```
# Usage Example
In this example we load the [Adult Dataset](https://archive.ics.uci.edu/ml/datasets/Adult)* which is a built-in demo dataset. We use `GANBLR` to learn from the real data and then generate some synthetic data.
```python3
from ganblr import get_demo_data
from ganblr.models import GANBLR
# this is a discrete version of adult since GANBLR requires discrete data.
df = get_demo_data('adult')
x, y = df.values[:,:-1], df.values[:,-1]
model = GANBLR()
model.fit(x, y, epochs = 10)
#generate synthetic data
synthetic_data = model.sample(1000)
```
The steps to generate synthetic data using `GANBLR++` are similar to `GANBLR`, but require an additional parameter `numerical_columns` to tell the model the index of the numerical columns.
```python3
from ganblr import get_demo_data
from ganblr.models import GANBLRPP
import numpy as np
# raw adult
df = get_demo_data('adult-raw')
x, y = df.values[:,:-1], df.values[:,-1]
def is_numerical(dtype):
return dtype.kind in 'iuf'
column_is_numerical = df.dtypes.apply(is_numerical).values
numerical_columns = np.argwhere(column_is_numerical).ravel()
model = GANBLRPP(numerical_columns)
model.fit(x, y, epochs = 10)
#generate synthetic data
synthetic_data = model.sample(1000)
```
# Documentation
You can check the documentation at [https://ganblr-docs.readthedocs.io/en/latest/](https://ganblr-docs.readthedocs.io/en/latest/).
# Leaderboard
Here we show the results of the TSTR(Training on Synthetic data, Testing on Real data) evaluation on `Adult` dataset based on the experiments in our paper.
TRTR(Train on Real, Test on Real) will be used as the baseline for comparison. You are welcome to update this Leaderboard.
| | LR | MLP | RF | XGBT |
|----------|--------|--------|--------|--------|
| TRTR | 0.8741 | 0.8561 | 0.8379 | 0.8562 |
| GANBLR | 0.74 | 0.842 | 0.81 | 0.851 |
| CTGAN | 0.787 | 0.831 | 0.792 | 0.839 |
| ... | ... | ... | ... | ... |
# Citation
If you use GANBLR, please cite the following work:
*Y. Zhang, N. A. Zaidi, J. Zhou and G. Li*, "GANBLR: A Tabular Data Generation Model," 2021 IEEE International Conference on Data Mining (ICDM), 2021, pp. 181-190, doi: 10.1109/ICDM51629.2021.00103.
```LaTeX
@inproceedings{ganblr,
author={Zhang, Yishuo and Zaidi, Nayyar A. and Zhou, Jiahui and Li, Gang},
booktitle={2021 IEEE International Conference on Data Mining (ICDM)},
title={GANBLR: A Tabular Data Generation Model},
year={2021},
pages={181-190},
doi={10.1109/ICDM51629.2021.00103}
}
@inbook{ganblrpp,
author = {Yishuo Zhang and Nayyar Zaidi and Jiahui Zhou and Gang Li},
title = {<bold>GANBLR++</bold>: Incorporating Capacity to Generate Numeric Attributes and Leveraging Unrestricted Bayesian Networks},
booktitle = {Proceedings of the 2022 SIAM International Conference on Data Mining (SDM)},
pages = {298-306},
doi = {10.1137/1.9781611977172.34},
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/tulip-lab/ganblr",
"name": "ganblr",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.5.0",
"maintainer_email": null,
"keywords": "ganblr, tulip",
"author": "kae zhou",
"author_email": "kaezhou@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e6/62/4b9cb9130d2346651a777465eb6a98321a3743a2583f9d791af2370e3c10/ganblr-0.1.3.tar.gz",
"platform": "any",
"description": "# GANBLR Toolbox\n\nGANBLR Toolbox contains GANBLR models proposed by `Tulip Lab` for tabular data generation, which can sample fully artificial data from real data.\n\nCurrently, this package contains following GANBLR models:\n\n- GANBLR\n- GANBLR++\n\nFor a quick start, you can check out this usage example in Google Colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1w7A26JRkrXPeeA9q1Kbi_CRjbptkr8Ls?usp=sharing)\n\n# Install\n\nWe recommend you to install ganblr through pip:\n\n```bash\npip install ganblr\n```\n\nAlternatively, you can also clone the repository and install it from sources.\n\n```bash\ngit clone git@github.com:tulip-lab/ganblr.git\ncd ganblr\npython setup.py install\n```\n\n# Usage Example\n\nIn this example we load the [Adult Dataset](https://archive.ics.uci.edu/ml/datasets/Adult)* which is a built-in demo dataset. We use `GANBLR` to learn from the real data and then generate some synthetic data.\n\n```python3\nfrom ganblr import get_demo_data\nfrom ganblr.models import GANBLR\n\n# this is a discrete version of adult since GANBLR requires discrete data.\ndf = get_demo_data('adult')\nx, y = df.values[:,:-1], df.values[:,-1]\n\nmodel = GANBLR()\nmodel.fit(x, y, epochs = 10)\n\n#generate synthetic data\nsynthetic_data = model.sample(1000)\n```\n\nThe steps to generate synthetic data using `GANBLR++` are similar to `GANBLR`, but require an additional parameter `numerical_columns` to tell the model the index of the numerical columns.\n\n```python3\nfrom ganblr import get_demo_data\nfrom ganblr.models import GANBLRPP\nimport numpy as np\n\n# raw adult\ndf = get_demo_data('adult-raw')\nx, y = df.values[:,:-1], df.values[:,-1]\n\ndef is_numerical(dtype):\n return dtype.kind in 'iuf'\n\ncolumn_is_numerical = df.dtypes.apply(is_numerical).values\nnumerical_columns = np.argwhere(column_is_numerical).ravel()\n\nmodel = GANBLRPP(numerical_columns)\nmodel.fit(x, y, epochs = 10)\n\n#generate synthetic data\nsynthetic_data = model.sample(1000)\n```\n\n# Documentation\n\nYou can check the documentation at [https://ganblr-docs.readthedocs.io/en/latest/](https://ganblr-docs.readthedocs.io/en/latest/).\n# Leaderboard\n\nHere we show the results of the TSTR(Training on Synthetic data, Testing on Real data) evaluation on `Adult` dataset based on the experiments in our paper. \n\nTRTR(Train on Real, Test on Real) will be used as the baseline for comparison. You are welcome to update this Leaderboard.\n\n| | LR | MLP | RF | XGBT |\n|----------|--------|--------|--------|--------|\n| TRTR | 0.8741 | 0.8561 | 0.8379 | 0.8562 |\n| GANBLR | 0.74 | 0.842 | 0.81 | 0.851 |\n| CTGAN | 0.787 | 0.831 | 0.792 | 0.839 |\n| ... | ... | ... | ... | ... |\n\n# Citation\nIf you use GANBLR, please cite the following work:\n\n*Y. Zhang, N. A. Zaidi, J. Zhou and G. Li*, \"GANBLR: A Tabular Data Generation Model,\" 2021 IEEE International Conference on Data Mining (ICDM), 2021, pp. 181-190, doi: 10.1109/ICDM51629.2021.00103.\n\n```LaTeX\n@inproceedings{ganblr,\n author={Zhang, Yishuo and Zaidi, Nayyar A. and Zhou, Jiahui and Li, Gang}, \n booktitle={2021 IEEE International Conference on Data Mining (ICDM)}, \n title={GANBLR: A Tabular Data Generation Model}, \n year={2021}, \n pages={181-190}, \n doi={10.1109/ICDM51629.2021.00103}\n}\n@inbook{ganblrpp,\n author = {Yishuo Zhang and Nayyar Zaidi and Jiahui Zhou and Gang Li},\n title = {<bold>GANBLR++</bold>: Incorporating Capacity to Generate Numeric Attributes and Leveraging Unrestricted Bayesian Networks},\n booktitle = {Proceedings of the 2022 SIAM International Conference on Data Mining (SDM)},\n pages = {298-306},\n doi = {10.1137/1.9781611977172.34},\n}\n```\n\n",
"bugtrack_url": null,
"license": "MIT Licence",
"summary": "Ganblr Toolbox",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/tulip-lab/ganblr"
},
"split_keywords": [
"ganblr",
" tulip"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0f63a056fa1df3e8a65c3f2dcfdf5378ec03242ef7c9661cac3b089e43b221e2",
"md5": "2fa951c1c15be572af2cdc5a00de9adf",
"sha256": "d5c1027406801f0a82611adc4bed3c4bbf0c726570b55f08c723a84e6766e25b"
},
"downloads": -1,
"filename": "ganblr-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2fa951c1c15be572af2cdc5a00de9adf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5.0",
"size": 45674,
"upload_time": "2024-05-20T08:09:31",
"upload_time_iso_8601": "2024-05-20T08:09:31.595142Z",
"url": "https://files.pythonhosted.org/packages/0f/63/a056fa1df3e8a65c3f2dcfdf5378ec03242ef7c9661cac3b089e43b221e2/ganblr-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e6624b9cb9130d2346651a777465eb6a98321a3743a2583f9d791af2370e3c10",
"md5": "45a2cbdc7407b07f6d7370be6d3d7476",
"sha256": "bfda8ae84f67de31a2e6798b4706cfa09f5184be99f5a98cf5d1596d3191d696"
},
"downloads": -1,
"filename": "ganblr-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "45a2cbdc7407b07f6d7370be6d3d7476",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5.0",
"size": 41732,
"upload_time": "2024-05-20T08:09:33",
"upload_time_iso_8601": "2024-05-20T08:09:33.663108Z",
"url": "https://files.pythonhosted.org/packages/e6/62/4b9cb9130d2346651a777465eb6a98321a3743a2583f9d791af2370e3c10/ganblr-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-20 08:09:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tulip-lab",
"github_project": "ganblr",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ganblr"
}