ganblr


Nameganblr JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/tulip-lab/ganblr
SummaryGanblr Toolbox
upload_time2022-12-14 17:29:27
maintainer
docs_urlNone
authorkae zhou
requires_python>=3.5.0
licenseMIT Licence
keywords ganblr tulip
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GANBLR Toolbox

GANBLR Toolbox contains GANBLR models proposed by `Tulip Lab` for tabular data generation, which can sample fully artificial data from real data.

Currently, this package contains following GANBLR models:

- GANBLR
- GANBLR++

For a quick start, you can check out this usage example in Google Colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1w7A26JRkrXPeeA9q1Kbi_CRjbptkr8Ls?usp=sharing]

# Install

We recommend you to install ganblr through pip:

```bash
pip install ganblr
```

Alternatively, you can also clone the repository and install it from sources.

```bash
git clone git@github.com:tulip-lab/ganblr.git
cd ganblr
python setup.py install
```

# Usage Example

In this example we load the [Adult Dataset](https://archive.ics.uci.edu/ml/datasets/Adult)* which is a built-in demo dataset. We use `GANBLR` to learn from the real data and then generate some synthetic data.

```python3
from ganblr import get_demo_data
from ganblr.models import GANBLR

# this is a discrete version of adult since GANBLR requires discrete data.
df = get_demo_data('adult')
x, y = df.values[:,:-1], df.values[:,-1]

model = GANBLR()
model.fit(x, y, epochs = 10)

#generate synthetic data
synthetic_data = model.sample(1000)
```

The steps to generate synthetic data using `GANBLR++` are similar to `GANBLR`, but require an additional parameter `numerical_columns` to tell the model the index of the numerical columns.

```python3
from ganblr import get_demo_data
from ganblr.models import GANBLRPP
import numpy as np

# raw adult
df = get_demo_data('adult-raw')
x, y = df.values[:,:-1], df.values[:,-1]

def is_numerical(dtype):
    return dtype.kind in 'iuf'

column_is_numerical = df.dtypes.apply(is_numerical).values
numerical_columns = np.argwhere(column_is_numerical).ravel()

model = GANBLRPP(numerical_columns)
model.fit(x, y, epochs = 10)

#generate synthetic data
synthetic_data = model.sample(1000)
```

# Documentation

You can check the documentation at [https://ganblr-docs.readthedocs.io/en/latest/](https://ganblr-docs.readthedocs.io/en/latest/).
# Leaderboard

Here we show the results of the TSTR(Training on Synthetic data, Testing on Real data) evaluation on `Adult` dataset based on the experiments in our paper. 

TRTR(Train on Real, Test on Real) will be used as the baseline for comparison. You are welcome to update this Leaderboard.

|          | LR     | MLP    | RF     | XGBT   |
|----------|--------|--------|--------|--------|
| TRTR     | 0.8741 | 0.8561 | 0.8379 | 0.8562 |
| GANBLR   | 0.74   | 0.842  | 0.81   | 0.851  |
| CTGAN    | 0.787  | 0.831  | 0.792  | 0.839  |
| ...      | ...    | ...    | ...    | ...    |

# Citation
If you use GANBLR, please cite the following work:

*Y. Zhang, N. A. Zaidi, J. Zhou and G. Li*, "GANBLR: A Tabular Data Generation Model," 2021 IEEE International Conference on Data Mining (ICDM), 2021, pp. 181-190, doi: 10.1109/ICDM51629.2021.00103.

```LaTeX
@inproceedings{ganblr,
    author={Zhang, Yishuo and Zaidi, Nayyar A. and Zhou, Jiahui and Li, Gang},  
    booktitle={2021 IEEE International Conference on Data Mining (ICDM)},   
    title={GANBLR: A Tabular Data Generation Model},   
    year={2021},  
    pages={181-190},  
    doi={10.1109/ICDM51629.2021.00103}
}
@inbook{ganblrpp,
    author = {Yishuo Zhang and Nayyar Zaidi and Jiahui Zhou and Gang Li},
    title = {<bold>GANBLR++</bold>: Incorporating Capacity to Generate Numeric Attributes and Leveraging Unrestricted Bayesian Networks},
    booktitle = {Proceedings of the 2022 SIAM International Conference on Data Mining (SDM)},
    pages = {298-306},
    doi = {10.1137/1.9781611977172.34},
}
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tulip-lab/ganblr",
    "name": "ganblr",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5.0",
    "maintainer_email": "",
    "keywords": "ganblr,tulip",
    "author": "kae zhou",
    "author_email": "kaezhou@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c2/a6/e4097efcdcff218e5a2134ad06e633b93352e56087a22e5840267aca920b/ganblr-0.1.1.tar.gz",
    "platform": "any",
    "description": "# GANBLR Toolbox\n\nGANBLR Toolbox contains GANBLR models proposed by `Tulip Lab` for tabular data generation, which can sample fully artificial data from real data.\n\nCurrently, this package contains following GANBLR models:\n\n- GANBLR\n- GANBLR++\n\nFor a quick start, you can check out this usage example in Google Colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1w7A26JRkrXPeeA9q1Kbi_CRjbptkr8Ls?usp=sharing]\n\n# Install\n\nWe recommend you to install ganblr through pip:\n\n```bash\npip install ganblr\n```\n\nAlternatively, you can also clone the repository and install it from sources.\n\n```bash\ngit clone git@github.com:tulip-lab/ganblr.git\ncd ganblr\npython setup.py install\n```\n\n# Usage Example\n\nIn this example we load the [Adult Dataset](https://archive.ics.uci.edu/ml/datasets/Adult)* which is a built-in demo dataset. We use `GANBLR` to learn from the real data and then generate some synthetic data.\n\n```python3\nfrom ganblr import get_demo_data\nfrom ganblr.models import GANBLR\n\n# this is a discrete version of adult since GANBLR requires discrete data.\ndf = get_demo_data('adult')\nx, y = df.values[:,:-1], df.values[:,-1]\n\nmodel = GANBLR()\nmodel.fit(x, y, epochs = 10)\n\n#generate synthetic data\nsynthetic_data = model.sample(1000)\n```\n\nThe steps to generate synthetic data using `GANBLR++` are similar to `GANBLR`, but require an additional parameter `numerical_columns` to tell the model the index of the numerical columns.\n\n```python3\nfrom ganblr import get_demo_data\nfrom ganblr.models import GANBLRPP\nimport numpy as np\n\n# raw adult\ndf = get_demo_data('adult-raw')\nx, y = df.values[:,:-1], df.values[:,-1]\n\ndef is_numerical(dtype):\n    return dtype.kind in 'iuf'\n\ncolumn_is_numerical = df.dtypes.apply(is_numerical).values\nnumerical_columns = np.argwhere(column_is_numerical).ravel()\n\nmodel = GANBLRPP(numerical_columns)\nmodel.fit(x, y, epochs = 10)\n\n#generate synthetic data\nsynthetic_data = model.sample(1000)\n```\n\n# Documentation\n\nYou can check the documentation at [https://ganblr-docs.readthedocs.io/en/latest/](https://ganblr-docs.readthedocs.io/en/latest/).\n# Leaderboard\n\nHere we show the results of the TSTR(Training on Synthetic data, Testing on Real data) evaluation on `Adult` dataset based on the experiments in our paper. \n\nTRTR(Train on Real, Test on Real) will be used as the baseline for comparison. You are welcome to update this Leaderboard.\n\n|          | LR     | MLP    | RF     | XGBT   |\n|----------|--------|--------|--------|--------|\n| TRTR     | 0.8741 | 0.8561 | 0.8379 | 0.8562 |\n| GANBLR   | 0.74   | 0.842  | 0.81   | 0.851  |\n| CTGAN    | 0.787  | 0.831  | 0.792  | 0.839  |\n| ...      | ...    | ...    | ...    | ...    |\n\n# Citation\nIf you use GANBLR, please cite the following work:\n\n*Y. Zhang, N. A. Zaidi, J. Zhou and G. Li*, \"GANBLR: A Tabular Data Generation Model,\" 2021 IEEE International Conference on Data Mining (ICDM), 2021, pp. 181-190, doi: 10.1109/ICDM51629.2021.00103.\n\n```LaTeX\n@inproceedings{ganblr,\n    author={Zhang, Yishuo and Zaidi, Nayyar A. and Zhou, Jiahui and Li, Gang},  \n    booktitle={2021 IEEE International Conference on Data Mining (ICDM)},   \n    title={GANBLR: A Tabular Data Generation Model},   \n    year={2021},  \n    pages={181-190},  \n    doi={10.1109/ICDM51629.2021.00103}\n}\n@inbook{ganblrpp,\n    author = {Yishuo Zhang and Nayyar Zaidi and Jiahui Zhou and Gang Li},\n    title = {<bold>GANBLR++</bold>: Incorporating Capacity to Generate Numeric Attributes and Leveraging Unrestricted Bayesian Networks},\n    booktitle = {Proceedings of the 2022 SIAM International Conference on Data Mining (SDM)},\n    pages = {298-306},\n    doi = {10.1137/1.9781611977172.34},\n}\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT Licence",
    "summary": "Ganblr Toolbox",
    "version": "0.1.1",
    "split_keywords": [
        "ganblr",
        "tulip"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "9d2bd766e46d1dba5f24c40801136886",
                "sha256": "5a30978711b90d71f98dee82103f72a21d3854305ed71b15134d9614fedcb3fe"
            },
            "downloads": -1,
            "filename": "ganblr-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9d2bd766e46d1dba5f24c40801136886",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5.0",
            "size": 17191,
            "upload_time": "2022-12-14T17:29:26",
            "upload_time_iso_8601": "2022-12-14T17:29:26.098182Z",
            "url": "https://files.pythonhosted.org/packages/66/13/337bd5e763d109ba17260b2e6637f27d1b7e3830b29b2625ecbe5decebff/ganblr-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "10b41370984ddaab8bb30e9940974c17",
                "sha256": "d0b3992930a775f081463a299afa6d48597327230875e957d33322414b1f8b56"
            },
            "downloads": -1,
            "filename": "ganblr-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "10b41370984ddaab8bb30e9940974c17",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5.0",
            "size": 16361,
            "upload_time": "2022-12-14T17:29:27",
            "upload_time_iso_8601": "2022-12-14T17:29:27.498125Z",
            "url": "https://files.pythonhosted.org/packages/c2/a6/e4097efcdcff218e5a2134ad06e633b93352e56087a22e5840267aca920b/ganblr-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-14 17:29:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "tulip-lab",
    "github_project": "ganblr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ganblr"
}
        
Elapsed time: 0.01813s