gentab


Namegentab JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryA synthetic tabular data generation library.
upload_time2024-04-25 12:00:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords data deep learning generation machine learning synthetic tabular
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <a>
	    <img src='https://img.shields.io/badge/python-3.10%2B-blueviolet' alt='Python' />
	</a>
    <a>
	    <img src='https://img.shields.io/badge/code%20style-black-black' />
	</a>
	<a href="">
  		<img src="https://colab.research.google.com/assets/colab-badge.svg"/>
	</a>
    <a href='https://opensource.org/license/gpl-3-0'>
	    <img src='https://img.shields.io/badge/license-GPLv3-blue' />
	</a>
</p>

# GenTab

Synthetic Tabular Data Generation Library

## Overview

This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...

## Features

:nut_and_bolt: Pre-process your data.

:clock130: State-of-the-art models.

:recycle: Easy to use and customize. 

## Install

The `gentab` library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.

``` bash
pip install gentab
```

## Available Generators

Below is the list of the generators currently available in the library.

### Linear

|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| Random Over-Sampling      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://link.springer.com/article/10.1007/s10618-012-0295-5)
| SMOTE                   | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() |            [link](https://arxiv.org/abs/1106.1813)                                  |                                                                            |
| ADASYN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/document/4633969)

### PDF
|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| Gaussian Copula      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926)


### AE

|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| TVAE      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)

### GAN

|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| CTGAN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)
| CTAB-GAN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://proceedings.mlr.press/v157/zhao21a.html)
| CTAB-GAN+      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() |  [link](https://arxiv.org/abs/2204.00401)

### Diffusion

|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| ForestDiffusion      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2309.09968)

### LLM

|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| GReaT      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2210.06280)
| Tabula      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.12746)

### Hybrid

|               Model                  |                                                                                    Example                                                                                    |                     Papers                    |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| Copula GAN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926) [link](https://arxiv.org/abs/1907.00503)
| AutoDiffusion      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.15479)

## Examples

### Generation

``` python
from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console

config = Config("configs/playnet.json")

dataset = Dataset(config)
dataset.reduce_size(
    {
        "left_attack": 0.97,
        "right_attack": 0.97,
        "right_transition": 0.9,
        "left_transition": 0.9,
        "time_out": 0.8,
        "left_penal": 0.5,
        "right_penal": 0.5,
    }
)
dataset.merge_classes(
    {
        "attack": ["left_attack", "right_attack"],
        "transition": ["left_transition", "right_transition"],
        "penalty": ["left_penal", "right_penal"],
    }
)
dataset.reduce_mem()

console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())

evaluator = MLP(generator)
evaluator.evaluate()

dataset.save_to_disk(generator)
```

### Tuning

``` python
from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

generator = AutoDiffusion(dataset)

evaluator = LightGBM(generator)

trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()
```

### Loading Stored Synthetic Datasets

``` python
from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()

# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()

# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()

# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gentab",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "data, deep learning, generation, machine learning, synthetic, tabular",
    "author": null,
    "author_email": "\"Omar A. Mures\" <omar.alvarez@udc.es>",
    "download_url": "https://files.pythonhosted.org/packages/6d/56/c39995680f2aef83dd4b69cafa9e0c2739d9afbcaa74cb8666f892572b1d/gentab-0.1.2.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <a>\n\t    <img src='https://img.shields.io/badge/python-3.10%2B-blueviolet' alt='Python' />\n\t</a>\n    <a>\n\t    <img src='https://img.shields.io/badge/code%20style-black-black' />\n\t</a>\n\t<a href=\"\">\n  \t\t<img src=\"https://colab.research.google.com/assets/colab-badge.svg\"/>\n\t</a>\n    <a href='https://opensource.org/license/gpl-3-0'>\n\t    <img src='https://img.shields.io/badge/license-GPLv3-blue' />\n\t</a>\n</p>\n\n# GenTab\n\nSynthetic Tabular Data Generation Library\n\n## Overview\n\nThis Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...\n\n## Features\n\n:nut_and_bolt: Pre-process your data.\n\n:clock130: State-of-the-art models.\n\n:recycle: Easy to use and customize. \n\n## Install\n\nThe `gentab` library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.\n\n``` bash\npip install gentab\n```\n\n## Available Generators\n\nBelow is the list of the generators currently available in the library.\n\n### Linear\n\n|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| Random Over-Sampling      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://link.springer.com/article/10.1007/s10618-012-0295-5)\n| SMOTE                   | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() |            [link](https://arxiv.org/abs/1106.1813)                                  |                                                                            |\n| ADASYN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/document/4633969)\n\n### PDF\n|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| Gaussian Copula      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926)\n\n\n### AE\n\n|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| TVAE      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)\n\n### GAN\n\n|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| CTGAN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)\n| CTAB-GAN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://proceedings.mlr.press/v157/zhao21a.html)\n| CTAB-GAN+      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() |  [link](https://arxiv.org/abs/2204.00401)\n\n### Diffusion\n\n|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| ForestDiffusion      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2309.09968)\n\n### LLM\n\n|               Model                  |                                                                                    Example                                                                                    |                     Paper                    |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| GReaT      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2210.06280)\n| Tabula      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.12746)\n\n### Hybrid\n\n|               Model                  |                                                                                    Example                                                                                    |                     Papers                    |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| Copula GAN      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926) [link](https://arxiv.org/abs/1907.00503)\n| AutoDiffusion      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.15479)\n\n## Examples\n\n### Generation\n\n``` python\nfrom gentab.generators import AutoDiffusion\nfrom gentab.evaluators import MLP\nfrom gentab.data import Config, Dataset\nfrom gentab.utils import console\n\nconfig = Config(\"configs/playnet.json\")\n\ndataset = Dataset(config)\ndataset.reduce_size(\n    {\n        \"left_attack\": 0.97,\n        \"right_attack\": 0.97,\n        \"right_transition\": 0.9,\n        \"left_transition\": 0.9,\n        \"time_out\": 0.8,\n        \"left_penal\": 0.5,\n        \"right_penal\": 0.5,\n    }\n)\ndataset.merge_classes(\n    {\n        \"attack\": [\"left_attack\", \"right_attack\"],\n        \"transition\": [\"left_transition\", \"right_transition\"],\n        \"penalty\": [\"left_penal\", \"right_penal\"],\n    }\n)\ndataset.reduce_mem()\n\nconsole.print(dataset.class_counts(), dataset.row_count())\ngenerator = AutoDiffusion(dataset)\ngenerator.generate()\nconsole.print(dataset.generated_class_counts(), dataset.generated_row_count())\n\nevaluator = MLP(generator)\nevaluator.evaluate()\n\ndataset.save_to_disk(generator)\n```\n\n### Tuning\n\n``` python\nfrom gentab.generators import AutoDiffusion\nfrom gentab.evaluators import LightGBM\nfrom gentab.tuners import AutoDiffusionTuner\nfrom gentab.data import Config, Dataset\n\nconfig = Config(\"configs/adult.json\")\n\ndataset = Dataset(config)\ndataset.merge_classes({\n    \"<=50K\": [\"<=50K.\"], \">50K\": [\">50K.\"]\n})\ndataset.reduce_mem()\n\ngenerator = AutoDiffusion(dataset)\n\nevaluator = LightGBM(generator)\n\ntrials = 10\ntime = 60 * 60 * 8\ntuner = AutoDiffusionTuner(evaluator, trials, timeout=time)\ntuner.tune()\ntuner.save_to_disk()\n```\n\n### Loading Stored Synthetic Datasets\n\n``` python\nfrom gentab.generators import AutoDiffusion\nfrom gentab.evaluators import LightGBM\nfrom gentab.tuners import AutoDiffusionTuner\nfrom gentab.data import Config, Dataset\n\nconfig = Config(\"configs/adult.json\")\n\ndataset = Dataset(config)\ndataset.merge_classes({\n    \"<=50K\": [\"<=50K.\"], \">50K\": [\">50K.\"]\n})\ndataset.reduce_mem()\n\n# Load previously saved dataset...\ngenerator = AutoDiffusion(dataset)\ngenerator.load_from_disk()\n\n# Do work with previously generated but not tuned dataset...\nevaluator = LightGBM(generator)\nevaluator.evaluate()\nevaluator.evaluate_baseline()\n\n# Load previously tuned and saved dataset...\ntuner = AutoDiffusionTuner(evaluator, 0)\ntuner.load_from_disk()\n\n# Do work with previously tuned dataset...\nevaluator.evaluate()\nevaluator.evaluate_baseline()\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A synthetic tabular data generation library.",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/omaralvarez/gentab",
        "Issues": "https://github.com/omaralvarez/gentab/issues"
    },
    "split_keywords": [
        "data",
        " deep learning",
        " generation",
        " machine learning",
        " synthetic",
        " tabular"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "946d43666d9314ea9950a73fe7d0f83890921c15f6d59ce5b3c77c120d7e404e",
                "md5": "ff567a6ec347638894b0ed9999876156",
                "sha256": "5e0da06342304f20e469b685f4ce1a3d870dbeae99c6e26d4a4348aaeb044999"
            },
            "downloads": -1,
            "filename": "gentab-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ff567a6ec347638894b0ed9999876156",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 132951,
            "upload_time": "2024-04-25T12:00:56",
            "upload_time_iso_8601": "2024-04-25T12:00:56.708061Z",
            "url": "https://files.pythonhosted.org/packages/94/6d/43666d9314ea9950a73fe7d0f83890921c15f6d59ce5b3c77c120d7e404e/gentab-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6d56c39995680f2aef83dd4b69cafa9e0c2739d9afbcaa74cb8666f892572b1d",
                "md5": "a2595f4506057819ab423756a6db3e1c",
                "sha256": "7036bc206158455bf8fec342b6f5fc4a0fdd032407bed736a8f65189f6cb127b"
            },
            "downloads": -1,
            "filename": "gentab-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a2595f4506057819ab423756a6db3e1c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 105266,
            "upload_time": "2024-04-25T12:00:58",
            "upload_time_iso_8601": "2024-04-25T12:00:58.871579Z",
            "url": "https://files.pythonhosted.org/packages/6d/56/c39995680f2aef83dd4b69cafa9e0c2739d9afbcaa74cb8666f892572b1d/gentab-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-25 12:00:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "omaralvarez",
    "github_project": "gentab",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "gentab"
}
        
Elapsed time: 7.15233s