<p align="center">
<a>
<img src='https://img.shields.io/badge/python-3.10%2B-blueviolet' alt='Python' />
</a>
<a>
<img src='https://img.shields.io/badge/code%20style-black-black' />
</a>
<a href="">
<img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>
<a href='https://opensource.org/license/gpl-3-0'>
<img src='https://img.shields.io/badge/license-GPLv3-blue' />
</a>
</p>
# GenTab
Synthetic Tabular Data Generation Library
## Overview
This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...
## Features
:nut_and_bolt: Pre-process your data.
:clock130: State-of-the-art models.
:recycle: Easy to use and customize.
## Install
The `gentab` library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.
``` bash
pip install gentab
```
## Available Generators
Below is the list of the generators currently available in the library.
### Linear
| Model | Example | Paper |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| Random Over-Sampling | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://link.springer.com/article/10.1007/s10618-012-0295-5)
| SMOTE | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1106.1813) | |
| ADASYN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/document/4633969)
### PDF
| Model | Example | Paper |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| Gaussian Copula | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926)
### AE
| Model | Example | Paper |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| TVAE | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)
### GAN
| Model | Example | Paper |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| CTGAN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)
| CTAB-GAN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://proceedings.mlr.press/v157/zhao21a.html)
| CTAB-GAN+ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2204.00401)
### Diffusion
| Model | Example | Paper |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| ForestDiffusion | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2309.09968)
### LLM
| Model | Example | Paper |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| GReaT | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2210.06280)
| Tabula | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.12746)
### Hybrid
| Model | Example | Papers |
|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
| Copula GAN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926) [link](https://arxiv.org/abs/1907.00503)
| AutoDiffusion | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.15479)
## Examples
### Generation
``` python
from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console
config = Config("configs/playnet.json")
dataset = Dataset(config)
dataset.reduce_size(
{
"left_attack": 0.97,
"right_attack": 0.97,
"right_transition": 0.9,
"left_transition": 0.9,
"time_out": 0.8,
"left_penal": 0.5,
"right_penal": 0.5,
}
)
dataset.merge_classes(
{
"attack": ["left_attack", "right_attack"],
"transition": ["left_transition", "right_transition"],
"penalty": ["left_penal", "right_penal"],
}
)
dataset.reduce_mem()
console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())
evaluator = MLP(generator)
evaluator.evaluate()
dataset.save_to_disk(generator)
```
### Tuning
``` python
from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset
config = Config("configs/adult.json")
dataset = Dataset(config)
dataset.merge_classes({
"<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()
generator = AutoDiffusion(dataset)
evaluator = LightGBM(generator)
trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()
```
### Loading Stored Synthetic Datasets
``` python
from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset
config = Config("configs/adult.json")
dataset = Dataset(config)
dataset.merge_classes({
"<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()
# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()
# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()
# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()
# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()
```
Raw data
{
"_id": null,
"home_page": null,
"name": "gentab",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "data, deep learning, generation, machine learning, synthetic, tabular",
"author": null,
"author_email": "\"Omar A. Mures\" <omar.alvarez@udc.es>",
"download_url": "https://files.pythonhosted.org/packages/6d/56/c39995680f2aef83dd4b69cafa9e0c2739d9afbcaa74cb8666f892572b1d/gentab-0.1.2.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <a>\n\t <img src='https://img.shields.io/badge/python-3.10%2B-blueviolet' alt='Python' />\n\t</a>\n <a>\n\t <img src='https://img.shields.io/badge/code%20style-black-black' />\n\t</a>\n\t<a href=\"\">\n \t\t<img src=\"https://colab.research.google.com/assets/colab-badge.svg\"/>\n\t</a>\n <a href='https://opensource.org/license/gpl-3-0'>\n\t <img src='https://img.shields.io/badge/license-GPLv3-blue' />\n\t</a>\n</p>\n\n# GenTab\n\nSynthetic Tabular Data Generation Library\n\n## Overview\n\nThis Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...\n\n## Features\n\n:nut_and_bolt: Pre-process your data.\n\n:clock130: State-of-the-art models.\n\n:recycle: Easy to use and customize. \n\n## Install\n\nThe `gentab` library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.\n\n``` bash\npip install gentab\n```\n\n## Available Generators\n\nBelow is the list of the generators currently available in the library.\n\n### Linear\n\n| Model | Example | Paper |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| Random Over-Sampling | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://link.springer.com/article/10.1007/s10618-012-0295-5)\n| SMOTE | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1106.1813) | |\n| ADASYN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/document/4633969)\n\n### PDF\n| Model | Example | Paper |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| Gaussian Copula | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926)\n\n\n### AE\n\n| Model | Example | Paper |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| TVAE | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)\n\n### GAN\n\n| Model | Example | Paper |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| CTGAN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/1907.00503)\n| CTAB-GAN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://proceedings.mlr.press/v157/zhao21a.html)\n| CTAB-GAN+ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2204.00401)\n\n### Diffusion\n\n| Model | Example | Paper |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| ForestDiffusion | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2309.09968)\n\n### LLM\n\n| Model | Example | Paper |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| GReaT | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2210.06280)\n| Tabula | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.12746)\n\n### Hybrid\n\n| Model | Example | Papers |\n|:--------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|\n| Copula GAN | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://ieeexplore.ieee.org/abstract/document/7796926) [link](https://arxiv.org/abs/1907.00503)\n| AutoDiffusion | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]() | [link](https://arxiv.org/abs/2310.15479)\n\n## Examples\n\n### Generation\n\n``` python\nfrom gentab.generators import AutoDiffusion\nfrom gentab.evaluators import MLP\nfrom gentab.data import Config, Dataset\nfrom gentab.utils import console\n\nconfig = Config(\"configs/playnet.json\")\n\ndataset = Dataset(config)\ndataset.reduce_size(\n {\n \"left_attack\": 0.97,\n \"right_attack\": 0.97,\n \"right_transition\": 0.9,\n \"left_transition\": 0.9,\n \"time_out\": 0.8,\n \"left_penal\": 0.5,\n \"right_penal\": 0.5,\n }\n)\ndataset.merge_classes(\n {\n \"attack\": [\"left_attack\", \"right_attack\"],\n \"transition\": [\"left_transition\", \"right_transition\"],\n \"penalty\": [\"left_penal\", \"right_penal\"],\n }\n)\ndataset.reduce_mem()\n\nconsole.print(dataset.class_counts(), dataset.row_count())\ngenerator = AutoDiffusion(dataset)\ngenerator.generate()\nconsole.print(dataset.generated_class_counts(), dataset.generated_row_count())\n\nevaluator = MLP(generator)\nevaluator.evaluate()\n\ndataset.save_to_disk(generator)\n```\n\n### Tuning\n\n``` python\nfrom gentab.generators import AutoDiffusion\nfrom gentab.evaluators import LightGBM\nfrom gentab.tuners import AutoDiffusionTuner\nfrom gentab.data import Config, Dataset\n\nconfig = Config(\"configs/adult.json\")\n\ndataset = Dataset(config)\ndataset.merge_classes({\n \"<=50K\": [\"<=50K.\"], \">50K\": [\">50K.\"]\n})\ndataset.reduce_mem()\n\ngenerator = AutoDiffusion(dataset)\n\nevaluator = LightGBM(generator)\n\ntrials = 10\ntime = 60 * 60 * 8\ntuner = AutoDiffusionTuner(evaluator, trials, timeout=time)\ntuner.tune()\ntuner.save_to_disk()\n```\n\n### Loading Stored Synthetic Datasets\n\n``` python\nfrom gentab.generators import AutoDiffusion\nfrom gentab.evaluators import LightGBM\nfrom gentab.tuners import AutoDiffusionTuner\nfrom gentab.data import Config, Dataset\n\nconfig = Config(\"configs/adult.json\")\n\ndataset = Dataset(config)\ndataset.merge_classes({\n \"<=50K\": [\"<=50K.\"], \">50K\": [\">50K.\"]\n})\ndataset.reduce_mem()\n\n# Load previously saved dataset...\ngenerator = AutoDiffusion(dataset)\ngenerator.load_from_disk()\n\n# Do work with previously generated but not tuned dataset...\nevaluator = LightGBM(generator)\nevaluator.evaluate()\nevaluator.evaluate_baseline()\n\n# Load previously tuned and saved dataset...\ntuner = AutoDiffusionTuner(evaluator, 0)\ntuner.load_from_disk()\n\n# Do work with previously tuned dataset...\nevaluator.evaluate()\nevaluator.evaluate_baseline()\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "A synthetic tabular data generation library.",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/omaralvarez/gentab",
"Issues": "https://github.com/omaralvarez/gentab/issues"
},
"split_keywords": [
"data",
" deep learning",
" generation",
" machine learning",
" synthetic",
" tabular"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "946d43666d9314ea9950a73fe7d0f83890921c15f6d59ce5b3c77c120d7e404e",
"md5": "ff567a6ec347638894b0ed9999876156",
"sha256": "5e0da06342304f20e469b685f4ce1a3d870dbeae99c6e26d4a4348aaeb044999"
},
"downloads": -1,
"filename": "gentab-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ff567a6ec347638894b0ed9999876156",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 132951,
"upload_time": "2024-04-25T12:00:56",
"upload_time_iso_8601": "2024-04-25T12:00:56.708061Z",
"url": "https://files.pythonhosted.org/packages/94/6d/43666d9314ea9950a73fe7d0f83890921c15f6d59ce5b3c77c120d7e404e/gentab-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6d56c39995680f2aef83dd4b69cafa9e0c2739d9afbcaa74cb8666f892572b1d",
"md5": "a2595f4506057819ab423756a6db3e1c",
"sha256": "7036bc206158455bf8fec342b6f5fc4a0fdd032407bed736a8f65189f6cb127b"
},
"downloads": -1,
"filename": "gentab-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "a2595f4506057819ab423756a6db3e1c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 105266,
"upload_time": "2024-04-25T12:00:58",
"upload_time_iso_8601": "2024-04-25T12:00:58.871579Z",
"url": "https://files.pythonhosted.org/packages/6d/56/c39995680f2aef83dd4b69cafa9e0c2739d9afbcaa74cb8666f892572b1d/gentab-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-25 12:00:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "omaralvarez",
"github_project": "gentab",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "gentab"
}