# NOGAN SYNTHESIZER
[![PyPI version](https://badge.fury.io/py/nogan-synthesizer.svg)](https://badge.fury.io/py/nogan-synthesizer)
[![Documentation](https://img.shields.io/badge/Documentation-%20-blue)](https://rajiviyer.github.io/nogan_synthesizer/)
NoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN.
## Class
- **NoGANSynthesizer**: Synthetic Data Generator that fits a tabular data
## Functions
- **wrap_category_columns**: Function to compress all specified categorical columns into one
- **unwrap_category_columns**: Function to expand all wrapped categorical columns
## Authors
- [Dr. Vincent Granville](mailto:vincentg@mltechniques.com) - Research
- [Rajiv Iyer](mailto:raju.rgi@gmail.com) - Development/Maintenance
## Installation
The package can be installed with
```
pip install nogan_synthesizer
```
## Tests
The test can be run by cloning the repo and running:
```
pytest tests
```
In case of any issues running the tests, please run them after installing the package locally:
```
pip install -e .
```
## Usage
Start by importing the class
```Python
from nogan_synthesizer import NoGANSynth
from nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns
from genai_evaluation import multivariate_ecdf, ks_statistic
```
Assuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that.
We first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary
```Python
cat_cols = [category columns list...]
wrapped_real_data, idx_to_key, key_to_idx = \
wrap_category_columns(real_data, cat_cols)
```
We then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data
```Python
nogan = NoGANSynth(real_data)
nogan.fit()
n_synth_rows = len(real_data)
synth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)
```
We can then evaluate the synthetic & real data distributions using genai_evaluation package
```Python
_, ecdf_val1, ecdf_synth = \
multivariate_ecdf(wrapped_real_data,
synth_data,
n_nodes = 1000,
verbose = True,
random_seed=42)
ks_stat = ks_statistic(ecdf_val1, ecdf_synth)
```
Once we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary
```Python
unwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)
```
## Motivation
The motivation for this package comes from Dr. Vincent Granville's paper [Generative AI Technology Break-through: Spectacular Performance of New Synthesizer](https://mltechniques.com/2023/08/02/generative-ai-technology-break-through-spectacular-performance-of-new-synthesizer/)
If you have any tips or suggestions, please contact us on email.
# History
## 0.1.0 (2023-09-19)
- First release on PyPI.
## 0.1.1 (2023-09-27)
### Fixed
- Resolved issues with single categorical columns
## 0.1.2 (2023-09-27)
### Feature
- Added Feature for flexible Uniform & Gaussian Sampling for columns in generate_synthetic_data method
## 0.1.3 (2023-10-10)
### Fixed
- Resolved issues with float column when selected as category column
## 0.1.4 (2023-10-16)
### Fixed
- Resolved issues with brackets "(" & ")" in category column values
## 0.1.5 (2023-10-24)
### Feature
- Added gen random seed to be set during generation
Raw data
{
"_id": null,
"home_page": "https://github.com/rajiviyer/nogan_synthesizer",
"name": "nogan-synthesizer",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "nogan_synthesizer",
"author": "Rajiv Iyer",
"author_email": "raju.rgi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/98/39/e286df2c2103f448e1a641399a6248bd4b8ca313d36c1dfa31c178ab4ad8/nogan_synthesizer-0.1.5.tar.gz",
"platform": null,
"description": "# NOGAN SYNTHESIZER\r\n[![PyPI version](https://badge.fury.io/py/nogan-synthesizer.svg)](https://badge.fury.io/py/nogan-synthesizer)\r\n[![Documentation](https://img.shields.io/badge/Documentation-%20-blue)](https://rajiviyer.github.io/nogan_synthesizer/)\r\n\r\n\r\nNoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN. \r\n\r\n## Class\r\n- **NoGANSynthesizer**: Synthetic Data Generator that fits a tabular data\r\n\r\n## Functions\r\n- **wrap_category_columns**: Function to compress all specified categorical columns into one\r\n- **unwrap_category_columns**: Function to expand all wrapped categorical columns\r\n\r\n## Authors\r\n- [Dr. Vincent Granville](mailto:vincentg@mltechniques.com) - Research\r\n- [Rajiv Iyer](mailto:raju.rgi@gmail.com) - Development/Maintenance\r\n\r\n## Installation\r\nThe package can be installed with\r\n```\r\npip install nogan_synthesizer\r\n```\r\n\r\n## Tests\r\nThe test can be run by cloning the repo and running:\r\n```\r\npytest tests\r\n```\r\nIn case of any issues running the tests, please run them after installing the package locally:\r\n\r\n```\r\npip install -e .\r\n```\r\n\r\n## Usage\r\n\r\nStart by importing the class\r\n```Python\r\nfrom nogan_synthesizer import NoGANSynth\r\nfrom nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns\r\nfrom genai_evaluation import multivariate_ecdf, ks_statistic\r\n```\r\n\r\nAssuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that.\r\nWe first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary\r\n```Python\r\ncat_cols = [category columns list...]\r\nwrapped_real_data, idx_to_key, key_to_idx = \\\r\n wrap_category_columns(real_data, cat_cols)\r\n```\r\n\r\nWe then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data\r\n```Python\r\nnogan = NoGANSynth(real_data)\r\nnogan.fit()\r\n\r\nn_synth_rows = len(real_data)\r\nsynth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)\r\n```\r\n\r\nWe can then evaluate the synthetic & real data distributions using genai_evaluation package\r\n```Python\r\n_, ecdf_val1, ecdf_synth = \\\r\n multivariate_ecdf(wrapped_real_data, \r\n synth_data, \r\n n_nodes = 1000,\r\n verbose = True,\r\n random_seed=42)\r\n\r\nks_stat = ks_statistic(ecdf_val1, ecdf_synth) \r\n```\r\n\r\nOnce we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary\r\n```Python\r\nunwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)\r\n```\r\n## Motivation\r\nThe motivation for this package comes from Dr. Vincent Granville's paper [Generative AI Technology Break-through: Spectacular Performance of New Synthesizer](https://mltechniques.com/2023/08/02/generative-ai-technology-break-through-spectacular-performance-of-new-synthesizer/)\r\n\r\nIf you have any tips or suggestions, please contact us on email.\r\n\r\n# History\r\n\r\n## 0.1.0 (2023-09-19)\r\n- First release on PyPI.\r\n\r\n## 0.1.1 (2023-09-27)\r\n### Fixed\r\n- Resolved issues with single categorical columns\r\n\r\n## 0.1.2 (2023-09-27)\r\n### Feature\r\n- Added Feature for flexible Uniform & Gaussian Sampling for columns in generate_synthetic_data method\r\n\r\n## 0.1.3 (2023-10-10)\r\n### Fixed\r\n- Resolved issues with float column when selected as category column\r\n\r\n## 0.1.4 (2023-10-16)\r\n### Fixed\r\n- Resolved issues with brackets \"(\" & \")\" in category column values\r\n\r\n## 0.1.5 (2023-10-24)\r\n### Feature\r\n- Added gen random seed to be set during generation\r\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "NoGAN Tabular Synthetic Data Generation",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/rajiviyer/nogan_synthesizer"
},
"split_keywords": [
"nogan_synthesizer"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b33de518f55d8b156b80edf40235325276a35886ccd28ac7520a65c934a4e7d9",
"md5": "2d58fe4e47289380aa3fec7622ca8c63",
"sha256": "226ab9dca9c392b1a561ef36873b773f95b43e534a0be92e22da171590f78de0"
},
"downloads": -1,
"filename": "nogan_synthesizer-0.1.5-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "2d58fe4e47289380aa3fec7622ca8c63",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6",
"size": 8113,
"upload_time": "2023-10-24T06:38:25",
"upload_time_iso_8601": "2023-10-24T06:38:25.585336Z",
"url": "https://files.pythonhosted.org/packages/b3/3d/e518f55d8b156b80edf40235325276a35886ccd28ac7520a65c934a4e7d9/nogan_synthesizer-0.1.5-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9839e286df2c2103f448e1a641399a6248bd4b8ca313d36c1dfa31c178ab4ad8",
"md5": "969d2ff46b192de68c56162cbfcff488",
"sha256": "44e9d5893c94ae8667c38e231d539a1cdf0ccc6a3cb6c25ff4efa4aecda92bc9"
},
"downloads": -1,
"filename": "nogan_synthesizer-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "969d2ff46b192de68c56162cbfcff488",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 9250,
"upload_time": "2023-10-24T06:38:27",
"upload_time_iso_8601": "2023-10-24T06:38:27.177785Z",
"url": "https://files.pythonhosted.org/packages/98/39/e286df2c2103f448e1a641399a6248bd4b8ca313d36c1dfa31c178ab4ad8/nogan_synthesizer-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-24 06:38:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rajiviyer",
"github_project": "nogan_synthesizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "nogan-synthesizer"
}