nogan-synthesizer


Namenogan-synthesizer JSON
Version 0.1.5 PyPI version JSON
download
home_pagehttps://github.com/rajiviyer/nogan_synthesizer
SummaryNoGAN Tabular Synthetic Data Generation
upload_time2023-10-24 06:38:27
maintainer
docs_urlNone
authorRajiv Iyer
requires_python>=3.6
licenseMIT license
keywords nogan_synthesizer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NOGAN SYNTHESIZER
[![PyPI version](https://badge.fury.io/py/nogan-synthesizer.svg)](https://badge.fury.io/py/nogan-synthesizer)
[![Documentation](https://img.shields.io/badge/Documentation-%20-blue)](https://rajiviyer.github.io/nogan_synthesizer/)


NoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN. 

## Class
- **NoGANSynthesizer**: Synthetic Data Generator that fits a tabular data

## Functions
- **wrap_category_columns**: Function to compress all specified categorical columns into one
- **unwrap_category_columns**: Function to expand all wrapped categorical columns

## Authors
- [Dr. Vincent Granville](mailto:vincentg@mltechniques.com) - Research
- [Rajiv Iyer](mailto:raju.rgi@gmail.com) - Development/Maintenance

## Installation
The package can be installed with
```
pip install nogan_synthesizer
```

## Tests
The test can be run by cloning the repo and running:
```
pytest tests
```
In case of any issues running the tests, please run them after installing the package locally:

```
pip install -e .
```

## Usage

Start by importing the class
```Python
from nogan_synthesizer import NoGANSynth
from nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns
from genai_evaluation import multivariate_ecdf, ks_statistic
```

Assuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that.
We first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary
```Python
cat_cols = [category columns list...]
wrapped_real_data, idx_to_key, key_to_idx = \
                        wrap_category_columns(real_data, cat_cols)
```

We then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data
```Python
nogan = NoGANSynth(real_data)
nogan.fit()

n_synth_rows = len(real_data)
synth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)
```

We can then evaluate the synthetic & real data distributions using genai_evaluation package
```Python
_, ecdf_val1, ecdf_synth = \
            multivariate_ecdf(wrapped_real_data, 
                              synth_data, 
                              n_nodes = 1000,
                              verbose = True,
                              random_seed=42)

ks_stat = ks_statistic(ecdf_val1, ecdf_synth)                              
```

Once we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary
```Python
unwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)
```
## Motivation
The motivation for this package comes from Dr. Vincent Granville's paper [Generative AI Technology Break-through: Spectacular Performance of New Synthesizer](https://mltechniques.com/2023/08/02/generative-ai-technology-break-through-spectacular-performance-of-new-synthesizer/)

If you have any tips or suggestions, please contact us on email.

# History

## 0.1.0 (2023-09-19)
- First release on PyPI.

## 0.1.1 (2023-09-27)
### Fixed
- Resolved issues with single categorical columns

## 0.1.2 (2023-09-27)
### Feature
- Added Feature for flexible Uniform & Gaussian Sampling for columns in generate_synthetic_data method

## 0.1.3 (2023-10-10)
### Fixed
- Resolved issues with float column when selected as category column

## 0.1.4 (2023-10-16)
### Fixed
- Resolved issues with brackets "(" & ")" in category column values

## 0.1.5 (2023-10-24)
### Feature
- Added gen random seed to be set during generation

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rajiviyer/nogan_synthesizer",
    "name": "nogan-synthesizer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "nogan_synthesizer",
    "author": "Rajiv Iyer",
    "author_email": "raju.rgi@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/98/39/e286df2c2103f448e1a641399a6248bd4b8ca313d36c1dfa31c178ab4ad8/nogan_synthesizer-0.1.5.tar.gz",
    "platform": null,
    "description": "# NOGAN SYNTHESIZER\r\n[![PyPI version](https://badge.fury.io/py/nogan-synthesizer.svg)](https://badge.fury.io/py/nogan-synthesizer)\r\n[![Documentation](https://img.shields.io/badge/Documentation-%20-blue)](https://rajiviyer.github.io/nogan_synthesizer/)\r\n\r\n\r\nNoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN. \r\n\r\n## Class\r\n- **NoGANSynthesizer**: Synthetic Data Generator that fits a tabular data\r\n\r\n## Functions\r\n- **wrap_category_columns**: Function to compress all specified categorical columns into one\r\n- **unwrap_category_columns**: Function to expand all wrapped categorical columns\r\n\r\n## Authors\r\n- [Dr. Vincent Granville](mailto:vincentg@mltechniques.com) - Research\r\n- [Rajiv Iyer](mailto:raju.rgi@gmail.com) - Development/Maintenance\r\n\r\n## Installation\r\nThe package can be installed with\r\n```\r\npip install nogan_synthesizer\r\n```\r\n\r\n## Tests\r\nThe test can be run by cloning the repo and running:\r\n```\r\npytest tests\r\n```\r\nIn case of any issues running the tests, please run them after installing the package locally:\r\n\r\n```\r\npip install -e .\r\n```\r\n\r\n## Usage\r\n\r\nStart by importing the class\r\n```Python\r\nfrom nogan_synthesizer import NoGANSynth\r\nfrom nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns\r\nfrom genai_evaluation import multivariate_ecdf, ks_statistic\r\n```\r\n\r\nAssuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that.\r\nWe first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary\r\n```Python\r\ncat_cols = [category columns list...]\r\nwrapped_real_data, idx_to_key, key_to_idx = \\\r\n                        wrap_category_columns(real_data, cat_cols)\r\n```\r\n\r\nWe then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data\r\n```Python\r\nnogan = NoGANSynth(real_data)\r\nnogan.fit()\r\n\r\nn_synth_rows = len(real_data)\r\nsynth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)\r\n```\r\n\r\nWe can then evaluate the synthetic & real data distributions using genai_evaluation package\r\n```Python\r\n_, ecdf_val1, ecdf_synth = \\\r\n            multivariate_ecdf(wrapped_real_data, \r\n                              synth_data, \r\n                              n_nodes = 1000,\r\n                              verbose = True,\r\n                              random_seed=42)\r\n\r\nks_stat = ks_statistic(ecdf_val1, ecdf_synth)                              \r\n```\r\n\r\nOnce we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary\r\n```Python\r\nunwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)\r\n```\r\n## Motivation\r\nThe motivation for this package comes from Dr. Vincent Granville's paper [Generative AI Technology Break-through: Spectacular Performance of New Synthesizer](https://mltechniques.com/2023/08/02/generative-ai-technology-break-through-spectacular-performance-of-new-synthesizer/)\r\n\r\nIf you have any tips or suggestions, please contact us on email.\r\n\r\n# History\r\n\r\n## 0.1.0 (2023-09-19)\r\n- First release on PyPI.\r\n\r\n## 0.1.1 (2023-09-27)\r\n### Fixed\r\n- Resolved issues with single categorical columns\r\n\r\n## 0.1.2 (2023-09-27)\r\n### Feature\r\n- Added Feature for flexible Uniform & Gaussian Sampling for columns in generate_synthetic_data method\r\n\r\n## 0.1.3 (2023-10-10)\r\n### Fixed\r\n- Resolved issues with float column when selected as category column\r\n\r\n## 0.1.4 (2023-10-16)\r\n### Fixed\r\n- Resolved issues with brackets \"(\" & \")\" in category column values\r\n\r\n## 0.1.5 (2023-10-24)\r\n### Feature\r\n- Added gen random seed to be set during generation\r\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "NoGAN Tabular Synthetic Data Generation",
    "version": "0.1.5",
    "project_urls": {
        "Homepage": "https://github.com/rajiviyer/nogan_synthesizer"
    },
    "split_keywords": [
        "nogan_synthesizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b33de518f55d8b156b80edf40235325276a35886ccd28ac7520a65c934a4e7d9",
                "md5": "2d58fe4e47289380aa3fec7622ca8c63",
                "sha256": "226ab9dca9c392b1a561ef36873b773f95b43e534a0be92e22da171590f78de0"
            },
            "downloads": -1,
            "filename": "nogan_synthesizer-0.1.5-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2d58fe4e47289380aa3fec7622ca8c63",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6",
            "size": 8113,
            "upload_time": "2023-10-24T06:38:25",
            "upload_time_iso_8601": "2023-10-24T06:38:25.585336Z",
            "url": "https://files.pythonhosted.org/packages/b3/3d/e518f55d8b156b80edf40235325276a35886ccd28ac7520a65c934a4e7d9/nogan_synthesizer-0.1.5-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9839e286df2c2103f448e1a641399a6248bd4b8ca313d36c1dfa31c178ab4ad8",
                "md5": "969d2ff46b192de68c56162cbfcff488",
                "sha256": "44e9d5893c94ae8667c38e231d539a1cdf0ccc6a3cb6c25ff4efa4aecda92bc9"
            },
            "downloads": -1,
            "filename": "nogan_synthesizer-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "969d2ff46b192de68c56162cbfcff488",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 9250,
            "upload_time": "2023-10-24T06:38:27",
            "upload_time_iso_8601": "2023-10-24T06:38:27.177785Z",
            "url": "https://files.pythonhosted.org/packages/98/39/e286df2c2103f448e1a641399a6248bd4b8ca313d36c1dfa31c178ab4ad8/nogan_synthesizer-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-24 06:38:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rajiviyer",
    "github_project": "nogan_synthesizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "nogan-synthesizer"
}
        
Elapsed time: 0.16958s