# Auto Synthetic Data Platform
### Auto Synthetic Data Platform is a Python package that will help advertisers create a synthetic version of their first-party datasets (1PD) using AI/ML.
##### This is not an official Google product.
[Introduction](#introduction) •
[Limitations](#limitations) •
[Getting started](#getting-started) •
[References](#references)
## Introduction
Organizations increasingly rely on first-party data (1PD) due to government regulations, depreciation and fragmentation of AdTech solutions, e.g. removal of third-party cookies (3PC). However, typically advertisers are not allowed to share 1PD externally. It blocks leveraging third-party Data Science expertise in new AI/ML model development and limits realizing the full revenue-generating potential of owned 1PD.
The package is a self-served solution to help advertisers create a synthetic version of their first-party datasets (1PD). The solution automatically preprocesses a real dataset and identifies the most optimal synthetic data model through the process of hyperparameter tuning.
The solution will optimize the model against a set of objectives specified by a user (see [Getting started](#getting-started)). Typically users would optimize their model to generate synthetic data of:
- Statistical properties as close as possible to the one of the real data
- High privacy standards to protect sensitivity of real data
- High suitability, i.e. synthetic data could be used to train AI/ML models that later can make accurate predictions on real data
The solution relies on the [SynthCity](https://github.com/vanderschaarlab/synthcity) library developed by The van der Schaar Lab at the University of Cambridge.
## Limitations
Users take full responsibility for synthetic data created using this solution.
The solution can protect user privacy in the case where each row in the original dataset is a user. However, it does not protect, for example, leaking sensitive company data like total revenue per day, since by design it maintains the statistical resemblance of the original data.
## Getting started
### Installation
The *torchaudio* and *torchdata* packages must be uninstalled if you already have PyTorch >= 2.0. It is to make the enviornment compatible with the *synthcity* package.
The *synthcity* team is already working on upgrading their requirements to PyTorch 2.0. See [here](https://github.com/vanderschaarlab/synthcity/issues/234).
```bash
pip -q uninstall torchaudio torchdata
```
The *auto_synthetic_data_platform* package can be installed from a file.
```bash
pip install auto_synthetic_data_platform.tar.gz
```
### Preprocessing & loading data
#### Preprocessing
The *auto_synthetic_data_platform* has the *Preprocessor* class that can help preprocess the real dataset according to the industry best practices. It will also log all the information about the dataset that can impact a synthetic data model training. The logs can be shared externally in your organization cannot allow external people to have access to your data and remote synthetic data model training debugging is required.
First, load your real data to a Pandas dataframe.
```python
import pandas as pd
_REAL_DATAFRAME_PATH = "/content/real_dataset.csv"
real_dataframe = pd.read_csv(_REAL_DATAFRAME_PATH)
```
Second, specify column and preprocessing metadata.
```python
_EXPERIMENT_DIRECTORY = ("/content/")
_COLUMN_METADATA = {
"numerical": [
"column_a",
"column_b",
],
"categorical": [
"column_c",
"column_d",
],
}
_PREPROCESS_METADATA = {
"remove_numerical_outliers": False,
"remove_duplicates": True,
"preprocess_missing_values": True,
"missing_values_method": "drop",
}
```
Third, initialize the *Preprocessor* class and preprocess the data.
```python
from auto_synthetic_data_platform import preprocessing
import pathlib
preprocessor = preprocessing.Preprocessor(
dataframe_path=pathlib.Path(_REAL_DATAFRAME_PATH),
experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),
column_metadata=_COLUMN_METADATA,
preprocess_metadata=_PREPROCESS_METADATA,
)
preprocessed_real_dataset = preprocessor.output_dataframe
```
#### Loading
Lastly, load the preprocessed real dataset using a dataloader from the *SynthCity* package.
```python
from synthcity.plugins.core import dataloader
data_loader = dataloader.GenericDataLoader(
data=preprocessed_real_dataset,
target_column="column_d",
)
```
### Synthetic data model training
#### Model selection
At this stage a synthetic data model instance from *SynthCity* needs to be initialized. The list of all the available models can be found [here](https://github.com/vanderschaarlab/synthcity#-methods).
```python
from synthcity import plugins
_TVAE_MODEL = "tvae"
tvae_synthetic_data_model = plugins.Plugins().get(_TVAE_MODEL)
```
#### Objective/s selection
Then, a mapping of evaluation metrics compatible with *SynthCity* is required to proceed further. Each model in the hyperparameter tuning process will be evaluated against these criteria to identify the best synthetic data model. The list of available metrics can be found [here](https://github.com/vanderschaarlab/synthcity#zap-evaluation-metrics).
```python
_EVALUATION_METRICS = {
"sanity": ["close_values_probability"],
"stats": ["inv_kl_divergence"],
"performance": ["xgb"],
"privacy": ["k-anonymization"],
}
```
#### Tuner setup
A synthetic data model tuner from the *auto_synthetic_data_platform* package needs to be setup.
```python
from auto_synthetic_data_platform import synthetic_data_model_tuning
_NUMBER_OF_TRIALS = 2
_OPTIMIZATION_DIRECTION = "maximize"
_TASK_TYPE = "regression"
tvae_synthetic_data_model_optimizer = synthetic_data_model_tuning.SyntheticDataModelTuner(
data_loader=data_loader,
synthetic_data_model=tvae_synthetic_data_model,
experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),
number_of_trials=_NUMBER_OF_TRIALS,
optimization_direction=_OPTIMIZATION_DIRECTION,
evaluation_metrics=_EVALUATION_METRICS,
task_type=_TASK_TYPE,
)
```
#### Hyperparameter tuning
The tuner runs a hyperparameter search in the background and then outputs the best synthetic data model. The processing time depends on the number of trials.
**WARNING:** The process is resource & time consuming.
```python
best_tvae_network_synthetic_data_model = (
tvae_synthetic_data_model_optimizer.best_synthetic_data_model
)
```
The tuner class provides an easy access to:
- The most optimal hyperparameters for the given model.
```python
best_hyperparameters = tvae_synthetic_data_model_optimizer.best_hyperparameters
```
- A plot with parallel hyperparamter coordinates for later analysis.
```python
tvae_synthetic_data_model_optimizer.display_parallel_hyperparameter_coordinates()
```
- A plot with hyperparamter importances during training.
```python
tvae_synthetic_data_model_optimizer.display_hyperparameter_importances()
```
- A full evaluation report on the idenitifed best synthetic data model using all the available evaluation metrics from *SynthCity*.
```python
best_tvae_model_full_evaluation_report = tvae_synthetic_data_model_optimizer.best_synthetic_data_model_full_evaluation_report
```
- An evaluation report on the idenitifed best synthetic data model using only the evaluation metrics specified at the class initializaiton.
```python
best_tvae_model_evaluation_report = tvae_synthetic_data_model_optimizer.best_synthetic_data_model_evaluation_report
```
The most topimal synthetic data can be easily saved using the tuner class.
```python
tvae_synthetic_data_model_optimizer.save_best_synthetic_data_model()
```
### Synthetic data generation
Synthetic data can be easily genrated with the tuner.
```python
synthetic_data_10_examples = tvae_synthetic_data_model_optimizer.generate_synthetic_data_with_the_best_synthetic_data_model(
count=10,
)
```
### Comparing multiple tuned synthetic data models
Another likely step is to compare between 2 or more tuned synthetic data models and the *auto_synthetic_data_platform* package can help with it as well.
For the demo purposes, we need to tune an alternative synthetic data model following the exact same steps as above.
```python
_CTGAN_MODEL = "ctgan"
ctgan_synthetic_data_model = plugins.Plugins().get(_CTGAN_MODEL)
ctgan_synthetic_data_model_optimizer = synthetic_data_model_tuning.SyntheticDataModelTuner(
data_loader=data_loader,
synthetic_data_model=ctgan_synthetic_data_model,
experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),
number_of_trials=_NUMBER_OF_TRIALS,
optimization_direction=_OPTIMIZATION_DIRECTION,
evaluation_metrics=_EVALUATION_METRICS,
task_type=_TASK_TYPE,
)
best_ctgan_synthetic_data_model = (
ctgan_synthetic_data_model_optimizer.best_synthetic_data_model
)
best_ctgan_model_evaluation_report = ctgan_synthetic_data_model_optimizer.best_synthetic_data_model_evaluation_report
```
The last step is to compare 2 or more created evaluation reports.
```python
evaluation_reports_to_compare = {
"tvae": best_tvae_model_evaluation_report,
"ctgan": best_ctgan_model_evaluation_report,
}
synthetic_data_model_tuning.compare_synthetic_data_models_full_evaluation_reports(
evaluation_results_mapping=evaluation_reports_to_compare
)
```
## References
The *auto_synthetic_data_platform* package is a wrapper around the *synthcity* library. Credits to: Qian, Zhaozhi and Cebere, Bogdan-Constantin and van der Schaar, Mihaela
The *auto_synthetic_data_platform* package uses the Apache License (Version 2.0, January 2004) more details in the package's LICENSE file.
Raw data
{
"_id": null,
"home_page": "https://github.com/google-marketing-solutions/auto_synthetic_data_platform",
"name": "auto-synthetic-data-platform",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "",
"keywords": "1pd privacy ai ml marketing",
"author": "Google gTech Ads EMEA Privacy Data Science Team",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/1c/53/afb0ad9568dd340fd623e24c49c5dbbcb63c162635a339c91026170ed5f9/auto-synthetic-data-platform-0.0.1.tar.gz",
"platform": null,
"description": "# Auto Synthetic Data Platform\n### Auto Synthetic Data Platform is a Python package that will help advertisers create a synthetic version of their first-party datasets (1PD) using AI/ML.\n##### This is not an official Google product.\n\n[Introduction](#introduction) \u2022\n[Limitations](#limitations) \u2022\n[Getting started](#getting-started) \u2022\n[References](#references)\n\n## Introduction\n\nOrganizations increasingly rely on first-party data (1PD) due to government regulations, depreciation and fragmentation of AdTech solutions, e.g. removal of third-party cookies (3PC). However, typically advertisers are not allowed to share 1PD externally. It blocks leveraging third-party Data Science expertise in new AI/ML model development and limits realizing the full revenue-generating potential of owned 1PD.\n\nThe package is a self-served solution to help advertisers create a synthetic version of their first-party datasets (1PD). The solution automatically preprocesses a real dataset and identifies the most optimal synthetic data model through the process of hyperparameter tuning.\n\nThe solution will optimize the model against a set of objectives specified by a user (see [Getting started](#getting-started)). Typically users would optimize their model to generate synthetic data of:\n\n- Statistical properties as close as possible to the one of the real data\n- High privacy standards to protect sensitivity of real data\n- High suitability, i.e. synthetic data could be used to train AI/ML models that later can make accurate predictions on real data\n\nThe solution relies on the [SynthCity](https://github.com/vanderschaarlab/synthcity) library developed by The van der Schaar Lab at the University of Cambridge.\n\n## Limitations\n\nUsers take full responsibility for synthetic data created using this solution.\n\nThe solution can protect user privacy in the case where each row in the original dataset is a user. However, it does not protect, for example, leaking sensitive company data like total revenue per day, since by design it maintains the statistical resemblance of the original data.\n\n## Getting started\n\n### Installation\n\nThe *torchaudio* and *torchdata* packages must be uninstalled if you already have PyTorch >= 2.0. It is to make the enviornment compatible with the *synthcity* package.\n\nThe *synthcity* team is already working on upgrading their requirements to PyTorch 2.0. See [here](https://github.com/vanderschaarlab/synthcity/issues/234).\n\n```bash\npip -q uninstall torchaudio torchdata\n```\n\nThe *auto_synthetic_data_platform* package can be installed from a file.\n\n```bash\npip install auto_synthetic_data_platform.tar.gz\n```\n\n### Preprocessing & loading data\n\n#### Preprocessing\n\nThe *auto_synthetic_data_platform* has the *Preprocessor* class that can help preprocess the real dataset according to the industry best practices. It will also log all the information about the dataset that can impact a synthetic data model training. The logs can be shared externally in your organization cannot allow external people to have access to your data and remote synthetic data model training debugging is required.\n\nFirst, load your real data to a Pandas dataframe.\n\n```python\nimport pandas as pd\n_REAL_DATAFRAME_PATH = \"/content/real_dataset.csv\"\nreal_dataframe = pd.read_csv(_REAL_DATAFRAME_PATH)\n```\n\nSecond, specify column and preprocessing metadata.\n\n```python\n_EXPERIMENT_DIRECTORY = (\"/content/\")\n_COLUMN_METADATA = {\n \"numerical\": [\n \"column_a\",\n \"column_b\",\n ],\n \"categorical\": [\n \"column_c\",\n \"column_d\",\n ],\n}\n_PREPROCESS_METADATA = {\n \"remove_numerical_outliers\": False,\n \"remove_duplicates\": True,\n \"preprocess_missing_values\": True,\n \"missing_values_method\": \"drop\",\n}\n```\n\nThird, initialize the *Preprocessor* class and preprocess the data.\n\n```python\nfrom auto_synthetic_data_platform import preprocessing\nimport pathlib\npreprocessor = preprocessing.Preprocessor(\n dataframe_path=pathlib.Path(_REAL_DATAFRAME_PATH),\n experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),\n column_metadata=_COLUMN_METADATA,\n preprocess_metadata=_PREPROCESS_METADATA,\n)\npreprocessed_real_dataset = preprocessor.output_dataframe\n```\n\n#### Loading\n\nLastly, load the preprocessed real dataset using a dataloader from the *SynthCity* package.\n\n```python\nfrom synthcity.plugins.core import dataloader\ndata_loader = dataloader.GenericDataLoader(\n data=preprocessed_real_dataset,\n target_column=\"column_d\",\n)\n```\n\n### Synthetic data model training\n\n#### Model selection\n\nAt this stage a synthetic data model instance from *SynthCity* needs to be initialized. The list of all the available models can be found [here](https://github.com/vanderschaarlab/synthcity#-methods).\n\n```python\nfrom synthcity import plugins\n_TVAE_MODEL = \"tvae\"\ntvae_synthetic_data_model = plugins.Plugins().get(_TVAE_MODEL)\n```\n\n#### Objective/s selection\n\nThen, a mapping of evaluation metrics compatible with *SynthCity* is required to proceed further. Each model in the hyperparameter tuning process will be evaluated against these criteria to identify the best synthetic data model. The list of available metrics can be found [here](https://github.com/vanderschaarlab/synthcity#zap-evaluation-metrics).\n\n```python\n_EVALUATION_METRICS = {\n \"sanity\": [\"close_values_probability\"],\n \"stats\": [\"inv_kl_divergence\"],\n \"performance\": [\"xgb\"],\n \"privacy\": [\"k-anonymization\"],\n}\n```\n\n#### Tuner setup\n\nA synthetic data model tuner from the *auto_synthetic_data_platform* package needs to be setup.\n\n```python\nfrom auto_synthetic_data_platform import synthetic_data_model_tuning\n_NUMBER_OF_TRIALS = 2\n_OPTIMIZATION_DIRECTION = \"maximize\"\n_TASK_TYPE = \"regression\"\ntvae_synthetic_data_model_optimizer = synthetic_data_model_tuning.SyntheticDataModelTuner(\n data_loader=data_loader,\n synthetic_data_model=tvae_synthetic_data_model,\n experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),\n number_of_trials=_NUMBER_OF_TRIALS,\n optimization_direction=_OPTIMIZATION_DIRECTION,\n evaluation_metrics=_EVALUATION_METRICS,\n task_type=_TASK_TYPE,\n)\n```\n\n#### Hyperparameter tuning\n\nThe tuner runs a hyperparameter search in the background and then outputs the best synthetic data model. The processing time depends on the number of trials.\n\n**WARNING:** The process is resource & time consuming.\n\n```python\nbest_tvae_network_synthetic_data_model = (\n tvae_synthetic_data_model_optimizer.best_synthetic_data_model\n )\n```\n\nThe tuner class provides an easy access to:\n\n- The most optimal hyperparameters for the given model.\n\n```python\nbest_hyperparameters = tvae_synthetic_data_model_optimizer.best_hyperparameters\n```\n\n- A plot with parallel hyperparamter coordinates for later analysis.\n\n```python\ntvae_synthetic_data_model_optimizer.display_parallel_hyperparameter_coordinates()\n```\n\n- A plot with hyperparamter importances during training.\n\n```python\ntvae_synthetic_data_model_optimizer.display_hyperparameter_importances()\n```\n\n- A full evaluation report on the idenitifed best synthetic data model using all the available evaluation metrics from *SynthCity*.\n\n```python\nbest_tvae_model_full_evaluation_report = tvae_synthetic_data_model_optimizer.best_synthetic_data_model_full_evaluation_report\n```\n\n- An evaluation report on the idenitifed best synthetic data model using only the evaluation metrics specified at the class initializaiton.\n\n```python\nbest_tvae_model_evaluation_report = tvae_synthetic_data_model_optimizer.best_synthetic_data_model_evaluation_report\n```\n\nThe most topimal synthetic data can be easily saved using the tuner class.\n\n```python\ntvae_synthetic_data_model_optimizer.save_best_synthetic_data_model()\n```\n\n### Synthetic data generation\n\nSynthetic data can be easily genrated with the tuner.\n\n```python\nsynthetic_data_10_examples = tvae_synthetic_data_model_optimizer.generate_synthetic_data_with_the_best_synthetic_data_model(\n count=10,\n)\n```\n\n### Comparing multiple tuned synthetic data models\n\nAnother likely step is to compare between 2 or more tuned synthetic data models and the *auto_synthetic_data_platform* package can help with it as well.\n\nFor the demo purposes, we need to tune an alternative synthetic data model following the exact same steps as above.\n\n```python\n_CTGAN_MODEL = \"ctgan\"\nctgan_synthetic_data_model = plugins.Plugins().get(_CTGAN_MODEL)\nctgan_synthetic_data_model_optimizer = synthetic_data_model_tuning.SyntheticDataModelTuner(\n data_loader=data_loader,\n synthetic_data_model=ctgan_synthetic_data_model,\n experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),\n number_of_trials=_NUMBER_OF_TRIALS,\n optimization_direction=_OPTIMIZATION_DIRECTION,\n evaluation_metrics=_EVALUATION_METRICS,\n task_type=_TASK_TYPE,\n)\nbest_ctgan_synthetic_data_model = (\n ctgan_synthetic_data_model_optimizer.best_synthetic_data_model\n )\nbest_ctgan_model_evaluation_report = ctgan_synthetic_data_model_optimizer.best_synthetic_data_model_evaluation_report\n```\n\nThe last step is to compare 2 or more created evaluation reports.\n\n```python\nevaluation_reports_to_compare = {\n \"tvae\": best_tvae_model_evaluation_report,\n \"ctgan\": best_ctgan_model_evaluation_report,\n}\nsynthetic_data_model_tuning.compare_synthetic_data_models_full_evaluation_reports(\n evaluation_results_mapping=evaluation_reports_to_compare\n)\n```\n\n## References\n\nThe *auto_synthetic_data_platform* package is a wrapper around the *synthcity* library. Credits to: Qian, Zhaozhi and Cebere, Bogdan-Constantin and van der Schaar, Mihaela\n\nThe *auto_synthetic_data_platform* package uses the Apache License (Version 2.0, January 2004) more details in the package's LICENSE file.\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "Google EMEA gTech Ads Data Science Team's solution to create privacy-safe synthetic data out of real data. The solution is a wrapper around the synthcity package (https://github.com/vanderschaarlab/synthcity) simplifying the process of model tuning.",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/google-marketing-solutions/auto_synthetic_data_platform"
},
"split_keywords": [
"1pd",
"privacy",
"ai",
"ml",
"marketing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f193337f68dd607ed6e45d3ebabf8b6fb956ea83e135eab9b9fe8a896abb8ee6",
"md5": "b05924e7032e51f4fb6716cb9d88a996",
"sha256": "45f66d60f3b059026aa8a183cd5791044ad53828d34defa4a912b137f88e896c"
},
"downloads": -1,
"filename": "auto_synthetic_data_platform-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b05924e7032e51f4fb6716cb9d88a996",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 23968,
"upload_time": "2024-02-15T15:30:05",
"upload_time_iso_8601": "2024-02-15T15:30:05.480453Z",
"url": "https://files.pythonhosted.org/packages/f1/93/337f68dd607ed6e45d3ebabf8b6fb956ea83e135eab9b9fe8a896abb8ee6/auto_synthetic_data_platform-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1c53afb0ad9568dd340fd623e24c49c5dbbcb63c162635a339c91026170ed5f9",
"md5": "288bf5014a737550c113467af1834be2",
"sha256": "1b0ce2297d3695507f04c248f24afdcead207c63936050fd21bc08e06d6e53cc"
},
"downloads": -1,
"filename": "auto-synthetic-data-platform-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "288bf5014a737550c113467af1834be2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 21345,
"upload_time": "2024-02-15T15:30:06",
"upload_time_iso_8601": "2024-02-15T15:30:06.980070Z",
"url": "https://files.pythonhosted.org/packages/1c/53/afb0ad9568dd340fd623e24c49c5dbbcb63c162635a339c91026170ed5f9/auto-synthetic-data-platform-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-15 15:30:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "google-marketing-solutions",
"github_project": "auto_synthetic_data_platform",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "auto-synthetic-data-platform"
}