# TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics
Extensive evaluation of the utility of synthetic data sets. The original and synthetic data are compared on distinguishability and on a univariate, bivariate and multivariate level. All four metrics are visualized in one plot with a spiderplot. Where one equals 'complete overlap' and zero equals 'no overlap' between original and synthetic data. This plot can depict multiple synthetic data sets. Therefore it can be used to evaluate different levels of privacy protection in synthetic data sets, varying parameter settings in synthetic data generators, or completely different synthetic data generators.
All individual metrics depicted in the spiderplot can be visualized as well. The example_script.py shows you step by step how to generate all visualizations. The main functionalities of the scripts are:
- Univariate distributions: shows the distributions of one variable for the original and synthetic data.
- Bivariate correlations: visualizes a Pearson-r correlation matrix for all variables.
- Multivariate predictions: shows an SVM classifier predicts accuracies for each variable training on either original or synthetic data tested on original data.
- Distinguishability: shows the AUC of a logistic classifier that classifies samples as either original or synthetic.
- Spiderplot: generates spiderplot for these four metrics.
Note that any required pre-processing of the (synthetic) data sets should be done prior. Take into account addressing NANs, missing values, outliers and scaling the data.
For more information on the selected metrics, please refer to the paper (link will be added upon publication) or contact madelon.molhoek@tno.nl. As we aim to keep developing our code feedback and tips are welcome.
![Utility depicted in spider plot for adult data set, for different values of epsilon. Data are generated with CTGAN and can be found in scripts/datasets.](https://raw.githubusercontent.com/TNO-SDG/tabular.eval.utility_metrics/main/scripts/spider_plot.png)
### PET Lab
The TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.
The package `tno.sdg.tabular.eval.utility_metrics` is part of the [TNO Python Toolbox](https://github.com/TNO-PET).
_Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws._
_This implementation of cryptographic software has not been audited. Use at your own risk._
## Documentation
Documentation of the `tno.sdg.tabular.eval.utility_metrics` package can be found
[here](https://docs.pet.tno.nl/sdg/tabular/eval/utility_metrics/0.4.1).
## Install
Easily install the `tno.sdg.tabular.eval.utility_metrics` package using `pip`:
```console
$ python -m pip install tno.sdg.tabular.eval.utility_metrics
```
_Note:_ If you are cloning the repository and wish to edit the source code, be
sure to install the package in editable mode:
```console
$ python -m pip install -e 'tno.sdg.tabular.eval.utility_metrics'
```
If you wish to run the tests you can use:
```console
$ python -m pip install 'tno.sdg.tabular.eval.utility_metrics[tests]'
```
## Usage
See the script in the `scripts` directory.
Raw data
{
"_id": null,
"home_page": null,
"name": "tno.sdg.tabular.eval.utility-metrics",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "TNO PET Lab <petlab@tno.nl>",
"keywords": "TNO, SDG, synthetic data, synthetic data generation, tabular, evaluation, utility",
"author": null,
"author_email": "TNO PET Lab <petlab@tno.nl>",
"download_url": "https://files.pythonhosted.org/packages/5b/e5/602559c5a760c24afdef56933cceb8d958806540b74c14714fb28f6bdd4e/tno_sdg_tabular_eval_utility_metrics-0.4.1.tar.gz",
"platform": "any",
"description": "# TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics\n\nExtensive evaluation of the utility of synthetic data sets. The original and synthetic data are compared on distinguishability and on a univariate, bivariate and multivariate level. All four metrics are visualized in one plot with a spiderplot. Where one equals 'complete overlap' and zero equals 'no overlap' between original and synthetic data. This plot can depict multiple synthetic data sets. Therefore it can be used to evaluate different levels of privacy protection in synthetic data sets, varying parameter settings in synthetic data generators, or completely different synthetic data generators.\n\nAll individual metrics depicted in the spiderplot can be visualized as well. The example_script.py shows you step by step how to generate all visualizations. The main functionalities of the scripts are:\n\n- Univariate distributions: shows the distributions of one variable for the original and synthetic data.\n- Bivariate correlations: visualizes a Pearson-r correlation matrix for all variables.\n- Multivariate predictions: shows an SVM classifier predicts accuracies for each variable training on either original or synthetic data tested on original data.\n- Distinguishability: shows the AUC of a logistic classifier that classifies samples as either original or synthetic.\n- Spiderplot: generates spiderplot for these four metrics.\n\nNote that any required pre-processing of the (synthetic) data sets should be done prior. Take into account addressing NANs, missing values, outliers and scaling the data.\n\nFor more information on the selected metrics, please refer to the paper (link will be added upon publication) or contact madelon.molhoek@tno.nl. As we aim to keep developing our code feedback and tips are welcome.\n\n![Utility depicted in spider plot for adult data set, for different values of epsilon. Data are generated with CTGAN and can be found in scripts/datasets.](https://raw.githubusercontent.com/TNO-SDG/tabular.eval.utility_metrics/main/scripts/spider_plot.png)\n\n### PET Lab\n\nThe TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.\n\nThe package `tno.sdg.tabular.eval.utility_metrics` is part of the [TNO Python Toolbox](https://github.com/TNO-PET).\n\n_Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws._ \n_This implementation of cryptographic software has not been audited. Use at your own risk._\n\n## Documentation\n\nDocumentation of the `tno.sdg.tabular.eval.utility_metrics` package can be found\n[here](https://docs.pet.tno.nl/sdg/tabular/eval/utility_metrics/0.4.1).\n\n## Install\n\nEasily install the `tno.sdg.tabular.eval.utility_metrics` package using `pip`:\n\n```console\n$ python -m pip install tno.sdg.tabular.eval.utility_metrics\n```\n\n_Note:_ If you are cloning the repository and wish to edit the source code, be\nsure to install the package in editable mode:\n\n```console\n$ python -m pip install -e 'tno.sdg.tabular.eval.utility_metrics'\n```\n\nIf you wish to run the tests you can use:\n\n```console\n$ python -m pip install 'tno.sdg.tabular.eval.utility_metrics[tests]'\n```\n\n## Usage\n\nSee the script in the `scripts` directory.\n",
"bugtrack_url": null,
"license": "Apache License, Version 2.0",
"summary": "Utility metrics for tabular data",
"version": "0.4.1",
"project_urls": {
"Documentation": "https://docs.pet.tno.nl/sdg/tabular/eval/utility_metrics/0.4.1",
"Homepage": "https://pet.tno.nl/",
"Source": "https://github.com/TNO-SDG/tabular.eval.utility_metrics"
},
"split_keywords": [
"tno",
" sdg",
" synthetic data",
" synthetic data generation",
" tabular",
" evaluation",
" utility"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "12059fa3367f90bb81074a8263ef845fc8cc94d394f6be8bb144d8584a1b0003",
"md5": "6a9841cbf7e2fcdbbc7d5814f0dc8bcf",
"sha256": "f9612930152b4d3877d94ebfa1ac83af6dcd35e762721c2930dbe334bd4fd27b"
},
"downloads": -1,
"filename": "tno.sdg.tabular.eval.utility_metrics-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6a9841cbf7e2fcdbbc7d5814f0dc8bcf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 35366,
"upload_time": "2024-12-10T13:24:10",
"upload_time_iso_8601": "2024-12-10T13:24:10.584074Z",
"url": "https://files.pythonhosted.org/packages/12/05/9fa3367f90bb81074a8263ef845fc8cc94d394f6be8bb144d8584a1b0003/tno.sdg.tabular.eval.utility_metrics-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5be5602559c5a760c24afdef56933cceb8d958806540b74c14714fb28f6bdd4e",
"md5": "7ccdfe493427ccc5f9d224015abdb413",
"sha256": "0cf27d73c74457c7df1bb4c17f85ff5cf648e7e7f37a4925f57d145fc758bcdb"
},
"downloads": -1,
"filename": "tno_sdg_tabular_eval_utility_metrics-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "7ccdfe493427ccc5f9d224015abdb413",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 264230,
"upload_time": "2024-12-10T13:24:13",
"upload_time_iso_8601": "2024-12-10T13:24:13.771286Z",
"url": "https://files.pythonhosted.org/packages/5b/e5/602559c5a760c24afdef56933cceb8d958806540b74c14714fb28f6bdd4e/tno_sdg_tabular_eval_utility_metrics-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-10 13:24:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TNO-SDG",
"github_project": "tabular.eval.utility_metrics",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "tno.sdg.tabular.eval.utility-metrics"
}