tno.sdg.tabular.eval.utility-metrics


Nametno.sdg.tabular.eval.utility-metrics JSON
Version 0.3.0 PyPI version JSON
download
home_page
SummaryUtility metrics for tabular data
upload_time2024-02-28 13:23:02
maintainer
docs_urlNone
author
requires_python>=3.8
licenseApache License, Version 2.0
keywords tno sdg synthetic data synthetic data generation tabular evaluation utility
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics

The TNO PET Lab consists of generic software components, procedures, and
functionalities developed and maintained on a regular basis to facilitate and
aid in the development of PET solutions. The lab is a cross-project initiative
allowing us to integrate and reuse previously developed PET functionalities to
boost the development of new protocols and solutions.

The package `tno.sdg.tabular.eval.utility_metrics`` is part of the
[TNO Python Toolbox](https://github.com/TNO-PET).

Extensive evaluation of the utility of synthetic data sets. The original and
synthetic data are compared on distinguishability and on a univariate, bivariate
and multivariate level.

The main functionalities are:

- Univariate distributions: shows the distributions of one variable for the
  original and synthetic data.
- Bivariate correlations: visualises a Pearson-r correlation matrix for all
  variables.
- Multivariate predictions: shows an SVM classifier predicts accuracies for each
  variable training on either original or synthetic data tested on original
  data.
- Distinguishability: shows the AUC of a logistic classifier that classifies
  samples as either original or synthetic.
- Spiderplot: generates spider plot for these four metrics.

_Limitations in (end-)use: the content of this software package may solely be
used for applications that comply with international export control laws._  
_This implementation of cryptographic software has not been audited. Use at your
own risk._

## Documentation

Documentation of the `tno.sdg.tabular.eval.utility_metrics` package can be found
[here](https://docs.pet.tno.nl/sdg/tabular/eval/utility_metrics/0.3.0).

## Install

Easily install the `tno.sdg.tabular.eval.utility_metrics` package using `pip`:

```console
$ python -m pip install tno.sdg.tabular.eval.utility_metrics
```

If you wish to run the tests you can use:

```console
$ python -m pip install 'tno.sdg.tabular.eval.utility_metrics[tests]'
```

## Usage

All four metrics are visualised in one plot with a spider plot. Where `1`
represents 'complete overlap' and `0` represents 'no overlap' between original
and synthetic data. This plot can depict multiple synthetic data sets. Therefore
it can be used to evaluate different levels of privacy protection in synthetic
data sets, varying parameter settings in synthetic data generators, or
completely different synthetic data generators.

All individual metrics depicted in the spider plot can be visualised as well.
The examples below show you step by step how to generate all visualizations.

Note that any required pre-processing of the (synthetic) data sets should be
done prior. Take into account addressing NaNs, missing values, outliers and
scaling the data.

For more information on the selected metrics, please refer to the paper (link
will be added upon publication). As we aim to keep developing our code feedback
and tips are welcome.

![Utility depicted in spider plot for adult data set, for different values of epsilon. Data are generated with CTGAN and can be found in example/datasets.](example/spider_plot.png)

This example evaluates utility of synthetic data. We evaluate univariate,
bivariate, multivariate utility and distinguishability (record-level utility).
All four dimensions are depicted in a radar chart/spider plot. All the required
computations are performed with the function `compute_all`. Multiple synthetic
data sets can be used as input.

To run the example you need the data found in `example/datasets`.

### Loading the data

To load the example data you can use the following snippet.

```python
import pandas as pd

original_data = pd.read_csv("original_dataset.csv", index_col=0)
synthetic_data_1 = pd.read_csv("synthetic_dataset_1.csv", index_col=0)
synthetic_data_2 = pd.read_csv("synthetic_dataset_2.csv", index_col=0)

# Specify numerical and categorical columns
numerical_column_names = ["col1", "col2"]
categorical_column_names = ["col3", "col4"]
```

### Computing metrics and generating a spider plot

Using the loaded data you can compute all four utility metrics and generate a
spider plot (written to `spiderplot.html`).

```python
from tno.sdg.tabular.eval.utility_metrics import (
    compute_all_metrics,
    compute_spider_plot,
)


computed_metrics = compute_all_metrics(
    original_data,
    {"Dataset 1": synthetic_data_1, "Dataset 2": synthetic_data_2},
    categorical_column_names,
)

spider_plot = compute_spider_plot(computed_metrics)
spider_plot.write_html(file="spiderplot.html", auto_open=True)
```

### Plotting univariate distributions

The univariate distributions can be plotted as follows. Plots are stored in
`distribution_<column>.png`, where `<column>` refers to the respective column.

```python
from tno.sdg.tabular.eval.utility_metrics import visualise_distributions


figures = visualise_distributions(
    original_data,
    synthetic_data_1,
    ["col1", "col2", "col3", "col4"],
    cat_col_names=categorical_column_names,
)
for col, fig in figures.items():
    fig.savefig(f"distribution_{col}.png", dpi=300, bbox_inches="tight")
```

### Bivariate plots

The correlation between columns can be plotted using the snippet below and are
written to `correlation_matrices.png`.

```python
from tno.sdg.tabular.eval.utility_metrics import visualise_correlation_matrices


(
    correlation_matrix_original,
    correlation_matrix_synthetic,
    cor_fig,
) = visualise_correlation_matrices(
    original_data, synthetic_data_1, categorical_column_names
)
cor_fig.savefig("correlation_matrices.png", dpi=300, bbox_inches="tight")
```

### Multivariate plots

Plotting classification tasks for all columns while converting numerical columns
to categories.  
Output is stored in `svm_classification.png`.

```python
from tno.sdg.tabular.eval.utility_metrics import visualise_accuracies_scores


(
    accuracies_scores_original,
    accuracies_score_synthetic,
    pred_fig,
) = visualise_accuracies_scores(
    original_data, synthetic_data_1, categorical_column_names
)
pred_fig.savefig("svm_classification.png", dpi=300, bbox_inches="tight")
```

Plotting regression and classification task for numerical and categorical
columns.  
Output is stored in `svm_regression_classification.png`.

```python
from tno.sdg.tabular.eval.utility_metrics import visualise_prediction_scores


(
    prediction_scores_original,
    prediction_score_synthetic,
    pred_fig_regr,
) = visualise_prediction_scores(
    original_data, synthetic_data_1, categorical_column_names
)
pred_fig_regr.savefig("svm_regression_classification.png", dpi=300, bbox_inches="tight")
```

### Distinguishability

Plotting distinguishability by classifying synthetic and original samples with
propensity score.  
Output is stored in `propensity_scores.png`.

```python
from tno.sdg.tabular.eval.utility_metrics import visualise_propensity_scores


propensity_scores, prop_fig = visualise_propensity_scores(
    original_data, synthetic_data_1, categorical_column_names
)
prop_fig.savefig("propensity_scores.png", dpi=300, bbox_inches="tight")
```

Plotting Area Under Curve for False Positive and True Positive rates for
classifying synthetic and original samples.  
Output is stored in `AUC_curves.png`.

```python
from tno.sdg.tabular.eval.utility_metrics import visualise_logistical_regression_auc


predictions, utility_disth, auc_fig = visualise_logistical_regression_auc(
    original_data, synthetic_data_1, categorical_column_names
)
auc_fig.savefig("AUC_curves.png", dpi=300, bbox_inches="tight")
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tno.sdg.tabular.eval.utility-metrics",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "TNO PET Lab <petlab@tno.nl>",
    "keywords": "TNO,SDG,synthetic data,synthetic data generation,tabular,evaluation,utility",
    "author": "",
    "author_email": "TNO PET Lab <petlab@tno.nl>",
    "download_url": "https://files.pythonhosted.org/packages/fb/a6/2a524141e88c5ddedc515fe859ae3f55bc5bf1e3a02226f70bbbc35649fd/tno.sdg.tabular.eval.utility_metrics-0.3.0.tar.gz",
    "platform": "any",
    "description": "# TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics\n\nThe TNO PET Lab consists of generic software components, procedures, and\nfunctionalities developed and maintained on a regular basis to facilitate and\naid in the development of PET solutions. The lab is a cross-project initiative\nallowing us to integrate and reuse previously developed PET functionalities to\nboost the development of new protocols and solutions.\n\nThe package `tno.sdg.tabular.eval.utility_metrics`` is part of the\n[TNO Python Toolbox](https://github.com/TNO-PET).\n\nExtensive evaluation of the utility of synthetic data sets. The original and\nsynthetic data are compared on distinguishability and on a univariate, bivariate\nand multivariate level.\n\nThe main functionalities are:\n\n- Univariate distributions: shows the distributions of one variable for the\n  original and synthetic data.\n- Bivariate correlations: visualises a Pearson-r correlation matrix for all\n  variables.\n- Multivariate predictions: shows an SVM classifier predicts accuracies for each\n  variable training on either original or synthetic data tested on original\n  data.\n- Distinguishability: shows the AUC of a logistic classifier that classifies\n  samples as either original or synthetic.\n- Spiderplot: generates spider plot for these four metrics.\n\n_Limitations in (end-)use: the content of this software package may solely be\nused for applications that comply with international export control laws._  \n_This implementation of cryptographic software has not been audited. Use at your\nown risk._\n\n## Documentation\n\nDocumentation of the `tno.sdg.tabular.eval.utility_metrics` package can be found\n[here](https://docs.pet.tno.nl/sdg/tabular/eval/utility_metrics/0.3.0).\n\n## Install\n\nEasily install the `tno.sdg.tabular.eval.utility_metrics` package using `pip`:\n\n```console\n$ python -m pip install tno.sdg.tabular.eval.utility_metrics\n```\n\nIf you wish to run the tests you can use:\n\n```console\n$ python -m pip install 'tno.sdg.tabular.eval.utility_metrics[tests]'\n```\n\n## Usage\n\nAll four metrics are visualised in one plot with a spider plot. Where `1`\nrepresents 'complete overlap' and `0` represents 'no overlap' between original\nand synthetic data. This plot can depict multiple synthetic data sets. Therefore\nit can be used to evaluate different levels of privacy protection in synthetic\ndata sets, varying parameter settings in synthetic data generators, or\ncompletely different synthetic data generators.\n\nAll individual metrics depicted in the spider plot can be visualised as well.\nThe examples below show you step by step how to generate all visualizations.\n\nNote that any required pre-processing of the (synthetic) data sets should be\ndone prior. Take into account addressing NaNs, missing values, outliers and\nscaling the data.\n\nFor more information on the selected metrics, please refer to the paper (link\nwill be added upon publication). As we aim to keep developing our code feedback\nand tips are welcome.\n\n![Utility depicted in spider plot for adult data set, for different values of epsilon. Data are generated with CTGAN and can be found in example/datasets.](example/spider_plot.png)\n\nThis example evaluates utility of synthetic data. We evaluate univariate,\nbivariate, multivariate utility and distinguishability (record-level utility).\nAll four dimensions are depicted in a radar chart/spider plot. All the required\ncomputations are performed with the function `compute_all`. Multiple synthetic\ndata sets can be used as input.\n\nTo run the example you need the data found in `example/datasets`.\n\n### Loading the data\n\nTo load the example data you can use the following snippet.\n\n```python\nimport pandas as pd\n\noriginal_data = pd.read_csv(\"original_dataset.csv\", index_col=0)\nsynthetic_data_1 = pd.read_csv(\"synthetic_dataset_1.csv\", index_col=0)\nsynthetic_data_2 = pd.read_csv(\"synthetic_dataset_2.csv\", index_col=0)\n\n# Specify numerical and categorical columns\nnumerical_column_names = [\"col1\", \"col2\"]\ncategorical_column_names = [\"col3\", \"col4\"]\n```\n\n### Computing metrics and generating a spider plot\n\nUsing the loaded data you can compute all four utility metrics and generate a\nspider plot (written to `spiderplot.html`).\n\n```python\nfrom tno.sdg.tabular.eval.utility_metrics import (\n    compute_all_metrics,\n    compute_spider_plot,\n)\n\n\ncomputed_metrics = compute_all_metrics(\n    original_data,\n    {\"Dataset 1\": synthetic_data_1, \"Dataset 2\": synthetic_data_2},\n    categorical_column_names,\n)\n\nspider_plot = compute_spider_plot(computed_metrics)\nspider_plot.write_html(file=\"spiderplot.html\", auto_open=True)\n```\n\n### Plotting univariate distributions\n\nThe univariate distributions can be plotted as follows. Plots are stored in\n`distribution_<column>.png`, where `<column>` refers to the respective column.\n\n```python\nfrom tno.sdg.tabular.eval.utility_metrics import visualise_distributions\n\n\nfigures = visualise_distributions(\n    original_data,\n    synthetic_data_1,\n    [\"col1\", \"col2\", \"col3\", \"col4\"],\n    cat_col_names=categorical_column_names,\n)\nfor col, fig in figures.items():\n    fig.savefig(f\"distribution_{col}.png\", dpi=300, bbox_inches=\"tight\")\n```\n\n### Bivariate plots\n\nThe correlation between columns can be plotted using the snippet below and are\nwritten to `correlation_matrices.png`.\n\n```python\nfrom tno.sdg.tabular.eval.utility_metrics import visualise_correlation_matrices\n\n\n(\n    correlation_matrix_original,\n    correlation_matrix_synthetic,\n    cor_fig,\n) = visualise_correlation_matrices(\n    original_data, synthetic_data_1, categorical_column_names\n)\ncor_fig.savefig(\"correlation_matrices.png\", dpi=300, bbox_inches=\"tight\")\n```\n\n### Multivariate plots\n\nPlotting classification tasks for all columns while converting numerical columns\nto categories.  \nOutput is stored in `svm_classification.png`.\n\n```python\nfrom tno.sdg.tabular.eval.utility_metrics import visualise_accuracies_scores\n\n\n(\n    accuracies_scores_original,\n    accuracies_score_synthetic,\n    pred_fig,\n) = visualise_accuracies_scores(\n    original_data, synthetic_data_1, categorical_column_names\n)\npred_fig.savefig(\"svm_classification.png\", dpi=300, bbox_inches=\"tight\")\n```\n\nPlotting regression and classification task for numerical and categorical\ncolumns.  \nOutput is stored in `svm_regression_classification.png`.\n\n```python\nfrom tno.sdg.tabular.eval.utility_metrics import visualise_prediction_scores\n\n\n(\n    prediction_scores_original,\n    prediction_score_synthetic,\n    pred_fig_regr,\n) = visualise_prediction_scores(\n    original_data, synthetic_data_1, categorical_column_names\n)\npred_fig_regr.savefig(\"svm_regression_classification.png\", dpi=300, bbox_inches=\"tight\")\n```\n\n### Distinguishability\n\nPlotting distinguishability by classifying synthetic and original samples with\npropensity score.  \nOutput is stored in `propensity_scores.png`.\n\n```python\nfrom tno.sdg.tabular.eval.utility_metrics import visualise_propensity_scores\n\n\npropensity_scores, prop_fig = visualise_propensity_scores(\n    original_data, synthetic_data_1, categorical_column_names\n)\nprop_fig.savefig(\"propensity_scores.png\", dpi=300, bbox_inches=\"tight\")\n```\n\nPlotting Area Under Curve for False Positive and True Positive rates for\nclassifying synthetic and original samples.  \nOutput is stored in `AUC_curves.png`.\n\n```python\nfrom tno.sdg.tabular.eval.utility_metrics import visualise_logistical_regression_auc\n\n\npredictions, utility_disth, auc_fig = visualise_logistical_regression_auc(\n    original_data, synthetic_data_1, categorical_column_names\n)\nauc_fig.savefig(\"AUC_curves.png\", dpi=300, bbox_inches=\"tight\")\n```\n",
    "bugtrack_url": null,
    "license": "Apache License, Version 2.0",
    "summary": "Utility metrics for tabular data",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://docs.pet.tno.nl/sdg/tabular/eval/utility-metrics/0.3.0",
        "Homepage": "https://pet.tno.nl/",
        "Source": "https://github.com/TNO-SDG/tabular.eval.utility_metrics"
    },
    "split_keywords": [
        "tno",
        "sdg",
        "synthetic data",
        "synthetic data generation",
        "tabular",
        "evaluation",
        "utility"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d7b9493969033fb21d5e3a1eda5b1ea2340636ea12c75b700dda7b5fc1b8d04f",
                "md5": "46db09cd2d73134b821b9062780374a5",
                "sha256": "4c6b8568f1e2dc162d7f5c79c87e88568813b45367ce655c932efb77dc1f9cf6"
            },
            "downloads": -1,
            "filename": "tno.sdg.tabular.eval.utility_metrics-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "46db09cd2d73134b821b9062780374a5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 36362,
            "upload_time": "2024-02-28T13:22:59",
            "upload_time_iso_8601": "2024-02-28T13:22:59.727007Z",
            "url": "https://files.pythonhosted.org/packages/d7/b9/493969033fb21d5e3a1eda5b1ea2340636ea12c75b700dda7b5fc1b8d04f/tno.sdg.tabular.eval.utility_metrics-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fba62a524141e88c5ddedc515fe859ae3f55bc5bf1e3a02226f70bbbc35649fd",
                "md5": "3795892a8e6ba91f900a2144dcf0e77e",
                "sha256": "e358d2a86ffbec637a6cf4bc3a9b0feb3ffe452d98332e27d5474370a8c7a377"
            },
            "downloads": -1,
            "filename": "tno.sdg.tabular.eval.utility_metrics-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3795892a8e6ba91f900a2144dcf0e77e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 261473,
            "upload_time": "2024-02-28T13:23:02",
            "upload_time_iso_8601": "2024-02-28T13:23:02.923476Z",
            "url": "https://files.pythonhosted.org/packages/fb/a6/2a524141e88c5ddedc515fe859ae3f55bc5bf1e3a02226f70bbbc35649fd/tno.sdg.tabular.eval.utility_metrics-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-28 13:23:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TNO-SDG",
    "github_project": "tabular.eval.utility_metrics",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "tno.sdg.tabular.eval.utility-metrics"
}
        
Elapsed time: 0.19268s