<a href="https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f"><img src="https://img.shields.io/badge/SURE-docs-blue?logo=mdbook" /></a>
[](https://clearbox-sure.readthedocs.io/en/latest/?badge=latest)
[](https://badge.fury.io/py/clearbox-sure)
[](https://pepy.tech/project/clearbox-sure)
[](https://github.com/Clearbox-AI/SURE)
<img src="docs/source/img/sure_logo_nobg.png" width="250">
### Synthetic Data: Utility, Regulatory compliance, and Ethical privacy
The SURE package is an open-source Python library intended to be used for the assessment of the utility and privacy performance of any tabular synthetic dataset.
The SURE library features multiple Python modules that can be easily imported and seamlessly integrated into any Python script after installing the library.
> [!WARNING]
> This is a beta version of the library and only runs on Linux and MacOS for the moment.
> [!IMPORTANT]
> Requires Python >= 3.10
# Installation
To install the library run the following command in your terminal:
```shell
$ pip install clearbox-sure
```
# Modules overview
The SURE library features the following modules:
1. Preprocessor
2. Statistical similarity metrics
3. Model garden
4. ML utility metrics
5. Distance metrics
6. Privacy attack sandbox
7. Report generator
**Preprocessor**
The input datasets undergo manipulation by the preprocessor module, tailored to conform to the standard structure utilized across the subsequent processes. The Polars library used in the preprocessor makes this operation significantly faster compared to the use of other data processing libraries.
**Utility**
The statistical similarity metrics, the ML utility metrics and the model garden modules constitute the data **utility evaluation** part.
The statistical similarity module and the distance metrics module take as input the pre-processed datasets and carry out the operation to assess the statistical similarity between the datasets and how different the content of the synthetic dataset is from the one of the original dataset. In particular, The real and synthetic input datasets are used in the statistical similarity metrics module to assess how close the two datasets are in terms of statistical properties, such as mean, correlation, distribution.
The model garden executes a classification or regression task on the given dataset with multiple machine learning models, returning the performance metrics of each of the models tested on the given task and dataset.
The model garden module’s best performing models are employed in the machine learning utility metrics module to compute the usefulness of the synthetic data on a given ML task (classification or regression).
**Privacy**
The distance metrics and the privacy attack sandbox make up the synthetic data **privacy assessment** modules.
The distance metrics module computes the Gower distance between the two input datasets and the distance to the closest record for each line of the first dataset.
The ML privacy attack sandbox allows to simulate a Membership Inference Attack for re-identification of vulnerable records identified with the distance metrics module and evaluate how exposed the synthetic dataset is to this kind of assault.
**Report**
Eventually, the report generator provides a summary of the utility and privacy metrics computed in the previous modules, providing a visual digest with charts and tables of the results.
This following diagram serves as a visual representation of how each module contributes to the utility-privacy assessment process and highlights the seamless interconnection and synergy between individual blocks.
<img src="docs/source/img/SURE_workflow_.png" alt="drawing" width="500"/>
# Usage
The library leverages Polars, which ensures faster computations compared to other data manipulation libraries. It supports both Polars and Pandas dataframes.
The user must provide both the original real training dataset (which was used to train the generative model that produced the synthetic dataset), the real holdout dataset (which was NOT used to train the generative model that produced the synthetic dataset) and the corresponding synthetic dataset to enable the library's modules to perform the necessary computations for evaluation.
Below is a code snippet example for the usage of the library:
```python
# Import the necessary modules from the SURE library
from sure import Preprocessor, report
from sure.utility import (compute_statistical_metrics, compute_mutual_info,
compute_utility_metrics_class)
from sure.privacy import (distance_to_closest_record, dcr_stats, number_of_dcr_equal_to_zero, validation_dcr_test,
adversary_dataset, membership_inference_test)
# Assuming real_data, valid_data and synth_data are three pandas DataFrames
# Preprocessor initialization and query execution on the real, synthetic and validation datasets
preprocessor = Preprocessor(real_data, get_discarded_info=False, num_fill_null='forward', scaling='standardize')
real_data_preprocessed = preprocessor.transform(real_data)
valid_data_preprocessed = preprocessor.transform(valid_data)
synth_data_preprocessed = preprocessor.transform(synth_data)
# Statistical properties and mutual information
num_features_stats, cat_features_stats, temporal_feat_stats = compute_statistical_metrics(real_data, synth_data)
corr_real, corr_synth, corr_difference = compute_mutual_info(real_data_preprocessed, synth_data_preprocessed)
# ML utility: TSTR - Train on Synthetic, Test on Real
X_train = real_data_preprocessed.drop("label", axis=1) # Assuming the datasets have a “label” column for the machine learning task they are intended for
y_train = real_data_preprocessed["label"]
X_synth = synth_data_preprocessed.drop("label", axis=1)
y_synth = synth_data_preprocessed["label"]
X_test = valid_data_preprocessed.drop("label", axis=1).limit(10000) # Test the trained models on a portion of the original real dataset (first 10k rows)
y_test = valid_data_preprocessed["label"].limit(10000)
TSTR_metrics = compute_utility_metrics_class(X_train, X_synth, X_test, y_train, y_synth, y_test)
# Distance to closest record
dcr_synth_train = distance_to_closest_record("synth_train", synth_data, real_data)
dcr_synth_valid = distance_to_closest_record("synth_val", synth_data, valid_data)
dcr_stats_synth_train = dcr_stats("synth_train", dcr_synth_train)
dcr_stats_synth_valid = dcr_stats("synth_val", dcr_synth_valid)
dcr_zero_synth_train = number_of_dcr_equal_to_zero("synth_train", dcr_synth_train)
dcr_zero_synth_valid = number_of_dcr_equal_to_zero("synth_val", dcr_synth_valid)
share = validation_dcr_test(dcr_synth_train, dcr_synth_valid)
# ML privacy attack sandbox initialization and simulation
adversary_df = adversary_dataset(real_data_preprocessed, valid_data_preprocessed)
# The function adversary_dataset adds a column "privacy_test_is_training" to the adversary dataset, indicating whether the record was part of the training set or not
adversary_guesses_ground_truth = adversary_df["privacy_test_is_training"]
MIA = membership_inference_test(adversary_dfv, synth_data_preprocessed, adversary_guesses_ground_truth)
# Report generation as HTML page
report(real_data, synth_data)
```
Follow the step-by-step [guide](https://github.com/Clearbox-AI/SURE/tree/main/examples) to test the library.
<!-- Review the dedicated [documentation](https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f) to learn how to further customize your synthetic data assessment pipeline. -->
Raw data
{
"_id": null,
"home_page": "https://github.com/Clearbox-AI/SURE",
"name": "clearbox-sure",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": null,
"author": "Clearbox AI",
"author_email": "info@clearbox.ai",
"download_url": null,
"platform": null,
"description": "<a href=\"https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f\"><img src=\"https://img.shields.io/badge/SURE-docs-blue?logo=mdbook\" /></a>\n[](https://clearbox-sure.readthedocs.io/en/latest/?badge=latest)\n[](https://badge.fury.io/py/clearbox-sure)\n[](https://pepy.tech/project/clearbox-sure)\n[](https://github.com/Clearbox-AI/SURE)\n\n<img src=\"docs/source/img/sure_logo_nobg.png\" width=\"250\">\n\n### Synthetic Data: Utility, Regulatory compliance, and Ethical privacy\n\nThe SURE package is an open-source Python library intended to be used for the assessment of the utility and privacy performance of any tabular synthetic dataset.\n\nThe SURE library features multiple Python modules that can be easily imported and seamlessly integrated into any Python script after installing the library.\n\n> [!WARNING]\n> This is a beta version of the library and only runs on Linux and MacOS for the moment.\n\n> [!IMPORTANT]\n> Requires Python >= 3.10\n\n# Installation\n\nTo install the library run the following command in your terminal:\n\n```shell\n$ pip install clearbox-sure\n```\n\n# Modules overview\n\nThe SURE library features the following modules:\n\n1. Preprocessor\n2. Statistical similarity metrics\n3. Model garden\n4. ML utility metrics\n5. Distance metrics\n6. Privacy attack sandbox\n7. Report generator\n\n**Preprocessor**\n\nThe input datasets undergo manipulation by the preprocessor module, tailored to conform to the standard structure utilized across the subsequent processes. The Polars library used in the preprocessor makes this operation significantly faster compared to the use of other data processing libraries.\n\n**Utility**\n\nThe statistical similarity metrics, the ML utility metrics and the model garden modules constitute the data **utility evaluation** part.\n\nThe statistical similarity module and the distance metrics module take as input the pre-processed datasets and carry out the operation to assess the statistical similarity between the datasets and how different the content of the synthetic dataset is from the one of the original dataset. In particular, The real and synthetic input datasets are used in the statistical similarity metrics module to assess how close the two datasets are in terms of statistical properties, such as mean, correlation, distribution.\n\nThe model garden executes a classification or regression task on the given dataset with multiple machine learning models, returning the performance metrics of each of the models tested on the given task and dataset.\n\nThe model garden module\u2019s best performing models are employed in the machine learning utility metrics module to compute the usefulness of the synthetic data on a given ML task (classification or regression).\n\n**Privacy**\n\nThe distance metrics and the privacy attack sandbox make up the synthetic data **privacy assessment** modules.\n\nThe distance metrics module computes the Gower distance between the two input datasets and the distance to the closest record for each line of the first dataset.\n\nThe ML privacy attack sandbox allows to simulate a Membership Inference Attack for re-identification of vulnerable records identified with the distance metrics module and evaluate how exposed the synthetic dataset is to this kind of assault.\n\n**Report**\n\nEventually, the report generator provides a summary of the utility and privacy metrics computed in the previous modules, providing a visual digest with charts and tables of the results.\n\nThis following diagram serves as a visual representation of how each module contributes to the utility-privacy assessment process and highlights the seamless interconnection and synergy between individual blocks.\n\n<img src=\"docs/source/img/SURE_workflow_.png\" alt=\"drawing\" width=\"500\"/>\n\n# Usage\n\nThe library leverages Polars, which ensures faster computations compared to other data manipulation libraries. It supports both Polars and Pandas dataframes.\n\nThe user must provide both the original real training dataset (which was used to train the generative model that produced the synthetic dataset), the real holdout dataset (which was NOT used to train the generative model that produced the synthetic dataset) and the corresponding synthetic dataset to enable the library's modules to perform the necessary computations for evaluation.\n\nBelow is a code snippet example for the usage of the library:\n\n```python\n# Import the necessary modules from the SURE library\nfrom sure import Preprocessor, report\nfrom sure.utility import (compute_statistical_metrics, compute_mutual_info,\n\t\t\t compute_utility_metrics_class)\nfrom sure.privacy import (distance_to_closest_record, dcr_stats, number_of_dcr_equal_to_zero, validation_dcr_test, \n\t\t\t adversary_dataset, membership_inference_test)\n\n# Assuming real_data, valid_data and synth_data are three pandas DataFrames\n\n# Preprocessor initialization and query execution on the real, synthetic and validation datasets\npreprocessor = Preprocessor(real_data, get_discarded_info=False, num_fill_null='forward', scaling='standardize')\n\nreal_data_preprocessed = preprocessor.transform(real_data)\nvalid_data_preprocessed = preprocessor.transform(valid_data)\nsynth_data_preprocessed = preprocessor.transform(synth_data)\n\n# Statistical properties and mutual information\nnum_features_stats, cat_features_stats, temporal_feat_stats = compute_statistical_metrics(real_data, synth_data)\ncorr_real, corr_synth, corr_difference = compute_mutual_info(real_data_preprocessed, synth_data_preprocessed)\n\n# ML utility: TSTR - Train on Synthetic, Test on Real\nX_train = real_data_preprocessed.drop(\"label\", axis=1) # Assuming the datasets have a \u201clabel\u201d column for the machine learning task they are intended for\ny_train = real_data_preprocessed[\"label\"]\nX_synth = synth_data_preprocessed.drop(\"label\", axis=1)\ny_synth = synth_data_preprocessed[\"label\"]\nX_test = valid_data_preprocessed.drop(\"label\", axis=1).limit(10000) # Test the trained models on a portion of the original real dataset (first 10k rows)\ny_test = valid_data_preprocessed[\"label\"].limit(10000)\nTSTR_metrics = compute_utility_metrics_class(X_train, X_synth, X_test, y_train, y_synth, y_test)\n\n# Distance to closest record\ndcr_synth_train = distance_to_closest_record(\"synth_train\", synth_data, real_data)\ndcr_synth_valid = distance_to_closest_record(\"synth_val\", synth_data, valid_data)\ndcr_stats_synth_train = dcr_stats(\"synth_train\", dcr_synth_train)\ndcr_stats_synth_valid = dcr_stats(\"synth_val\", dcr_synth_valid)\ndcr_zero_synth_train = number_of_dcr_equal_to_zero(\"synth_train\", dcr_synth_train)\ndcr_zero_synth_valid = number_of_dcr_equal_to_zero(\"synth_val\", dcr_synth_valid)\nshare = validation_dcr_test(dcr_synth_train, dcr_synth_valid)\n\n# ML privacy attack sandbox initialization and simulation\nadversary_df = adversary_dataset(real_data_preprocessed, valid_data_preprocessed)\n# The function adversary_dataset adds a column \"privacy_test_is_training\" to the adversary dataset, indicating whether the record was part of the training set or not\nadversary_guesses_ground_truth = adversary_df[\"privacy_test_is_training\"] \nMIA = membership_inference_test(adversary_dfv, synth_data_preprocessed, adversary_guesses_ground_truth)\n\n# Report generation as HTML page\nreport(real_data, synth_data)\n```\n\nFollow the step-by-step [guide](https://github.com/Clearbox-AI/SURE/tree/main/examples) to test the library.\n\n<!-- Review the dedicated [documentation](https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f) to learn how to further customize your synthetic data assessment pipeline. -->\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.2.6",
"project_urls": {
"Homepage": "https://github.com/Clearbox-AI/SURE"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "cf55034d561dbde94db2eb0229a36be4f46766aafca0b7f60ca8f32f73dd1764",
"md5": "7dfdcac742cfae259662b148c3391c11",
"sha256": "b6f6ed6080961f73e73880915d5a6557971004f1c5253c5006bbdb838cb364f2"
},
"downloads": -1,
"filename": "clearbox_sure-0.2.6-cp310-cp310-macosx_10_9_universal2.whl",
"has_sig": false,
"md5_digest": "7dfdcac742cfae259662b148c3391c11",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.7.0",
"size": 410229,
"upload_time": "2025-02-05T16:28:57",
"upload_time_iso_8601": "2025-02-05T16:28:57.983526Z",
"url": "https://files.pythonhosted.org/packages/cf/55/034d561dbde94db2eb0229a36be4f46766aafca0b7f60ca8f32f73dd1764/clearbox_sure-0.2.6-cp310-cp310-macosx_10_9_universal2.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3e41366d87fddaadd70d07212756458a5119dacd7407d8efcc37aa4d2ddf379d",
"md5": "8f1751645d6eb8747e8f11ae3b4f61b1",
"sha256": "a2e1ecdd8bf03b96e3ade414598e86b6b0ad37e6acce4291c3f9babaf15881b2"
},
"downloads": -1,
"filename": "clearbox_sure-0.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "8f1751645d6eb8747e8f11ae3b4f61b1",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.7.0",
"size": 1072050,
"upload_time": "2025-02-05T16:32:34",
"upload_time_iso_8601": "2025-02-05T16:32:34.091172Z",
"url": "https://files.pythonhosted.org/packages/3e/41/366d87fddaadd70d07212756458a5119dacd7407d8efcc37aa4d2ddf379d/clearbox_sure-0.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e31c514a817d61f14423631a8e06eac731185007127079022b29237807f2319e",
"md5": "5233d60d54358ef447dade99894a4cb5",
"sha256": "6f9b2d6d6e241ce4dc323626784546a6ca1adb382c524c4d16af294c72190647"
},
"downloads": -1,
"filename": "clearbox_sure-0.2.6-cp311-cp311-macosx_10_9_universal2.whl",
"has_sig": false,
"md5_digest": "5233d60d54358ef447dade99894a4cb5",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.7.0",
"size": 409216,
"upload_time": "2025-02-05T16:28:57",
"upload_time_iso_8601": "2025-02-05T16:28:57.955673Z",
"url": "https://files.pythonhosted.org/packages/e3/1c/514a817d61f14423631a8e06eac731185007127079022b29237807f2319e/clearbox_sure-0.2.6-cp311-cp311-macosx_10_9_universal2.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4d62d439fcee391a8bf8bb25de85b261cf0a80eba999bd1fe0a8890404cb5b1c",
"md5": "fc6d3f91bfb5230189ee878378985e22",
"sha256": "e43ada1edca5ac5f4c78c370ff3c1b1b2ade7e233130708387edbfb6e9ff4450"
},
"downloads": -1,
"filename": "clearbox_sure-0.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "fc6d3f91bfb5230189ee878378985e22",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.7.0",
"size": 1182786,
"upload_time": "2025-02-05T16:32:38",
"upload_time_iso_8601": "2025-02-05T16:32:38.859152Z",
"url": "https://files.pythonhosted.org/packages/4d/62/d439fcee391a8bf8bb25de85b261cf0a80eba999bd1fe0a8890404cb5b1c/clearbox_sure-0.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "974093bdab69ab804d43abe5c6b021acadc13f9729ee182ac6f67bfc1aa7021d",
"md5": "605e1399d136b1ffd2cbe1c44eb46eb3",
"sha256": "3b7c48fab0ee511a1e49ad3bdd0d09269bffbfc053db46dadc6b7481215a6d7a"
},
"downloads": -1,
"filename": "clearbox_sure-0.2.6-cp312-cp312-macosx_10_13_universal2.whl",
"has_sig": false,
"md5_digest": "605e1399d136b1ffd2cbe1c44eb46eb3",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.7.0",
"size": 403519,
"upload_time": "2025-02-05T16:28:58",
"upload_time_iso_8601": "2025-02-05T16:28:58.571433Z",
"url": "https://files.pythonhosted.org/packages/97/40/93bdab69ab804d43abe5c6b021acadc13f9729ee182ac6f67bfc1aa7021d/clearbox_sure-0.2.6-cp312-cp312-macosx_10_13_universal2.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a95896dc3f3eb7ecfddc52c803f92f2a1e377b9b998f314abd493f78acc6bd2c",
"md5": "b4c964b15e6bde3463cd3bb498c398a2",
"sha256": "65a0375770e37e1f118eae55ed3bbae61f524473f202d62d48ec5c9436c127ab"
},
"downloads": -1,
"filename": "clearbox_sure-0.2.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "b4c964b15e6bde3463cd3bb498c398a2",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.7.0",
"size": 1210351,
"upload_time": "2025-02-05T16:32:46",
"upload_time_iso_8601": "2025-02-05T16:32:46.082767Z",
"url": "https://files.pythonhosted.org/packages/a9/58/96dc3f3eb7ecfddc52c803f92f2a1e377b9b998f314abd493f78acc6bd2c/clearbox_sure-0.2.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-05 16:28:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Clearbox-AI",
"github_project": "SURE",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "clearbox-preprocessor",
"specs": []
},
{
"name": "setuptools",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "polars",
"specs": [
[
"<=",
"1.17.1"
]
]
},
{
"name": "numpy",
"specs": []
},
{
"name": "cython",
"specs": []
},
{
"name": "wheel",
"specs": []
},
{
"name": "streamlit",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "matplotlib-inline",
"specs": []
},
{
"name": "seaborn",
"specs": []
},
{
"name": "click",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "joblib",
"specs": []
},
{
"name": "lightgbm",
"specs": []
},
{
"name": "xgboost",
"specs": []
},
{
"name": "sphinx_rtd_theme",
"specs": []
},
{
"name": "myst_parser",
"specs": []
}
],
"lcname": "clearbox-sure"
}