syndat

Name	syndat JSON
Version	0.12.2 JSON
	download
home_page	https://github.com/SCAI-BIO/syndat
Summary	A library for evaluation & visualization of synthetic data.
upload_time	2025-07-16 16:54:58
maintainer	None
docs_url	None
author	Tim Adams
requires_python	>=3.9
license	MIT
keywords	synthetic-data data-quality data-visualization
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Syndat
![tests](https://github.com/SCAI-BIO/syndat/actions/workflows/tests.yaml/badge.svg) ![docs](https://readthedocs.org/projects/syndat/badge/?version=latest&style=flat) ![version](https://img.shields.io/github/v/release/SCAI-BIO/syndat)

Syndat is a software package that provides basic functionalities for the evaluation and visualisation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.

Syndat also allows users to generate stratified and interpretable visualisations, including raincloud plots, GOF plots, and trajectory comparisons, offering deeper insights into the quality of synthetic clinical data across different subgroups.

# Installation

Install via pip:

```bash
pip install syndat
```

# Usage

## Fidelity metrics

### Jenson-Shannon Distance

The Jenson-Shannon distance is a measure of similarity between two probability distributions. In our case, we compute
probability distributions for each feature in the datasets and compute and can thus compare the statistic feature 
similarity of two dataframes. 

It is bounded between 0 and 1, with 0 indicating identical distributions. 

### (Normalized) Correlation Difference

In addition to statistical similarity between the same features, we also want to make sure to preserve the correlations
across different features. The normalized correlation difference measures the similarity of the correlation matrix of 
two dataframes.

A low correlation difference near zero indicates that the correlation structure of the synthetic data is similar to the 
real data.

### Discriminator AUC

A classifier is trained to discriminate between real and synthetic data. Based on the Receiver Operating Characteristic 
(ROC) curve, we compute the area under the curve (AUC) as a measure of how well the classifier can distinguish between 
the two datasets. 

An AUC of 0.5 indicates that the classifier is unable to distinguish between the two datasets, while an AUC of 1.0 
indicates perfect discrimination.

Exemplary usage:

```python
import pandas as pd
from syndat.metrics import (
    jensen_shannon_distance,
    normalized_correlation_difference,
    discriminator_auc
)

real = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': ['A', 'B', 'A', 'B', 'C']
})

synthetic = pd.DataFrame({
    'feature1': [1, 2, 2, 3, 3],
    'feature2': ['A', 'B', 'A', 'C', 'C']
})

print(jensen_shannon_distance(real, synthetic))
>> {'feature1': 0.4990215421876156, 'feature2': 0.22141025172133794}

print(normalized_correlation_difference(real, synthetic))
>> 0.24571345029108108

print(discriminator_auc(real, synthetic))
>> 0.6
```

### Scoring Functions

For convenience and easier interpretation, a normalized score can be computed for each of the 
metrics instead:

```python
# JSD score is being aggregated over all features
distribution_similarity_score = syndat.scores.distribution(real, synthetic)
discrimination_score = syndat.scores.discrimination(real, synthetic)
correlation_score = syndat.scores.correlation(real, synthetic)
```

Scores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.

## Visualization

Visualize real vs. synthetic data distributions, summary statistics and discriminating features:

```python
import pandas as pd
import syndat

real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")

# plot *all* feature distribution and store image files
syndat.visualization.plot_distributions(real, synthetic, store_destination="results/plots")
syndat.visualization.plot_correlations(real, synthetic, store_destination="results/plots")

# plot and display specific feature distribution plot
syndat.visualization.plot_numerical_feature("feature_xy", real, synthetic)
syndat.visualization.plot_numerical_feature("feature_xy", real, synthetic)

# plot a shap plot of differentiating feature for real and synthetic data
syndat.visualization.plot_shap_discrimination(real, synthetic)
```


## Postprocessing

Postprocess synthetic data to improve data fidelity:

```python
import pandas as pd
import syndat

real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")

# postprocess synthetic data
synthetic_post = syndat.postprocessing.assert_minmax(real, synthetic)
synthetic_post = syndat.postprocessing.normalize_float_precision(real, synthetic)
```

# Evaluation and Visualization of Synthetic Clinical Trial Data

An example demonstrating how to compute distribution, discrimination, and correlation scores, as well as how to generate stratified visualizations (gof, raincloud and other plots), is available in `examples/rct_example.py`.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/SCAI-BIO/syndat",
    "name": "syndat",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "synthetic-data, data-quality, data-visualization",
    "author": "Tim Adams",
    "author_email": "tim.adams@scai.fraunhofer.de",
    "download_url": "https://files.pythonhosted.org/packages/8e/0d/51ff1656f5502abd19e146316156a9d3227a8863b2457001a34d7e269736/syndat-0.12.2.tar.gz",
    "platform": null,
    "description": "# Syndat\n![tests](https://github.com/SCAI-BIO/syndat/actions/workflows/tests.yaml/badge.svg) ![docs](https://readthedocs.org/projects/syndat/badge/?version=latest&style=flat) ![version](https://img.shields.io/github/v/release/SCAI-BIO/syndat)\n\nSyndat is a software package that provides basic functionalities for the evaluation and visualisation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.\n\nSyndat also allows users to generate stratified and interpretable visualisations, including raincloud plots, GOF plots, and trajectory comparisons, offering deeper insights into the quality of synthetic clinical data across different subgroups.\n\n# Installation\n\nInstall via pip:\n\n```bash\npip install syndat\n```\n\n# Usage\n\n## Fidelity metrics\n\n### Jenson-Shannon Distance\n\nThe Jenson-Shannon distance is a measure of similarity between two probability distributions. In our case, we compute\nprobability distributions for each feature in the datasets and compute and can thus compare the statistic feature \nsimilarity of two dataframes. \n\nIt is bounded between 0 and 1, with 0 indicating identical distributions. \n\n### (Normalized) Correlation Difference\n\nIn addition to statistical similarity between the same features, we also want to make sure to preserve the correlations\nacross different features. The normalized correlation difference measures the similarity of the correlation matrix of \ntwo dataframes.\n\nA low correlation difference near zero indicates that the correlation structure of the synthetic data is similar to the \nreal data.\n\n### Discriminator AUC\n\nA classifier is trained to discriminate between real and synthetic data. Based on the Receiver Operating Characteristic \n(ROC) curve, we compute the area under the curve (AUC) as a measure of how well the classifier can distinguish between \nthe two datasets. \n\nAn AUC of 0.5 indicates that the classifier is unable to distinguish between the two datasets, while an AUC of 1.0 \nindicates perfect discrimination.\n\nExemplary usage:\n\n```python\nimport pandas as pd\nfrom syndat.metrics import (\n    jensen_shannon_distance,\n    normalized_correlation_difference,\n    discriminator_auc\n)\n\nreal = pd.DataFrame({\n    'feature1': [1, 2, 3, 4, 5],\n    'feature2': ['A', 'B', 'A', 'B', 'C']\n})\n\nsynthetic = pd.DataFrame({\n    'feature1': [1, 2, 2, 3, 3],\n    'feature2': ['A', 'B', 'A', 'C', 'C']\n})\n\nprint(jensen_shannon_distance(real, synthetic))\n>> {'feature1': 0.4990215421876156, 'feature2': 0.22141025172133794}\n\nprint(normalized_correlation_difference(real, synthetic))\n>> 0.24571345029108108\n\nprint(discriminator_auc(real, synthetic))\n>> 0.6\n```\n\n### Scoring Functions\n\nFor convenience and easier interpretation, a normalized score can be computed for each of the \nmetrics instead:\n\n```python\n# JSD score is being aggregated over all features\ndistribution_similarity_score = syndat.scores.distribution(real, synthetic)\ndiscrimination_score = syndat.scores.discrimination(real, synthetic)\ncorrelation_score = syndat.scores.correlation(real, synthetic)\n```\n\nScores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.\n\n## Visualization\n\nVisualize real vs. synthetic data distributions, summary statistics and discriminating features:\n\n```python\nimport pandas as pd\nimport syndat\n\nreal = pd.read_csv(\"real.csv\")\nsynthetic = pd.read_csv(\"synthetic.csv\")\n\n# plot *all* feature distribution and store image files\nsyndat.visualization.plot_distributions(real, synthetic, store_destination=\"results/plots\")\nsyndat.visualization.plot_correlations(real, synthetic, store_destination=\"results/plots\")\n\n# plot and display specific feature distribution plot\nsyndat.visualization.plot_numerical_feature(\"feature_xy\", real, synthetic)\nsyndat.visualization.plot_numerical_feature(\"feature_xy\", real, synthetic)\n\n# plot a shap plot of differentiating feature for real and synthetic data\nsyndat.visualization.plot_shap_discrimination(real, synthetic)\n```\n\n\n## Postprocessing\n\nPostprocess synthetic data to improve data fidelity:\n\n```python\nimport pandas as pd\nimport syndat\n\nreal = pd.read_csv(\"real.csv\")\nsynthetic = pd.read_csv(\"synthetic.csv\")\n\n# postprocess synthetic data\nsynthetic_post = syndat.postprocessing.assert_minmax(real, synthetic)\nsynthetic_post = syndat.postprocessing.normalize_float_precision(real, synthetic)\n```\n\n# Evaluation and Visualization of Synthetic Clinical Trial Data\n\nAn example demonstrating how to compute distribution, discrimination, and correlation scores, as well as how to generate stratified visualizations (gof, raincloud and other plots), is available in `examples/rct_example.py`.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A library for evaluation & visualization of synthetic data.",
    "version": "0.12.2",
    "project_urls": {
        "Documentation": "https://github.com/SCAI-BIO/syndat#readme",
        "Homepage": "https://github.com/SCAI-BIO/syndat",
        "Repository": "https://github.com/SCAI-BIO/syndat",
        "documentation": "https://github.com/SCAI-BIO/syndat#readme",
        "source": "https://github.com/SCAI-BIO/syndat",
        "tracker": "https://github.com/SCAI-BIO/syndat/issues"
    },
    "split_keywords": [
        "synthetic-data",
        " data-quality",
        " data-visualization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38b896ffc1d3ca31408829a5bfcc89e5357346fbfdc851aefa8d63e7f79d2318",
                "md5": "da0174a69f1c9788805ac8b65f0af0fc",
                "sha256": "e601cb579d2b0a8c0e08bc90592f4455ba1385cc3b4359b4092eee0b983ce99d"
            },
            "downloads": -1,
            "filename": "syndat-0.12.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da0174a69f1c9788805ac8b65f0af0fc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 21129,
            "upload_time": "2025-07-16T16:54:56",
            "upload_time_iso_8601": "2025-07-16T16:54:56.963899Z",
            "url": "https://files.pythonhosted.org/packages/38/b8/96ffc1d3ca31408829a5bfcc89e5357346fbfdc851aefa8d63e7f79d2318/syndat-0.12.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8e0d51ff1656f5502abd19e146316156a9d3227a8863b2457001a34d7e269736",
                "md5": "a367ee2913705c7356a3c8635bb41b3a",
                "sha256": "4b63b0b1d7156fb94e8a74b0386ac4ab9c63119f0741ba95d8ffaf4afdbfd3ed"
            },
            "downloads": -1,
            "filename": "syndat-0.12.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a367ee2913705c7356a3c8635bb41b3a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 19620,
            "upload_time": "2025-07-16T16:54:58",
            "upload_time_iso_8601": "2025-07-16T16:54:58.251049Z",
            "url": "https://files.pythonhosted.org/packages/8e/0d/51ff1656f5502abd19e146316156a9d3227a8863b2457001a34d7e269736/syndat-0.12.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-16 16:54:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SCAI-BIO",
    "github_project": "syndat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "syndat"
}

Tim Adams