# Syndat
  
Syndat is a software package that provides basic functionalities for the evaluation and visualizsation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.
# Installation
Install via pip:
```bash
pip install syndat
```
# Usage
## Quality metrics
Compute data quality metrics by comparing real and synthetic data in terms of their separation complexity,
distribution similarity or pairwise feature correlations:
```python
import pandas as pd
import syndat
real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")
# How similar are the statistical distributions of real and synthetic features
distribution_similarity_score = syndat.scores.distribution(real, synthetic)
# How hard is it for a classifier to discriminate real and synthetic data
discrimination_score = syndat.scores.discrimination(real, synthetic)
# How well are pairwise feature correlations preserved
correlation_score = syndat.scores.correlation(real, synthetic)
```
Scores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.
## Visualization
Visualize real vs. synthetic data distributions, summary statistics and discriminating features:
```python
import pandas as pd
import syndat
real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")
# plot *all* feature distribution and store image files
syndat.visualization.plot_distributions(real, synthetic, store_destination="results/plots")
syndat.visualization.plot_correlations(real, synthetic, store_destination="results/plots")
# plot and display specific feature distribution plot
syndat.visualization.plot_numerical_feature("feature_xy", real, synthetic)
syndat.visualization.plot_numerical_feature("feature_xy", real, synthetic)
# plot a shap plot of differentiating feature for real and synthetic data
syndat.visualization.plot_shap_discrimination(real, synthetic)
```
## Postprocessing
Postprocess synthetic data to improve data fidelity:
```python
import pandas as pd
import syndat
real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")
# postprocess synthetic data
synthetic_post = syndat.postprocessing.assert_minmax(real, synthetic)
synthetic_post = syndat.postprocessing.normalize_float_precision(real, synthetic)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/SCAI-BIO/syndat",
"name": "syndat",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "synthetic-data, data-quality, data-visualization",
"author": "Tim Adams",
"author_email": "tim.adams@scai.fraunhofer.de",
"download_url": "https://files.pythonhosted.org/packages/9f/38/c7db98a9e500a21bdcc7cad206b870bcd00a4d009b01dbce7d580e30c59c/syndat-0.10.5.tar.gz",
"platform": null,
"description": "\n# Syndat\n  \n\nSyndat is a software package that provides basic functionalities for the evaluation and visualizsation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.\n\n# Installation\n\nInstall via pip:\n\n```bash\npip install syndat\n```\n\n# Usage\n\n## Quality metrics\n\nCompute data quality metrics by comparing real and synthetic data in terms of their separation complexity, \ndistribution similarity or pairwise feature correlations:\n\n```python\nimport pandas as pd\nimport syndat\n\nreal = pd.read_csv(\"real.csv\")\nsynthetic = pd.read_csv(\"synthetic.csv\")\n\n# How similar are the statistical distributions of real and synthetic features \ndistribution_similarity_score = syndat.scores.distribution(real, synthetic)\n\n# How hard is it for a classifier to discriminate real and synthetic data\ndiscrimination_score = syndat.scores.discrimination(real, synthetic)\n\n# How well are pairwise feature correlations preserved\ncorrelation_score = syndat.scores.correlation(real, synthetic)\n```\n\nScores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.\n\n## Visualization\n\nVisualize real vs. synthetic data distributions, summary statistics and discriminating features:\n\n```python\nimport pandas as pd\nimport syndat\n\nreal = pd.read_csv(\"real.csv\")\nsynthetic = pd.read_csv(\"synthetic.csv\")\n\n# plot *all* feature distribution and store image files\nsyndat.visualization.plot_distributions(real, synthetic, store_destination=\"results/plots\")\nsyndat.visualization.plot_correlations(real, synthetic, store_destination=\"results/plots\")\n\n# plot and display specific feature distribution plot\nsyndat.visualization.plot_numerical_feature(\"feature_xy\", real, synthetic)\nsyndat.visualization.plot_numerical_feature(\"feature_xy\", real, synthetic)\n\n# plot a shap plot of differentiating feature for real and synthetic data\nsyndat.visualization.plot_shap_discrimination(real, synthetic)\n```\n\n\n## Postprocessing\n\nPostprocess synthetic data to improve data fidelity:\n\n```python\nimport pandas as pd\nimport syndat\n\nreal = pd.read_csv(\"real.csv\")\nsynthetic = pd.read_csv(\"synthetic.csv\")\n\n# postprocess synthetic data\nsynthetic_post = syndat.postprocessing.assert_minmax(real, synthetic)\nsynthetic_post = syndat.postprocessing.normalize_float_precision(real, synthetic)\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A library for evaluation & visualization of synthetic data.",
"version": "0.10.5",
"project_urls": {
"Documentation": "https://github.com/SCAI-BIO/syndat#readme",
"Homepage": "https://github.com/SCAI-BIO/syndat",
"Source": "https://github.com/SCAI-BIO/syndat",
"Tracker": "https://github.com/SCAI-BIO/syndat/issues"
},
"split_keywords": [
"synthetic-data",
" data-quality",
" data-visualization"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5c114f63d4dc0796886677aa2eb0285f46f59feecf18eab3ca4f89a9c8933e0b",
"md5": "dac79abdfa656e1f356a69afa2ad32d4",
"sha256": "8e71d612efcf600e51fbc4602d791a4e28af8c5daed0772b8edce010993150f8"
},
"downloads": -1,
"filename": "syndat-0.10.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "dac79abdfa656e1f356a69afa2ad32d4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 12154,
"upload_time": "2025-02-06T21:37:01",
"upload_time_iso_8601": "2025-02-06T21:37:01.068767Z",
"url": "https://files.pythonhosted.org/packages/5c/11/4f63d4dc0796886677aa2eb0285f46f59feecf18eab3ca4f89a9c8933e0b/syndat-0.10.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9f38c7db98a9e500a21bdcc7cad206b870bcd00a4d009b01dbce7d580e30c59c",
"md5": "e01381b8817c7ca3ccf4adb2e3aac6c5",
"sha256": "1cd02f6531f086b72aba3b7a2976efb3daf61c4c2c49a0f3ab40085cbb796a59"
},
"downloads": -1,
"filename": "syndat-0.10.5.tar.gz",
"has_sig": false,
"md5_digest": "e01381b8817c7ca3ccf4adb2e3aac6c5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 14424,
"upload_time": "2025-02-06T21:37:02",
"upload_time_iso_8601": "2025-02-06T21:37:02.906819Z",
"url": "https://files.pythonhosted.org/packages/9f/38/c7db98a9e500a21bdcc7cad206b870bcd00a4d009b01dbce7d580e30c59c/syndat-0.10.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-06 21:37:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SCAI-BIO",
"github_project": "syndat",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pandas",
"specs": [
[
"~=",
"2.1.4"
]
]
},
{
"name": "numpy",
"specs": [
[
"~=",
"1.26.2"
]
]
},
{
"name": "scipy",
"specs": [
[
"~=",
"1.11.4"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
"~=",
"1.5.1"
]
]
},
{
"name": "matplotlib",
"specs": [
[
"~=",
"3.8.2"
]
]
},
{
"name": "seaborn",
"specs": [
[
"~=",
"0.13.0"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"70.0.0"
]
]
},
{
"name": "shap",
"specs": [
[
"~=",
"0.42.0"
]
]
}
],
"lcname": "syndat"
}