# Overview
This library contains tools for evaluating fidelity and privacy of synthetic data.
## Usage
Import the desired modules from the library:
```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tonic_reporting import univariate, multivariate, privacy
```
**Preface**
*Numeric* columns refer to columns *encoded* as numeric. Numerical data types in the schema underlying a model may be encoded as other types.
*Categorical* columns refer to columns *encoded* as categorical.
*source_df* is a Pandas DataFrame of original data from the source database
*synth_df* is a Pandas DataFrame of sampled data from trained models
The source and synthetic DataFrames should be equal in row count and schema.
**Numeric Column Statistics**
`univariate.summarize_numeric(source_df, synth_df, numeric_cols)`
**Categorical Column Statistics**
`univariate.summarize_categorical(source_df, synth_df, categorical_cols)`
**Numeric Column Comparative Histograms**
```
fig, axarr = plt.subplots(1, len(numeric_cols), figsize = (9,12))
axarr = axarr.ravel()
for col, ax in zip(numeric_cols, axarr):
univariate.plot_histogram(source_df, synth_df, col,ax)
```
**Categorical Column Comparative Frequency Tables**
```
for col in categorical_cols:
univariate.plot_frequency_table(source_df, synth_df, col, ax)
```
**Numeric Column Aggregates Over Time**
If the data represents time series, we can visualize means and confidence intervals of numeric features
over time:
```
for col in numeric_cols:
fig, ax = plt.subplots(figsize=(10, 8))
univariate.plot_events_means(source_df, synth_df, col, order_col, ax=ax)
```
and
```
for col in numeric_cols:
fig, ax = plt.subplots(figsize=(12, 10))
univariate.plot_events_confidence_intervals(source_df, synth_df, col, order_col, ax=ax)
```
where `order_col` denotes the time/order column.
**Numeric Column Multivariate Correlations Table**
`multivariate.summarize_correlations(source_df, synth_df, numeric_cols)`
**Numeric Column Multivariate Correlations Heat Map**
```
fig, axarr = plt.subplots(1, 2, figsize=(13, 8))
multivariate.plot_correlations(source_df, synth_df, numeric_cols, axarr=axarr, )
fig.tight_layout()
```
**Distance to Closest Record Comparison**
```
syn_dcr, real_dcr = privacy.compute_dcr(source_df, synth_df, numeric_cols, categorical_cols)
fig, ax = plt.subplots(1,1,figsize=(8,6))
ax.hist(real_dcr,bins=300,label = 'Real vs. real', color='mediumpurple');
ax.hist(syn_dcr,bins=300,label='Synthetic vs. real', color='mediumturquoise');
ax.tick_params(axis='both', which='major', labelsize=14)
ax.set_title('Distances to closest record',fontsize=22)
ax.legend(fontsize=16);
```
Raw data
{
"_id": null,
"home_page": "https://www.tonic.ai/",
"name": "tonic-reporting",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "tonic.ai,tonic",
"author": "Eric Timmerman",
"author_email": "eric@tonic.ai",
"download_url": "https://files.pythonhosted.org/packages/5f/91/e9a25877b3cb80289b2a7e7baf9326e67802478ddfef95b8c70accfc8ec8/tonic-reporting-1.4.4.tar.gz",
"platform": null,
"description": "# Overview\nThis library contains tools for evaluating fidelity and privacy of synthetic data.\n\n## Usage\n\nImport the desired modules from the library:\n\n```\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom tonic_reporting import univariate, multivariate, privacy\n```\n\n**Preface**\n\n*Numeric* columns refer to columns *encoded* as numeric. Numerical data types in the schema underlying a model may be encoded as other types.\n\n*Categorical* columns refer to columns *encoded* as categorical.\n\n*source_df* is a Pandas DataFrame of original data from the source database\n\n*synth_df* is a Pandas DataFrame of sampled data from trained models\n\nThe source and synthetic DataFrames should be equal in row count and schema.\n\n**Numeric Column Statistics**\n\n`univariate.summarize_numeric(source_df, synth_df, numeric_cols)`\n\n**Categorical Column Statistics**\n\n`univariate.summarize_categorical(source_df, synth_df, categorical_cols)`\n\n**Numeric Column Comparative Histograms**\n\n```\nfig, axarr = plt.subplots(1, len(numeric_cols), figsize = (9,12))\naxarr = axarr.ravel()\n\nfor col, ax in zip(numeric_cols, axarr):\n univariate.plot_histogram(source_df, synth_df, col,ax)\n```\n\n**Categorical Column Comparative Frequency Tables**\n\n```\nfor col in categorical_cols:\n univariate.plot_frequency_table(source_df, synth_df, col, ax)\n```\n\n**Numeric Column Aggregates Over Time**\n\nIf the data represents time series, we can visualize means and confidence intervals of numeric features\nover time:\n\n```\nfor col in numeric_cols:\n fig, ax = plt.subplots(figsize=(10, 8))\n univariate.plot_events_means(source_df, synth_df, col, order_col, ax=ax)\n```\n\nand\n\n```\nfor col in numeric_cols:\n fig, ax = plt.subplots(figsize=(12, 10))\n univariate.plot_events_confidence_intervals(source_df, synth_df, col, order_col, ax=ax)\n```\nwhere `order_col` denotes the time/order column.\n\n**Numeric Column Multivariate Correlations Table**\n\n`multivariate.summarize_correlations(source_df, synth_df, numeric_cols)`\n\n**Numeric Column Multivariate Correlations Heat Map**\n\n```\nfig, axarr = plt.subplots(1, 2, figsize=(13, 8))\nmultivariate.plot_correlations(source_df, synth_df, numeric_cols, axarr=axarr, )\nfig.tight_layout()\n```\n\n**Distance to Closest Record Comparison**\n\n```\nsyn_dcr, real_dcr = privacy.compute_dcr(source_df, synth_df, numeric_cols, categorical_cols)\n\nfig, ax = plt.subplots(1,1,figsize=(8,6))\nax.hist(real_dcr,bins=300,label = 'Real vs. real', color='mediumpurple');\nax.hist(syn_dcr,bins=300,label='Synthetic vs. real', color='mediumturquoise');\nax.tick_params(axis='both', which='major', labelsize=14)\nax.set_title('Distances to closest record',fontsize=22)\nax.legend(fontsize=16);\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Tools for evaluating fidelity and privacy of synthetic data",
"version": "1.4.4",
"split_keywords": [
"tonic.ai",
"tonic"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6519118343c15c271d8003737af94b0d3a619680396fb5fdf085bf649d1ae26f",
"md5": "cd073395cfb26fc348fe39b20b75fd0a",
"sha256": "936cddf55bfdd7d64ba149443f1822d2cda6952f1a3db254d2bf85e752523556"
},
"downloads": -1,
"filename": "tonic_reporting-1.4.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cd073395cfb26fc348fe39b20b75fd0a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11381,
"upload_time": "2023-02-02T23:10:53",
"upload_time_iso_8601": "2023-02-02T23:10:53.994691Z",
"url": "https://files.pythonhosted.org/packages/65/19/118343c15c271d8003737af94b0d3a619680396fb5fdf085bf649d1ae26f/tonic_reporting-1.4.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5f91e9a25877b3cb80289b2a7e7baf9326e67802478ddfef95b8c70accfc8ec8",
"md5": "bac30ded2d320b343f7366373805cd20",
"sha256": "8337b7cd61aeaf40152a2d0994a620d68240beca66e3d82cc8c71b265c928286"
},
"downloads": -1,
"filename": "tonic-reporting-1.4.4.tar.gz",
"has_sig": false,
"md5_digest": "bac30ded2d320b343f7366373805cd20",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 10006,
"upload_time": "2023-02-02T23:10:51",
"upload_time_iso_8601": "2023-02-02T23:10:51.208380Z",
"url": "https://files.pythonhosted.org/packages/5f/91/e9a25877b3cb80289b2a7e7baf9326e67802478ddfef95b8c70accfc8ec8/tonic-reporting-1.4.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-02 23:10:51",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "tonic-reporting"
}