# SyntheRela - Synthetic Relational Data Generation Benchmark
<h2 align="center">
<img src="https://raw.githubusercontent.com/martinjurkovic/syntherela/refs/heads/main/docs/SyntheRela.png" height="150px">
<div align="center">
<a href="https://pypi.org/project/syntherela/">
<img src="https://img.shields.io/pypi/v/syntherela" alt="PyPI">
</a>
<a href="https://github.com/martinjurkovic/syntherela/blob/main/LICENSE">
<img alt="MIT License" src="https://img.shields.io/badge/License-MIT-yellow.svg">
</a>
<a href="https://arxiv.org/abs/2410.03411">
<img alt="Paper URL" src="https://img.shields.io/badge/cs.DB-2410.03411-B31B1B.svg">
</a>
<a href="https://pypi.org/pypi/syntherela/">
<img src="https://img.shields.io/pypi/pyversions/syntherela" alt="PyPI pyversions">
</a>
</div>
</h2>
Our paper **Benchmarking the Fidelity and Utility of Synthetic Relational Data** is available on [arxiv](https://arxiv.org/abs/2410.03411).
## Installation
To install only the benchmark package, run the following command:
```bash
pip install syntherela
```
## Replicating the paper's results
We divide the reproducibility of the experiments into two parts: the generation of synthetic data and the evaluation of the generated data. The following sections describe how to reproduce the experiments for each part.
> To reproduce some of the figures the synthetic data needs to be downloaded first. The tables can be reproduced with the results provided in the repository or by re-running the benchmark.
First, create a .env file in the root of the project with the path to the root of the project. Copy `.env.example`, rename it to `.env` and update the path.
### Download synthetic data and results
The data and results can be downloaded and extracted with the below script, or are available on [google drive here](https://drive.google.com/drive/folders/1L9KarR20JqzU0p8b3G_KU--h2b8sz6ky).
```bash
conda activate reproduce_benchmark
./experiments/reproducibility/download_data_and_results.sh
```
### Evaluation of synthetic data
To run the benchmark and get the results of the metrics, run:
```bash
conda activate reproduce_benchmark
./experiments/reproducibility/evaluate_relational.sh
./experiments/reproducibility/evaluate_tabular.sh
./experiments/reproducibility/evaluate_utility.sh
```
### Generation of synthetic data
Depending on the synthetic data generation method a separate pythone environment is needed. The instruction for installing the required environment for each method is provided in [docs/INSTALLATION.md](/docs/INSTALLATION.md).
After installing the required environment, the synthetic data can be generated by running the following commands:
```bash
conda activate reproduce_benchmark
./experiments/reproducibility/generation/generate_sdv.sh
conda activate rctgan
./experiments/reproducibility/generation/generate_rctgan.sh
conda activate realtabformer
./experiments/reproducibility/generation/generate_realtabformer.sh
conda activate tabular
./experiments/reproducibility/generation/generate_tabular.sh
conda activate gretel
python experiments/generation/gretel/generate_gretel.py --connection-uid <connection-uid> --model lstm
python experiments/generation/gretel/generate_gretel.py --connection-uid <connection-uid> --model actgan
cd experiments/generation/clavaddpm
./generate_clavaddpm.sh <dataset-name> <real-data-path> <synthetic-data-path>
```
To generate data with MOSTLYAI, insructions are provided in [experiments/generation/mostlyai/README.md](experiments/generation/mostlyai/README.md). <br>
Further instructions for GRETELAI are provided in [experiments/generation/gretel/README.md](experiments/generation/gretel/README.md).
### Visualising Results
To visualize results, after running the benchmark you can run the below script. The figures will be saved to `results/figures/`:
```bash
conda activate reproduce_benchmark
./experiments/reproducibility/generate_figures.sh
```
### Reproducing Tables
To reproduce the tables you can run the below script. The tables will be saved as .tex files in `results/tables/`:
```bash
conda activate reproduce_benchmark
./experiments/reproducibility/generate_tables.sh
```
## Adding a new metric
The documentation for adding a new metric can be found in [docs/ADDING_A_METRIC.md](/docs/ADDING_A_METRIC.md).
## Synthetic Data Methods
### Open Source Methods
- SDV: [The Synthetic Data Vault](https://ieeexplore.ieee.org/document/7796926)
- RCTGAN: [Row Conditional-TGAN for Generating Synthetic Relational Databases](https://ieeexplore.ieee.org/abstract/document/10096001)
- REaLTabFormer: [Generating Realistic Relational and Tabular Data using Transformers](https://arxiv.org/abs/2302.02041)
- ClavaDDPM: [Multi-relational Data Synthesis with Cluster-guided Diffusion Models](https://arxiv.org/html/2405.17724v1)
- IRG: [Generating Synthetic Relational Databases using GANs](https://arxiv.org/abs/2312.15187)
- [Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders](https://arxiv.org/abs/2211.16889)*
- [Generative Modeling of Complex Data](https://arxiv.org/abs/2202.02145)*
- BayesM2M & NeuralM2M: [Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation](https://iclr.cc/virtual/2023/poster/10982)*
\* Denotes the method does not have a public implementation available.
### Commercial Providers
A list of commercial synthetic relational data providers is available in [docs/SYNTHETIC_DATA_TOOLS.md](/docs/SYNTHETIC_DATA_TOOLS.md).
## Conflicts of Interest
The authors declare no conflict of interest and are not associated with any of the evaluated commercial synthetic data providers.
## Citation
If you use SyntheRela in your work, please cite our paper:
```
@misc{hudovernik2024benchmarkingsyntheticrelationaldata,
title={Benchmarking the Fidelity and Utility of Synthetic Relational Data},
author={Valter Hudovernik and Martin Jurkovič and Erik Štrumbelj},
year={2024},
eprint={2410.03411},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2410.03411},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "syntherela",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "python, syntherela, synthetic data, relational data, evaluation, benchmark",
"author": null,
"author_email": "Martin Jurkovic <martin.jurkovic19@gmail.com>, Valter Hudovernik <valter.hudovernik@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/27/e0/0328b2c0e74e0ddd400e2c9ad8353234a1860d2539ae0cc461917bf4e07d/syntherela-0.0.3.tar.gz",
"platform": null,
"description": "# SyntheRela - Synthetic Relational Data Generation Benchmark\n\n<h2 align=\"center\">\n <img src=\"https://raw.githubusercontent.com/martinjurkovic/syntherela/refs/heads/main/docs/SyntheRela.png\" height=\"150px\">\n <div align=\"center\">\n <a href=\"https://pypi.org/project/syntherela/\">\n <img src=\"https://img.shields.io/pypi/v/syntherela\" alt=\"PyPI\">\n </a>\n <a href=\"https://github.com/martinjurkovic/syntherela/blob/main/LICENSE\">\n <img alt=\"MIT License\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\">\n </a>\n <a href=\"https://arxiv.org/abs/2410.03411\">\n <img alt=\"Paper URL\" src=\"https://img.shields.io/badge/cs.DB-2410.03411-B31B1B.svg\">\n </a>\n <a href=\"https://pypi.org/pypi/syntherela/\">\n <img src=\"https://img.shields.io/pypi/pyversions/syntherela\" alt=\"PyPI pyversions\">\n </a>\n </div>\n</h2>\n\nOur paper **Benchmarking the Fidelity and Utility of Synthetic Relational Data** is available on [arxiv](https://arxiv.org/abs/2410.03411).\n\n## Installation\nTo install only the benchmark package, run the following command:\n\n```bash\npip install syntherela\n```\n\n## Replicating the paper's results\nWe divide the reproducibility of the experiments into two parts: the generation of synthetic data and the evaluation of the generated data. The following sections describe how to reproduce the experiments for each part.\n> To reproduce some of the figures the synthetic data needs to be downloaded first. The tables can be reproduced with the results provided in the repository or by re-running the benchmark.\n\nFirst, create a .env file in the root of the project with the path to the root of the project. Copy `.env.example`, rename it to `.env` and update the path.\n\n### Download synthetic data and results\n\nThe data and results can be downloaded and extracted with the below script, or are available on [google drive here](https://drive.google.com/drive/folders/1L9KarR20JqzU0p8b3G_KU--h2b8sz6ky).\n\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/download_data_and_results.sh\n```\n\n### Evaluation of synthetic data\nTo run the benchmark and get the results of the metrics, run:\n\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/evaluate_relational.sh\n\n./experiments/reproducibility/evaluate_tabular.sh\n\n./experiments/reproducibility/evaluate_utility.sh\n```\n\n### Generation of synthetic data\nDepending on the synthetic data generation method a separate pythone environment is needed. The instruction for installing the required environment for each method is provided in [docs/INSTALLATION.md](/docs/INSTALLATION.md).\n\nAfter installing the required environment, the synthetic data can be generated by running the following commands:\n\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/generation/generate_sdv.sh\n\nconda activate rctgan\n./experiments/reproducibility/generation/generate_rctgan.sh\n\nconda activate realtabformer\n./experiments/reproducibility/generation/generate_realtabformer.sh\n\nconda activate tabular\n./experiments/reproducibility/generation/generate_tabular.sh\n\nconda activate gretel\npython experiments/generation/gretel/generate_gretel.py --connection-uid <connection-uid> --model lstm\npython experiments/generation/gretel/generate_gretel.py --connection-uid <connection-uid> --model actgan\n\ncd experiments/generation/clavaddpm\n./generate_clavaddpm.sh <dataset-name> <real-data-path> <synthetic-data-path> \n```\n\nTo generate data with MOSTLYAI, insructions are provided in [experiments/generation/mostlyai/README.md](experiments/generation/mostlyai/README.md).\u00a0<br>\nFurther instructions for GRETELAI are provided in [experiments/generation/gretel/README.md](experiments/generation/gretel/README.md).\n\n### Visualising Results\nTo visualize results, after running the benchmark you can run the below script. The figures will be saved to `results/figures/`:\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/generate_figures.sh\n```\n### Reproducing Tables\nTo reproduce the tables you can run the below script. The tables will be saved as .tex files in `results/tables/`:\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/generate_tables.sh\n```\n\n## Adding a new metric\nThe documentation for adding a new metric can be found in [docs/ADDING_A_METRIC.md](/docs/ADDING_A_METRIC.md).\n\n## Synthetic Data Methods\n### Open Source Methods\n- SDV: [The Synthetic Data Vault](https://ieeexplore.ieee.org/document/7796926)\n- RCTGAN: [Row Conditional-TGAN for Generating Synthetic Relational Databases](https://ieeexplore.ieee.org/abstract/document/10096001)\n- REaLTabFormer: [Generating Realistic Relational and Tabular Data using Transformers](https://arxiv.org/abs/2302.02041)\n- ClavaDDPM: [Multi-relational Data Synthesis with Cluster-guided Diffusion Models](https://arxiv.org/html/2405.17724v1)\n- IRG: [Generating Synthetic Relational Databases using GANs](https://arxiv.org/abs/2312.15187)\n- [Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders](https://arxiv.org/abs/2211.16889)*\n- [Generative Modeling of Complex Data](https://arxiv.org/abs/2202.02145)*\n- BayesM2M & NeuralM2M: [Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation](https://iclr.cc/virtual/2023/poster/10982)*\n\n\n\\* Denotes the method does not have a public implementation available.\n\n### Commercial Providers\nA list of commercial synthetic relational data providers is available in [docs/SYNTHETIC_DATA_TOOLS.md](/docs/SYNTHETIC_DATA_TOOLS.md).\n\n## Conflicts of Interest\nThe authors declare no conflict of interest and are not associated with any of the evaluated commercial synthetic data providers.\n\n## Citation\nIf you use SyntheRela in your work, please cite our paper:\n```\n@misc{hudovernik2024benchmarkingsyntheticrelationaldata,\n title={Benchmarking the Fidelity and Utility of Synthetic Relational Data}, \n author={Valter Hudovernik and Martin Jurkovi\u010d and Erik \u0160trumbelj},\n year={2024},\n eprint={2410.03411},\n archivePrefix={arXiv},\n primaryClass={cs.DB},\n url={https://arxiv.org/abs/2410.03411}, \n}\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "SyntheRela - Synthetic Relational Data Generation Benchmark",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/martinjurkovic/syntherela",
"Issues": "https://github.com/martinjurkovic/syntherela/issues"
},
"split_keywords": [
"python",
" syntherela",
" synthetic data",
" relational data",
" evaluation",
" benchmark"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "70178f42d95129507f0dfabb4fed25a83ec421129079a564b3a6aba1e8a6a738",
"md5": "e0a4f24cb1e5097f8cfd374b50ec6429",
"sha256": "881b26c234a81184ea3af6a99716dd9b9c6258f1006fa7e7faa2357128d5b920"
},
"downloads": -1,
"filename": "syntherela-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e0a4f24cb1e5097f8cfd374b50ec6429",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 370068,
"upload_time": "2024-11-21T06:20:35",
"upload_time_iso_8601": "2024-11-21T06:20:35.485431Z",
"url": "https://files.pythonhosted.org/packages/70/17/8f42d95129507f0dfabb4fed25a83ec421129079a564b3a6aba1e8a6a738/syntherela-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "27e00328b2c0e74e0ddd400e2c9ad8353234a1860d2539ae0cc461917bf4e07d",
"md5": "73e12daca7999fb878c0c2fcf10f37fc",
"sha256": "ac084c63641676c3aa13a5508498dbd2ded0dd117613979d16d71f4e59d7979d"
},
"downloads": -1,
"filename": "syntherela-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "73e12daca7999fb878c0c2fcf10f37fc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 272780,
"upload_time": "2024-11-21T06:20:37",
"upload_time_iso_8601": "2024-11-21T06:20:37.882942Z",
"url": "https://files.pythonhosted.org/packages/27/e0/0328b2c0e74e0ddd400e2c9ad8353234a1860d2539ae0cc461917bf4e07d/syntherela-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-21 06:20:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "martinjurkovic",
"github_project": "syntherela",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "syntherela"
}