syntherela


Namesyntherela JSON
Version 0.0.3 PyPI version JSON
download
home_pageNone
SummarySyntheRela - Synthetic Relational Data Generation Benchmark
upload_time2024-11-21 06:20:37
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords python syntherela synthetic data relational data evaluation benchmark
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SyntheRela - Synthetic Relational Data Generation Benchmark

<h2 align="center">
    <img src="https://raw.githubusercontent.com/martinjurkovic/syntherela/refs/heads/main/docs/SyntheRela.png" height="150px">
    <div align="center">
      <a href="https://pypi.org/project/syntherela/">
        <img src="https://img.shields.io/pypi/v/syntherela" alt="PyPI">
      </a>
      <a href="https://github.com/martinjurkovic/syntherela/blob/main/LICENSE">
        <img alt="MIT License" src="https://img.shields.io/badge/License-MIT-yellow.svg">
      </a>
      <a href="https://arxiv.org/abs/2410.03411">
        <img alt="Paper URL" src="https://img.shields.io/badge/cs.DB-2410.03411-B31B1B.svg">
      </a>
      <a href="https://pypi.org/pypi/syntherela/">
        <img src="https://img.shields.io/pypi/pyversions/syntherela" alt="PyPI pyversions">
      </a>
  </div>
</h2>

Our paper **Benchmarking the Fidelity and Utility of Synthetic Relational Data** is available on [arxiv](https://arxiv.org/abs/2410.03411).

## Installation
To install only the benchmark package, run the following command:

```bash
pip install syntherela
```

## Replicating the paper's results
We divide the reproducibility of the experiments into two parts: the generation of synthetic data and the evaluation of the generated data. The following sections describe how to reproduce the experiments for each part.
> To reproduce some of the figures the synthetic data needs to be downloaded first. The tables can be reproduced with the results provided in the repository or by re-running the benchmark.

First, create a .env file in the root of the project with the path to the root of the project. Copy `.env.example`, rename it to `.env` and update the path.

### Download synthetic data and results

The data and results can be downloaded and extracted with the below script, or are available on [google drive here](https://drive.google.com/drive/folders/1L9KarR20JqzU0p8b3G_KU--h2b8sz6ky).

```bash
conda activate reproduce_benchmark
./experiments/reproducibility/download_data_and_results.sh
```

### Evaluation of synthetic data
To run the benchmark and get the results of the metrics, run:

```bash
conda activate reproduce_benchmark
./experiments/reproducibility/evaluate_relational.sh

./experiments/reproducibility/evaluate_tabular.sh

./experiments/reproducibility/evaluate_utility.sh
```

### Generation of synthetic data
Depending on the synthetic data generation method a separate pythone environment is needed. The instruction for installing the required environment for each method is provided in [docs/INSTALLATION.md](/docs/INSTALLATION.md).

After installing the required environment, the synthetic data can be generated by running the following commands:

```bash
conda activate reproduce_benchmark
./experiments/reproducibility/generation/generate_sdv.sh

conda activate rctgan
./experiments/reproducibility/generation/generate_rctgan.sh

conda activate realtabformer
./experiments/reproducibility/generation/generate_realtabformer.sh

conda activate tabular
./experiments/reproducibility/generation/generate_tabular.sh

conda activate gretel
python experiments/generation/gretel/generate_gretel.py --connection-uid  <connection-uid> --model lstm
python experiments/generation/gretel/generate_gretel.py --connection-uid  <connection-uid> --model actgan

cd experiments/generation/clavaddpm
./generate_clavaddpm.sh <dataset-name> <real-data-path> <synthetic-data-path>  
```

To generate data with MOSTLYAI, insructions are provided in [experiments/generation/mostlyai/README.md](experiments/generation/mostlyai/README.md). <br>
Further instructions for GRETELAI are provided in [experiments/generation/gretel/README.md](experiments/generation/gretel/README.md).

### Visualising Results
To visualize results, after running the benchmark you can run the below script. The figures will be saved to `results/figures/`:
```bash
conda activate reproduce_benchmark
./experiments/reproducibility/generate_figures.sh
```
### Reproducing Tables
To reproduce the tables you can run the below script. The tables will be saved as .tex files in `results/tables/`:
```bash
conda activate reproduce_benchmark
./experiments/reproducibility/generate_tables.sh
```

## Adding a new metric
The documentation for adding a new metric can be found in [docs/ADDING_A_METRIC.md](/docs/ADDING_A_METRIC.md).

## Synthetic Data Methods
### Open Source Methods
- SDV: [The Synthetic Data Vault](https://ieeexplore.ieee.org/document/7796926)
- RCTGAN: [Row Conditional-TGAN for Generating Synthetic Relational Databases](https://ieeexplore.ieee.org/abstract/document/10096001)
- REaLTabFormer: [Generating Realistic Relational and Tabular Data using Transformers](https://arxiv.org/abs/2302.02041)
- ClavaDDPM: [Multi-relational Data Synthesis with Cluster-guided Diffusion Models](https://arxiv.org/html/2405.17724v1)
- IRG: [Generating Synthetic Relational Databases using GANs](https://arxiv.org/abs/2312.15187)
- [Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders](https://arxiv.org/abs/2211.16889)*
- [Generative Modeling of Complex Data](https://arxiv.org/abs/2202.02145)*
- BayesM2M & NeuralM2M: [Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation](https://iclr.cc/virtual/2023/poster/10982)*


\* Denotes the method does not have a public implementation available.

### Commercial Providers
A list of commercial synthetic relational data providers is available in [docs/SYNTHETIC_DATA_TOOLS.md](/docs/SYNTHETIC_DATA_TOOLS.md).

## Conflicts of Interest
The authors declare no conflict of interest and are not associated with any of the evaluated commercial synthetic data providers.

## Citation
If you use SyntheRela in your work, please cite our paper:
```
@misc{hudovernik2024benchmarkingsyntheticrelationaldata,
      title={Benchmarking the Fidelity and Utility of Synthetic Relational Data}, 
      author={Valter Hudovernik and Martin Jurkovič and Erik Štrumbelj},
      year={2024},
      eprint={2410.03411},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2410.03411}, 
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "syntherela",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "python, syntherela, synthetic data, relational data, evaluation, benchmark",
    "author": null,
    "author_email": "Martin Jurkovic <martin.jurkovic19@gmail.com>, Valter Hudovernik <valter.hudovernik@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/27/e0/0328b2c0e74e0ddd400e2c9ad8353234a1860d2539ae0cc461917bf4e07d/syntherela-0.0.3.tar.gz",
    "platform": null,
    "description": "# SyntheRela - Synthetic Relational Data Generation Benchmark\n\n<h2 align=\"center\">\n    <img src=\"https://raw.githubusercontent.com/martinjurkovic/syntherela/refs/heads/main/docs/SyntheRela.png\" height=\"150px\">\n    <div align=\"center\">\n      <a href=\"https://pypi.org/project/syntherela/\">\n        <img src=\"https://img.shields.io/pypi/v/syntherela\" alt=\"PyPI\">\n      </a>\n      <a href=\"https://github.com/martinjurkovic/syntherela/blob/main/LICENSE\">\n        <img alt=\"MIT License\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\">\n      </a>\n      <a href=\"https://arxiv.org/abs/2410.03411\">\n        <img alt=\"Paper URL\" src=\"https://img.shields.io/badge/cs.DB-2410.03411-B31B1B.svg\">\n      </a>\n      <a href=\"https://pypi.org/pypi/syntherela/\">\n        <img src=\"https://img.shields.io/pypi/pyversions/syntherela\" alt=\"PyPI pyversions\">\n      </a>\n  </div>\n</h2>\n\nOur paper **Benchmarking the Fidelity and Utility of Synthetic Relational Data** is available on [arxiv](https://arxiv.org/abs/2410.03411).\n\n## Installation\nTo install only the benchmark package, run the following command:\n\n```bash\npip install syntherela\n```\n\n## Replicating the paper's results\nWe divide the reproducibility of the experiments into two parts: the generation of synthetic data and the evaluation of the generated data. The following sections describe how to reproduce the experiments for each part.\n> To reproduce some of the figures the synthetic data needs to be downloaded first. The tables can be reproduced with the results provided in the repository or by re-running the benchmark.\n\nFirst, create a .env file in the root of the project with the path to the root of the project. Copy `.env.example`, rename it to `.env` and update the path.\n\n### Download synthetic data and results\n\nThe data and results can be downloaded and extracted with the below script, or are available on [google drive here](https://drive.google.com/drive/folders/1L9KarR20JqzU0p8b3G_KU--h2b8sz6ky).\n\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/download_data_and_results.sh\n```\n\n### Evaluation of synthetic data\nTo run the benchmark and get the results of the metrics, run:\n\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/evaluate_relational.sh\n\n./experiments/reproducibility/evaluate_tabular.sh\n\n./experiments/reproducibility/evaluate_utility.sh\n```\n\n### Generation of synthetic data\nDepending on the synthetic data generation method a separate pythone environment is needed. The instruction for installing the required environment for each method is provided in [docs/INSTALLATION.md](/docs/INSTALLATION.md).\n\nAfter installing the required environment, the synthetic data can be generated by running the following commands:\n\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/generation/generate_sdv.sh\n\nconda activate rctgan\n./experiments/reproducibility/generation/generate_rctgan.sh\n\nconda activate realtabformer\n./experiments/reproducibility/generation/generate_realtabformer.sh\n\nconda activate tabular\n./experiments/reproducibility/generation/generate_tabular.sh\n\nconda activate gretel\npython experiments/generation/gretel/generate_gretel.py --connection-uid  <connection-uid> --model lstm\npython experiments/generation/gretel/generate_gretel.py --connection-uid  <connection-uid> --model actgan\n\ncd experiments/generation/clavaddpm\n./generate_clavaddpm.sh <dataset-name> <real-data-path> <synthetic-data-path>  \n```\n\nTo generate data with MOSTLYAI, insructions are provided in [experiments/generation/mostlyai/README.md](experiments/generation/mostlyai/README.md).\u00a0<br>\nFurther instructions for GRETELAI are provided in [experiments/generation/gretel/README.md](experiments/generation/gretel/README.md).\n\n### Visualising Results\nTo visualize results, after running the benchmark you can run the below script. The figures will be saved to `results/figures/`:\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/generate_figures.sh\n```\n### Reproducing Tables\nTo reproduce the tables you can run the below script. The tables will be saved as .tex files in `results/tables/`:\n```bash\nconda activate reproduce_benchmark\n./experiments/reproducibility/generate_tables.sh\n```\n\n## Adding a new metric\nThe documentation for adding a new metric can be found in [docs/ADDING_A_METRIC.md](/docs/ADDING_A_METRIC.md).\n\n## Synthetic Data Methods\n### Open Source Methods\n- SDV: [The Synthetic Data Vault](https://ieeexplore.ieee.org/document/7796926)\n- RCTGAN: [Row Conditional-TGAN for Generating Synthetic Relational Databases](https://ieeexplore.ieee.org/abstract/document/10096001)\n- REaLTabFormer: [Generating Realistic Relational and Tabular Data using Transformers](https://arxiv.org/abs/2302.02041)\n- ClavaDDPM: [Multi-relational Data Synthesis with Cluster-guided Diffusion Models](https://arxiv.org/html/2405.17724v1)\n- IRG: [Generating Synthetic Relational Databases using GANs](https://arxiv.org/abs/2312.15187)\n- [Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders](https://arxiv.org/abs/2211.16889)*\n- [Generative Modeling of Complex Data](https://arxiv.org/abs/2202.02145)*\n- BayesM2M & NeuralM2M: [Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation](https://iclr.cc/virtual/2023/poster/10982)*\n\n\n\\* Denotes the method does not have a public implementation available.\n\n### Commercial Providers\nA list of commercial synthetic relational data providers is available in [docs/SYNTHETIC_DATA_TOOLS.md](/docs/SYNTHETIC_DATA_TOOLS.md).\n\n## Conflicts of Interest\nThe authors declare no conflict of interest and are not associated with any of the evaluated commercial synthetic data providers.\n\n## Citation\nIf you use SyntheRela in your work, please cite our paper:\n```\n@misc{hudovernik2024benchmarkingsyntheticrelationaldata,\n      title={Benchmarking the Fidelity and Utility of Synthetic Relational Data}, \n      author={Valter Hudovernik and Martin Jurkovi\u010d and Erik \u0160trumbelj},\n      year={2024},\n      eprint={2410.03411},\n      archivePrefix={arXiv},\n      primaryClass={cs.DB},\n      url={https://arxiv.org/abs/2410.03411}, \n}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "SyntheRela - Synthetic Relational Data Generation Benchmark",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/martinjurkovic/syntherela",
        "Issues": "https://github.com/martinjurkovic/syntherela/issues"
    },
    "split_keywords": [
        "python",
        " syntherela",
        " synthetic data",
        " relational data",
        " evaluation",
        " benchmark"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "70178f42d95129507f0dfabb4fed25a83ec421129079a564b3a6aba1e8a6a738",
                "md5": "e0a4f24cb1e5097f8cfd374b50ec6429",
                "sha256": "881b26c234a81184ea3af6a99716dd9b9c6258f1006fa7e7faa2357128d5b920"
            },
            "downloads": -1,
            "filename": "syntherela-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e0a4f24cb1e5097f8cfd374b50ec6429",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 370068,
            "upload_time": "2024-11-21T06:20:35",
            "upload_time_iso_8601": "2024-11-21T06:20:35.485431Z",
            "url": "https://files.pythonhosted.org/packages/70/17/8f42d95129507f0dfabb4fed25a83ec421129079a564b3a6aba1e8a6a738/syntherela-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "27e00328b2c0e74e0ddd400e2c9ad8353234a1860d2539ae0cc461917bf4e07d",
                "md5": "73e12daca7999fb878c0c2fcf10f37fc",
                "sha256": "ac084c63641676c3aa13a5508498dbd2ded0dd117613979d16d71f4e59d7979d"
            },
            "downloads": -1,
            "filename": "syntherela-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "73e12daca7999fb878c0c2fcf10f37fc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 272780,
            "upload_time": "2024-11-21T06:20:37",
            "upload_time_iso_8601": "2024-11-21T06:20:37.882942Z",
            "url": "https://files.pythonhosted.org/packages/27/e0/0328b2c0e74e0ddd400e2c9ad8353234a1860d2539ae0cc461917bf4e07d/syntherela-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-21 06:20:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "martinjurkovic",
    "github_project": "syntherela",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "syntherela"
}
        
Elapsed time: 0.69232s