# copula-tabular
Generate tabular synthetic data using Gaussian copulas
<div align="center">
[Overview](#overview) | [Documentation](#documentation) | [Contributing](#contributing) | [Development notes](#development-notes) | [Copyright and license](#copyright-and-license)
<!--
[Overview](#overview) | [Documentation](#documentation) | [How to cite](#how-to-cite) | [Contributing](#contributing) | [Development notes](#development-notes) | [Copyright and license](#copyright-and-license) | [Acknowledgements](#acknowledgements) -->
</div>
## Overview
Advancements in synthetic data generation have made it a viable solution for applications in various fields, such as finance, biomedical research, and data science. Synthetic data is generated artificially, yet replicates the joint probability distribution of its real-world counterpart. Its ability to mimic the statistical behaviour of real data makes it a useful tool for testing algorithms, systems, and training machine learning models, and it can be used as an economical substitute for real data when it is not available, is too sensitive to release, or too costly to acquire. Copula-based data generation methods have been demonstrated to produce reliable and accurate tabular data when generating synthetic data.
In this package, we present a tool for generating multivariate synthetic data through the implementation of a Gaussian copula. This model incorporates conditional joint distributions into its framework, allowing for the splitting of single variables into multiple component marginal distributions. The conditional enhancements provides greater usability in the synthesis of complex, non-linear sample distributions, allowing for the replication of a wider range of datasets.
The tool is designed to work with a data dictionary, or a file describing the metadata of the input dataset. There are additional class-based implementations of data cleaning, visualisation tools, transformation tools, privacy leakage evaluation, and sample wrapper scripts for generating synthetic data from start to finish.
### Example Result:
![Figure showing correlation plots of a simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The left plot shows the original Pearson correlation between variables, while the middle and right plots show the correlation for synthetic data generated using standard copula and conditional copula respectively.](docs/assets/img/tabulaCopula_example_socialdata_correlation_matrix_three.svg)
*Figure showing correlation plots of a simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The left plot shows the original Pearson correlation between variables, while the middle and right plots show the correlation for synthetic data generated using standard copula and conditional copula respectively.*
![Figure showing superimposed scatterplots of the same simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The training, synthetic (standard copula), synthetic (conditional copula) data points are in blue, grey, and red respectively.](docs/assets/img/tabulaCopula_example_socialdata_scatterplot_lowsampling_six.svg)
*Figure showing superimposed scatterplots of the same simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The training, synthetic (standard copula), synthetic (conditional copula) data points are in blue, grey, and red respectively.*
## Documentation
For installation instructions, getting started guides and tutorials, background information, and API reference summaries, please see the
[website](https://biomeddar.github.io/copula-tabular/).
<!-- ## How to cite -->
## Contributing
Thank you for considering contributing to Synthia. Please follow this [link](https://biomeddar.github.io/copula-tabular/help/contri.html) for more details.
## Development notes
Please visit the [website](https://biomeddar.github.io/copula-tabular/help/developmentNotes.html) for more details.
## Copyright and license
Copyright 2023 BiomedDAR, BII, A*STAR. Licensed under [MIT](https://biomeddar.github.io/copula-tabular/help/copyright.html).
<!-- ## Acknowledgements -->
Raw data
{
"_id": null,
"home_page": "https://biomeddar.github.io/copula-tabular/",
"name": "bdarpack",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3",
"maintainer_email": null,
"keywords": "synthetic data copula",
"author": "MZ Tan",
"author_email": "tan_ming_zhen@bii.a-star.edu.sg",
"download_url": "https://files.pythonhosted.org/packages/54/88/78ab39466ead5992601f297b0ef5e32bada5295925ad62d0f842ade03624/bdarpack-0.1.5.tar.gz",
"platform": null,
"description": "# copula-tabular\r\nGenerate tabular synthetic data using Gaussian copulas\r\n\r\n<div align=\"center\">\r\n\r\n [Overview](#overview) | [Documentation](#documentation) | [Contributing](#contributing) | [Development notes](#development-notes) | [Copyright and license](#copyright-and-license)\r\n<!-- \r\n [Overview](#overview) | [Documentation](#documentation) | [How to cite](#how-to-cite) | [Contributing](#contributing) | [Development notes](#development-notes) | [Copyright and license](#copyright-and-license) | [Acknowledgements](#acknowledgements) -->\r\n</div>\r\n\r\n## Overview\r\n\r\nAdvancements in synthetic data generation have made it a viable solution for applications in various fields, such as finance, biomedical research, and data science. Synthetic data is generated artificially, yet replicates the joint probability distribution of its real-world counterpart. Its ability to mimic the statistical behaviour of real data makes it a useful tool for testing algorithms, systems, and training machine learning models, and it can be used as an economical substitute for real data when it is not available, is too sensitive to release, or too costly to acquire. Copula-based data generation methods have been demonstrated to produce reliable and accurate tabular data when generating synthetic data.\r\n\r\nIn this package, we present a tool for generating multivariate synthetic data through the implementation of a Gaussian copula. This model incorporates conditional joint distributions into its framework, allowing for the splitting of single variables into multiple component marginal distributions. The conditional enhancements provides greater usability in the synthesis of complex, non-linear sample distributions, allowing for the replication of a wider range of datasets.\r\n\r\nThe tool is designed to work with a data dictionary, or a file describing the metadata of the input dataset. There are additional class-based implementations of data cleaning, visualisation tools, transformation tools, privacy leakage evaluation, and sample wrapper scripts for generating synthetic data from start to finish.\r\n\r\n### Example Result:\r\n![Figure showing correlation plots of a simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The left plot shows the original Pearson correlation between variables, while the middle and right plots show the correlation for synthetic data generated using standard copula and conditional copula respectively.](docs/assets/img/tabulaCopula_example_socialdata_correlation_matrix_three.svg)\r\n*Figure showing correlation plots of a simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The left plot shows the original Pearson correlation between variables, while the middle and right plots show the correlation for synthetic data generated using standard copula and conditional copula respectively.*\r\n\r\n![Figure showing superimposed scatterplots of the same simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The training, synthetic (standard copula), synthetic (conditional copula) data points are in blue, grey, and red respectively.](docs/assets/img/tabulaCopula_example_socialdata_scatterplot_lowsampling_six.svg)\r\n*Figure showing superimposed scatterplots of the same simulated multivariate dataset, containing non-trivial, non-linear and non-monotonic relationships. The training, synthetic (standard copula), synthetic (conditional copula) data points are in blue, grey, and red respectively.*\r\n\r\n\r\n## Documentation\r\nFor installation instructions, getting started guides and tutorials, background information, and API reference summaries, please see the \r\n[website](https://biomeddar.github.io/copula-tabular/).\r\n\r\n<!-- ## How to cite -->\r\n\r\n## Contributing\r\nThank you for considering contributing to Synthia. Please follow this [link](https://biomeddar.github.io/copula-tabular/help/contri.html) for more details.\r\n\r\n## Development notes\r\nPlease visit the [website](https://biomeddar.github.io/copula-tabular/help/developmentNotes.html) for more details.\r\n\r\n## Copyright and license\r\nCopyright 2023 BiomedDAR, BII, A*STAR. Licensed under [MIT](https://biomeddar.github.io/copula-tabular/help/copyright.html).\r\n\r\n<!-- ## Acknowledgements -->\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package for tabular synthetic data",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://biomeddar.github.io/copula-tabular/"
},
"split_keywords": [
"synthetic",
"data",
"copula"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "64d7a56dc2269af41d2db0c71e52af4dc1643a59862400f561906f7d05c8d0a0",
"md5": "8155198db45cfdde1c8d4e8244787a3c",
"sha256": "e78ed9fab18611d611d0125ff3c476c200f137fe830b5193308e60448622b34f"
},
"downloads": -1,
"filename": "bdarpack-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8155198db45cfdde1c8d4e8244787a3c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3",
"size": 72375,
"upload_time": "2024-04-02T02:10:25",
"upload_time_iso_8601": "2024-04-02T02:10:25.188922Z",
"url": "https://files.pythonhosted.org/packages/64/d7/a56dc2269af41d2db0c71e52af4dc1643a59862400f561906f7d05c8d0a0/bdarpack-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "548878ab39466ead5992601f297b0ef5e32bada5295925ad62d0f842ade03624",
"md5": "3d596ca7bedd189c59ef52d00d39875a",
"sha256": "b6c4ae1aaaecb652a214fe5ecb488080b9680bf18b86e6fa5710cb803967abe4"
},
"downloads": -1,
"filename": "bdarpack-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "3d596ca7bedd189c59ef52d00d39875a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3",
"size": 68435,
"upload_time": "2024-04-02T02:10:26",
"upload_time_iso_8601": "2024-04-02T02:10:26.477377Z",
"url": "https://files.pythonhosted.org/packages/54/88/78ab39466ead5992601f297b0ef5e32bada5295925ad62d0f842ade03624/bdarpack-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-02 02:10:26",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "bdarpack"
}