<div align="center">
# synloc: An Algorithm to Create Synthetic Tabular Data
<img src="https://raw.githubusercontent.com/alfurka/synloc/main/assets/logo_white_bc.png" alt = 'synloc'>
</div>
## Overview
`synloc` is an algorithm to sequentially and locally estimate distributions to create synthetic versions of a tabular data. The proposed methodology can be combined with parametric and nonparametric distributions.
## Installation
`synloc` can be installed through [PyPI](https://pypi.org/):
```
pip install synloc
```
## A Quick Example
Assume that we have a sample with three variables with the following distributions:
$$x \sim Beta(0.1,\,0.1)$$
$$y \sim Beta(0.1,\, 0.5)$$
$$z \sim 10 y + Normal(0,\,1)$$
The distribution can be generated by `tools` module in `synloc`:
```python
from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default.
```
Initializing the resampler:
```python
from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)
```
**Subsample** size is defined as `K=30`. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."
```python
syn_data = resampler.fit()
```
100%|██████████| 1000/1000 [00:01<00:00, 687.53it/s]
`syn_data` is a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) where all variables are synthesized. Comparing the original sample using a 3-D Scatter:
```python
resampler.comparePlots(['x','y','z'])
```
![](https://raw.githubusercontent.com/alfurka/synloc/v.0.0.2/assets/README_7_0.png)
## How to cite?
If you use `synloc` in your research, please cite the following paper:
```bibtex
@article{kalay2022generating,
title={Generating Synthetic Data with The Nearest Neighbors Algorithm},
author={Kalay, Ali Furkan},
journal={arXiv preprint arXiv:2210.00884},
year={2022}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/alfurka/synloc",
"name": "synloc",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "copulas,distributions,sampling,synthetic-data,oversampling,nonparametric-distributions,semiparametric,nonparametric,knn,clustering,k-means,multivariate-distributions",
"author": "Ali Furkan Kalay",
"author_email": "alfurka@gmail.com",
"download_url": "",
"platform": null,
"description": "<div align=\"center\">\r\n\r\n# synloc: An Algorithm to Create Synthetic Tabular Data\r\n\r\n<img src=\"https://raw.githubusercontent.com/alfurka/synloc/main/assets/logo_white_bc.png\" alt = 'synloc'>\r\n\r\n</div>\r\n\r\n## Overview\r\n\r\n`synloc` is an algorithm to sequentially and locally estimate distributions to create synthetic versions of a tabular data. The proposed methodology can be combined with parametric and nonparametric distributions. \r\n\r\n## Installation\r\n\r\n`synloc` can be installed through [PyPI](https://pypi.org/):\r\n\r\n```\r\npip install synloc\r\n```\r\n\r\n## A Quick Example\r\n\r\nAssume that we have a sample with three variables with the following distributions:\r\n\r\n$$x \\sim Beta(0.1,\\,0.1)$$\r\n$$y \\sim Beta(0.1,\\, 0.5)$$\r\n$$z \\sim 10 y + Normal(0,\\,1)$$\r\n\r\nThe distribution can be generated by `tools` module in `synloc`:\r\n\r\n\r\n```python\r\nfrom synloc.tools import sample_trivariate_xyz\r\ndata = sample_trivariate_xyz() # Generates a sample with size 1000 by default. \r\n```\r\n\r\nInitializing the resampler:\r\n\r\n\r\n```python\r\nfrom synloc import LocalCov\r\nresampler = LocalCov(data = data, K = 30)\r\n```\r\n\r\n**Subsample** size is defined as `K=30`. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw \"synthetic values.\"\r\n\r\n\r\n```python\r\nsyn_data = resampler.fit() \r\n```\r\n\r\n 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1000/1000 [00:01<00:00, 687.53it/s]\r\n \r\n\r\n`syn_data` is a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) where all variables are synthesized. Comparing the original sample using a 3-D Scatter:\r\n\r\n\r\n```python\r\nresampler.comparePlots(['x','y','z'])\r\n``` \r\n![](https://raw.githubusercontent.com/alfurka/synloc/v.0.0.2/assets/README_7_0.png)\r\n\r\n## How to cite?\r\n\r\nIf you use `synloc` in your research, please cite the following paper:\r\n\r\n```bibtex\r\n@article{kalay2022generating,\r\n title={Generating Synthetic Data with The Nearest Neighbors Algorithm},\r\n author={Kalay, Ali Furkan},\r\n journal={arXiv preprint arXiv:2210.00884},\r\n year={2022}\r\n}\r\n```\r\n",
"bugtrack_url": null,
"license": "",
"summary": "A Python package to create synthetic data from a locally estimated distributions.",
"version": "0.1.2",
"split_keywords": [
"copulas",
"distributions",
"sampling",
"synthetic-data",
"oversampling",
"nonparametric-distributions",
"semiparametric",
"nonparametric",
"knn",
"clustering",
"k-means",
"multivariate-distributions"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "bce87de03018aafd393d5f8f9784a97e",
"sha256": "4a9883f1b6197ab0a9fc339674c68155cdbf348b814e8e4db628bfcec8a91466"
},
"downloads": -1,
"filename": "synloc-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bce87de03018aafd393d5f8f9784a97e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 10159,
"upload_time": "2022-12-29T12:46:50",
"upload_time_iso_8601": "2022-12-29T12:46:50.355076Z",
"url": "https://files.pythonhosted.org/packages/95/9f/d0665adfc49c0e4b4bed32b41eb65ffe5e99862658d24f49ba9351f09cc9/synloc-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-29 12:46:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "alfurka",
"github_project": "synloc",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "synloc"
}