synloc


Namesynloc JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/alfurka/synloc
SummaryA Python package to create synthetic data from a locally estimated distributions.
upload_time2022-12-29 12:46:50
maintainer
docs_urlNone
authorAli Furkan Kalay
requires_python>=3.8
license
keywords copulas distributions sampling synthetic-data oversampling nonparametric-distributions semiparametric nonparametric knn clustering k-means multivariate-distributions
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# synloc: An Algorithm to Create Synthetic Tabular Data

<img src="https://raw.githubusercontent.com/alfurka/synloc/main/assets/logo_white_bc.png" alt = 'synloc'>

</div>

## Overview

`synloc` is an algorithm to sequentially and locally estimate distributions to create synthetic versions of a tabular data. The proposed methodology can be combined with parametric and nonparametric distributions. 

## Installation

`synloc` can be installed through [PyPI](https://pypi.org/):

```
pip install synloc
```

## A Quick Example

Assume that we have a sample with three variables with the following distributions:

$$x \sim Beta(0.1,\,0.1)$$
$$y \sim Beta(0.1,\, 0.5)$$
$$z \sim 10 y + Normal(0,\,1)$$

The distribution can be generated by `tools` module in `synloc`:


```python
from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default. 
```

Initializing the resampler:


```python
from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)
```

**Subsample** size is defined as `K=30`. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."


```python
syn_data = resampler.fit() 
```

    100%|██████████| 1000/1000 [00:01<00:00, 687.53it/s]
    

`syn_data` is a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) where all variables are synthesized. Comparing the original sample using a 3-D Scatter:


```python
resampler.comparePlots(['x','y','z'])
```    
![](https://raw.githubusercontent.com/alfurka/synloc/v.0.0.2/assets/README_7_0.png)

## How to cite?

If you use `synloc` in your research, please cite the following paper:

```bibtex
@article{kalay2022generating,
  title={Generating Synthetic Data with The Nearest Neighbors Algorithm},
  author={Kalay, Ali Furkan},
  journal={arXiv preprint arXiv:2210.00884},
  year={2022}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/alfurka/synloc",
    "name": "synloc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "copulas,distributions,sampling,synthetic-data,oversampling,nonparametric-distributions,semiparametric,nonparametric,knn,clustering,k-means,multivariate-distributions",
    "author": "Ali Furkan Kalay",
    "author_email": "alfurka@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "<div align=\"center\">\r\n\r\n# synloc: An Algorithm to Create Synthetic Tabular Data\r\n\r\n<img src=\"https://raw.githubusercontent.com/alfurka/synloc/main/assets/logo_white_bc.png\" alt = 'synloc'>\r\n\r\n</div>\r\n\r\n## Overview\r\n\r\n`synloc` is an algorithm to sequentially and locally estimate distributions to create synthetic versions of a tabular data. The proposed methodology can be combined with parametric and nonparametric distributions. \r\n\r\n## Installation\r\n\r\n`synloc` can be installed through [PyPI](https://pypi.org/):\r\n\r\n```\r\npip install synloc\r\n```\r\n\r\n## A Quick Example\r\n\r\nAssume that we have a sample with three variables with the following distributions:\r\n\r\n$$x \\sim Beta(0.1,\\,0.1)$$\r\n$$y \\sim Beta(0.1,\\, 0.5)$$\r\n$$z \\sim 10 y + Normal(0,\\,1)$$\r\n\r\nThe distribution can be generated by `tools` module in `synloc`:\r\n\r\n\r\n```python\r\nfrom synloc.tools import sample_trivariate_xyz\r\ndata = sample_trivariate_xyz() # Generates a sample with size 1000 by default. \r\n```\r\n\r\nInitializing the resampler:\r\n\r\n\r\n```python\r\nfrom synloc import LocalCov\r\nresampler = LocalCov(data = data, K = 30)\r\n```\r\n\r\n**Subsample** size is defined as `K=30`. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw \"synthetic values.\"\r\n\r\n\r\n```python\r\nsyn_data = resampler.fit() \r\n```\r\n\r\n    100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1000/1000 [00:01<00:00, 687.53it/s]\r\n    \r\n\r\n`syn_data` is a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) where all variables are synthesized. Comparing the original sample using a 3-D Scatter:\r\n\r\n\r\n```python\r\nresampler.comparePlots(['x','y','z'])\r\n```    \r\n![](https://raw.githubusercontent.com/alfurka/synloc/v.0.0.2/assets/README_7_0.png)\r\n\r\n## How to cite?\r\n\r\nIf you use `synloc` in your research, please cite the following paper:\r\n\r\n```bibtex\r\n@article{kalay2022generating,\r\n  title={Generating Synthetic Data with The Nearest Neighbors Algorithm},\r\n  author={Kalay, Ali Furkan},\r\n  journal={arXiv preprint arXiv:2210.00884},\r\n  year={2022}\r\n}\r\n```\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A Python package to create synthetic data from a locally estimated distributions.",
    "version": "0.1.2",
    "split_keywords": [
        "copulas",
        "distributions",
        "sampling",
        "synthetic-data",
        "oversampling",
        "nonparametric-distributions",
        "semiparametric",
        "nonparametric",
        "knn",
        "clustering",
        "k-means",
        "multivariate-distributions"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "bce87de03018aafd393d5f8f9784a97e",
                "sha256": "4a9883f1b6197ab0a9fc339674c68155cdbf348b814e8e4db628bfcec8a91466"
            },
            "downloads": -1,
            "filename": "synloc-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bce87de03018aafd393d5f8f9784a97e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 10159,
            "upload_time": "2022-12-29T12:46:50",
            "upload_time_iso_8601": "2022-12-29T12:46:50.355076Z",
            "url": "https://files.pythonhosted.org/packages/95/9f/d0665adfc49c0e4b4bed32b41eb65ffe5e99862658d24f49ba9351f09cc9/synloc-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-29 12:46:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "alfurka",
    "github_project": "synloc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "synloc"
}
        
Elapsed time: 0.02254s