# TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based
This package provides a simple synthetic data generator for tabular data. In
short, it works by clustering a given tabular dataset (by default using k-means
clustering), from which per-attribute histograms per cluster are created. These
histograms are sampled to generate synthetic data.
### PET Lab
The TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.
The package `tno.sdg.tabular.gen.cluster_based` is part of the [TNO Python Toolbox](https://github.com/TNO-PET).
_Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws._
_This implementation of cryptographic software has not been audited. Use at your own risk._
## Documentation
Documentation of the `tno.sdg.tabular.gen.cluster_based` package can be found
[here](https://docs.pet.tno.nl/sdg/tabular/gen/cluster_based/0.2.0).
## Install
Easily install the `tno.sdg.tabular.gen.cluster_based` package using `pip`:
```console
$ python -m pip install tno.sdg.tabular.gen.cluster_based
```
_Note:_ If you are cloning the repository and wish to edit the source code, be
sure to install the package in editable mode:
```console
$ python -m pip install -e 'tno.sdg.tabular.gen.cluster_based'
```
If you wish to run the tests you can use:
```console
$ python -m pip install 'tno.sdg.tabular.gen.cluster_based[tests]'
```
## Usage
The `tno.sdg.tabular.gen.cluster_based` package provides a single class
`ClusterBasedGenerator` that provides a simple interface to the synthetic data
generation.
First, the `ClusterBasedGenerator` must be fitted on a real dataset using the
`ClusterBasedGenerator.fit` method. The user must specify the type of each
column of the dataset via the `data_types` parameter. Once fitted, the user can
call `ClusterBasedGenerator.sample` to generate synthetic data samples.
```python
import pandas as pd
from tno.sdg.tabular.gen.cluster_based import ClusterBasedGenerator, DataType
df = pd.read_csv("src/tno/sdg/tabular/gen/cluster_based/test/data/adult.data")
df_subset = df[["age", "sex", "income", "workclass", "education", "marital-status"]]
generator = ClusterBasedGenerator()
generator.fit(df_subset, [DataType.CONTINUOUS, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL])
samples = generator.sample()
```
### Histogram Templates
The generator uses histograms to generate data. A single histogram represents
a single feature. The bins of this histogram are, by default, derived from the
data. If you wish to provide a custom template for the histogram, you can
create one or more `HistogramTemplate` for the desired features and pass these
to the `ClusterBasedGenerator`.
```python
age_template = ContinuousHistogramTemplate(lims=[0,10,20,30,40,50,60,70,80,90,100])
education_template = CategoricalHistogramTemplate(values=['Bachelors, Masters'])
generator = ClusterBasedGenerator(
histogram_templates={
'age': age_template
'education': education_template
# we let marital-status be derived from the data
}
)
```
### Clustering
The `ClusterBasedGenerator`, as the name suggests, uses clustering to achieve
synthetic data generation. By default, `sklearn.cluster.KMeans` is used with
parameters `n_clusters=8, init="random", n_init="auto"`. To change the
clusterer, simply pass a clustering algorithm to `ClusterBasedGenerator`. The
clusterer is expected to subclass `BaseEstimator` (base class of `scipy`) and
implement `fit` and `predict`.
For example, to use `KMeans` but with a different amount of clusters, you can pass:
```python
generator = ClusterBasedGenerator(clusterer=KMeans(n_clusters=2))
```
### Preprocessing
Depending on the clustering algorithm and input data used, the data may need to
be preprocessed. For `KMeans`, the default clustering algorithm, preprocessing
is required.
The default preprocessor applies the `StandardScaler` to `DataType.CONTINUOUS`
features and the `OneHotEncoder` to `DataType.CATEGORICAL` features.
It is possible to provide a custom preprocessor in the same manner as for the
clusterer. The preprocessor should be a `BaseEstimator` with the methods `fit`
and `predict` implemented. It is possible to combine multiple existing
preprocessors (such as `OneHotEncoder`) together, and even bulid
a `Pipeline`. See `default_processor` and `ClusterBasedGenerator.fit` for
examples on how to use these `scipy` features.
```python
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
def custom_preprocessor() -> BaseEstimator:
return make_column_transformer(
(StandardScaler(), 'age'),
(OneHotEncoder(), 'education'),
('drop', 'marital-status')
)
generator = ClusterBasedGenerator(preprocessor=custom_preprocessor())
```
Raw data
{
"_id": null,
"home_page": null,
"name": "tno.sdg.tabular.gen.cluster-based",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "TNO PET Lab <petlab@tno.nl>",
"keywords": "TNO, SDG, synthetic data, synthetic data generation, tabular",
"author": null,
"author_email": "TNO PET Lab <petlab@tno.nl>",
"download_url": "https://files.pythonhosted.org/packages/e5/c5/31fc33469159ae4aa68d97ef6428e5ada1563970dc2a5022f3a330f32a30/tno_sdg_tabular_gen_cluster_based-0.2.0.tar.gz",
"platform": "any",
"description": "# TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based\n\nThis package provides a simple synthetic data generator for tabular data. In\nshort, it works by clustering a given tabular dataset (by default using k-means\nclustering), from which per-attribute histograms per cluster are created. These\nhistograms are sampled to generate synthetic data.\n\n### PET Lab\n\nThe TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.\n\nThe package `tno.sdg.tabular.gen.cluster_based` is part of the [TNO Python Toolbox](https://github.com/TNO-PET).\n\n_Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws._ \n_This implementation of cryptographic software has not been audited. Use at your own risk._\n\n## Documentation\n\nDocumentation of the `tno.sdg.tabular.gen.cluster_based` package can be found\n[here](https://docs.pet.tno.nl/sdg/tabular/gen/cluster_based/0.2.0).\n\n## Install\n\nEasily install the `tno.sdg.tabular.gen.cluster_based` package using `pip`:\n\n```console\n$ python -m pip install tno.sdg.tabular.gen.cluster_based\n```\n\n_Note:_ If you are cloning the repository and wish to edit the source code, be\nsure to install the package in editable mode:\n\n```console\n$ python -m pip install -e 'tno.sdg.tabular.gen.cluster_based'\n```\n\nIf you wish to run the tests you can use:\n\n```console\n$ python -m pip install 'tno.sdg.tabular.gen.cluster_based[tests]'\n```\n\n## Usage\n\nThe `tno.sdg.tabular.gen.cluster_based` package provides a single class\n`ClusterBasedGenerator` that provides a simple interface to the synthetic data\ngeneration.\n\nFirst, the `ClusterBasedGenerator` must be fitted on a real dataset using the\n`ClusterBasedGenerator.fit` method. The user must specify the type of each\ncolumn of the dataset via the `data_types` parameter. Once fitted, the user can\ncall `ClusterBasedGenerator.sample` to generate synthetic data samples.\n\n```python\nimport pandas as pd\nfrom tno.sdg.tabular.gen.cluster_based import ClusterBasedGenerator, DataType\n\ndf = pd.read_csv(\"src/tno/sdg/tabular/gen/cluster_based/test/data/adult.data\")\ndf_subset = df[[\"age\", \"sex\", \"income\", \"workclass\", \"education\", \"marital-status\"]]\ngenerator = ClusterBasedGenerator()\ngenerator.fit(df_subset, [DataType.CONTINUOUS, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL])\nsamples = generator.sample()\n\n```\n\n### Histogram Templates\n\nThe generator uses histograms to generate data. A single histogram represents\na single feature. The bins of this histogram are, by default, derived from the\ndata. If you wish to provide a custom template for the histogram, you can\ncreate one or more `HistogramTemplate` for the desired features and pass these\nto the `ClusterBasedGenerator`.\n\n```python\nage_template = ContinuousHistogramTemplate(lims=[0,10,20,30,40,50,60,70,80,90,100])\neducation_template = CategoricalHistogramTemplate(values=['Bachelors, Masters'])\ngenerator = ClusterBasedGenerator(\n histogram_templates={\n 'age': age_template\n 'education': education_template\n # we let marital-status be derived from the data\n }\n)\n```\n\n### Clustering\n\nThe `ClusterBasedGenerator`, as the name suggests, uses clustering to achieve\nsynthetic data generation. By default, `sklearn.cluster.KMeans` is used with\nparameters `n_clusters=8, init=\"random\", n_init=\"auto\"`. To change the\nclusterer, simply pass a clustering algorithm to `ClusterBasedGenerator`. The\nclusterer is expected to subclass `BaseEstimator` (base class of `scipy`) and\nimplement `fit` and `predict`.\n\nFor example, to use `KMeans` but with a different amount of clusters, you can pass:\n\n```python\ngenerator = ClusterBasedGenerator(clusterer=KMeans(n_clusters=2))\n```\n\n### Preprocessing\n\nDepending on the clustering algorithm and input data used, the data may need to\nbe preprocessed. For `KMeans`, the default clustering algorithm, preprocessing\nis required.\n\nThe default preprocessor applies the `StandardScaler` to `DataType.CONTINUOUS`\nfeatures and the `OneHotEncoder` to `DataType.CATEGORICAL` features.\n\nIt is possible to provide a custom preprocessor in the same manner as for the\nclusterer. The preprocessor should be a `BaseEstimator` with the methods `fit`\nand `predict` implemented. It is possible to combine multiple existing\npreprocessors (such as `OneHotEncoder`) together, and even bulid\na `Pipeline`. See `default_processor` and `ClusterBasedGenerator.fit` for\nexamples on how to use these `scipy` features.\n\n```python\nfrom sklearn.compose import make_column_transformer\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\n\n\ndef custom_preprocessor() -> BaseEstimator:\n return make_column_transformer(\n (StandardScaler(), 'age'),\n (OneHotEncoder(), 'education'),\n ('drop', 'marital-status')\n )\n\ngenerator = ClusterBasedGenerator(preprocessor=custom_preprocessor())\n```\n",
"bugtrack_url": null,
"license": "Apache License, Version 2.0",
"summary": "Cluster Based Synthetic Data Generation",
"version": "0.2.0",
"project_urls": {
"Documentation": "https://docs.pet.tno.nl/sdg/tabular/gen/cluster_based/0.2.0",
"Homepage": "https://pet.tno.nl/",
"Source": "https://github.com/TNO-SDG/tabular.gen.cluster_based"
},
"split_keywords": [
"tno",
" sdg",
" synthetic data",
" synthetic data generation",
" tabular"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c1a6e3277be2bf95eaa8462d0024d00c0842873fea664b904d3d0dc188291bc5",
"md5": "9db811d13b87cd94255d251d4c43e64b",
"sha256": "fe91ae1b6a4c94739da3c08dda657adb023219a7d888fa03158acbd5e318b341"
},
"downloads": -1,
"filename": "tno.sdg.tabular.gen.cluster_based-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9db811d13b87cd94255d251d4c43e64b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 22342,
"upload_time": "2024-12-10T13:35:38",
"upload_time_iso_8601": "2024-12-10T13:35:38.388670Z",
"url": "https://files.pythonhosted.org/packages/c1/a6/e3277be2bf95eaa8462d0024d00c0842873fea664b904d3d0dc188291bc5/tno.sdg.tabular.gen.cluster_based-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e5c531fc33469159ae4aa68d97ef6428e5ada1563970dc2a5022f3a330f32a30",
"md5": "7fc9e3001ac01c34cfc863fd588825f0",
"sha256": "7e47d044632ae4d6ebbc1951103026fa15651c62f79648b72814748e04b69f90"
},
"downloads": -1,
"filename": "tno_sdg_tabular_gen_cluster_based-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "7fc9e3001ac01c34cfc863fd588825f0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 24254,
"upload_time": "2024-12-10T13:35:39",
"upload_time_iso_8601": "2024-12-10T13:35:39.831954Z",
"url": "https://files.pythonhosted.org/packages/e5/c5/31fc33469159ae4aa68d97ef6428e5ada1563970dc2a5022f3a330f32a30/tno_sdg_tabular_gen_cluster_based-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-10 13:35:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TNO-SDG",
"github_project": "tabular.gen.cluster_based",
"github_not_found": true,
"lcname": "tno.sdg.tabular.gen.cluster-based"
}