tno.sdg.tabular.gen.cluster-based


Nametno.sdg.tabular.gen.cluster-based JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryCluster Based Synthetic Data Generation
upload_time2024-12-10 13:35:39
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseApache License, Version 2.0
keywords tno sdg synthetic data synthetic data generation tabular
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based

This package provides a simple synthetic data generator for tabular data. In
short, it works by clustering a given tabular dataset (by default using k-means
clustering), from which per-attribute histograms per cluster are created. These
histograms are sampled to generate synthetic data.

### PET Lab

The TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.

The package `tno.sdg.tabular.gen.cluster_based` is part of the [TNO Python Toolbox](https://github.com/TNO-PET).

_Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws._  
_This implementation of cryptographic software has not been audited. Use at your own risk._

## Documentation

Documentation of the `tno.sdg.tabular.gen.cluster_based` package can be found
[here](https://docs.pet.tno.nl/sdg/tabular/gen/cluster_based/0.2.0).

## Install

Easily install the `tno.sdg.tabular.gen.cluster_based` package using `pip`:

```console
$ python -m pip install tno.sdg.tabular.gen.cluster_based
```

_Note:_ If you are cloning the repository and wish to edit the source code, be
sure to install the package in editable mode:

```console
$ python -m pip install -e 'tno.sdg.tabular.gen.cluster_based'
```

If you wish to run the tests you can use:

```console
$ python -m pip install 'tno.sdg.tabular.gen.cluster_based[tests]'
```

## Usage

The `tno.sdg.tabular.gen.cluster_based` package provides a single class
`ClusterBasedGenerator` that provides a simple interface to the synthetic data
generation.

First, the `ClusterBasedGenerator` must be fitted on a real dataset using the
`ClusterBasedGenerator.fit` method. The user must specify the type of each
column of the dataset via the `data_types` parameter. Once fitted, the user can
call `ClusterBasedGenerator.sample` to generate synthetic data samples.

```python
import pandas as pd
from tno.sdg.tabular.gen.cluster_based import ClusterBasedGenerator, DataType

df = pd.read_csv("src/tno/sdg/tabular/gen/cluster_based/test/data/adult.data")
df_subset = df[["age", "sex", "income", "workclass", "education", "marital-status"]]
generator = ClusterBasedGenerator()
generator.fit(df_subset, [DataType.CONTINUOUS, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL])
samples = generator.sample()

```

### Histogram Templates

The generator uses histograms to generate data. A single histogram represents
a single feature. The bins of this histogram are, by default, derived from the
data. If you wish to provide a custom template for the histogram, you can
create one or more `HistogramTemplate` for the desired features and pass these
to the `ClusterBasedGenerator`.

```python
age_template = ContinuousHistogramTemplate(lims=[0,10,20,30,40,50,60,70,80,90,100])
education_template = CategoricalHistogramTemplate(values=['Bachelors, Masters'])
generator = ClusterBasedGenerator(
   histogram_templates={
      'age': age_template
      'education': education_template
      # we let marital-status be derived from the data
   }
)
```

### Clustering

The `ClusterBasedGenerator`, as the name suggests, uses clustering to achieve
synthetic data generation. By default, `sklearn.cluster.KMeans` is used with
parameters `n_clusters=8, init="random", n_init="auto"`. To change the
clusterer, simply pass a clustering algorithm to `ClusterBasedGenerator`. The
clusterer is expected to subclass `BaseEstimator` (base class of `scipy`) and
implement `fit` and `predict`.

For example, to use `KMeans` but with a different amount of clusters, you can pass:

```python
generator = ClusterBasedGenerator(clusterer=KMeans(n_clusters=2))
```

### Preprocessing

Depending on the clustering algorithm and input data used, the data may need to
be preprocessed. For `KMeans`, the default clustering algorithm, preprocessing
is required.

The default preprocessor applies the `StandardScaler` to `DataType.CONTINUOUS`
features and the `OneHotEncoder` to `DataType.CATEGORICAL` features.

It is possible to provide a custom preprocessor in the same manner as for the
clusterer. The preprocessor should be a `BaseEstimator` with the methods `fit`
and `predict` implemented. It is possible to combine multiple existing
preprocessors (such as `OneHotEncoder`) together, and even bulid
a `Pipeline`. See `default_processor` and `ClusterBasedGenerator.fit` for
examples on how to use these `scipy` features.

```python
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler


def custom_preprocessor() -> BaseEstimator:
   return make_column_transformer(
      (StandardScaler(), 'age'),
      (OneHotEncoder(), 'education'),
      ('drop', 'marital-status')
   )

generator = ClusterBasedGenerator(preprocessor=custom_preprocessor())
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tno.sdg.tabular.gen.cluster-based",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "TNO PET Lab <petlab@tno.nl>",
    "keywords": "TNO, SDG, synthetic data, synthetic data generation, tabular",
    "author": null,
    "author_email": "TNO PET Lab <petlab@tno.nl>",
    "download_url": "https://files.pythonhosted.org/packages/e5/c5/31fc33469159ae4aa68d97ef6428e5ada1563970dc2a5022f3a330f32a30/tno_sdg_tabular_gen_cluster_based-0.2.0.tar.gz",
    "platform": "any",
    "description": "# TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based\n\nThis package provides a simple synthetic data generator for tabular data. In\nshort, it works by clustering a given tabular dataset (by default using k-means\nclustering), from which per-attribute histograms per cluster are created. These\nhistograms are sampled to generate synthetic data.\n\n### PET Lab\n\nThe TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.\n\nThe package `tno.sdg.tabular.gen.cluster_based` is part of the [TNO Python Toolbox](https://github.com/TNO-PET).\n\n_Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws._  \n_This implementation of cryptographic software has not been audited. Use at your own risk._\n\n## Documentation\n\nDocumentation of the `tno.sdg.tabular.gen.cluster_based` package can be found\n[here](https://docs.pet.tno.nl/sdg/tabular/gen/cluster_based/0.2.0).\n\n## Install\n\nEasily install the `tno.sdg.tabular.gen.cluster_based` package using `pip`:\n\n```console\n$ python -m pip install tno.sdg.tabular.gen.cluster_based\n```\n\n_Note:_ If you are cloning the repository and wish to edit the source code, be\nsure to install the package in editable mode:\n\n```console\n$ python -m pip install -e 'tno.sdg.tabular.gen.cluster_based'\n```\n\nIf you wish to run the tests you can use:\n\n```console\n$ python -m pip install 'tno.sdg.tabular.gen.cluster_based[tests]'\n```\n\n## Usage\n\nThe `tno.sdg.tabular.gen.cluster_based` package provides a single class\n`ClusterBasedGenerator` that provides a simple interface to the synthetic data\ngeneration.\n\nFirst, the `ClusterBasedGenerator` must be fitted on a real dataset using the\n`ClusterBasedGenerator.fit` method. The user must specify the type of each\ncolumn of the dataset via the `data_types` parameter. Once fitted, the user can\ncall `ClusterBasedGenerator.sample` to generate synthetic data samples.\n\n```python\nimport pandas as pd\nfrom tno.sdg.tabular.gen.cluster_based import ClusterBasedGenerator, DataType\n\ndf = pd.read_csv(\"src/tno/sdg/tabular/gen/cluster_based/test/data/adult.data\")\ndf_subset = df[[\"age\", \"sex\", \"income\", \"workclass\", \"education\", \"marital-status\"]]\ngenerator = ClusterBasedGenerator()\ngenerator.fit(df_subset, [DataType.CONTINUOUS, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL])\nsamples = generator.sample()\n\n```\n\n### Histogram Templates\n\nThe generator uses histograms to generate data. A single histogram represents\na single feature. The bins of this histogram are, by default, derived from the\ndata. If you wish to provide a custom template for the histogram, you can\ncreate one or more `HistogramTemplate` for the desired features and pass these\nto the `ClusterBasedGenerator`.\n\n```python\nage_template = ContinuousHistogramTemplate(lims=[0,10,20,30,40,50,60,70,80,90,100])\neducation_template = CategoricalHistogramTemplate(values=['Bachelors, Masters'])\ngenerator = ClusterBasedGenerator(\n   histogram_templates={\n      'age': age_template\n      'education': education_template\n      # we let marital-status be derived from the data\n   }\n)\n```\n\n### Clustering\n\nThe `ClusterBasedGenerator`, as the name suggests, uses clustering to achieve\nsynthetic data generation. By default, `sklearn.cluster.KMeans` is used with\nparameters `n_clusters=8, init=\"random\", n_init=\"auto\"`. To change the\nclusterer, simply pass a clustering algorithm to `ClusterBasedGenerator`. The\nclusterer is expected to subclass `BaseEstimator` (base class of `scipy`) and\nimplement `fit` and `predict`.\n\nFor example, to use `KMeans` but with a different amount of clusters, you can pass:\n\n```python\ngenerator = ClusterBasedGenerator(clusterer=KMeans(n_clusters=2))\n```\n\n### Preprocessing\n\nDepending on the clustering algorithm and input data used, the data may need to\nbe preprocessed. For `KMeans`, the default clustering algorithm, preprocessing\nis required.\n\nThe default preprocessor applies the `StandardScaler` to `DataType.CONTINUOUS`\nfeatures and the `OneHotEncoder` to `DataType.CATEGORICAL` features.\n\nIt is possible to provide a custom preprocessor in the same manner as for the\nclusterer. The preprocessor should be a `BaseEstimator` with the methods `fit`\nand `predict` implemented. It is possible to combine multiple existing\npreprocessors (such as `OneHotEncoder`) together, and even bulid\na `Pipeline`. See `default_processor` and `ClusterBasedGenerator.fit` for\nexamples on how to use these `scipy` features.\n\n```python\nfrom sklearn.compose import make_column_transformer\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\n\n\ndef custom_preprocessor() -> BaseEstimator:\n   return make_column_transformer(\n      (StandardScaler(), 'age'),\n      (OneHotEncoder(), 'education'),\n      ('drop', 'marital-status')\n   )\n\ngenerator = ClusterBasedGenerator(preprocessor=custom_preprocessor())\n```\n",
    "bugtrack_url": null,
    "license": "Apache License, Version 2.0",
    "summary": "Cluster Based Synthetic Data Generation",
    "version": "0.2.0",
    "project_urls": {
        "Documentation": "https://docs.pet.tno.nl/sdg/tabular/gen/cluster_based/0.2.0",
        "Homepage": "https://pet.tno.nl/",
        "Source": "https://github.com/TNO-SDG/tabular.gen.cluster_based"
    },
    "split_keywords": [
        "tno",
        " sdg",
        " synthetic data",
        " synthetic data generation",
        " tabular"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c1a6e3277be2bf95eaa8462d0024d00c0842873fea664b904d3d0dc188291bc5",
                "md5": "9db811d13b87cd94255d251d4c43e64b",
                "sha256": "fe91ae1b6a4c94739da3c08dda657adb023219a7d888fa03158acbd5e318b341"
            },
            "downloads": -1,
            "filename": "tno.sdg.tabular.gen.cluster_based-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9db811d13b87cd94255d251d4c43e64b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 22342,
            "upload_time": "2024-12-10T13:35:38",
            "upload_time_iso_8601": "2024-12-10T13:35:38.388670Z",
            "url": "https://files.pythonhosted.org/packages/c1/a6/e3277be2bf95eaa8462d0024d00c0842873fea664b904d3d0dc188291bc5/tno.sdg.tabular.gen.cluster_based-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e5c531fc33469159ae4aa68d97ef6428e5ada1563970dc2a5022f3a330f32a30",
                "md5": "7fc9e3001ac01c34cfc863fd588825f0",
                "sha256": "7e47d044632ae4d6ebbc1951103026fa15651c62f79648b72814748e04b69f90"
            },
            "downloads": -1,
            "filename": "tno_sdg_tabular_gen_cluster_based-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7fc9e3001ac01c34cfc863fd588825f0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 24254,
            "upload_time": "2024-12-10T13:35:39",
            "upload_time_iso_8601": "2024-12-10T13:35:39.831954Z",
            "url": "https://files.pythonhosted.org/packages/e5/c5/31fc33469159ae4aa68d97ef6428e5ada1563970dc2a5022f3a330f32a30/tno_sdg_tabular_gen_cluster_based-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-10 13:35:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TNO-SDG",
    "github_project": "tabular.gen.cluster_based",
    "github_not_found": true,
    "lcname": "tno.sdg.tabular.gen.cluster-based"
}
        
Elapsed time: 0.40227s