artificial-data-generator


Nameartificial-data-generator JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/sigrun-may/artificial-data-generator
Summary
upload_time2024-03-18 15:29:44
maintainer
docs_urlNone
authorSigrun May
requires_python>=3.8,<4.0
licenseMIT
keywords artificial data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Data generator for synthetic data including artificial classes, intraclass correlations, pseudo-classes and random data - [Sphinx Doc](https://sigrun-may.github.io/artificial-data-generator/)

## Table of Contents

- [Purpose](#purpose)
- [Data structure](#data-structure)
  - [Different parts of the data set](#different-parts-of-the-data-set)
  - [Data distribution and effect sizes](#data-distribution-and-effect-sizes)
  - [Correlations](#correlations)
- [Pseudo-classes](#pseudo-classes)
- [Random Features](#random-features)
- [Installation](#installation)
- [Licensing](#licensing)

## Purpose

In order to develop new methods or to compare existing methods for feature selection, reference data with known dependencies and importance of the individual features are needed. This data generator can be used to simulate biological data for example artificial high throughput data including artificial biomarkers. Since commonly not all true biomarkers and internal dependencies of high-dimensional biological datasets are known with
certainty, artificial data **enables to know the expected outcome in advance**. In synthetic data, the feature importances and the distribution of each class are known. Irrelevant features can be purely random or belong to a pseudo-class. Such data can be used, for example, to make random effects observable.

## Data structure

### Different parts of the data set

The synthetic-data-generator produces data sets consisting of up to three main parts:

1. **Relevant features** belonging to an artificial class (for example artificial biomarkers)
1. \[optional\] **Pseudo-classes** (for example a patient's height or gender, which have no association with a particular disease)
1. \[optional\] **Random data** representing the features (for example biomarker candidates) that are not associated with any class

The number of artificial classes is not limited. Each class is generated individually and then combined with the others.
In order to simulate artificial biomarkers in total, all individual classes have the same number of features in total.

This is an example of simulated binary biological data including artificial biomarkers:

![Different blocks of the artificial data.](docs/source/imgs/artificial_data.png)

### Data distribution and effect sizes

For each class, either the **normal distribution or the log normal distribution** can be selected. The different **classes can be shifted** to regulate the effect sizes and to influence the difficulty of data analysis.

The normally distributed data could, for example, represent the range of values of healthy individuals.
In the case of a disease, biological systems are in some way out of balance.
Extreme changes in values as well as outliers can then be observed ([Concordet et al., 2009](https://doi.org/10.1016/j.cca.2009.03.057)).
Therefore, the values of a diseased individual could be simulated with a lognormal distribution.

Example of log-normal and normal distributed classes:

![Different distributions of the classes.](docs/source/imgs/distributions.png)

### Correlations

**Intra-class correlation can be generated for each artificial class**. Any number of groups
containing correlated features can be combined with any given number of uncorrelated features.

However, a high correlation within a group does not necessarily lead to
a high correlation to other groups or features of the same class. An example of a class with three
highly correlated groups but without high correlations between all groups:

![Different distributions of the classes.](docs/source/imgs/corr_3_groups.png)

It is probably likely that biomarkers of healthy individuals usually have a relatively low correlation. On average,
their values are within a usual "normal" range. In this case, one biomarker tends to be in the upper normal range and another biomarker in the lower normal range. However, individually it can also be exactly the opposite, so that the correlation between healthy individuals would be rather low. Therefore, the **values of healthy people
could be simulated without any special artificially generated correlations**.

In the case of a disease, however, a biological system is brought out of balance in a certain way and must react to it.
For example, this reaction can then happen in a coordinated manner involving several biomarkers,
or corresponding cascades (e.g. pathways) can be activated or blocked. This can result in a **rather stronger
correlation of biomarkers in patients suffering from a disease**. To simulate these intra-class correlations,
a class is divided into a given number of groups with high internal correlation
(the respective strength can be defined).

## Pseudo-classes

One option for an element of the generated data set is a pseudo-class. For example, this could be a
patient's height or gender, which are not related to a specific disease.

The generated pseudo-class contains the same number of classes with identical distributions as the artificial biomarkers.
But after the generation of the individual classes, all samples (rows) are randomly shuffled.
Finally, combining the shuffled data with the original, unshuffled class labels, the pseudo-class no longer
has a valid association with any class label. Consequently, no element of the pseudo-class should be
recognized as relevant by a feature selection algorithm.

## Random Features

The artificial biomarkers and, if applicable, the optional pseudo-classes can be combined with any number
of random features. Varying the number of random features can be used, for example, to analyze random effects
that occur in small sample sizes with a very large number of features.

## Installation

The artificial-data-generator is available at [the Python Package Index (PyPI)](https://pypi.org/project/artificial-data-generator/).
It can be installed with pip:

```bash
$ pip install artificial-data-generator
```

## Project Setup

We recommend to do the setup in a text console and not with a GUI tool.
This offers better control and transparency.

We use [Poetry](https://python-poetry.org/docs/) and
[pyenv](https://github.com/pyenv/pyenv). Not Conda, Anaconda or pip directly.

### 1. Get Project Source

First you have to clone the project with GIT.
If you want to make a pull request, you must clone your previously forked project and
not the original project.
After the project has been cloned, use `cd` to change into the project directory.

### 2. Install Poetry

We use [Poetry](https://python-poetry.org/docs/) for dependency management and packaging in this project.
The next step is the [installation of Poetry](https://python-poetry.org/docs/#installation),
if you do not already have it.
Poetry offers different installation options. We recommend the option "with the official installer".
But it does not matter. It's your choice.

### 3. Configure Poetry

We suggest the following two config options. These are not mandatory but useful.

Set [`virtualenvs.prefer-active-python`](https://python-poetry.org/docs/configuration/#virtualenvsprefer-active-python-experimental)
to `true`.
With this setting Poetry uses the currently activated Python version to create a new virtual environment.
If set to false, the Python version used during Poetry installation is used.
This makes it possible to determine the exact Python version for development.
This can be done [global or locale](https://python-poetry.org/docs/configuration/#local-configuration).
We suggest to do this setting as global.

- global setting: `poetry config virtualenvs.prefer-active-python true`
- locale setting: `poetry config virtualenvs.prefer-active-python true --local` - this will create or change the `poetry.toml` file

Set [`virtualenvs.options.always-copy`](https://python-poetry.org/docs/configuration/#virtualenvsoptionsalways-copy)
to `true`.
When the new virtual environment is created (later) all needed files are copied into it instead of symlinked.
The advantage is that you can delete the old globally installed Python version later without breaking the Python in
the locale virtual environment.
The disadvantage is that we waste some disk space.
This can be done [global or locale](https://python-poetry.org/docs/configuration/#local-configuration).
We suggest to do this setting as global.

- global setting: `poetry config virtualenvs.options.always-copy true`
- locale setting: `poetry config virtualenvs.options.always-copy true --local` - this will create or change the `poetry.toml` file

### 4. Set the Python Version (pyenv)

We recommend [pyenv](https://github.com/pyenv/pyenv) to install and manage different Python versions.
First [install pyenv](https://github.com/pyenv/pyenv#installation) if you do not already have it.

Next install the appropriate Python version.
We recommend the development on the oldest still permitted Python version of the project.
This version number can be found in the `pyproject.toml` file in the setting called
`tool.poetry.dependencies.python`. If this is set like `python = "^3.8"`
we use pyenv to install Python 3.8:
`pyenv install 3.8`
This installs the latest 3.8 Python version.

If the Python installation was successful we use `pyenv versions` to see which exact Version is installed.
Then we activate this version with `pyenv local <version>`.
This command will create a `.python-version` file in the project directory.
Make sure that you are still in the project directory.
For example execute: `pyenv local 3.8.17`

### 5. Install the Project with Poetry

Execute `poetry install --all-extras` to install the project.
This installs all dependencies, optional (extra) dependencies and
needed linting, testing and documentation dependencies.
With this method, the sources are also implicitly installed in
[editable mode](https://pip.pypa.io/en/latest/cli/pip_install/#cmdoption-e).

## Licensing

Copyright (c) 2022 Sigrun May, Helmholtz-Zentrum für Infektionsforschung GmbH (HZI)<br/>
Copyright (c) 2022 Sigrun May, Ostfalia Hochschule für angewandte Wissenschaften

Licensed under the **MIT License** (the "License"); you may not use this file except in compliance with the License.
You may obtain a copy of the License by reviewing the file
[LICENSE](https://github.com/sigrun-may/artificial-data-generator/blob/main/LICENSE) in the repository.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/sigrun-may/artificial-data-generator",
    "name": "artificial-data-generator",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "artificial data",
    "author": "Sigrun May",
    "author_email": "s.may@ostfalia.de",
    "download_url": "https://files.pythonhosted.org/packages/ff/ae/1c03171f9ec785c9230343a35deb1433baf6776ff658335fcf8056ab6c81/artificial_data_generator-0.0.3.tar.gz",
    "platform": null,
    "description": "# Data generator for synthetic data including artificial classes, intraclass correlations, pseudo-classes and random data - [Sphinx Doc](https://sigrun-may.github.io/artificial-data-generator/)\n\n## Table of Contents\n\n- [Purpose](#purpose)\n- [Data structure](#data-structure)\n  - [Different parts of the data set](#different-parts-of-the-data-set)\n  - [Data distribution and effect sizes](#data-distribution-and-effect-sizes)\n  - [Correlations](#correlations)\n- [Pseudo-classes](#pseudo-classes)\n- [Random Features](#random-features)\n- [Installation](#installation)\n- [Licensing](#licensing)\n\n## Purpose\n\nIn order to develop new methods or to compare existing methods for feature selection, reference data with known dependencies and importance of the individual features are needed. This data generator can be used to simulate biological data for example artificial high throughput data including artificial biomarkers. Since commonly not all true biomarkers and internal dependencies of high-dimensional biological datasets are known with\ncertainty, artificial data **enables to know the expected outcome in advance**. In synthetic data, the feature importances and the distribution of each class are known. Irrelevant features can be purely random or belong to a pseudo-class. Such data can be used, for example, to make random effects observable.\n\n## Data structure\n\n### Different parts of the data set\n\nThe synthetic-data-generator produces data sets consisting of up to three main parts:\n\n1. **Relevant features** belonging to an artificial class (for example artificial biomarkers)\n1. \\[optional\\] **Pseudo-classes** (for example a patient's height or gender, which have no association with a particular disease)\n1. \\[optional\\] **Random data** representing the features (for example biomarker candidates) that are not associated with any class\n\nThe number of artificial classes is not limited. Each class is generated individually and then combined with the others.\nIn order to simulate artificial biomarkers in total, all individual classes have the same number of features in total.\n\nThis is an example of simulated binary biological data including artificial biomarkers:\n\n![Different blocks of the artificial data.](docs/source/imgs/artificial_data.png)\n\n### Data distribution and effect sizes\n\nFor each class, either the **normal distribution or the log normal distribution** can be selected. The different **classes can be shifted** to regulate the effect sizes and to influence the difficulty of data analysis.\n\nThe normally distributed data could, for example, represent the range of values of healthy individuals.\nIn the case of a disease, biological systems are in some way out of balance.\nExtreme changes in values as well as outliers can then be observed ([Concordet et al., 2009](https://doi.org/10.1016/j.cca.2009.03.057)).\nTherefore, the values of a diseased individual could be simulated with a lognormal distribution.\n\nExample of log-normal and normal distributed classes:\n\n![Different distributions of the classes.](docs/source/imgs/distributions.png)\n\n### Correlations\n\n**Intra-class correlation can be generated for each artificial class**. Any number of groups\ncontaining correlated features can be combined with any given number of uncorrelated features.\n\nHowever, a high correlation within a group does not necessarily lead to\na high correlation to other groups or features of the same class. An example of a class with three\nhighly correlated groups but without high correlations between all groups:\n\n![Different distributions of the classes.](docs/source/imgs/corr_3_groups.png)\n\nIt is probably likely that biomarkers of healthy individuals usually have a relatively low correlation. On average,\ntheir values are within a usual \"normal\" range. In this case, one biomarker tends to be in the upper normal range and another biomarker in the lower normal range. However, individually it can also be exactly the opposite, so that the correlation between healthy individuals would be rather low. Therefore, the **values of healthy people\ncould be simulated without any special artificially generated correlations**.\n\nIn the case of a disease, however, a biological system is brought out of balance in a certain way and must react to it.\nFor example, this reaction can then happen in a coordinated manner involving several biomarkers,\nor corresponding cascades (e.g. pathways) can be activated or blocked. This can result in a **rather stronger\ncorrelation of biomarkers in patients suffering from a disease**. To simulate these intra-class correlations,\na class is divided into a given number of groups with high internal correlation\n(the respective strength can be defined).\n\n## Pseudo-classes\n\nOne option for an element of the generated data set is a pseudo-class. For example, this could be a\npatient's height or gender, which are not related to a specific disease.\n\nThe generated pseudo-class contains the same number of classes with identical distributions as the artificial biomarkers.\nBut after the generation of the individual classes, all samples (rows) are randomly shuffled.\nFinally, combining the shuffled data with the original, unshuffled class labels, the pseudo-class no longer\nhas a valid association with any class label. Consequently, no element of the pseudo-class should be\nrecognized as relevant by a feature selection algorithm.\n\n## Random Features\n\nThe artificial biomarkers and, if applicable, the optional pseudo-classes can be combined with any number\nof random features. Varying the number of random features can be used, for example, to analyze random effects\nthat occur in small sample sizes with a very large number of features.\n\n## Installation\n\nThe artificial-data-generator is available at [the Python Package Index (PyPI)](https://pypi.org/project/artificial-data-generator/).\nIt can be installed with pip:\n\n```bash\n$ pip install artificial-data-generator\n```\n\n## Project Setup\n\nWe recommend to do the setup in a text console and not with a GUI tool.\nThis offers better control and transparency.\n\nWe use [Poetry](https://python-poetry.org/docs/) and\n[pyenv](https://github.com/pyenv/pyenv). Not Conda, Anaconda or pip directly.\n\n### 1. Get Project Source\n\nFirst you have to clone the project with GIT.\nIf you want to make a pull request, you must clone your previously forked project and\nnot the original project.\nAfter the project has been cloned, use `cd` to change into the project directory.\n\n### 2. Install Poetry\n\nWe use [Poetry](https://python-poetry.org/docs/) for dependency management and packaging in this project.\nThe next step is the [installation of Poetry](https://python-poetry.org/docs/#installation),\nif you do not already have it.\nPoetry offers different installation options. We recommend the option \"with the official installer\".\nBut it does not matter. It's your choice.\n\n### 3. Configure Poetry\n\nWe suggest the following two config options. These are not mandatory but useful.\n\nSet [`virtualenvs.prefer-active-python`](https://python-poetry.org/docs/configuration/#virtualenvsprefer-active-python-experimental)\nto `true`.\nWith this setting Poetry uses the currently activated Python version to create a new virtual environment.\nIf set to false, the Python version used during Poetry installation is used.\nThis makes it possible to determine the exact Python version for development.\nThis can be done [global or locale](https://python-poetry.org/docs/configuration/#local-configuration).\nWe suggest to do this setting as global.\n\n- global setting: `poetry config virtualenvs.prefer-active-python true`\n- locale setting: `poetry config virtualenvs.prefer-active-python true --local` - this will create or change the `poetry.toml` file\n\nSet [`virtualenvs.options.always-copy`](https://python-poetry.org/docs/configuration/#virtualenvsoptionsalways-copy)\nto `true`.\nWhen the new virtual environment is created (later) all needed files are copied into it instead of symlinked.\nThe advantage is that you can delete the old globally installed Python version later without breaking the Python in\nthe locale virtual environment.\nThe disadvantage is that we waste some disk space.\nThis can be done [global or locale](https://python-poetry.org/docs/configuration/#local-configuration).\nWe suggest to do this setting as global.\n\n- global setting: `poetry config virtualenvs.options.always-copy true`\n- locale setting: `poetry config virtualenvs.options.always-copy true --local` - this will create or change the `poetry.toml` file\n\n### 4. Set the Python Version (pyenv)\n\nWe recommend [pyenv](https://github.com/pyenv/pyenv) to install and manage different Python versions.\nFirst [install pyenv](https://github.com/pyenv/pyenv#installation) if you do not already have it.\n\nNext install the appropriate Python version.\nWe recommend the development on the oldest still permitted Python version of the project.\nThis version number can be found in the `pyproject.toml` file in the setting called\n`tool.poetry.dependencies.python`. If this is set like `python = \"^3.8\"`\nwe use pyenv to install Python 3.8:\n`pyenv install 3.8`\nThis installs the latest 3.8 Python version.\n\nIf the Python installation was successful we use `pyenv versions` to see which exact Version is installed.\nThen we activate this version with `pyenv local <version>`.\nThis command will create a `.python-version` file in the project directory.\nMake sure that you are still in the project directory.\nFor example execute: `pyenv local 3.8.17`\n\n### 5. Install the Project with Poetry\n\nExecute `poetry install --all-extras` to install the project.\nThis installs all dependencies, optional (extra) dependencies and\nneeded linting, testing and documentation dependencies.\nWith this method, the sources are also implicitly installed in\n[editable mode](https://pip.pypa.io/en/latest/cli/pip_install/#cmdoption-e).\n\n## Licensing\n\nCopyright (c) 2022 Sigrun May, Helmholtz-Zentrum f\u00fcr Infektionsforschung GmbH (HZI)<br/>\nCopyright (c) 2022 Sigrun May, Ostfalia Hochschule f\u00fcr angewandte Wissenschaften\n\nLicensed under the **MIT License** (the \"License\"); you may not use this file except in compliance with the License.\nYou may obtain a copy of the License by reviewing the file\n[LICENSE](https://github.com/sigrun-may/artificial-data-generator/blob/main/LICENSE) in the repository.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "",
    "version": "0.0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/sigrun-may/artificial-data-generator/issues",
        "Homepage": "https://github.com/sigrun-may/artificial-data-generator"
    },
    "split_keywords": [
        "artificial",
        "data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7bc2d84e17d1075a43802b8069f974149fa84e7ab840c4d0635dab79b0354319",
                "md5": "ec3bc240ede46cd31da3dac9d912ead1",
                "sha256": "ccfc70d3927a753af64b0b7302f5ad48c708b077d5494e2df7f9fc6fcead5be6"
            },
            "downloads": -1,
            "filename": "artificial_data_generator-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ec3bc240ede46cd31da3dac9d912ead1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 19319,
            "upload_time": "2024-03-18T15:29:42",
            "upload_time_iso_8601": "2024-03-18T15:29:42.167500Z",
            "url": "https://files.pythonhosted.org/packages/7b/c2/d84e17d1075a43802b8069f974149fa84e7ab840c4d0635dab79b0354319/artificial_data_generator-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ffae1c03171f9ec785c9230343a35deb1433baf6776ff658335fcf8056ab6c81",
                "md5": "480e5f1f85ca81f2c2c0ac6b08bb43cb",
                "sha256": "dcd1acb8f6e19fb6ddfb0fe8dbfc94fe2f5e785a655e127e541fee5be834ad11"
            },
            "downloads": -1,
            "filename": "artificial_data_generator-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "480e5f1f85ca81f2c2c0ac6b08bb43cb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 19056,
            "upload_time": "2024-03-18T15:29:44",
            "upload_time_iso_8601": "2024-03-18T15:29:44.355815Z",
            "url": "https://files.pythonhosted.org/packages/ff/ae/1c03171f9ec785c9230343a35deb1433baf6776ff658335fcf8056ab6c81/artificial_data_generator-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-18 15:29:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sigrun-may",
    "github_project": "artificial-data-generator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "artificial-data-generator"
}
        
Elapsed time: 0.43680s