evalAIRR

Name	evalAIRR JSON
Version	0.0.44 JSON
	download
home_page
Summary	Comparison of real and simulated AIRR datasets
upload_time	2023-05-10 20:00:24
maintainer
docs_url	None
author	Lukas Sparnauskas
requires_python
license	MIT
keywords	python airr simulated data ml machine learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# evalAIRR

A tool that allows comparison of real and simulated AIRR datasets by providing different statistical indicators and dataset visualizations in one report.

## Installation

It is recommended to use a virtual python environment to run evalAIRR if another python environment is used. Here is a quick guide on how you can set up a virtual environment:

`https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments`

### Install using pip

Run this command to install the evalAIRR package:

`pip install evalairr`

## Quickstart

evalAIRR uses a YAML file for configuration. If you are unfamiliar with how YAML files are structured, read this guide to the syntax:

`https://docs.fileformat.com/programming/yaml/#syntax`

This is the stucture of a sample report configuration file you can use to start off with (it is included in the repository location ./yaml_files/quickstart.yaml):

```
datasets:
real:
path: ./data/real_data.csv
sim:
path: ./data/sim_data.csv
reports:
feature_based:
report1:
features:
- CAS
- SAS
report_types:
- ks
- distr_densityplot
- distance
- statistics
observation_based:
report1:
observations:
- all
report_types:
- distr_densityplot
general:
feat_mean_vs_variance:
obs_mean_vs_variance:
pca_2d_feat:
pca_2d_obs:
corr_feat_hist:
corr_obs_hist:
output:
path: ./output/report.html
```

This report will process the two provided datasets (real and simulated) with encoded kmer data, and create an HTML report with multiple report types. These include feature-based report types - Kolmogorov–Smirnov test (indicated by report type `ks`), a feature distribution density plot (indicated by report type `distr_densityplot`), Euclidean distance measures (indicated by report type `distance`) and descriptive statistics (mean, median, variance and standard deviation)(indicated by report type `statistics`) for the features `CAS` and `SAS`. It will then export the report to the path `./output/report.html`. It will also create an observation-based report with the feature distribution density plot (indicated by report type `distr_densityplot`) for all observations (indicated by keyword `all` in the observation list). Finally, these general reports will be generated: feature mean compared with feature variance (indicated by report type `feat_mean_vs_variance`), observation mean compared with observation variance (indicated by report type `obs_mean_vs_variance`), two dimensional representation of all features using PCA (indicated by report type `pca_2d_feat`), two dimensional representation of all observations using PCA (indicated by report type `pca_2d_obs`), feature correlation coefficient distribution histogram (indicated by report type `corr_feat_hist`) and an observation correlation coefficient distribution histogram (indicated by report type `corr_obs_hist`). More details on what reports can be created can be found in the _YAML Configuration Guidelines_ section.

The repository contains sample datafiles and a quickstart YAML configuration files. You can clone the repository and run evalAIRR within it to use sample data.

Within the cloned repository run the command:

`evalairr -i ./yaml_files/quickstart.yaml`

The report will be generated in the specified output path in the configuration file or, if a specific path is not provided, in `<CURRENT_DIRECTORY>/output/report.html`. The report is exported in the HTML format.

## YAML Configuration Guidelines

The configuration YAML file consists of 3 main sections: `datasets`, `reports` and `output`.

### Datasets

In the `datasets` section, you have to provide paths to a real and a simulated datasets that you are comparing. CSV files with encoded kmer data are supported. This can be done by specifying the file path of each file in the `path` variable under the sections `real` and `sim` respectively. Here is an example of how a configured `datasets` section looks like:

```
datasets:
real:
path: ./data/real_data.csv
sim:
path: ./data/sim_data.csv
```

### Reports

In the `reports` section, you can provide the list of report types you want to create and their parameters. There are three types of report groups depending on the different parameters: `feature_based`, `observation_based` and `generic`. Here is the list of reports you can create that compare the features of the real dataset with the simulated dataset:

#### Feature-based reports

- `ks` - Kolmogorov–Smirnov statistic. Parameters: list of features you are creating the report for.
- `distr_histogram` - feature distribution histogram. Parameters: list of features you are creating the report for.
- `distr_boxplot` - feature distribution boxplot. Parameters: list of features you are creating the report for.
- `distr_violinplot` - feature distribution violin plot. Parameters: list of features you are creating the report for.
- `distr_densityplot` - feature distribution density plot. Parameters: list of features you are creating the report for.
- `distance` - Euclidean distance between the real and simulated feature. Parameters: list of features you are creating the report for.
- `statistics` - statistical indicators (mean, median, standard deviation and variance) of a feature in both real and simulated datasets. Parameters: list of features you are creating the report for.

#### Observation-based reports

- `ks` - Kolmogorov–Smirnov statistic. Parameters: list of observations you are creating the report for.
- `distr_histogram` - observation distribution histogram. Parameters: list of observations you are creating the report for.
- `distr_boxplot` - observation distribution boxplot. Parameters: list of observations you are creating the report for.
- `distr_violinplot` - observation distribution violin plot. Parameters: list of observations you are creating the report for.
- `distr_densityplot` - observation distribution density plot. Parameters: list of observations you are creating the report for. The observation index `all` can be used to report on all observations in one plot. `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset (only applies in reports with `all` observations). `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `distance` - Euclidean distance between the real and simulated observation. Parameters: list of observations you are creating the report for.
- `statistics` - statistical indicators (mean, median, standard deviation and variance) of an observation in both real and simulated datasets. Parameters: list of observations you are creating the report for.

#### General reports

- `ks_feat` - Kolmogorov–Smirnov statistic for all features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/ks_feat.csv`). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values.
- `ks_obs` - Kolmogorov–Smirnov statistic for all observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/ks_obs.csv`). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values.
- `copula_2d` - a 2D scatter plot that displays two features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.
- `copula_3d` - a 3D scatter plot that displays three features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.
- `feat_mean_vs_variance` - a scatter plot that displays the mean value of every feature on one axis and the variance of every feature on the other axis. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `obs_mean_vs_variance` - a scatter plot that displays the mean value of every observation on one axis and the variance of every observation on the other axis. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `corr` - correlation matrix heatmaps of the real and simulated datasets. Parameters: `reduce_to_n_features` - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `corr_feat_hist` - feature correlation matrix distribution histogram for the real and simulated datasets. Parameters: `n_bins` - an optional parameter that sets the number of bins in the histogram (default value is 30). `reduce_to_n_features` - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `corr_obs_hist` - observation correlation matrix distribution histogram for the real and simulated datasets. Parameters: `n_bins` - an optional parameter that sets the number of bins in the histogram (default value is 30). `reduce_to_n_obs` - an optional parameter for dimensionality reduction using PCA. The number of observations to reduce the dataset to (must be reduce_to_n_obs < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `corr_csv` - CSV file exporting of the difference between correlation matrices of the real and simulated datasets. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/corr.csv`).
- `pca_2d_feat` - two feature-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `pca_2d_obs` - two observation-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.
- `distance_feat` - Euclidean distance between the real and simulated features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/dist.csv`).
- `statistics_feat` - statistical indicators (mean, median, standard deviation and variance) of all features in both real and simulated datasets. Parameters: `output_dir` - optional parameter, that specifies the directory for the csv files in which the csv result files `real_stat.csv` and `sim_stat.csv` will be exported to (default value is set to `./output/`). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance.
- `distance_obs` - Euclidean distance between the real and simulated observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/obs_dist.csv`).
- `statistics_obs` - statistical indicators (mean, median, standard deviation and variance) of all observation in both real and simulated datasets. Parameters: `output_dir` - optional parameter, that specifies the directory for the csv files in which the csv result files `real_obs_stat.csv` and `sim_obs_stat.csv` will be exported to (default value is set to `./output/`). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance.
- `jensen_shannon_feat` - Jensen-Shannon divergence metric between the real and simulated features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/jenshan.csv`).
- `jensen_shannon_obs` - Jensen-Shannon divergence metric between the real and simulated observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/obs_jenshan.csv`).

Here is a sample `reports` section of a configuration file containing all of the reports:

```
reports:
feature_based:
report1:
features:
- CAS
- SAS
report_types:
- ks
- distr_histogram
- distr_boxplot
- distr_violinplot
- distr_densityplot
- distance
- statistics
observation_based:
report1:
observations:
- 20
report_types:
- ks
- distr_histogram
- distr_boxplot
- distr_violinplot
- distr_densityplot
- distance
- statistics
report2:
observations:
- all
report_types:
- distr_densityplot
with_ml_sim: True
ml_random_state: 0
general:
copula_2d:
report1:
- CAS
- SAS
copula_3d:
report1:
- CAS
- SAS
- TGT
feat_mean_vs_variance:
with_ml_sim: True
ml_random_state: 0
obs_mean_vs_variance:
with_ml_sim: True
ml_random_state: 0
corr_hist:
n_bins: 30
with_ml_sim: True
ml_random_state: 0
reduce_to_n_features: 200
corr:
reduce_to_n_features: 200
with_ml_sim: True
ml_random_state: 0
pca_2d_feat:
with_ml_sim: True
ml_random_state: 0
pca_2d_obs:
with_ml_sim: True
ml_random_state: 0
corr_csv:
output: ./output/corr.csv
ks_feat:
output: ./output/ks_feat.csv
ks_obs:
output: ./output/ks_obs.csv
statistics_feat:
output_dir: ./output/
statistics_obs:
output_dir: ./output/
distance_feat:
output: ./output/dist.csv
distance_obs:
output: ./output/obs_dist.csv
jensen_shannon_feat:
output: ./output/jenshan.csv
jensen_shannon_obs:
output: ./output/obs_jenshan.csv
```

### Output

An optional section where you can specify the file path of the generated report. The default path of the generated report is `<CURRENT_DIRECTORY>/output/report.html`. The report is exported in the HTML format. If you declare the path as 'NONE', the report will not be created.

An example output section:

```
output:
path: ./output/report.html
```

For example, this output section would result in a report file not being created:

```
output:
path: NONE
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "evalAIRR",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,airr,simulated data,ml,machine learning",
    "author": "Lukas Sparnauskas",
    "author_email": "<lukas.11sp@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7b/20/1f0019d6e578c7b833a525a4fa7506597c9ed88618943b386bf0f1acb433/evalAIRR-0.0.44.tar.gz",
    "platform": null,
    "description": "\n# evalAIRR\n\nA tool that allows comparison of real and simulated AIRR datasets by providing different statistical indicators and dataset visualizations in one report.\n\n## Installation\n\nIt is recommended to use a virtual python environment to run evalAIRR if another python environment is used. Here is a quick guide on how you can set up a virtual environment:\n\n`https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments`\n\n### Install using pip\n\nRun this command to install the evalAIRR package:\n\n`pip install evalairr`\n\n## Quickstart\n\nevalAIRR uses a YAML file for configuration. If you are unfamiliar with how YAML files are structured, read this guide to the syntax:\n\n`https://docs.fileformat.com/programming/yaml/#syntax`\n\nThis is the stucture of a sample report configuration file you can use to start off with (it is included in the repository location ./yaml_files/quickstart.yaml):\n\n```\ndatasets:\n  real:\n    path: ./data/real_data.csv\n  sim:\n    path: ./data/sim_data.csv\nreports:\n  feature_based:\n    report1:\n      features:\n        - CAS\n        - SAS\n      report_types:\n        - ks\n        - distr_densityplot\n        - distance\n        - statistics\n  observation_based:\n    report1:\n      observations:\n        - all\n      report_types:\n        - distr_densityplot\n  general:\n    feat_mean_vs_variance:\n    obs_mean_vs_variance:\n    pca_2d_feat:\n    pca_2d_obs:\n    corr_feat_hist:\n    corr_obs_hist:\noutput:\n  path: ./output/report.html\n```\n\nThis report will process the two provided datasets (real and simulated) with encoded kmer data, and create an HTML report with multiple report types. These include feature-based report types - Kolmogorov\u2013Smirnov test (indicated by report type `ks`), a feature distribution density plot (indicated by report type `distr_densityplot`), Euclidean distance measures (indicated by report type `distance`) and descriptive statistics (mean, median, variance and standard deviation)(indicated by report type `statistics`) for the features `CAS` and `SAS`. It will then export the report to the path `./output/report.html`. It will also create an observation-based report with the feature distribution density plot (indicated by report type `distr_densityplot`) for all observations (indicated by keyword `all` in the observation list). Finally, these general reports will be generated: feature mean compared with feature variance (indicated by report type `feat_mean_vs_variance`), observation mean compared with observation variance (indicated by report type `obs_mean_vs_variance`), two dimensional representation of all features using PCA (indicated by report type `pca_2d_feat`), two dimensional representation of all observations using PCA (indicated by report type `pca_2d_obs`), feature correlation coefficient distribution histogram (indicated by report type `corr_feat_hist`) and an observation correlation coefficient distribution histogram (indicated by report type `corr_obs_hist`). More details on what reports can be created can be found in the _YAML Configuration Guidelines_ section.\n\nThe repository contains sample datafiles and a quickstart YAML configuration files. You can clone the repository and run evalAIRR within it to use sample data.\n\nWithin the cloned repository run the command:\n\n`evalairr -i ./yaml_files/quickstart.yaml`\n\nThe report will be generated in the specified output path in the configuration file or, if a specific path is not provided, in `<CURRENT_DIRECTORY>/output/report.html`. The report is exported in the HTML format.\n\n## YAML Configuration Guidelines\n\nThe configuration YAML file consists of 3 main sections: `datasets`, `reports` and `output`.\n\n### Datasets\n\nIn the `datasets` section, you have to provide paths to a real and a simulated datasets that you are comparing. CSV files with encoded kmer data are supported. This can be done by specifying the file path of each file in the `path` variable under the sections `real` and `sim` respectively. Here is an example of how a configured `datasets` section looks like:\n\n```\ndatasets:\n  real:\n    path: ./data/real_data.csv\n  sim:\n    path: ./data/sim_data.csv\n```\n\n### Reports\n\nIn the `reports` section, you can provide the list of report types you want to create and their parameters. There are three types of report groups depending on the different parameters: `feature_based`, `observation_based` and `generic`. Here is the list of reports you can create that compare the features of the real dataset with the simulated dataset:\n\n#### Feature-based reports\n\n- <b>`ks`</b> - Kolmogorov\u2013Smirnov statistic. Parameters: list of features you are creating the report for.\n- <b>`distr_histogram`</b> - feature distribution histogram. Parameters: list of features you are creating the report for.\n- <b>`distr_boxplot`</b> - feature distribution boxplot. Parameters: list of features you are creating the report for.\n- <b>`distr_violinplot`</b> - feature distribution violin plot. Parameters: list of features you are creating the report for.\n- <b>`distr_densityplot`</b> - feature distribution density plot. Parameters: list of features you are creating the report for.\n- <b>`distance`</b> - Euclidean distance between the real and simulated feature. Parameters: list of features you are creating the report for.\n- <b>`statistics`</b> - statistical indicators (mean, median, standard deviation and variance) of a feature in both real and simulated datasets. Parameters: list of features you are creating the report for.\n\n#### Observation-based reports\n\n- <b>`ks`</b> - Kolmogorov\u2013Smirnov statistic. Parameters: list of observations you are creating the report for.\n- <b>`distr_histogram`</b> - observation distribution histogram. Parameters: list of observations you are creating the report for.\n- <b>`distr_boxplot`</b> - observation distribution boxplot. Parameters: list of observations you are creating the report for.\n- <b>`distr_violinplot`</b> - observation distribution violin plot. Parameters: list of observations you are creating the report for.\n- <b>`distr_densityplot`</b> - observation distribution density plot. Parameters: list of observations you are creating the report for. The observation index `all` can be used to report on all observations in one plot. `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset (only applies in reports with `all` observations). `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`distance`</b> - Euclidean distance between the real and simulated observation. Parameters: list of observations you are creating the report for.\n- <b>`statistics`</b> - statistical indicators (mean, median, standard deviation and variance) of an observation in both real and simulated datasets. Parameters: list of observations you are creating the report for.\n\n#### General reports\n\n- <b>`ks_feat`</b> - Kolmogorov\u2013Smirnov statistic for all features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/ks_feat.csv`). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values.\n- <b>`ks_obs`</b> - Kolmogorov\u2013Smirnov statistic for all observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/ks_obs.csv`). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values.\n- <b>`copula_2d`</b> - a 2D scatter plot that displays two features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.\n- <b>`copula_3d`</b> - a 3D scatter plot that displays three features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.\n- <b>`feat_mean_vs_variance`</b> - a scatter plot that displays the mean value of every feature on one axis and the variance of every feature on the other axis. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`obs_mean_vs_variance`</b> - a scatter plot that displays the mean value of every observation on one axis and the variance of every observation on the other axis. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`corr`</b> - correlation matrix heatmaps of the real and simulated datasets. Parameters: `reduce_to_n_features` - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`corr_feat_hist`</b> - feature correlation matrix distribution histogram for the real and simulated datasets. Parameters: `n_bins` - an optional parameter that sets the number of bins in the histogram (default value is 30). `reduce_to_n_features` - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`corr_obs_hist`</b> - observation correlation matrix distribution histogram for the real and simulated datasets. Parameters: `n_bins` - an optional parameter that sets the number of bins in the histogram (default value is 30). `reduce_to_n_obs` - an optional parameter for dimensionality reduction using PCA. The number of observations to reduce the dataset to (must be reduce_to_n_obs < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`corr_csv`</b> - CSV file exporting of the difference between correlation matrices of the real and simulated datasets. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/corr.csv`).\n- <b>`pca_2d_feat`</b> - two feature-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`pca_2d_obs`</b> - two observation-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation.\n- <b>`distance_feat`</b> - Euclidean distance between the real and simulated features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/dist.csv`).\n- <b>`statistics_feat`</b> - statistical indicators (mean, median, standard deviation and variance) of all features in both real and simulated datasets. Parameters: `output_dir` - optional parameter, that specifies the directory for the csv files in which the csv result files `real_stat.csv` and `sim_stat.csv` will be exported to (default value is set to `./output/`). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance.\n- <b>`distance_obs`</b> - Euclidean distance between the real and simulated observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/obs_dist.csv`).\n- <b>`statistics_obs`</b> - statistical indicators (mean, median, standard deviation and variance) of all observation in both real and simulated datasets. Parameters: `output_dir` - optional parameter, that specifies the directory for the csv files in which the csv result files `real_obs_stat.csv` and `sim_obs_stat.csv` will be exported to (default value is set to `./output/`). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance.\n- <b>`jensen_shannon_feat`</b> - Jensen-Shannon divergence metric between the real and simulated features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/jenshan.csv`).\n- <b>`jensen_shannon_obs`</b> - Jensen-Shannon divergence metric between the real and simulated observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/obs_jenshan.csv`).\n\nHere is a sample `reports` section of a configuration file containing all of the reports:\n\n```\nreports:\n  feature_based:\n    report1:\n      features:\n        - CAS\n        - SAS\n      report_types:\n        - ks\n        - distr_histogram\n        - distr_boxplot\n        - distr_violinplot\n        - distr_densityplot\n        - distance\n        - statistics\n  observation_based:\n    report1:\n      observations:\n        - 20\n      report_types:\n        - ks\n        - distr_histogram\n        - distr_boxplot\n        - distr_violinplot\n        - distr_densityplot\n        - distance\n        - statistics\n    report2:\n      observations:\n        - all\n      report_types:\n        - distr_densityplot\n      with_ml_sim: True\n      ml_random_state: 0\n  general:\n    copula_2d:\n      report1:\n        - CAS\n        - SAS\n    copula_3d:\n      report1:\n        - CAS\n        - SAS\n        - TGT\n    feat_mean_vs_variance:\n      with_ml_sim: True\n      ml_random_state: 0\n    obs_mean_vs_variance:\n      with_ml_sim: True\n      ml_random_state: 0\n    corr_hist:\n      n_bins: 30\n      with_ml_sim: True\n      ml_random_state: 0\n      reduce_to_n_features: 200\n    corr:\n      reduce_to_n_features: 200\n      with_ml_sim: True\n      ml_random_state: 0\n    pca_2d_feat:\n      with_ml_sim: True\n      ml_random_state: 0\n    pca_2d_obs:\n      with_ml_sim: True\n      ml_random_state: 0\n    corr_csv:\n      output: ./output/corr.csv\n    ks_feat:\n      output: ./output/ks_feat.csv\n    ks_obs:\n      output: ./output/ks_obs.csv\n    statistics_feat:\n      output_dir: ./output/\n    statistics_obs:\n      output_dir: ./output/\n    distance_feat:\n      output: ./output/dist.csv\n    distance_obs:\n      output: ./output/obs_dist.csv\n    jensen_shannon_feat:\n      output: ./output/jenshan.csv\n    jensen_shannon_obs:\n      output: ./output/obs_jenshan.csv\n```\n\n### Output\n\nAn optional section where you can specify the file path of the generated report. The default path of the generated report is `<CURRENT_DIRECTORY>/output/report.html`. The report is exported in the HTML format. If you declare the path as 'NONE', the report will not be created.\n\nAn example output section:\n\n```\noutput:\n  path: ./output/report.html\n```\n\nFor example, this output section would result in a report file not being created:\n\n```\noutput:\n  path: NONE\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Comparison of real and simulated AIRR datasets",
    "version": "0.0.44",
    "project_urls": null,
    "split_keywords": [
        "python",
        "airr",
        "simulated data",
        "ml",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "29571bd119b7192359c616cdfd8fa4342382a2bf75989df3ad2a8fbf47b4b4df",
                "md5": "f157129640a74065135d9a4fad3f667e",
                "sha256": "1191185348e52c08a82016edb07504a3e5b3338e96ae87061c2839ef67d9a25a"
            },
            "downloads": -1,
            "filename": "evalAIRR-0.0.44-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f157129640a74065135d9a4fad3f667e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17516,
            "upload_time": "2023-05-10T20:00:21",
            "upload_time_iso_8601": "2023-05-10T20:00:21.407505Z",
            "url": "https://files.pythonhosted.org/packages/29/57/1bd119b7192359c616cdfd8fa4342382a2bf75989df3ad2a8fbf47b4b4df/evalAIRR-0.0.44-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7b201f0019d6e578c7b833a525a4fa7506597c9ed88618943b386bf0f1acb433",
                "md5": "d05cdcb7660d9e1b348ba6589a336b82",
                "sha256": "71c3b3b88a7fb26c77c0e9b9fb9baee7a244cfbb33478f818da05d5ee28be0bc"
            },
            "downloads": -1,
            "filename": "evalAIRR-0.0.44.tar.gz",
            "has_sig": false,
            "md5_digest": "d05cdcb7660d9e1b348ba6589a336b82",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 18061,
            "upload_time": "2023-05-10T20:00:24",
            "upload_time_iso_8601": "2023-05-10T20:00:24.558512Z",
            "url": "https://files.pythonhosted.org/packages/7b/20/1f0019d6e578c7b833a525a4fa7506597c9ed88618943b386bf0f1acb433/evalAIRR-0.0.44.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-10 20:00:24",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "evalairr"
}

Lukas Sparnauskas