syngen


Namesyngen JSON
Version 0.1.12 PyPI version JSON
download
home_pagehttps://github.com/tdspora/syngen
SummaryThe tool uncovers patterns, trends, and correlations hidden within your production datasets.
upload_time2023-06-01 12:18:21
maintainerPavel Bobyrev
docs_urlNone
authorEPAM Systems, Inc.
requires_python>3.7
licenseGPLv3 License
keywords data generation synthetic vae tabular
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # EPAM Syngen

EPAM Syngen is an unsupervised tabular data generation tool. It is useful for generation of test data with a given table as a template. Most datatypes including floats, integers, datetime, text, categorical, binary are supported. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach. 
The source of data might be in CSV, Avro format and should be located locally and be in UTF-8 encoding.

The tool is based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space.

## Getting started

Use pip to install the library:

`pip install syngen`

The training and inference processes are separated with two cli entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.<br>

To start training with defaults parameters run:

```bash
train --source PATH_TO_ORIGINAL_CSV \
    --table_name TABLE_NAME
```

This will train a model and save the model artifacts to disk.

To generate with defaults parameters data simply call:

```bash
infer --table_name TABLE_NAME
```

<i>Please notice that the name should match the one you used in the training process.</i><br>
This will create a csv file with the synthetic table in <i>./model_artifacts/tmp_store/TABLE_NAME/merged_infer_TABLE_NAME.csv</i>.<br>

Here is a quick example:

```bash
pip install syngen
train --source ./example-data/housing.csv –-table_name Housing
infer --table_name Housing
```
As the example you can use the dataset <i>"Housing"</i> in [example-data/housing.csv](example-data/housing.csv).
In this example, our real-world data is <a href="https://www.kaggle.com/datasets/camnugent/california-housing-prices" target="_blank">"Housing"</a> from Kaggle.

## Features

### Training

You can add flexibility to the training and inference processes using additional hyperparameters.<br>
For training of single table call:

```bash
train --source PATH_TO_ORIGINAL_CSV \
    --table_name TABLE_NAME \
    --epochs INT \
    --row_limit INT \
    --drop_null BOOL \
    --print_report BOOL \
    --batch_size INT
```

To train one or more tables using a metadata file, you can use the following command:

```bash
train --metadata_path PATH_TO_METADATA_YAML
```

The parameters which you can set up for training process:

- <i>source</i> – required parameter for training of single table, a path to the file that you want to use as a reference
- <i>table_name</i> – required parameter for training of single table, an arbitrary string to name the directories 
- <i>epochs</i> – a number of training epochs. Since the early stopping mechanism is implemented the bigger value of epochs is the better
- <i>row_limit</i> – a number of rows to train over. A number less than the original table length will randomly subset the specified number of rows
- <i>drop_null</i> – whether to drop rows with at least one missing value
- <i>batch_size</i> – if specified, the training is split into batches. This can save the RAM
- <i>print_report</i> - whether to generate accuracy and sampling reports. Please note that the sampling report is generated only if the `row_limit` parameter is set.
- <i>metadata_path</i> – a path to the metadata file containing the metadata
- <i>column_types</i> - might include the section <i>categorical</i> which contains the listed columns defined as categorical by a user

Requirements for parameters of training process:
* <i>source</i> - data type - string
* <i>table_name</i> - data type - string
* <i>epochs</i> - data type - integer, must be equal to or more than 1, default value is 10
* <i>row_limit</i> - data type - integer
* <i>drop_null</i> - data type - boolean, default value - False
* <i>batch_size</i> - data type - integer, must be equal to or more than 1, default value - 32
* <i>print_report</i> - data type - boolean, default value is False
* <i>metadata_path</i> - data type - string
* <i>column_types</i> - data type - dictionary with the key <i>categorical</i> - the list of columns (data type - string)


### Inference (generation)

You can customize the inference processes by calling for one table:

```bash
infer --size INT \
    --table_name STR \
    --run_parallel BOOL \
    --batch_size INT \
    --random_seed INT \
    --print_report BOOL
```
 
To generate one or more tables using a metadata file, you can use the following command:

```bash
infer --metadata_path PATH_TO_METADATA
```

The parameters which you can set up for generation process:

- <i>size</i> - the desired number of rows to generate
- <i>table_name</i> – required parameter for inference of single table, the name of the table, same as in training
- <i>run_parallel</i> – whether to use multiprocessing (feasible for tables > 5000 rows)
- <i>batch_size</i> – if specified, the generation is split into batches. This can save the RAM
- <i>random_seed</i> – if specified, generates a reproducible result
- <i>print_report</i> – whether to generate accuracy and sampling reports. Please note that the sampling report is generated only if the row_limit parameter is set.
- <i>metadata_path</i> – a path to metadata file

Requirements for parameters of generation process:
* <i>size</i> - data type - integer, must be equal to or more than 1, default value is 100
* <i>table_name</i> - data type - string
* <i>run_parallel</i> - data type - boolean, default value is False
* <i>batch_size</i> - data type - integer, must be equal to or more than 1
* <i>random_seed</i> - data type - integer, must be equal to or more than 0
* <i>print_report</i> - data type - boolean, default value is False
* <i>metadata_path</i> - data type - string

The metadata can contain any of the arguments above for each table. If so, the duplicated arguments from the CLI 
will be ignored.

<i>Note:</i> If you want to set the logging level, you can use the parameter <i>log_level</i> in the CLI call:

```bash
train --source STR --table_name STR --log_level STR
train --metadata_path STR --log_level STR
infer --size INT --table_name STR --log_level STR
infer --metadata_path STR --log_level STR
```

where <i>log_level</i> might be one of the following values: <i>DEBUG, INFO, WARNING, ERROR, CRITICAL</i>.


### Linked tables generation

To generate one or more tables, you might provide metadata in yaml format. By providing information about the relationships 
between tables via metadata, it becomes possible to manage complex relationships across any number of tables. 
You can also specify additional parameters needed for training and inference in the metadata file and in this case, 
they will be ignored in the CLI call.

<i>Note:</i> By using metadata file, you can also generate tables with absent relationships. 
In this case, the tables will be generated independently.

The yaml metadata file should match the following template:

    CUSTOMER:                                       # Table name. Required parameter
        source: "./files/customer.csv"              # Supported formats include local files in CSV, Avro formats. Required parameter
                 
        train_settings:                             # Settings for training process. Optional parameter
            epochs: 10                              # Number of epochs if different from the default in the command line options. Optional parameter
            drop_null: False                        # Drop rows with NULL values. Optional parameter
            row_limit: None                         # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter
            batch_size: 32                          # If specified, the training is split into batches. This can save the RAM. Optional parameter
            print_report: False                     # Turn on or turn off generation of the report. Optional parameter
            column_types:
                categorical:                        # Force listed columns to have categorical type (use dictionary of values). Optional parameter
                    - gender
                    - marital_status
                 
        infer_settings:                             # Settings for infer process. Optional parameter
            size: 100                               # Size for generated data. Optional parameter
            run_parallel: False                     # Turn on or turn off parallel training process. Optional parameter
            print_report: False                     # Turn on or turn off generation of the report. Optional parameter
            batch_size: None                        # If specified, the generation is split into batches. This can save the RAM. Optional parameter
            random_seed: None                       # If specified, generates a reproducible result. Optional parameter
        keys:                                       # Keys of the table. Optional parameter
            PK_CUSTOMER_ID:                         # Name of a key. Only one PK per table.
                type: "PK"                          # The key type. Supported: PK - primary key, FK - foreign key, TKN - token key
                columns:                            # Array of column names
                    - customer_id
     
            UQ1:                                    # Name of a key
                type: "UQ"                          # One or many unique keys
                columns:
                    - e_mail
     
            FK1:                                    # One or many foreign keys
                type: "FK"
                columns:                            # Array of columns in the current table
                    - e_mail
                    - alias
                references:
                    table: "PROFILE"                # Name of the parent table
                    columns:                        # Array of columns in the parent table
                        - e_mail
                        - alias
       
            FK2:
                type: "FK"
                columns:
                    - address_id
                references:
                    table: "ADDRESS"
                    columns:
                        - address_id

     
    ORDER:                                          # Table name. Required parameter
        source: "./files/order.csv"                 # Supported formats include local files in CSV, Avro formats. Required parameter
     
        train_settings:                             # Settings for training process. Optional parameter
            epochs: 10                              # Number of epochs if different from the default in the command line options. Optional parameter
            drop_null: False                        # Drop rows with NULL values. Optional parameter
            row_limit: None                         # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter
            batch_size: 32                          # If specified, the training is split into batches. This can save the RAM. Optional parameter
            print_report: False                     # Turn on or turn off generation of the report. Optional parameter
            column_types:
                categorical:                        # Force listed columns to have categorical type (use dictionary of values). Optional parameter
                    - gender
                    - marital_status
     
        infer_settings:                             # Settings for infer process. Optional parameter
            size: 100                               # Size for generated data. Optional parameter
            run_parallel: False                     # Turn on or turn off parallel training process. Optional parameter
            print_report: False                     # Turn on or turn off generation of the report. Optional parameter
            batch_size: None                        # If specified, the generation is split into batches. This can save the RAM. Optional parameter
            random_seed: None                       # If specified, generates a reproducible result. Optional parameter
        keys:                                       # Keys of the table. Optional parameter
            pk_order_id:
                type: "PK"
                columns:
                    - order_id
     
            FK1:
                type: "FK"
                columns:
                    - customer_id
                references:
                    table: "CUSTOMER"
                    columns:
                        - customer_id

<i>You can find the example of metadata file in [example-metadata/housing_metadata.yaml](example-metadata/housing_metadata.yaml)</i><br>

By providing the necessary information through a metadata file, you can initiate training and inference processes using the following commands:

```bash
train --metadata_path=PATH_TO_YAML_METADATA_FILE
infer --metadata_path=PATH_TO_YAML_METADATA_FILE
```
Here is a quick example:

```bash
train --metadata_path="./example-metadata/housing_metadata.yaml"
infer --metadata_path="./example-metadata/housing_metadata.yaml"
```

If `--metadata_path` is present and the metadata contains the necessary parameters, other CLI parameters will be ignored.<br>

### Docker images

The train and inference components of <i>syngen</i> is available as public docker images:

<https://hub.docker.com/r/tdspora/syngen-train>

<https://hub.docker.com/r/tdspora/syngen-infer>

To run dockerized code (see parameters description in *Training* and *Inference* sections) for one table call:

```bash
docker pull tdspora/syngen-train:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \
  --table_name=TABLE_NAME \
  --source=./model_artifacts/YOUR_CSV_FILE.csv

docker pull tdspora/syngen-infer:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \
  --table_name=TABLE_NAME
```

PATH_TO_LOCAL_FOLDER is an absolute path to the folder where your original csv is stored.

You can add any arguments listed in the corresponding sections for infer and training processes in the CLI call.

To run dockerized code by providing the metadata file simply call:

```bash
docker pull tdspora/syngen-train:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \
  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML

docker pull tdspora/syngen-infer:latest
docker run --rm \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \
  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML
```

You can add any arguments listed in the corresponding sections for infer and training processes in the CLI call, however, they will be 
overwritten by corresponding arguments in the metadata file.

#### Logging level

Set the `LOGURU_LEVEL` environment variable to desired level of logging.
For example, to suppress the debug messages, add `-e LOGURU_LEVEL=INFO` to the `docker run` command:
```bash
docker pull tdspora/syngen-train:latest
docker run --rm -e LOGURU_LEVEL=INFO \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \
  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML

docker pull tdspora/syngen-infer:latest
docker run --rm -e LOGURU_LEVEL=INFO \
  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \
  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML
```

## Contribution

We welcome contributions from the community to help us improve and maintain our public GitHub repository. We appreciate any feedback, bug reports, or feature requests, and we encourage developers to submit fixes or new features using issues.

If you have found a bug or have a feature request, please submit an issue to our GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the feature request. Our team will review the issue and work with you to address any problems or discuss any potential new features.

If you would like to contribute a fix or a new feature, please submit a pull request to our GitHub repository. Please make sure your code follows our coding standards and best practices. Our team will review your pull request and work with you to ensure that it meets our standards and is ready for inclusion in our codebase.

We appreciate your contributions and thank you for your interest in helping us maintain and improve our public GitHub repository.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tdspora/syngen",
    "name": "syngen",
    "maintainer": "Pavel Bobyrev",
    "docs_url": null,
    "requires_python": ">3.7",
    "maintainer_email": "",
    "keywords": "data,generation,synthetic,vae,tabular",
    "author": "EPAM Systems, Inc.",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/99/67/9cced42915b82289042e5754d593bf6ef9f11110a36fdc8b49c8232b04eb/syngen-0.1.12.tar.gz",
    "platform": null,
    "description": "# EPAM Syngen\r\n\r\nEPAM Syngen is an unsupervised tabular data generation tool. It is useful for generation of test data with a given table as a template. Most datatypes including floats, integers, datetime, text, categorical, binary are supported. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach. \r\nThe source of data might be in CSV, Avro format and should be located locally and be in UTF-8 encoding.\r\n\r\nThe tool is based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space.\r\n\r\n## Getting started\r\n\r\nUse pip to install the library:\r\n\r\n`pip install syngen`\r\n\r\nThe training and inference processes are separated with two cli entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.<br>\r\n\r\nTo start training with defaults parameters run:\r\n\r\n```bash\r\ntrain --source PATH_TO_ORIGINAL_CSV \\\r\n    --table_name TABLE_NAME\r\n```\r\n\r\nThis will train a model and save the model artifacts to disk.\r\n\r\nTo generate with defaults parameters data simply call:\r\n\r\n```bash\r\ninfer --table_name TABLE_NAME\r\n```\r\n\r\n<i>Please notice that the name should match the one you used in the training process.</i><br>\r\nThis will create a csv file with the synthetic table in <i>./model_artifacts/tmp_store/TABLE_NAME/merged_infer_TABLE_NAME.csv</i>.<br>\r\n\r\nHere is a quick example:\r\n\r\n```bash\r\npip install syngen\r\ntrain --source ./example-data/housing.csv \u2013-table_name Housing\r\ninfer --table_name Housing\r\n```\r\nAs the example you can use the dataset <i>\"Housing\"</i> in [example-data/housing.csv](example-data/housing.csv).\r\nIn this example, our real-world data is <a href=\"https://www.kaggle.com/datasets/camnugent/california-housing-prices\" target=\"_blank\">\"Housing\"</a> from Kaggle.\r\n\r\n## Features\r\n\r\n### Training\r\n\r\nYou can add flexibility to the training and inference processes using additional hyperparameters.<br>\r\nFor training of single table call:\r\n\r\n```bash\r\ntrain --source PATH_TO_ORIGINAL_CSV \\\r\n    --table_name TABLE_NAME \\\r\n    --epochs INT \\\r\n    --row_limit INT \\\r\n    --drop_null BOOL \\\r\n    --print_report BOOL \\\r\n    --batch_size INT\r\n```\r\n\r\nTo train one or more tables using a metadata file, you can use the following command:\r\n\r\n```bash\r\ntrain --metadata_path PATH_TO_METADATA_YAML\r\n```\r\n\r\nThe parameters which you can set up for training process:\r\n\r\n- <i>source</i> \u2013 required parameter for training of single table, a path to the file that you want to use as a reference\r\n- <i>table_name</i> \u2013 required parameter for training of single table, an arbitrary string to name the directories \r\n- <i>epochs</i> \u2013 a number of training epochs. Since the early stopping mechanism is implemented the bigger value of epochs is the better\r\n- <i>row_limit</i> \u2013 a number of rows to train over. A number less than the original table length will randomly subset the specified number of rows\r\n- <i>drop_null</i> \u2013 whether to drop rows with at least one missing value\r\n- <i>batch_size</i> \u2013 if specified, the training is split into batches. This can save the RAM\r\n- <i>print_report</i> - whether to generate accuracy and sampling reports. Please note that the sampling report is generated only if the `row_limit` parameter is set.\r\n- <i>metadata_path</i> \u2013 a path to the metadata file containing the metadata\r\n- <i>column_types</i> - might include the section <i>categorical</i> which contains the listed columns defined as categorical by a user\r\n\r\nRequirements for parameters of training process:\r\n* <i>source</i> - data type - string\r\n* <i>table_name</i> - data type - string\r\n* <i>epochs</i> - data type - integer, must be equal to or more than 1, default value is 10\r\n* <i>row_limit</i> - data type - integer\r\n* <i>drop_null</i> - data type - boolean, default value - False\r\n* <i>batch_size</i> - data type - integer, must be equal to or more than 1, default value - 32\r\n* <i>print_report</i> - data type - boolean, default value is False\r\n* <i>metadata_path</i> - data type - string\r\n* <i>column_types</i> - data type - dictionary with the key <i>categorical</i> - the list of columns (data type - string)\r\n\r\n\r\n### Inference (generation)\r\n\r\nYou can customize the inference processes by calling for one table:\r\n\r\n```bash\r\ninfer --size INT \\\r\n    --table_name STR \\\r\n    --run_parallel BOOL \\\r\n    --batch_size INT \\\r\n    --random_seed INT \\\r\n    --print_report BOOL\r\n```\r\n \r\nTo generate one or more tables using a metadata file, you can use the following command:\r\n\r\n```bash\r\ninfer --metadata_path PATH_TO_METADATA\r\n```\r\n\r\nThe parameters which you can set up for generation process:\r\n\r\n- <i>size</i> - the desired number of rows to generate\r\n- <i>table_name</i> \u2013 required parameter for inference of single table, the name of the table, same as in training\r\n- <i>run_parallel</i> \u2013 whether to use multiprocessing (feasible for tables > 5000 rows)\r\n- <i>batch_size</i> \u2013 if specified, the generation is split into batches. This can save the RAM\r\n- <i>random_seed</i> \u2013 if specified, generates a reproducible result\r\n- <i>print_report</i> \u2013 whether to generate accuracy and sampling reports. Please note that the sampling report is generated only if the row_limit parameter is set.\r\n- <i>metadata_path</i> \u2013 a path to metadata file\r\n\r\nRequirements for parameters of generation process:\r\n* <i>size</i> - data type - integer, must be equal to or more than 1, default value is 100\r\n* <i>table_name</i> - data type - string\r\n* <i>run_parallel</i> - data type - boolean, default value is False\r\n* <i>batch_size</i> - data type - integer, must be equal to or more than 1\r\n* <i>random_seed</i> - data type - integer, must be equal to or more than 0\r\n* <i>print_report</i> - data type - boolean, default value is False\r\n* <i>metadata_path</i> - data type - string\r\n\r\nThe metadata can contain any of the arguments above for each table. If so, the duplicated arguments from the CLI \r\nwill be ignored.\r\n\r\n<i>Note:</i> If you want to set the logging level, you can use the parameter <i>log_level</i> in the CLI call:\r\n\r\n```bash\r\ntrain --source STR --table_name STR --log_level STR\r\ntrain --metadata_path STR --log_level STR\r\ninfer --size INT --table_name STR --log_level STR\r\ninfer --metadata_path STR --log_level STR\r\n```\r\n\r\nwhere <i>log_level</i> might be one of the following values: <i>DEBUG, INFO, WARNING, ERROR, CRITICAL</i>.\r\n\r\n\r\n### Linked tables generation\r\n\r\nTo generate one or more tables, you might provide metadata in yaml format. By providing information about the relationships \r\nbetween tables via metadata, it becomes possible to manage complex relationships across any number of tables. \r\nYou can also specify additional parameters needed for training and inference in the metadata file and in this case, \r\nthey will be ignored in the CLI call.\r\n\r\n<i>Note:</i> By using metadata file, you can also generate tables with absent relationships. \r\nIn this case, the tables will be generated independently.\r\n\r\nThe yaml metadata file should match the following template:\r\n\r\n    CUSTOMER:                                       # Table name. Required parameter\r\n        source: \"./files/customer.csv\"              # Supported formats include local files in CSV, Avro formats. Required parameter\r\n                 \r\n        train_settings:                             # Settings for training process. Optional parameter\r\n            epochs: 10                              # Number of epochs if different from the default in the command line options. Optional parameter\r\n            drop_null: False                        # Drop rows with NULL values. Optional parameter\r\n            row_limit: None                         # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter\r\n            batch_size: 32                          # If specified, the training is split into batches. This can save the RAM. Optional parameter\r\n            print_report: False                     # Turn on or turn off generation of the report. Optional parameter\r\n            column_types:\r\n                categorical:                        # Force listed columns to have categorical type (use dictionary of values). Optional parameter\r\n                    - gender\r\n                    - marital_status\r\n                 \r\n        infer_settings:                             # Settings for infer process. Optional parameter\r\n            size: 100                               # Size for generated data. Optional parameter\r\n            run_parallel: False                     # Turn on or turn off parallel training process. Optional parameter\r\n            print_report: False                     # Turn on or turn off generation of the report. Optional parameter\r\n            batch_size: None                        # If specified, the generation is split into batches. This can save the RAM. Optional parameter\r\n            random_seed: None                       # If specified, generates a reproducible result. Optional parameter\r\n        keys:                                       # Keys of the table. Optional parameter\r\n            PK_CUSTOMER_ID:                         # Name of a key. Only one PK per table.\r\n                type: \"PK\"                          # The key type. Supported: PK - primary key, FK - foreign key, TKN - token key\r\n                columns:                            # Array of column names\r\n                    - customer_id\r\n     \r\n            UQ1:                                    # Name of a key\r\n                type: \"UQ\"                          # One or many unique keys\r\n                columns:\r\n                    - e_mail\r\n     \r\n            FK1:                                    # One or many foreign keys\r\n                type: \"FK\"\r\n                columns:                            # Array of columns in the current table\r\n                    - e_mail\r\n                    - alias\r\n                references:\r\n                    table: \"PROFILE\"                # Name of the parent table\r\n                    columns:                        # Array of columns in the parent table\r\n                        - e_mail\r\n                        - alias\r\n       \r\n            FK2:\r\n                type: \"FK\"\r\n                columns:\r\n                    - address_id\r\n                references:\r\n                    table: \"ADDRESS\"\r\n                    columns:\r\n                        - address_id\r\n\r\n     \r\n    ORDER:                                          # Table name. Required parameter\r\n        source: \"./files/order.csv\"                 # Supported formats include local files in CSV, Avro formats. Required parameter\r\n     \r\n        train_settings:                             # Settings for training process. Optional parameter\r\n            epochs: 10                              # Number of epochs if different from the default in the command line options. Optional parameter\r\n            drop_null: False                        # Drop rows with NULL values. Optional parameter\r\n            row_limit: None                         # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter\r\n            batch_size: 32                          # If specified, the training is split into batches. This can save the RAM. Optional parameter\r\n            print_report: False                     # Turn on or turn off generation of the report. Optional parameter\r\n            column_types:\r\n                categorical:                        # Force listed columns to have categorical type (use dictionary of values). Optional parameter\r\n                    - gender\r\n                    - marital_status\r\n     \r\n        infer_settings:                             # Settings for infer process. Optional parameter\r\n            size: 100                               # Size for generated data. Optional parameter\r\n            run_parallel: False                     # Turn on or turn off parallel training process. Optional parameter\r\n            print_report: False                     # Turn on or turn off generation of the report. Optional parameter\r\n            batch_size: None                        # If specified, the generation is split into batches. This can save the RAM. Optional parameter\r\n            random_seed: None                       # If specified, generates a reproducible result. Optional parameter\r\n        keys:                                       # Keys of the table. Optional parameter\r\n            pk_order_id:\r\n                type: \"PK\"\r\n                columns:\r\n                    - order_id\r\n     \r\n            FK1:\r\n                type: \"FK\"\r\n                columns:\r\n                    - customer_id\r\n                references:\r\n                    table: \"CUSTOMER\"\r\n                    columns:\r\n                        - customer_id\r\n\r\n<i>You can find the example of metadata file in [example-metadata/housing_metadata.yaml](example-metadata/housing_metadata.yaml)</i><br>\r\n\r\nBy providing the necessary information through a metadata file, you can initiate training and inference processes using the following commands:\r\n\r\n```bash\r\ntrain --metadata_path=PATH_TO_YAML_METADATA_FILE\r\ninfer --metadata_path=PATH_TO_YAML_METADATA_FILE\r\n```\r\nHere is a quick example:\r\n\r\n```bash\r\ntrain --metadata_path=\"./example-metadata/housing_metadata.yaml\"\r\ninfer --metadata_path=\"./example-metadata/housing_metadata.yaml\"\r\n```\r\n\r\nIf `--metadata_path` is present and the metadata contains the necessary parameters, other CLI parameters will be ignored.<br>\r\n\r\n### Docker images\r\n\r\nThe train and inference components of <i>syngen</i> is available as public docker images:\r\n\r\n<https://hub.docker.com/r/tdspora/syngen-train>\r\n\r\n<https://hub.docker.com/r/tdspora/syngen-infer>\r\n\r\nTo run dockerized code (see parameters description in *Training* and *Inference* sections) for one table call:\r\n\r\n```bash\r\ndocker pull tdspora/syngen-train:latest\r\ndocker run --rm \\\r\n  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \\\r\n  --table_name=TABLE_NAME \\\r\n  --source=./model_artifacts/YOUR_CSV_FILE.csv\r\n\r\ndocker pull tdspora/syngen-infer:latest\r\ndocker run --rm \\\r\n  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \\\r\n  --table_name=TABLE_NAME\r\n```\r\n\r\nPATH_TO_LOCAL_FOLDER is an absolute path to the folder where your original csv is stored.\r\n\r\nYou can add any arguments listed in the corresponding sections for infer and training processes in the CLI call.\r\n\r\nTo run dockerized code by providing the metadata file simply call:\r\n\r\n```bash\r\ndocker pull tdspora/syngen-train:latest\r\ndocker run --rm \\\r\n  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \\\r\n  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\r\n\r\ndocker pull tdspora/syngen-infer:latest\r\ndocker run --rm \\\r\n  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \\\r\n  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\r\n```\r\n\r\nYou can add any arguments listed in the corresponding sections for infer and training processes in the CLI call, however, they will be \r\noverwritten by corresponding arguments in the metadata file.\r\n\r\n#### Logging level\r\n\r\nSet the `LOGURU_LEVEL` environment variable to desired level of logging.\r\nFor example, to suppress the debug messages, add `-e LOGURU_LEVEL=INFO` to the `docker run` command:\r\n```bash\r\ndocker pull tdspora/syngen-train:latest\r\ndocker run --rm -e LOGURU_LEVEL=INFO \\\r\n  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-train \\\r\n  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\r\n\r\ndocker pull tdspora/syngen-infer:latest\r\ndocker run --rm -e LOGURU_LEVEL=INFO \\\r\n  -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen-infer \\\r\n  --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\r\n```\r\n\r\n## Contribution\r\n\r\nWe welcome contributions from the community to help us improve and maintain our public GitHub repository. We appreciate any feedback, bug reports, or feature requests, and we encourage developers to submit fixes or new features using issues.\r\n\r\nIf you have found a bug or have a feature request, please submit an issue to our GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the feature request. Our team will review the issue and work with you to address any problems or discuss any potential new features.\r\n\r\nIf you would like to contribute a fix or a new feature, please submit a pull request to our GitHub repository. Please make sure your code follows our coding standards and best practices. Our team will review your pull request and work with you to ensure that it meets our standards and is ready for inclusion in our codebase.\r\n\r\nWe appreciate your contributions and thank you for your interest in helping us maintain and improve our public GitHub repository.\r\n",
    "bugtrack_url": null,
    "license": "GPLv3 License",
    "summary": "The tool uncovers patterns, trends, and correlations hidden within your production datasets.",
    "version": "0.1.12",
    "project_urls": {
        "Homepage": "https://github.com/tdspora/syngen"
    },
    "split_keywords": [
        "data",
        "generation",
        "synthetic",
        "vae",
        "tabular"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "61e70a237c7abd9e2c02cef3241542178ea579eac6f8cb9f34b8b562ce9aa0a4",
                "md5": "d05ffe871371c46f1113a36a8ce015e7",
                "sha256": "8a21847d2298b8aedfb7fd815ea43ec3797b92cb3ad209cf7892cb0d92a8d593"
            },
            "downloads": -1,
            "filename": "syngen-0.1.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d05ffe871371c46f1113a36a8ce015e7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">3.7",
            "size": 1228382,
            "upload_time": "2023-06-01T12:18:18",
            "upload_time_iso_8601": "2023-06-01T12:18:18.328905Z",
            "url": "https://files.pythonhosted.org/packages/61/e7/0a237c7abd9e2c02cef3241542178ea579eac6f8cb9f34b8b562ce9aa0a4/syngen-0.1.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "99679cced42915b82289042e5754d593bf6ef9f11110a36fdc8b49c8232b04eb",
                "md5": "33f666ea0a370aff96020cf305a0bb1b",
                "sha256": "3614781c3a3b8d267ae1c22bb150af3f54f772df1c6e3325049af37375e821e4"
            },
            "downloads": -1,
            "filename": "syngen-0.1.12.tar.gz",
            "has_sig": false,
            "md5_digest": "33f666ea0a370aff96020cf305a0bb1b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">3.7",
            "size": 1223719,
            "upload_time": "2023-06-01T12:18:21",
            "upload_time_iso_8601": "2023-06-01T12:18:21.289974Z",
            "url": "https://files.pythonhosted.org/packages/99/67/9cced42915b82289042e5754d593bf6ef9f11110a36fdc8b49c8232b04eb/syngen-0.1.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-01 12:18:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tdspora",
    "github_project": "syngen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "syngen"
}
        
Elapsed time: 0.13184s