[![CI/CD](https://github.com/tdspora/syngen/actions/workflows/action-build-deploy.yml/badge.svg?branch=main)](https://github.com/tdspora/syngen/actions/workflows/action-build-deploy.yml)
# EPAM Syngen
EPAM Syngen is an unsupervised tabular data generation tool. It is useful for generation of test data with a given table as a template. Most datatypes including floats, integers, datetime, text, categorical, binary are supported. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach.
The source of data might be in CSV, Avro and Excel format and should be located locally and be in UTF-8 encoding.
The tool is based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space.
## Prerequisites
Python 3.10 or 3.11 is required to run the library. The library is tested on Linux and Windows operating systems.
You can download Python from [the official website](https://www.python.org/downloads/) and install manually, or you can install Python [from your terminal](https://docs.python-guide.org/starting/installation/). After the installation of Python, please, check whether [pip is installed](https://pip.pypa.io/en/stable/getting-started/).
## Getting started
Before the installation of the library, you have to [set up the virtual environment](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/).
You can install the library with CLI only:
```bash
pip install syngen
```
Otherwise, if you want to install the UI version with streamlit, run:
```bash
pip install syngen[ui]
```
*Note:* see details of the UI usage in the [corresponding section](#ui-web-interface)
The training and inference processes are separated with two CLI entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.<br>
To start training with defaults parameters run:
```bash
train --source PATH_TO_ORIGINAL_CSV \
--table_name TABLE_NAME
```
This will train a model and save the model artifacts to disk.
To generate with defaults parameters data simply call:
```bash
infer --table_name TABLE_NAME
```
<i>Please notice that the name should match the one you used in the training process.</i><br>
This will create a csv file with the synthetic table in <i>./model_artifacts/tmp_store/TABLE_NAME/merged_infer_TABLE_NAME.csv</i>.<br>
Here is a quick example:
```bash
train --source ./examples/example-data/housing.csv –-table_name Housing
infer --table_name Housing
```
As the example you can use the dataset <i>"Housing"</i> in [examples/example-data/housing.csv](examples/example-data/housing.csv).
In this example, our real-world data is <a href="https://www.kaggle.com/datasets/camnugent/california-housing-prices" target="_blank">"Housing"</a> from Kaggle.
## Features
### Training
You can add flexibility to the training and inference processes using additional hyperparameters.<br>
For training of single table call:
```bash
train --source PATH_TO_ORIGINAL_CSV \
--table_name TABLE_NAME \
--epochs INT \
--row_limit INT \
--drop_null BOOL \
--reports STR \
--batch_size INT
```
*Note:* To specify multiple options for the *--reports* parameter, you need to provide the *--reports* parameter multiple times.
For example:
```bash
train --source PATH_TO_ORIGINAL_CSV \
--table_name TABLE_NAME \
--reports accuracy \
--reports sample
```
The accepted values for the parameter <i>"reports"</i>:
- <i>"none"</i> (default) - no reports will be generated
- <i>"accuracy"</i> - generates an accuracy report to measure the quality of synthetic data relative to the original dataset. This report is produced after the completion of the training process, during which a model learns to generate new data. The synthetic data generated for this report is of the same size as the original dataset to reach more accurate comparison.
- <i>"sample"</i> - generates a sample report (if original data is sampled, the comparison of distributions of original data and sampled data is provided in the report)
- <i>"metrics_only"</i> - outputs the metrics information only to standard output without generation of an accuracy report
- <i>"all"</i> - generates both accuracy and sample reports<br>
Default value is <i>"none"</i>.
To train one or more tables using a metadata file, you can use the following command:
```bash
train --metadata_path PATH_TO_METADATA_YAML
```
Parameters that you can set up for training process:
- <i>source</i> – required parameter for training of single table, a path to the file that you want to use as a reference
- <i>table_name</i> – required parameter for training of single table, an arbitrary string to name the directories
- <i>epochs</i> – a number of training epochs. Since the early stopping mechanism is implemented the bigger value of epochs is the better
- <i>row_limit</i> – a number of rows to train over. A number less than the original table length will randomly subset the specified number of rows
- <i>drop_null</i> – whether to drop rows with at least one missing value
- <i>batch_size</i> – if specified, the training is split into batches. This can save the RAM
- <i>reports</i> - controls the generation of quality reports, might require significant time for big tables (>10000 rows)
- <i>metadata_path</i> – a path to the metadata file containing the metadata
- <i>column_types</i> - might include the section <i>categorical</i> which contains the listed columns defined as categorical by a user
Requirements for parameters of training process:
* <i>source</i> - data type - string
* <i>table_name</i> - data type - string
* <i>epochs</i> - data type - integer, must be equal to or more than 1, default value is 10
* <i>row_limit</i> - data type - integer
* <i>drop_null</i> - data type - boolean, default value - False
* <i>batch_size</i> - data type - integer, must be equal to or more than 1, default value - 32
* <i>reports</i> - data type - if the value is passed through CLI - string, if the value is passed in the metadata file - string or list, accepted values: <i>"none"</i> (default) - no reports will be generated, <i>"all"</i> - generates both accuracy and sample reports, <i>"accuracy"</i> - generates an accuracy report, <i>"sample"</i> - generates a sample report, <i>"metrics_only"</i> - outputs the metrics information only to standard output without generation of a report. Default value is <i>"none"</i>. In the metadata file multiple values can be specified as a list of available options (<i>"accuracy"</i>, <i>"sample"</i>, <i>"metrics_only"</i>) to generate multiple types of reports simultaneously, e.g. [<i>"metrics_only"</i>, <i>"sample"</i>]
* <i>metadata_path</i> - data type - string
* <i>column_types</i> - data type - dictionary with the key <i>categorical</i> - the list of columns (data type - string)
### Inference (generation)
You can customize the inference processes by calling for one table:
```bash
infer --size INT \
--table_name STR \
--run_parallel BOOL \
--batch_size INT \
--random_seed INT \
--reports STR
```
*Note:* To specify multiple options for the *--reports* parameter, you need to provide the *--reports* parameter multiple times.
For example:
```bash
infer --table_name TABLE_NAME \
--reports accuracy \
--reports metrics_only
```
The accepted values for the parameter <i>"reports"</i>:
- <i>"none"</i> (default) - no reports will be generated
- <i>"accuracy"</i> - generates an accuracy report that compares original and synthetic data patterns to verify the quality of the generated data
- <i>"metrics_only"</i> - outputs the metrics information only to standard output without generation of an accuracy report
- <i>"all"</i> - generates an accuracy report<br>
Default value is <i>"none"</i>.
To generate one or more tables using a metadata file, you can use the following command:
```bash
infer --metadata_path PATH_TO_METADATA
```
The parameters which you can set up for generation process:
- <i>size</i> - the desired number of rows to generate
- <i>table_name</i> – required parameter for inference of single table, the name of the table, same as in training
- <i>run_parallel</i> – whether to use multiprocessing (feasible for tables > 5000 rows)
- <i>batch_size</i> – if specified, the generation is split into batches. This can save the RAM
- <i>random_seed</i> – if specified, generates a reproducible result
- <i>reports</i> - controls the generation of quality reports, might require significant time for big generated tables (>10000 rows)
- <i>metadata_path</i> – a path to metadata file
Requirements for parameters of generation process:
* <i>size</i> - data type - integer, must be equal to or more than 1, default value is 100
* <i>table_name</i> - data type - string
* <i>run_parallel</i> - data type - boolean, default value is False
* <i>batch_size</i> - data type - integer, must be equal to or more than 1
* <i>random_seed</i> - data type - integer, must be equal to or more than 0
* <i>reports</i> - data type - if the value is passed through CLI - string, if the value is passed in the metadata file - string or list, accepted values: <i>"none"</i> (default) - no reports will be generated, <i>"all"</i> - generates an accuracy report, <i>"accuracy"</i> - generates an accuracy report, <i>"metrics_only"</i> - outputs the metrics information only to standard output without generation of a report. Default value is <i>"none"</i>. In the metadata file multiple values can be specified as a list of available options (<i>"accuracy"</i>, <i>"metrics_only"</i>) to generate multiple types of reports simultaneously
* <i>metadata_path</i> - data type - string
The metadata can contain any of the arguments above for each table. If so, the duplicated arguments from the CLI
will be ignored.
*Note:* If you want to set the logging level, you can use the parameter <i>log_level</i> in the CLI call:
```bash
train --source STR --table_name STR --log_level STR
train --metadata_path STR --log_level STR
infer --size INT --table_name STR --log_level STR
infer --metadata_path STR --log_level STR
```
where <i>log_level</i> might be one of the following values: <i>TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL</i>.
### Linked tables generation
To generate one or more tables, you might provide metadata in yaml format. By providing information about the relationships
between tables via metadata, it becomes possible to manage complex relationships across any number of tables.
You can also specify additional parameters needed for training and inference in the metadata file and in this case,
they will be ignored in the CLI call.
*Note:* By using metadata file, you can also generate tables with absent relationships.
In this case, the tables will be generated independently.
The yaml metadata file should match the following template:
```yaml
global: # Global settings. Optional parameter. In this section you can specify training and inference settings which will be set for all tables
train_settings: # Settings for training process. Optional parameter
epochs: 10 # Number of epochs if different from the default in the command line options. Optional parameter
drop_null: False # Drop rows with NULL values. Optional parameter
row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter
batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter
reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously, e.g. ["metrics_only", "sample"]. Might require significant time for big tables (>10000 rows).
infer_settings: # Settings for infer process. Optional parameter
size: 100 # Size for generated data. Optional parameter
run_parallel: False # Turn on or turn off parallel training process. Optional parameter
reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates an accuracy report, "accuracy" - generates an accuracy report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows).
batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter
random_seed: null # If specified, generates a reproducible result. Optional parameter
CUSTOMER: # Table name. Required parameter
train_settings: # Settings for training process. Required parameter
source: "./files/customer.csv" # The path to the original data. Supported formats include local files in '.csv', '.avro' formats. Required parameter
epochs: 10 # Number of epochs if different from the default in the command line options. Optional parameter
drop_null: False # Drop rows with NULL values. Optional parameter
row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter
batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter
reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously, e.g. ["metrics_only", "sample"]. Might require significant time for big tables (>10000 rows).
column_types:
categorical: # Force listed columns to have categorical type (use dictionary of values). Optional parameter
- gender
- marital_status
format: # Settings for reading and writing data in '.csv', '.psv', '.tsv', '.txt', '.xls', '.xlsx' format. Optional parameter
sep: ',' # Delimiter to use. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
quotechar: '"' # The character used to denote the start and end of a quoted item. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
quoting: minimal # Control field quoting behavior per constants - ["all", "minimal", "non-numeric", "none"]. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
escapechar: '"' # One-character string used to escape other characters. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
encoding: null # A string representing the encoding to use in the output file. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
header: infer # Row number(s) to use as the column names, and the start of the data. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
skiprows: null # Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
on_bad_lines: error # Specifies what to do upon encountering a bad line (a line with too many fields) - ["error", "warn", "skip"]. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
engine: null # Parser engine to use - ["c", "python"]. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
na_values: null # Additional strings to recognize as NA/NaN. The first value of the array will be used to replace NA/NaN values. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats
sheet_name: 0 # Name of the sheet in the Excel file. Optional parameter. Applicable for '.xls', '.xlsx' formats
infer_settings: # Settings for infer process. Optional parameter
destination: "./files/generated_data_customer.csv" # The path where the generated data will be stored. If the information about 'destination' isn't specified, by default the synthetic data will be stored locally in '.csv'. Supported formats include local files in '.csv', '.avro' formats. Optional parameter
size: 100 # Size for generated data. Optional parameter
run_parallel: False # Turn on or turn off parallel training process. Optional parameter
reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates an accuracy report, "accuracy" - generates an accuracy report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows).
batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter
random_seed: null # If specified, generates a reproducible result. Optional parameter
keys: # Keys of the table. Optional parameter
PK_CUSTOMER_ID: # Name of a key. Only one PK per table.
type: "PK" # The key type. Supported: PK - primary key, FK - foreign key, TKN - token key
columns: # Array of column names
- customer_id
UQ1: # Name of a key
type: "UQ" # One or many unique keys
columns:
- e_mail
FK1: # One or many foreign keys
type: "FK"
columns: # Array of columns in the current table
- e_mail
- alias
references:
table: "PROFILE" # Name of the parent table
columns: # Array of columns in the parent table
- e_mail
- alias
FK2:
type: "FK"
columns:
- address_id
references:
table: "ADDRESS"
columns:
- address_id
ORDER: # Table name. Required parameter
train_settings: # Settings for training process. Required parameter
source: "./files/order.csv" # The path to the original data. Supported formats include local files in 'csv', '.avro' formats. Required parameter
epochs: 10 # Number of epochs if different from the default in the command line options. Optional parameter
drop_null: False # Drop rows with NULL values. Optional parameter
row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter
batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter
reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report, e.g. ["metrics_only", "sample"]. Might require significant time for big tables (>10000 rows).
column_types:
categorical: # Force listed columns to have categorical type (use dictionary of values). Optional parameter
- gender
- marital_status
infer_settings: # Settings for infer process. Optional parameter
destination: "./files/generated_data_order.csv" # The path where the generated data will be stored. If the information about 'destination' isn't specified, by default the synthetic data will be stored locally in '.csv'. Supported formats include local files in 'csv', '.avro' formats. Required parameter
size: 100 # Size for generated data. Optional parameter
run_parallel: False # Turn on or turn off parallel training process. Optional parameter
reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates an accuracy report, "accuracy" - generates an accuracy report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows).
batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter
random_seed: null # If specified, generates a reproducible result. Optional parameter
format: # Settings for reading and writing data in 'csv' format. Optional parameter
sep: ',' # Delimiter to use. Optional parameter
quotechar: '"' # The character used to denote the start and end of a quoted item. Optional parameter
quoting: minimal # Control field quoting behavior per constants - ["all", "minimal", "non-numeric", "none"]. Optional parameter
escapechar: '"' # One-character string used to escape other characters. Optional parameter
encoding: null # A string representing the encoding to use in the output file. Optional parameter
header: infer # Row number(s) to use as the column names, and the start of the data. Optional parameter
skiprows: null # Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. Optional parameter
on_bad_lines: error # Specifies what to do upon encountering a bad line (a line with too many fields) - ["error", "warn", "skip"]. Optional parameter
engine: null # Parser engine to use - ["c", "python"]. Optional parameter
sheet_name: 0 # Name of the sheet in the Excel file. Optional parameter
keys: # Keys of the table. Optional parameter
pk_order_id:
type: "PK"
columns:
- order_id
FK1:
type: "FK"
columns:
- customer_id
references:
table: "CUSTOMER"
columns:
- customer_id
```
*Note:*
<ul>
<li>In the section <i>"global"</i> you can specify training and inference settings for all tables. If the same settings are specified for a specific table, they will override the global settings</li>
<li>If the information about <i>"destination"</i> isn't specified in <i>"infer_settings"</i>, by default the synthetic data will be stored locally in <i>".csv"</i> format</li>
</ul>
<i>You can find the example of metadata file in [examples/example-metadata/housing_metadata.yaml](examples/example-metadata/housing_metadata.yaml)</i><br>
By providing the necessary information through a metadata file, you can initiate training and inference processes using the following commands:
```bash
train --metadata_path=PATH_TO_YAML_METADATA_FILE
infer --metadata_path=PATH_TO_YAML_METADATA_FILE
```
Here is a quick example:
```bash
train --metadata_path="./examples/example-metadata/housing_metadata.yaml"
infer --metadata_path="./examples/example-metadata/housing_metadata.yaml"
```
If `--metadata_path` is present and the metadata contains the necessary parameters, other CLI parameters will be ignored.<br>
### Ways to set the value(s) in the section "reports" of the metadata file
The accepted values in the section <i>"reports"</i> in <i>"train_settings"</i>:
- <i>"none"</i> (default) - no reports will be generated
- <i>"accuracy"</i> - generates an accuracy report to measure the quality of synthetic data relative to the original dataset. This report is produced after the completion of the training process, during which a model learns to generate new data. The synthetic data generated for this report is of the same size as the original dataset to reach more accurate comparison.
- <i>"sample"</i> - generates a sample report (if original data is sampled, the comparison of distributions of original data and sampled data is provided in the report)
- <i>"metrics_only"</i> - outputs the metrics information only to standard output without generation of an accuracy report
- <i>"all"</i> - generates both accuracy and sample reports<br>
Default value is <i>"none"</i>.
Examples how to set the value(s) in the section <i>"reports"</i> in <i>"train_settings"</i>:
```yaml
reports: none
reports: all
reports: accuracy
reports: metrics_only
reports: sample
reports:
- accuracy
- metrics_only
- sample
```
The accepted values for the parameter <i>"reports"</i> in <i>"infer_settings"</i>:
- <i>"none"</i> (default) - no reports will be generated
- <i>"accuracy"</i> - generates an accuracy report to verify the quality of the generated data
- <i>"metrics_only"</i> - outputs the metrics information only to standard output without generation of an accuracy report
- <i>"all"</i> - generates an accuracy report<br>
Default value is <i>"none"</i>.
Examples how to set the value(s) in the section <i>"reports"</i> in <i>"infer_settings"</i>:
```yaml
reports: none
reports: all
reports: accuracy
reports: metrics_only
reports:
- accuracy
- metrics_only
```
### Docker images
The train and inference components of <i>syngen</i> is available as public docker image:
<https://hub.docker.com/r/tdspora/syngen>
To run dockerized code (see parameters description in *Training* and *Inference* sections) for one table call:
```bash
docker pull tdspora/syngen
docker run --rm \
--user $(id -u):$(id -g) \
-v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \
--task=train \
--table_name=TABLE_NAME \
--source=./model_artifacts/YOUR_CSV_FILE.csv
docker run --rm \
--user $(id -u):$(id -g) \
-v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \
--task=infer \
--table_name=TABLE_NAME
```
PATH_TO_LOCAL_FOLDER is an absolute path to the folder where your original csv is stored.
You can add any arguments listed in the corresponding sections for infer and training processes in the CLI call.
To run dockerized code by providing the metadata file simply call:
```bash
docker pull tdspora/syngen
docker run --rm \
--user $(id -u):$(id -g) \
-v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \
--task=train \
--metadata_path=./model_artifacts/PATH_TO_METADATA_YAML
docker run --rm \
--user $(id -u):$(id -g) \
-v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \
--task=infer \
--metadata_path=./model_artifacts/PATH_TO_METADATA_YAML
```
You can add any arguments listed in the corresponding sections for infer and training processes in the CLI call, however, they will be
overwritten by corresponding arguments in the metadata file.
#### UI web interface
You can access the streamlit UI web interface by running the following command after installing the library with the UI option:
```bash
pip install syngen[ui]
```
then create a python file and insert the code provided below into it:
```python
afrom syngen import streamlit_app
streamlit_app.start()
```
run the python file:
```bash
python your_file.py
```
You also can access the streamlit UI web interface by launching the container with the following command:
```bash
docker pull tdspora/syngen
docker run -p 8501:8501 tdspora/syngen --webui
```
The UI will be available at <http://localhost:8501>.
#### MLflow monitoring
Set the `MLFLOW_TRACKING_URI` environment variable to the desired MLflow tracking server, for instance:
http://localhost:5000/. You can also set the `MLFLOW_ARTIFACTS_DESTINATION` environment variable to your preferred path
(including the cloud path), where the artifacts should be stored. Additionally, set the `MLFLOW_EXPERIMENT_NAME`
environment variable to the name you prefer for the experiment.
To get the system metrics, please set the `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING` environment variable to `true`.
By default, the metrics are logged every 10 seconds, but the interval may be changed by setting the environment variable
`MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL` (for more detailed description look [here](https://mlflow.org/docs/latest/system-metrics/index.html))
When using Docker, ensure the environmental variables are set before running the container.
The provided environmental variables allow to track the training process, and the inference process, and store
the artifacts in the desired location.
You can access the MLflow UI by navigating to the provided URL in your browser. If you store artifacts in remote storage,
ensure that all necessary credentials are provided before using Mlflow.
```bash
docker pull tdspora/syngen:latest
docker run --rm -it \
--user $(id -u):$(id -g) \
-e MLFLOW_TRACKING_URI='http://localhost:5000' \
-e MLFLOW_ARTIFACTS_DESTINATION=MLFLOW_ARTIFACTS_DESTINATION \
-e MLFLOW_EXPERIMENT_NAME=test_name \
-e MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true \
-e MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL 10 \
-v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \
--metadata_path=./model_artifacts/PATH_TO_METADATA_YAML
docker run --rm -it \
--user $(id -u):$(id -g) \
-e MLFLOW_TRACKING_URI='http://localhost:5000' \
-e MLFLOW_ARTIFACTS_DESTINATION=MLFLOW_ARTIFACTS_DESTINATION \
-e MLFLOW_EXPERIMENT_NAME=test_name \
-e MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true \
-e MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL 10 \
-v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \
--metadata_path=./model_artifacts/PATH_TO_METADATA_YAML
```
## Syngen Installation Guide for MacOS ARM (M1/M2) with Python 3.10 or 3.11
### Prerequisites
Before you begin, make sure you have the following installed:
- Python 3.10 or 3.11
- Homebrew (optional but recommended for managing dependencies)
### Installation Steps
1. **Upgrade pip**: Ensure you have the latest version of `pip`.
```sh
pip install --upgrade pip
```
2. **Install Setuptools, Wheel, and Cython**: These packages are necessary for building and installing other dependencies.
```sh
pip install setuptools wheel 'Cython<3'
```
3. **Install Fastavro**: Install a specific version of `fastavro` to avoid build issues.
```sh
pip install --no-build-isolation fastavro==1.5.1
```
4. **Install Syngen**: Now, you can install the Syngen package.
```sh
pip install syngen
```
5. **Install TensorFlow Metal**: This package leverages the GPU capabilities of M1/M2 chips for TensorFlow.
```sh
pip install tensorflow-metal
```
#### From source (development)
Download repository from GitHub by cloning or zip file.
Then install it in editable mode.
```sh
pip install -e .
```
### Additional Information
- **Homebrew**: If you do not have Homebrew installed, you can install it by running:
```sh
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- **Python 3.10**: Ensure you have Python 3.10 installed. You can use pyenv to manage different Python versions:
```sh
brew install pyenv
pyenv install 3.10.0
pyenv global 3.10.0
```
OR
- **Python 3.11**: Ensure you have Python 3.11 installed. You can use pyenv to manage different Python versions:
```sh
brew install pyenv
pyenv install 3.11.0
pyenv global 3.11.0
```
### Verifying Installation
To verify the installation, run the following command to check if Syngen is installed correctly:
```sh
python -c "import syngen; print(syngen.__version__)"
```
If the command prints the version of Syngen without errors, the installation was successful.
### Troubleshooting
If you encounter any issues during installation, consider the following steps:
- Ensure all dependencies are up-to-date.
- Check for any compatibility issues with other installed packages.
- Consult the Syngen [documentation](https://github.com/tdspora/syngen) or raise an issue on GitHub.
## Contribution
We welcome contributions from the community to help us improve and maintain our public GitHub repository. We appreciate any feedback, bug reports, or feature requests, and we encourage developers to submit fixes or new features using issues.
If you have found a bug or have a feature request, please submit an issue to our GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the feature request. Our team will review the issue and work with you to address any problems or discuss any potential new features.
If you would like to contribute a fix or a new feature, please submit a pull request to our GitHub repository. Please make sure your code follows our coding standards and best practices. Our team will review your pull request and work with you to ensure that it meets our standards and is ready for inclusion in our codebase.
We appreciate your contributions, and thank you for your interest in helping us maintain and improve our public GitHub repository.
Raw data
{
"_id": null,
"home_page": "https://github.com/tdspora/syngen",
"name": "syngen-databricks",
"maintainer": "Hanna Imshenetska",
"docs_url": null,
"requires_python": "<3.12,>3.10",
"maintainer_email": null,
"keywords": "data, generation, synthetic, vae, tabular",
"author": "EPAM Systems, Inc.",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/1f/ec/3e7972d199b6bdbf49765bebc8132b913f2a02869b2cd15562a20d02bf23/syngen_databricks-0.10.1.tar.gz",
"platform": null,
"description": "[![CI/CD](https://github.com/tdspora/syngen/actions/workflows/action-build-deploy.yml/badge.svg?branch=main)](https://github.com/tdspora/syngen/actions/workflows/action-build-deploy.yml)\n\n# EPAM Syngen\n\nEPAM Syngen is an unsupervised tabular data generation tool. It is useful for generation of test data with a given table as a template. Most datatypes including floats, integers, datetime, text, categorical, binary are supported. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach.\nThe source of data might be in CSV, Avro and Excel format and should be located locally and be in UTF-8 encoding.\n\nThe tool is based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space.\n\n## Prerequisites\n\nPython 3.10 or 3.11 is required to run the library. The library is tested on Linux and Windows operating systems.\nYou can download Python from [the official website](https://www.python.org/downloads/) and install manually, or you can install Python [from your terminal](https://docs.python-guide.org/starting/installation/). After the installation of Python, please, check whether [pip is installed](https://pip.pypa.io/en/stable/getting-started/).\n\n## Getting started\n\nBefore the installation of the library, you have to [set up the virtual environment](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/).\n\nYou can install the library with CLI only:\n\n```bash\npip install syngen\n```\n\nOtherwise, if you want to install the UI version with streamlit, run:\n\n```bash\npip install syngen[ui]\n```\n\n*Note:* see details of the UI usage in the [corresponding section](#ui-web-interface)\n\n\nThe training and inference processes are separated with two CLI entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.<br>\n\nTo start training with defaults parameters run:\n\n```bash\ntrain --source PATH_TO_ORIGINAL_CSV \\\n --table_name TABLE_NAME\n```\n\nThis will train a model and save the model artifacts to disk.\n\nTo generate with defaults parameters data simply call:\n\n```bash\ninfer --table_name TABLE_NAME\n```\n\n<i>Please notice that the name should match the one you used in the training process.</i><br>\nThis will create a csv file with the synthetic table in <i>./model_artifacts/tmp_store/TABLE_NAME/merged_infer_TABLE_NAME.csv</i>.<br>\n\nHere is a quick example:\n\n```bash\ntrain --source ./examples/example-data/housing.csv \u2013-table_name Housing\ninfer --table_name Housing\n```\n\nAs the example you can use the dataset <i>\"Housing\"</i> in [examples/example-data/housing.csv](examples/example-data/housing.csv).\nIn this example, our real-world data is <a href=\"https://www.kaggle.com/datasets/camnugent/california-housing-prices\" target=\"_blank\">\"Housing\"</a> from Kaggle.\n\n## Features\n\n### Training\n\nYou can add flexibility to the training and inference processes using additional hyperparameters.<br>\nFor training of single table call:\n\n```bash\ntrain --source PATH_TO_ORIGINAL_CSV \\\n --table_name TABLE_NAME \\\n --epochs INT \\\n --row_limit INT \\\n --drop_null BOOL \\\n --reports STR \\\n --batch_size INT\n```\n\n*Note:* To specify multiple options for the *--reports* parameter, you need to provide the *--reports* parameter multiple times. \nFor example:\n```bash\ntrain --source PATH_TO_ORIGINAL_CSV \\\n --table_name TABLE_NAME \\\n --reports accuracy \\\n --reports sample\n```\nThe accepted values for the parameter <i>\"reports\"</i>:\n - <i>\"none\"</i> (default) - no reports will be generated\n - <i>\"accuracy\"</i> - generates an accuracy report to measure the quality of synthetic data relative to the original dataset. This report is produced after the completion of the training process, during which a model learns to generate new data. The synthetic data generated for this report is of the same size as the original dataset to reach more accurate comparison.\n - <i>\"sample\"</i> - generates a sample report (if original data is sampled, the comparison of distributions of original data and sampled data is provided in the report)\n - <i>\"metrics_only\"</i> - outputs the metrics information only to standard output without generation of an accuracy report\n - <i>\"all\"</i> - generates both accuracy and sample reports<br>\nDefault value is <i>\"none\"</i>.\n\nTo train one or more tables using a metadata file, you can use the following command:\n\n```bash\ntrain --metadata_path PATH_TO_METADATA_YAML\n```\n\nParameters that you can set up for training process:\n\n- <i>source</i> \u2013 required parameter for training of single table, a path to the file that you want to use as a reference\n- <i>table_name</i> \u2013 required parameter for training of single table, an arbitrary string to name the directories\n- <i>epochs</i> \u2013 a number of training epochs. Since the early stopping mechanism is implemented the bigger value of epochs is the better\n- <i>row_limit</i> \u2013 a number of rows to train over. A number less than the original table length will randomly subset the specified number of rows\n- <i>drop_null</i> \u2013 whether to drop rows with at least one missing value\n- <i>batch_size</i> \u2013 if specified, the training is split into batches. This can save the RAM\n- <i>reports</i> - controls the generation of quality reports, might require significant time for big tables (>10000 rows)\n- <i>metadata_path</i> \u2013 a path to the metadata file containing the metadata\n- <i>column_types</i> - might include the section <i>categorical</i> which contains the listed columns defined as categorical by a user\n\nRequirements for parameters of training process:\n* <i>source</i> - data type - string\n* <i>table_name</i> - data type - string\n* <i>epochs</i> - data type - integer, must be equal to or more than 1, default value is 10\n* <i>row_limit</i> - data type - integer\n* <i>drop_null</i> - data type - boolean, default value - False\n* <i>batch_size</i> - data type - integer, must be equal to or more than 1, default value - 32\n* <i>reports</i> - data type - if the value is passed through CLI - string, if the value is passed in the metadata file - string or list, accepted values: <i>\"none\"</i> (default) - no reports will be generated, <i>\"all\"</i> - generates both accuracy and sample reports, <i>\"accuracy\"</i> - generates an accuracy report, <i>\"sample\"</i> - generates a sample report, <i>\"metrics_only\"</i> - outputs the metrics information only to standard output without generation of a report. Default value is <i>\"none\"</i>. In the metadata file multiple values can be specified as a list of available options (<i>\"accuracy\"</i>, <i>\"sample\"</i>, <i>\"metrics_only\"</i>) to generate multiple types of reports simultaneously, e.g. [<i>\"metrics_only\"</i>, <i>\"sample\"</i>]\n* <i>metadata_path</i> - data type - string\n* <i>column_types</i> - data type - dictionary with the key <i>categorical</i> - the list of columns (data type - string)\n\n### Inference (generation)\n\nYou can customize the inference processes by calling for one table:\n\n```bash\ninfer --size INT \\\n --table_name STR \\\n --run_parallel BOOL \\\n --batch_size INT \\\n --random_seed INT \\\n --reports STR\n```\n\n*Note:* To specify multiple options for the *--reports* parameter, you need to provide the *--reports* parameter multiple times. \nFor example:\n```bash\ninfer --table_name TABLE_NAME \\\n --reports accuracy \\\n --reports metrics_only\n```\nThe accepted values for the parameter <i>\"reports\"</i>:\n - <i>\"none\"</i> (default) - no reports will be generated\n - <i>\"accuracy\"</i> - generates an accuracy report that compares original and synthetic data patterns to verify the quality of the generated data\n - <i>\"metrics_only\"</i> - outputs the metrics information only to standard output without generation of an accuracy report\n - <i>\"all\"</i> - generates an accuracy report<br>\nDefault value is <i>\"none\"</i>.\n\nTo generate one or more tables using a metadata file, you can use the following command:\n\n```bash\ninfer --metadata_path PATH_TO_METADATA\n```\n\nThe parameters which you can set up for generation process:\n\n- <i>size</i> - the desired number of rows to generate\n- <i>table_name</i> \u2013 required parameter for inference of single table, the name of the table, same as in training\n- <i>run_parallel</i> \u2013 whether to use multiprocessing (feasible for tables > 5000 rows)\n- <i>batch_size</i> \u2013 if specified, the generation is split into batches. This can save the RAM\n- <i>random_seed</i> \u2013 if specified, generates a reproducible result\n- <i>reports</i> - controls the generation of quality reports, might require significant time for big generated tables (>10000 rows)\n- <i>metadata_path</i> \u2013 a path to metadata file\n\nRequirements for parameters of generation process:\n* <i>size</i> - data type - integer, must be equal to or more than 1, default value is 100\n* <i>table_name</i> - data type - string\n* <i>run_parallel</i> - data type - boolean, default value is False\n* <i>batch_size</i> - data type - integer, must be equal to or more than 1\n* <i>random_seed</i> - data type - integer, must be equal to or more than 0\n* <i>reports</i> - data type - if the value is passed through CLI - string, if the value is passed in the metadata file - string or list, accepted values: <i>\"none\"</i> (default) - no reports will be generated, <i>\"all\"</i> - generates an accuracy report, <i>\"accuracy\"</i> - generates an accuracy report, <i>\"metrics_only\"</i> - outputs the metrics information only to standard output without generation of a report. Default value is <i>\"none\"</i>. In the metadata file multiple values can be specified as a list of available options (<i>\"accuracy\"</i>, <i>\"metrics_only\"</i>) to generate multiple types of reports simultaneously\n* <i>metadata_path</i> - data type - string\n\nThe metadata can contain any of the arguments above for each table. If so, the duplicated arguments from the CLI\nwill be ignored.\n\n*Note:* If you want to set the logging level, you can use the parameter <i>log_level</i> in the CLI call:\n\n```bash\ntrain --source STR --table_name STR --log_level STR\ntrain --metadata_path STR --log_level STR\ninfer --size INT --table_name STR --log_level STR\ninfer --metadata_path STR --log_level STR\n```\n\nwhere <i>log_level</i> might be one of the following values: <i>TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL</i>.\n\n### Linked tables generation\n\nTo generate one or more tables, you might provide metadata in yaml format. By providing information about the relationships\nbetween tables via metadata, it becomes possible to manage complex relationships across any number of tables.\nYou can also specify additional parameters needed for training and inference in the metadata file and in this case,\nthey will be ignored in the CLI call.\n\n*Note:* By using metadata file, you can also generate tables with absent relationships.\nIn this case, the tables will be generated independently.\n\nThe yaml metadata file should match the following template:\n\n```yaml\nglobal: # Global settings. Optional parameter. In this section you can specify training and inference settings which will be set for all tables\n train_settings: # Settings for training process. Optional parameter\n epochs: 10 # Number of epochs if different from the default in the command line options. Optional parameter\n drop_null: False # Drop rows with NULL values. Optional parameter\n row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter\n batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter\n reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: \"none\" (default) - no reports will be generated, \"all\" - generates both accuracy and sample reports, \"accuracy\" - generates an accuracy report, \"sample\" - generates a sample report, \"metrics_only\" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously, e.g. [\"metrics_only\", \"sample\"]. Might require significant time for big tables (>10000 rows).\n\n infer_settings: # Settings for infer process. Optional parameter\n size: 100 # Size for generated data. Optional parameter\n run_parallel: False # Turn on or turn off parallel training process. Optional parameter\n reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: \"none\" (default) - no reports will be generated, \"all\" - generates an accuracy report, \"accuracy\" - generates an accuracy report, \"metrics_only\" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows). \n batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter\n random_seed: null # If specified, generates a reproducible result. Optional parameter\n\nCUSTOMER: # Table name. Required parameter\n train_settings: # Settings for training process. Required parameter\n source: \"./files/customer.csv\" # The path to the original data. Supported formats include local files in '.csv', '.avro' formats. Required parameter\n epochs: 10 # Number of epochs if different from the default in the command line options. Optional parameter\n drop_null: False # Drop rows with NULL values. Optional parameter\n row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter\n batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter\n reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: \"none\" (default) - no reports will be generated, \"all\" - generates both accuracy and sample reports, \"accuracy\" - generates an accuracy report, \"sample\" - generates a sample report, \"metrics_only\" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously, e.g. [\"metrics_only\", \"sample\"]. Might require significant time for big tables (>10000 rows). \n column_types:\n categorical: # Force listed columns to have categorical type (use dictionary of values). Optional parameter\n - gender\n - marital_status\n\n format: # Settings for reading and writing data in '.csv', '.psv', '.tsv', '.txt', '.xls', '.xlsx' format. Optional parameter\n sep: ',' # Delimiter to use. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n quotechar: '\"' # The character used to denote the start and end of a quoted item. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n quoting: minimal # Control field quoting behavior per constants - [\"all\", \"minimal\", \"non-numeric\", \"none\"]. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n escapechar: '\"' # One-character string used to escape other characters. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n encoding: null # A string representing the encoding to use in the output file. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n header: infer # Row number(s) to use as the column names, and the start of the data. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n skiprows: null # Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n on_bad_lines: error # Specifies what to do upon encountering a bad line (a line with too many fields) - [\"error\", \"warn\", \"skip\"]. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n engine: null # Parser engine to use - [\"c\", \"python\"]. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n na_values: null # Additional strings to recognize as NA/NaN. The first value of the array will be used to replace NA/NaN values. Optional parameter. Applicable for '.csv', '.psv', '.tsv', '.txt' formats\n sheet_name: 0 # Name of the sheet in the Excel file. Optional parameter. Applicable for '.xls', '.xlsx' formats\n infer_settings: # Settings for infer process. Optional parameter\n destination: \"./files/generated_data_customer.csv\" # The path where the generated data will be stored. If the information about 'destination' isn't specified, by default the synthetic data will be stored locally in '.csv'. Supported formats include local files in '.csv', '.avro' formats. Optional parameter\n size: 100 # Size for generated data. Optional parameter\n run_parallel: False # Turn on or turn off parallel training process. Optional parameter\n reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: \"none\" (default) - no reports will be generated, \"all\" - generates an accuracy report, \"accuracy\" - generates an accuracy report, \"metrics_only\" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows).\n batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter\n random_seed: null # If specified, generates a reproducible result. Optional parameter\n\n keys: # Keys of the table. Optional parameter\n PK_CUSTOMER_ID: # Name of a key. Only one PK per table.\n type: \"PK\" # The key type. Supported: PK - primary key, FK - foreign key, TKN - token key\n columns: # Array of column names\n - customer_id\n\n UQ1: # Name of a key\n type: \"UQ\" # One or many unique keys\n columns:\n - e_mail\n\n FK1: # One or many foreign keys\n type: \"FK\"\n columns: # Array of columns in the current table\n - e_mail\n - alias\n references:\n table: \"PROFILE\" # Name of the parent table\n columns: # Array of columns in the parent table\n - e_mail\n - alias\n\n FK2:\n type: \"FK\"\n columns:\n - address_id\n references:\n table: \"ADDRESS\"\n columns:\n - address_id\n\n\nORDER: # Table name. Required parameter\n train_settings: # Settings for training process. Required parameter\n source: \"./files/order.csv\" # The path to the original data. Supported formats include local files in 'csv', '.avro' formats. Required parameter\n epochs: 10 # Number of epochs if different from the default in the command line options. Optional parameter\n drop_null: False # Drop rows with NULL values. Optional parameter\n row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter\n batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter\n reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: \"none\" (default) - no reports will be generated, \"all\" - generates both accuracy and sample reports, \"accuracy\" - generates an accuracy report, \"sample\" - generates a sample report, \"metrics_only\" - outputs the metrics information only to standard output without generation of a report, e.g. [\"metrics_only\", \"sample\"]. Might require significant time for big tables (>10000 rows).\n column_types:\n categorical: # Force listed columns to have categorical type (use dictionary of values). Optional parameter\n - gender\n - marital_status\n\n infer_settings: # Settings for infer process. Optional parameter\n destination: \"./files/generated_data_order.csv\" # The path where the generated data will be stored. If the information about 'destination' isn't specified, by default the synthetic data will be stored locally in '.csv'. Supported formats include local files in 'csv', '.avro' formats. Required parameter\n size: 100 # Size for generated data. Optional parameter\n run_parallel: False # Turn on or turn off parallel training process. Optional parameter\n reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: \"none\" (default) - no reports will be generated, \"all\" - generates an accuracy report, \"accuracy\" - generates an accuracy report, \"metrics_only\" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows).\n batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter\n random_seed: null # If specified, generates a reproducible result. Optional parameter\n format: # Settings for reading and writing data in 'csv' format. Optional parameter\n sep: ',' # Delimiter to use. Optional parameter\n quotechar: '\"' # The character used to denote the start and end of a quoted item. Optional parameter\n quoting: minimal # Control field quoting behavior per constants - [\"all\", \"minimal\", \"non-numeric\", \"none\"]. Optional parameter\n escapechar: '\"' # One-character string used to escape other characters. Optional parameter\n encoding: null # A string representing the encoding to use in the output file. Optional parameter\n header: infer # Row number(s) to use as the column names, and the start of the data. Optional parameter\n skiprows: null # Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. Optional parameter\n on_bad_lines: error # Specifies what to do upon encountering a bad line (a line with too many fields) - [\"error\", \"warn\", \"skip\"]. Optional parameter\n engine: null # Parser engine to use - [\"c\", \"python\"]. Optional parameter\n sheet_name: 0 # Name of the sheet in the Excel file. Optional parameter\n keys: # Keys of the table. Optional parameter\n pk_order_id:\n type: \"PK\"\n columns:\n - order_id\n\n FK1:\n type: \"FK\"\n columns:\n - customer_id\n references:\n table: \"CUSTOMER\"\n columns: \n - customer_id\n```\n\n*Note:*\n<ul>\n<li>In the section <i>\"global\"</i> you can specify training and inference settings for all tables. If the same settings are specified for a specific table, they will override the global settings</li>\n<li>If the information about <i>\"destination\"</i> isn't specified in <i>\"infer_settings\"</i>, by default the synthetic data will be stored locally in <i>\".csv\"</i> format</li>\n</ul>\n\n<i>You can find the example of metadata file in [examples/example-metadata/housing_metadata.yaml](examples/example-metadata/housing_metadata.yaml)</i><br>\n\nBy providing the necessary information through a metadata file, you can initiate training and inference processes using the following commands:\n\n```bash\ntrain --metadata_path=PATH_TO_YAML_METADATA_FILE\ninfer --metadata_path=PATH_TO_YAML_METADATA_FILE\n```\nHere is a quick example:\n\n```bash\ntrain --metadata_path=\"./examples/example-metadata/housing_metadata.yaml\"\ninfer --metadata_path=\"./examples/example-metadata/housing_metadata.yaml\"\n```\n\nIf `--metadata_path` is present and the metadata contains the necessary parameters, other CLI parameters will be ignored.<br>\n\n### Ways to set the value(s) in the section \"reports\" of the metadata file\n\nThe accepted values in the section <i>\"reports\"</i> in <i>\"train_settings\"</i>:\n - <i>\"none\"</i> (default) - no reports will be generated\n - <i>\"accuracy\"</i> - generates an accuracy report to measure the quality of synthetic data relative to the original dataset. This report is produced after the completion of the training process, during which a model learns to generate new data. The synthetic data generated for this report is of the same size as the original dataset to reach more accurate comparison.\n - <i>\"sample\"</i> - generates a sample report (if original data is sampled, the comparison of distributions of original data and sampled data is provided in the report)\n - <i>\"metrics_only\"</i> - outputs the metrics information only to standard output without generation of an accuracy report\n - <i>\"all\"</i> - generates both accuracy and sample reports<br>\nDefault value is <i>\"none\"</i>.\n\nExamples how to set the value(s) in the section <i>\"reports\"</i> in <i>\"train_settings\"</i>:\n```yaml\n\nreports: none\n\nreports: all\n\nreports: accuracy\n\nreports: metrics_only\n\nreports: sample\n\nreports:\n - accuracy\n - metrics_only\n - sample\n```\nThe accepted values for the parameter <i>\"reports\"</i> in <i>\"infer_settings\"</i>:\n - <i>\"none\"</i> (default) - no reports will be generated\n - <i>\"accuracy\"</i> - generates an accuracy report to verify the quality of the generated data\n - <i>\"metrics_only\"</i> - outputs the metrics information only to standard output without generation of an accuracy report\n - <i>\"all\"</i> - generates an accuracy report<br>\nDefault value is <i>\"none\"</i>.\n\nExamples how to set the value(s) in the section <i>\"reports\"</i> in <i>\"infer_settings\"</i>:\n```yaml\n\nreports: none\n\nreports: all\n\nreports: accuracy\n\nreports: metrics_only\n\nreports:\n - accuracy\n - metrics_only\n```\n\n\n### Docker images\n\nThe train and inference components of <i>syngen</i> is available as public docker image:\n\n<https://hub.docker.com/r/tdspora/syngen>\n\nTo run dockerized code (see parameters description in *Training* and *Inference* sections) for one table call:\n\n```bash\ndocker pull tdspora/syngen\ndocker run --rm \\\n --user $(id -u):$(id -g) \\\n -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \\\n --task=train \\\n --table_name=TABLE_NAME \\\n --source=./model_artifacts/YOUR_CSV_FILE.csv\n\ndocker run --rm \\\n --user $(id -u):$(id -g) \\\n -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \\\n --task=infer \\\n --table_name=TABLE_NAME\n```\n\nPATH_TO_LOCAL_FOLDER is an absolute path to the folder where your original csv is stored.\n\nYou can add any arguments listed in the corresponding sections for infer and training processes in the CLI call.\n\nTo run dockerized code by providing the metadata file simply call:\n\n```bash\ndocker pull tdspora/syngen\ndocker run --rm \\\n --user $(id -u):$(id -g) \\\n -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \\\n --task=train \\\n --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\n\ndocker run --rm \\\n --user $(id -u):$(id -g) \\\n -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \\\n --task=infer \\\n --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\n```\n\nYou can add any arguments listed in the corresponding sections for infer and training processes in the CLI call, however, they will be\noverwritten by corresponding arguments in the metadata file.\n\n#### UI web interface\nYou can access the streamlit UI web interface by running the following command after installing the library with the UI option:\n\n```bash\npip install syngen[ui]\n```\nthen create a python file and insert the code provided below into it:\n\n```python\nafrom syngen import streamlit_app\n\n\nstreamlit_app.start()\n```\n\nrun the python file:\n\n```bash\npython your_file.py\n```\n\nYou also can access the streamlit UI web interface by launching the container with the following command:\n\n```bash\ndocker pull tdspora/syngen\ndocker run -p 8501:8501 tdspora/syngen --webui\n```\n\nThe UI will be available at <http://localhost:8501>.\n\n#### MLflow monitoring\n\nSet the `MLFLOW_TRACKING_URI` environment variable to the desired MLflow tracking server, for instance:\nhttp://localhost:5000/. You can also set the `MLFLOW_ARTIFACTS_DESTINATION` environment variable to your preferred path \n(including the cloud path), where the artifacts should be stored. Additionally, set the `MLFLOW_EXPERIMENT_NAME` \nenvironment variable to the name you prefer for the experiment. \nTo get the system metrics, please set the `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING` environment variable to `true`.\nBy default, the metrics are logged every 10 seconds, but the interval may be changed by setting the environment variable \n`MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL` (for more detailed description look [here](https://mlflow.org/docs/latest/system-metrics/index.html))\n\nWhen using Docker, ensure the environmental variables are set before running the container.\n\nThe provided environmental variables allow to track the training process, and the inference process, and store \nthe artifacts in the desired location.\nYou can access the MLflow UI by navigating to the provided URL in your browser. If you store artifacts in remote storage,\nensure that all necessary credentials are provided before using Mlflow.\n\n```bash\ndocker pull tdspora/syngen:latest\ndocker run --rm -it \\\n --user $(id -u):$(id -g) \\\n -e MLFLOW_TRACKING_URI='http://localhost:5000' \\\n -e MLFLOW_ARTIFACTS_DESTINATION=MLFLOW_ARTIFACTS_DESTINATION \\\n -e MLFLOW_EXPERIMENT_NAME=test_name \\\n -e MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true \\\n -e MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL 10 \\\n -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \\\n --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\n\ndocker run --rm -it \\\n --user $(id -u):$(id -g) \\\n -e MLFLOW_TRACKING_URI='http://localhost:5000' \\\n -e MLFLOW_ARTIFACTS_DESTINATION=MLFLOW_ARTIFACTS_DESTINATION \\\n -e MLFLOW_EXPERIMENT_NAME=test_name \\\n -e MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true \\\n -e MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL 10 \\\n -v PATH_TO_LOCAL_FOLDER:/src/model_artifacts tdspora/syngen \\\n --metadata_path=./model_artifacts/PATH_TO_METADATA_YAML\n```\n\n## Syngen Installation Guide for MacOS ARM (M1/M2) with Python 3.10 or 3.11\n\n### Prerequisites\n\nBefore you begin, make sure you have the following installed:\n\n- Python 3.10 or 3.11\n- Homebrew (optional but recommended for managing dependencies)\n\n### Installation Steps\n\n1. **Upgrade pip**: Ensure you have the latest version of `pip`.\n\n ```sh\n pip install --upgrade pip\n ```\n\n2. **Install Setuptools, Wheel, and Cython**: These packages are necessary for building and installing other dependencies.\n\n ```sh\n pip install setuptools wheel 'Cython<3'\n ```\n\n3. **Install Fastavro**: Install a specific version of `fastavro` to avoid build issues.\n\n ```sh\n pip install --no-build-isolation fastavro==1.5.1\n ```\n\n4. **Install Syngen**: Now, you can install the Syngen package.\n\n ```sh\n pip install syngen\n ```\n\n5. **Install TensorFlow Metal**: This package leverages the GPU capabilities of M1/M2 chips for TensorFlow.\n\n ```sh\n pip install tensorflow-metal\n ```\n\n#### From source (development)\n\nDownload repository from GitHub by cloning or zip file.\nThen install it in editable mode.\n\n```sh\n pip install -e .\n```\n\n### Additional Information\n\n- **Homebrew**: If you do not have Homebrew installed, you can install it by running:\n\n ```sh\n /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"\n ```\n\n- **Python 3.10**: Ensure you have Python 3.10 installed. You can use pyenv to manage different Python versions:\n\n ```sh\n brew install pyenv\n pyenv install 3.10.0\n pyenv global 3.10.0\n ```\nOR\n- **Python 3.11**: Ensure you have Python 3.11 installed. You can use pyenv to manage different Python versions:\n\n ```sh\n brew install pyenv\n pyenv install 3.11.0\n pyenv global 3.11.0\n ```\n\n### Verifying Installation\n\nTo verify the installation, run the following command to check if Syngen is installed correctly:\n\n```sh\npython -c \"import syngen; print(syngen.__version__)\"\n```\n\nIf the command prints the version of Syngen without errors, the installation was successful.\n\n### Troubleshooting\n\nIf you encounter any issues during installation, consider the following steps:\n\n- Ensure all dependencies are up-to-date.\n- Check for any compatibility issues with other installed packages.\n- Consult the Syngen [documentation](https://github.com/tdspora/syngen) or raise an issue on GitHub.\n\n## Contribution\n\nWe welcome contributions from the community to help us improve and maintain our public GitHub repository. We appreciate any feedback, bug reports, or feature requests, and we encourage developers to submit fixes or new features using issues.\n\nIf you have found a bug or have a feature request, please submit an issue to our GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the feature request. Our team will review the issue and work with you to address any problems or discuss any potential new features.\n\nIf you would like to contribute a fix or a new feature, please submit a pull request to our GitHub repository. Please make sure your code follows our coding standards and best practices. Our team will review your pull request and work with you to ensure that it meets our standards and is ready for inclusion in our codebase.\n\nWe appreciate your contributions, and thank you for your interest in helping us maintain and improve our public GitHub repository.\n",
"bugtrack_url": null,
"license": "GPLv3 License",
"summary": "The tool uncovers patterns, trends, and correlations hidden within your production datasets.",
"version": "0.10.1",
"project_urls": {
"Homepage": "https://github.com/tdspora/syngen"
},
"split_keywords": [
"data",
" generation",
" synthetic",
" vae",
" tabular"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "684bf18dc0dcb20c6ca41b3382ccbf3db6ca0478012ab1e2be9617eaa59b5452",
"md5": "83c21253a6fd22e044079e8f125f8ba1",
"sha256": "ac36606c5fcddbeebcd0bb4fcf6abaf06473a3863ba0e127eeb7ea0a2aa2348c"
},
"downloads": -1,
"filename": "syngen_databricks-0.10.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "83c21253a6fd22e044079e8f125f8ba1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>3.10",
"size": 1909293,
"upload_time": "2024-12-18T11:23:53",
"upload_time_iso_8601": "2024-12-18T11:23:53.258540Z",
"url": "https://files.pythonhosted.org/packages/68/4b/f18dc0dcb20c6ca41b3382ccbf3db6ca0478012ab1e2be9617eaa59b5452/syngen_databricks-0.10.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1fec3e7972d199b6bdbf49765bebc8132b913f2a02869b2cd15562a20d02bf23",
"md5": "52cf8701642a86c750ccdf66486d4141",
"sha256": "56d9b6ef8010f626393486a928a62595f6476a4bef6eac71a9c217570f3e1d7a"
},
"downloads": -1,
"filename": "syngen_databricks-0.10.1.tar.gz",
"has_sig": false,
"md5_digest": "52cf8701642a86c750ccdf66486d4141",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>3.10",
"size": 1890356,
"upload_time": "2024-12-18T11:23:56",
"upload_time_iso_8601": "2024-12-18T11:23:56.485890Z",
"url": "https://files.pythonhosted.org/packages/1f/ec/3e7972d199b6bdbf49765bebc8132b913f2a02869b2cd15562a20d02bf23/syngen_databricks-0.10.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-18 11:23:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tdspora",
"github_project": "syngen",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "aiohttp",
"specs": [
[
">=",
"3.10.11"
]
]
},
{
"name": "attrs",
"specs": []
},
{
"name": "avro",
"specs": []
},
{
"name": "base32-crockford",
"specs": []
},
{
"name": "boto3",
"specs": []
},
{
"name": "category_encoders",
"specs": [
[
"==",
"2.6.3"
]
]
},
{
"name": "click",
"specs": []
},
{
"name": "Jinja2",
"specs": []
},
{
"name": "keras",
"specs": [
[
"==",
"2.15.*"
]
]
},
{
"name": "lazy",
"specs": [
[
"==",
"1.4"
]
]
},
{
"name": "loguru",
"specs": []
},
{
"name": "MarkupSafe",
"specs": [
[
"==",
"2.1.1"
]
]
},
{
"name": "marshmallow",
"specs": [
[
"==",
"3.19.*"
]
]
},
{
"name": "matplotlib",
"specs": [
[
"==",
"3.9.*"
]
]
},
{
"name": "mlflow",
"specs": [
[
"==",
"2.16.*"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.26.*"
]
]
},
{
"name": "openpyxl",
"specs": []
},
{
"name": "pandas",
"specs": [
[
"==",
"2.2.*"
]
]
},
{
"name": "pandavro",
"specs": [
[
"==",
"1.8.*"
]
]
},
{
"name": "pathos",
"specs": [
[
"==",
"0.2.*"
]
]
},
{
"name": "pillow",
"specs": [
[
"==",
"10.3.*"
]
]
},
{
"name": "psutil",
"specs": []
},
{
"name": "py-ulid",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "pytest-reportportal",
"specs": []
},
{
"name": "python-slugify",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "PyYAML",
"specs": [
[
"==",
"6.*"
]
]
},
{
"name": "reportportal-client",
"specs": []
},
{
"name": "scikit_learn",
"specs": [
[
"==",
"1.5.*"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.14.*"
]
]
},
{
"name": "seaborn",
"specs": [
[
"==",
"0.13.*"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"74.1.*"
]
]
},
{
"name": "tensorflow",
"specs": [
[
"==",
"2.15.*"
]
]
},
{
"name": "tornado",
"specs": [
[
"==",
"6.4.*"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.66.3"
]
]
},
{
"name": "Werkzeug",
"specs": [
[
"==",
"3.1.2"
]
]
},
{
"name": "xlrd",
"specs": []
},
{
"name": "xlwt",
"specs": []
}
],
"lcname": "syngen-databricks"
}