# tpch-datagen
A utility to generate TPC-H data in parallel using [DuckDB](https://duckdb.org) and multi-processing
[<img src="https://img.shields.io/badge/GitHub-gizmodata%2Ftpch--datagen-blue.svg?logo=Github">](https://github.com/gizmodata/tpch-datagen)
[![tpch-datagen-ci](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml)
[![Supported Python Versions](https://img.shields.io/pypi/pyversions/tpch-datagen)](https://pypi.org/project/tpch-datagen/)
[![PyPI version](https://badge.fury.io/py/tpch-datagen.svg)](https://badge.fury.io/py/tpch-datagen)
[![PyPI Downloads](https://img.shields.io/pypi/dm/tpch-datagen.svg)](https://pypi.org/project/tpch-datagen/)
# Why?
Because generating TPC-H data can be time-consuming and resource-intensive. This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.
# Setup (to run locally)
## Install Python package
You can install `tpch-datagen` from PyPi or from source.
### Option 1 - from PyPi
```shell
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
pip install tpch-datagen
```
### Option 2 - from source - for development
```shell
git clone https://github.com/gizmodata/tpch-datagen
cd tpch-datagen
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel
# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]
```
### Note
For the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:
```shell
export PYTHONPATH=$(pwd)/src
```
### Usage
Here are the options for the `tpch-datagen` command:
```shell
tpch-datagen --help
Usage: tpch-datagen [OPTIONS]
Options:
--version / --no-version Prints the TPC-H Datagen package version and
exits. [required]
--scale-factor INTEGER The TPC-H Scale Factor to use for data
generation.
--data-directory TEXT The target output data directory to put the
files into [default: data; required]
--work-directory TEXT The work directory to use for data
generation. [default: /tmp; required]
--overwrite / --no-overwrite Can we overwrite the target directory if it
already exists... [default: no-overwrite;
required]
--num-chunks INTEGER The number of chunks that will be generated
- more chunks equals smaller memory
requirements, but more files generated.
[default: 10; required]
--num-processes INTEGER The maximum number of processes for the
multi-processing pool to use for data
generation. [default: 10; required]
--duckdb-threads INTEGER The number of DuckDB threads to use for data
generation (within each job process).
[default: 1; required]
--per-thread-output / --no-per-thread-output
Controls whether to write the output to a
single file or multiple files (for each
process). [default: per-thread-output;
required]
--compression-method [none|snappy|gzip|zstd]
The compression method to use for the
parquet files generated. [default: zstd;
required]
--file-size-bytes TEXT The target file size for the parquet files
generated. [default: 100m; required]
--help Show this message and exit.
```
> [!NOTE]
> Default values may change depending on the number of CPU cores you have, etc.
### Handy development commands
#### Version management
##### Bump the version of the application - (you must have installed from source with the [dev] extras)
```bash
bumpver update --patch
```
Raw data
{
"_id": null,
"home_page": null,
"name": "tpch-datagen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "tpc, tpch, tpc-h, data, generator, datagen",
"author": null,
"author_email": "Philip Moore <prmoore77@hotmail.com>",
"download_url": "https://files.pythonhosted.org/packages/24/cc/221c472423113408ce0ebb86159b2d37488c3e6212168cadb0380370ccce/tpch_datagen-0.0.4.tar.gz",
"platform": null,
"description": "# tpch-datagen\nA utility to generate TPC-H data in parallel using [DuckDB](https://duckdb.org) and multi-processing\n\n[<img src=\"https://img.shields.io/badge/GitHub-gizmodata%2Ftpch--datagen-blue.svg?logo=Github\">](https://github.com/gizmodata/tpch-datagen)\n[![tpch-datagen-ci](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml)\n[![Supported Python Versions](https://img.shields.io/pypi/pyversions/tpch-datagen)](https://pypi.org/project/tpch-datagen/)\n[![PyPI version](https://badge.fury.io/py/tpch-datagen.svg)](https://badge.fury.io/py/tpch-datagen)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/tpch-datagen.svg)](https://pypi.org/project/tpch-datagen/)\n\n# Why?\nBecause generating TPC-H data can be time-consuming and resource-intensive. This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.\n\n# Setup (to run locally)\n\n## Install Python package\nYou can install `tpch-datagen` from PyPi or from source.\n\n### Option 1 - from PyPi\n```shell\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\npip install tpch-datagen\n```\n\n### Option 2 - from source - for development\n```shell\ngit clone https://github.com/gizmodata/tpch-datagen\n\ncd tpch-datagen\n\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\n# Upgrade pip, setuptools, and wheel\npip install --upgrade pip setuptools wheel\n\n# Install TPC-H Datagen - in editable mode with client and dev dependencies\npip install --editable .[dev]\n```\n\n### Note\nFor the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:\n```shell\nexport PYTHONPATH=$(pwd)/src\n```\n\n### Usage\nHere are the options for the `tpch-datagen` command:\n\n```shell\ntpch-datagen --help\nUsage: tpch-datagen [OPTIONS]\n\nOptions:\n --version / --no-version Prints the TPC-H Datagen package version and\n exits. [required]\n --scale-factor INTEGER The TPC-H Scale Factor to use for data\n generation.\n --data-directory TEXT The target output data directory to put the\n files into [default: data; required]\n --work-directory TEXT The work directory to use for data\n generation. [default: /tmp; required]\n --overwrite / --no-overwrite Can we overwrite the target directory if it\n already exists... [default: no-overwrite;\n required]\n --num-chunks INTEGER The number of chunks that will be generated\n - more chunks equals smaller memory\n requirements, but more files generated.\n [default: 10; required]\n --num-processes INTEGER The maximum number of processes for the\n multi-processing pool to use for data\n generation. [default: 10; required]\n --duckdb-threads INTEGER The number of DuckDB threads to use for data\n generation (within each job process).\n [default: 1; required]\n --per-thread-output / --no-per-thread-output\n Controls whether to write the output to a\n single file or multiple files (for each\n process). [default: per-thread-output;\n required]\n --compression-method [none|snappy|gzip|zstd]\n The compression method to use for the\n parquet files generated. [default: zstd;\n required]\n --file-size-bytes TEXT The target file size for the parquet files\n generated. [default: 100m; required]\n --help Show this message and exit.\n```\n\n> [!NOTE] \n> Default values may change depending on the number of CPU cores you have, etc.\n\n### Handy development commands\n\n#### Version management\n\n##### Bump the version of the application - (you must have installed from source with the [dev] extras)\n```bash\nbumpver update --patch\n```\n",
"bugtrack_url": null,
"license": "Copyright 2024 Gizmo Data LLC Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.",
"summary": "A package which makes it easy to generate TPC-H data in parallel with DuckDB",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/gizmodata/tpch-datagen"
},
"split_keywords": [
"tpc",
" tpch",
" tpc-h",
" data",
" generator",
" datagen"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b2ec08a97d0b8407e44ca5651db195690d0b57eda29d3d33fa2a56463d2cfb87",
"md5": "e06eaa0c0be90732cd8c78f09ea0efd2",
"sha256": "621a9eee8eda07a4f76ba7bfaae47641a105b0fe25fe80dc140158af1938867b"
},
"downloads": -1,
"filename": "tpch_datagen-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e06eaa0c0be90732cd8c78f09ea0efd2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 7726,
"upload_time": "2024-10-08T21:11:33",
"upload_time_iso_8601": "2024-10-08T21:11:33.406785Z",
"url": "https://files.pythonhosted.org/packages/b2/ec/08a97d0b8407e44ca5651db195690d0b57eda29d3d33fa2a56463d2cfb87/tpch_datagen-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "24cc221c472423113408ce0ebb86159b2d37488c3e6212168cadb0380370ccce",
"md5": "6d2e1e7dce33bf7c53ffe55522da4ceb",
"sha256": "071e686610a3cfce4c8bc0542cbdc740b245e3cb73539b75fd8f8bd5db48fe18"
},
"downloads": -1,
"filename": "tpch_datagen-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "6d2e1e7dce33bf7c53ffe55522da4ceb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 8168,
"upload_time": "2024-10-08T21:11:35",
"upload_time_iso_8601": "2024-10-08T21:11:35.363394Z",
"url": "https://files.pythonhosted.org/packages/24/cc/221c472423113408ce0ebb86159b2d37488c3e6212168cadb0380370ccce/tpch_datagen-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-08 21:11:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gizmodata",
"github_project": "tpch-datagen",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tpch-datagen"
}