tpch-datagen


Nametpch-datagen JSON
Version 0.0.8 PyPI version JSON
download
home_pageNone
SummaryA package which makes it easy to generate TPC-H data in parallel with DuckDB
upload_time2025-10-08 19:32:24
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords tpc tpch tpc-h data generator datagen
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tpch-datagen - by [GizmoData](https://gizmodata.com)™
A utility to generate TPC-H data in parallel using [DuckDB](https://duckdb.org) and multi-processing

[<img src="https://img.shields.io/badge/GitHub-gizmodata%2Ftpch--datagen-blue.svg?logo=Github">](https://github.com/gizmodata/tpch-datagen)
[![tpch-datagen-ci](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml)
[![Supported Python Versions](https://img.shields.io/pypi/pyversions/tpch-datagen)](https://pypi.org/project/tpch-datagen/)
[![PyPI version](https://badge.fury.io/py/tpch-datagen.svg)](https://badge.fury.io/py/tpch-datagen)
[![PyPI Downloads](https://img.shields.io/pypi/dm/tpch-datagen.svg)](https://pypi.org/project/tpch-datagen/)

# Why?
Because generating TPC-H data can be time-consuming and resource-intensive.  This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.

# Setup (to run locally)

## Install Python package
You can install `tpch-datagen` from PyPi or from source.

### Option 1 - from PyPi
```shell
# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

pip install tpch-datagen
```

### Option 2 - from source - for development
```shell
git clone https://github.com/gizmodata/tpch-datagen

cd tpch-datagen

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel

# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]
```

### Note
For the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:
```shell
export PYTHONPATH=$(pwd)/src
```

### Usage
Here are the options for the `tpch-datagen` command:

```shell
tpch-datagen --help
Usage: tpch-datagen [OPTIONS]

Options:
  --version / --no-version        Prints the TPC-H Datagen package version and
                                  exits.  [required]
  --scale-factor INTEGER          The TPC-H Scale Factor to use for data
                                  generation.
  --data-directory TEXT           The target output data directory to put the
                                  files into  [default: data; required]
  --work-directory TEXT           The work directory to use for data
                                  generation.  [default: /tmp; required]
  --overwrite / --no-overwrite    Can we overwrite the target directory if it
                                  already exists...  [default: no-overwrite;
                                  required]
  --num-chunks INTEGER            The number of chunks that will be generated
                                  - more chunks equals smaller memory
                                  requirements, but more files generated.
                                  [default: 10; required]
  --num-processes INTEGER         The maximum number of processes for the
                                  multi-processing pool to use for data
                                  generation.  [default: 10; required]
  --duckdb-threads INTEGER        The number of DuckDB threads to use for data
                                  generation (within each job process).
                                  [default: 1; required]
  --per-thread-output / --no-per-thread-output
                                  Controls whether to write the output to a
                                  single file or multiple files (for each
                                  process).  [default: per-thread-output;
                                  required]
  --compression-method [none|snappy|gzip|zstd]
                                  The compression method to use for the
                                  parquet files generated.  [default: zstd;
                                  required]
  --file-size-bytes TEXT          The target file size for the parquet files
                                  generated.  [default: 100m; required]
  --parquet-version [v1|v2]       The version of Parquet to use for the
                                  parquet files generated.  [default: v2;
                                  required]
  --help                          Show this message and exit.
```

> [!NOTE]   
> Default values may change depending on the number of CPU cores you have, etc.

### Handy development commands

#### Version management

##### Bump the version of the application - (you must have installed from source with the [dev] extras)
```bash
bumpver update --patch
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tpch-datagen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "tpc, tpch, tpc-h, data, generator, datagen",
    "author": null,
    "author_email": "Philip Moore <philip@gizmodata.com>",
    "download_url": "https://files.pythonhosted.org/packages/fe/e1/c57cc6f8693710c71f6cfa4424cde2e01f3fe6aa10340de00a5f5019fe04/tpch_datagen-0.0.8.tar.gz",
    "platform": null,
    "description": "# tpch-datagen - by [GizmoData](https://gizmodata.com)\u2122\nA utility to generate TPC-H data in parallel using [DuckDB](https://duckdb.org) and multi-processing\n\n[<img src=\"https://img.shields.io/badge/GitHub-gizmodata%2Ftpch--datagen-blue.svg?logo=Github\">](https://github.com/gizmodata/tpch-datagen)\n[![tpch-datagen-ci](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml)\n[![Supported Python Versions](https://img.shields.io/pypi/pyversions/tpch-datagen)](https://pypi.org/project/tpch-datagen/)\n[![PyPI version](https://badge.fury.io/py/tpch-datagen.svg)](https://badge.fury.io/py/tpch-datagen)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/tpch-datagen.svg)](https://pypi.org/project/tpch-datagen/)\n\n# Why?\nBecause generating TPC-H data can be time-consuming and resource-intensive.  This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.\n\n# Setup (to run locally)\n\n## Install Python package\nYou can install `tpch-datagen` from PyPi or from source.\n\n### Option 1 - from PyPi\n```shell\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\npip install tpch-datagen\n```\n\n### Option 2 - from source - for development\n```shell\ngit clone https://github.com/gizmodata/tpch-datagen\n\ncd tpch-datagen\n\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\n# Upgrade pip, setuptools, and wheel\npip install --upgrade pip setuptools wheel\n\n# Install TPC-H Datagen - in editable mode with client and dev dependencies\npip install --editable .[dev]\n```\n\n### Note\nFor the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:\n```shell\nexport PYTHONPATH=$(pwd)/src\n```\n\n### Usage\nHere are the options for the `tpch-datagen` command:\n\n```shell\ntpch-datagen --help\nUsage: tpch-datagen [OPTIONS]\n\nOptions:\n  --version / --no-version        Prints the TPC-H Datagen package version and\n                                  exits.  [required]\n  --scale-factor INTEGER          The TPC-H Scale Factor to use for data\n                                  generation.\n  --data-directory TEXT           The target output data directory to put the\n                                  files into  [default: data; required]\n  --work-directory TEXT           The work directory to use for data\n                                  generation.  [default: /tmp; required]\n  --overwrite / --no-overwrite    Can we overwrite the target directory if it\n                                  already exists...  [default: no-overwrite;\n                                  required]\n  --num-chunks INTEGER            The number of chunks that will be generated\n                                  - more chunks equals smaller memory\n                                  requirements, but more files generated.\n                                  [default: 10; required]\n  --num-processes INTEGER         The maximum number of processes for the\n                                  multi-processing pool to use for data\n                                  generation.  [default: 10; required]\n  --duckdb-threads INTEGER        The number of DuckDB threads to use for data\n                                  generation (within each job process).\n                                  [default: 1; required]\n  --per-thread-output / --no-per-thread-output\n                                  Controls whether to write the output to a\n                                  single file or multiple files (for each\n                                  process).  [default: per-thread-output;\n                                  required]\n  --compression-method [none|snappy|gzip|zstd]\n                                  The compression method to use for the\n                                  parquet files generated.  [default: zstd;\n                                  required]\n  --file-size-bytes TEXT          The target file size for the parquet files\n                                  generated.  [default: 100m; required]\n  --parquet-version [v1|v2]       The version of Parquet to use for the\n                                  parquet files generated.  [default: v2;\n                                  required]\n  --help                          Show this message and exit.\n```\n\n> [!NOTE]   \n> Default values may change depending on the number of CPU cores you have, etc.\n\n### Handy development commands\n\n#### Version management\n\n##### Bump the version of the application - (you must have installed from source with the [dev] extras)\n```bash\nbumpver update --patch\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package which makes it easy to generate TPC-H data in parallel with DuckDB",
    "version": "0.0.8",
    "project_urls": {
        "Homepage": "https://github.com/gizmodata/tpch-datagen"
    },
    "split_keywords": [
        "tpc",
        " tpch",
        " tpc-h",
        " data",
        " generator",
        " datagen"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "47f95bb85955bf086306d424d5390320af67c997e256192f6228c08e9b33e170",
                "md5": "ea40f7540c9372712af3d9a4ae4935da",
                "sha256": "37e2e023e878be96eb67f38f34de8e14f219412adb1612b70f693aa027816c98"
            },
            "downloads": -1,
            "filename": "tpch_datagen-0.0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ea40f7540c9372712af3d9a4ae4935da",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 7558,
            "upload_time": "2025-10-08T19:32:23",
            "upload_time_iso_8601": "2025-10-08T19:32:23.084511Z",
            "url": "https://files.pythonhosted.org/packages/47/f9/5bb85955bf086306d424d5390320af67c997e256192f6228c08e9b33e170/tpch_datagen-0.0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fee1c57cc6f8693710c71f6cfa4424cde2e01f3fe6aa10340de00a5f5019fe04",
                "md5": "7a044be8e9e790ae7e383e98ebbbc4b1",
                "sha256": "9c2ab22fa6f65d355faff928a4edfaff9d5447a921ff292fb0bff2318438fb1b"
            },
            "downloads": -1,
            "filename": "tpch_datagen-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "7a044be8e9e790ae7e383e98ebbbc4b1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 7967,
            "upload_time": "2025-10-08T19:32:24",
            "upload_time_iso_8601": "2025-10-08T19:32:24.067687Z",
            "url": "https://files.pythonhosted.org/packages/fe/e1/c57cc6f8693710c71f6cfa4424cde2e01f3fe6aa10340de00a5f5019fe04/tpch_datagen-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-08 19:32:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gizmodata",
    "github_project": "tpch-datagen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tpch-datagen"
}
        
Elapsed time: 4.58100s