tpch-datagen


Nametpch-datagen JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryA package which makes it easy to generate TPC-H data in parallel with DuckDB
upload_time2024-10-08 21:11:35
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseCopyright 2024 Gizmo Data LLC Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords tpc tpch tpc-h data generator datagen
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tpch-datagen
A utility to generate TPC-H data in parallel using [DuckDB](https://duckdb.org) and multi-processing

[<img src="https://img.shields.io/badge/GitHub-gizmodata%2Ftpch--datagen-blue.svg?logo=Github">](https://github.com/gizmodata/tpch-datagen)
[![tpch-datagen-ci](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml)
[![Supported Python Versions](https://img.shields.io/pypi/pyversions/tpch-datagen)](https://pypi.org/project/tpch-datagen/)
[![PyPI version](https://badge.fury.io/py/tpch-datagen.svg)](https://badge.fury.io/py/tpch-datagen)
[![PyPI Downloads](https://img.shields.io/pypi/dm/tpch-datagen.svg)](https://pypi.org/project/tpch-datagen/)

# Why?
Because generating TPC-H data can be time-consuming and resource-intensive.  This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.

# Setup (to run locally)

## Install Python package
You can install `tpch-datagen` from PyPi or from source.

### Option 1 - from PyPi
```shell
# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

pip install tpch-datagen
```

### Option 2 - from source - for development
```shell
git clone https://github.com/gizmodata/tpch-datagen

cd tpch-datagen

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel

# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]
```

### Note
For the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:
```shell
export PYTHONPATH=$(pwd)/src
```

### Usage
Here are the options for the `tpch-datagen` command:

```shell
tpch-datagen --help
Usage: tpch-datagen [OPTIONS]

Options:
  --version / --no-version        Prints the TPC-H Datagen package version and
                                  exits.  [required]
  --scale-factor INTEGER          The TPC-H Scale Factor to use for data
                                  generation.
  --data-directory TEXT           The target output data directory to put the
                                  files into  [default: data; required]
  --work-directory TEXT           The work directory to use for data
                                  generation.  [default: /tmp; required]
  --overwrite / --no-overwrite    Can we overwrite the target directory if it
                                  already exists...  [default: no-overwrite;
                                  required]
  --num-chunks INTEGER            The number of chunks that will be generated
                                  - more chunks equals smaller memory
                                  requirements, but more files generated.
                                  [default: 10; required]
  --num-processes INTEGER         The maximum number of processes for the
                                  multi-processing pool to use for data
                                  generation.  [default: 10; required]
  --duckdb-threads INTEGER        The number of DuckDB threads to use for data
                                  generation (within each job process).
                                  [default: 1; required]
  --per-thread-output / --no-per-thread-output
                                  Controls whether to write the output to a
                                  single file or multiple files (for each
                                  process).  [default: per-thread-output;
                                  required]
  --compression-method [none|snappy|gzip|zstd]
                                  The compression method to use for the
                                  parquet files generated.  [default: zstd;
                                  required]
  --file-size-bytes TEXT          The target file size for the parquet files
                                  generated.  [default: 100m; required]
  --help                          Show this message and exit.
```

> [!NOTE]   
> Default values may change depending on the number of CPU cores you have, etc.

### Handy development commands

#### Version management

##### Bump the version of the application - (you must have installed from source with the [dev] extras)
```bash
bumpver update --patch
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tpch-datagen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "tpc, tpch, tpc-h, data, generator, datagen",
    "author": null,
    "author_email": "Philip Moore <prmoore77@hotmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/24/cc/221c472423113408ce0ebb86159b2d37488c3e6212168cadb0380370ccce/tpch_datagen-0.0.4.tar.gz",
    "platform": null,
    "description": "# tpch-datagen\nA utility to generate TPC-H data in parallel using [DuckDB](https://duckdb.org) and multi-processing\n\n[<img src=\"https://img.shields.io/badge/GitHub-gizmodata%2Ftpch--datagen-blue.svg?logo=Github\">](https://github.com/gizmodata/tpch-datagen)\n[![tpch-datagen-ci](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml/badge.svg)](https://github.com/gizmodata/tpch-datagen/actions/workflows/ci.yml)\n[![Supported Python Versions](https://img.shields.io/pypi/pyversions/tpch-datagen)](https://pypi.org/project/tpch-datagen/)\n[![PyPI version](https://badge.fury.io/py/tpch-datagen.svg)](https://badge.fury.io/py/tpch-datagen)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/tpch-datagen.svg)](https://pypi.org/project/tpch-datagen/)\n\n# Why?\nBecause generating TPC-H data can be time-consuming and resource-intensive.  This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.\n\n# Setup (to run locally)\n\n## Install Python package\nYou can install `tpch-datagen` from PyPi or from source.\n\n### Option 1 - from PyPi\n```shell\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\npip install tpch-datagen\n```\n\n### Option 2 - from source - for development\n```shell\ngit clone https://github.com/gizmodata/tpch-datagen\n\ncd tpch-datagen\n\n# Create the virtual environment\npython3 -m venv .venv\n\n# Activate the virtual environment\n. .venv/bin/activate\n\n# Upgrade pip, setuptools, and wheel\npip install --upgrade pip setuptools wheel\n\n# Install TPC-H Datagen - in editable mode with client and dev dependencies\npip install --editable .[dev]\n```\n\n### Note\nFor the following commands - if you running from source and using `--editable` mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:\n```shell\nexport PYTHONPATH=$(pwd)/src\n```\n\n### Usage\nHere are the options for the `tpch-datagen` command:\n\n```shell\ntpch-datagen --help\nUsage: tpch-datagen [OPTIONS]\n\nOptions:\n  --version / --no-version        Prints the TPC-H Datagen package version and\n                                  exits.  [required]\n  --scale-factor INTEGER          The TPC-H Scale Factor to use for data\n                                  generation.\n  --data-directory TEXT           The target output data directory to put the\n                                  files into  [default: data; required]\n  --work-directory TEXT           The work directory to use for data\n                                  generation.  [default: /tmp; required]\n  --overwrite / --no-overwrite    Can we overwrite the target directory if it\n                                  already exists...  [default: no-overwrite;\n                                  required]\n  --num-chunks INTEGER            The number of chunks that will be generated\n                                  - more chunks equals smaller memory\n                                  requirements, but more files generated.\n                                  [default: 10; required]\n  --num-processes INTEGER         The maximum number of processes for the\n                                  multi-processing pool to use for data\n                                  generation.  [default: 10; required]\n  --duckdb-threads INTEGER        The number of DuckDB threads to use for data\n                                  generation (within each job process).\n                                  [default: 1; required]\n  --per-thread-output / --no-per-thread-output\n                                  Controls whether to write the output to a\n                                  single file or multiple files (for each\n                                  process).  [default: per-thread-output;\n                                  required]\n  --compression-method [none|snappy|gzip|zstd]\n                                  The compression method to use for the\n                                  parquet files generated.  [default: zstd;\n                                  required]\n  --file-size-bytes TEXT          The target file size for the parquet files\n                                  generated.  [default: 100m; required]\n  --help                          Show this message and exit.\n```\n\n> [!NOTE]   \n> Default values may change depending on the number of CPU cores you have, etc.\n\n### Handy development commands\n\n#### Version management\n\n##### Bump the version of the application - (you must have installed from source with the [dev] extras)\n```bash\nbumpver update --patch\n```\n",
    "bugtrack_url": null,
    "license": "Copyright 2024 Gizmo Data LLC  Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at  http://www.apache.org/licenses/LICENSE-2.0  Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.",
    "summary": "A package which makes it easy to generate TPC-H data in parallel with DuckDB",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/gizmodata/tpch-datagen"
    },
    "split_keywords": [
        "tpc",
        " tpch",
        " tpc-h",
        " data",
        " generator",
        " datagen"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b2ec08a97d0b8407e44ca5651db195690d0b57eda29d3d33fa2a56463d2cfb87",
                "md5": "e06eaa0c0be90732cd8c78f09ea0efd2",
                "sha256": "621a9eee8eda07a4f76ba7bfaae47641a105b0fe25fe80dc140158af1938867b"
            },
            "downloads": -1,
            "filename": "tpch_datagen-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e06eaa0c0be90732cd8c78f09ea0efd2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 7726,
            "upload_time": "2024-10-08T21:11:33",
            "upload_time_iso_8601": "2024-10-08T21:11:33.406785Z",
            "url": "https://files.pythonhosted.org/packages/b2/ec/08a97d0b8407e44ca5651db195690d0b57eda29d3d33fa2a56463d2cfb87/tpch_datagen-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "24cc221c472423113408ce0ebb86159b2d37488c3e6212168cadb0380370ccce",
                "md5": "6d2e1e7dce33bf7c53ffe55522da4ceb",
                "sha256": "071e686610a3cfce4c8bc0542cbdc740b245e3cb73539b75fd8f8bd5db48fe18"
            },
            "downloads": -1,
            "filename": "tpch_datagen-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "6d2e1e7dce33bf7c53ffe55522da4ceb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 8168,
            "upload_time": "2024-10-08T21:11:35",
            "upload_time_iso_8601": "2024-10-08T21:11:35.363394Z",
            "url": "https://files.pythonhosted.org/packages/24/cc/221c472423113408ce0ebb86159b2d37488c3e6212168cadb0380370ccce/tpch_datagen-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-08 21:11:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gizmodata",
    "github_project": "tpch-datagen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tpch-datagen"
}
        
Elapsed time: 0.88624s