# data product processor
The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.
The declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.
## Installation
**Prerequisites**
- Python 3.x
- Apache Spark 3.x
**Install with pip**
```commandline
pip install data-product-processor
```
## Getting started
### Declare a basic data product
Please see [Data product specification](docs/data-product-specification.md) for an overview on the files required to declare a data product.
### Process the data product
From folder in which the previously created file are stored, run the data-product-processor as follows:
```commandline
data-product-processor \
--default_data_lake_bucket some-datalake-bucket \
--aws_profile some-profile \
--aws_region eu-central-1 \
--local
```
This command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).
If you want to run the library from a different folder than the data product decleration, reference the latter through the additional argument `--product_path`.
```commandline
data-product-processor \
--product_path ../path-to-some-data-product \
--default_data_lake_bucket some-datalake-bucket \
--aws_profile some-profile \
--aws_region eu-central-1 \
--local
```
## CLI Arguments
```commandline
data-product-processor --help
--JOB_ID - the unique id of this Glue/EMR job
--JOB_RUN_ID - the unique id of this Glue job run
--JOB_NAME - the name of this Glue job
--job-bookmark-option - job-bookmark-disable if you don't want bookmarking
--TempDir - tempoarary results directory
--product_path - the data product definition folder
--aws_profile - the AWS profile to be used for connection
--aws_region - the AWS region to be used
--local - local development
--jars - extra jars to be added to the Spark context
--additional-python-modules - this parameter is injected by Glue, currently it is not in use
--default_data_lake_bucket - a default bucket location (with s3a:// prefix)
```
## References
- [Data product specification](docs/data-product-specification.md)
- [Access management](docs/access-management.md)
## Tutorials
- [How to write and test custom transformation logic?](docs/how-to/transformation-logic.md)
- [How to reference custom Spark dependencies?](docs/how-to/custom-dependencies.md)
- [How to set up local development?](docs/how-to/local-development.md)
Raw data
{
"_id": null,
"home_page": "https://github.com/aws-samples/dpac-data-product-processor",
"name": "data-product-processor",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Amazon Web Services",
"author_email": "",
"download_url": "",
"platform": "any",
"description": "# data product processor\n\nThe data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.\n\nThe declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.\n\n## Installation\n**Prerequisites** \n- Python 3.x\n- Apache Spark 3.x\n\n**Install with pip**\n```commandline\npip install data-product-processor\n```\n\n## Getting started\n### Declare a basic data product\nPlease see [Data product specification](docs/data-product-specification.md) for an overview on the files required to declare a data product.\n\n### Process the data product\nFrom folder in which the previously created file are stored, run the data-product-processor as follows:\n\n```commandline\ndata-product-processor \\\n --default_data_lake_bucket some-datalake-bucket \\\n --aws_profile some-profile \\\n --aws_region eu-central-1 \\\n --local\n```\nThis command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).\n\nIf you want to run the library from a different folder than the data product decleration, reference the latter through the additional argument `--product_path`.\n```commandline\ndata-product-processor \\\n --product_path ../path-to-some-data-product \\\n --default_data_lake_bucket some-datalake-bucket \\\n --aws_profile some-profile \\\n --aws_region eu-central-1 \\\n --local\n```\n\n## CLI Arguments\n```commandline\ndata-product-processor --help\n\n --JOB_ID - the unique id of this Glue/EMR job\n --JOB_RUN_ID - the unique id of this Glue job run\n --JOB_NAME - the name of this Glue job\n --job-bookmark-option - job-bookmark-disable if you don't want bookmarking\n --TempDir - tempoarary results directory\n --product_path - the data product definition folder\n --aws_profile - the AWS profile to be used for connection\n --aws_region - the AWS region to be used\n --local - local development\n --jars - extra jars to be added to the Spark context\n --additional-python-modules - this parameter is injected by Glue, currently it is not in use\n --default_data_lake_bucket - a default bucket location (with s3a:// prefix)\n```\n## References\n- [Data product specification](docs/data-product-specification.md)\n- [Access management](docs/access-management.md)\n\n## Tutorials\n- [How to write and test custom transformation logic?](docs/how-to/transformation-logic.md)\n- [How to reference custom Spark dependencies?](docs/how-to/custom-dependencies.md)\n- [How to set up local development?](docs/how-to/local-development.md)\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "The data product processor (dpp) is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.",
"version": "1.0.3",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4a1a756155af3bb6ab84e34d4eb9ace9566ac5a37056e955ddc731b69507f251",
"md5": "bdf81918982d977e80b31f4f49e61d63",
"sha256": "5fef227ea77ae7dbc3620f8129a2b95ca5ec32095f575f9bae15575a55516ee2"
},
"downloads": -1,
"filename": "data_product_processor-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bdf81918982d977e80b31f4f49e61d63",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 41441,
"upload_time": "2023-01-24T22:25:02",
"upload_time_iso_8601": "2023-01-24T22:25:02.475902Z",
"url": "https://files.pythonhosted.org/packages/4a/1a/756155af3bb6ab84e34d4eb9ace9566ac5a37056e955ddc731b69507f251/data_product_processor-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-24 22:25:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "aws-samples",
"github_project": "dpac-data-product-processor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "boto3",
"specs": [
[
"==",
"1.18.34"
]
]
},
{
"name": "botocore",
"specs": []
},
{
"name": "wheel",
"specs": [
[
"==",
"0.38.1"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"5.4.1"
]
]
},
{
"name": "pydantic",
"specs": []
},
{
"name": "quinn",
"specs": []
},
{
"name": "boto3-stubs",
"specs": [
[
"==",
"1.18.34"
]
]
},
{
"name": "mypy-boto3-glue",
"specs": [
[
"==",
"1.18.34"
]
]
},
{
"name": "jsonschema",
"specs": [
[
"==",
"3.0.2"
]
]
},
{
"name": "pyspark",
"specs": [
[
"==",
"3.2.0"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.22.1"
]
]
}
],
"lcname": "data-product-processor"
}