data-product-processor


Namedata-product-processor JSON
Version 1.0.3 PyPI version JSON
download
home_pagehttps://github.com/aws-samples/dpac-data-product-processor
SummaryThe data product processor (dpp) is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.
upload_time2023-01-24 22:25:02
maintainer
docs_urlNone
authorAmazon Web Services
requires_python
licenseApache License 2.0
keywords
VCS
bugtrack_url
requirements boto3 botocore wheel pyyaml pydantic quinn boto3-stubs mypy-boto3-glue jsonschema pyspark numpy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # data product processor

The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

The declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.

## Installation
**Prerequisites**  
- Python 3.x
- Apache Spark 3.x

**Install with pip**
```commandline
pip install data-product-processor
```

## Getting started
### Declare a basic data product
Please see [Data product specification](docs/data-product-specification.md) for an overview on the files required to declare a data product.

### Process the data product
From folder in which the previously created file are stored, run the data-product-processor as follows:

```commandline
data-product-processor \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local
```
This command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).

If you want to run the library from a different folder than the data product decleration, reference the latter through the additional argument `--product_path`.
```commandline
data-product-processor \
  --product_path ../path-to-some-data-product \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local
```

## CLI Arguments
```commandline
data-product-processor --help

  --JOB_ID - the unique id of this Glue/EMR job
  --JOB_RUN_ID - the unique id of this Glue job run
  --JOB_NAME - the name of this Glue job
  --job-bookmark-option - job-bookmark-disable if you don't want bookmarking
  --TempDir - tempoarary results directory
  --product_path - the data product definition folder
  --aws_profile - the AWS profile to be used for connection
  --aws_region - the AWS region to be used
  --local - local development
  --jars - extra jars to be added to the Spark context
  --additional-python-modules - this parameter is injected by Glue, currently it is not in use
  --default_data_lake_bucket - a default bucket location (with s3a:// prefix)
```
## References
- [Data product specification](docs/data-product-specification.md)
- [Access management](docs/access-management.md)

## Tutorials
- [How to write and test custom transformation logic?](docs/how-to/transformation-logic.md)
- [How to reference custom Spark dependencies?](docs/how-to/custom-dependencies.md)
- [How to set up local development?](docs/how-to/local-development.md)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/aws-samples/dpac-data-product-processor",
    "name": "data-product-processor",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Amazon Web Services",
    "author_email": "",
    "download_url": "",
    "platform": "any",
    "description": "# data product processor\n\nThe data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.\n\nThe declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.\n\n## Installation\n**Prerequisites**  \n- Python 3.x\n- Apache Spark 3.x\n\n**Install with pip**\n```commandline\npip install data-product-processor\n```\n\n## Getting started\n### Declare a basic data product\nPlease see [Data product specification](docs/data-product-specification.md) for an overview on the files required to declare a data product.\n\n### Process the data product\nFrom folder in which the previously created file are stored, run the data-product-processor as follows:\n\n```commandline\ndata-product-processor \\\n  --default_data_lake_bucket some-datalake-bucket \\\n  --aws_profile some-profile \\\n  --aws_region eu-central-1 \\\n  --local\n```\nThis command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).\n\nIf you want to run the library from a different folder than the data product decleration, reference the latter through the additional argument `--product_path`.\n```commandline\ndata-product-processor \\\n  --product_path ../path-to-some-data-product \\\n  --default_data_lake_bucket some-datalake-bucket \\\n  --aws_profile some-profile \\\n  --aws_region eu-central-1 \\\n  --local\n```\n\n## CLI Arguments\n```commandline\ndata-product-processor --help\n\n  --JOB_ID - the unique id of this Glue/EMR job\n  --JOB_RUN_ID - the unique id of this Glue job run\n  --JOB_NAME - the name of this Glue job\n  --job-bookmark-option - job-bookmark-disable if you don't want bookmarking\n  --TempDir - tempoarary results directory\n  --product_path - the data product definition folder\n  --aws_profile - the AWS profile to be used for connection\n  --aws_region - the AWS region to be used\n  --local - local development\n  --jars - extra jars to be added to the Spark context\n  --additional-python-modules - this parameter is injected by Glue, currently it is not in use\n  --default_data_lake_bucket - a default bucket location (with s3a:// prefix)\n```\n## References\n- [Data product specification](docs/data-product-specification.md)\n- [Access management](docs/access-management.md)\n\n## Tutorials\n- [How to write and test custom transformation logic?](docs/how-to/transformation-logic.md)\n- [How to reference custom Spark dependencies?](docs/how-to/custom-dependencies.md)\n- [How to set up local development?](docs/how-to/local-development.md)\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "The data product processor (dpp) is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.",
    "version": "1.0.3",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4a1a756155af3bb6ab84e34d4eb9ace9566ac5a37056e955ddc731b69507f251",
                "md5": "bdf81918982d977e80b31f4f49e61d63",
                "sha256": "5fef227ea77ae7dbc3620f8129a2b95ca5ec32095f575f9bae15575a55516ee2"
            },
            "downloads": -1,
            "filename": "data_product_processor-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bdf81918982d977e80b31f4f49e61d63",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 41441,
            "upload_time": "2023-01-24T22:25:02",
            "upload_time_iso_8601": "2023-01-24T22:25:02.475902Z",
            "url": "https://files.pythonhosted.org/packages/4a/1a/756155af3bb6ab84e34d4eb9ace9566ac5a37056e955ddc731b69507f251/data_product_processor-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-24 22:25:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "aws-samples",
    "github_project": "dpac-data-product-processor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "boto3",
            "specs": [
                [
                    "==",
                    "1.18.34"
                ]
            ]
        },
        {
            "name": "botocore",
            "specs": []
        },
        {
            "name": "wheel",
            "specs": [
                [
                    "==",
                    "0.38.1"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    "==",
                    "5.4.1"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": []
        },
        {
            "name": "quinn",
            "specs": []
        },
        {
            "name": "boto3-stubs",
            "specs": [
                [
                    "==",
                    "1.18.34"
                ]
            ]
        },
        {
            "name": "mypy-boto3-glue",
            "specs": [
                [
                    "==",
                    "1.18.34"
                ]
            ]
        },
        {
            "name": "jsonschema",
            "specs": [
                [
                    "==",
                    "3.0.2"
                ]
            ]
        },
        {
            "name": "pyspark",
            "specs": [
                [
                    "==",
                    "3.2.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.22.1"
                ]
            ]
        }
    ],
    "lcname": "data-product-processor"
}
        
Elapsed time: 0.13950s