# Demo PySpark Delta Table SCD2 implementation
[![Python package](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/python-package.yml/badge.svg)](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/python-package.yml)
[![CodeQL](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/codeql.yml/badge.svg)](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/codeql.yml)
This project utilizes `faker-pyspark` to generate random schema and dataframes to mimic data table snapshots.
Using these snapshots to process and apply SCD2 pattern into delta table as the destination.
Source of Inspiration for SCD2 pattern: https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-2547/glue/scd-deltalake-employee-etl-job.py
## Installation
Install with pip:
``` bash
pip install pyspark-delta-scd2 delta-spark faker-pyspark
```
Please note, this package do not enforce version of delta-spark, PySpark and faker-pyspark.
When you want to use this example in AWS Glue environment, enforced versions conflict with the target environment.
### Generate incremental updates to dataframe and apply scd2
``` python
>>> from pyspark_delta_scd2 import get_spark, PySparkDeltaScd2
>>> spark = get_spark()
>>> demo = PySparkDeltaScd2(spark=spark)
>>> # initial load
>>> df1 = demo.process()
>>> # incremental update
>>> df2 = demo.process()
>>> # df2 should have some deletes, updates and inserts
```
Raw data
{
"_id": null,
"home_page": "https://github.com/spsoni/pyspark-delta-scd2",
"name": "pyspark-delta-scd2",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8.1,<4.0.0",
"maintainer_email": "",
"keywords": "Faker, PySpark,DeltaTable",
"author": "Sury Soni",
"author_email": "github@suryasoni.info",
"download_url": "https://files.pythonhosted.org/packages/c8/1f/df9589bbf6823c1e9d3e6a990f0958efe85e290a6fe96c4bf4d75a6b018b/pyspark_delta_scd2-0.4.1.tar.gz",
"platform": null,
"description": "# Demo PySpark Delta Table SCD2 implementation\n\n[![Python package](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/python-package.yml/badge.svg)](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/python-package.yml)\n[![CodeQL](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/codeql.yml/badge.svg)](https://github.com/spsoni/pyspark-delta-scd2/actions/workflows/codeql.yml)\n\nThis project utilizes `faker-pyspark` to generate random schema and dataframes to mimic data table snapshots.\n\nUsing these snapshots to process and apply SCD2 pattern into delta table as the destination. \n\nSource of Inspiration for SCD2 pattern: https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-2547/glue/scd-deltalake-employee-etl-job.py \n\n## Installation\n\nInstall with pip:\n\n``` bash\npip install pyspark-delta-scd2 delta-spark faker-pyspark\n\n```\n\nPlease note, this package do not enforce version of delta-spark, PySpark and faker-pyspark.\n\nWhen you want to use this example in AWS Glue environment, enforced versions conflict with the target environment.\n\n### Generate incremental updates to dataframe and apply scd2\n\n``` python\n>>> from pyspark_delta_scd2 import get_spark, PySparkDeltaScd2\n>>> spark = get_spark()\n>>> demo = PySparkDeltaScd2(spark=spark)\n>>> # initial load\n>>> df1 = demo.process()\n>>> # incremental update\n>>> df2 = demo.process()\n>>> # df2 should have some deletes, updates and inserts\n\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "This project utilizes faker-pyspark to generate random schema and dataframes to mimic data table snapshots. Using these snapshots to process and apply SCD2 pattern into delta table as the destination.",
"version": "0.4.1",
"project_urls": {
"Homepage": "https://github.com/spsoni/pyspark-delta-scd2",
"Repository": "https://github.com/spsoni/pyspark-delta-scd2"
},
"split_keywords": [
"faker",
" pyspark",
"deltatable"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3077679eb89278d089c7af244c012da2cb3cb4a5947c41d77a207a00ba5b6f92",
"md5": "5b78c8b39bff60050d004c8eb52cf9d0",
"sha256": "6d6ee1940e819793d7f6c60ef94e98fc23758d8df684715f41448378af98ed1f"
},
"downloads": -1,
"filename": "pyspark_delta_scd2-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5b78c8b39bff60050d004c8eb52cf9d0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.1,<4.0.0",
"size": 5775,
"upload_time": "2023-06-22T10:27:29",
"upload_time_iso_8601": "2023-06-22T10:27:29.949584Z",
"url": "https://files.pythonhosted.org/packages/30/77/679eb89278d089c7af244c012da2cb3cb4a5947c41d77a207a00ba5b6f92/pyspark_delta_scd2-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c81fdf9589bbf6823c1e9d3e6a990f0958efe85e290a6fe96c4bf4d75a6b018b",
"md5": "9788152593d247d03ef39cb20d79b99a",
"sha256": "34f4a616050b9e3ddc9117f22c5013c45a8a0a8014eb56694b233aee79cec91b"
},
"downloads": -1,
"filename": "pyspark_delta_scd2-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "9788152593d247d03ef39cb20d79b99a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.1,<4.0.0",
"size": 4767,
"upload_time": "2023-06-22T10:27:31",
"upload_time_iso_8601": "2023-06-22T10:27:31.486604Z",
"url": "https://files.pythonhosted.org/packages/c8/1f/df9589bbf6823c1e9d3e6a990f0958efe85e290a6fe96c4bf4d75a6b018b/pyspark_delta_scd2-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-22 10:27:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "spsoni",
"github_project": "pyspark-delta-scd2",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pyspark-delta-scd2"
}