Name | lakehouse-engine JSON |
Version |
1.23.0
JSON |
| download |
home_page | None |
Summary | A configuration-driven Spark framework serving as the engine for several lakehouse algorithms and data flows. |
upload_time | 2024-10-28 14:58:58 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.11 |
license | Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright 2023 adidas AG Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |
keywords |
framework
big-data
spark
databricks
data-quality
data-engineering
great-expectations
lakehouse
delta-lake
configuration-driver
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<img align="right" src="assets/img/lakehouse_engine_logo_no_bg_160.png" alt="Lakehouse Engine Logo">
# Lakehouse Engine
A configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.
---
> ***Note:*** whenever you read Data Product or Data Product team, we want to refer to Teams and use cases, whose main focus is on
leveraging the power of data, on a particular topic, end-to-end (ingestion, consumption...) to achieve insights, supporting faster and better decisions,
which generate value for their businesses. These Teams should not be focusing on building reusable frameworks, but on re-using the existing frameworks to achieve their goals.
---
## Main Goals
The goal of the Lakehouse Engine is to bring some advantages, such as:
- offer cutting-edge, standard, governed and battle-tested foundations that several Data Product teams can benefit from;
- avoid that Data Product teams develop siloed solutions, reducing technical debts and high operating costs (redundant developments across teams);
- allow Data Product teams to focus mostly on data-related tasks, avoiding wasting time & resources on developing the same code for different use cases;
- benefit from the fact that many teams are reusing the same code, which increases the likelihood that common issues are surfaced and solved faster;
- decrease the dependency and learning curve to Spark and other technologies that the Lakehouse Engine abstracts;
- speed up repetitive tasks;
- reduced vendor lock-in.
---
> ***Note:*** even though you will see a focus on AWS and Databricks, this is just due to the lack of use cases for other technologies like GCP and Azure, but we are open for contribution.
---
## Key Features
⭐ **Data Loads:** perform data loads from diverse source types and apply transformations and data quality validations,
ensuring trustworthy data, before integrating it into distinct target types. Additionally, people can also define termination
actions like optimisations or notifications. [On the usage section](#load-data-usage-example) you will find an example using all the supported keywords for data loads.
---
> ***Note:*** The Lakehouse
Engine supports different types of sources and targets, such as, kafka, jdbc, dataframes, files (csv, parquet, json, delta...), sftp, sap bw, sap b4...
---
⭐ **Transformations:** configuration driven transformations without the need to write any spark code. Transformations can be applied by using the `transform_specs` in the Data Loads.
---
> ***Note:*** you can search all the available transformations, as well as checking implementation details and examples [here](https://adidas.github.io/lakehouse-engine-docs/lakehouse_engine/transformers.html).
---
⭐ **Data Quality Validations:** the Lakehouse Engine uses Great Expectations as a backend and abstracts any implementation
details by offering people the capability to specify what validations to apply on the data, solely using dict/json based configurations.
The Data Quality validations can be applied on:
- post-mortem (static) data, using the DQ Validator algorithm (`execute_dq_validation`)
- data in-motion, using the `dq_specs` keyword in the Data Loads, to add it as one more step while loading data.
[On the usage section](#load-data-usage-example) you will find an example using this type of Data Quality validations.
⭐ **Reconciliation:** useful algorithm to compare two source of data, by defining one version of the `truth` to compare
against the `current` version of the data. It can be particularly useful during migrations phases, two compare a few KPIs
and ensure the new version of a table (`current`), for example, delivers the same vision of the data as the old one (`truth`).
Find usage examples [here](lakehouse_engine_usage/reconciliator.html).
⭐ **Sensors:** an abstraction to otherwise complex spark code that can be executed in very small single-node clusters
to check if an upstream system or Data Product contains new data since the last execution. With this feature, people can
trigger jobs to run in more frequent intervals and if the upstream does not contain new data, then the rest of the job
exits without creating bigger clusters to execute more intensive data ETL (Extraction, Transformation, and Loading).
Find usage examples [here](lakehouse_engine_usage/sensor.html).
⭐ **Terminators:** this feature allow people to specify what to do as a last action, before finishing a Data Load.
Some examples of actions are: optimising target table, vacuum, compute stats, expose change data feed to external location
or even send e-mail notifications. Thus, it is specifically used in Data Loads, using the `terminate_specs` keyword.
[On the usage section](#load-data-usage-example) you will find an example using terminators.
⭐ **Table Manager:** function `manage_table`, offers a set of actions to manipulate tables/views in several ways, such as:
- compute table statistics;
- create/drop tables and views;
- delete/truncate/repair tables;
- vacuum delta tables or locations;
- optimize table;
- describe table;
- show table properties;
- execute sql.
⭐ **File Manager:** function `manage_files`, offers a set of actions to manipulate files in several ways, such as:
- delete Objects in S3;
- copy Objects in S3;
- restore Objects from S3 Glacier;
- check the status of a restore from S3 Glacier;
- request a restore of objects from S3 Glacier and wait for them to be copied to a destination.
⭐ **Notifications:** you can configure and send email notifications.
---
> ***Note:*** it can be used as an independent function (`send_notification`) or as a `terminator_spec`, using the function `notify`.
---
📖 In case you want to check further details you can check the documentation of the [Lakehouse Engine facade](lakehouse_engine/engine.html).
## Installation
As the Lakehouse Engine is built as wheel (look into our **build** and **deploy** make targets) you can install it as any other python package using **pip**.
```
pip install lakehouse-engine
```
Alternatively, you can also upload the wheel to any target of your like (e.g. S3) and perform a pip installation pointing to that target location.
---
> ***Note:*** The Lakehouse Engine is packaged with plugins or optional dependencies, which are not installed by default. The goal is
> to make its installation lighter and to avoid unnecessary dependencies. You can check all the optional dependencies in
> the [tool.setuptools.dynamic] section of the [pyproject.toml](pyproject.toml) file. They are currently: os, dq, azure and sftp. So,
> in case you want to make usage of the Data Quality features offered in the Lakehouse Engine, instead of running the previous command, you should run
> the command below, which will bring the core functionalities, plus DQ.
> ```
> pip install lakehouse-engine[dq]
> ```
> In case you are in an environment without pre-install spark and delta, you will also want to install the `os` optional dependencies, like so:
> ```
> pip install lakehouse-engine[os]
> ```
> And in case you want to install several optional dependencies, you can run a command like:
> ```
> pip install lakehouse-engine[dq,sftp]
> ```
> It is advisable for a Data Product to pin a specific version of the Lakehouse Engine (and have recurring upgrading activities)
> to avoid breaking changes in a new release.
> In case you don't want to be so conservative, you can pin to a major version, which usually shouldn't include changes that break backwards compatibility.
---
## How Data Products use the Lakehouse Engine Framework?
<img src="assets/img/lakehouse_dp_usage.drawio.png?raw=true" style="max-width: 800px; height: auto; "/>
The Lakehouse Engine is a configuration-first Data Engineering framework, using the concept of ACONs to configure algorithms.
An ACON, stands for Algorithm Configuration and is a JSON representation, as the [Load Data Usage Example](#load-data-usage-example) demonstrates.
Below you find described the main keywords you can use to configure and ACON for a Data Load.
---
> ***Note:*** the usage logic for the other [algorithms/features presented](#key-features) will always be similar, but using different keywords,
which you can search for in the examples and documentation provided in the [Key Features](#key-features) and [Community Support and Contributing](#community-support-and-contributing) sections.
---
- **Input specifications (input_specs):** specify how to read data. This is a **mandatory** keyword.
- **Transform specifications (transform_specs):** specify how to transform data.
- **Data quality specifications (dq_specs):** specify how to execute the data quality process.
- **Output specifications (output_specs):** specify how to write data to the target. This is a **mandatory** keyword.
- **Terminate specifications (terminate_specs):** specify what to do after writing into the target (e.g., optimising target table, vacuum, compute stats, expose change data feed to external location, etc).
- **Execution environment (exec_env):** custom Spark session configurations to be provided for your algorithm (configurations can also be provided from your job/cluster configuration, which we highly advise you to do instead of passing performance related configs here for example).
## Load Data Usage Example
You can use the Lakehouse Engine in a **pyspark script** or **notebook**.
Below you can find an example on how to execute a Data Load using the Lakehouse Engine, which is doing the following:
1. Read CSV files, from a specified location, in a streaming fashion and providing a specific schema and some additional
options for properly read the files (e.g. header, delimiter...);
2. Apply two transformations on the input data:
1. Add a new column having the Row ID;
2. Add a new column `extraction_date`, which extracts the date from the `lhe_extraction_filepath`, based on a regex.
3. Apply Data Quality validations and store the result of their execution in the table `your_database.order_events_dq_checks`:
1. Check if the column `omnihub_locale_code` is not having null values;
2. Check if the distinct value count for the column `product_division` is between 10 and 100;
3. Check if the max of the column `so_net_value` is between 10 and 1000;
4. Check if the length of the values in the column `omnihub_locale_code` is between 1 and 10;
5. Check if the mean of the values for the column `coupon_code` is between 15 and 20.
4. Write the output into the table `your_database.order_events_with_dq` in a delta format, partitioned by `order_date_header`
and applying a merge predicate condition, ensuring the data is only inserted into the table if it does not match the predicate
(meaning the data is not yet available in the table). Moreover, the `insert_only` flag is used to specify that there should not
be any updates or deletes in the target table, only inserts;
5. Optimize the Delta Table that we just wrote in (e.g. z-ordering);
6. Specify 3 custom Spark Session configurations.
---
> ⚠️ ***Note:*** `spec_id` is one of the main concepts to ensure you can chain the steps of the algorithm,
so, for example, you can specify the transformations (in `transform_specs`) of a DataFrame that was read in the `input_specs`.
---
```python
from lakehouse_engine.engine import load_data
acon = {
"input_specs": [
{
"spec_id": "orders_bronze",
"read_type": "streaming",
"data_format": "csv",
"schema_path": "s3://my-data-product-bucket/artefacts/metadata/bronze/schemas/orders.json",
"with_filepath": True,
"options": {
"badRecordsPath": "s3://my-data-product-bucket/badrecords/order_events_with_dq/",
"header": False,
"delimiter": "\u005E",
"dateFormat": "yyyyMMdd",
},
"location": "s3://my-data-product-bucket/bronze/orders/",
}
],
"transform_specs": [
{
"spec_id": "orders_bronze_with_extraction_date",
"input_id": "orders_bronze",
"transformers": [
{"function": "with_row_id"},
{
"function": "with_regex_value",
"args": {
"input_col": "lhe_extraction_filepath",
"output_col": "extraction_date",
"drop_input_col": True,
"regex": ".*WE_SO_SCL_(\\d+).csv",
},
},
],
}
],
"dq_specs": [
{
"spec_id": "check_orders_bronze_with_extraction_date",
"input_id": "orders_bronze_with_extraction_date",
"dq_type": "validator",
"result_sink_db_table": "your_database.order_events_dq_checks",
"fail_on_error": False,
"dq_functions": [
{
"dq_function": "expect_column_values_to_not_be_null",
"args": {
"column": "omnihub_locale_code"
}
},
{
"dq_function": "expect_column_unique_value_count_to_be_between",
"args": {
"column": "product_division",
"min_value": 10,
"max_value": 100
},
},
{
"dq_function": "expect_column_max_to_be_between",
"args": {
"column": "so_net_value",
"min_value": 10,
"max_value": 1000
}
},
{
"dq_function": "expect_column_value_lengths_to_be_between",
"args": {
"column": "omnihub_locale_code",
"min_value": 1,
"max_value": 10
},
},
{
"dq_function": "expect_column_mean_to_be_between",
"args": {
"column": "coupon_code",
"min_value": 15,
"max_value": 20
}
},
],
},
],
"output_specs": [
{
"spec_id": "orders_silver",
"input_id": "check_orders_bronze_with_extraction_date",
"data_format": "delta",
"write_type": "merge",
"partitions": ["order_date_header"],
"merge_opts": {
"merge_predicate": """
new.sales_order_header = current.sales_order_header
AND new.sales_order_schedule = current.sales_order_schedule
AND new.sales_order_item=current.sales_order_item
AND new.epoch_status=current.epoch_status
AND new.changed_on=current.changed_on
AND new.extraction_date=current.extraction_date
AND new.lhe_batch_id=current.lhe_batch_id
AND new.lhe_row_id=current.lhe_row_id
""",
"insert_only": True,
},
"db_table": "your_database.order_events_with_dq",
"options": {
"checkpointLocation": "s3://my-data-product-bucket/checkpoints/template_order_events_with_dq/"
},
}
],
"terminate_specs": [
{
"function": "optimize_dataset",
"args": {
"db_table": "your_database.order_events_with_dq"
}
}
],
"exec_env": {
"spark.databricks.delta.schema.autoMerge.enabled": True,
"spark.databricks.delta.optimizeWrite.enabled": True,
"spark.databricks.delta.autoCompact.enabled": True,
},
}
load_data(acon=acon)
```
---
> ***Note:*** Although it is possible to interact with the Lakehouse Engine functions directly from your python code,
instead of relying on creating an ACON dict and use the engine api, we do not ensure the stability across new
Lakehouse Engine releases when calling internal functions (not exposed in the facade) directly.
---
---
> ***Note:*** ACON structure might change across releases, please test your Data Product first before updating to a
new version of the Lakehouse Engine in your Production environment.
---
## Who maintains the Lakehouse Engine?
The Lakehouse Engine is under active development and production usage by the Adidas Lakehouse Foundations Engineering team.
## Community Support and Contributing
🤝 Do you want to contribute or need any support? Check out all the details in [CONTRIBUTING.md](https://github.com/adidas/lakehouse-engine/blob/master/CONTRIBUTING.md).
## License and Software Information
© adidas AG
adidas AG publishes this software and accompanied documentation (if any) subject to the terms of the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt)
with the aim of helping the community with our tools and libraries which we think can be also useful for other people.
You will find a copy of the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt) in the root folder of this package. All rights not explicitly granted
to you under the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt) remain the sole and exclusive property of adidas AG.
---
> ***NOTICE:*** The software has been designed solely for the purposes described in this ReadMe file. The software is NOT designed,
tested or verified for productive use whatsoever, nor or for any use related to high risk environments, such as health care,
highly or fully autonomous driving, power plants, or other critical infrastructures or services.
---
If you want to contact adidas regarding the software, you can mail us at software.engineering@adidas.com.
For further information open the [adidas terms and conditions](https://github.com/adidas/adidas-contribution-guidelines/wiki/Terms-and-conditions) page.
Raw data
{
"_id": null,
"home_page": null,
"name": "lakehouse-engine",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "framework, big-data, spark, databricks, data-quality, data-engineering, great-expectations, lakehouse, delta-lake, configuration-driver",
"author": null,
"author_email": "Adidas Lakehouse Foundations Team <software.engineering@adidas.com>",
"download_url": null,
"platform": null,
"description": "<img align=\"right\" src=\"assets/img/lakehouse_engine_logo_no_bg_160.png\" alt=\"Lakehouse Engine Logo\">\n\n# Lakehouse Engine\nA configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.\n\n---\n> ***Note:*** whenever you read Data Product or Data Product team, we want to refer to Teams and use cases, whose main focus is on \nleveraging the power of data, on a particular topic, end-to-end (ingestion, consumption...) to achieve insights, supporting faster and better decisions, \nwhich generate value for their businesses. These Teams should not be focusing on building reusable frameworks, but on re-using the existing frameworks to achieve their goals.\n\n---\n\n## Main Goals\nThe goal of the Lakehouse Engine is to bring some advantages, such as:\n- offer cutting-edge, standard, governed and battle-tested foundations that several Data Product teams can benefit from;\n- avoid that Data Product teams develop siloed solutions, reducing technical debts and high operating costs (redundant developments across teams);\n- allow Data Product teams to focus mostly on data-related tasks, avoiding wasting time & resources on developing the same code for different use cases;\n- benefit from the fact that many teams are reusing the same code, which increases the likelihood that common issues are surfaced and solved faster;\n- decrease the dependency and learning curve to Spark and other technologies that the Lakehouse Engine abstracts;\n- speed up repetitive tasks;\n- reduced vendor lock-in.\n\n---\n > ***Note:*** even though you will see a focus on AWS and Databricks, this is just due to the lack of use cases for other technologies like GCP and Azure, but we are open for contribution.\n\n---\n\n## Key Features\n\u2b50 **Data Loads:** perform data loads from diverse source types and apply transformations and data quality validations, \nensuring trustworthy data, before integrating it into distinct target types. Additionally, people can also define termination \nactions like optimisations or notifications. [On the usage section](#load-data-usage-example) you will find an example using all the supported keywords for data loads.\n\n---\n> ***Note:*** The Lakehouse \nEngine supports different types of sources and targets, such as, kafka, jdbc, dataframes, files (csv, parquet, json, delta...), sftp, sap bw, sap b4...\n\n---\n\n\u2b50 **Transformations:** configuration driven transformations without the need to write any spark code. Transformations can be applied by using the `transform_specs` in the Data Loads.\n\n---\n> ***Note:*** you can search all the available transformations, as well as checking implementation details and examples [here](https://adidas.github.io/lakehouse-engine-docs/lakehouse_engine/transformers.html).\n\n---\n\n\u2b50 **Data Quality Validations:** the Lakehouse Engine uses Great Expectations as a backend and abstracts any implementation\ndetails by offering people the capability to specify what validations to apply on the data, solely using dict/json based configurations.\nThe Data Quality validations can be applied on:\n- post-mortem (static) data, using the DQ Validator algorithm (`execute_dq_validation`)\n- data in-motion, using the `dq_specs` keyword in the Data Loads, to add it as one more step while loading data. \n[On the usage section](#load-data-usage-example) you will find an example using this type of Data Quality validations.\n\n\u2b50 **Reconciliation:** useful algorithm to compare two source of data, by defining one version of the `truth` to compare\nagainst the `current` version of the data. It can be particularly useful during migrations phases, two compare a few KPIs\nand ensure the new version of a table (`current`), for example, delivers the same vision of the data as the old one (`truth`).\nFind usage examples [here](lakehouse_engine_usage/reconciliator.html).\n\n\u2b50 **Sensors:** an abstraction to otherwise complex spark code that can be executed in very small single-node clusters\nto check if an upstream system or Data Product contains new data since the last execution. With this feature, people can\ntrigger jobs to run in more frequent intervals and if the upstream does not contain new data, then the rest of the job\nexits without creating bigger clusters to execute more intensive data ETL (Extraction, Transformation, and Loading).\nFind usage examples [here](lakehouse_engine_usage/sensor.html).\n\n\u2b50 **Terminators:** this feature allow people to specify what to do as a last action, before finishing a Data Load.\nSome examples of actions are: optimising target table, vacuum, compute stats, expose change data feed to external location\nor even send e-mail notifications. Thus, it is specifically used in Data Loads, using the `terminate_specs` keyword.\n[On the usage section](#load-data-usage-example) you will find an example using terminators.\n\n\u2b50 **Table Manager:** function `manage_table`, offers a set of actions to manipulate tables/views in several ways, such as:\n- compute table statistics;\n- create/drop tables and views;\n- delete/truncate/repair tables;\n- vacuum delta tables or locations;\n- optimize table;\n- describe table;\n- show table properties;\n- execute sql.\n\n\u2b50 **File Manager:** function `manage_files`, offers a set of actions to manipulate files in several ways, such as:\n- delete Objects in S3;\n- copy Objects in S3;\n- restore Objects from S3 Glacier;\n- check the status of a restore from S3 Glacier;\n- request a restore of objects from S3 Glacier and wait for them to be copied to a destination.\n\n\n\u2b50 **Notifications:** you can configure and send email notifications.\n\n---\n> ***Note:*** it can be used as an independent function (`send_notification`) or as a `terminator_spec`, using the function `notify`.\n\n---\n\n\ud83d\udcd6 In case you want to check further details you can check the documentation of the [Lakehouse Engine facade](lakehouse_engine/engine.html).\n\n## Installation\nAs the Lakehouse Engine is built as wheel (look into our **build** and **deploy** make targets) you can install it as any other python package using **pip**.\n\n```\npip install lakehouse-engine\n```\n\nAlternatively, you can also upload the wheel to any target of your like (e.g. S3) and perform a pip installation pointing to that target location.\n\n---\n> ***Note:*** The Lakehouse Engine is packaged with plugins or optional dependencies, which are not installed by default. The goal is\n> to make its installation lighter and to avoid unnecessary dependencies. You can check all the optional dependencies in\n> the [tool.setuptools.dynamic] section of the [pyproject.toml](pyproject.toml) file. They are currently: os, dq, azure and sftp. So,\n> in case you want to make usage of the Data Quality features offered in the Lakehouse Engine, instead of running the previous command, you should run\n> the command below, which will bring the core functionalities, plus DQ.\n> ```\n> pip install lakehouse-engine[dq]\n> ```\n> In case you are in an environment without pre-install spark and delta, you will also want to install the `os` optional dependencies, like so:\n> ```\n> pip install lakehouse-engine[os]\n> ```\n> And in case you want to install several optional dependencies, you can run a command like:\n> ```\n> pip install lakehouse-engine[dq,sftp]\n> ```\n> It is advisable for a Data Product to pin a specific version of the Lakehouse Engine (and have recurring upgrading activities)\n> to avoid breaking changes in a new release.\n> In case you don't want to be so conservative, you can pin to a major version, which usually shouldn't include changes that break backwards compatibility.\n\n---\n\n## How Data Products use the Lakehouse Engine Framework?\n<img src=\"assets/img/lakehouse_dp_usage.drawio.png?raw=true\" style=\"max-width: 800px; height: auto; \"/>\n\nThe Lakehouse Engine is a configuration-first Data Engineering framework, using the concept of ACONs to configure algorithms. \nAn ACON, stands for Algorithm Configuration and is a JSON representation, as the [Load Data Usage Example](#load-data-usage-example) demonstrates. \n\nBelow you find described the main keywords you can use to configure and ACON for a Data Load.\n\n---\n> ***Note:*** the usage logic for the other [algorithms/features presented](#key-features) will always be similar, but using different keywords, \nwhich you can search for in the examples and documentation provided in the [Key Features](#key-features) and [Community Support and Contributing](#community-support-and-contributing) sections.\n\n---\n\n- **Input specifications (input_specs):** specify how to read data. This is a **mandatory** keyword.\n- **Transform specifications (transform_specs):** specify how to transform data.\n- **Data quality specifications (dq_specs):** specify how to execute the data quality process.\n- **Output specifications (output_specs):** specify how to write data to the target. This is a **mandatory** keyword.\n- **Terminate specifications (terminate_specs):** specify what to do after writing into the target (e.g., optimising target table, vacuum, compute stats, expose change data feed to external location, etc).\n- **Execution environment (exec_env):** custom Spark session configurations to be provided for your algorithm (configurations can also be provided from your job/cluster configuration, which we highly advise you to do instead of passing performance related configs here for example).\n\n## Load Data Usage Example\n\nYou can use the Lakehouse Engine in a **pyspark script** or **notebook**.\nBelow you can find an example on how to execute a Data Load using the Lakehouse Engine, which is doing the following:\n1. Read CSV files, from a specified location, in a streaming fashion and providing a specific schema and some additional \noptions for properly read the files (e.g. header, delimiter...);\n2. Apply two transformations on the input data:\n 1. Add a new column having the Row ID;\n 2. Add a new column `extraction_date`, which extracts the date from the `lhe_extraction_filepath`, based on a regex.\n3. Apply Data Quality validations and store the result of their execution in the table `your_database.order_events_dq_checks`:\n 1. Check if the column `omnihub_locale_code` is not having null values;\n 2. Check if the distinct value count for the column `product_division` is between 10 and 100;\n 3. Check if the max of the column `so_net_value` is between 10 and 1000;\n 4. Check if the length of the values in the column `omnihub_locale_code` is between 1 and 10;\n 5. Check if the mean of the values for the column `coupon_code` is between 15 and 20.\n4. Write the output into the table `your_database.order_events_with_dq` in a delta format, partitioned by `order_date_header`\nand applying a merge predicate condition, ensuring the data is only inserted into the table if it does not match the predicate\n(meaning the data is not yet available in the table). Moreover, the `insert_only` flag is used to specify that there should not \nbe any updates or deletes in the target table, only inserts;\n5. Optimize the Delta Table that we just wrote in (e.g. z-ordering);\n6. Specify 3 custom Spark Session configurations.\n\n---\n> \u26a0\ufe0f ***Note:*** `spec_id` is one of the main concepts to ensure you can chain the steps of the algorithm,\nso, for example, you can specify the transformations (in `transform_specs`) of a DataFrame that was read in the `input_specs`.\n\n---\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n \"input_specs\": [\n {\n \"spec_id\": \"orders_bronze\",\n \"read_type\": \"streaming\",\n \"data_format\": \"csv\",\n \"schema_path\": \"s3://my-data-product-bucket/artefacts/metadata/bronze/schemas/orders.json\",\n \"with_filepath\": True,\n \"options\": {\n \"badRecordsPath\": \"s3://my-data-product-bucket/badrecords/order_events_with_dq/\",\n \"header\": False,\n \"delimiter\": \"\\u005E\",\n \"dateFormat\": \"yyyyMMdd\",\n },\n \"location\": \"s3://my-data-product-bucket/bronze/orders/\",\n }\n ],\n \"transform_specs\": [\n {\n \"spec_id\": \"orders_bronze_with_extraction_date\",\n \"input_id\": \"orders_bronze\",\n \"transformers\": [\n {\"function\": \"with_row_id\"},\n {\n \"function\": \"with_regex_value\",\n \"args\": {\n \"input_col\": \"lhe_extraction_filepath\",\n \"output_col\": \"extraction_date\",\n \"drop_input_col\": True,\n \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\",\n },\n },\n ],\n }\n ],\n \"dq_specs\": [\n {\n \"spec_id\": \"check_orders_bronze_with_extraction_date\",\n \"input_id\": \"orders_bronze_with_extraction_date\",\n \"dq_type\": \"validator\",\n \"result_sink_db_table\": \"your_database.order_events_dq_checks\",\n \"fail_on_error\": False,\n \"dq_functions\": [\n {\n \"dq_function\": \"expect_column_values_to_not_be_null\", \n \"args\": {\n \"column\": \"omnihub_locale_code\"\n }\n },\n {\n \"dq_function\": \"expect_column_unique_value_count_to_be_between\",\n \"args\": {\n \"column\": \"product_division\", \n \"min_value\": 10,\n \"max_value\": 100\n },\n },\n {\n \"dq_function\": \"expect_column_max_to_be_between\", \n \"args\": {\n \"column\": \"so_net_value\", \n \"min_value\": 10, \n \"max_value\": 1000\n }\n },\n {\n \"dq_function\": \"expect_column_value_lengths_to_be_between\",\n \"args\": {\n \"column\": \"omnihub_locale_code\", \n \"min_value\": 1, \n \"max_value\": 10\n },\n },\n {\n \"dq_function\": \"expect_column_mean_to_be_between\", \n \"args\": {\n \"column\": \"coupon_code\", \n \"min_value\": 15, \n \"max_value\": 20\n }\n },\n ],\n },\n ],\n \"output_specs\": [\n {\n \"spec_id\": \"orders_silver\",\n \"input_id\": \"check_orders_bronze_with_extraction_date\",\n \"data_format\": \"delta\",\n \"write_type\": \"merge\",\n \"partitions\": [\"order_date_header\"],\n \"merge_opts\": {\n \"merge_predicate\": \"\"\"\n new.sales_order_header = current.sales_order_header\n AND new.sales_order_schedule = current.sales_order_schedule\n AND new.sales_order_item=current.sales_order_item\n AND new.epoch_status=current.epoch_status\n AND new.changed_on=current.changed_on\n AND new.extraction_date=current.extraction_date\n AND new.lhe_batch_id=current.lhe_batch_id\n AND new.lhe_row_id=current.lhe_row_id\n \"\"\",\n \"insert_only\": True,\n },\n \"db_table\": \"your_database.order_events_with_dq\",\n \"options\": {\n \"checkpointLocation\": \"s3://my-data-product-bucket/checkpoints/template_order_events_with_dq/\"\n },\n }\n ],\n \"terminate_specs\": [\n {\n \"function\": \"optimize_dataset\",\n \"args\": {\n \"db_table\": \"your_database.order_events_with_dq\"\n }\n }\n ],\n \"exec_env\": {\n \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n \"spark.databricks.delta.optimizeWrite.enabled\": True,\n \"spark.databricks.delta.autoCompact.enabled\": True,\n },\n}\n\nload_data(acon=acon)\n```\n\n---\n> ***Note:*** Although it is possible to interact with the Lakehouse Engine functions directly from your python code, \ninstead of relying on creating an ACON dict and use the engine api, we do not ensure the stability across new \nLakehouse Engine releases when calling internal functions (not exposed in the facade) directly.\n\n---\n\n---\n> ***Note:*** ACON structure might change across releases, please test your Data Product first before updating to a \nnew version of the Lakehouse Engine in your Production environment.\n\n---\n\n## Who maintains the Lakehouse Engine?\nThe Lakehouse Engine is under active development and production usage by the Adidas Lakehouse Foundations Engineering team. \n\n## Community Support and Contributing\n\n\ud83e\udd1d Do you want to contribute or need any support? Check out all the details in [CONTRIBUTING.md](https://github.com/adidas/lakehouse-engine/blob/master/CONTRIBUTING.md).\n\n## License and Software Information\n\n\u00a9 adidas AG\n\nadidas AG publishes this software and accompanied documentation (if any) subject to the terms of the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt)\nwith the aim of helping the community with our tools and libraries which we think can be also useful for other people.\nYou will find a copy of the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt) in the root folder of this package. All rights not explicitly granted\nto you under the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt) remain the sole and exclusive property of adidas AG.\n\n---\n> ***NOTICE:*** The software has been designed solely for the purposes described in this ReadMe file. The software is NOT designed,\ntested or verified for productive use whatsoever, nor or for any use related to high risk environments, such as health care,\nhighly or fully autonomous driving, power plants, or other critical infrastructures or services.\n\n---\n\nIf you want to contact adidas regarding the software, you can mail us at software.engineering@adidas.com.\n\nFor further information open the [adidas terms and conditions](https://github.com/adidas/adidas-contribution-guidelines/wiki/Terms-and-conditions) page.\n",
"bugtrack_url": null,
"license": "Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. \"License\" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. \"Licensor\" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. \"Legal Entity\" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, \"control\" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. \"You\" (or \"Your\") shall mean an individual or Legal Entity exercising permissions granted by this License. \"Source\" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. \"Object\" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. \"Work\" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). \"Derivative Works\" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. \"Contribution\" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, \"submitted\" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as \"Not a Contribution.\" \"Contributor\" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a \"NOTICE\" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets \"[]\" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same \"printed page\" as the copyright notice for easier identification within third-party archives. Copyright 2023 adidas AG Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ",
"summary": "A configuration-driven Spark framework serving as the engine for several lakehouse algorithms and data flows.",
"version": "1.23.0",
"project_urls": {
"Documentation": "https://adidas.github.io/lakehouse-engine-docs/index.html",
"Issues": "https://github.com/adidas/lakehouse-engine/issues",
"Releases": "https://github.com/adidas/lakehouse-engine/releases",
"Repository": "https://github.com/adidas/lakehouse-engine"
},
"split_keywords": [
"framework",
" big-data",
" spark",
" databricks",
" data-quality",
" data-engineering",
" great-expectations",
" lakehouse",
" delta-lake",
" configuration-driver"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "26079ceff628a2b77743e66936199d59063d63f38e211bd794f9ed08e9ad638f",
"md5": "c97fe066252772e59cf4a0b8841c9ea4",
"sha256": "b2337ea67af4026f3b2cd77750b1ad7c5d1b4f60faef695606bb812435c97e9e"
},
"downloads": -1,
"filename": "lakehouse_engine-1.23.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c97fe066252772e59cf4a0b8841c9ea4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 207608,
"upload_time": "2024-10-28T14:58:58",
"upload_time_iso_8601": "2024-10-28T14:58:58.801559Z",
"url": "https://files.pythonhosted.org/packages/26/07/9ceff628a2b77743e66936199d59063d63f38e211bd794f9ed08e9ad638f/lakehouse_engine-1.23.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-28 14:58:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adidas",
"github_project": "lakehouse-engine",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "lakehouse-engine"
}