# dbxconfig
Configuration framework for databricks pipelines.
Define configuration and table dependencies in yaml config then get the table mappings config model:
Define your tables.
```yaml
landing:
read:
landing_dbx_patterns:
customer_details_1: null
customer_details_2: null
raw:
delta_lake:
raw_dbx_patterns:
customers:
ids: id
depends_on:
- landing.landing_dbx_patterns.customer_details_1
- landing.landing_dbx_patterns.customer_details_2
warning_thresholds:
invalid_ratio: 0.1
invalid_rows: 0
max_rows: 100
min_rows: 5
exception_thresholds:
invalid_ratio: 0.2
invalid_rows: 2
max_rows: 1000
min_rows: 0
custom_properties:
process_group: 1
base:
delta_lake:
# delta table properties can be set at stage level or table level
delta_properties:
delta.appendOnly: true
delta.autoOptimize.autoCompact: true
delta.autoOptimize.optimizeWrite: true
delta.enableChangeDataFeed: false
base_dbx_patterns:
customer_details_1:
ids: id
depends_on:
- raw.raw_dbx_patterns.customers
# delta table properties can be set at stage level or table level
# table level properties will overwride stage level properties
delta_properties:
delta.enableChangeDataFeed: true
customer_details_2:
ids: id
depends_on:
- raw.raw_dbx_patterns.customers
```
Define you load configuration:
```yaml
tables: ./tables.yaml
landing:
read:
trigger: customerdetailscomplete-{{filename_date_format}}*.flg
trigger_type: file
database: landing_dbx_patterns
table: "{{table}}"
container: datalake
root: "/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}"
filename: "{{table}}-{{filename_date_format}}*.csv"
filename_date_format: "%Y%m%d"
path_date_format: "%Y%m%d"
format: cloudFiles
spark_schema: ../Schema/{{table.lower()}}.yaml
options:
# autoloader
cloudFiles.format: csv
cloudFiles.schemaLocation: /mnt/{{container}}/checkpoint/{{checkpoint}}
cloudFiles.useIncrementalListing: auto
# schema
inferSchema: false
enforceSchema: true
columnNameOfCorruptRecord: _corrupt_record
# csv
header: false
mode: PERMISSIVE
encoding: windows-1252
delimiter: ","
escape: '"'
nullValue: ""
quote: '"'
emptyValue: ""
raw:
delta_lake:
# delta table properties can be set at stage level or table level
delta_properties:
delta.appendOnly: true
delta.autoOptimize.autoCompact: true
delta.autoOptimize.optimizeWrite: true
delta.enableChangeDataFeed: false
database: raw_dbx_patterns
table: "{{table}}"
container: datalake
root: /mnt/{{container}}/data/raw
path: "{{database}}/{{table}}"
options:
checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}
mergeSchema: true
```
Import the config objects into you pipeline:
```python
from dbxconfig import Config, Timeslice, StageType
# build path to configuration file
pattern = "auto_load_schema"
config_path = f"../Config"
# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).
timeslice = Timeslice(day="*", month="*", year="*")
# parse and create a config objects
config = Config(config_path=config_path, pattern=pattern)
# get the configuration for a table mapping to load.
table_mapping = config.get_table_mapping(
timeslice=timeslice,
stage=StageType.raw,
table="customers"
)
print(table_mapping)
```
## Development Setup
```
pip install -r requirements.txt
```
## Unit Tests
To run the unit tests with a coverage report.
```
pip install -e .
pytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html
```
## Build
```
python setup.py sdist bdist_wheel
```
## Publish
```
twine upload dist/*
```
Raw data
{
"_id": null,
"home_page": "https://dbxconfig.readthedocs.io/en/latest/",
"name": "dbxconfig",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Shaun Ryan",
"author_email": "shaun_chiburi@hotmail.com",
"download_url": "https://files.pythonhosted.org/packages/4a/44/fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a/dbxconfig-5.0.7.tar.gz",
"platform": null,
"description": "# dbxconfig\n\nConfiguration framework for databricks pipelines.\nDefine configuration and table dependencies in yaml config then get the table mappings config model:\n\nDefine your tables.\n\n```yaml\nlanding:\n read:\n landing_dbx_patterns:\n customer_details_1: null\n customer_details_2: null\n\nraw:\n delta_lake:\n raw_dbx_patterns:\n customers:\n ids: id\n depends_on:\n - landing.landing_dbx_patterns.customer_details_1\n - landing.landing_dbx_patterns.customer_details_2\n warning_thresholds:\n invalid_ratio: 0.1\n invalid_rows: 0\n max_rows: 100\n min_rows: 5\n exception_thresholds:\n invalid_ratio: 0.2\n invalid_rows: 2\n max_rows: 1000\n min_rows: 0\n custom_properties:\n process_group: 1\n\nbase:\n delta_lake:\n # delta table properties can be set at stage level or table level\n delta_properties:\n delta.appendOnly: true\n delta.autoOptimize.autoCompact: true \n delta.autoOptimize.optimizeWrite: true \n delta.enableChangeDataFeed: false\n base_dbx_patterns:\n customer_details_1:\n ids: id\n depends_on:\n - raw.raw_dbx_patterns.customers\n # delta table properties can be set at stage level or table level\n # table level properties will overwride stage level properties\n delta_properties:\n delta.enableChangeDataFeed: true\n customer_details_2:\n ids: id\n depends_on:\n - raw.raw_dbx_patterns.customers\n```\n\nDefine you load configuration:\n\n```yaml\ntables: ./tables.yaml\n\nlanding:\n read:\n trigger: customerdetailscomplete-{{filename_date_format}}*.flg\n trigger_type: file\n database: landing_dbx_patterns\n table: \"{{table}}\"\n container: datalake\n root: \"/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}\"\n filename: \"{{table}}-{{filename_date_format}}*.csv\"\n filename_date_format: \"%Y%m%d\"\n path_date_format: \"%Y%m%d\"\n format: cloudFiles\n spark_schema: ../Schema/{{table.lower()}}.yaml\n options:\n # autoloader\n cloudFiles.format: csv\n cloudFiles.schemaLocation: /mnt/{{container}}/checkpoint/{{checkpoint}}\n cloudFiles.useIncrementalListing: auto\n # schema\n inferSchema: false\n enforceSchema: true\n columnNameOfCorruptRecord: _corrupt_record\n # csv\n header: false\n mode: PERMISSIVE\n encoding: windows-1252\n delimiter: \",\"\n escape: '\"'\n nullValue: \"\"\n quote: '\"'\n emptyValue: \"\"\n \n\nraw:\n delta_lake:\n # delta table properties can be set at stage level or table level\n delta_properties:\n delta.appendOnly: true\n delta.autoOptimize.autoCompact: true \n delta.autoOptimize.optimizeWrite: true \n delta.enableChangeDataFeed: false\n database: raw_dbx_patterns\n table: \"{{table}}\"\n container: datalake\n root: /mnt/{{container}}/data/raw\n path: \"{{database}}/{{table}}\"\n options:\n checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}\n mergeSchema: true\n```\n\nImport the config objects into you pipeline:\n\n```python\nfrom dbxconfig import Config, Timeslice, StageType\n\n# build path to configuration file\npattern = \"auto_load_schema\"\nconfig_path = f\"../Config\"\n\n# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).\ntimeslice = Timeslice(day=\"*\", month=\"*\", year=\"*\")\n\n# parse and create a config objects\nconfig = Config(config_path=config_path, pattern=pattern)\n\n# get the configuration for a table mapping to load.\ntable_mapping = config.get_table_mapping(\n timeslice=timeslice, \n stage=StageType.raw, \n table=\"customers\"\n)\n\nprint(table_mapping)\n```\n\n## Development Setup\n\n```\npip install -r requirements.txt\n```\n\n## Unit Tests\n\nTo run the unit tests with a coverage report.\n\n```\npip install -e .\npytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html\n```\n\n## Build\n\n```\npython setup.py sdist bdist_wheel\n```\n\n## Publish\n\n\n```\ntwine upload dist/*\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Databricks Configuration Framework",
"version": "5.0.7",
"project_urls": {
"Documentation": "https://dbxconfig.readthedocs.io/en/latest/",
"GitHub": "https://github.com/semanticinsight/dbxconfig",
"Homepage": "https://dbxconfig.readthedocs.io/en/latest/"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "49bb217355efa204694420a5d29129656d32e39147eaa485767d46c27c630a19",
"md5": "b420e530d5bec4e58bd1c9a9a42e93b6",
"sha256": "e70ee6e03988ee1d35bac861a6b286b34bb5ca9e752b4bef265a2ee10e96695e"
},
"downloads": -1,
"filename": "dbxconfig-5.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b420e530d5bec4e58bd1c9a9a42e93b6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 22883,
"upload_time": "2023-05-04T11:06:04",
"upload_time_iso_8601": "2023-05-04T11:06:04.580539Z",
"url": "https://files.pythonhosted.org/packages/49/bb/217355efa204694420a5d29129656d32e39147eaa485767d46c27c630a19/dbxconfig-5.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4a44fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a",
"md5": "c1c3b22301a98c0b54a82a6a5aba40bd",
"sha256": "737836a5f74c2d6f0d90e8ec7ebee86d63ce192c75d71b3d0b284f1e19333032"
},
"downloads": -1,
"filename": "dbxconfig-5.0.7.tar.gz",
"has_sig": false,
"md5_digest": "c1c3b22301a98c0b54a82a6a5aba40bd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 17935,
"upload_time": "2023-05-04T11:06:07",
"upload_time_iso_8601": "2023-05-04T11:06:07.785303Z",
"url": "https://files.pythonhosted.org/packages/4a/44/fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a/dbxconfig-5.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-04 11:06:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "semanticinsight",
"github_project": "dbxconfig",
"github_not_found": true,
"lcname": "dbxconfig"
}