dbxconfig


Namedbxconfig JSON
Version 5.0.7 PyPI version JSON
download
home_pagehttps://dbxconfig.readthedocs.io/en/latest/
SummaryDatabricks Configuration Framework
upload_time2023-05-04 11:06:07
maintainer
docs_urlNone
authorShaun Ryan
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # dbxconfig

Configuration framework for databricks pipelines.
Define configuration and table dependencies in yaml config then get the table mappings config model:

Define your tables.

```yaml
landing:
  read:
    landing_dbx_patterns:
      customer_details_1: null
      customer_details_2: null

raw:
  delta_lake:
    raw_dbx_patterns:
      customers:
        ids: id
        depends_on:
          - landing.landing_dbx_patterns.customer_details_1
          - landing.landing_dbx_patterns.customer_details_2
        warning_thresholds:
          invalid_ratio: 0.1
          invalid_rows: 0
          max_rows: 100
          min_rows: 5
        exception_thresholds:
          invalid_ratio: 0.2
          invalid_rows: 2
          max_rows: 1000
          min_rows: 0
        custom_properties:
          process_group: 1

base:
  delta_lake:
    # delta table properties can be set at stage level or table level
    delta_properties:
      delta.appendOnly: true
      delta.autoOptimize.autoCompact: true    
      delta.autoOptimize.optimizeWrite: true  
      delta.enableChangeDataFeed: false
    base_dbx_patterns:
      customer_details_1:
        ids: id
        depends_on:
          - raw.raw_dbx_patterns.customers
        # delta table properties can be set at stage level or table level
        # table level properties will overwride stage level properties
        delta_properties:
            delta.enableChangeDataFeed: true
      customer_details_2:
        ids: id
        depends_on:
          - raw.raw_dbx_patterns.customers
```

Define you load configuration:

```yaml
tables: ./tables.yaml

landing:
  read:
    trigger: customerdetailscomplete-{{filename_date_format}}*.flg
    trigger_type: file
    database: landing_dbx_patterns
    table: "{{table}}"
    container: datalake
    root: "/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}"
    filename: "{{table}}-{{filename_date_format}}*.csv"
    filename_date_format: "%Y%m%d"
    path_date_format: "%Y%m%d"
    format: cloudFiles
    spark_schema: ../Schema/{{table.lower()}}.yaml
    options:
      # autoloader
      cloudFiles.format: csv
      cloudFiles.schemaLocation:  /mnt/{{container}}/checkpoint/{{checkpoint}}
      cloudFiles.useIncrementalListing: auto
      # schema
      inferSchema: false
      enforceSchema: true
      columnNameOfCorruptRecord: _corrupt_record
      # csv
      header: false
      mode: PERMISSIVE
      encoding: windows-1252
      delimiter: ","
      escape: '"'
      nullValue: ""
      quote: '"'
      emptyValue: ""
    

raw:
  delta_lake:
    # delta table properties can be set at stage level or table level
    delta_properties:
      delta.appendOnly: true
      delta.autoOptimize.autoCompact: true    
      delta.autoOptimize.optimizeWrite: true  
      delta.enableChangeDataFeed: false
    database: raw_dbx_patterns
    table: "{{table}}"
    container: datalake
    root: /mnt/{{container}}/data/raw
    path: "{{database}}/{{table}}"
    options:
      checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}
      mergeSchema: true
```

Import the config objects into you pipeline:

```python
from dbxconfig import Config, Timeslice, StageType

# build path to configuration file
pattern = "auto_load_schema"
config_path = f"../Config"

# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).
timeslice = Timeslice(day="*", month="*", year="*")

# parse and create a config objects
config = Config(config_path=config_path, pattern=pattern)

# get the configuration for a table mapping to load.
table_mapping = config.get_table_mapping(
    timeslice=timeslice, 
    stage=StageType.raw, 
    table="customers"
)

print(table_mapping)
```

## Development Setup

```
pip install -r requirements.txt
```

## Unit Tests

To run the unit tests with a coverage report.

```
pip install -e .
pytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html
```

## Build

```
python setup.py sdist bdist_wheel
```

## Publish


```
twine upload dist/*
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://dbxconfig.readthedocs.io/en/latest/",
    "name": "dbxconfig",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Shaun Ryan",
    "author_email": "shaun_chiburi@hotmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4a/44/fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a/dbxconfig-5.0.7.tar.gz",
    "platform": null,
    "description": "# dbxconfig\n\nConfiguration framework for databricks pipelines.\nDefine configuration and table dependencies in yaml config then get the table mappings config model:\n\nDefine your tables.\n\n```yaml\nlanding:\n  read:\n    landing_dbx_patterns:\n      customer_details_1: null\n      customer_details_2: null\n\nraw:\n  delta_lake:\n    raw_dbx_patterns:\n      customers:\n        ids: id\n        depends_on:\n          - landing.landing_dbx_patterns.customer_details_1\n          - landing.landing_dbx_patterns.customer_details_2\n        warning_thresholds:\n          invalid_ratio: 0.1\n          invalid_rows: 0\n          max_rows: 100\n          min_rows: 5\n        exception_thresholds:\n          invalid_ratio: 0.2\n          invalid_rows: 2\n          max_rows: 1000\n          min_rows: 0\n        custom_properties:\n          process_group: 1\n\nbase:\n  delta_lake:\n    # delta table properties can be set at stage level or table level\n    delta_properties:\n      delta.appendOnly: true\n      delta.autoOptimize.autoCompact: true    \n      delta.autoOptimize.optimizeWrite: true  \n      delta.enableChangeDataFeed: false\n    base_dbx_patterns:\n      customer_details_1:\n        ids: id\n        depends_on:\n          - raw.raw_dbx_patterns.customers\n        # delta table properties can be set at stage level or table level\n        # table level properties will overwride stage level properties\n        delta_properties:\n            delta.enableChangeDataFeed: true\n      customer_details_2:\n        ids: id\n        depends_on:\n          - raw.raw_dbx_patterns.customers\n```\n\nDefine you load configuration:\n\n```yaml\ntables: ./tables.yaml\n\nlanding:\n  read:\n    trigger: customerdetailscomplete-{{filename_date_format}}*.flg\n    trigger_type: file\n    database: landing_dbx_patterns\n    table: \"{{table}}\"\n    container: datalake\n    root: \"/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}\"\n    filename: \"{{table}}-{{filename_date_format}}*.csv\"\n    filename_date_format: \"%Y%m%d\"\n    path_date_format: \"%Y%m%d\"\n    format: cloudFiles\n    spark_schema: ../Schema/{{table.lower()}}.yaml\n    options:\n      # autoloader\n      cloudFiles.format: csv\n      cloudFiles.schemaLocation:  /mnt/{{container}}/checkpoint/{{checkpoint}}\n      cloudFiles.useIncrementalListing: auto\n      # schema\n      inferSchema: false\n      enforceSchema: true\n      columnNameOfCorruptRecord: _corrupt_record\n      # csv\n      header: false\n      mode: PERMISSIVE\n      encoding: windows-1252\n      delimiter: \",\"\n      escape: '\"'\n      nullValue: \"\"\n      quote: '\"'\n      emptyValue: \"\"\n    \n\nraw:\n  delta_lake:\n    # delta table properties can be set at stage level or table level\n    delta_properties:\n      delta.appendOnly: true\n      delta.autoOptimize.autoCompact: true    \n      delta.autoOptimize.optimizeWrite: true  \n      delta.enableChangeDataFeed: false\n    database: raw_dbx_patterns\n    table: \"{{table}}\"\n    container: datalake\n    root: /mnt/{{container}}/data/raw\n    path: \"{{database}}/{{table}}\"\n    options:\n      checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}\n      mergeSchema: true\n```\n\nImport the config objects into you pipeline:\n\n```python\nfrom dbxconfig import Config, Timeslice, StageType\n\n# build path to configuration file\npattern = \"auto_load_schema\"\nconfig_path = f\"../Config\"\n\n# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).\ntimeslice = Timeslice(day=\"*\", month=\"*\", year=\"*\")\n\n# parse and create a config objects\nconfig = Config(config_path=config_path, pattern=pattern)\n\n# get the configuration for a table mapping to load.\ntable_mapping = config.get_table_mapping(\n    timeslice=timeslice, \n    stage=StageType.raw, \n    table=\"customers\"\n)\n\nprint(table_mapping)\n```\n\n## Development Setup\n\n```\npip install -r requirements.txt\n```\n\n## Unit Tests\n\nTo run the unit tests with a coverage report.\n\n```\npip install -e .\npytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html\n```\n\n## Build\n\n```\npython setup.py sdist bdist_wheel\n```\n\n## Publish\n\n\n```\ntwine upload dist/*\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Databricks Configuration Framework",
    "version": "5.0.7",
    "project_urls": {
        "Documentation": "https://dbxconfig.readthedocs.io/en/latest/",
        "GitHub": "https://github.com/semanticinsight/dbxconfig",
        "Homepage": "https://dbxconfig.readthedocs.io/en/latest/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "49bb217355efa204694420a5d29129656d32e39147eaa485767d46c27c630a19",
                "md5": "b420e530d5bec4e58bd1c9a9a42e93b6",
                "sha256": "e70ee6e03988ee1d35bac861a6b286b34bb5ca9e752b4bef265a2ee10e96695e"
            },
            "downloads": -1,
            "filename": "dbxconfig-5.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b420e530d5bec4e58bd1c9a9a42e93b6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 22883,
            "upload_time": "2023-05-04T11:06:04",
            "upload_time_iso_8601": "2023-05-04T11:06:04.580539Z",
            "url": "https://files.pythonhosted.org/packages/49/bb/217355efa204694420a5d29129656d32e39147eaa485767d46c27c630a19/dbxconfig-5.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4a44fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a",
                "md5": "c1c3b22301a98c0b54a82a6a5aba40bd",
                "sha256": "737836a5f74c2d6f0d90e8ec7ebee86d63ce192c75d71b3d0b284f1e19333032"
            },
            "downloads": -1,
            "filename": "dbxconfig-5.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "c1c3b22301a98c0b54a82a6a5aba40bd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17935,
            "upload_time": "2023-05-04T11:06:07",
            "upload_time_iso_8601": "2023-05-04T11:06:07.785303Z",
            "url": "https://files.pythonhosted.org/packages/4a/44/fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a/dbxconfig-5.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-04 11:06:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "semanticinsight",
    "github_project": "dbxconfig",
    "github_not_found": true,
    "lcname": "dbxconfig"
}
        
Elapsed time: 0.06484s