dbt-bytehouse


Namedbt-bytehouse JSON
Version 1.7.2 PyPI version JSON
download
home_pagehttps://github.com/bytehouse-cloud/bytehouse-dbt
SummaryThe ByteHouse plugin for dbt (data build tool)
upload_time2024-04-26 06:39:57
maintainerNone
docs_urlNone
authorRafsan Mazumder
requires_python>=3.7
licenseMIT
keywords bytehouse dbt connector
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Introduction
`dbt` (Data Building Tool) is an open source tool that enables data analysts and engineers to transform
data in their warehouses simply by writing select statements. `dbt` performs the T (Transform) of ETL and
allows companies to write transformations as queries and orchestrate them in a more efficient way. 
ByteHouse dbt connector is a plugin enabling users to build their data warehouse ecosystem with dbt 
and ByteHouse. 
# Table of Contents
- [Introduction](#introduction)
- [Requirements](#requirements)
- [Creating ByteHouse Account](#creating-bytehouse-account)
- [Installation](#installation)
- [dbt Project Setup](#dbt-project-setup)
  * [dbt_project.yml](#dbt-projectyml)
  * [profiles.yml](#profilesyml)
  * [Connection & Authentication Configurations](#connection---authentication-configurations)
    + [ByteHouse Regions](#bytehouse-regions)
    + [Region & Password Configuration](#region---password-configuration)
    + [Region & API Key Configuration](#region---api-key-configuration)
    + [Host Address & Password Configuration](#host-address---password-configuration)
    + [Host Address & API Key Configuration](#host-address---api-key-configuration)
  * [Project Initialization](#project-initialization)
  * [Test Warehouse Connection](#test-warehouse-connection)
- [Dataset Ingestion](#dataset-ingestion)
- [dbt Models](#dbt-models)
- [schema.yml](#schemayml)
- [Materialization types of Models](#materialization-types-of-models)
  * [View Materializations](#view-materializations)
  * [Table Materializations](#table-materializations)
  * [Incremental Materializations](#incremental-materializations)
    + [How it works](#how-it-works)
- [Project Documentation](#project-documentation)
- [Local Development](#local-development)
- [Original Author](#original-author)
- [License](#license)
# Requirements
Make sure you have `dbt` & `python` installed on your machine. If not, then you can follow this guide https://docs.getdbt.com/docs/get-started/installation
- dbt v1.7.0 or greater
- python v3.7 or greater
# Creating ByteHouse Account
You need to create ByteHouse account in order to use bytehouse-dbt connector. You can simply create a free account with
the process mentioned in our official website documentation: https://docs.bytehouse.cloud/en/docs/quick-start <br/>

You can also create ByteHouse account through Volcano Engine by ByteDance: 
https://www.volcengine.com/product/bytehouse-cloud 
# Installation
Create a new repository where we will instantiate a `Python` virtual environment.
```commandline
mkdir dbt_bytehouse_demo
cd dbt_bytehouse_demo

python -m venv venv
source venv/bin/activate
```
Latest release version can be installed from here:
```commandline
pip install dbt-bytehouse
```
Current development version can be installed from here:
```commandline
pip install git+https://github.com/bytehouse-cloud/bytehouse-dbt@master#egg=bytehouse-driver
```
Check whether installation is successful by verifying bytehouse is available under Plugins.
```commandline
dbt --version
```
![Version](./examples/1_version.png)
# dbt Project Setup
## dbt_project.yml
Every `dbt` project needs a `dbt_project.yml` file — this is how `dbt` knows a directory is a `dbt` project. `dbt_project.yml`
file holds the context of your project and tells `dbt` how to build your data sets. Some common configurations
for `dbt_project.yml` are:
<table>
    <tr>
        <td>YAML key</td>
        <td>Value</td>
    </tr>
    <tr>
        <td>name</td>
        <td>Your project’s name in snake case</td>
    </tr>
    <tr>
        <td>version</td>
        <td>Version of your project</td>
    </tr>
    <tr>
        <td>profile</td>
        <td>The profile dbt uses to connect to ByteHouse</td>
    </tr>
    <tr>
        <td>model-paths</td>
        <td>Directories to where your model and source files live</td>
    </tr>
    <tr>
        <td>seed-paths</td>
        <td>Directories to where your seed files live</td>
    </tr>
    <tr>
        <td>test-paths</td>
        <td>Directories to where your test files live</td>
    </tr>
    <tr>
        <td>snapshot-paths</td>
        <td>Directories to where your snapshots live</td>
    </tr>
    <tr>
        <td>docs-paths</td>
        <td>Directories to where your docs blocks live</td>
    </tr>
</table>

## profiles.yml
When you invoke `dbt` from the command line, `dbt` parses your `dbt_project.yml` and obtains the profile name. `dbt` then 
checks your `profiles.yml` file for a profile with the same name. A profile contains all the details/credentials 
required to connect to ByteHouse. `dbt` will search the current working directory for the `profiles.yml` file and will
default to the `~/.dbt/ directory` if not found.
```yaml
<profile-name>:
  target: <target-name>
  outputs:
    <target-name>:
      type: bytehouse
      schema: <database-name>
      user: bytehouse
      password: <bytehouse-api-key>
      driver: native
      #optional fields
      host: <hostname>
      port: <port>
      region: <region-name>
      warehouse: <warehouse-name>
      retries: 1
      secure: True
      connect_timeout: 10
      send_receive_timeout: 300
      custom_settings: <empty>
```
<table>
    <tr>
        <td>YAML key</td>
        <td>Value</td>
    </tr>
    <tr>
        <td>&lt;profile-name&gt;</td>
        <td>Name of the profile. Has to be the same name as the profile indicated in your dbt_project.yml file</td>
    </tr>
    <tr>
        <td>target</td>
        <td>Default target your dbt project will use. It must be one of the targets you define in your profile</td>
    </tr>
    <tr>
        <td>type</td>
        <td>Must be set to bytehouse</td>
    </tr>
    <tr>
        <td>schema</td>
        <td>Database name</td>
    </tr>
    <tr>
        <td>user</td>
        <td>Must be set to bytehouse</td>
    </tr>
    <tr>
        <td>password</td>
        <td>ByteHouse API Token</td>
    </tr>
    <tr>
        <td>driver</td>
        <td>Must be set to native</td>
    </tr>
    <tr>
        <td>host</td>
        <td>[Optional] The host name of the connection</td>
    </tr>
    <tr>
        <td>port </td>
        <td>[Optional] The port number of the host server</td>
    </tr>
    <tr>
        <td>warehouse</td>
        <td>[Optional] The name of the virtual warehouse that you want to use for this session</td>
    </tr>
    <tr>
        <td>retries</td>
        <td>[Optional] Number of times to retry the initial connection attempt if the error appears to be recoverable</td>
    </tr>
    <tr>
        <td>secure</td>
        <td>[Optional] Whether the connection is secured by TLS. Suggested to set it to True</td>
    </tr>
    <tr>
        <td>connect_timeout</td>
        <td>[Optional] Connection timeout in seconds. Default is 10 seconds</td>
    </tr>
    <tr>
        <td>send_receive_timeout</td>
        <td>[Optional] Timeout for receiving data from or sending data to ByteHouse. Default is 5 minutes (300 seconds)</td>
    </tr>
    <tr>
        <td>custom_settings</td>
        <td>[Optional] A mapping of ByteHouse specific user settings to use with the connection</td>
    </tr>
</table>

## Connection & Authentication Configurations
### Host Address & API Key Configuration
Required parameters: `host` `port` `user` `password`
```yaml
bytehouse_profile:
  target: dev
  outputs:
    dev:
      type: bytehouse
      driver: native

      # database
      schema: $DATABASE_NAME

      # target server address
      host: $HOST_ADDRESS  
      port: $PORT_NUMBER

      # account credentials
      user: bytehouse
      password: $API_KEY

      # additional settings
      secure: True
```
## Project Initialization
`dbt init` command will prompt for project name & database adapters, where you have to select bytehouse. This will create
a new folder with your project name, sample files & `dbt_project.yml` config file, enough to get you started with dbt.
```commandline
dbt init
```
![Init](./examples/2_profiles.png)
Update your profiles.yml with required authentication & target server credentials. 
```commandline
nano ~/.dbt/profiles.yml
```
As `dbt` has created a new folder with the same name as your project name, change your current 
directory to the project folder. 
```commandline
cd dbt_bytehouse_demo
```
Update your `dbt_project.yml` file to have the same profile_name as `~/.dbt/profiles.yml`.
```commandline
nano dbt_project.yml
```
![Project](./examples/3_project.png)
## Test Warehouse Connection
Use `dbt debug` command to verify required dependencies & warehouse connection. In case of success, it will show you 
"All checks passed!"
```commandline
dbt debug
```
![Debug](./examples/4_debug.png)
# Dataset Ingestion
To showcase different `dbt` functionalities, we will ingest a small imdb movie dataset, with the following schema. 
The DDL & insertion queries can be found here https://github.com/bytehouse-cloud/bytehouse-dbt/examples/data_loading.sql. 
You can use ByteHouse SQL worksheet to create the schema & insert the dataset. 
![Schema](./examples/5_schema.png)
To verify that dataset preparation was successful, we will execute this following query to summarize each actor along 
with their total number of movie appearances. 
```
SELECT id,
  any(actor_name) as name,
  uniqExact(movie_id)    as num_movies,
  avg(rating)                as avg_rating,
  max(created_at) as updated_at
FROM (
  SELECT actors.id as id,
       concat(actors.first_name, ' ', actors.last_name) as actor_name,
       movies.id as movie_id,
       movies.rating as rating,
       created_at
  FROM  imdb.actors
        JOIN imdb.roles ON roles.actor_id = actors.id
        LEFT OUTER JOIN imdb.movies ON movies.id = roles.movie_id
)
GROUP BY id
```
The resultset should be like this:
![Schema Confirm](./examples/6_schema_confirm.png)
# dbt Models 
In a `dbt` project, a model is a sql file located inside `models/` directory which will contain a `SELECT` statement 
referring to a transformation. The name of the model file will refer to the name of future table/view after
`dbt` execution. When we execute `dbt run` command, `dbt` will build this model directly into ByteHouse by wrapping
it in a create table / view materialization. Materialization type of your model will determine the actual SQL
that `dbt` will use to create model in the warehouse.
# schema.yml
The `schema.yml` file will define our tables & columns by referring through alias name. This schemas can later be
used in different models/macros via `source()` function.
Create `schema.yml` file under `models/` directory
```commandline
touch models/schema.yml
```
Define our model schema like this
```yaml
version: 2

sources:
- name: imdb
  tables:
  - name: actors
  - name: roles
  - name: movies
```
# Materialization types of Models
## View Materializations
In case of view materialization, a model is transformed to a view on each single run by `CREATE VIEW` AS statement in 
ByteHouse. View materializations won't store the actual data, so it would be slower to query than table materializations.
Let's create `models/actor_insight.sql` as view materialization. 
```commandline
touch models/actors_insight.sql
```
```
{{ config(materialized='view') }}

SELECT id,
  any(actor_name) as name,
  uniqExact(movie_id)    as num_movies,
  avg(rating)                as avg_rating,
  max(created_at) as updated_at
FROM (
  SELECT actors.id as id,
       concat(actors.first_name, ' ', actors.last_name) as actor_name,
       movies.id as movie_id,
       movies.rating as rating,
       created_at
  FROM  {{ source('imdb', 'actors') }}
        JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
        LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id
```
Let's execute `dbt run` command to build this model in ByteHouse. 
![Confirm](./examples/7_actor_insight_confirm.png)
Querying this view, we can replicate the results of our earlier query with a simpler syntax:
```
SELECT * FROM imdb.actors_insight ORDER BY num_movies DESC;
```
![Confirm2](./examples/8_materialized_view_run_confirm.png)
## Table Materializations
In case of table materialization, your model would be rebuilt as a table on each single `dbt run` 
via a `CREATE TABLE` AS statement. 
```commandline
touch models/actors_insight_table.sql
```
We can use our previous view materialization sql with config change for table materialization sql. 
```
{{ config(order_by='(updated_at, id, name)', materialized='table') }}
```
We can verify that both view & table materializations generate the same response. 
```
 SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;
```
## Incremental Materializations
For our previous table materialization, `dbt` will construct a table every time to materialize the model. For larger
or complex transformations, this would be redundant and costly in terms of computing power. Incremental 
materializations solve this problem.<br/>
The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, 
`dbt` transforms only the rows in your source data that you tell `dbt` to filter for, inserting them into the target 
table which is the table that has already been built.<br/>
To tell `dbt` which rows it should transform on an incremental run, wrap valid SQL that filters for these rows 
in the `is_incremental()` macro. Your `is_incremental()` code will check for rows created or modified since the 
last time `dbt` ran this model.
```commandline
touch models/actors_insight_incremental.sql
```
```
{{ config(order_by='(updated_at, id, name)', materialized='incremental') }}

SELECT id,
  any(actor_name) as name,
  uniqExact(movie_id)    as num_movies,
  avg(rating)                as avg_rating,
  max(created_at) as updated_at
FROM (
  SELECT actors.id as id,
       concat(actors.first_name, ' ', actors.last_name) as actor_name,
       movies.id as movie_id,
       movies.rating as rating,
       created_at
  FROM  {{ source('imdb', 'actors') }}
        JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
        LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id

{% if is_incremental() %}

-- this filter will only be applied on an incremental run
where id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}})

{% endif %}
```
We can verify that view, table & incremental materializations, all generate the same response. 
```
 SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;
```
```commandline
dbt run -m models/actors_insight_incremental.sql
```
Let's insert a few more rows to demonstrate the power of incremental materializations. Now the most 
appeared actor should be 'Chris Pratt'. 
```
INSERT INTO imdb.movies VALUES (9, 'Passengers', 2016, 7);
INSERT INTO imdb.movies VALUES (10, 'The Tomorrow War', 2021, 6.5);

INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 9, 'Jim Preston');
INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 10, 'Dan Forester');
```
```commandline
dbt run -m models/actors_insight_incremental.sql
```
```
SELECT * FROM imdb.actors_insight_incremental ORDER BY num_movies DESC;
```
![Confirm3](./examples/9_incremental_confirm.png)
### How it works
1. `dbt` will first create a temporary table named `actors_insight_incremental_tmp` & insert all those rows which 
pass our `is_incremental()` filter.
2. A new table `actors_insight_incremental_new` will be created & rows from the old table `actors_insight_incremental` 
will be ingested here. `dbt` will make sure that `unique_key` (if any declared in config) constraint is maintained, 
by not allowing those rows which have the same `unique_key` as the previous temporary table. 
3. The rows from the temporary table would be ingested into the new table.
4. Our previous table (`actors_insight_incremental`) & new table (`actors_insight_new`) will be exchanged. 

# Direct Write to CNCH Unique Key Table
## Requirement
1. `dbt-bytehouse` connector version to be at least `1.7.1`. 
2. Needs to set config name `cnch_unique_key` instead of `unique_key`
3. Needs to set config `incremental_strategy` as `append`
4. Write a SQL expression to append the virtual warehouse id at the end of DBT model, example
`SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'`
## Example
```
{{
    config(
        materialized='incremental',
        cnch_unique_key='id',
        incremental_strategy='append'    
    )
}}

SELECT
    id,
    transaction_date,
    user_id,
    amount,
    status
FROM 
    dbt_unique.transactions 

{% if is_incremental() %}

  WHERE transaction_date >= (SELECT max(transaction_date) FROM dbt_unique.transactions)

{% endif %}

SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'
```
# Project Documentation
`dbt` provides a way to generate documentation for your dbt project and render it as a website. 
Create `models/actors_insight_incremental.yml` to generate documentation for our models. 
```yaml
version: 2

models:
  - name: actors_insight_incremental
    description: Actor appearance summary based on roles
    columns: 
      - name: id
        description: The id number of actor
      - name: name
        description: The name of actor
      - name: num_movies
        description: The number of movies actor has appeared 
      - name: avg_rating
        description: Average rating
      - name: updated_at
        description: Latest update time
```
Use `dbt docs generate` to generate the documentation for your models & `dbt docs serve` to serve the documentation in 
your local browser on port 8000.
![Doc](./examples/10_docs.png)
# Local Development
Update `tests/integration/confest.py` file to include your connection credentials. For running tests locally, follow 
these steps:
```commandline
pip install -r dev_requirements.txt
python -m pytest
```
# Original Author
ByteHouse wants to thank ClickHouse for original contribution to this connector.

# License
This project is distributed under the terms of the Apache License (Version 2.0).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bytehouse-cloud/bytehouse-dbt",
    "name": "dbt-bytehouse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "ByteHouse dbt connector",
    "author": "Rafsan Mazumder",
    "author_email": "rafsan.mazumder@bytedance.com",
    "download_url": "https://files.pythonhosted.org/packages/50/a4/09a767bf41dbfa3b5a1d2f063fe5683754d4bd90aea1deff557c7d4cb51a/dbt-bytehouse-1.7.2.tar.gz",
    "platform": "any",
    "description": "# Introduction\n`dbt` (Data Building Tool) is an open source tool that enables data analysts and engineers to transform\ndata in their warehouses simply by writing select statements. `dbt` performs the T (Transform) of ETL and\nallows companies to write transformations as queries and orchestrate them in a more efficient way. \nByteHouse dbt connector is a plugin enabling users to build their data warehouse ecosystem with dbt \nand ByteHouse. \n# Table of Contents\n- [Introduction](#introduction)\n- [Requirements](#requirements)\n- [Creating ByteHouse Account](#creating-bytehouse-account)\n- [Installation](#installation)\n- [dbt Project Setup](#dbt-project-setup)\n  * [dbt_project.yml](#dbt-projectyml)\n  * [profiles.yml](#profilesyml)\n  * [Connection & Authentication Configurations](#connection---authentication-configurations)\n    + [ByteHouse Regions](#bytehouse-regions)\n    + [Region & Password Configuration](#region---password-configuration)\n    + [Region & API Key Configuration](#region---api-key-configuration)\n    + [Host Address & Password Configuration](#host-address---password-configuration)\n    + [Host Address & API Key Configuration](#host-address---api-key-configuration)\n  * [Project Initialization](#project-initialization)\n  * [Test Warehouse Connection](#test-warehouse-connection)\n- [Dataset Ingestion](#dataset-ingestion)\n- [dbt Models](#dbt-models)\n- [schema.yml](#schemayml)\n- [Materialization types of Models](#materialization-types-of-models)\n  * [View Materializations](#view-materializations)\n  * [Table Materializations](#table-materializations)\n  * [Incremental Materializations](#incremental-materializations)\n    + [How it works](#how-it-works)\n- [Project Documentation](#project-documentation)\n- [Local Development](#local-development)\n- [Original Author](#original-author)\n- [License](#license)\n# Requirements\nMake sure you have `dbt` & `python` installed on your machine. If not, then you can follow this guide https://docs.getdbt.com/docs/get-started/installation\n- dbt v1.7.0 or greater\n- python v3.7 or greater\n# Creating ByteHouse Account\nYou need to create ByteHouse account in order to use bytehouse-dbt connector. You can simply create a free account with\nthe process mentioned in our official website documentation: https://docs.bytehouse.cloud/en/docs/quick-start <br/>\n\nYou can also create ByteHouse account through Volcano Engine by ByteDance: \nhttps://www.volcengine.com/product/bytehouse-cloud \n# Installation\nCreate a new repository where we will instantiate a `Python` virtual environment.\n```commandline\nmkdir dbt_bytehouse_demo\ncd dbt_bytehouse_demo\n\npython -m venv venv\nsource venv/bin/activate\n```\nLatest release version can be installed from here:\n```commandline\npip install dbt-bytehouse\n```\nCurrent development version can be installed from here:\n```commandline\npip install git+https://github.com/bytehouse-cloud/bytehouse-dbt@master#egg=bytehouse-driver\n```\nCheck whether installation is successful by verifying bytehouse is available under Plugins.\n```commandline\ndbt --version\n```\n![Version](./examples/1_version.png)\n# dbt Project Setup\n## dbt_project.yml\nEvery `dbt` project needs a `dbt_project.yml` file \u2014 this is how `dbt` knows a directory is a `dbt` project. `dbt_project.yml`\nfile holds the context of your project and tells `dbt` how to build your data sets. Some common configurations\nfor `dbt_project.yml` are:\n<table>\n    <tr>\n        <td>YAML key</td>\n        <td>Value</td>\n    </tr>\n    <tr>\n        <td>name</td>\n        <td>Your project\u2019s name in snake case</td>\n    </tr>\n    <tr>\n        <td>version</td>\n        <td>Version of your project</td>\n    </tr>\n    <tr>\n        <td>profile</td>\n        <td>The profile dbt uses to connect to ByteHouse</td>\n    </tr>\n    <tr>\n        <td>model-paths</td>\n        <td>Directories to where your model and source files live</td>\n    </tr>\n    <tr>\n        <td>seed-paths</td>\n        <td>Directories to where your seed files live</td>\n    </tr>\n    <tr>\n        <td>test-paths</td>\n        <td>Directories to where your test files live</td>\n    </tr>\n    <tr>\n        <td>snapshot-paths</td>\n        <td>Directories to where your snapshots live</td>\n    </tr>\n    <tr>\n        <td>docs-paths</td>\n        <td>Directories to where your docs blocks live</td>\n    </tr>\n</table>\n\n## profiles.yml\nWhen you invoke `dbt` from the command line, `dbt` parses your `dbt_project.yml` and obtains the profile name. `dbt` then \nchecks your `profiles.yml` file for a profile with the same name. A profile contains all the details/credentials \nrequired to connect to ByteHouse. `dbt` will search the current working directory for the `profiles.yml` file and will\ndefault to the `~/.dbt/ directory` if not found.\n```yaml\n<profile-name>:\n  target: <target-name>\n  outputs:\n    <target-name>:\n      type: bytehouse\n      schema: <database-name>\n      user: bytehouse\n      password: <bytehouse-api-key>\n      driver: native\n      #optional fields\n      host: <hostname>\n      port: <port>\n      region: <region-name>\n      warehouse: <warehouse-name>\n      retries: 1\n      secure: True\n      connect_timeout: 10\n      send_receive_timeout: 300\n      custom_settings: <empty>\n```\n<table>\n    <tr>\n        <td>YAML key</td>\n        <td>Value</td>\n    </tr>\n    <tr>\n        <td>&lt;profile-name&gt;</td>\n        <td>Name of the profile. Has to be the same name as the profile indicated in your dbt_project.yml file</td>\n    </tr>\n    <tr>\n        <td>target</td>\n        <td>Default target your dbt project will use. It must be one of the targets you define in your profile</td>\n    </tr>\n    <tr>\n        <td>type</td>\n        <td>Must be set to bytehouse</td>\n    </tr>\n    <tr>\n        <td>schema</td>\n        <td>Database name</td>\n    </tr>\n    <tr>\n        <td>user</td>\n        <td>Must be set to bytehouse</td>\n    </tr>\n    <tr>\n        <td>password</td>\n        <td>ByteHouse API Token</td>\n    </tr>\n    <tr>\n        <td>driver</td>\n        <td>Must be set to native</td>\n    </tr>\n    <tr>\n        <td>host</td>\n        <td>[Optional] The host name of the connection</td>\n    </tr>\n    <tr>\n        <td>port </td>\n        <td>[Optional] The port number of the host server</td>\n    </tr>\n    <tr>\n        <td>warehouse</td>\n        <td>[Optional] The name of the virtual warehouse that you want to use for this session</td>\n    </tr>\n    <tr>\n        <td>retries</td>\n        <td>[Optional] Number of times to retry the initial connection attempt if the error appears to be recoverable</td>\n    </tr>\n    <tr>\n        <td>secure</td>\n        <td>[Optional] Whether the connection is secured by TLS. Suggested to set it to True</td>\n    </tr>\n    <tr>\n        <td>connect_timeout</td>\n        <td>[Optional] Connection timeout in seconds. Default is 10 seconds</td>\n    </tr>\n    <tr>\n        <td>send_receive_timeout</td>\n        <td>[Optional] Timeout for receiving data from or sending data to ByteHouse. Default is 5 minutes (300 seconds)</td>\n    </tr>\n    <tr>\n        <td>custom_settings</td>\n        <td>[Optional] A mapping of ByteHouse specific user settings to use with the connection</td>\n    </tr>\n</table>\n\n## Connection & Authentication Configurations\n### Host Address & API Key Configuration\nRequired parameters: `host` `port` `user` `password`\n```yaml\nbytehouse_profile:\n  target: dev\n  outputs:\n    dev:\n      type: bytehouse\n      driver: native\n\n      # database\n      schema: $DATABASE_NAME\n\n      # target server address\n      host: $HOST_ADDRESS  \n      port: $PORT_NUMBER\n\n      # account credentials\n      user: bytehouse\n      password: $API_KEY\n\n      # additional settings\n      secure: True\n```\n## Project Initialization\n`dbt init` command will prompt for project name & database adapters, where you have to select bytehouse. This will create\na new folder with your project name, sample files & `dbt_project.yml` config file, enough to get you started with dbt.\n```commandline\ndbt init\n```\n![Init](./examples/2_profiles.png)\nUpdate your profiles.yml with required authentication & target server credentials. \n```commandline\nnano ~/.dbt/profiles.yml\n```\nAs `dbt` has created a new folder with the same name as your project name, change your current \ndirectory to the project folder. \n```commandline\ncd dbt_bytehouse_demo\n```\nUpdate your `dbt_project.yml` file to have the same profile_name as `~/.dbt/profiles.yml`.\n```commandline\nnano dbt_project.yml\n```\n![Project](./examples/3_project.png)\n## Test Warehouse Connection\nUse `dbt debug` command to verify required dependencies & warehouse connection. In case of success, it will show you \n\"All checks passed!\"\n```commandline\ndbt debug\n```\n![Debug](./examples/4_debug.png)\n# Dataset Ingestion\nTo showcase different `dbt` functionalities, we will ingest a small imdb movie dataset, with the following schema. \nThe DDL & insertion queries can be found here https://github.com/bytehouse-cloud/bytehouse-dbt/examples/data_loading.sql. \nYou can use ByteHouse SQL worksheet to create the schema & insert the dataset. \n![Schema](./examples/5_schema.png)\nTo verify that dataset preparation was successful, we will execute this following query to summarize each actor along \nwith their total number of movie appearances. \n```\nSELECT id,\n  any(actor_name) as name,\n  uniqExact(movie_id)    as num_movies,\n  avg(rating)                as avg_rating,\n  max(created_at) as updated_at\nFROM (\n  SELECT actors.id as id,\n       concat(actors.first_name, ' ', actors.last_name) as actor_name,\n       movies.id as movie_id,\n       movies.rating as rating,\n       created_at\n  FROM  imdb.actors\n        JOIN imdb.roles ON roles.actor_id = actors.id\n        LEFT OUTER JOIN imdb.movies ON movies.id = roles.movie_id\n)\nGROUP BY id\n```\nThe resultset should be like this:\n![Schema Confirm](./examples/6_schema_confirm.png)\n# dbt Models \nIn a `dbt` project, a model is a sql file located inside `models/` directory which will contain a `SELECT` statement \nreferring to a transformation. The name of the model file will refer to the name of future table/view after\n`dbt` execution. When we execute `dbt run` command, `dbt` will build this model directly into ByteHouse by wrapping\nit in a create table / view materialization. Materialization type of your model will determine the actual SQL\nthat `dbt` will use to create model in the warehouse.\n# schema.yml\nThe `schema.yml` file will define our tables & columns by referring through alias name. This schemas can later be\nused in different models/macros via `source()` function.\nCreate `schema.yml` file under `models/` directory\n```commandline\ntouch models/schema.yml\n```\nDefine our model schema like this\n```yaml\nversion: 2\n\nsources:\n- name: imdb\n  tables:\n  - name: actors\n  - name: roles\n  - name: movies\n```\n# Materialization types of Models\n## View Materializations\nIn case of view materialization, a model is transformed to a view on each single run by `CREATE VIEW` AS statement in \nByteHouse. View materializations won't store the actual data, so it would be slower to query than table materializations.\nLet's create `models/actor_insight.sql` as view materialization. \n```commandline\ntouch models/actors_insight.sql\n```\n```\n{{ config(materialized='view') }}\n\nSELECT id,\n  any(actor_name) as name,\n  uniqExact(movie_id)    as num_movies,\n  avg(rating)                as avg_rating,\n  max(created_at) as updated_at\nFROM (\n  SELECT actors.id as id,\n       concat(actors.first_name, ' ', actors.last_name) as actor_name,\n       movies.id as movie_id,\n       movies.rating as rating,\n       created_at\n  FROM  {{ source('imdb', 'actors') }}\n        JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id\n        LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id\n)\nGROUP BY id\n```\nLet's execute `dbt run` command to build this model in ByteHouse. \n![Confirm](./examples/7_actor_insight_confirm.png)\nQuerying this view, we can replicate the results of our earlier query with a simpler syntax:\n```\nSELECT * FROM imdb.actors_insight ORDER BY num_movies DESC;\n```\n![Confirm2](./examples/8_materialized_view_run_confirm.png)\n## Table Materializations\nIn case of table materialization, your model would be rebuilt as a table on each single `dbt run` \nvia a `CREATE TABLE` AS statement. \n```commandline\ntouch models/actors_insight_table.sql\n```\nWe can use our previous view materialization sql with config change for table materialization sql. \n```\n{{ config(order_by='(updated_at, id, name)', materialized='table') }}\n```\nWe can verify that both view & table materializations generate the same response. \n```\n SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;\n```\n## Incremental Materializations\nFor our previous table materialization, `dbt` will construct a table every time to materialize the model. For larger\nor complex transformations, this would be redundant and costly in terms of computing power. Incremental \nmaterializations solve this problem.<br/>\nThe first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, \n`dbt` transforms only the rows in your source data that you tell `dbt` to filter for, inserting them into the target \ntable which is the table that has already been built.<br/>\nTo tell `dbt` which rows it should transform on an incremental run, wrap valid SQL that filters for these rows \nin the `is_incremental()` macro. Your `is_incremental()` code will check for rows created or modified since the \nlast time `dbt` ran this model.\n```commandline\ntouch models/actors_insight_incremental.sql\n```\n```\n{{ config(order_by='(updated_at, id, name)', materialized='incremental') }}\n\nSELECT id,\n  any(actor_name) as name,\n  uniqExact(movie_id)    as num_movies,\n  avg(rating)                as avg_rating,\n  max(created_at) as updated_at\nFROM (\n  SELECT actors.id as id,\n       concat(actors.first_name, ' ', actors.last_name) as actor_name,\n       movies.id as movie_id,\n       movies.rating as rating,\n       created_at\n  FROM  {{ source('imdb', 'actors') }}\n        JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id\n        LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id\n)\nGROUP BY id\n\n{% if is_incremental() %}\n\n-- this filter will only be applied on an incremental run\nwhere id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}})\n\n{% endif %}\n```\nWe can verify that view, table & incremental materializations, all generate the same response. \n```\n SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;\n```\n```commandline\ndbt run -m models/actors_insight_incremental.sql\n```\nLet's insert a few more rows to demonstrate the power of incremental materializations. Now the most \nappeared actor should be 'Chris Pratt'. \n```\nINSERT INTO imdb.movies VALUES (9, 'Passengers', 2016, 7);\nINSERT INTO imdb.movies VALUES (10, 'The Tomorrow War', 2021, 6.5);\n\nINSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 9, 'Jim Preston');\nINSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 10, 'Dan Forester');\n```\n```commandline\ndbt run -m models/actors_insight_incremental.sql\n```\n```\nSELECT * FROM imdb.actors_insight_incremental ORDER BY num_movies DESC;\n```\n![Confirm3](./examples/9_incremental_confirm.png)\n### How it works\n1. `dbt` will first create a temporary table named `actors_insight_incremental_tmp` & insert all those rows which \npass our `is_incremental()` filter.\n2. A new table `actors_insight_incremental_new` will be created & rows from the old table `actors_insight_incremental` \nwill be ingested here. `dbt` will make sure that `unique_key` (if any declared in config) constraint is maintained, \nby not allowing those rows which have the same `unique_key` as the previous temporary table. \n3. The rows from the temporary table would be ingested into the new table.\n4. Our previous table (`actors_insight_incremental`) & new table (`actors_insight_new`) will be exchanged. \n\n# Direct Write to CNCH Unique Key Table\n## Requirement\n1. `dbt-bytehouse` connector version to be at least `1.7.1`. \n2. Needs to set config name `cnch_unique_key` instead of `unique_key`\n3. Needs to set config `incremental_strategy` as `append`\n4. Write a SQL expression to append the virtual warehouse id at the end of DBT model, example\n`SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'`\n## Example\n```\n{{\n    config(\n        materialized='incremental',\n        cnch_unique_key='id',\n        incremental_strategy='append'    \n    )\n}}\n\nSELECT\n    id,\n    transaction_date,\n    user_id,\n    amount,\n    status\nFROM \n    dbt_unique.transactions \n\n{% if is_incremental() %}\n\n  WHERE transaction_date >= (SELECT max(transaction_date) FROM dbt_unique.transactions)\n\n{% endif %}\n\nSETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'\n```\n# Project Documentation\n`dbt` provides a way to generate documentation for your dbt project and render it as a website. \nCreate `models/actors_insight_incremental.yml` to generate documentation for our models. \n```yaml\nversion: 2\n\nmodels:\n  - name: actors_insight_incremental\n    description: Actor appearance summary based on roles\n    columns: \n      - name: id\n        description: The id number of actor\n      - name: name\n        description: The name of actor\n      - name: num_movies\n        description: The number of movies actor has appeared \n      - name: avg_rating\n        description: Average rating\n      - name: updated_at\n        description: Latest update time\n```\nUse `dbt docs generate` to generate the documentation for your models & `dbt docs serve` to serve the documentation in \nyour local browser on port 8000.\n![Doc](./examples/10_docs.png)\n# Local Development\nUpdate `tests/integration/confest.py` file to include your connection credentials. For running tests locally, follow \nthese steps:\n```commandline\npip install -r dev_requirements.txt\npython -m pytest\n```\n# Original Author\nByteHouse wants to thank ClickHouse for original contribution to this connector.\n\n# License\nThis project is distributed under the terms of the Apache License (Version 2.0).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "The ByteHouse plugin for dbt (data build tool)",
    "version": "1.7.2",
    "project_urls": {
        "Changes": "https://github.com/bytehouse-cloud/bytehouse-dbt/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/bytehouse-cloud/bytehouse-dbt",
        "Homepage": "https://github.com/bytehouse-cloud/bytehouse-dbt"
    },
    "split_keywords": [
        "bytehouse",
        "dbt",
        "connector"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "50a409a767bf41dbfa3b5a1d2f063fe5683754d4bd90aea1deff557c7d4cb51a",
                "md5": "951b3812c6a32019b5e99bf060a9654b",
                "sha256": "1b6a255b22a2d311fd0eb1f578345c7c5de5d53a2179660e55f9e10d7aeafaa0"
            },
            "downloads": -1,
            "filename": "dbt-bytehouse-1.7.2.tar.gz",
            "has_sig": false,
            "md5_digest": "951b3812c6a32019b5e99bf060a9654b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 35534,
            "upload_time": "2024-04-26T06:39:57",
            "upload_time_iso_8601": "2024-04-26T06:39:57.579732Z",
            "url": "https://files.pythonhosted.org/packages/50/a4/09a767bf41dbfa3b5a1d2f063fe5683754d4bd90aea1deff557c7d4cb51a/dbt-bytehouse-1.7.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-26 06:39:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bytehouse-cloud",
    "github_project": "bytehouse-dbt",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "dbt-bytehouse"
}
        
Elapsed time: 0.24839s