# Introduction
`dbt` (Data Building Tool) is an open source tool that enables data analysts and engineers to transform
data in their warehouses simply by writing select statements. `dbt` performs the T (Transform) of ETL and
allows companies to write transformations as queries and orchestrate them in a more efficient way.
ByteHouse dbt connector is a plugin enabling users to build their data warehouse ecosystem with dbt
and ByteHouse.
# Table of Contents
- [Introduction](#introduction)
- [Requirements](#requirements)
- [Creating ByteHouse Account](#creating-bytehouse-account)
- [Installation](#installation)
- [dbt Project Setup](#dbt-project-setup)
* [dbt_project.yml](#dbt-projectyml)
* [profiles.yml](#profilesyml)
* [Connection & Authentication Configurations](#connection---authentication-configurations)
+ [ByteHouse Regions](#bytehouse-regions)
+ [Region & Password Configuration](#region---password-configuration)
+ [Region & API Key Configuration](#region---api-key-configuration)
+ [Host Address & Password Configuration](#host-address---password-configuration)
+ [Host Address & API Key Configuration](#host-address---api-key-configuration)
* [Project Initialization](#project-initialization)
* [Test Warehouse Connection](#test-warehouse-connection)
- [Dataset Ingestion](#dataset-ingestion)
- [dbt Models](#dbt-models)
- [schema.yml](#schemayml)
- [Materialization types of Models](#materialization-types-of-models)
* [View Materializations](#view-materializations)
* [Table Materializations](#table-materializations)
* [Incremental Materializations](#incremental-materializations)
+ [How it works](#how-it-works)
- [Project Documentation](#project-documentation)
- [Local Development](#local-development)
- [Original Author](#original-author)
- [License](#license)
# Requirements
Make sure you have `dbt` & `python` installed on your machine. If not, then you can follow this guide https://docs.getdbt.com/docs/get-started/installation
- dbt v1.7.0 or greater
- python v3.7 or greater
# Creating ByteHouse Account
You need to create ByteHouse account in order to use bytehouse-dbt connector. You can simply create a free account with
the process mentioned in our official website documentation: https://docs.bytehouse.cloud/en/docs/quick-start <br/>
You can also create ByteHouse account through Volcano Engine by ByteDance:
https://www.volcengine.com/product/bytehouse-cloud
# Installation
Create a new repository where we will instantiate a `Python` virtual environment.
```commandline
mkdir dbt_bytehouse_demo
cd dbt_bytehouse_demo
python -m venv venv
source venv/bin/activate
```
Latest release version can be installed from here:
```commandline
pip install dbt-bytehouse
```
Current development version can be installed from here:
```commandline
pip install git+https://github.com/bytehouse-cloud/bytehouse-dbt@master#egg=bytehouse-driver
```
Check whether installation is successful by verifying bytehouse is available under Plugins.
```commandline
dbt --version
```
![Version](./examples/1_version.png)
# dbt Project Setup
## dbt_project.yml
Every `dbt` project needs a `dbt_project.yml` file — this is how `dbt` knows a directory is a `dbt` project. `dbt_project.yml`
file holds the context of your project and tells `dbt` how to build your data sets. Some common configurations
for `dbt_project.yml` are:
<table>
<tr>
<td>YAML key</td>
<td>Value</td>
</tr>
<tr>
<td>name</td>
<td>Your project’s name in snake case</td>
</tr>
<tr>
<td>version</td>
<td>Version of your project</td>
</tr>
<tr>
<td>profile</td>
<td>The profile dbt uses to connect to ByteHouse</td>
</tr>
<tr>
<td>model-paths</td>
<td>Directories to where your model and source files live</td>
</tr>
<tr>
<td>seed-paths</td>
<td>Directories to where your seed files live</td>
</tr>
<tr>
<td>test-paths</td>
<td>Directories to where your test files live</td>
</tr>
<tr>
<td>snapshot-paths</td>
<td>Directories to where your snapshots live</td>
</tr>
<tr>
<td>docs-paths</td>
<td>Directories to where your docs blocks live</td>
</tr>
</table>
## profiles.yml
When you invoke `dbt` from the command line, `dbt` parses your `dbt_project.yml` and obtains the profile name. `dbt` then
checks your `profiles.yml` file for a profile with the same name. A profile contains all the details/credentials
required to connect to ByteHouse. `dbt` will search the current working directory for the `profiles.yml` file and will
default to the `~/.dbt/ directory` if not found.
```yaml
<profile-name>:
target: <target-name>
outputs:
<target-name>:
type: bytehouse
schema: <database-name>
user: bytehouse
password: <bytehouse-api-key>
driver: native
#optional fields
host: <hostname>
port: <port>
region: <region-name>
warehouse: <warehouse-name>
retries: 1
secure: True
connect_timeout: 10
send_receive_timeout: 300
custom_settings: <empty>
```
<table>
<tr>
<td>YAML key</td>
<td>Value</td>
</tr>
<tr>
<td><profile-name></td>
<td>Name of the profile. Has to be the same name as the profile indicated in your dbt_project.yml file</td>
</tr>
<tr>
<td>target</td>
<td>Default target your dbt project will use. It must be one of the targets you define in your profile</td>
</tr>
<tr>
<td>type</td>
<td>Must be set to bytehouse</td>
</tr>
<tr>
<td>schema</td>
<td>Database name</td>
</tr>
<tr>
<td>user</td>
<td>Must be set to bytehouse</td>
</tr>
<tr>
<td>password</td>
<td>ByteHouse API Token</td>
</tr>
<tr>
<td>driver</td>
<td>Must be set to native</td>
</tr>
<tr>
<td>host</td>
<td>[Optional] The host name of the connection</td>
</tr>
<tr>
<td>port </td>
<td>[Optional] The port number of the host server</td>
</tr>
<tr>
<td>warehouse</td>
<td>[Optional] The name of the virtual warehouse that you want to use for this session</td>
</tr>
<tr>
<td>retries</td>
<td>[Optional] Number of times to retry the initial connection attempt if the error appears to be recoverable</td>
</tr>
<tr>
<td>secure</td>
<td>[Optional] Whether the connection is secured by TLS. Suggested to set it to True</td>
</tr>
<tr>
<td>connect_timeout</td>
<td>[Optional] Connection timeout in seconds. Default is 10 seconds</td>
</tr>
<tr>
<td>send_receive_timeout</td>
<td>[Optional] Timeout for receiving data from or sending data to ByteHouse. Default is 5 minutes (300 seconds)</td>
</tr>
<tr>
<td>custom_settings</td>
<td>[Optional] A mapping of ByteHouse specific user settings to use with the connection</td>
</tr>
</table>
## Connection & Authentication Configurations
### Host Address & API Key Configuration
Required parameters: `host` `port` `user` `password`
```yaml
bytehouse_profile:
target: dev
outputs:
dev:
type: bytehouse
driver: native
# database
schema: $DATABASE_NAME
# target server address
host: $HOST_ADDRESS
port: $PORT_NUMBER
# account credentials
user: bytehouse
password: $API_KEY
# additional settings
secure: True
```
## Project Initialization
`dbt init` command will prompt for project name & database adapters, where you have to select bytehouse. This will create
a new folder with your project name, sample files & `dbt_project.yml` config file, enough to get you started with dbt.
```commandline
dbt init
```
![Init](./examples/2_profiles.png)
Update your profiles.yml with required authentication & target server credentials.
```commandline
nano ~/.dbt/profiles.yml
```
As `dbt` has created a new folder with the same name as your project name, change your current
directory to the project folder.
```commandline
cd dbt_bytehouse_demo
```
Update your `dbt_project.yml` file to have the same profile_name as `~/.dbt/profiles.yml`.
```commandline
nano dbt_project.yml
```
![Project](./examples/3_project.png)
## Test Warehouse Connection
Use `dbt debug` command to verify required dependencies & warehouse connection. In case of success, it will show you
"All checks passed!"
```commandline
dbt debug
```
![Debug](./examples/4_debug.png)
# Dataset Ingestion
To showcase different `dbt` functionalities, we will ingest a small imdb movie dataset, with the following schema.
The DDL & insertion queries can be found here https://github.com/bytehouse-cloud/bytehouse-dbt/examples/data_loading.sql.
You can use ByteHouse SQL worksheet to create the schema & insert the dataset.
![Schema](./examples/5_schema.png)
To verify that dataset preparation was successful, we will execute this following query to summarize each actor along
with their total number of movie appearances.
```
SELECT id,
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rating) as avg_rating,
max(created_at) as updated_at
FROM (
SELECT actors.id as id,
concat(actors.first_name, ' ', actors.last_name) as actor_name,
movies.id as movie_id,
movies.rating as rating,
created_at
FROM imdb.actors
JOIN imdb.roles ON roles.actor_id = actors.id
LEFT OUTER JOIN imdb.movies ON movies.id = roles.movie_id
)
GROUP BY id
```
The resultset should be like this:
![Schema Confirm](./examples/6_schema_confirm.png)
# dbt Models
In a `dbt` project, a model is a sql file located inside `models/` directory which will contain a `SELECT` statement
referring to a transformation. The name of the model file will refer to the name of future table/view after
`dbt` execution. When we execute `dbt run` command, `dbt` will build this model directly into ByteHouse by wrapping
it in a create table / view materialization. Materialization type of your model will determine the actual SQL
that `dbt` will use to create model in the warehouse.
# schema.yml
The `schema.yml` file will define our tables & columns by referring through alias name. This schemas can later be
used in different models/macros via `source()` function.
Create `schema.yml` file under `models/` directory
```commandline
touch models/schema.yml
```
Define our model schema like this
```yaml
version: 2
sources:
- name: imdb
tables:
- name: actors
- name: roles
- name: movies
```
# Materialization types of Models
## View Materializations
In case of view materialization, a model is transformed to a view on each single run by `CREATE VIEW` AS statement in
ByteHouse. View materializations won't store the actual data, so it would be slower to query than table materializations.
Let's create `models/actor_insight.sql` as view materialization.
```commandline
touch models/actors_insight.sql
```
```
{{ config(materialized='view') }}
SELECT id,
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rating) as avg_rating,
max(created_at) as updated_at
FROM (
SELECT actors.id as id,
concat(actors.first_name, ' ', actors.last_name) as actor_name,
movies.id as movie_id,
movies.rating as rating,
created_at
FROM {{ source('imdb', 'actors') }}
JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id
```
Let's execute `dbt run` command to build this model in ByteHouse.
![Confirm](./examples/7_actor_insight_confirm.png)
Querying this view, we can replicate the results of our earlier query with a simpler syntax:
```
SELECT * FROM imdb.actors_insight ORDER BY num_movies DESC;
```
![Confirm2](./examples/8_materialized_view_run_confirm.png)
## Table Materializations
In case of table materialization, your model would be rebuilt as a table on each single `dbt run`
via a `CREATE TABLE` AS statement.
```commandline
touch models/actors_insight_table.sql
```
We can use our previous view materialization sql with config change for table materialization sql.
```
{{ config(order_by='(updated_at, id, name)', materialized='table') }}
```
We can verify that both view & table materializations generate the same response.
```
SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;
```
## Incremental Materializations
For our previous table materialization, `dbt` will construct a table every time to materialize the model. For larger
or complex transformations, this would be redundant and costly in terms of computing power. Incremental
materializations solve this problem.<br/>
The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs,
`dbt` transforms only the rows in your source data that you tell `dbt` to filter for, inserting them into the target
table which is the table that has already been built.<br/>
To tell `dbt` which rows it should transform on an incremental run, wrap valid SQL that filters for these rows
in the `is_incremental()` macro. Your `is_incremental()` code will check for rows created or modified since the
last time `dbt` ran this model.
```commandline
touch models/actors_insight_incremental.sql
```
```
{{ config(order_by='(updated_at, id, name)', materialized='incremental') }}
SELECT id,
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rating) as avg_rating,
max(created_at) as updated_at
FROM (
SELECT actors.id as id,
concat(actors.first_name, ' ', actors.last_name) as actor_name,
movies.id as movie_id,
movies.rating as rating,
created_at
FROM {{ source('imdb', 'actors') }}
JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}})
{% endif %}
```
We can verify that view, table & incremental materializations, all generate the same response.
```
SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;
```
```commandline
dbt run -m models/actors_insight_incremental.sql
```
Let's insert a few more rows to demonstrate the power of incremental materializations. Now the most
appeared actor should be 'Chris Pratt'.
```
INSERT INTO imdb.movies VALUES (9, 'Passengers', 2016, 7);
INSERT INTO imdb.movies VALUES (10, 'The Tomorrow War', 2021, 6.5);
INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 9, 'Jim Preston');
INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 10, 'Dan Forester');
```
```commandline
dbt run -m models/actors_insight_incremental.sql
```
```
SELECT * FROM imdb.actors_insight_incremental ORDER BY num_movies DESC;
```
![Confirm3](./examples/9_incremental_confirm.png)
### How it works
1. `dbt` will first create a temporary table named `actors_insight_incremental_tmp` & insert all those rows which
pass our `is_incremental()` filter.
2. A new table `actors_insight_incremental_new` will be created & rows from the old table `actors_insight_incremental`
will be ingested here. `dbt` will make sure that `unique_key` (if any declared in config) constraint is maintained,
by not allowing those rows which have the same `unique_key` as the previous temporary table.
3. The rows from the temporary table would be ingested into the new table.
4. Our previous table (`actors_insight_incremental`) & new table (`actors_insight_new`) will be exchanged.
# Direct Write to CNCH Unique Key Table
## Requirement
1. `dbt-bytehouse` connector version to be at least `1.7.1`.
2. Needs to set config name `cnch_unique_key` instead of `unique_key`
3. Needs to set config `incremental_strategy` as `append`
4. Write a SQL expression to append the virtual warehouse id at the end of DBT model, example
`SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'`
## Example
```
{{
config(
materialized='incremental',
cnch_unique_key='id',
incremental_strategy='append'
)
}}
SELECT
id,
transaction_date,
user_id,
amount,
status
FROM
dbt_unique.transactions
{% if is_incremental() %}
WHERE transaction_date >= (SELECT max(transaction_date) FROM dbt_unique.transactions)
{% endif %}
SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'
```
# Project Documentation
`dbt` provides a way to generate documentation for your dbt project and render it as a website.
Create `models/actors_insight_incremental.yml` to generate documentation for our models.
```yaml
version: 2
models:
- name: actors_insight_incremental
description: Actor appearance summary based on roles
columns:
- name: id
description: The id number of actor
- name: name
description: The name of actor
- name: num_movies
description: The number of movies actor has appeared
- name: avg_rating
description: Average rating
- name: updated_at
description: Latest update time
```
Use `dbt docs generate` to generate the documentation for your models & `dbt docs serve` to serve the documentation in
your local browser on port 8000.
![Doc](./examples/10_docs.png)
# Local Development
Update `tests/integration/confest.py` file to include your connection credentials. For running tests locally, follow
these steps:
```commandline
pip install -r dev_requirements.txt
python -m pytest
```
# Original Author
ByteHouse wants to thank ClickHouse for original contribution to this connector.
# License
This project is distributed under the terms of the Apache License (Version 2.0).
Raw data
{
"_id": null,
"home_page": "https://github.com/bytehouse-cloud/bytehouse-dbt",
"name": "dbt-bytehouse",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "ByteHouse dbt connector",
"author": "Rafsan Mazumder",
"author_email": "rafsan.mazumder@bytedance.com",
"download_url": "https://files.pythonhosted.org/packages/27/71/6ff58426944d5dcac959d3b920960ea4e2f9176518edc21af4a5a67f43fc/dbt-bytehouse-1.7.3.tar.gz",
"platform": "any",
"description": "# Introduction\n`dbt` (Data Building Tool) is an open source tool that enables data analysts and engineers to transform\ndata in their warehouses simply by writing select statements. `dbt` performs the T (Transform) of ETL and\nallows companies to write transformations as queries and orchestrate them in a more efficient way. \nByteHouse dbt connector is a plugin enabling users to build their data warehouse ecosystem with dbt \nand ByteHouse. \n# Table of Contents\n- [Introduction](#introduction)\n- [Requirements](#requirements)\n- [Creating ByteHouse Account](#creating-bytehouse-account)\n- [Installation](#installation)\n- [dbt Project Setup](#dbt-project-setup)\n * [dbt_project.yml](#dbt-projectyml)\n * [profiles.yml](#profilesyml)\n * [Connection & Authentication Configurations](#connection---authentication-configurations)\n + [ByteHouse Regions](#bytehouse-regions)\n + [Region & Password Configuration](#region---password-configuration)\n + [Region & API Key Configuration](#region---api-key-configuration)\n + [Host Address & Password Configuration](#host-address---password-configuration)\n + [Host Address & API Key Configuration](#host-address---api-key-configuration)\n * [Project Initialization](#project-initialization)\n * [Test Warehouse Connection](#test-warehouse-connection)\n- [Dataset Ingestion](#dataset-ingestion)\n- [dbt Models](#dbt-models)\n- [schema.yml](#schemayml)\n- [Materialization types of Models](#materialization-types-of-models)\n * [View Materializations](#view-materializations)\n * [Table Materializations](#table-materializations)\n * [Incremental Materializations](#incremental-materializations)\n + [How it works](#how-it-works)\n- [Project Documentation](#project-documentation)\n- [Local Development](#local-development)\n- [Original Author](#original-author)\n- [License](#license)\n# Requirements\nMake sure you have `dbt` & `python` installed on your machine. If not, then you can follow this guide https://docs.getdbt.com/docs/get-started/installation\n- dbt v1.7.0 or greater\n- python v3.7 or greater\n# Creating ByteHouse Account\nYou need to create ByteHouse account in order to use bytehouse-dbt connector. You can simply create a free account with\nthe process mentioned in our official website documentation: https://docs.bytehouse.cloud/en/docs/quick-start <br/>\n\nYou can also create ByteHouse account through Volcano Engine by ByteDance: \nhttps://www.volcengine.com/product/bytehouse-cloud \n# Installation\nCreate a new repository where we will instantiate a `Python` virtual environment.\n```commandline\nmkdir dbt_bytehouse_demo\ncd dbt_bytehouse_demo\n\npython -m venv venv\nsource venv/bin/activate\n```\nLatest release version can be installed from here:\n```commandline\npip install dbt-bytehouse\n```\nCurrent development version can be installed from here:\n```commandline\npip install git+https://github.com/bytehouse-cloud/bytehouse-dbt@master#egg=bytehouse-driver\n```\nCheck whether installation is successful by verifying bytehouse is available under Plugins.\n```commandline\ndbt --version\n```\n![Version](./examples/1_version.png)\n# dbt Project Setup\n## dbt_project.yml\nEvery `dbt` project needs a `dbt_project.yml` file \u2014 this is how `dbt` knows a directory is a `dbt` project. `dbt_project.yml`\nfile holds the context of your project and tells `dbt` how to build your data sets. Some common configurations\nfor `dbt_project.yml` are:\n<table>\n <tr>\n <td>YAML key</td>\n <td>Value</td>\n </tr>\n <tr>\n <td>name</td>\n <td>Your project\u2019s name in snake case</td>\n </tr>\n <tr>\n <td>version</td>\n <td>Version of your project</td>\n </tr>\n <tr>\n <td>profile</td>\n <td>The profile dbt uses to connect to ByteHouse</td>\n </tr>\n <tr>\n <td>model-paths</td>\n <td>Directories to where your model and source files live</td>\n </tr>\n <tr>\n <td>seed-paths</td>\n <td>Directories to where your seed files live</td>\n </tr>\n <tr>\n <td>test-paths</td>\n <td>Directories to where your test files live</td>\n </tr>\n <tr>\n <td>snapshot-paths</td>\n <td>Directories to where your snapshots live</td>\n </tr>\n <tr>\n <td>docs-paths</td>\n <td>Directories to where your docs blocks live</td>\n </tr>\n</table>\n\n## profiles.yml\nWhen you invoke `dbt` from the command line, `dbt` parses your `dbt_project.yml` and obtains the profile name. `dbt` then \nchecks your `profiles.yml` file for a profile with the same name. A profile contains all the details/credentials \nrequired to connect to ByteHouse. `dbt` will search the current working directory for the `profiles.yml` file and will\ndefault to the `~/.dbt/ directory` if not found.\n```yaml\n<profile-name>:\n target: <target-name>\n outputs:\n <target-name>:\n type: bytehouse\n schema: <database-name>\n user: bytehouse\n password: <bytehouse-api-key>\n driver: native\n #optional fields\n host: <hostname>\n port: <port>\n region: <region-name>\n warehouse: <warehouse-name>\n retries: 1\n secure: True\n connect_timeout: 10\n send_receive_timeout: 300\n custom_settings: <empty>\n```\n<table>\n <tr>\n <td>YAML key</td>\n <td>Value</td>\n </tr>\n <tr>\n <td><profile-name></td>\n <td>Name of the profile. Has to be the same name as the profile indicated in your dbt_project.yml file</td>\n </tr>\n <tr>\n <td>target</td>\n <td>Default target your dbt project will use. It must be one of the targets you define in your profile</td>\n </tr>\n <tr>\n <td>type</td>\n <td>Must be set to bytehouse</td>\n </tr>\n <tr>\n <td>schema</td>\n <td>Database name</td>\n </tr>\n <tr>\n <td>user</td>\n <td>Must be set to bytehouse</td>\n </tr>\n <tr>\n <td>password</td>\n <td>ByteHouse API Token</td>\n </tr>\n <tr>\n <td>driver</td>\n <td>Must be set to native</td>\n </tr>\n <tr>\n <td>host</td>\n <td>[Optional] The host name of the connection</td>\n </tr>\n <tr>\n <td>port </td>\n <td>[Optional] The port number of the host server</td>\n </tr>\n <tr>\n <td>warehouse</td>\n <td>[Optional] The name of the virtual warehouse that you want to use for this session</td>\n </tr>\n <tr>\n <td>retries</td>\n <td>[Optional] Number of times to retry the initial connection attempt if the error appears to be recoverable</td>\n </tr>\n <tr>\n <td>secure</td>\n <td>[Optional] Whether the connection is secured by TLS. Suggested to set it to True</td>\n </tr>\n <tr>\n <td>connect_timeout</td>\n <td>[Optional] Connection timeout in seconds. Default is 10 seconds</td>\n </tr>\n <tr>\n <td>send_receive_timeout</td>\n <td>[Optional] Timeout for receiving data from or sending data to ByteHouse. Default is 5 minutes (300 seconds)</td>\n </tr>\n <tr>\n <td>custom_settings</td>\n <td>[Optional] A mapping of ByteHouse specific user settings to use with the connection</td>\n </tr>\n</table>\n\n## Connection & Authentication Configurations\n### Host Address & API Key Configuration\nRequired parameters: `host` `port` `user` `password`\n```yaml\nbytehouse_profile:\n target: dev\n outputs:\n dev:\n type: bytehouse\n driver: native\n\n # database\n schema: $DATABASE_NAME\n\n # target server address\n host: $HOST_ADDRESS \n port: $PORT_NUMBER\n\n # account credentials\n user: bytehouse\n password: $API_KEY\n\n # additional settings\n secure: True\n```\n## Project Initialization\n`dbt init` command will prompt for project name & database adapters, where you have to select bytehouse. This will create\na new folder with your project name, sample files & `dbt_project.yml` config file, enough to get you started with dbt.\n```commandline\ndbt init\n```\n![Init](./examples/2_profiles.png)\nUpdate your profiles.yml with required authentication & target server credentials. \n```commandline\nnano ~/.dbt/profiles.yml\n```\nAs `dbt` has created a new folder with the same name as your project name, change your current \ndirectory to the project folder. \n```commandline\ncd dbt_bytehouse_demo\n```\nUpdate your `dbt_project.yml` file to have the same profile_name as `~/.dbt/profiles.yml`.\n```commandline\nnano dbt_project.yml\n```\n![Project](./examples/3_project.png)\n## Test Warehouse Connection\nUse `dbt debug` command to verify required dependencies & warehouse connection. In case of success, it will show you \n\"All checks passed!\"\n```commandline\ndbt debug\n```\n![Debug](./examples/4_debug.png)\n# Dataset Ingestion\nTo showcase different `dbt` functionalities, we will ingest a small imdb movie dataset, with the following schema. \nThe DDL & insertion queries can be found here https://github.com/bytehouse-cloud/bytehouse-dbt/examples/data_loading.sql. \nYou can use ByteHouse SQL worksheet to create the schema & insert the dataset. \n![Schema](./examples/5_schema.png)\nTo verify that dataset preparation was successful, we will execute this following query to summarize each actor along \nwith their total number of movie appearances. \n```\nSELECT id,\n any(actor_name) as name,\n uniqExact(movie_id) as num_movies,\n avg(rating) as avg_rating,\n max(created_at) as updated_at\nFROM (\n SELECT actors.id as id,\n concat(actors.first_name, ' ', actors.last_name) as actor_name,\n movies.id as movie_id,\n movies.rating as rating,\n created_at\n FROM imdb.actors\n JOIN imdb.roles ON roles.actor_id = actors.id\n LEFT OUTER JOIN imdb.movies ON movies.id = roles.movie_id\n)\nGROUP BY id\n```\nThe resultset should be like this:\n![Schema Confirm](./examples/6_schema_confirm.png)\n# dbt Models \nIn a `dbt` project, a model is a sql file located inside `models/` directory which will contain a `SELECT` statement \nreferring to a transformation. The name of the model file will refer to the name of future table/view after\n`dbt` execution. When we execute `dbt run` command, `dbt` will build this model directly into ByteHouse by wrapping\nit in a create table / view materialization. Materialization type of your model will determine the actual SQL\nthat `dbt` will use to create model in the warehouse.\n# schema.yml\nThe `schema.yml` file will define our tables & columns by referring through alias name. This schemas can later be\nused in different models/macros via `source()` function.\nCreate `schema.yml` file under `models/` directory\n```commandline\ntouch models/schema.yml\n```\nDefine our model schema like this\n```yaml\nversion: 2\n\nsources:\n- name: imdb\n tables:\n - name: actors\n - name: roles\n - name: movies\n```\n# Materialization types of Models\n## View Materializations\nIn case of view materialization, a model is transformed to a view on each single run by `CREATE VIEW` AS statement in \nByteHouse. View materializations won't store the actual data, so it would be slower to query than table materializations.\nLet's create `models/actor_insight.sql` as view materialization. \n```commandline\ntouch models/actors_insight.sql\n```\n```\n{{ config(materialized='view') }}\n\nSELECT id,\n any(actor_name) as name,\n uniqExact(movie_id) as num_movies,\n avg(rating) as avg_rating,\n max(created_at) as updated_at\nFROM (\n SELECT actors.id as id,\n concat(actors.first_name, ' ', actors.last_name) as actor_name,\n movies.id as movie_id,\n movies.rating as rating,\n created_at\n FROM {{ source('imdb', 'actors') }}\n JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id\n LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id\n)\nGROUP BY id\n```\nLet's execute `dbt run` command to build this model in ByteHouse. \n![Confirm](./examples/7_actor_insight_confirm.png)\nQuerying this view, we can replicate the results of our earlier query with a simpler syntax:\n```\nSELECT * FROM imdb.actors_insight ORDER BY num_movies DESC;\n```\n![Confirm2](./examples/8_materialized_view_run_confirm.png)\n## Table Materializations\nIn case of table materialization, your model would be rebuilt as a table on each single `dbt run` \nvia a `CREATE TABLE` AS statement. \n```commandline\ntouch models/actors_insight_table.sql\n```\nWe can use our previous view materialization sql with config change for table materialization sql. \n```\n{{ config(order_by='(updated_at, id, name)', materialized='table') }}\n```\nWe can verify that both view & table materializations generate the same response. \n```\n SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;\n```\n## Incremental Materializations\nFor our previous table materialization, `dbt` will construct a table every time to materialize the model. For larger\nor complex transformations, this would be redundant and costly in terms of computing power. Incremental \nmaterializations solve this problem.<br/>\nThe first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, \n`dbt` transforms only the rows in your source data that you tell `dbt` to filter for, inserting them into the target \ntable which is the table that has already been built.<br/>\nTo tell `dbt` which rows it should transform on an incremental run, wrap valid SQL that filters for these rows \nin the `is_incremental()` macro. Your `is_incremental()` code will check for rows created or modified since the \nlast time `dbt` ran this model.\n```commandline\ntouch models/actors_insight_incremental.sql\n```\n```\n{{ config(order_by='(updated_at, id, name)', materialized='incremental') }}\n\nSELECT id,\n any(actor_name) as name,\n uniqExact(movie_id) as num_movies,\n avg(rating) as avg_rating,\n max(created_at) as updated_at\nFROM (\n SELECT actors.id as id,\n concat(actors.first_name, ' ', actors.last_name) as actor_name,\n movies.id as movie_id,\n movies.rating as rating,\n created_at\n FROM {{ source('imdb', 'actors') }}\n JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id\n LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id\n)\nGROUP BY id\n\n{% if is_incremental() %}\n\n-- this filter will only be applied on an incremental run\nwhere id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}})\n\n{% endif %}\n```\nWe can verify that view, table & incremental materializations, all generate the same response. \n```\n SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;\n```\n```commandline\ndbt run -m models/actors_insight_incremental.sql\n```\nLet's insert a few more rows to demonstrate the power of incremental materializations. Now the most \nappeared actor should be 'Chris Pratt'. \n```\nINSERT INTO imdb.movies VALUES (9, 'Passengers', 2016, 7);\nINSERT INTO imdb.movies VALUES (10, 'The Tomorrow War', 2021, 6.5);\n\nINSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 9, 'Jim Preston');\nINSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 10, 'Dan Forester');\n```\n```commandline\ndbt run -m models/actors_insight_incremental.sql\n```\n```\nSELECT * FROM imdb.actors_insight_incremental ORDER BY num_movies DESC;\n```\n![Confirm3](./examples/9_incremental_confirm.png)\n### How it works\n1. `dbt` will first create a temporary table named `actors_insight_incremental_tmp` & insert all those rows which \npass our `is_incremental()` filter.\n2. A new table `actors_insight_incremental_new` will be created & rows from the old table `actors_insight_incremental` \nwill be ingested here. `dbt` will make sure that `unique_key` (if any declared in config) constraint is maintained, \nby not allowing those rows which have the same `unique_key` as the previous temporary table. \n3. The rows from the temporary table would be ingested into the new table.\n4. Our previous table (`actors_insight_incremental`) & new table (`actors_insight_new`) will be exchanged. \n\n# Direct Write to CNCH Unique Key Table\n## Requirement\n1. `dbt-bytehouse` connector version to be at least `1.7.1`. \n2. Needs to set config name `cnch_unique_key` instead of `unique_key`\n3. Needs to set config `incremental_strategy` as `append`\n4. Write a SQL expression to append the virtual warehouse id at the end of DBT model, example\n`SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'`\n## Example\n```\n{{\n config(\n materialized='incremental',\n cnch_unique_key='id',\n incremental_strategy='append' \n )\n}}\n\nSELECT\n id,\n transaction_date,\n user_id,\n amount,\n status\nFROM \n dbt_unique.transactions \n\n{% if is_incremental() %}\n\n WHERE transaction_date >= (SELECT max(transaction_date) FROM dbt_unique.transactions)\n\n{% endif %}\n\nSETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'\n```\n# Project Documentation\n`dbt` provides a way to generate documentation for your dbt project and render it as a website. \nCreate `models/actors_insight_incremental.yml` to generate documentation for our models. \n```yaml\nversion: 2\n\nmodels:\n - name: actors_insight_incremental\n description: Actor appearance summary based on roles\n columns: \n - name: id\n description: The id number of actor\n - name: name\n description: The name of actor\n - name: num_movies\n description: The number of movies actor has appeared \n - name: avg_rating\n description: Average rating\n - name: updated_at\n description: Latest update time\n```\nUse `dbt docs generate` to generate the documentation for your models & `dbt docs serve` to serve the documentation in \nyour local browser on port 8000.\n![Doc](./examples/10_docs.png)\n# Local Development\nUpdate `tests/integration/confest.py` file to include your connection credentials. For running tests locally, follow \nthese steps:\n```commandline\npip install -r dev_requirements.txt\npython -m pytest\n```\n# Original Author\nByteHouse wants to thank ClickHouse for original contribution to this connector.\n\n# License\nThis project is distributed under the terms of the Apache License (Version 2.0).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "The ByteHouse plugin for dbt (data build tool)",
"version": "1.7.3",
"project_urls": {
"Changes": "https://github.com/bytehouse-cloud/bytehouse-dbt/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/bytehouse-cloud/bytehouse-dbt",
"Homepage": "https://github.com/bytehouse-cloud/bytehouse-dbt"
},
"split_keywords": [
"bytehouse",
"dbt",
"connector"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8f0d0c1d66e293e61e8467b6dad0dadb2a49112b10c1ccfceb211f9d1294a458",
"md5": "5ce017a80c4ee50a42a015f6a2ffe74d",
"sha256": "112d8588dc37e9f5d62f1d108d72eb5c032e58a1d8db61d6e9a24273970dc00a"
},
"downloads": -1,
"filename": "dbt_bytehouse-1.7.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5ce017a80c4ee50a42a015f6a2ffe74d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 45432,
"upload_time": "2024-07-15T11:26:28",
"upload_time_iso_8601": "2024-07-15T11:26:28.565123Z",
"url": "https://files.pythonhosted.org/packages/8f/0d/0c1d66e293e61e8467b6dad0dadb2a49112b10c1ccfceb211f9d1294a458/dbt_bytehouse-1.7.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "27716ff58426944d5dcac959d3b920960ea4e2f9176518edc21af4a5a67f43fc",
"md5": "a1ff3689186e40dde98d68ef892aa93f",
"sha256": "55228dd5dc47c957e00c491325e6f98d6c8c516d8fda2d0f6a56d8b3e65b3f53"
},
"downloads": -1,
"filename": "dbt-bytehouse-1.7.3.tar.gz",
"has_sig": false,
"md5_digest": "a1ff3689186e40dde98d68ef892aa93f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 35485,
"upload_time": "2024-07-15T11:26:31",
"upload_time_iso_8601": "2024-07-15T11:26:31.092743Z",
"url": "https://files.pythonhosted.org/packages/27/71/6ff58426944d5dcac959d3b920960ea4e2f9176518edc21af4a5a67f43fc/dbt-bytehouse-1.7.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-15 11:26:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bytehouse-cloud",
"github_project": "bytehouse-dbt",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "dbt-bytehouse"
}