dbt-clickhouse


Namedbt-clickhouse JSON
Version 1.7.6 PyPI version JSON
download
home_pagehttps://github.com/ClickHouse/dbt-clickhouse
SummaryThe Clickhouse plugin for dbt (data build tool)
upload_time2024-04-13 01:05:21
maintainerNone
docs_urlNone
authorClickHouse Inc.
requires_python>=3.8
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="https://raw.githubusercontent.com/ClickHouse/dbt-clickhouse/master/etc/chdbt.png" alt="clickhouse dbt logo" width="300"/>
</p>

# dbt-clickhouse

This plugin ports [dbt](https://getdbt.com) functionality to [Clickhouse](https://clickhouse.tech/).

The plugin uses syntax that requires ClickHouse version 22.1 or newer. We do not test older versions of Clickhouse.  We also do not currently test
Replicated tables or the related `ON CLUSTER` functionality.

## Installation

Use your favorite Python package manager to install the app from PyPI, e.g.

```bash
pip install dbt-clickhouse
```

## Supported features

- [x] Table materialization
- [x] View materialization
- [x] Incremental materialization
- [x] Materialized View materializations (uses the `TO` form of MATERIALIZED VIEW, experimental)
- [x] Seeds
- [x] Sources
- [x] Docs generate
- [x] Tests
- [x] Snapshots
- [x] Most dbt-utils macros (now included in dbt-core)  
- [x] Ephemeral materialization
- [x] Distributed table materialization (experimental)
- [x] Distributed incremental materialization (experimental)
- [x] Contracts

# Usage Notes

## SET Statement Warning
In many environments, using the SET statement to persist a ClickHouse setting across all DBT queries is not reliable
and can cause unexpected failures.  This is particularly true when using HTTP connections through a load balancer that
distributes queries across multiple nodes (such as ClickHouse cloud), although in some circumstances this can also
happen with native ClickHouse connections.  Accordingly, we recommend configuring any required ClickHouse settings in the
"custom_settings" property of the DBT profile as a best practice, instead of relying on a prehook "SET" statement as
has been occasionally suggested.

## Database

The dbt model relation identifier `database.schema.table` is not compatible with Clickhouse because Clickhouse does not support a `schema`.
So we use a simplified approach `schema.table`, where `schema` is the Clickhouse database. Using the `default` database is not recommended.

## Example Profile

Default values are in brackets:

```
your_profile_name:
  target: dev
  outputs:
    dev:
      type: clickhouse
      schema: [default] # ClickHouse database for dbt models

      # optional
      driver: [http] # http or native.  If not set this will be autodetermined based on port setting
      host: [localhost] 
      port: [8123]  # If not set, defaults to 8123, 8443, 9000, 9440 depending on the secure and driver settings 
      user: [default] # User for all database operations
      password: [<empty string>] # Password for the user
      cluster: [<empty string>] If set, certain DDL/table operations will be executed with the `ON CLUSTER` clause using this cluster. Distributed materializations require this setting to work. See the following ClickHouse Cluster section for more details.
      verify: [True] # Validate TLS certificate if using TLS/SSL
      secure: [False] # Use TLS (native protocol) or HTTPS (http protocol)
      retries: [1] # Number of times to retry a "retriable" database exception (such as a 503 'Service Unavailable' error)
      compression: [<empty string>] Use gzip compression if truthy (http), or compression type for a native connection
      connect_timeout: [10] # Timeout in seconds to establish a connection to ClickHouse
      send_receive_timeout: [300] # Timeout in seconds to receive data from the ClickHouse server
      cluster_mode: [False] # Use specific settings designed to improve operation on Replicated databases (recommended for ClickHouse Cloud)
      use_lw_deletes: [False] Use the strategy `delete+insert` as the default incremental strategy.
      check_exchange: [True] # Validate that clickhouse support the atomic EXCHANGE TABLES command.  (Not needed for most ClickHouse versions)
      local_suffix [_local] # Table suffix of local tables on shards for distributed materializations.
      allow_automatic_deduplication [False] # Enable ClickHouse automatic deduplication for Replicated tables
      custom_settings: [{}] # A dictionary/mapping of custom ClickHouse settings for the connection - default is empty.
      
      # Native (clickhouse-driver) connection settings
      sync_request_timeout: [5] Timeout for server ping
      compress_block_size: [1048576] Compression block size if compression is enabled
      
```

## Model Configuration

| Option                 | Description                                                                                                                                                                                                                                            | Default if any |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| engine                 | The table engine (type of table) to use when creating tables                                                                                                                                                                                           | `MergeTree()`  |
| order_by               | A tuple of column names or arbitrary expressions. This allows you to create a small sparse index that helps find data faster.                                                                                                                          | `tuple()`      |
| partition_by           | A partition is a logical combination of records in a table by a specified criterion. The partition key can be any expression from the table columns.                                                                                                   |                |
| sharding_key           | Sharding key determines the destination server when inserting into distributed engine table.  The sharding key can be random or as an output of a hash function                                                                                        | `rand()`)      |
| primary_key            | Like order_by, a ClickHouse primary key expression.  If not specified, ClickHouse will use the order by expression as the primary key                                                                                                                  |                |
| unique_key             | A tuple of column names that uniquely identify rows.  Used with incremental models for updates.                                                                                                                                                        |                |
| inserts_only           | If set to True for an incremental model, incremental updates will be inserted directly to the target table without creating intermediate table. It has been deprecated in favor of the `append` incremental `strategy`, which operates in the same way |                |
| incremental_strategy   | Incremental model update strategy of `delete+insert` or `append`.  See the following Incremental Model Strategies                                                                                                                                      | `default`      |
| incremental_predicates | Additional conditions to be applied to the incremental materialization (only applied to `delete+insert` strategy                                                                                                                                       |                |
| settings               | A map/dictionary of "TABLE" settings to be used to DDL statements like 'CREATE TABLE' with this model                                                                                                                                                  |                |
| query_settings         | A map/dictionary of ClickHouse user level settings to be used with `INSERT` or `DELETE` statements in conjunction with this model                                                                                                                      |                |
| ttl                    | A TTL expression to be used with the table.  The TTL expression is a string that can be used to specify the TTL for the table.                                                                                                                         |                |

## ClickHouse Cluster 

The `cluster` setting in profile enables dbt-clickhouse to run against a ClickHouse cluster.

### Effective Scope


if `cluster` is set in profile, `on_cluster_clause` now will return cluster info for:
- Database creation
- View materialization
- Distributed materializations
- Models with Replicated engines

table and incremental materializations with non-replicated engine will not be affected by `cluster` setting (model would be created on the connected node only).

### Compatibility


If a model has been created without a `cluster` setting, dbt-clickhouse will detect the situation and run all DDL/DML without `on cluster` clause for this model.


## A Note on Model Settings

ClickHouse has several types/levels of "settings".  In the model configuration above, two types of these are configurable.  `settings` means the `SETTINGS`
clause used in `CREATE TABLE/VIEW` types of DDL statements, so this is generally settings that are specific to the specific ClickHouse table engine.  The new
`query_settings` is use to add a `SETTINGS` clause to the `INSERT` and `DELETE` queries used for model materialization (including incremental materializations).
There are hundreds of ClickHouse settings, and it's not always clear which is a "table" setting and which is a "user" setting (although the latter are generally
available in the `system.settings` table.)  In general the defaults are recommended, and any use of these properties should be carefully researched and tested.


## Known Limitations

* Ephemeral models/CTEs don't work if placed before the "INSERT INTO" in a ClickHouse insert statement, see https://github.com/ClickHouse/ClickHouse/issues/30323.  This
should not affect most models, but care should be taken where an ephemeral model is placed in model definitions and other SQL statements.

## Incremental Model Strategies

As of version 1.3.2, dbt-clickhouse supports three incremental model strategies.

### The Default (Legacy) Strategy  

Historically ClickHouse has had only limited support for updates and deletes, in the form of asynchronous "mutations."  To emulate expected dbt behavior,
dbt-clickhouse by default creates a new temporary table containing all unaffected (not deleted, not changed) "old" records, plus any new or updated records,
and then swaps or exchanges this temporary table with the existing incremental model relation.  This is the only strategy that preserves the original relation if something
goes wrong before the operation completes; however, since it involves a full copy of the original table, it can be quite expensive and slow to execute.

### The Delete+Insert Strategy

ClickHouse added "lightweight deletes" as an experimental feature in version 22.8.  Lightweight deletes are significantly faster than ALTER TABLE ... DELETE
operations, because they don't require rewriting ClickHouse data parts.  The incremental strategy `delete+insert` utilizes lightweight deletes to implement
incremental materializations that perform significantly better than the "legacy" strategy.  However, there are important caveats to using this strategy:
- Lightweight deletes must be enabled on your ClickHouse server using the setting `allow_experimental_lightweight_delete=1` or you 
must set `use_lw_deletes=true` in your profile (which will enable that setting for your dbt sessions)
- Lightweight deletes are now production ready, but there may be performance and other problems on ClickHouse versions earlier than 23.3.
- This strategy operates directly on the affected table/relation (with creating any intermediate or temporary tables), so if there is an issue during the operation, the
data in the incremental model is likely to be in an invalid state
- When using lightweight deletes, dbt-clickhouse enabled the setting `allow_nondeterministic_mutations`.  In some very rare cases using non-deterministic incremental_predicates
this could result in a race condition for the updated/deleted items (and related log messages in the ClickHouse logs).  To ensure consistent results the
incremental predicates should only include sub-queries on data that will not be modified during the incremental materialization.

### The Append Strategy

This strategy replaces the `inserts_only` setting in previous versions of dbt-clickhouse.  This approach simply appends new rows to the existing relation.
As a result duplicate rows are not eliminated, and there is no temporary or intermediate table.  It is the fastest approach if duplicates are either permitted
in the data or excluded by the incremental query WHERE clause/filter.

## Additional ClickHouse Macros

### Model Materialization Utility Macros

The following macros are included to facilitate creating ClickHouse specific tables and views:
- `engine_clause` -- Uses the `engine` model configuration property to assign a ClickHouse table engine.  dbt-clickhouse uses the `MergeTree` engine by default.
- `partition_cols` -- Uses the `partition_by` model configuration property to assign a ClickHouse partition key.  No partition key is assigned by default.
- `order_cols` -- Uses the `order_by` model configuration to assign a ClickHouse order by/sorting key.  If not specified ClickHouse will use an empty tuple() and the table will be unsorted
- `primary_key_clause` -- Uses the `primary_key` model configuration property to assign a ClickHouse primary key.  By default, primary key is set and ClickHouse will use the order by clause as the primary key. 
- `on_cluster_clause` -- Uses the `cluster` profile property to add an `ON CLUSTER` clause to certain dbt-operations: distributed materializations, views creation, database creation.
- `ttl_config` -- Uses the `ttl` model configuration property to assign a ClickHouse table TTL expression.  No TTL is assigned by default.

### s3Source Helper Macro

The `s3source` macro simplifies the process of selecting ClickHouse data directly from S3 using the ClickHouse S3 table function.  It works by
populating the S3 table function parameters from a named configuration dictionary (the name of the dictionary must end in `s3`).  The macro
first looks for the dictionary in the profile `vars`, and then in the model configuration.  The dictionary can contain any of the following
keys used to populate the parameters of the S3 table function:

| Argument Name         | Description                                                                                                                                                                                  |
|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| bucket                | The bucket base url, such as `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi`. `https://` is assumed if no protocol is provided.                                         |
| path                  | The S3 path to use for the table query, such as `/trips_4.gz`.  S3 wildcards are supported.                                                                                                  |
| fmt                   | The expected ClickHouse input format (such as `TSV` or `CSVWithNames`) of the referenced S3 objects.                                                                                         |
| structure             | The column structure of the data in bucket, as a list of name/datatype pairs, such as `['id UInt32', 'date DateTime', 'value String']`  If not provided ClickHouse will infer the structure. |
| aws_access_key_id     | The S3 access key id.                                                                                                                                                                        |
| aws_secret_access_key | The S3 secret key.                                                                                                                                                                           |
| compression           | The compression method used with the S3 objects.  If not provided ClickHouse will attempt to determine compression based on the file name.                                                   |

See the [S3 test file](https://github.com/ClickHouse/dbt-clickhouse/blob/main/tests/integration/adapter/clickhouse/test_clickhouse_s3.py) for examples of how to use this macro.

# Contracts and Constraints

Only exact column type contracts are supported.  For example, a contract with a UInt32 column type will fail if the model returns a UInt64 or other integer type.
ClickHouse also support _only_ `CHECK` constraints on the entire table/model.  Primary key, foreign key, unique, and column level CHECK constraints are not supported.
(See ClickHouse documentation on primary/order by keys.)

# Materialized Views (Experimental)
A `materialized_view` materialization should be a `SELECT` from an existing (source) table.  The adapter will create a target table with the model name
and a ClickHouse MATERIALIZED VIEW with the name `<model_name>_mv`.  Unlike PostgreSQL, a ClickHouse materialized view is not "static" (and has
no corresponding REFRESH operation).  Instead, it acts as an "insert trigger", and will insert new rows into the target table using the defined `SELECT`
"transformation" in the view definition on rows inserted into the source table.  See the [test file]
(https://github.com/ClickHouse/dbt-clickhouse/blob/main/tests/integration/adapter/materialized_view/test_materialized_view.py)  for an introductory example
of how to use this functionality.

# Dictionary materializations (experimental)
See the tests in https://github.com/ClickHouse/dbt-clickhouse/blob/main/tests/integration/adapter/dictionary/test_dictionary.py for examples of how to
implement materializations for ClickHouse dictionaries 

# Distributed materializations

Notes:

- dbt-clickhouse queries now automatically include the setting `insert_distributed_sync = 1` in order to ensure that downstream incremental
materialization operations execute correctly.  This could cause some distributed table inserts to run more slowly than expected.

## Distributed table materialization

Distributed table created with following steps:
1. Creates temp view with sql query to get right structure
2. Create empty local tables based on view 
3. Create distributed table based on local tables. 
4. Data inserts into distributed table, so it is distributed across shards without duplicating.

### Distributed table model example
```sql
{{
    config(
        materialized='distributed_table',
        order_by='id, created_at',
        sharding_key='cityHash64(id)',
        engine='ReplacingMergeTree'
    )
}}

select id, created_at, item from {{ source('db', 'table') }}
```

### Generated migrations

```sql
CREATE TABLE db.table_local on cluster cluster
(
    `id` UInt64,
    `created_at` DateTime,
    `item` String
)
ENGINE = ReplacingMergeTree
ORDER BY (id, created_at)
SETTINGS index_granularity = 8192;


CREATE TABLE db.table on cluster cluster
(
    `id` UInt64,
    `created_at` DateTime,
    `item` String
)
ENGINE = Distributed('cluster', 'db', 'table_local', cityHash64(id));
```

## Distributed incremental materialization

Incremental model based on the same idea as distributed table, the main difficulty is to process all incremental strategies correctly.

1. _The Append Strategy_ just insert data into distributed table.
2. _The Delete+Insert_ Strategy creates distributed temp table to work with all data on every shard.
3. _The Default (Legacy) Strategy_ creates distributed temp and intermediate tables for the same reason.

Only shard tables are replacing, because distributed table does not keep data. 
The distributed table reloads only when the full_refresh mode is enabled or the table structure may have changed.

### Distributed incremental model example
```sql
{{
    config(
        materialized='distributed_incremental',
        engine='MergeTree',
        incremental_strategy='append',
        unique_key='id,created_at'
    )
}}

select id, created_at, item from {{ source('db', 'table') }}
```

### Generated migrations

```sql
CREATE TABLE db.table_local on cluster cluster
(
    `id` UInt64,
    `created_at` DateTime,
    `item` String
)
ENGINE = MergeTree
SETTINGS index_granularity = 8192;


CREATE TABLE db.table on cluster cluster
(
    `id` UInt64,
    `created_at` DateTime,
    `item` String
)
ENGINE = Distributed('cluster', 'db', 'table_local', cityHash64(id));
```

# Running Tests

This adapter passes all of dbt basic tests as presented in dbt's official docs: https://docs.getdbt.com/docs/contributing/testing-a-new-adapter#testing-your-adapter.
Use `pytest tests` to run tests.

You can customize the test environment via environment variables. We recommend doing so with the pytest `pytest-dotenv` plugin combined with root level `test.env`
configuration file (this file should not be checked into git).  The following environment variables are recognized:

1. DBT_CH_TEST_HOST - Default=`localhost`
2. DBT_CH_TEST_USER - your ClickHouse username. Default=`default`
3. DBT_CH_TEST_PASSWORD - your ClickHouse password. Default=''
4. DBT_CH_TEST_PORT - ClickHouse client port. Default=8123 (The default is automatically changed to the correct port if DBT_CH_TEST_USE_DOCKER is enabled)
6. DBT_CH_TEST_DB_ENGINE - Database engine used to create schemas.  Defaults to '' (server default)
7. DBT_CH_TEST_USE_DOCKER - Set to True to run clickhouse-server docker image (see tests/docker-compose.yml).  Requires docker-compose. Default=False
8. DBT_CH_TEST_CH_VERSION - ClickHouse docker image to use.  Defaults to `latest`
9. DBT_CH_TEST_INCLUDE_S3 - Include S3 tests.  Default=False since these are currently dependent on a specific ClickHouse S3 bucket/test dataset
10. DBT_CH_TEST_CLUSTER_MODE - Use the profile value
11. DBT_CH_TEST_CLUSTER - ClickHouse cluster name, if DBT_CH_TEST_USE_DOCKER set to true, only `test_replica` and `test_shard` is valid (see tests/test_config.xml for cluster settings)


## Original Author
ClickHouse wants to thank @[silentsokolov](https://github.com/silentsokolov) for creating this connector and for their valuable contributions.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ClickHouse/dbt-clickhouse",
    "name": "dbt-clickhouse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "ClickHouse Inc.",
    "author_email": "guy@clickhouse.com",
    "download_url": "https://files.pythonhosted.org/packages/95/51/f8ffc3a83de0e47208deb161005370a79ceba57d5b7651aac7e2b0dc436d/dbt-clickhouse-1.7.6.tar.gz",
    "platform": "any",
    "description": "<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/ClickHouse/dbt-clickhouse/master/etc/chdbt.png\" alt=\"clickhouse dbt logo\" width=\"300\"/>\n</p>\n\n# dbt-clickhouse\n\nThis plugin ports [dbt](https://getdbt.com) functionality to [Clickhouse](https://clickhouse.tech/).\n\nThe plugin uses syntax that requires ClickHouse version 22.1 or newer. We do not test older versions of Clickhouse.  We also do not currently test\nReplicated tables or the related `ON CLUSTER` functionality.\n\n## Installation\n\nUse your favorite Python package manager to install the app from PyPI, e.g.\n\n```bash\npip install dbt-clickhouse\n```\n\n## Supported features\n\n- [x] Table materialization\n- [x] View materialization\n- [x] Incremental materialization\n- [x] Materialized View materializations (uses the `TO` form of MATERIALIZED VIEW, experimental)\n- [x] Seeds\n- [x] Sources\n- [x] Docs generate\n- [x] Tests\n- [x] Snapshots\n- [x] Most dbt-utils macros (now included in dbt-core)  \n- [x] Ephemeral materialization\n- [x] Distributed table materialization (experimental)\n- [x] Distributed incremental materialization (experimental)\n- [x] Contracts\n\n# Usage Notes\n\n## SET Statement Warning\nIn many environments, using the SET statement to persist a ClickHouse setting across all DBT queries is not reliable\nand can cause unexpected failures.  This is particularly true when using HTTP connections through a load balancer that\ndistributes queries across multiple nodes (such as ClickHouse cloud), although in some circumstances this can also\nhappen with native ClickHouse connections.  Accordingly, we recommend configuring any required ClickHouse settings in the\n\"custom_settings\" property of the DBT profile as a best practice, instead of relying on a prehook \"SET\" statement as\nhas been occasionally suggested.\n\n## Database\n\nThe dbt model relation identifier `database.schema.table` is not compatible with Clickhouse because Clickhouse does not support a `schema`.\nSo we use a simplified approach `schema.table`, where `schema` is the Clickhouse database. Using the `default` database is not recommended.\n\n## Example Profile\n\nDefault values are in brackets:\n\n```\nyour_profile_name:\n  target: dev\n  outputs:\n    dev:\n      type: clickhouse\n      schema: [default] # ClickHouse database for dbt models\n\n      # optional\n      driver: [http] # http or native.  If not set this will be autodetermined based on port setting\n      host: [localhost] \n      port: [8123]  # If not set, defaults to 8123, 8443, 9000, 9440 depending on the secure and driver settings \n      user: [default] # User for all database operations\n      password: [<empty string>] # Password for the user\n      cluster: [<empty string>] If set, certain DDL/table operations will be executed with the `ON CLUSTER` clause using this cluster. Distributed materializations require this setting to work. See the following ClickHouse Cluster section for more details.\n      verify: [True] # Validate TLS certificate if using TLS/SSL\n      secure: [False] # Use TLS (native protocol) or HTTPS (http protocol)\n      retries: [1] # Number of times to retry a \"retriable\" database exception (such as a 503 'Service Unavailable' error)\n      compression: [<empty string>] Use gzip compression if truthy (http), or compression type for a native connection\n      connect_timeout: [10] # Timeout in seconds to establish a connection to ClickHouse\n      send_receive_timeout: [300] # Timeout in seconds to receive data from the ClickHouse server\n      cluster_mode: [False] # Use specific settings designed to improve operation on Replicated databases (recommended for ClickHouse Cloud)\n      use_lw_deletes: [False] Use the strategy `delete+insert` as the default incremental strategy.\n      check_exchange: [True] # Validate that clickhouse support the atomic EXCHANGE TABLES command.  (Not needed for most ClickHouse versions)\n      local_suffix [_local] # Table suffix of local tables on shards for distributed materializations.\n      allow_automatic_deduplication [False] # Enable ClickHouse automatic deduplication for Replicated tables\n      custom_settings: [{}] # A dictionary/mapping of custom ClickHouse settings for the connection - default is empty.\n      \n      # Native (clickhouse-driver) connection settings\n      sync_request_timeout: [5] Timeout for server ping\n      compress_block_size: [1048576] Compression block size if compression is enabled\n      \n```\n\n## Model Configuration\n\n| Option                 | Description                                                                                                                                                                                                                                            | Default if any |\n|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|\n| engine                 | The table engine (type of table) to use when creating tables                                                                                                                                                                                           | `MergeTree()`  |\n| order_by               | A tuple of column names or arbitrary expressions. This allows you to create a small sparse index that helps find data faster.                                                                                                                          | `tuple()`      |\n| partition_by           | A partition is a logical combination of records in a table by a specified criterion. The partition key can be any expression from the table columns.                                                                                                   |                |\n| sharding_key           | Sharding key determines the destination server when inserting into distributed engine table.  The sharding key can be random or as an output of a hash function                                                                                        | `rand()`)      |\n| primary_key            | Like order_by, a ClickHouse primary key expression.  If not specified, ClickHouse will use the order by expression as the primary key                                                                                                                  |                |\n| unique_key             | A tuple of column names that uniquely identify rows.  Used with incremental models for updates.                                                                                                                                                        |                |\n| inserts_only           | If set to True for an incremental model, incremental updates will be inserted directly to the target table without creating intermediate table. It has been deprecated in favor of the `append` incremental `strategy`, which operates in the same way |                |\n| incremental_strategy   | Incremental model update strategy of `delete+insert` or `append`.  See the following Incremental Model Strategies                                                                                                                                      | `default`      |\n| incremental_predicates | Additional conditions to be applied to the incremental materialization (only applied to `delete+insert` strategy                                                                                                                                       |                |\n| settings               | A map/dictionary of \"TABLE\" settings to be used to DDL statements like 'CREATE TABLE' with this model                                                                                                                                                  |                |\n| query_settings         | A map/dictionary of ClickHouse user level settings to be used with `INSERT` or `DELETE` statements in conjunction with this model                                                                                                                      |                |\n| ttl                    | A TTL expression to be used with the table.  The TTL expression is a string that can be used to specify the TTL for the table.                                                                                                                         |                |\n\n## ClickHouse Cluster \n\nThe `cluster` setting in profile enables dbt-clickhouse to run against a ClickHouse cluster.\n\n### Effective Scope\n\n\nif `cluster` is set in profile, `on_cluster_clause` now will return cluster info for:\n- Database creation\n- View materialization\n- Distributed materializations\n- Models with Replicated engines\n\ntable and incremental materializations with non-replicated engine will not be affected by `cluster` setting (model would be created on the connected node only).\n\n### Compatibility\n\n\nIf a model has been created without a `cluster` setting, dbt-clickhouse will detect the situation and run all DDL/DML without `on cluster` clause for this model.\n\n\n## A Note on Model Settings\n\nClickHouse has several types/levels of \"settings\".  In the model configuration above, two types of these are configurable.  `settings` means the `SETTINGS`\nclause used in `CREATE TABLE/VIEW` types of DDL statements, so this is generally settings that are specific to the specific ClickHouse table engine.  The new\n`query_settings` is use to add a `SETTINGS` clause to the `INSERT` and `DELETE` queries used for model materialization (including incremental materializations).\nThere are hundreds of ClickHouse settings, and it's not always clear which is a \"table\" setting and which is a \"user\" setting (although the latter are generally\navailable in the `system.settings` table.)  In general the defaults are recommended, and any use of these properties should be carefully researched and tested.\n\n\n## Known Limitations\n\n* Ephemeral models/CTEs don't work if placed before the \"INSERT INTO\" in a ClickHouse insert statement, see https://github.com/ClickHouse/ClickHouse/issues/30323.  This\nshould not affect most models, but care should be taken where an ephemeral model is placed in model definitions and other SQL statements.\n\n## Incremental Model Strategies\n\nAs of version 1.3.2, dbt-clickhouse supports three incremental model strategies.\n\n### The Default (Legacy) Strategy  \n\nHistorically ClickHouse has had only limited support for updates and deletes, in the form of asynchronous \"mutations.\"  To emulate expected dbt behavior,\ndbt-clickhouse by default creates a new temporary table containing all unaffected (not deleted, not changed) \"old\" records, plus any new or updated records,\nand then swaps or exchanges this temporary table with the existing incremental model relation.  This is the only strategy that preserves the original relation if something\ngoes wrong before the operation completes; however, since it involves a full copy of the original table, it can be quite expensive and slow to execute.\n\n### The Delete+Insert Strategy\n\nClickHouse added \"lightweight deletes\" as an experimental feature in version 22.8.  Lightweight deletes are significantly faster than ALTER TABLE ... DELETE\noperations, because they don't require rewriting ClickHouse data parts.  The incremental strategy `delete+insert` utilizes lightweight deletes to implement\nincremental materializations that perform significantly better than the \"legacy\" strategy.  However, there are important caveats to using this strategy:\n- Lightweight deletes must be enabled on your ClickHouse server using the setting `allow_experimental_lightweight_delete=1` or you \nmust set `use_lw_deletes=true` in your profile (which will enable that setting for your dbt sessions)\n- Lightweight deletes are now production ready, but there may be performance and other problems on ClickHouse versions earlier than 23.3.\n- This strategy operates directly on the affected table/relation (with creating any intermediate or temporary tables), so if there is an issue during the operation, the\ndata in the incremental model is likely to be in an invalid state\n- When using lightweight deletes, dbt-clickhouse enabled the setting `allow_nondeterministic_mutations`.  In some very rare cases using non-deterministic incremental_predicates\nthis could result in a race condition for the updated/deleted items (and related log messages in the ClickHouse logs).  To ensure consistent results the\nincremental predicates should only include sub-queries on data that will not be modified during the incremental materialization.\n\n### The Append Strategy\n\nThis strategy replaces the `inserts_only` setting in previous versions of dbt-clickhouse.  This approach simply appends new rows to the existing relation.\nAs a result duplicate rows are not eliminated, and there is no temporary or intermediate table.  It is the fastest approach if duplicates are either permitted\nin the data or excluded by the incremental query WHERE clause/filter.\n\n## Additional ClickHouse Macros\n\n### Model Materialization Utility Macros\n\nThe following macros are included to facilitate creating ClickHouse specific tables and views:\n- `engine_clause` -- Uses the `engine` model configuration property to assign a ClickHouse table engine.  dbt-clickhouse uses the `MergeTree` engine by default.\n- `partition_cols` -- Uses the `partition_by` model configuration property to assign a ClickHouse partition key.  No partition key is assigned by default.\n- `order_cols` -- Uses the `order_by` model configuration to assign a ClickHouse order by/sorting key.  If not specified ClickHouse will use an empty tuple() and the table will be unsorted\n- `primary_key_clause` -- Uses the `primary_key` model configuration property to assign a ClickHouse primary key.  By default, primary key is set and ClickHouse will use the order by clause as the primary key. \n- `on_cluster_clause` -- Uses the `cluster` profile property to add an `ON CLUSTER` clause to certain dbt-operations: distributed materializations, views creation, database creation.\n- `ttl_config` -- Uses the `ttl` model configuration property to assign a ClickHouse table TTL expression.  No TTL is assigned by default.\n\n### s3Source Helper Macro\n\nThe `s3source` macro simplifies the process of selecting ClickHouse data directly from S3 using the ClickHouse S3 table function.  It works by\npopulating the S3 table function parameters from a named configuration dictionary (the name of the dictionary must end in `s3`).  The macro\nfirst looks for the dictionary in the profile `vars`, and then in the model configuration.  The dictionary can contain any of the following\nkeys used to populate the parameters of the S3 table function:\n\n| Argument Name         | Description                                                                                                                                                                                  |\n|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| bucket                | The bucket base url, such as `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi`. `https://` is assumed if no protocol is provided.                                         |\n| path                  | The S3 path to use for the table query, such as `/trips_4.gz`.  S3 wildcards are supported.                                                                                                  |\n| fmt                   | The expected ClickHouse input format (such as `TSV` or `CSVWithNames`) of the referenced S3 objects.                                                                                         |\n| structure             | The column structure of the data in bucket, as a list of name/datatype pairs, such as `['id UInt32', 'date DateTime', 'value String']`  If not provided ClickHouse will infer the structure. |\n| aws_access_key_id     | The S3 access key id.                                                                                                                                                                        |\n| aws_secret_access_key | The S3 secret key.                                                                                                                                                                           |\n| compression           | The compression method used with the S3 objects.  If not provided ClickHouse will attempt to determine compression based on the file name.                                                   |\n\nSee the [S3 test file](https://github.com/ClickHouse/dbt-clickhouse/blob/main/tests/integration/adapter/clickhouse/test_clickhouse_s3.py) for examples of how to use this macro.\n\n# Contracts and Constraints\n\nOnly exact column type contracts are supported.  For example, a contract with a UInt32 column type will fail if the model returns a UInt64 or other integer type.\nClickHouse also support _only_ `CHECK` constraints on the entire table/model.  Primary key, foreign key, unique, and column level CHECK constraints are not supported.\n(See ClickHouse documentation on primary/order by keys.)\n\n# Materialized Views (Experimental)\nA `materialized_view` materialization should be a `SELECT` from an existing (source) table.  The adapter will create a target table with the model name\nand a ClickHouse MATERIALIZED VIEW with the name `<model_name>_mv`.  Unlike PostgreSQL, a ClickHouse materialized view is not \"static\" (and has\nno corresponding REFRESH operation).  Instead, it acts as an \"insert trigger\", and will insert new rows into the target table using the defined `SELECT`\n\"transformation\" in the view definition on rows inserted into the source table.  See the [test file]\n(https://github.com/ClickHouse/dbt-clickhouse/blob/main/tests/integration/adapter/materialized_view/test_materialized_view.py)  for an introductory example\nof how to use this functionality.\n\n# Dictionary materializations (experimental)\nSee the tests in https://github.com/ClickHouse/dbt-clickhouse/blob/main/tests/integration/adapter/dictionary/test_dictionary.py for examples of how to\nimplement materializations for ClickHouse dictionaries \n\n# Distributed materializations\n\nNotes:\n\n- dbt-clickhouse queries now automatically include the setting `insert_distributed_sync = 1` in order to ensure that downstream incremental\nmaterialization operations execute correctly.  This could cause some distributed table inserts to run more slowly than expected.\n\n## Distributed table materialization\n\nDistributed table created with following steps:\n1. Creates temp view with sql query to get right structure\n2. Create empty local tables based on view \n3. Create distributed table based on local tables. \n4. Data inserts into distributed table, so it is distributed across shards without duplicating.\n\n### Distributed table model example\n```sql\n{{\n    config(\n        materialized='distributed_table',\n        order_by='id, created_at',\n        sharding_key='cityHash64(id)',\n        engine='ReplacingMergeTree'\n    )\n}}\n\nselect id, created_at, item from {{ source('db', 'table') }}\n```\n\n### Generated migrations\n\n```sql\nCREATE TABLE db.table_local on cluster cluster\n(\n    `id` UInt64,\n    `created_at` DateTime,\n    `item` String\n)\nENGINE = ReplacingMergeTree\nORDER BY (id, created_at)\nSETTINGS index_granularity = 8192;\n\n\nCREATE TABLE db.table on cluster cluster\n(\n    `id` UInt64,\n    `created_at` DateTime,\n    `item` String\n)\nENGINE = Distributed('cluster', 'db', 'table_local', cityHash64(id));\n```\n\n## Distributed incremental materialization\n\nIncremental model based on the same idea as distributed table, the main difficulty is to process all incremental strategies correctly.\n\n1. _The Append Strategy_ just insert data into distributed table.\n2. _The Delete+Insert_ Strategy creates distributed temp table to work with all data on every shard.\n3. _The Default (Legacy) Strategy_ creates distributed temp and intermediate tables for the same reason.\n\nOnly shard tables are replacing, because distributed table does not keep data. \nThe distributed table reloads only when the full_refresh mode is enabled or the table structure may have changed.\n\n### Distributed incremental model example\n```sql\n{{\n    config(\n        materialized='distributed_incremental',\n        engine='MergeTree',\n        incremental_strategy='append',\n        unique_key='id,created_at'\n    )\n}}\n\nselect id, created_at, item from {{ source('db', 'table') }}\n```\n\n### Generated migrations\n\n```sql\nCREATE TABLE db.table_local on cluster cluster\n(\n    `id` UInt64,\n    `created_at` DateTime,\n    `item` String\n)\nENGINE = MergeTree\nSETTINGS index_granularity = 8192;\n\n\nCREATE TABLE db.table on cluster cluster\n(\n    `id` UInt64,\n    `created_at` DateTime,\n    `item` String\n)\nENGINE = Distributed('cluster', 'db', 'table_local', cityHash64(id));\n```\n\n# Running Tests\n\nThis adapter passes all of dbt basic tests as presented in dbt's official docs: https://docs.getdbt.com/docs/contributing/testing-a-new-adapter#testing-your-adapter.\nUse `pytest tests` to run tests.\n\nYou can customize the test environment via environment variables. We recommend doing so with the pytest `pytest-dotenv` plugin combined with root level `test.env`\nconfiguration file (this file should not be checked into git).  The following environment variables are recognized:\n\n1. DBT_CH_TEST_HOST - Default=`localhost`\n2. DBT_CH_TEST_USER - your ClickHouse username. Default=`default`\n3. DBT_CH_TEST_PASSWORD - your ClickHouse password. Default=''\n4. DBT_CH_TEST_PORT - ClickHouse client port. Default=8123 (The default is automatically changed to the correct port if DBT_CH_TEST_USE_DOCKER is enabled)\n6. DBT_CH_TEST_DB_ENGINE - Database engine used to create schemas.  Defaults to '' (server default)\n7. DBT_CH_TEST_USE_DOCKER - Set to True to run clickhouse-server docker image (see tests/docker-compose.yml).  Requires docker-compose. Default=False\n8. DBT_CH_TEST_CH_VERSION - ClickHouse docker image to use.  Defaults to `latest`\n9. DBT_CH_TEST_INCLUDE_S3 - Include S3 tests.  Default=False since these are currently dependent on a specific ClickHouse S3 bucket/test dataset\n10. DBT_CH_TEST_CLUSTER_MODE - Use the profile value\n11. DBT_CH_TEST_CLUSTER - ClickHouse cluster name, if DBT_CH_TEST_USE_DOCKER set to true, only `test_replica` and `test_shard` is valid (see tests/test_config.xml for cluster settings)\n\n\n## Original Author\nClickHouse wants to thank @[silentsokolov](https://github.com/silentsokolov) for creating this connector and for their valuable contributions.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "The Clickhouse plugin for dbt (data build tool)",
    "version": "1.7.6",
    "project_urls": {
        "Homepage": "https://github.com/ClickHouse/dbt-clickhouse"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "48efcfbedb7c13a06b4797051cae63c98c71d8d8a1480d47faf0f06b305a485a",
                "md5": "7cd3edcf235883cffda2fbe370d76b1a",
                "sha256": "3fa83670db860e8c6b29c7ec623f8537182e03ac6216eafa7ccfd6908e53157d"
            },
            "downloads": -1,
            "filename": "dbt_clickhouse-1.7.6-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7cd3edcf235883cffda2fbe370d76b1a",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.8",
            "size": 59901,
            "upload_time": "2024-04-13T01:05:18",
            "upload_time_iso_8601": "2024-04-13T01:05:18.768197Z",
            "url": "https://files.pythonhosted.org/packages/48/ef/cfbedb7c13a06b4797051cae63c98c71d8d8a1480d47faf0f06b305a485a/dbt_clickhouse-1.7.6-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9551f8ffc3a83de0e47208deb161005370a79ceba57d5b7651aac7e2b0dc436d",
                "md5": "bf3f0cf8311f0a0814f1a0fce28376e0",
                "sha256": "28c6923eaf6144860ed208bf6409e167a198aa63522abcdbb82fbdbeabf1da10"
            },
            "downloads": -1,
            "filename": "dbt-clickhouse-1.7.6.tar.gz",
            "has_sig": false,
            "md5_digest": "bf3f0cf8311f0a0814f1a0fce28376e0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 52414,
            "upload_time": "2024-04-13T01:05:21",
            "upload_time_iso_8601": "2024-04-13T01:05:21.403059Z",
            "url": "https://files.pythonhosted.org/packages/95/51/f8ffc3a83de0e47208deb161005370a79ceba57d5b7651aac7e2b0dc436d/dbt-clickhouse-1.7.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-13 01:05:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ClickHouse",
    "github_project": "dbt-clickhouse",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dbt-clickhouse"
}
        
Elapsed time: 0.22332s