-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Open Source ClickHouse Deployments #1496
base: devel
Are you sure you want to change the base?
Conversation
Signed-off-by: Marcel Coetzee <[email protected]>
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
…onfiguration setting~ Signed-off-by: Marcel Coetzee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code itself looks good to me. Just two comments regarding test coverage and the use of sentinel tables.
@@ -104,7 +111,8 @@ Data is loaded into ClickHouse using the most efficient method depending on the | |||
|
|||
`Clickhouse` does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons. | |||
To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their name prefixed with the dataset name separated by | |||
the configurable `dataset_table_separator`. Additionally, a special sentinel table that does not contain any data will be created, so dlt knows which virtual datasets already exist in a | |||
the configurable `dataset_table_separator`. | |||
Additionally, a special sentinel table that doesn't contain any data will be created, so dlt knows which virtual datasets already exist in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps out of scope for this PR, but have you considered using pipeline state instead of a sentinel table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a really good suggestion!
I've dropped the sentinel table workaround for more canonical usage of dlt:
dlt/dlt/destinations/impl/clickhouse/clickhouse.py
Lines 310 to 345 in 409487c
def has_dataset(self) -> bool: | |
try: | |
container = Container() | |
state_ctx = container[LoadPackageStateInjectableContext] | |
datasets = cast(List[str], state_ctx.state.get("datasets", [])) | |
return self.sql_client.dataset_name in datasets | |
except KeyError: | |
return False | |
def create_dataset(self) -> None: | |
with contextlib.suppress(KeyError): | |
container = Container() | |
state_ctx = container[LoadPackageStateInjectableContext] | |
datasets = cast(List[str], state_ctx.state.get("datasets", [])) | |
if self.sql_client.dataset_name not in datasets: | |
datasets.append(self.sql_client.dataset_name) | |
state: TLoadPackageState = state_ctx.state | |
state["datasets"] = datasets # type: ignore[typeddict-unknown-key] | |
commit_load_package_state() | |
def drop_dataset(self) -> None: | |
with contextlib.suppress(KeyError): | |
container = Container() | |
state_ctx = container[LoadPackageStateInjectableContext] | |
datasets = cast(List[str], state_ctx.state.get("datasets", [])) | |
if self.sql_client.dataset_name in datasets: | |
datasets.remove(self.sql_client.dataset_name) | |
state: TLoadPackageState = state_ctx.state | |
state["datasets"] = datasets # type: ignore[typeddict-unknown-key] | |
commit_load_package_state() | |
to_drop_results = self.sql_client._list_tables() | |
for table in to_drop_results: | |
self.sql_client.execute_sql( | |
f"""DROP TABLE {self.sql_client.catalog_name()}.{self.capabilities.escape_identifier(table)} SYNC""" | |
) |
else: | ||
# Non annotated resource needs to default to detected installation | ||
# type, i.e. cloud or self-managed. | ||
# CI runs on CH cloud, so will be `SharedMergeTree`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run CI both for ClickHouse cloud and ClickHouse local, like Postgres?
Seems like we don't have proper test coverage without local (e.g. for the Log
table engines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added ClickHouse OSS as a local destination test in
.github/workflows/test_destination_clickhouse.yml
, so CI also runs local tests. - Removed the log engine family for now. I'm not even sure why I brought it into this ticket. If users ask for it, it will require more work than what can be done here.
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
# Conflicts: # dlt/destinations/impl/clickhouse/clickhouse.py # dlt/destinations/impl/clickhouse/sql_client.py # tests/load/clickhouse/test_clickhouse_adapter.py
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Description
This will improves support for self-managed ClickHouse open source deployments, while maintaining compatibility with ClickHouse Cloud deployments.
table_engine_type
inclickhouse_adapter
. Valid types aremerge_tree
,replicated_merge_tree
,shared_merge_tree
,stripe_log
,tiny_log
MergeTree
if no engine is specified, which now works for both Cloud and self-managed deploymentsclickhouse_connect
sessions.Related Issues
Additional Context
Note that this change does not include support for specifying replication, ZooKeeper or shard details for the
ReplicatedMergeTree
engine. Users requiring those customizations can continue to specify the full engine definition in their configuration.