Support Open Source ClickHouse Deployments #1496

Pipboyguy · 2024-06-19T20:13:58Z

Description

This will improves support for self-managed ClickHouse open source deployments, while maintaining compatibility with ClickHouse Cloud deployments.

Allow explicitly specifying the desired engine via table_engine_type in clickhouse_adapter. Valid types are merge_tree, replicated_merge_tree, shared_merge_tree, stripe_log, tiny_log
Default to MergeTree if no engine is specified, which now works for both Cloud and self-managed deployments
Update tests to check for the appropriate engine based on annotation
CH cloud has "date_time_input_format" set to "best_effort", while OS deployments don't. This caused some tests to fail for OS deployments. We override setting for clickhouse_connect sessions.
Clarify some networking details for making dlt work with OS deployments in docs

Related Issues

Additional Context

Note that this change does not include support for specifying replication, ZooKeeper or shard details for the ReplicatedMergeTree engine. Users requiring those customizations can continue to specify the full engine definition in their configuration.

Signed-off-by: Marcel Coetzee <[email protected]>

netlify · 2024-06-19T20:14:14Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`409487c`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/667f36efbe836c0008f48dea
😎 Deploy Preview	https://deploy-preview-1496--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: Marcel Coetzee <[email protected]>

…onfiguration setting~ Signed-off-by: Marcel Coetzee <[email protected]>

jorritsandbrink

Code itself looks good to me. Just two comments regarding test coverage and the use of sentinel tables.

jorritsandbrink · 2024-06-24T20:12:26Z

docs/website/docs/dlt-ecosystem/destinations/clickhouse.md

@@ -104,7 +111,8 @@ Data is loaded into ClickHouse using the most efficient method depending on the

 `Clickhouse` does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons.
 To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their name prefixed with the dataset name separated by
-the configurable `dataset_table_separator`. Additionally, a special sentinel table that does not contain any data will be created, so dlt knows which virtual datasets already exist in a
+the configurable `dataset_table_separator`.
+Additionally, a special sentinel table that doesn't contain any data will be created, so dlt knows which virtual datasets already exist in a


Perhaps out of scope for this PR, but have you considered using pipeline state instead of a sentinel table?

That is a really good suggestion!

I've dropped the sentinel table workaround for more canonical usage of dlt:

dlt/dlt/destinations/impl/clickhouse/clickhouse.py

Lines 310 to 345 in 409487c

def has_dataset(self) -> bool:

try:

container = Container()

state_ctx = container[LoadPackageStateInjectableContext]

datasets = cast(List[str], state_ctx.state.get("datasets", []))

return self.sql_client.dataset_name in datasets

except KeyError:

return False

def create_dataset(self) -> None:

with contextlib.suppress(KeyError):

container = Container()

state_ctx = container[LoadPackageStateInjectableContext]

datasets = cast(List[str], state_ctx.state.get("datasets", []))

if self.sql_client.dataset_name not in datasets:

datasets.append(self.sql_client.dataset_name)

state: TLoadPackageState = state_ctx.state

state["datasets"] = datasets # type: ignore[typeddict-unknown-key]

commit_load_package_state()

def drop_dataset(self) -> None:

with contextlib.suppress(KeyError):

container = Container()

state_ctx = container[LoadPackageStateInjectableContext]

datasets = cast(List[str], state_ctx.state.get("datasets", []))

if self.sql_client.dataset_name in datasets:

datasets.remove(self.sql_client.dataset_name)

state: TLoadPackageState = state_ctx.state

state["datasets"] = datasets # type: ignore[typeddict-unknown-key]

commit_load_package_state()

to_drop_results = self.sql_client._list_tables()

for table in to_drop_results:

self.sql_client.execute_sql(

f"""DROP TABLE {self.sql_client.catalog_name()}.{self.capabilities.escape_identifier(table)} SYNC"""

)

jorritsandbrink · 2024-06-24T20:16:22Z

tests/load/clickhouse/test_clickhouse_adapter.py

+                else:
+                    # Non annotated resource needs to default to detected installation
+                    # type, i.e. cloud or self-managed.
+                    # CI runs on CH cloud, so will be `SharedMergeTree`.


Should we run CI both for ClickHouse cloud and ClickHouse local, like Postgres?

Seems like we don't have proper test coverage without local (e.g. for the Log table engines).

Added ClickHouse OSS as a local destination test in .github/workflows/test_destination_clickhouse.yml, so CI also runs local tests.

Removed the log engine family for now. I'm not even sure why I brought it into this ticket. If users ask for it, it will require more work than what can be done here.

Signed-off-by: Marcel Coetzee <[email protected]>

# Conflicts: # dlt/destinations/impl/clickhouse/clickhouse.py # dlt/destinations/impl/clickhouse/sql_client.py # tests/load/clickhouse/test_clickhouse_adapter.py

Signed-off-by: Marcel Coetzee <[email protected]>

Default to for both on-prem and cloud

56b3818

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy linked an issue Jun 19, 2024 that may be closed by this pull request

ClickHouse MergeTree Support #1387

Open

Pipboyguy self-assigned this Jun 19, 2024

Pipboyguy added enhancement New feature or request tech-debt Leftovers from previous sprint that should be fixed over time labels Jun 19, 2024

Pipboyguy added 2 commits June 19, 2024 22:55

Add documentation for new engine family types

e4e1010

Signed-off-by: Marcel Coetzee <[email protected]>

Typo

d998ac9

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy requested review from rudolfix and sh-rp and removed request for rudolfix June 20, 2024 12:35

Merge branch 'refs/heads/devel' into 1387-clickhouse-mergetree-support

825982b

Pipboyguy changed the title ~~Add support for MergeTree engine in ClickHouse destination~~ Support Open Source ClickHouse Deployments Jun 20, 2024

Pipboyguy added 2 commits June 20, 2024 22:11

Minor doc changes

4912173

Signed-off-by: Marcel Coetzee <[email protected]>

Fix local clickhouse deployment timestamp parsing issue with simple c…

e1ab71d

…onfiguration setting~ Signed-off-by: Marcel Coetzee <[email protected]>

jorritsandbrink reviewed Jun 24, 2024

View reviewed changes

Pipboyguy added 3 commits June 25, 2024 14:23

Merge branch 'devel' into 1387-clickhouse-mergetree-support

db603a5

Extend support for local deployment time types

8f80391

Signed-off-by: Marcel Coetzee <[email protected]>

Adapt test to check whether CH OSS or cloud

544925a

Signed-off-by: Marcel Coetzee <[email protected]>

rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024

Pipboyguy added 9 commits June 26, 2024 20:44

Defend against CH OSS unsupported dbapi datetime parsing

b2a3596

Signed-off-by: Marcel Coetzee <[email protected]>

Fix typo

e2a5c4a

Signed-off-by: Marcel Coetzee <[email protected]>

Add ClickHouse to local destination tests

9ef12c1

Signed-off-by: Marcel Coetzee <[email protected]>

Update ClickHouse test workflow and remove engine types

2388cd6

Signed-off-by: Marcel Coetzee <[email protected]>

Use Python 3.10.x for ClickHouse destination tests

b0e3751

Signed-off-by: Marcel Coetzee <[email protected]>

Merge branch 'devel' into 1387-clickhouse-mergetree-support

d207686

# Conflicts: # dlt/destinations/impl/clickhouse/clickhouse.py # dlt/destinations/impl/clickhouse/sql_client.py # tests/load/clickhouse/test_clickhouse_adapter.py

Add ClickHouse MergeTree support and refactor code

f077346

Signed-off-by: Marcel Coetzee <[email protected]>

Update ClickHouse Docker setup and test workflow

6e44a71

Signed-off-by: Marcel Coetzee <[email protected]>

Refactor ClickHouse tests to cover both OSS and Cloud versions

61bae45

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy added 4 commits June 27, 2024 19:54

Disable SSL for ClickHouse OSS tests

6d50c51

Signed-off-by: Marcel Coetzee <[email protected]>

Merge branch 'devel' into 1387-clickhouse-mergetree-support

fc46384

Use state instead of sentinel tables

e1ac1ce

Signed-off-by: Marcel Coetzee <[email protected]>

Remove mention of sentinel table for ClickHouse datasets

409487c

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy requested a review from jorritsandbrink June 28, 2024 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Open Source ClickHouse Deployments #1496

Support Open Source ClickHouse Deployments #1496

Pipboyguy commented Jun 19, 2024 •

edited

Loading

netlify bot commented Jun 19, 2024 •

edited

Loading

jorritsandbrink left a comment

jorritsandbrink Jun 24, 2024

Pipboyguy Jun 28, 2024

jorritsandbrink Jun 24, 2024

Pipboyguy Jun 26, 2024 •

edited

Loading

	def has_dataset(self) -> bool:
	try:
	container = Container()
	state_ctx = container[LoadPackageStateInjectableContext]
	datasets = cast(List[str], state_ctx.state.get("datasets", []))
	return self.sql_client.dataset_name in datasets
	except KeyError:
	return False

	def create_dataset(self) -> None:
	with contextlib.suppress(KeyError):
	container = Container()
	state_ctx = container[LoadPackageStateInjectableContext]
	datasets = cast(List[str], state_ctx.state.get("datasets", []))
	if self.sql_client.dataset_name not in datasets:
	datasets.append(self.sql_client.dataset_name)
	state: TLoadPackageState = state_ctx.state
	state["datasets"] = datasets # type: ignore[typeddict-unknown-key]
	commit_load_package_state()

	def drop_dataset(self) -> None:
	with contextlib.suppress(KeyError):
	container = Container()
	state_ctx = container[LoadPackageStateInjectableContext]
	datasets = cast(List[str], state_ctx.state.get("datasets", []))
	if self.sql_client.dataset_name in datasets:
	datasets.remove(self.sql_client.dataset_name)
	state: TLoadPackageState = state_ctx.state
	state["datasets"] = datasets # type: ignore[typeddict-unknown-key]
	commit_load_package_state()

	to_drop_results = self.sql_client._list_tables()
	for table in to_drop_results:
	self.sql_client.execute_sql(
	f"""DROP TABLE {self.sql_client.catalog_name()}.{self.capabilities.escape_identifier(table)} SYNC"""
	)

Support Open Source ClickHouse Deployments #1496

Are you sure you want to change the base?

Support Open Source ClickHouse Deployments #1496

Conversation

Pipboyguy commented Jun 19, 2024 • edited Loading

Description

Related Issues

Additional Context

netlify bot commented Jun 19, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

jorritsandbrink left a comment

Choose a reason for hiding this comment

jorritsandbrink Jun 24, 2024

Choose a reason for hiding this comment

Pipboyguy Jun 28, 2024

Choose a reason for hiding this comment

jorritsandbrink Jun 24, 2024

Choose a reason for hiding this comment

Pipboyguy Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Pipboyguy commented Jun 19, 2024 •

edited

Loading

netlify bot commented Jun 19, 2024 •

edited

Loading

Pipboyguy Jun 26, 2024 •

edited

Loading