Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Open Source ClickHouse Deployments #1496

Open
wants to merge 22 commits into
base: devel
Choose a base branch
from

Conversation

Pipboyguy
Copy link
Collaborator

@Pipboyguy Pipboyguy commented Jun 19, 2024

Description

This will improves support for self-managed ClickHouse open source deployments, while maintaining compatibility with ClickHouse Cloud deployments.

  • Allow explicitly specifying the desired engine via table_engine_type in clickhouse_adapter. Valid types are merge_tree, replicated_merge_tree, shared_merge_tree, stripe_log, tiny_log
  • Default to MergeTree if no engine is specified, which now works for both Cloud and self-managed deployments
  • Update tests to check for the appropriate engine based on annotation
  • CH cloud has "date_time_input_format" set to "best_effort", while OS deployments don't. This caused some tests to fail for OS deployments. We override setting for clickhouse_connect sessions.
  • Clarify some networking details for making dlt work with OS deployments in docs

Related Issues

Additional Context

Note that this change does not include support for specifying replication, ZooKeeper or shard details for the ReplicatedMergeTree engine. Users requiring those customizations can continue to specify the full engine definition in their configuration.

@Pipboyguy Pipboyguy linked an issue Jun 19, 2024 that may be closed by this pull request
Copy link

netlify bot commented Jun 19, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 409487c
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/667f36efbe836c0008f48dea
😎 Deploy Preview https://deploy-preview-1496--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Pipboyguy Pipboyguy self-assigned this Jun 19, 2024
@Pipboyguy Pipboyguy added enhancement New feature or request tech-debt Leftovers from previous sprint that should be fixed over time labels Jun 19, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
@Pipboyguy Pipboyguy requested review from rudolfix and sh-rp and removed request for rudolfix June 20, 2024 12:35
@Pipboyguy Pipboyguy changed the title Add support for MergeTree engine in ClickHouse destination Support Open Source ClickHouse Deployments Jun 20, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Copy link
Collaborator

@jorritsandbrink jorritsandbrink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code itself looks good to me. Just two comments regarding test coverage and the use of sentinel tables.

@@ -104,7 +111,8 @@ Data is loaded into ClickHouse using the most efficient method depending on the

`Clickhouse` does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons.
To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their name prefixed with the dataset name separated by
the configurable `dataset_table_separator`. Additionally, a special sentinel table that does not contain any data will be created, so dlt knows which virtual datasets already exist in a
the configurable `dataset_table_separator`.
Additionally, a special sentinel table that doesn't contain any data will be created, so dlt knows which virtual datasets already exist in a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps out of scope for this PR, but have you considered using pipeline state instead of a sentinel table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a really good suggestion!

I've dropped the sentinel table workaround for more canonical usage of dlt:

def has_dataset(self) -> bool:
try:
container = Container()
state_ctx = container[LoadPackageStateInjectableContext]
datasets = cast(List[str], state_ctx.state.get("datasets", []))
return self.sql_client.dataset_name in datasets
except KeyError:
return False
def create_dataset(self) -> None:
with contextlib.suppress(KeyError):
container = Container()
state_ctx = container[LoadPackageStateInjectableContext]
datasets = cast(List[str], state_ctx.state.get("datasets", []))
if self.sql_client.dataset_name not in datasets:
datasets.append(self.sql_client.dataset_name)
state: TLoadPackageState = state_ctx.state
state["datasets"] = datasets # type: ignore[typeddict-unknown-key]
commit_load_package_state()
def drop_dataset(self) -> None:
with contextlib.suppress(KeyError):
container = Container()
state_ctx = container[LoadPackageStateInjectableContext]
datasets = cast(List[str], state_ctx.state.get("datasets", []))
if self.sql_client.dataset_name in datasets:
datasets.remove(self.sql_client.dataset_name)
state: TLoadPackageState = state_ctx.state
state["datasets"] = datasets # type: ignore[typeddict-unknown-key]
commit_load_package_state()
to_drop_results = self.sql_client._list_tables()
for table in to_drop_results:
self.sql_client.execute_sql(
f"""DROP TABLE {self.sql_client.catalog_name()}.{self.capabilities.escape_identifier(table)} SYNC"""
)

else:
# Non annotated resource needs to default to detected installation
# type, i.e. cloud or self-managed.
# CI runs on CH cloud, so will be `SharedMergeTree`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run CI both for ClickHouse cloud and ClickHouse local, like Postgres?

Seems like we don't have proper test coverage without local (e.g. for the Log table engines).

Copy link
Collaborator Author

@Pipboyguy Pipboyguy Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Added ClickHouse OSS as a local destination test in .github/workflows/test_destination_clickhouse.yml, so CI also runs local tests.
  • Removed the log engine family for now. I'm not even sure why I brought it into this ticket. If users ask for it, it will require more work than what can be done here.

@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
# Conflicts:
#	dlt/destinations/impl/clickhouse/clickhouse.py
#	dlt/destinations/impl/clickhouse/sql_client.py
#	tests/load/clickhouse/test_clickhouse_adapter.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sprint Marks group of tasks with core team focus at this moment tech-debt Leftovers from previous sprint that should be fixed over time
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Open Source ClickHouse fails on Timestamp with Timezone ClickHouse MergeTree Support
3 participants