allows naming conventions to be changed #998

rudolfix · 2024-02-25T12:59:59Z

Description

This PR attempts to allow any naming convention to be used (ie UPPER CASE or CamelCase). Until now any naming convention that changed casing of columns was failing. Also destinations have problems when reflecting such columns from information schema.

Fixes following issues
#1470
#964
#1074 (by offering alternative naming convention)
#1085
#860 (most probably, fixture order was messed up)
#1471 (partially, without internal source)

read the commit log for details!

naming conventions:

Make sure dlt engine is agnostic to the naming convention used on schema, table and column level (done, not well tested)
Make sure that schema settings (preferred types, normalizer settings, default hints) are also updated in (1)
Expose and document methods to change default hints, preferred settings and other compiled properties, update documentation
Make sure that all destinations work with forced lower and upper case naming conventions (not done)
destination capabilities are extended to define the following

naming convention (the default snake case is dropped, convention set in Schema is adopted by default - it is snake case)
a flag if destination is or is not case sensitive
case folding function used by destination to get insensitive identifiers (str, lower, upper)
all destinations are by default configured in case insensitive mode

we use INFORMATION SCHEMA to read all columns for all tables at once. with many tables the speed increase should be significant
sql_cs and sql_ci naming conventions to create case sensitive and insensitive names compatible with SQL
duck_case and direct will be unified and will allow any characters in the identifiers
quadrant now works with in memory and local engine

todo:

sql_cs and sql_ci not yet implemented (but that's trivial)
Also fix bug in Qdrant - state and recent schemas are incorrectly retrieved (somehow sort order was ignored during implementation) (90% fixed)
also fix bugs in Weaviate tests
documentation is still missing

reference:
(4) an overview of case sensitivity and preferred naming convention per destination: https://www.linkedin.com/posts/toby-mao_sql-activity-7170104665412423680-17gv/

… regex

… modules

netlify · 2024-02-25T13:00:15Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`5f4cb4c`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/667c77b57ffd340008cbec5d

…ng conventions, fixes how identifiers are normalized in destinations

dlt/destinations/impl/bigquery/sql_client.py

dlt/destinations/job_client_impl.py

dlt/common/normalizers/configuration.py

dlt/destinations/impl/bigquery/sql_client.py

sh-rp · 2024-06-26T07:49:36Z

Other Notes:

Do we check wether two distinct tablenames are normalized to the same name? I think I saw a warning for this in one of the commits, but can't find it now
Do we need to be concerned about migration for existing tables where the dlt internal tables and columns are not normalized yet? From what I saw the default naming will work exactly the same way as before for all the data tables (snake_case) so that should be fine.
Just now someone on the tech support channel asked to change the staging dataset name convention. Not entirely sure wether this conceptually is part of this, but it's not the first time that someone asks about this and we should maybe somehow make it configurable. Since it only really is done in one place, it should be super easy too.

dlt/destinations/impl/snowflake/snowflake.py

sh-rp · 2024-06-26T07:54:46Z

tests/load/pipeline/test_csv_loading.py

+
+@pytest.mark.parametrize(
+    "destination_config",
+    destinations_configs(default_sql_configs=True, subset=["postgres", "snowflake"]),


we should probably filter the tests by loader_file_format in caps and not explicitely in the test header, then all destinations that support csv will automatically be added to this test.

90% of explicit config can be inferred from caps and config interfaces

sh-rp · 2024-06-26T07:56:37Z

tests/load/pipeline/test_csv_loading.py

+def test_load_csv(
+    destination_config: DestinationTestConfiguration, item_type: TestDataItemFormat
+) -> None:
+    os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "True"


note to self: if I set up those helper functions correctly, we can also easily inspect gzipped files without this setting. ibis / dataframes will make this stuff obsolete anyway because we can use that in the tests in the future

@sh-rp I did it already for tead_text is fsclient. will push changes thoday

dlt/common/destination/reference.py

dlt/destinations/impl/qdrant/qdrant_client.py

docs/website/docs/general-usage/destination.md

rudolfix added 23 commits February 20, 2024 22:35

allows to decorate async function with dlt.source

93995b9

adds pytest-async and updates pytest to 7.x

4444878

fixes forked teardown issue 7.x

b3b70f6

bumps deps for py 3.12

f5d7a0a

adds py 12 common tests

83dc38a

fixes typings after deps bump

21ebfee

bumps airflow, yanks duckdb to 0.9.2

7985f9d

fixes tests

07f285e

fixes pandas version

06e441e

adds 3.12 duckdb dep

3e846a1

Merge branch 'devel' into rfix/enables-async-source

37b4a31

Merge branch 'devel' into rfix/enables-async-source

934c167

adds right hand pipe operator

7fa574d

fixes docker ci build

8c7942d

adds docs on async sources and resources

f951fc0

normalizes default hints and preferred types in schema

387a7c7

defines pipeline state table in utils, column normalization in simple…

88728e1

… regex

normalizes all identifiers used by relational normalizer, fixes other…

1a53425

… modules

fixes sql job client to use normalized identifiers in queries

8835023

runs state sync tests for lower and upper case naming conventions

f4c504f

fixes weaviate to use normalized identifiers in queries

874cc29

partially fixes qdrant incorrect state and version retrieval queries

c4e9f35

initial sql uppercase naming convention

6345377

rudolfix self-assigned this Mar 4, 2024

rudolfix added the community This issue came from slack community workspace label Mar 4, 2024

rudolfix added 4 commits March 8, 2024 20:35

Merge branch 'devel' into rfix/allows-naming-conventions

96a02ff

adds native df readers to databricks and bigquery

aef8cc2

adds casing identifier capability to support different casing in nami…

a53c00b

…ng conventions, fixes how identifiers are normalized in destinations

cleans typing for relational normalizer

91f5780

rudolfix added 2 commits June 25, 2024 00:44

adds missing arguments to filesystem factory

3855fcc

fixes tests

651412e

sh-rp mentioned this pull request Jun 25, 2024

support csv file format in snowflake #1470

Closed

sh-rp reviewed Jun 25, 2024

View reviewed changes

dlt/destinations/impl/bigquery/sql_client.py Outdated Show resolved Hide resolved

sh-rp reviewed Jun 25, 2024

View reviewed changes

dlt/destinations/job_client_impl.py Outdated Show resolved Hide resolved

sh-rp reviewed Jun 25, 2024

View reviewed changes

dlt/common/normalizers/configuration.py Show resolved Hide resolved

sh-rp reviewed Jun 25, 2024

View reviewed changes

dlt/destinations/impl/bigquery/sql_client.py Show resolved Hide resolved

updates destination and naming convention docs

aab36e1

sh-rp reviewed Jun 26, 2024

View reviewed changes

dlt/destinations/impl/snowflake/snowflake.py Show resolved Hide resolved

sh-rp reviewed Jun 26, 2024

View reviewed changes

dlt/common/destination/reference.py Show resolved Hide resolved

sh-rp reviewed Jun 26, 2024

View reviewed changes

dlt/destinations/impl/qdrant/qdrant_client.py Show resolved Hide resolved

sh-rp reviewed Jun 26, 2024

View reviewed changes

docs/website/docs/general-usage/destination.md Show resolved Hide resolved

sh-rp previously approved these changes Jun 26, 2024

View reviewed changes

rudolfix added 5 commits June 26, 2024 14:54

removes is_case_sensitive from naming convention initializer

7294aae

simplifies with_file_import mark

dc10473

adds case sensitivity tests

727a35e

uses dev_mode everywhere

4cb2646

improves csv docs

f098e5a

rudolfix dismissed sh-rp’s stale review via f098e5a June 26, 2024 13:00

rudolfix mentioned this pull request Jun 26, 2024

Fix Qdrant tests on CI #1337

Open

rudolfix added 4 commits June 26, 2024 19:56

fixes encodings in fsspec

1521778

improves naming convention docs

796483e

fixes tests and renames clash to collision

534c7f8

fixes getting original bases from instance

5f4cb4c

rudolfix merged commit b76f8f4 into devel Jun 26, 2024
50 checks passed

rudolfix deleted the rfix/allows-naming-conventions branch June 26, 2024 21:08

rudolfix mentioned this pull request Jun 27, 2024

Feat/simplifies naming convention writing #1523

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allows naming conventions to be changed #998

allows naming conventions to be changed #998

rudolfix commented Feb 25, 2024 •

edited

Loading

netlify bot commented Feb 25, 2024 •

edited

Loading

sh-rp commented Jun 26, 2024 •

edited

Loading

sh-rp Jun 26, 2024

rudolfix Jun 26, 2024

sh-rp Jun 26, 2024

rudolfix Jun 26, 2024

allows naming conventions to be changed #998

allows naming conventions to be changed #998

Conversation

rudolfix commented Feb 25, 2024 • edited Loading

Description

netlify bot commented Feb 25, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp commented Jun 26, 2024 • edited Loading

sh-rp Jun 26, 2024

Choose a reason for hiding this comment

rudolfix Jun 26, 2024

Choose a reason for hiding this comment

sh-rp Jun 26, 2024

Choose a reason for hiding this comment

rudolfix Jun 26, 2024

Choose a reason for hiding this comment

rudolfix commented Feb 25, 2024 •

edited

Loading

netlify bot commented Feb 25, 2024 •

edited

Loading

sh-rp commented Jun 26, 2024 •

edited

Loading