Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allows naming conventions to be changed #998

Merged
merged 114 commits into from
Jun 26, 2024
Merged

Conversation

rudolfix
Copy link
Collaborator

@rudolfix rudolfix commented Feb 25, 2024

Description

This PR attempts to allow any naming convention to be used (ie UPPER CASE or CamelCase). Until now any naming convention that changed casing of columns was failing. Also destinations have problems when reflecting such columns from information schema.

Fixes following issues
#1470
#964
#1074 (by offering alternative naming convention)
#1085
#860 (most probably, fixture order was messed up)
#1471 (partially, without internal source)

read the commit log for details!

naming conventions:

  1. Make sure dlt engine is agnostic to the naming convention used on schema, table and column level (done, not well tested)
  2. Make sure that schema settings (preferred types, normalizer settings, default hints) are also updated in (1)
  3. Expose and document methods to change default hints, preferred settings and other compiled properties, update documentation
  4. Make sure that all destinations work with forced lower and upper case naming conventions (not done)
  5. destination capabilities are extended to define the following
  • naming convention (the default snake case is dropped, convention set in Schema is adopted by default - it is snake case)
  • a flag if destination is or is not case sensitive
  • case folding function used by destination to get insensitive identifiers (str, lower, upper)
  • all destinations are by default configured in case insensitive mode
  1. we use INFORMATION SCHEMA to read all columns for all tables at once. with many tables the speed increase should be significant
  2. sql_cs and sql_ci naming conventions to create case sensitive and insensitive names compatible with SQL
  3. duck_case and direct will be unified and will allow any characters in the identifiers
  4. quadrant now works with in memory and local engine

todo:

  • sql_cs and sql_ci not yet implemented (but that's trivial)
  • Also fix bug in Qdrant - state and recent schemas are incorrectly retrieved (somehow sort order was ignored during implementation) (90% fixed)
  • also fix bugs in Weaviate tests
  • documentation is still missing

reference:
(4) an overview of case sensitivity and preferred naming convention per destination: https://www.linkedin.com/posts/toby-mao_sql-activity-7170104665412423680-17gv/

Copy link

netlify bot commented Feb 25, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 5f4cb4c
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/667c77b57ffd340008cbec5d

@rudolfix rudolfix self-assigned this Mar 4, 2024
@rudolfix rudolfix added the community This issue came from slack community workspace label Mar 4, 2024
@sh-rp
Copy link
Collaborator

sh-rp commented Jun 26, 2024

Other Notes:

  • Do we check wether two distinct tablenames are normalized to the same name? I think I saw a warning for this in one of the commits, but can't find it now
  • Do we need to be concerned about migration for existing tables where the dlt internal tables and columns are not normalized yet? From what I saw the default naming will work exactly the same way as before for all the data tables (snake_case) so that should be fine.
  • Just now someone on the tech support channel asked to change the staging dataset name convention. Not entirely sure wether this conceptually is part of this, but it's not the first time that someone asks about this and we should maybe somehow make it configurable. Since it only really is done in one place, it should be super easy too.


@pytest.mark.parametrize(
"destination_config",
destinations_configs(default_sql_configs=True, subset=["postgres", "snowflake"]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably filter the tests by loader_file_format in caps and not explicitely in the test header, then all destinations that support csv will automatically be added to this test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

90% of explicit config can be inferred from caps and config interfaces

def test_load_csv(
destination_config: DestinationTestConfiguration, item_type: TestDataItemFormat
) -> None:
os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "True"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: if I set up those helper functions correctly, we can also easily inspect gzipped files without this setting. ibis / dataframes will make this stuff obsolete anyway because we can use that in the tests in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sh-rp I did it already for tead_text is fsclient. will push changes thoday

sh-rp
sh-rp previously approved these changes Jun 26, 2024
@rudolfix rudolfix merged commit b76f8f4 into devel Jun 26, 2024
50 checks passed
@rudolfix rudolfix deleted the rfix/allows-naming-conventions branch June 26, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci full run the full load tests on pr community This issue came from slack community workspace
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants