Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: data diffing with SQLMesh does NOT work correctly if primary keys match with duplicates #2810

Open
sungchun12 opened this issue Jun 24, 2024 · 5 comments · May be fixed by #2840
Open
Labels
Bug Something isn't working

Comments

@sungchun12
Copy link
Contributor

image

The above is an example tablediff using the default sqlmesh projected generated with the CLI and using duckdb. The expected result should error out and say that the primary key (grain) is not unique and not null and prompt the user to update the grain or fix the primary key issue.

Steps to reproduce:

  • sqlmesh init duckdb
  • sqlmesh plan and apply the changes
  • make changes to full_model below
  • sqlmesh plan dev and apply the changes
  • sqlmesh table_diff prod:dev sqlmesh_example.full_model --show-sample
MODEL (
  name sqlmesh_example.full_model,
  kind FULL,
  cron '@daily',
  grain item_id,
  audits (assert_positive_order_ids),
);

SELECT
  item_id,
  count(distinct id) AS num_orders,
FROM
    sqlmesh_example.incremental_model
GROUP BY item_id
union all
SELECT
  item_id,
  count(distinct id) AS num_orders,
FROM
    sqlmesh_example.incremental_model
GROUP BY item_id
@sungchun12 sungchun12 added the Bug Something isn't working label Jun 24, 2024
@tobymao
Copy link
Contributor

tobymao commented Jun 24, 2024

this is user error. grain is not validated and up to the user to ensure interegrity through an audit.

@sungchun12
Copy link
Contributor Author

That's a non-obvious thing for a user trying this for the first time or the 50th time. We should reduce that friction with a clear message that for accurate results, the grain should be tested. Example below.

(venv) ➜  sqlmesh-demos git:(migration-demo) ✗ sqlmesh table_diff prod:dev demo.full_model --show-sample
grain should have unique and not_null audits for accurate results

Schema Diff Between 'PROD' and 'DEV' environments for model 'demo.full_model':
└── Schemas match


Row Counts:
└──  FULL MATCH: 6 rows (100.0%)

COMMON ROWS column comparison stats:
            pct_match
num_orders      100.0


COMMON ROWS sample data differences:
  All joined rows match

@tobymao
Copy link
Contributor

tobymao commented Jun 24, 2024

maybe a count distinct could be shoved into the data diff query, if you or perhaps @Themiscodes has time you could make a pr for this.

basically i'm looking for a way to put in some kind of uniqueness check into the singular data diff query, it could be an optional flag that is injected by default but turned off due to overhead

@sungchun12
Copy link
Contributor Author

I'll leave it to @Themiscodes

What about checking for nulls? Those get missed too and mislead the user that there's no data diff.

image

@Themiscodes
Copy link
Contributor

Sure I'll look into this @sungchun12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants