Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise/warn on incomplete columns in normalize #1504

Open
wants to merge 6 commits into
base: devel
Choose a base branch
from

Conversation

steinitzu
Copy link
Collaborator

Description

Turns the "unbound column" warning into an exception for not-null columns and move it to normalize

Related Issues

Additional Context

Copy link

netlify bot commented Jun 21, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 7f36f97
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/667de26731a47600086202fd

@sh-rp
Copy link
Collaborator

sh-rp commented Jun 24, 2024

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

@steinitzu
Copy link
Collaborator Author

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

I agree it would be much better to fail early if possible. Ideally we could tell right after the first data item is extracted.
But I wasn't sure if we can always tell whether the column is populated in extract. The "seen data" marker for the table is only set in normalize so I was going by that. But I'll give it a try.

@@ -989,3 +989,24 @@ def r():
with pytest.raises(PipelineStepFailed) as pip_ex:
p.run(r())
assert isinstance(pip_ex.value.__context__, SchemaException)


@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a test (or check if one exists) to see what happens when we do a merge on merge keys but some rows have null in the merge key? It's not super important right now, but if it would be interesting to know what happens :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a test so I added one. This was raising an exception already through schema.coerce_row in normalize

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small requests

@sh-rp sh-rp added sprint Marks group of tasks with core team focus at this moment labels Jun 26, 2024
@steinitzu steinitzu force-pushed the fix/error-missing-merge-key branch 2 times, most recently from d16217f to a855f32 Compare June 26, 2024 23:49
@steinitzu
Copy link
Collaborator Author

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

Was looking into if this was possible also, but I don't think so without moving a lot of normalize logic into extract. I wasn't sure how much schema inferrence is done in extract, seems there is none.

@steinitzu steinitzu force-pushed the fix/error-missing-merge-key branch from 1db75e1 to 0d0afa5 Compare June 27, 2024 18:47
@steinitzu steinitzu marked this pull request as ready for review June 27, 2024 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sprint Marks group of tasks with core team focus at this moment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong Merge Key Not Throwing Error
2 participants