-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the table diff sample output in CLI #2645
Comments
Yes 100% @izeigerman Show sample is useless beyond a certain width. Even if you pipe it to a file, the terminal width based wrapping will still happen. Either have column selection with a sequence of glob patterns using Also, the current console printer wrapping is annoying in that there is no workaround. Maybe a |
Is there a reason we don't show data diffs side by side like a git diff? We do it in the UI but not for the CLI. I imagine a lot of people want something similar in the CLI. Example library that does this: https://github.com/paulfitz/daff |
Oh yeah daff looks sick actually @sungchun12 -- it seems absolutely perfect to be honest 👀 🤔 |
|
Hello there, I made a data-diff tool for pyspark, and in the process I also made a generic library to create interactive html reports. They are both open source, and the data-diff-viewer does not need Spark (only duckdb-wasm to embed the diff report inside the html). I also started making a similar data-diff based on ibis instead of PySpark, but it's not ready yet. I would be happy to discuss about this if you want. |
@izeigerman you okay with me taking this on? I have lessons learned fresh on my mind from working on I have a couple improvements that are worth considering such as displaying row counts for:
on top of the work you did here: #2644 |
Just adding to the conversation, but it'd be awesome if the CLI tool could "incrementally" perform the diff with greater and greater degrees of strictness. It would make this TDD curmudgeon very very happy. For everything Pandas gets wrong, this is one thing they kinda get right, although their error messages are a bit too uninformative at certain steps. I think my ideal workflow/priority order would go something like column names -> column types -> row count -> primary ID match -> column value match |
@schlich When you say "incrementally", you want data diffing to have more surgical options to only display column name changes and stop there if it fails some kind of criteria OR are you suggesting execution/display order? I believe you mean execution/display order, but let me know otherwise! |
well, a little bit of both maybe? i'm also referencing pytest's -x flag that stops at one failure if if you have many. But it's also kind of just a natural progression of "accuracy" as your transformations develop |
@schlich I disagree with stopping at "failure" because that's an opinion. There are situations where many diffs or few diffs can be a good thing. We're aligned on natural progression though. I'll have to think through if we vastly change the format for that display order because I've been playing with it more and learned some of my suggestions are already covered but with different UX. |
Currently the printed output is too wide and unreadable even for tables with a handful of columns, since we include columns from both the source and the target table as part of the same row.
Here are some ideas on how to make the output more digestible:
The text was updated successfully, but these errors were encountered: