Improve the table diff sample output in CLI #2645

izeigerman · 2024-05-21T17:55:47Z

Currently the printed output is too wide and unreadable even for tables with a handful of columns, since we include columns from both the source and the target table as part of the same row.

Here are some ideas on how to make the output more digestible:

Print individual column pairs. This way the width will be bounded to 2 columns that are being compared + join keys.
Let user select which columns should be included in the sample.

z3z1ma · 2024-05-21T18:24:20Z

Yes 100% @izeigerman

Show sample is useless beyond a certain width. Even if you pipe it to a file, the terminal width based wrapping will still happen. Either have column selection with a sequence of glob patterns using fnmatch or alternatively consider a --tall flag for the sample which does option A.

Also, the current console printer wrapping is annoying in that there is no workaround. Maybe a --plain flag is useful for dumping to a file and you can use print directly to actuate that.

sungchun12 · 2024-05-21T19:40:41Z

Is there a reason we don't show data diffs side by side like a git diff? We do it in the UI but not for the CLI. I imagine a lot of people want something similar in the CLI.

Example library that does this: https://github.com/paulfitz/daff

z3z1ma · 2024-05-21T20:31:50Z

Oh yeah daff looks sick actually @sungchun12 -- it seems absolutely perfect to be honest 👀 🤔

z3z1ma · 2024-05-21T20:32:45Z

git diff/patch is a good way to look at the problem

FurcyPin · 2024-05-30T14:48:37Z

Hello there,

I made a data-diff tool for pyspark, and in the process I also made a generic library to create interactive html reports. They are both open source, and the data-diff-viewer does not need Spark (only duckdb-wasm to embed the diff report inside the html). I also started making a similar data-diff based on ibis instead of PySpark, but it's not ready yet.

I would be happy to discuss about this if you want.

sungchun12 · 2024-06-04T17:52:45Z

@izeigerman you okay with me taking this on?

I have lessons learned fresh on my mind from working on data-diff before it was sunset that I want to use up before those memories fade.

I have a couple improvements that are worth considering such as displaying row counts for:
demo: https://www.loom.com/share/b2a421a011854545aafe9f6186f163fc

unchanged
removed
different
added

on top of the work you did here: #2644

schlich · 2024-06-06T01:44:11Z

Just adding to the conversation, but it'd be awesome if the CLI tool could "incrementally" perform the diff with greater and greater degrees of strictness. It would make this TDD curmudgeon very very happy.

For everything Pandas gets wrong, this is one thing they kinda get right, although their error messages are a bit too uninformative at certain steps.

I think my ideal workflow/priority order would go something like column names -> column types -> row count -> primary ID match -> column value match

sungchun12 · 2024-06-06T17:12:02Z

@schlich When you say "incrementally", you want data diffing to have more surgical options to only display column name changes and stop there if it fails some kind of criteria OR are you suggesting execution/display order?

I believe you mean execution/display order, but let me know otherwise!

schlich · 2024-06-06T23:56:05Z

well, a little bit of both maybe? i'm also referencing pytest's -x flag that stops at one failure if if you have many. But it's also kind of just a natural progression of "accuracy" as your transformations develop

sungchun12 · 2024-06-07T00:13:57Z

@schlich I disagree with stopping at "failure" because that's an opinion. There are situations where many diffs or few diffs can be a good thing. We're aligned on natural progression though. I'll have to think through if we vastly change the format for that display order because I've been playing with it more and learned some of my suggestions are already covered but with different UX.

izeigerman added the Improvement Improves existing functionality label May 21, 2024

izeigerman mentioned this issue May 21, 2024

Feat: Improve the CLI output of the table diff command #2644

Merged

sungchun12 self-assigned this Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the table diff sample output in CLI #2645

Improve the table diff sample output in CLI #2645

izeigerman commented May 21, 2024

z3z1ma commented May 21, 2024 •

edited

Loading

sungchun12 commented May 21, 2024 •

edited

Loading

z3z1ma commented May 21, 2024 •

edited

Loading

z3z1ma commented May 21, 2024

FurcyPin commented May 30, 2024 •

edited

Loading

sungchun12 commented Jun 4, 2024

schlich commented Jun 6, 2024

sungchun12 commented Jun 6, 2024 •

edited

Loading

schlich commented Jun 6, 2024

sungchun12 commented Jun 7, 2024

Improve the table diff sample output in CLI #2645

Improve the table diff sample output in CLI #2645

Comments

izeigerman commented May 21, 2024

z3z1ma commented May 21, 2024 • edited Loading

sungchun12 commented May 21, 2024 • edited Loading

z3z1ma commented May 21, 2024 • edited Loading

z3z1ma commented May 21, 2024

FurcyPin commented May 30, 2024 • edited Loading

sungchun12 commented Jun 4, 2024

schlich commented Jun 6, 2024

sungchun12 commented Jun 6, 2024 • edited Loading

schlich commented Jun 6, 2024

sungchun12 commented Jun 7, 2024

z3z1ma commented May 21, 2024 •

edited

Loading

sungchun12 commented May 21, 2024 •

edited

Loading

z3z1ma commented May 21, 2024 •

edited

Loading

FurcyPin commented May 30, 2024 •

edited

Loading

sungchun12 commented Jun 6, 2024 •

edited

Loading