Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the table diff sample output in CLI #2645

Open
izeigerman opened this issue May 21, 2024 · 10 comments
Open

Improve the table diff sample output in CLI #2645

izeigerman opened this issue May 21, 2024 · 10 comments
Assignees
Labels
Improvement Improves existing functionality

Comments

@izeigerman
Copy link
Member

Currently the printed output is too wide and unreadable even for tables with a handful of columns, since we include columns from both the source and the target table as part of the same row.

Here are some ideas on how to make the output more digestible:

  • Print individual column pairs. This way the width will be bounded to 2 columns that are being compared + join keys.
  • Let user select which columns should be included in the sample.
@izeigerman izeigerman added the Improvement Improves existing functionality label May 21, 2024
@z3z1ma
Copy link
Contributor

z3z1ma commented May 21, 2024

Yes 100% @izeigerman

Show sample is useless beyond a certain width. Even if you pipe it to a file, the terminal width based wrapping will still happen. Either have column selection with a sequence of glob patterns using fnmatch or alternatively consider a --tall flag for the sample which does option A.

Also, the current console printer wrapping is annoying in that there is no workaround. Maybe a --plain flag is useful for dumping to a file and you can use print directly to actuate that.

@sungchun12
Copy link
Contributor

sungchun12 commented May 21, 2024

Is there a reason we don't show data diffs side by side like a git diff? We do it in the UI but not for the CLI. I imagine a lot of people want something similar in the CLI.

Example library that does this: https://github.com/paulfitz/daff

image

@z3z1ma
Copy link
Contributor

z3z1ma commented May 21, 2024

Oh yeah daff looks sick actually @sungchun12 -- it seems absolutely perfect to be honest 👀 🤔

@z3z1ma
Copy link
Contributor

z3z1ma commented May 21, 2024

git diff/patch is a good way to look at the problem

@FurcyPin
Copy link

FurcyPin commented May 30, 2024

Hello there,

I made a data-diff tool for pyspark, and in the process I also made a generic library to create interactive html reports. They are both open source, and the data-diff-viewer does not need Spark (only duckdb-wasm to embed the diff report inside the html). I also started making a similar data-diff based on ibis instead of PySpark, but it's not ready yet.

I would be happy to discuss about this if you want.

@sungchun12
Copy link
Contributor

@izeigerman you okay with me taking this on?

I have lessons learned fresh on my mind from working on data-diff before it was sunset that I want to use up before those memories fade.

I have a couple improvements that are worth considering such as displaying row counts for:
demo: https://www.loom.com/share/b2a421a011854545aafe9f6186f163fc

  • unchanged
  • removed
  • different
  • added

on top of the work you did here: #2644

image

@schlich
Copy link
Contributor

schlich commented Jun 6, 2024

Just adding to the conversation, but it'd be awesome if the CLI tool could "incrementally" perform the diff with greater and greater degrees of strictness. It would make this TDD curmudgeon very very happy.

For everything Pandas gets wrong, this is one thing they kinda get right, although their error messages are a bit too uninformative at certain steps.

I think my ideal workflow/priority order would go something like column names -> column types -> row count -> primary ID match -> column value match

@sungchun12 sungchun12 self-assigned this Jun 6, 2024
@sungchun12
Copy link
Contributor

sungchun12 commented Jun 6, 2024

@schlich When you say "incrementally", you want data diffing to have more surgical options to only display column name changes and stop there if it fails some kind of criteria OR are you suggesting execution/display order?

I believe you mean execution/display order, but let me know otherwise!

@schlich
Copy link
Contributor

schlich commented Jun 6, 2024

well, a little bit of both maybe? i'm also referencing pytest's -x flag that stops at one failure if if you have many. But it's also kind of just a natural progression of "accuracy" as your transformations develop

@sungchun12
Copy link
Contributor

@schlich I disagree with stopping at "failure" because that's an opinion. There are situations where many diffs or few diffs can be a good thing. We're aligned on natural progression though. I'll have to think through if we vastly change the format for that display order because I've been playing with it more and learned some of my suggestions are already covered but with different UX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Improves existing functionality
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

5 participants