-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update knn shapely score computation #1142
Conversation
- Make the _knn_shapley_score function easily testable. - Adjust hard-coded scores in tests to new transformation. - Add new test class for property based test that asserts that the raw scores are never negative. Resolves cleanlab#1127
An unrelated test case fails due to a recent fastText release, which has yet to be announced. No information can be found in the repository, as it was archived in Mar 19, 2024. |
screenshot of the new distribution of scores on a real dataset? cant really review this without that information |
This is a temporary change to allow CI to run without installing newly released numpy 2.0.0.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1142 +/- ##
==========================================
- Coverage 96.11% 96.05% -0.07%
==========================================
Files 80 80
Lines 6100 6076 -24
Branches 1079 1003 -76
==========================================
- Hits 5863 5836 -27
- Misses 140 142 +2
- Partials 97 98 +1 ☔ View full report in Codecov by Sentry. |
The current failing link-check is unrelated to this PR. |
…d data slices/clusters" This reverts commit 916a986.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for clarifications, sorry for confusion on my part!
Just had one suggested wording change
Co-authored-by: Jonas Mueller <[email protected]>
Summary
The original knn shapely score implementation is unlikely to find examples with net-negative contributions to model training.
This PR revisits the implementation of the raw score computation but places the
x -> 0.5(x + 1)
transformationsuch that the final score lies in the range [0, 1] in
data_shapley_knn()
(the public function).Here's the reference from the relevant paper: https://arxiv.org/pdf/1908.08619
Testing
Running the data valuation check through Datalab gives this strip plot of the different scores across the classes (it's recommended to view the enlarged image on a wider screen, due to the number of classes not fitting well on the strip plot).
This image was generated with the following code snippet added to the text tutorial (replace the cell that calls
Datalab.find_issues()
with the following code snippet (see section 4):Similarily, the visualization in the Misc Datalab workflow tutorial has been turned into a strip plot.
Links to Relevant Issues or Conversations