Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update knn shapely score computation #1142

Merged
merged 11 commits into from
Jun 19, 2024
Merged

Conversation

elisno
Copy link
Member

@elisno elisno commented Jun 12, 2024

Summary

The original knn shapely score implementation is unlikely to find examples with net-negative contributions to model training.
This PR revisits the implementation of the raw score computation but places the x -> 0.5(x + 1) transformation
such that the final score lies in the range [0, 1] in data_shapley_knn() (the public function).

Here's the reference from the relevant paper: https://arxiv.org/pdf/1908.08619
image

Testing

  • This PR adds a new test class for a property-based test that asserts that the raw scores lie in the appropriate range.
  • It adds several test cases for clear label/feature errors being added are captured in this score.
    • Test the scores and issue flags when there's a label error.
    • Same for one outlier in feature space.
    • How the scores/issue flags behave when there's one exact duplicate (with and without conflicting labels).
    • How sorting the dataset by the data valuation score and removing some percentages of the dataset beats random removals, in terms of model accuracy.
      • One parametrized test for a binary dataset.
      • Another parametrized test for a multi-class dataset.

Running the data valuation check through Datalab gives this strip plot of the different scores across the classes (it's recommended to view the enlarged image on a wider screen, due to the number of classes not fitting well on the strip plot).

image

This image was generated with the following code snippet added to the text tutorial (replace the cell that calls Datalab.find_issues() with the following code snippet (see section 4):

lab = Datalab(data_dict, label_name="labels")
lab.find_issues(features=text_embeddings, issue_types={"data_valuation": {"k": 10}})

# Collect the identified issues
data_valuation_issues = lab.get_issues("data_valuation")

# Display the data valuation issues
display(data_valuation_issues)

import seaborn as sns
import matplotlib.pyplot as plt

# Prepare the data for plotting a histogram
plot_data = data_valuation_issues.join(pd.DataFrame({"given_label": data_dict["labels"]}))

# Normalize the data for proportional count plot
plot_data['count'] = plot_data.groupby('given_label')['data_valuation_score'].transform('count')
plot_data['proportion'] = plot_data.groupby('given_label')['data_valuation_score'].transform(lambda x: x / x.count())


plt.figure(figsize=(21, 6))
sns.stripplot(
    data=plot_data,
    x="given_label",
    y="data_valuation_score",
    jitter=3.5,
    hue="given_label",
    dodge=True,
    alpha=0.5,
)
plt.xticks(rotation=45)
plt.xlim(-1.0, num_classes)
plt.show()

Similarily, the visualization in the Misc Datalab workflow tutorial has been turned into a strip plot.

image

Links to Relevant Issues or Conversations

- Make the _knn_shapley_score function easily testable.
- Adjust hard-coded scores in tests to new transformation.
- Add new test class for property based test that asserts that the raw scores are never negative.

Resolves cleanlab#1127
@elisno
Copy link
Member Author

elisno commented Jun 12, 2024

An unrelated test case fails due to a recent fastText release, which has yet to be announced. No information can be found in the repository, as it was archived in Mar 19, 2024.

@jwmueller
Copy link
Member

screenshot of the new distribution of scores on a real dataset? cant really review this without that information

@elisno elisno changed the title Update knn shapely score transformation Update knn shapely score computation Jun 18, 2024
pyproject.toml Show resolved Hide resolved
Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.05%. Comparing base (bc3451c) to head (157aaad).
Report is 6 commits behind head on master.

Current head 157aaad differs from pull request most recent head 334b318

Please upload reports for the commit 334b318 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1142      +/-   ##
==========================================
- Coverage   96.11%   96.05%   -0.07%     
==========================================
  Files          80       80              
  Lines        6100     6076      -24     
  Branches     1079     1003      -76     
==========================================
- Hits         5863     5836      -27     
- Misses        140      142       +2     
- Partials       97       98       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@elisno elisno requested a review from jwmueller June 18, 2024 15:36
@elisno
Copy link
Member Author

elisno commented Jun 18, 2024

The current failing link-check is unrelated to this PR.

@elisno elisno requested a review from jwmueller June 19, 2024 05:08
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for clarifications, sorry for confusion on my part!
Just had one suggested wording change

@elisno elisno merged commit 18dfb0d into cleanlab:master Jun 19, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update knn shapely score transformation
2 participants