Update knn shapely score computation #1142

elisno · 2024-06-12T22:19:00Z

Summary

The original knn shapely score implementation is unlikely to find examples with net-negative contributions to model training.
This PR revisits the implementation of the raw score computation but places the x -> 0.5(x + 1) transformation
such that the final score lies in the range [0, 1] in data_shapley_knn() (the public function).

Here's the reference from the relevant paper: https://arxiv.org/pdf/1908.08619

Testing

This PR adds a new test class for a property-based test that asserts that the raw scores lie in the appropriate range.
It adds several test cases for clear label/feature errors being added are captured in this score.
- Test the scores and issue flags when there's a label error.
- Same for one outlier in feature space.
- How the scores/issue flags behave when there's one exact duplicate (with and without conflicting labels).
- How sorting the dataset by the data valuation score and removing some percentages of the dataset beats random removals, in terms of model accuracy.
  - One parametrized test for a binary dataset.
  - Another parametrized test for a multi-class dataset.

Running the data valuation check through Datalab gives this strip plot of the different scores across the classes (it's recommended to view the enlarged image on a wider screen, due to the number of classes not fitting well on the strip plot).

This image was generated with the following code snippet added to the text tutorial (replace the cell that calls Datalab.find_issues() with the following code snippet (see section 4):

lab = Datalab(data_dict, label_name="labels")
lab.find_issues(features=text_embeddings, issue_types={"data_valuation": {"k": 10}})

# Collect the identified issues
data_valuation_issues = lab.get_issues("data_valuation")

# Display the data valuation issues
display(data_valuation_issues)

import seaborn as sns
import matplotlib.pyplot as plt

# Prepare the data for plotting a histogram
plot_data = data_valuation_issues.join(pd.DataFrame({"given_label": data_dict["labels"]}))

# Normalize the data for proportional count plot
plot_data['count'] = plot_data.groupby('given_label')['data_valuation_score'].transform('count')
plot_data['proportion'] = plot_data.groupby('given_label')['data_valuation_score'].transform(lambda x: x / x.count())


plt.figure(figsize=(21, 6))
sns.stripplot(
    data=plot_data,
    x="given_label",
    y="data_valuation_score",
    jitter=3.5,
    hue="given_label",
    dodge=True,
    alpha=0.5,
)
plt.xticks(rotation=45)
plt.xlim(-1.0, num_classes)
plt.show()

Similarily, the visualization in the Misc Datalab workflow tutorial has been turned into a strip plot.

Links to Relevant Issues or Conversations

Resolves update knn shapely score transformation #1127

- Make the _knn_shapley_score function easily testable. - Adjust hard-coded scores in tests to new transformation. - Add new test class for property based test that asserts that the raw scores are never negative. Resolves cleanlab#1127

elisno · 2024-06-12T22:19:18Z

An unrelated test case fails due to a recent fastText release, which has yet to be announced. No information can be found in the repository, as it was archived in Mar 19, 2024.

jwmueller · 2024-06-12T22:32:13Z

screenshot of the new distribution of scores on a real dataset? cant really review this without that information

tests/test_data_valuation.py

Fixes cleanlab#1127

This is a temporary change to allow CI to run without installing newly released numpy 2.0.0.

docs/source/tutorials/datalab/workflows.ipynb

pyproject.toml

codecov · 2024-06-18T14:36:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.05%. Comparing base (bc3451c) to head (157aaad).
Report is 6 commits behind head on master.

❗ Current head 157aaad differs from pull request most recent head 334b318

Please upload reports for the commit 334b318 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1142      +/-   ##
==========================================
- Coverage   96.11%   96.05%   -0.07%     
==========================================
  Files          80       80              
  Lines        6100     6076      -24     
  Branches     1079     1003      -76     
==========================================
- Hits         5863     5836      -27     
- Misses        140      142       +2     
- Partials       97       98       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

elisno · 2024-06-18T17:52:52Z

The current failing link-check is unrelated to this PR.

…lices/clusters

cleanlab/data_valuation.py

…d data slices/clusters" This reverts commit 916a986.

docs/source/tutorials/datalab/workflows.ipynb

jwmueller

LGTM thanks for clarifications, sorry for confusion on my part!
Just had one suggested wording change

Co-authored-by: Jonas Mueller <[email protected]>

update knn shapely score transformation

721614b

- Make the _knn_shapley_score function easily testable. - Adjust hard-coded scores in tests to new transformation. - Add new test class for property based test that asserts that the raw scores are never negative. Resolves cleanlab#1127

elisno mentioned this pull request Jun 12, 2024

CI fails due to an unnannounced fastText release #1143

Closed

jwmueller reviewed Jun 12, 2024

View reviewed changes

tests/test_data_valuation.py Show resolved Hide resolved

Merge branch 'master' into elisno/issue1127

beb4d7d

elisno changed the title ~~Update knn shapely score transformation~~ Update knn shapely score computation Jun 18, 2024

elisno added 3 commits June 18, 2024 04:44

update knn shapely score computation

c1c90fb

Fixes cleanlab#1127

update numpy dependency to version compatible with 1.22.

4d409ee

This is a temporary change to allow CI to run without installing newly released numpy 2.0.0.

update visualization in workflows.ipynb

c9994f3

elisno commented Jun 18, 2024

View reviewed changes

docs/source/tutorials/datalab/workflows.ipynb Show resolved Hide resolved

elisno commented Jun 18, 2024

View reviewed changes

pyproject.toml Show resolved Hide resolved

elisno added 2 commits June 18, 2024 05:55

revert docstring changes in data valuation modules.

3ffac29

update parameter range for test

157aaad

elisno requested a review from jwmueller June 18, 2024 15:36

docs: refer to workflows.ipynb in faq.ipynb about pre-computed data s…

916a986

…lices/clusters

jwmueller reviewed Jun 19, 2024

View reviewed changes

cleanlab/data_valuation.py Show resolved Hide resolved

elisno added 2 commits June 19, 2024 04:09

Revert "docs: refer to workflows.ipynb in faq.ipynb about pre-compute…

80604c8

…d data slices/clusters" This reverts commit 916a986.

update tesxt on visualization in workflows notebook

c46d5c0

elisno requested a review from jwmueller June 19, 2024 05:08

jwmueller reviewed Jun 19, 2024

View reviewed changes

docs/source/tutorials/datalab/workflows.ipynb Outdated Show resolved Hide resolved

jwmueller approved these changes Jun 19, 2024

View reviewed changes

Update docs/source/tutorials/datalab/workflows.ipynb

334b318

Co-authored-by: Jonas Mueller <[email protected]>

elisno merged commit 18dfb0d into cleanlab:master Jun 19, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update knn shapely score computation #1142

Update knn shapely score computation #1142

elisno commented Jun 12, 2024 •

edited

elisno commented Jun 12, 2024

jwmueller commented Jun 12, 2024

codecov bot commented Jun 18, 2024 •

edited

elisno commented Jun 18, 2024

jwmueller left a comment

Update knn shapely score computation #1142

Update knn shapely score computation #1142

Conversation

elisno commented Jun 12, 2024 • edited

Summary

Testing

Links to Relevant Issues or Conversations

elisno commented Jun 12, 2024

jwmueller commented Jun 12, 2024

codecov bot commented Jun 18, 2024 • edited

Codecov Report

elisno commented Jun 18, 2024

jwmueller left a comment

Choose a reason for hiding this comment

elisno commented Jun 12, 2024 •

edited

codecov bot commented Jun 18, 2024 •

edited