Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding an spurious_correlation as new issue type #872

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d9773fc
adding an spurious_correlation as new issue type
01PrathamS Oct 13, 2023
f6af4d4
new issue type spurious_correlation changes added
01PrathamS Oct 17, 2023
8312f7e
make necessary changes to spurious_correlation function and add unit …
01PrathamS Oct 25, 2023
48bf33c
Revert "make necessary changes to spurious_correlation function and a…
01PrathamS Oct 25, 2023
178c079
Merge branch 'cleanlab:master' into spurious_correlations
01PrathamS Oct 25, 2023
3a7a3fb
apply black formatter
elisno Oct 31, 2023
2660c33
define helper function that defines the score
elisno Oct 31, 2023
33444fd
Turn SpuriousCorrelations class into a dataclass
elisno Oct 31, 2023
4f3d9fa
Add a test file for spurious correlation functionality
elisno Oct 31, 2023
fcabb2c
update unit tests for calculate_correlations method
elisno Nov 1, 2023
81d4fcf
add assertion that the properties of interests should be in the dataf…
elisno Nov 1, 2023
ad0374a
add docstrings
elisno Nov 1, 2023
3907bb9
add docs page for internal class for handling spurious correlations c…
elisno Nov 1, 2023
b3b06c0
add a working Datalab example
elisno Nov 1, 2023
0fb83b1
refactor Datalab._spurious_correlations
elisno Nov 1, 2023
5d9698b
apply black formatter
elisno Nov 1, 2023
a185ecc
rename score column name
elisno Nov 1, 2023
e6cf509
remove line that suppresses warnings
elisno Nov 1, 2023
0ea9c9a
cast feature array to numpy array, add more type-hints
elisno Nov 1, 2023
861aa93
add type-hints for return values
elisno Nov 1, 2023
fc42d7e
format docstring
elisno Nov 1, 2023
fb5aa35
adjust condition for checking property 2 of the scoring function
elisno Nov 1, 2023
f35b3c1
Merge branch 'cleanlab:master' into spurious_correlations
01PrathamS Nov 12, 2023
2ae6b7b
fix baseline accuracy
elisno Nov 13, 2023
409c5b7
update expected scores based on fixed baseline accuracy
elisno Nov 13, 2023
1ef0397
hello
01PrathamS Dec 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions cleanlab/datalab/datalab.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,19 @@
f"\nAudit complete. {self.data_issues.issue_summary['num_issues'].sum()} issues found in the dataset."
)

def _spurious_correlations(self) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include an end-to-end unit test of this function. You should actually create a toy dataset that suffers from a spurious correlation (say have 10 tiny images at varying levels of darkness, and make the label related to how dark they are). And then verify that this code detects this spurious correlation. Likewise your same unit test should verify that the other spurious correlation scores (those unrelated to dark, light) do NOT give low scores for this same dataset.

For now you can just add the new unit test at the bottom of here:
https://github.com/cleanlab/cleanlab/blob/master/tests/datalab/test_datalab.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion @jwmueller, to include an end-to-end unit test. I'd like to ensure I create a comprehensive test that verifies the detection of spurious correlations effectively. However, I'm not entirely sure how to set up such a test, especially with a toy dataset. Could you please provide an example test or point me to any resources that might be helpful in creating this unit test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generally follow the structure of any of the existing unit tests. I wouldn't worry too much about the precise code structure you use, we can help you refactor the code properly. Instead I would focus on ensuring the test runs quickly (toy dataset is small enough) but still tests the key logic -- namely that this code is actually able to detect an image property that is highly correlated with the labels and that this code does not return false positives for image properties that have no relationship with the labels.

An example you could follow is: test_find_issues_with_pred_probs

and just change the dataset being used and add a final line: lab._spurious_correlations() near the end of the test and then check its results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the guidance, @jwmueller. I appreciate your clear explanation of what the unit test should achieve. I understand the high-level structure and the need to ensure it runs efficiently with a small toy dataset. However, I'm currently facing a bit of a roadblock when it comes to translating this into code. i saw at example code and other test code as well but couldn't figure out how to get it done through code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part is confusing to code specifically?

We can provide you some skeleton code or further pointers for that part, if you can write out your remaining specific questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i created dataset 'light_score = [0.11, 0.43, 0.96, 0.28, 0.23, 0.21, 0.63, 0.40, 0.19, 0.93]
dark_score = [0.98, 0.57, 0.28, 0.97, 0.91, 0.95, 0.57, 0.60, 0.87, 0.34]
label = [0, 1, 2, 0, 0, 0, 1, 1, 0, 2]
issues = pd.DataFrame({'dark_score': dark_score,
'light_score': light_score,
'labels': label})

issue_summary = pd.DataFrame({'issue_type': ['dark', 'light'],
'num_issues': [10,0]})' and it gets me result

'' image_property label_prediction_error
0 dark 0.3
1 light 0.3''

but when i tested it on mnist dataset by taking this as refarance 'https://docs.cleanlab.ai/master/tutorials/image.html' it gives output as

''image_property label_prediction_error
0 outlier 0.836867
1 near_duplicate 0.843817
2 low_information 0.743633
3 dark 0.855317''

I made a mistake by accidentally deleting the 'spurious_correlations' branch from my local machine. To rectify this, I have created a new branch named 'spurious_correlations_' and submitted a new pull request. I apologize for any inconvenience as i am doing this first time and will ensure to be more careful in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to work in a new PR, given I have left a lot of feedback on this one.

You should be able to get the branch back on your local machine by doing:

git checkout --track origin/spurious_correlations
git pull

(with git here pointed at your own fork). It should be good practice for you to get the branch back on your local machine, and resume work on the original PR if you can

"""
Use this after finding issues to which examples suffer from which types of issues.

NOTE:

Returns:
--------
A DataFrame where each row corresponds to image_property ('dark', 'grayscale')
and overall datascore for that image_property
"""
return self.SpuriousCorrelations.spurious_correlation

Check warning on line 317 in cleanlab/datalab/datalab.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/datalab.py#L317

Added line #L317 was not covered by tests

def report(
self,
*,
Expand Down
37 changes: 37 additions & 0 deletions cleanlab/datalab/internal/spurious_correlations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.naive_bayes import GaussianNB
from statistics import mode
import warnings
warnings.filterwarnings('ignore')

Check warning on line 7 in cleanlab/datalab/internal/spurious_correlations.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/internal/spurious_correlations.py#L1-L7

Added lines #L1 - L7 were not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this line, we should not be suppressing warnings in our codebase.

If there are warnings being printed, please explain why


from datalab import DataLab

Check warning on line 9 in cleanlab/datalab/internal/spurious_correlations.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/internal/spurious_correlations.py#L9

Added line #L9 was not covered by tests

class SpuriousCorrelations:

Check warning on line 11 in cleanlab/datalab/internal/spurious_correlations.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/internal/spurious_correlations.py#L11

Added line #L11 was not covered by tests
Copy link
Member

@jwmueller jwmueller Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define this class in a different (new) file.

My suggestion is:

cleanlab/datalab/internal/spurious_correlation.py


def __init__(self, data: DataLab) -> None:
self.data = data
self.issues = data.issues
self.labels = data.labels

Check warning on line 16 in cleanlab/datalab/internal/spurious_correlations.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/internal/spurious_correlations.py#L13-L16

Added lines #L13 - L16 were not covered by tests

def spurious_correlations(self) -> pd.DataFrame:
baseline_accuracy = np.bincount(self.labels).argmax() / len(self.labels)
image_properties = ["near_duplicate_score", "blurry_score", "light_score", "low_information_score", "dark_score", "grayscale_score", "odd_aspect_ratio_score", "odd_size_score"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before you loop over these, you should restrict to only the subset of these that is present in: self.issues

In some cases, not all of these properties will have been previously computed, in which case we should just not compute the spurious correlation for those properties that were not computed already

property_scores = {}

Check warning on line 21 in cleanlab/datalab/internal/spurious_correlations.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/internal/spurious_correlations.py#L18-L21

Added lines #L18 - L21 were not covered by tests
for property_of_interest in image_properties:
S = self.calculate_spurious_correlation(property_of_interest, baseline_accuracy)
property_scores[f'{property_of_interest}'] = S
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score'])
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'label_prediction_error'])

return data_score

Check warning on line 26 in cleanlab/datalab/internal/spurious_correlations.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/internal/spurious_correlations.py#L23-L26

Added lines #L23 - L26 were not covered by tests

def calculate_spurious_correlation(self, property_of_interest, baseline_accuracy):
Copy link
Member

@jwmueller jwmueller Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add mypy typing information for all arguments, as well as the return type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to get the type check that runs in our CI:
https://github.com/cleanlab/cleanlab/actions/runs/6512420947/job/17690017396?pr=872

to pass eventually, by adding the appropriate typing information everywhere

X = self.issues[property_of_interest].values.reshape(-1, 1)
y = self.labels
classifier = GaussianNB()
cv_accuracies = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make cv_folds = 5 an optional argument of calculate_spurious_correlation() and set cv = cv_folds here.

mean_accuracy = np.mean(cv_accuracies)
eps = 1e-8
S = min(1, (1 - mean_accuracy) / (1 - baseline_accuracy + eps))
return S

Check warning on line 36 in cleanlab/datalab/internal/spurious_correlations.py

View check run for this annotation

Codecov / codecov/patch

cleanlab/datalab/internal/spurious_correlations.py#L28-L36

Added lines #L28 - L36 were not covered by tests

Loading