-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding an spurious_correlation as new issue type #872
base: master
Are you sure you want to change the base?
Changes from 1 commit
d9773fc
f6af4d4
8312f7e
48bf33c
178c079
3a7a3fb
2660c33
33444fd
4f3d9fa
fcabb2c
81d4fcf
ad0374a
3907bb9
b3b06c0
0fb83b1
5d9698b
a185ecc
e6cf509
0ea9c9a
861aa93
fc42d7e
fb5aa35
f35b3c1
2ae6b7b
409c5b7
1ef0397
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,37 @@ | ||||||
import numpy as np | ||||||
import pandas as pd | ||||||
from sklearn.model_selection import cross_val_predict, cross_val_score | ||||||
from sklearn.naive_bayes import GaussianNB | ||||||
from statistics import mode | ||||||
import warnings | ||||||
warnings.filterwarnings('ignore') | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. delete this line, we should not be suppressing warnings in our codebase. If there are warnings being printed, please explain why |
||||||
|
||||||
from datalab import DataLab | ||||||
|
||||||
class SpuriousCorrelations: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's define this class in a different (new) file. My suggestion is: cleanlab/datalab/internal/spurious_correlation.py |
||||||
|
||||||
def __init__(self, data: DataLab) -> None: | ||||||
self.data = data | ||||||
self.issues = data.issues | ||||||
self.labels = data.labels | ||||||
|
||||||
def spurious_correlations(self) -> pd.DataFrame: | ||||||
baseline_accuracy = np.bincount(self.labels).argmax() / len(self.labels) | ||||||
image_properties = ["near_duplicate_score", "blurry_score", "light_score", "low_information_score", "dark_score", "grayscale_score", "odd_aspect_ratio_score", "odd_size_score"] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. before you loop over these, you should restrict to only the subset of these that is present in: In some cases, not all of these properties will have been previously computed, in which case we should just not compute the spurious correlation for those properties that were not computed already |
||||||
property_scores = {} | ||||||
for property_of_interest in image_properties: | ||||||
S = self.calculate_spurious_correlation(property_of_interest, baseline_accuracy) | ||||||
property_scores[f'{property_of_interest}'] = S | ||||||
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score']) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
return data_score | ||||||
|
||||||
def calculate_spurious_correlation(self, property_of_interest, baseline_accuracy): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add mypy typing information for all arguments, as well as the return type There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You'll need to get the type check that runs in our CI: to pass eventually, by adding the appropriate typing information everywhere |
||||||
X = self.issues[property_of_interest].values.reshape(-1, 1) | ||||||
y = self.labels | ||||||
classifier = GaussianNB() | ||||||
cv_accuracies = cross_val_score(classifier, X, y, cv=5, scoring='accuracy') | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's make |
||||||
mean_accuracy = np.mean(cv_accuracies) | ||||||
eps = 1e-8 | ||||||
S = min(1, (1 - mean_accuracy) / (1 - baseline_accuracy + eps)) | ||||||
return S | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include an end-to-end unit test of this function. You should actually create a toy dataset that suffers from a spurious correlation (say have 10 tiny images at varying levels of darkness, and make the label related to how dark they are). And then verify that this code detects this spurious correlation. Likewise your same unit test should verify that the other spurious correlation scores (those unrelated to dark, light) do NOT give low scores for this same dataset.
For now you can just add the new unit test at the bottom of here:
https://github.com/cleanlab/cleanlab/blob/master/tests/datalab/test_datalab.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion @jwmueller, to include an end-to-end unit test. I'd like to ensure I create a comprehensive test that verifies the detection of spurious correlations effectively. However, I'm not entirely sure how to set up such a test, especially with a toy dataset. Could you please provide an example test or point me to any resources that might be helpful in creating this unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can generally follow the structure of any of the existing unit tests. I wouldn't worry too much about the precise code structure you use, we can help you refactor the code properly. Instead I would focus on ensuring the test runs quickly (toy dataset is small enough) but still tests the key logic -- namely that this code is actually able to detect an image property that is highly correlated with the labels and that this code does not return false positives for image properties that have no relationship with the labels.
An example you could follow is: test_find_issues_with_pred_probs
and just change the dataset being used and add a final line:
lab._spurious_correlations()
near the end of the test and then check its results.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the guidance, @jwmueller. I appreciate your clear explanation of what the unit test should achieve. I understand the high-level structure and the need to ensure it runs efficiently with a small toy dataset. However, I'm currently facing a bit of a roadblock when it comes to translating this into code. i saw at example code and other test code as well but couldn't figure out how to get it done through code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part is confusing to code specifically?
We can provide you some skeleton code or further pointers for that part, if you can write out your remaining specific questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i created dataset 'light_score = [0.11, 0.43, 0.96, 0.28, 0.23, 0.21, 0.63, 0.40, 0.19, 0.93]
dark_score = [0.98, 0.57, 0.28, 0.97, 0.91, 0.95, 0.57, 0.60, 0.87, 0.34]
label = [0, 1, 2, 0, 0, 0, 1, 1, 0, 2]
issues = pd.DataFrame({'dark_score': dark_score,
'light_score': light_score,
'labels': label})
issue_summary = pd.DataFrame({'issue_type': ['dark', 'light'],
'num_issues': [10,0]})' and it gets me result
'' image_property label_prediction_error
0 dark 0.3
1 light 0.3''
but when i tested it on mnist dataset by taking this as refarance 'https://docs.cleanlab.ai/master/tutorials/image.html' it gives output as
''image_property label_prediction_error
0 outlier 0.836867
1 near_duplicate 0.843817
2 low_information 0.743633
3 dark 0.855317''
I made a mistake by accidentally deleting the 'spurious_correlations' branch from my local machine. To rectify this, I have created a new branch named 'spurious_correlations_' and submitted a new pull request. I apologize for any inconvenience as i am doing this first time and will ensure to be more careful in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to work in a new PR, given I have left a lot of feedback on this one.
You should be able to get the branch back on your local machine by doing:
(with git here pointed at your own fork). It should be good practice for you to get the branch back on your local machine, and resume work on the original PR if you can