Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new CLOS train test split tutorial notebook #1071

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

mturk24
Copy link
Contributor

@mturk24 mturk24 commented Mar 28, 2024

Summary

Added new tutorial that shows how to improve ML performance using train-test splits on your data with CLOS.

There is currently an issue preventing me from fully building the docs to see how quickly (and if successfully) the new tutorial builds.

Also modified the index files necessary to include this in the main sidebar of the CLOS tutorials. This is replacing the tabular datalab tutorial as well.

Latest Update: Bug in tutorial has been fixed and index files have been updated appropriately. Latest commits show fix/improvements to tutorial and data in S3 has been updated

Copy link

codecov bot commented Mar 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.33%. Comparing base (e0b7615) to head (69295dd).
Report is 52 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1071      +/-   ##
==========================================
- Coverage   96.20%   94.33%   -1.87%     
==========================================
  Files          76       80       +4     
  Lines        6005     6076      +71     
  Branches     1070     1003      -67     
==========================================
- Hits         5777     5732      -45     
- Misses        135      260     +125     
+ Partials       93       84       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… iid issues and filtered training data based on exact duplicates between training and test sets
…revious version following the model eval on clean training + test data. Fixed section on using Datalab on training data to clean the data
…up notebook and added more on hyperparameter optimization section. This section still needs to be improved.
… and cleaned up some of the code, put data used into s3 bucket
…ar before DCAI workflow tutorial, and renamed it to improving_ml_performance, also removed datalab tabular tutorial since this tutorial is replacing that one
@mturk24 mturk24 requested review from jwmueller and elisno April 4, 2024 01:52
@mturk24 mturk24 changed the title Added WIP new CLOS train test split tutorial notebook Added new CLOS train test split tutorial notebook Apr 4, 2024
@mturk24 mturk24 requested a review from sanjanag April 4, 2024 01:56
@mturk24
Copy link
Contributor Author

mturk24 commented Apr 4, 2024

Also adding @sanjanag as reviewer (since she was very helpful/involved in this)

@jwmueller jwmueller removed the request for review from elisno April 4, 2024 05:09
@mturk24
Copy link
Contributor Author

mturk24 commented Apr 4, 2024

I was able to get a workaround for this issue using this approach so was able to build the docs successfully. I'm not sure what the expected runtime is to build but going to try comparing build time with and without the new notebook more thoroughly.

@jwmueller
Copy link
Member

can you resolve merge conflicts, thanks!

"source": [
"# Define training index cutoff and find the exact duplicate indices to reference\n",
"train_idx_cutoff = len(preprocessed_train_data) - 1\n",
"exact_duplicates_indices = exact_duplicates.index\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be subsetted to the set of exact_duplicates where at least one of the datapoints is from the test set.

@jwmueller jwmueller requested review from jwmueller and removed request for sanjanag April 16, 2024 03:50
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a bug early on in this tutorial, so will stop reviewing until you've had a look and pinged me about it (because it seems like all subsequent results are affected if this step changes).

Specifically this is what we want to do: drop the extra duplicated copies of test data points found in our training set from this training set.

But I think your code is simply: dropping extra copies of any exact duplicate of a training datapoint, regardless if the set of exact duplicates only contains training data (and no test data).

…s from training data that are exact duplicat with test set, updated seed usage to be proper, and fixed unit tests accordingly
@mturk24 mturk24 force-pushed the add-train-test-clos-tutorial branch from 2bfafe9 to 43dfe63 Compare April 30, 2024 21:33
@mturk24 mturk24 closed this Apr 30, 2024
@mturk24 mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 13442e2 Compare April 30, 2024 21:39
@mturk24 mturk24 reopened this Apr 30, 2024
@mturk24 mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 83d4209 Compare April 30, 2024 21:45
…torial added between datalab and cleanlearning
@mturk24 mturk24 requested a review from jwmueller April 30, 2024 21:50
docs/source/tutorials/improving_ml_performance.ipynb Outdated Show resolved Hide resolved
"Here's how we recommend handling noisy training and test data (this tutorial walks through these steps):\n",
"\n",
"1. Preprocess your training and test data. Use cleanlab to check for issues in the merged dataset like train/test leakage or drift.\n",
"2. Fit your ML model to your noisy training data and get its predictions/embeddings for your test data. Use these model outputs with cleanlab to detect issues in your **test** data.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these numbered points really unclear (as a new person!).

For example, if in step 1, we preprocess training and test data...what does preprocessing do? I'd think that it would clean the data, but then in step 2 we're fitting a model to the "noisy training data". Does preprocessing data not mean that it is no longer noisy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added commit here that hopefully clarifies: 4a32bd9

By preprocess we mean: Do the non-cleanlab data prep work you'd normally do in your ML project. Eg. normalize continuous values, encode categoricals, handle missing vals, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason we include this statement is because the cleanlab package is only intended for data that is in ML-suitable format. cleanlab depends on you first training an initial ML model, so your data should be properly formatted for that

"\n",
"1. Preprocess your training and test data. Use cleanlab to check for issues in the merged dataset like train/test leakage or drift.\n",
"2. Fit your ML model to your noisy training data and get its predictions/embeddings for your test data. Use these model outputs with cleanlab to detect issues in your **test** data.\n",
"3. Manually review/correct cleanlab-detected issues in your test data. To avoid bias, **we caution against automated correction of test data**. Test data changes should be individually verified to ensure they will lead to more accurate model evaluation. We also caution against comparing the performance of different ML models across different versions of your test data; performance comparions between models should be based on the same test data.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To avoid bias, we caution against automated correction of test data" -> Why does this avoid bias? And what is automated correction? Is it not very good and that's why we need to manually check? This is all confusing to a non cleanlab studio user.

How about something like "While Cleanlab will detect issues and can automatically correct them, we encourage you to spot-check test data to understand why incorrect data may be present"

"Test data changes" is a confusing phrase bc test is both a noun and a verb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below about the context behind this tutorial. One major purpose of it was we saw cleanlab users blindly automatically just replacing all labels in test data that cleanlab flagged with their ML model's predicted labels. Doing this is dangerous and we are specifically trying to discourage it in this tutorial.

I tried to make some edits here, see if you still find it confusing:
cf8408c
8b146f7

docs/source/tutorials/improving_ml_performance.ipynb Outdated Show resolved Hide resolved
"id": "045f5e46-8985-4a7c-bc6f-9f7be509b787",
"metadata": {},
"source": [
"## 6. Compute out-of-sample predicted probabilities for training data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from step 4?

I start to lose track of what we're doing and why at this point in the tutorial

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for train data, step 4 was for test data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mturk24 You made another deviation from the spec in Sec 4, which may contribute to Nelson's confusion. See my comment there

"id": "96e8e3fe-b15f-41e0-87dd-0efb786f2920",
"metadata": {},
"source": [
"We have made some basic edits to improve test performance, so now we will parameterize each one of these edits (eg. what fraction of each issue to delete) to automatically find the best combination of edits to achieve optimal test performance. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What basic edits did we make? I'll bet you $10 that at this point the average reader could not answer this question - we've done so much in this notebook!

"\n",
"We will use a basic hyperparameter-tuning approach to optimize over these edit-variants + model re-training on the edited datasets with our objective being test performance.\n",
"\n",
"In a real-world setting, this would ideally be done on cleaned validation data instead of test data, but we are simplifying the approach for this tutorial.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cleaned validation data" - first time seeing this term. Confusing in comparison to the tutorial's first sentence of "In typical Machine Learning projects, we split our dataset into training data for fitting models and test (or validation) data"

I'd hazard the average reader couldn't explain back to you what is meant by this statement...what is a cleaned validation set vs cleaned

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the terminology is confusing, what's your suggestion?
Here's our defs:

cleaning = fixing Datalab detected issues (eg. relabeling data, dropping some data)

the point of this sentence is just to clarify we are being suboptimal ML practitioners here because we are going to start tuning some decisions against the test data. This call out is primarily to show we are not ML noobs and know this is not strictly kosher

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, OK. I have two suggestions:

  1. Let's use {train/test/validation} terminology which I think is well understood in the ML community. You fit a model with feature selection on the training set against the test set, and then perform hyperparameter tuning on the validation set. The main point of my comment was that we start this tutorial saying that validation set == test set, but then here we revert back to the validation set having a meaning that is distinct from the test set
  2. I like "cleaning". But then data that is in a state of not being cleaned should be "dirty" or "unclean", not "noisy". It does get a bit confusing when we're telling the reader (OK, we start with messy data, and then we apply preprocessing steps. Take the noisy data and find errors in it, which we can decide to auto-apply. When we're complete with cleaning".... Basically we should try our best (it's hard, I know!) to define adjectives like you do in your comment above. Noisy was one that I felt was unnecessarily confusing. Let's use "dirty" since it also ties in with branding. Let's not use Noisy since that can mean many things (e.g. just a low signal to noise ratio, too many columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and I used the suggestion feature - thanks for recommending @mturk24 !

#1071 (comment)

"id": "cc306eff-f3b7-4098-9f7e-3d17d1d0016a",
"metadata": {},
"source": [
"`cleanlab` was able to successfully help us improve ML performance in this tutorial! We saw how `cleanlab` helped us find and manually fix different data issue types in our test data to clean it. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THIS IS AWSOME!

These paragraphs are my favorite. Super clear messaging

I think the whole post could be improved by having a graph of the metrics - they're sort of print'ed throughout the post. Then it could be a great header image

"\n",
"We then were able to further improve model accuracy by optimizing for the exact amount of each issue type to drop from our data using hyperparameter optimization. \n",
"\n",
"To reiterate, here are the 2 main takeaways:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't my main takeaway from reading this! My main takeaway is that we should use cleanlab to optimize cleaning of a dataset.

Belaboring my point here but, instead of saying "DONT DO X" can you instead find a way to make it positive or at the very least explain why this is a bad idea?

I'm super happy to help - some quick post like "The folly of algorithmically cleaning test data", although it's a kind of confusing message because I think folks will think part of the point of cleanlab is to automatically clean data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nelsonauner Appreciate the detailed review!
I think you might be missing some context, so will provide just in case you want to revise any suggestions.

Some users want to use cleanlab to improve their ML model by improving their data. However the proper way to do this is highly subtle, hence we made this complex tutorial about it. The goal of the tutorial isn't to encourage users to use cleanlab to improve their ML model, it's for users who already want to do that to do it right.

We see some of these users curate (eg. fix label issues in) their training set, then train ML model, and eval on test data. But then they see no improvement in test-set accuracy. Why? Because they have actually introduced distribution-shift by altering their training data. If the test data still has issues, they must also be fixed for a faithful model evaluation.

Other users realize this and curate their test data too. But some of these users auto-fix their test data, which is dangerous! Cleanlab is imperfect (based on ML which is inevitably imperfect). Test data should NOT be blindly auto-fixed, that risks making model evaluation wrong.

Instead the recommended approach is the multi-step workflow outlined at the top of the tutorial, where less algorithmic/automated correction is applied to test data than to training data (focus manual effort/reviewing budget on curating test data rather than train)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional context.

I feel that your explanation in the above comment is incredibly useful and deserves to be somewhere in the tutorial, ideally next to the text where we tell them not to do it. But it's not a hill I'd die on!

@nelsonauner
Copy link
Contributor

nelsonauner commented Jun 19, 2024

Overall feedback:

  1. Try to reduce the number of unique/special terms that the reader has to grok. "Noisy", "Preprocess", "validation set" etc. Everyone has their own idea of what these terms mean, so we should carefully define them and keep them to a minimum
    𝟸̶.̶ ̶𝙿̶𝚎̶𝚘̶𝚙̶𝚕̶𝚎̶ ̶𝚘̶𝚗̶ ̶𝚝̶𝚑̶𝚎̶ ̶𝚒̶𝚗̶𝚝̶𝚎̶𝚛̶𝚗̶𝚎̶𝚝̶ ̶𝚑̶𝚊̶𝚟̶𝚎̶ ̶𝚜̶𝚑̶𝚘̶𝚛̶𝚝̶ ̶𝚊̶𝚝̶𝚝̶𝚎̶𝚗̶𝚝̶𝚒̶𝚘̶𝚗̶ ̶𝚜̶𝚙̶𝚊̶𝚗̶𝚜̶.̶ ̶𝙱̶𝚎̶ ̶𝚟̶𝚎̶𝚛̶𝚢̶ ̶𝚎̶𝚡̶𝚙̶𝚕̶𝚒̶𝚌̶𝚒̶𝚝̶ ̶𝚘̶𝚗̶ ̶_̶𝚠̶𝚑̶𝚊̶𝚝̶_̶ ̶𝚠̶𝚎̶ ̶𝚊̶𝚛̶𝚎̶ ̶𝚍̶𝚘̶𝚒̶𝚗̶𝚐̶ ̶𝚊̶𝚗̶𝚍̶ ̶_̶𝚠̶𝚑̶𝚢̶_̶.̶ ̶𝚃̶𝚑̶𝚎̶ ̶𝚏̶𝚎̶𝚠̶𝚎̶𝚛̶ ̶𝚜̶𝚎̶𝚗̶𝚝̶𝚎̶𝚗̶𝚌̶𝚎̶𝚜̶/̶𝚠̶𝚘̶𝚛̶𝚍̶𝚜̶ ̶𝚝̶𝚘̶ ̶𝚎̶𝚡̶𝚙̶𝚕̶𝚊̶𝚒̶𝚗̶ ̶𝚊̶ ̶𝚙̶𝚘̶𝚒̶𝚗̶𝚝̶ ̶𝚜̶𝚞̶𝚌̶𝚌̶𝚒̶𝚗̶𝚌̶𝚝̶𝚕̶𝚢̶,̶ ̶𝚝̶𝚑̶𝚎̶ ̶𝚋̶𝚎̶𝚝̶𝚝̶𝚎̶𝚛̶.̶ ̶

I take this feedback back :) I realize this is a tutorial not a blog post

mturk24 and others added 3 commits June 20, 2024 10:20
…at were outdated with newest version of cleanlab package, and some wording in markdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants