Added new CLOS train test split tutorial notebook #1071

mturk24 · 2024-03-28T18:42:51Z

Summary

Added new tutorial that shows how to improve ML performance using train-test splits on your data with CLOS.

There is currently an issue preventing me from fully building the docs to see how quickly (and if successfully) the new tutorial builds.

Also modified the index files necessary to include this in the main sidebar of the CLOS tutorials. This is replacing the tabular datalab tutorial as well.

Latest Update: Bug in tutorial has been fixed and index files have been updated appropriately. Latest commits show fix/improvements to tutorial and data in S3 has been updated

codecov · 2024-03-28T18:55:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.33%. Comparing base (e0b7615) to head (69295dd).
Report is 52 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1071      +/-   ##
==========================================
- Coverage   96.20%   94.33%   -1.87%     
==========================================
  Files          76       80       +4     
  Lines        6005     6076      +71     
  Branches     1070     1003      -67     
==========================================
- Hits         5777     5732      -45     
- Misses        135      260     +125     
+ Partials       93       84       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… iid issues and filtered training data based on exact duplicates between training and test sets

…revious version following the model eval on clean training + test data. Fixed section on using Datalab on training data to clean the data

…up notebook and added more on hyperparameter optimization section. This section still needs to be improved.

… and cleaned up some of the code, put data used into s3 bucket

…ar before DCAI workflow tutorial, and renamed it to improving_ml_performance, also removed datalab tabular tutorial since this tutorial is replacing that one

mturk24 · 2024-04-04T01:57:30Z

Also adding @sanjanag as reviewer (since she was very helpful/involved in this)

docs/source/tutorials/datalab/index.rst

mturk24 · 2024-04-04T19:28:48Z

I was able to get a workaround for this issue using this approach so was able to build the docs successfully. I'm not sure what the expected runtime is to build but going to try comparing build time with and without the new notebook more thoroughly.

jwmueller · 2024-04-08T18:46:30Z

can you resolve merge conflicts, thanks!

…als and adjusted intro section as well

jwmueller · 2024-04-16T03:42:00Z

docs/source/tutorials/improving_ml_performance.ipynb

+   "source": [
+    "# Define training index cutoff and find the exact duplicate indices to reference\n",
+    "train_idx_cutoff = len(preprocessed_train_data) - 1\n",
+    "exact_duplicates_indices = exact_duplicates.index\n",


I think this needs to be subsetted to the set of exact_duplicates where at least one of the datapoints is from the test set.

docs/source/index.rst

docs/source/tutorials/index.rst

jwmueller

I think there may be a bug early on in this tutorial, so will stop reviewing until you've had a look and pinged me about it (because it seems like all subsequent results are affected if this step changes).

Specifically this is what we want to do: drop the extra duplicated copies of test data points found in our training set from this training set.

But I think your code is simply: dropping extra copies of any exact duplicate of a training datapoint, regardless if the set of exact duplicates only contains training data (and no test data).

…s from training data that are exact duplicat with test set, updated seed usage to be proper, and fixed unit tests accordingly

…torial added between datalab and cleanlearning

docs/source/tutorials/improving_ml_performance.ipynb

nelsonauner · 2024-06-18T17:59:04Z

docs/source/tutorials/improving_ml_performance.ipynb

+    "Here's how we recommend handling noisy training and test data (this tutorial walks through these steps):\n",
+    "\n",
+    "1. Preprocess your training and test data. Use cleanlab to check for issues in the merged dataset like train/test leakage or drift.\n",
+    "2. Fit your ML model to your noisy training data and get its predictions/embeddings for your test data. Use these model outputs with cleanlab to detect issues in your **test** data.\n",


I found these numbered points really unclear (as a new person!).

For example, if in step 1, we preprocess training and test data...what does preprocessing do? I'd think that it would clean the data, but then in step 2 we're fitting a model to the "noisy training data". Does preprocessing data not mean that it is no longer noisy?

Added commit here that hopefully clarifies: 4a32bd9

By preprocess we mean: Do the non-cleanlab data prep work you'd normally do in your ML project. Eg. normalize continuous values, encode categoricals, handle missing vals, etc.

the reason we include this statement is because the cleanlab package is only intended for data that is in ML-suitable format. cleanlab depends on you first training an initial ML model, so your data should be properly formatted for that

nelsonauner · 2024-06-18T19:57:40Z

docs/source/tutorials/improving_ml_performance.ipynb

+    "\n",
+    "1. Preprocess your training and test data. Use cleanlab to check for issues in the merged dataset like train/test leakage or drift.\n",
+    "2. Fit your ML model to your noisy training data and get its predictions/embeddings for your test data. Use these model outputs with cleanlab to detect issues in your **test** data.\n",
+    "3. Manually review/correct cleanlab-detected issues in your test data. To avoid bias, **we caution against automated correction of test data**. Test data changes should be individually verified to ensure they will lead to more accurate model evaluation. We also caution against comparing the performance of different ML models across different versions of your test data; performance comparions between models should be based on the same test data.\n",


"To avoid bias, we caution against automated correction of test data" -> Why does this avoid bias? And what is automated correction? Is it not very good and that's why we need to manually check? This is all confusing to a non cleanlab studio user.

How about something like "While Cleanlab will detect issues and can automatically correct them, we encourage you to spot-check test data to understand why incorrect data may be present"

"Test data changes" is a confusing phrase bc test is both a noun and a verb.

see my comment below about the context behind this tutorial. One major purpose of it was we saw cleanlab users blindly automatically just replacing all labels in test data that cleanlab flagged with their ML model's predicted labels. Doing this is dangerous and we are specifically trying to discourage it in this tutorial.

I tried to make some edits here, see if you still find it confusing:
cf8408c
8b146f7

docs/source/tutorials/improving_ml_performance.ipynb

nelsonauner · 2024-06-19T17:49:21Z

docs/source/tutorials/improving_ml_performance.ipynb

+   "id": "045f5e46-8985-4a7c-bc6f-9f7be509b787",
+   "metadata": {},
+   "source": [
+    "## 6. Compute out-of-sample predicted probabilities for training data"


How is this different from step 4?

I start to lose track of what we're doing and why at this point in the tutorial

this is for train data, step 4 was for test data

@mturk24 You made another deviation from the spec in Sec 4, which may contribute to Nelson's confusion. See my comment there

nelsonauner · 2024-06-19T17:51:00Z

docs/source/tutorials/improving_ml_performance.ipynb

+   "id": "96e8e3fe-b15f-41e0-87dd-0efb786f2920",
+   "metadata": {},
+   "source": [
+    "We have made some basic edits to improve test performance, so now we will parameterize each one of these edits (eg. what fraction of each issue to delete) to automatically find the best combination of edits to achieve optimal test performance. \n",


What basic edits did we make? I'll bet you $10 that at this point the average reader could not answer this question - we've done so much in this notebook!

nelsonauner · 2024-06-19T17:57:00Z

docs/source/tutorials/improving_ml_performance.ipynb

+    "\n",
+    "We will use a basic hyperparameter-tuning approach to optimize over these edit-variants + model re-training on the edited datasets with our objective being test performance.\n",
+    "\n",
+    "In a real-world setting, this would ideally be done on cleaned validation data instead of test data, but we are simplifying the approach for this tutorial.\n",


"cleaned validation data" - first time seeing this term. Confusing in comparison to the tutorial's first sentence of "In typical Machine Learning projects, we split our dataset into training data for fitting models and test (or validation) data"

I'd hazard the average reader couldn't explain back to you what is meant by this statement...what is a cleaned validation set vs cleaned

Agree the terminology is confusing, what's your suggestion?
Here's our defs:

cleaning = fixing Datalab detected issues (eg. relabeling data, dropping some data)

the point of this sentence is just to clarify we are being suboptimal ML practitioners here because we are going to start tuning some decisions against the test data. This call out is primarily to show we are not ML noobs and know this is not strictly kosher

Great, OK. I have two suggestions:

Let's use {train/test/validation} terminology which I think is well understood in the ML community. You fit a model with feature selection on the training set against the test set, and then perform hyperparameter tuning on the validation set. The main point of my comment was that we start this tutorial saying that validation set == test set, but then here we revert back to the validation set having a meaning that is distinct from the test set

I like "cleaning". But then data that is in a state of not being cleaned should be "dirty" or "unclean", not "noisy". It does get a bit confusing when we're telling the reader (OK, we start with messy data, and then we apply preprocessing steps. Take the noisy data and find errors in it, which we can decide to auto-apply. When we're complete with cleaning".... Basically we should try our best (it's hard, I know!) to define adjectives like you do in your comment above. Noisy was one that I felt was unnecessarily confusing. Let's use "dirty" since it also ties in with branding. Let's not use Noisy since that can mean many things (e.g. just a low signal to noise ratio, too many columns)

Oh and I used the suggestion feature - thanks for recommending @mturk24 !

#1071 (comment)

nelsonauner · 2024-06-19T18:01:26Z

docs/source/tutorials/improving_ml_performance.ipynb

+   "id": "cc306eff-f3b7-4098-9f7e-3d17d1d0016a",
+   "metadata": {},
+   "source": [
+    "`cleanlab` was able to successfully help us improve ML performance in this tutorial! We saw how `cleanlab` helped us find and manually fix different data issue types in our test data to clean it. \n",


THIS IS AWSOME!

These paragraphs are my favorite. Super clear messaging

I think the whole post could be improved by having a graph of the metrics - they're sort of print'ed throughout the post. Then it could be a great header image

nelsonauner · 2024-06-19T18:03:31Z

docs/source/tutorials/improving_ml_performance.ipynb

+    "\n",
+    "We then were able to further improve model accuracy by optimizing for the exact amount of each issue type to drop from our data using hyperparameter optimization. \n",
+    "\n",
+    "To reiterate, here are the 2 main takeaways:\n",


These aren't my main takeaway from reading this! My main takeaway is that we should use cleanlab to optimize cleaning of a dataset.

Belaboring my point here but, instead of saying "DONT DO X" can you instead find a way to make it positive or at the very least explain why this is a bad idea?

I'm super happy to help - some quick post like "The folly of algorithmically cleaning test data", although it's a kind of confusing message because I think folks will think part of the point of cleanlab is to automatically clean data.

@nelsonauner Appreciate the detailed review!
I think you might be missing some context, so will provide just in case you want to revise any suggestions.

Some users want to use cleanlab to improve their ML model by improving their data. However the proper way to do this is highly subtle, hence we made this complex tutorial about it. The goal of the tutorial isn't to encourage users to use cleanlab to improve their ML model, it's for users who already want to do that to do it right.

We see some of these users curate (eg. fix label issues in) their training set, then train ML model, and eval on test data. But then they see no improvement in test-set accuracy. Why? Because they have actually introduced distribution-shift by altering their training data. If the test data still has issues, they must also be fixed for a faithful model evaluation.

Other users realize this and curate their test data too. But some of these users auto-fix their test data, which is dangerous! Cleanlab is imperfect (based on ML which is inevitably imperfect). Test data should NOT be blindly auto-fixed, that risks making model evaluation wrong.

Instead the recommended approach is the multi-step workflow outlined at the top of the tutorial, where less algorithmic/automated correction is applied to test data than to training data (focus manual effort/reviewing budget on curating test data rather than train)

Thanks for the additional context.

I feel that your explanation in the above comment is incredibly useful and deserves to be somewhere in the tutorial, ideally next to the text where we tell them not to do it. But it's not a hill I'd die on!

nelsonauner · 2024-06-19T18:11:03Z

Overall feedback:

Try to reduce the number of unique/special terms that the reader has to grok. "Noisy", "Preprocess", "validation set" etc. Everyone has their own idea of what these terms mean, so we should carefully define them and keep them to a minimum
𝟸̶.̶ ̶𝙿̶𝚎̶𝚘̶𝚙̶𝚕̶𝚎̶ ̶𝚘̶𝚗̶ ̶𝚝̶𝚑̶𝚎̶ ̶𝚒̶𝚗̶𝚝̶𝚎̶𝚛̶𝚗̶𝚎̶𝚝̶ ̶𝚑̶𝚊̶𝚟̶𝚎̶ ̶𝚜̶𝚑̶𝚘̶𝚛̶𝚝̶ ̶𝚊̶𝚝̶𝚝̶𝚎̶𝚗̶𝚝̶𝚒̶𝚘̶𝚗̶ ̶𝚜̶𝚙̶𝚊̶𝚗̶𝚜̶.̶ ̶𝙱̶𝚎̶ ̶𝚟̶𝚎̶𝚛̶𝚢̶ ̶𝚎̶𝚡̶𝚙̶𝚕̶𝚒̶𝚌̶𝚒̶𝚝̶ ̶𝚘̶𝚗̶ ̶_̶𝚠̶𝚑̶𝚊̶𝚝̶_̶ ̶𝚠̶𝚎̶ ̶𝚊̶𝚛̶𝚎̶ ̶𝚍̶𝚘̶𝚒̶𝚗̶𝚐̶ ̶𝚊̶𝚗̶𝚍̶ ̶_̶𝚠̶𝚑̶𝚢̶_̶.̶ ̶𝚃̶𝚑̶𝚎̶ ̶𝚏̶𝚎̶𝚠̶𝚎̶𝚛̶ ̶𝚜̶𝚎̶𝚗̶𝚝̶𝚎̶𝚗̶𝚌̶𝚎̶𝚜̶/̶𝚠̶𝚘̶𝚛̶𝚍̶𝚜̶ ̶𝚝̶𝚘̶ ̶𝚎̶𝚡̶𝚙̶𝚕̶𝚊̶𝚒̶𝚗̶ ̶𝚊̶ ̶𝚙̶𝚘̶𝚒̶𝚗̶𝚝̶ ̶𝚜̶𝚞̶𝚌̶𝚌̶𝚒̶𝚗̶𝚌̶𝚝̶𝚕̶𝚢̶,̶ ̶𝚝̶𝚑̶𝚎̶ ̶𝚋̶𝚎̶𝚝̶𝚝̶𝚎̶𝚛̶.̶ ̶

I take this feedback back :) I realize this is a tutorial not a blog post

Co-authored-by: Elías Snorrason <[email protected]>

docs/source/tutorials/improving_ml_performance.ipynb

Co-authored-by: Jonas Mueller <[email protected]>

Co-authored-by: Nelson Auner <[email protected]>

…at were outdated with newest version of cleanlab package, and some wording in markdown

Added WIP new CLOS train test split tutorial notebook

b5c44b7

mturk24 added 8 commits March 28, 2024 16:46

Fixed datasets and added sections on checking for near duplicates/non…

f954aa8

… iid issues and filtered training data based on exact duplicates between training and test sets

Can ignore commented out code and also some code I pasted in from a p…

6448856

…revious version following the model eval on clean training + test data. Fixed section on using Datalab on training data to clean the data

Added fix to test data inspection/cleaning, changed wording, cleaned …

e82a607

…up notebook and added more on hyperparameter optimization section. This section still needs to be improved.

Fixed some logic in hyperparameter section, still WIP

58e59f8

Fixed hyperparameter optimization section

ad94a5e

Added hidden tests at end of tutorial, changed intro section wording,…

797b762

… and cleaned up some of the code, put data used into s3 bucket

Modified index files to put clos train test tutorial in general sideb…

3164e94

…ar before DCAI workflow tutorial, and renamed it to improving_ml_performance, also removed datalab tabular tutorial since this tutorial is replacing that one

Fixed markdown wording

ffe9b33

mturk24 requested review from jwmueller and elisno April 4, 2024 01:52

mturk24 changed the title ~~Added WIP new CLOS train test split tutorial notebook~~ Added new CLOS train test split tutorial notebook Apr 4, 2024

mturk24 requested a review from sanjanag April 4, 2024 01:56

jwmueller reviewed Apr 4, 2024

View reviewed changes

docs/source/tutorials/datalab/index.rst Outdated Show resolved Hide resolved

jwmueller removed the request for review from elisno April 4, 2024 05:09

Fixed index at docs/source level

47f9976

mturk24 added 2 commits April 4, 2024 14:41

Re-added tabular datalab tutorial

e22ffd8

Fixed tabular datalab tutorial in index

aaaedf2

jwmueller and others added 8 commits April 8, 2024 16:23

top part of tutorial

068b8cc

dont need sklearn because its already dependency of cleanlab

447002b

wording

3f4108e

Merge branch 'master' into add-train-test-clos-tutorial

36ad7d1

Fixed headings throughout notebook to be more similar to other tutori…

d2419ea

…als and adjusted intro section as well

more clarification on the main steps of the tutorial

a143129

better motivate the merged data checks

9ebe4fc

duplicates section

f2257b2

jwmueller reviewed Apr 16, 2024

View reviewed changes

shorten duplicates notebook cells

761318f

jwmueller reviewed Apr 16, 2024

View reviewed changes

docs/source/index.rst Show resolved Hide resolved

jwmueller reviewed Apr 16, 2024

View reviewed changes

docs/source/tutorials/index.rst Show resolved Hide resolved

jwmueller requested review from jwmueller and removed request for sanjanag April 16, 2024 03:50

jwmueller requested changes Apr 16, 2024

View reviewed changes

Updated train and test datasets used, fixed bug with not dropping row…

83d4209

…s from training data that are exact duplicat with test set, updated seed usage to be proper, and fixed unit tests accordingly

mturk24 force-pushed the add-train-test-clos-tutorial branch from 2bfafe9 to 43dfe63 Compare April 30, 2024 21:33

mturk24 closed this Apr 30, 2024

mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 13442e2 Compare April 30, 2024 21:39

mturk24 reopened this Apr 30, 2024

mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 83d4209 Compare April 30, 2024 21:45

Updated docs/source/index.rst to be same as master branch with new tu…

2096c5d

…torial added between datalab and cleanlearning

mturk24 requested a review from jwmueller April 30, 2024 21:50

nelsonauner reviewed Jun 19, 2024

View reviewed changes

jwmueller and others added 8 commits June 19, 2024 20:36

address nelson validation data concern

05ed54c

link to define preprocess

4a32bd9

confusion about bias in test data

cf8408c

test data changes confusion

8b146f7

conciseness

b5927f0

caution this tutorial is complex

fcdd04c

add elias edit

a79485e

Co-authored-by: Elías Snorrason <[email protected]>

reformat wording

01f753b

jwmueller reviewed Jun 20, 2024

View reviewed changes

docs/source/tutorials/improving_ml_performance.ipynb Outdated Show resolved Hide resolved

nelsonauner reviewed Jun 20, 2024

View reviewed changes

docs/source/tutorials/improving_ml_performance.ipynb Outdated Show resolved Hide resolved

mturk24 and others added 3 commits June 20, 2024 10:20

Update docs/source/tutorials/improving_ml_performance.ipynb

dff9395

Co-authored-by: Jonas Mueller <[email protected]>

Update docs/source/tutorials/improving_ml_performance.ipynb

10563fe

Co-authored-by: Nelson Auner <[email protected]>

Updated tutorial hidden test thresholds, updated a few code blocks th…

69295dd

…at were outdated with newest version of cleanlab package, and some wording in markdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new CLOS train test split tutorial notebook #1071

Added new CLOS train test split tutorial notebook #1071

mturk24 commented Mar 28, 2024 •

edited

codecov bot commented Mar 28, 2024 •

edited

mturk24 commented Apr 4, 2024

mturk24 commented Apr 4, 2024

jwmueller commented Apr 8, 2024

jwmueller Apr 16, 2024

jwmueller left a comment

nelsonauner Jun 18, 2024 •

edited

jwmueller Jun 20, 2024

jwmueller Jun 20, 2024

nelsonauner Jun 18, 2024

jwmueller Jun 20, 2024

nelsonauner Jun 19, 2024

jwmueller Jun 20, 2024

jwmueller Jun 20, 2024

nelsonauner Jun 19, 2024

nelsonauner Jun 19, 2024

jwmueller Jun 20, 2024

nelsonauner Jun 20, 2024

nelsonauner Jun 20, 2024

nelsonauner Jun 19, 2024

nelsonauner Jun 19, 2024

jwmueller Jun 20, 2024

nelsonauner Jun 20, 2024

nelsonauner commented Jun 19, 2024 •

edited

Added new CLOS train test split tutorial notebook #1071

Are you sure you want to change the base?

Added new CLOS train test split tutorial notebook #1071

Conversation

mturk24 commented Mar 28, 2024 • edited

Summary

codecov bot commented Mar 28, 2024 • edited

Codecov Report

mturk24 commented Apr 4, 2024

mturk24 commented Apr 4, 2024

jwmueller commented Apr 8, 2024

Choose a reason for hiding this comment

jwmueller left a comment

Choose a reason for hiding this comment

nelsonauner Jun 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nelsonauner commented Jun 19, 2024 • edited

mturk24 commented Mar 28, 2024 •

edited

codecov bot commented Mar 28, 2024 •

edited

nelsonauner Jun 18, 2024 •

edited

nelsonauner commented Jun 19, 2024 •

edited