Add option to choose which fold to use as a final predictor #614

sami-ka · 2023-04-26T21:42:48Z

Thanks for this great package!
I use it pretty often so I wanted to add my contribution to it.

I needed to test the difference between taking the average of models fitted on each fold, and looking at the prediction of only the last fold.
This was especially interesting in my case as it was a time series split, and I wanted my final model to be the one trained on the most recent data.

I added a parameter in the AutoML class called chosen_fold, which I ultimately set at -1 in my case to get the model of the last fold.

It's a bit linked to #475.

Feel free to tell me if I should continue working on this evolution!

P.S. : I think the changes in the requirements_dev.txt are needed because the last click versions are not compatible anymore with the pinned version of black. (see https://stackoverflow.com/questions/71673404/importerror-cannot-import-name-unicodefun-from-click)
Probably upgrading black could also be a good move.

…_fold

pplonski · 2023-04-27T06:49:02Z

Hi @drskd!

Thank you for contribution. You are the first person that asked for this feature. If there will be more users that need this, then I will merge it.

brainmosaik · 2023-07-24T19:25:47Z

this sounds nice. but isn´t it the same as, just making a shorter test split ? Can i ask for what u use it? Do you use it only for prediction and not for training?

sami-ka · 2023-07-24T19:39:10Z

@brainmosaik In my case I did a time series split with 4 folds, lets say one for each season over the last 12 months.
It's reassuring to find model hyperparameters that works well on average for every season, but if I'm currently in summer, I would like to have my predictions based on the most recent model with the weights of the model computed on data from the last months.
For now, the behaviour is to take as final prediction the average prediction over the 4 folds. In my example, that corresponds to actually rely partly on a prediction that was made by a model fitted on data from the winter period, which is not completely satisfying if you have some sort of seasonality.

brainmosaik · 2023-07-24T19:53:46Z

Thats sounds, good. So the workflow for this, should be?:
1 .Train with chosen_fold=None, to find model hyperparameters that works well on average for every season.
2. Then train a new model with found hyperparameters ,and set chosen_fold=-1 ,to have a good model for the last fold (most recent season), but one that have more generalized hyperparameters.

So it should be better on new upcoming days?

Or should it be used like this?
1. Train with chosen_fold=None, to find model hyperparameters that works well on average for every season.
2. Reload the model but set chosen_fold=-1, to have the model for the last fold (most recent season).

Because mljar already saving models , for each fold? So we can skip the Re-training?

Big thanks for this idea and implementation. Just trying to get my head around it.

edit:
So i look a bit deeper in the code, so it looks like, its only for final prediction , not for training. So we can just reload/create model with chosen_fold = -1 . to get predictions based on the last season.

sami-ka · 2023-07-25T06:45:43Z

The chosen_fold parameter is used only at the prediction stage because, as you already noticed, if I do not want to train on some part of the data I should just discard it.

Here, as it's already the case, AutoML will train each chosen model and hyperparameters on the 4 splits, then look at the average chosen validation metric over 4 validation sets to rank models on the leaderboard, and finally predicts using the model at the top of the leaderboard.
Each model in the leaderboard in this example has in fact 4 different models trained each on different datasets. For instance, you could have a decision tree of max_depth=5 as the winner but if you train it on 4 different datasets, you would have 4 different predictions from each as the decision trees would not necessarily all be the same. AutoML returns a prediction as the average prediction of these 4 models.

The chosen_fold parameter would impact only the prediction step. Each model in the leaderboard would still be trained on 4 different datasets, but the prediction would come from only the model trained on the last split

Classic usage:

automl = AutoML(
            validation_strategy={
                "validation_type": "custom"
            },
            chosen_fold=None, # default value
        )
automl.fit(X, y, cv=tscv) # no behaviour change
automl.predict_proba(X) # returns the average predicted proba of 4 models, each trained with winning hyperparameters of the leaderboard on splits defined in CV

Custom usage:

automl = AutoML(
            validation_strategy={
                "validation_type": "custom"
            },
            chosen_fold=-1, # only used at prediction time 
        )
automl.fit(X, y, cv=tscv) # no behaviour change
automl.predict_proba(X) # returns the predicted proba of the model trained with winning hyperparameters of the leaderboard on last split only

I could indeed use only the last split if I'm interested in the most recent part of the data, but as the model search is really powerful, having only one split to validate is risky in terms of overfitting. Having additional regularization on hyperparameter selection with more splits help to limit the risk of overfitting.

mosaikme · 2023-12-27T07:28:00Z

i must say thanks again, for this commit.I think this should be in the main branch, could be really usefull. I wonder if this got , implemented now in the main branch? I have some addational ideas, to make it take an weighted, prediction, something like


# Create weights
WEIGHTS = np.linspace(1.0, len(preds), len(self.learners))

#print("CHOSEN FOLD: WEIGHTED",WEIGHTS )
for ind, learner in enumerate(self.learners):
    # preprocessing goes here
    X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
    
    # For binary classification, ensure that the prediction is a probability
    #y_p_prob = learner.predict_proba(X_data)[:, 1]  # Assuming you're interested in the probability of the positive class
    
    y_p = learner.predict(X_data)
    y_p = self.preprocessings[ind].inverse_scale_target(y_p)

    ##multiply Prediction with WEIGHT
    y_p_weighted     = y_p * WEIGHTS[ind] 
    
    y_predicted = y_p_weighted  if y_predicted is None else y_predicted + y_p_weighted
avg_pred_result = y_predicted / sum(WEIGHTS)

Or this,but i think i have a thinking error in this?

 # Create weights
            WEIGHTS = np.linspace(0.0, 1.0, len(self.learners))

 
            ##make the weights Exponential
            ##WEIGHTS = np.exp(WEIGHTS * chosen_fold)

            #make ALL_WEIGHTS sum to 1 , So we dont need to divide by  
            WEIGHTS = WEIGHTS /  np.sum(WEIGHTS)
      
            #print("CHOSEN FOLD: WEIGHTED",WEIGHTS )
            for ind, learner in enumerate(self.learners):
                # preprocessing goes here
                X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
                
                # For binary classification, ensure that the prediction is a probability
                #y_p_prob = learner.predict_proba(X_data)[:, 1]  # Assuming you're interested in the probability of the positive class
                
                y_p = learner.predict(X_data)
                y_p = self.preprocessings[ind].inverse_scale_target(y_p)

                ##multiply Prediction with WEIGHT
                y_p_weighted     = y_p * WEIGHTS[ind] 
                
                y_predicted = y_p_weighted  if y_predicted is None else y_predicted + y_p_weighted

sami-ka added 7 commits April 26, 2023 15:48

add chosen fold to AutoML class

1b7e6b1

add _chosen fold to base_autoML class

c557916

change predict method of model framework

487c497

change predict signature and model framework predict call with chosen…

81b79e3

…_fold

add simple test

530afee

add packages needed for black version and report in one test

91f069d

add getter and validate methods for chosen_fold

af5a376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to choose which fold to use as a final predictor #614

Add option to choose which fold to use as a final predictor #614

sami-ka commented Apr 26, 2023 •

edited

Loading

pplonski commented Apr 27, 2023

brainmosaik commented Jul 24, 2023 •

edited

Loading

sami-ka commented Jul 24, 2023

brainmosaik commented Jul 24, 2023 •

edited

Loading

sami-ka commented Jul 25, 2023

mosaikme commented Dec 27, 2023 •

edited

Loading

Add option to choose which fold to use as a final predictor #614

Are you sure you want to change the base?

Add option to choose which fold to use as a final predictor #614

Conversation

sami-ka commented Apr 26, 2023 • edited Loading

pplonski commented Apr 27, 2023

brainmosaik commented Jul 24, 2023 • edited Loading

sami-ka commented Jul 24, 2023

brainmosaik commented Jul 24, 2023 • edited Loading

sami-ka commented Jul 25, 2023

mosaikme commented Dec 27, 2023 • edited Loading

sami-ka commented Apr 26, 2023 •

edited

Loading

brainmosaik commented Jul 24, 2023 •

edited

Loading

brainmosaik commented Jul 24, 2023 •

edited

Loading

mosaikme commented Dec 27, 2023 •

edited

Loading