Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AutoML #51

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Added AutoML #51

wants to merge 2 commits into from

Conversation

earino
Copy link
Contributor

@earino earino commented May 29, 2017

As we've discussed in Slack, H2O has recently released some very interesting AutoML functionality. In this case, the leader is the StackedEnsemble generated from a GBM grid, a DL grid, a DRF and an XRT model. On 100k records it trained for a while on some small cloud hardware, and generated a respectable AUC of 0.7284624

> md
An object of class "H2OAutoML"
Slot "project_name":
[1] "<default>"

Slot "leader":
Model Details:
==============

H2OBinomialModel: stackedensemble
Model ID:  StackedEnsemble_model_1496028880431_2818 
NULL


H2OBinomialMetrics: stackedensemble
** Reported on training data. **

MSE:  0.06495612
RMSE:  0.2548649
LogLoss:  0.2435769
Mean Per-Class Error:  0.07056041
AUC:  0.9872952
Gini:  0.9745905

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           N     Y    Error         Rate
N      54777  1849 0.032653  =1849/56626
Y       1450 11918 0.108468  =1450/13368
Totals 56227 13767 0.047133  =3299/69994

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.299564 0.878423 218
2                       max f2  0.243801 0.912848 242
3                 max f0point5  0.362489 0.896238 193
4                 max accuracy  0.313673 0.953653 213
5                max precision  0.974294 1.000000   0
6                   max recall  0.132957 1.000000 309
7              max specificity  0.974294 1.000000   0
8             max absolute_mcc  0.299564 0.849339 218
9   max min_per_class_accuracy  0.253667 0.943118 237
10 max mean_per_class_accuracy  0.247323 0.944984 240

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: stackedensemble
** Reported on validation data. **

MSE:  0.1327237
RMSE:  0.3643127
LogLoss:  0.4226191
Mean Per-Class Error:  0.3271404
AUC:  0.7433911
Gini:  0.4867822

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           N    Y    Error         Rate
N       9287 2974 0.242558  =2974/12261
Y       1166 1666 0.411723   =1166/2832
Totals 10453 4640 0.274299  =4140/15093

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.196506 0.445931 257
2                       max f2  0.114152 0.591573 329
3                 max f0point5  0.307013 0.439652 188
4                 max accuracy  0.579457 0.822434  82
5                max precision  0.950060 1.000000   0
6                   max recall  0.048541 1.000000 396
7              max specificity  0.950060 1.000000   0
8             max absolute_mcc  0.272812 0.299325 207
9   max min_per_class_accuracy  0.165504 0.672539 281
10 max mean_per_class_accuracy  0.156244 0.677032 289

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`


Slot "leaderboard":
                                             model_id      auc  logloss
1            StackedEnsemble_model_1496028880431_2818 0.742023 0.424990
2  GBM_grid__a70036165806366cd146a852765f4af0_model_3 0.724540 0.472045
3  GBM_grid__a70036165806366cd146a852765f4af0_model_1 0.722181 0.438297
4  GBM_grid__a70036165806366cd146a852765f4af0_model_0 0.720750 0.475918
5                           DRF_model_1496028880431_4 0.718733 0.471836
6                         XRT_model_1496028880431_366 0.718564 0.439938
7   DL_grid__a70036165806366cd146a852765f4af0_model_0 0.715729 0.453427
8   DL_grid__a70036165806366cd146a852765f4af0_model_1 0.715312 0.453516
9  GBM_grid__a70036165806366cd146a852765f4af0_model_8 0.712989 0.443795
10 GBM_grid__a70036165806366cd146a852765f4af0_model_4 0.711725 0.457926
11  DL_grid__a70036165806366cd146a852765f4af0_model_2 0.711247 0.472706
12 GLM_grid__a70036165806366cd146a852765f4af0_model_0 0.709769 0.443991
13 GLM_grid__a70036165806366cd146a852765f4af0_model_1 0.709769 0.443991
14 GBM_grid__a70036165806366cd146a852765f4af0_model_6 0.705461 0.468157
15 GBM_grid__a70036165806366cd146a852765f4af0_model_2 0.703969 0.444650
16 GBM_grid__a70036165806366cd146a852765f4af0_model_5 0.697802 0.483724
17  DL_grid__a70036165806366cd146a852765f4af0_model_4 0.691404 0.497545
18 GBM_grid__a70036165806366cd146a852765f4af0_model_7 0.668311 0.897990
19  DL_grid__a70036165806366cd146a852765f4af0_model_3 0.658246 0.647369

AUC of 0.7284624 for train-0.1m.csv
Create h2o.R for newly released h2o AutoML
Copy link

@ledell ledell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leaderboard_frame is used to generate performance on a test set. If you don't provide a leaderboard_frame, it will chop off some training data to use for this purpose.

The way your code is currently written, some valuable training data (15%) goes to waste to score the leaderboard. You can fix by adding leaderboard_frame = "dx_test" in the h2o.automl() function.

Modified:

library(h2o)

h2o.init(max_mem_size="60g", nthreads=-1)

dx_train <- h2o.importFile(path = "train-0.1m.csv")
dx_test <- h2o.importFile(path = "test.csv")

Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]

system.time({
  md <- h2o.automl(x = Xnames, y = "dep_delayed_15min", 
                                training_frame = dx_train, 
                                leaderboard_frame = dx_test)
})

system.time({
  print(h2o.auc(h2o.performance(md@leader, dx_test)))
})

# alternative way to get leader model AUC
system.time({
  print(md@leaderboard$auc[1])
})

@szilard
Copy link
Owner

szilard commented May 30, 2017

Ensembles (the new Java implementation) + AutoML has been on my list to look at (I already did some).

However, I think I should keep this repo with the basic algos only and create new repos for looking at things build on top of those (also 99% of the training time in ensembles/autoML is spend in the building blocks, so there is no much to benchmark on speed, while the increase in AUC will be very much dataset dependent).

I already included ensembles in the course I'm teaching at UCLA, see here.

I might create a repo for autoML, thought that's also trivial, the code above changed 2 lines vs original. I would probably run it on 1M records though.

I actually already factored out GBMs from this benchmark in order to keep track with the newest best tools (added LightGBM) and forget about mediocre tools such as Spark. This new repo will have a more targeted focus (only 1M/10M records and only best GBM tools), but I might be able to update it with new versions more regularly (+add GPUs).

@szilard
Copy link
Owner

szilard commented May 30, 2017

PS: I also started a deep learning repo a few months ago, but did not get too far (yet).

@earino
Copy link
Contributor Author

earino commented May 30, 2017

following @ledell's advice, the code gives an AUC of 0.7286668 so some enhancement but not drastic on the 100k row dataset. I'm running it on the 1M overnight.

@ledell
Copy link

ledell commented May 30, 2017

@earino How long did you run it for? If it was the default, then it probably ran for 10 minutes. We changed the default to 1 hour very recently, so if you re-run on a newer version, you should make a note of the change. In your results above, it looks like StackedEnsemble_model_1496028880431_2818 had a test AUC of ~0.74, not ~0.72...?

@earino
Copy link
Contributor Author

earino commented May 30, 2017 via email

@earino
Copy link
Contributor Author

earino commented May 30, 2017

@ledell very explicitly, this is the exact line i'm using to get the performance number. Is it the wrong thing? print(h2o.auc(h2o.performance(md@leader, dx_test)))

@ledell
Copy link

ledell commented May 31, 2017

@earino That line will also work, but it requires re-computing all the performance metrics on the test set. They are already computed as part of the h2o.automl() function and stored in the Leaderboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants