Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More datasets and regression problems #53

Open
PhilippPro opened this issue Feb 12, 2018 · 4 comments
Open

More datasets and regression problems #53

PhilippPro opened this issue Feb 12, 2018 · 4 comments

Comments

@PhilippPro
Copy link

PhilippPro commented Feb 12, 2018

Did you consider using more datasets?

And how about regression problems?

There is for example this benchmarking suite, accessible via the OpenML packages: https://arxiv.org/abs/1708.03731

@PhilippPro PhilippPro changed the title More datasets More datasets and regression problems Feb 12, 2018
@szilard
Copy link
Owner

szilard commented Feb 12, 2018

Re more datasets: szilard/GBM-perf#4 (comment)

My focus now is top GBM implementations (including on GPUs). Doing more by doing less. I dockerized the most important things in a separate repo https://github.com/szilard/GBM-perf

Also read this summary I wrote recently: https://github.com/szilard/benchm-ml#summary

@PhilippPro
Copy link
Author

I just watched your talk, very interesting.

In my opinion one of the directions that should be further developed (and you already mentioned) is AutoML: packages for automatic tuning, automatic ensembling, automatic feature engineering etc. in a time efficient way.

@szilard
Copy link
Owner

szilard commented Feb 13, 2018

Oh, I forgot to say last comment that RE OpenML, those datasets are ridiculously small: https://gist.github.com/szilard/b82635fa9060227514af3423b3225a29

There is also another set of datasets, that's also too small datasets: https://gist.github.com/szilard/d8279374646fb5f372317db2a4074f2f

I would want a set of datasets with sizes from 1000 to 10M with median size 100K (so should cover 1K-10K-100K-1M-10M).

RE AutoML: Indeed that's super interesting. However, benchmarking that is way more difficult because there is the tricky tradeoff between computation time and accuracy. I've been looking at a few solutions but nothing formally (just tried out). Btw most of them have GBMs are building blocks, so benchmarking the components can give you already some idea on performance.

Btw when you say my talk, is is the KDD one? That's probably the most up to date, though my experiments with autoML and a few other things/results happened after the talk.

@PhilippPro
Copy link
Author

PhilippPro commented Feb 14, 2018

Ok, there are only some datasets with size above 10 K in the OpenML or PMLB benchmarking suite.

The AutoML solutions should have a time constraint parameter, so e.g. one can compare the results after 1 hour between these algorithms. Of course in reality they often miss this feature.

Yes, the KDD one, quite inspiring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants