Simple rules for building robust machine learning models

by Kyriakos Chatzidimitriou | May 16, 2019 09:29 | talks

presentationrdamachine learningtalk

This is the title of my invited talk in the Ask Me Analyting (AMA) call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group. The minutes of the call will be posted here.

The rules are summarized as follows:

Always have 3 sets:
- training
- validation
- test
Validation and test sets should reflect the data you expect to see in the future
Follow dataset size heuristics
Choose one metric to iterate faster and have more focus
Always do your exploratory analysis data
- density plots
- correlation plots
- box plots
When preprocessing use statistics based only on the training set
Increase the number you do 10-fold CV to get even more accurate estimates of performance
Use Wilcoxon statistical test to choose between two models
Time is money (Person-Months and Cloud Computing), start with a small dataset, debug and then increase the size
If you don't have enough data, find or create more data
Decide if you strive for performance or interpretability
Learn the strong points of each ML model
Become a knowledgeable trader of bias-variance
Finish of with an ensemble
Tune hyperparameters ... but up to a point
Start with a simple waterfall like process:
- Study the problem
- EDA
- Define optimization strategy
- Do feature engineering
- Modelling
- Ensembling