This is the title of my invited talk in the Ask Me Analyting (AMA) call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group. The minutes of the call will be posted here.
The rules are summarized as follows:
-
Always have 3 sets:
- Validation and test sets should reflect the data you expect to see in the future
- Follow dataset size heuristics
- Choose one metric to iterate faster and have more focus
-
Always do your exploratory analysis data
- density plots
- correlation plots
- box plots
- When preprocessing use statistics based only on the training set
- Increase the number you do 10-fold CV to get even more accurate estimates of performance
- Use Wilcoxon statistical test to choose between two models
- Time is money (Person-Months and Cloud Computing), start with a small dataset, debug and then increase the size
- If you don't have enough data, find or create more data
- Decide if you strive for performance or interpretability
- Learn the strong points of each ML model
- Become a knowledgeable trader of bias-variance
- Finish of with an ensemble
- Tune hyperparameters ... but up to a point
-
Start with a simple waterfall like process:
- Study the problem
- EDA
- Define optimization strategy
- Do feature engineering
- Modelling
- Ensembling
Enjoy!
Comments