Simple rules for building robust machine learning models

by Kyriakos Chatzidimitriou | May 16, 2019 12:29 | talks

presentationrdamachine learningtalk

This is the title of my invited talk in the Ask Me Analyting (AMA) call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group. The minutes of the call will be posted here.

The rules are summarized as follows:

  1. Always have 3 sets:

    • training
    • validation
    • test
  2. Validation and test sets should reflect the data you expect to see in the future
  3. Follow dataset size heuristics
  4. Choose one metric to iterate faster and have more focus
  5. Always do your exploratory analysis data

    • density plots
    • correlation plots
    • box plots
  6. When preprocessing use statistics based only on the training set
  7. Increase the number you do 10-fold CV to get even more accurate estimates of performance
  8. Use Wilcoxon statistical test to choose between two models
  9. Time is money (Person-Months and Cloud Computing), start with a small dataset, debug and then increase the size
  10. If you don't have enough data, find or create more data
  11. Decide if you strive for performance or interpretability
  12. Learn the strong points of each ML model
  13. Become a knowledgeable trader of bias-variance
  14. Finish of with an ensemble
  15. Tune hyperparameters ... but up to a point
  16. Start with a simple waterfall like process:

    • Study the problem
    • EDA
    • Define optimization strategy
    • Do feature engineering
    • Modelling
    • Ensembling