I am writing this post as an action to Kent C. Dodds Call for Action in the area of intentional carreer building. In that post Kent C. Dodds discusses (possibly reproducible) ideas on how he built his carreer (by creating and communicating value). One of the proposed actions was "Answer your co-worker's question in a public space (YouTube, gist, etc.) and share it".
In this blog post I am going ahead and answering a student's question in public space. In particular, I got asked by a student, whether one should elliminate collinearity - using Variance Inflation Factor (VIF) for example - before using a feature selection algorithm. I'll do my best to provide an insightful answer and to do that I will be fusing my knowledge, experimentation and different resources I found on the Internet.
More posts like that will follow. It is a way of finding ideas and writing posts that help you become a better communicator of ideas and concepts, creating content and value and helping other along the way. But I am stalling, so let's start.
Resources
I read and used the following resources on the subject:
Collinearity
Intercorrelation or multi-collinearity is the existence of predictor variables that are (highly) correlated among themselves. For example family income, family savings, age of head of household are correlated among themselves in an example when we try to predict family food expenditures. The older you are, you probably have more money and more savings and vice-versa.
In the presence of perfect collinearity, i.e. feature X1=α+β∗X2 (1), we could have many different coefficient values of features that would predict response variable Y equally well after performing Ordinary Least Squares (OLS). So given all these solutions, one would not be able to say something regarding the effect X1 and X2 have on Y. In addition, if a new sample arrives that we want to predict and does not follow equation (1), the prediction error will be probably very big. Finally, the regression coefficients of any multicollinear predictor variables can be very different in the presence of non-correlated variables. In practice though, this is rarelly the case since there is also an error component to the relationship.
In plain words: If there is a "nice" relationaship between X1 and X2, when new disturbed data arrive, don't expect to have good predictions. On the other hand, if the relationship between X1 and X2 is fuzzy and it continuous to be fuzzy then you won't have a problem. In general collinearity causes problems to the interpretability of the model. Prediction is not hurt as long as the new samples that arrive follow the same (multi)collinear pattern.
Variance Inflation Factor (VIF)
The formal method for detecting the presence of multicollinearity is Variance Inflation Factor (VIF). VIF measures how much the variances of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. VIF is 1 when Xn is not linearly related to other predictors and greater than 1 in the presence of intercorrelations with other features. A VIF of more than 10 is a heuristic for the indication that colinearity is influencing regression.
One can drop one or more collinear variables from the model, but 1) you get no other insight on whether the dropped variables hurt or not the prediction 2) the coefficients of the remaining variables will change. One can also do Principal Components Analysis, which will provide new uncorrelated variables. Of course the new variables will not have any physical meaning though whatsoever, hurting again interpretability. Remedial measures against serious collinearity may be Ridge Regression, which through regularization it gives preference to one solution over the others.
On the other hand, in Machine Learning we most often care about robust predictions and models that can generalize well, rather than issues regarding the interpretability of the models. (Sidenote: I believe this will change because we would often like to know why a machine leanring model provided this or that prediction...think of bank credit scoring systems or autonomous cars using deep neural networks that must adhere to certain laws as well.) The balance of how much regularization is needed (a form of a bias-variance trade-off with examples: the λ factor in Ridge Regression, number of variables sampled in Random Forrests or regularization parameter C in Support Vector Machines), is useually found through cross-validation. Of course it is always good to know about the existence of collinearity, since in principle when the collinearity equation changes, we can have large prediction errors.
To conclude, if you are interested in the effects of the predictor variables to the response and the interpretability of the model, do care about collinearity. If you are interested about the predictive abilities of a model then you can skip it and follow regular machine learning flows. But it is always good to check.
Below one can find an experimentation of various cases along with comments I've done using R to prove the arguments above. The Rmd document can be found here and the rendered html document here.
Rmd document for Collinearity and Feature Selection
Intro
This notebook is an online appendix of my blog post: On Collinearity
and Feature
Selection,
where I play with the concepts using R code.
The dataset
We will use the auto-mpg
dataset, where we
will try to predict the miles per galon (mpg) consumption given some car
related features like horsepower, weight etc.
set.seed(1234)
fileURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
download.file(fileURL, destfile="auto-mpg.data", method="curl")
data <- read.table("auto-mpg.data", na.strings = "?", quote='"', dec=".", header=F)
# remove instances with missing values and the name of the car
data <- data[complete.cases(data),-9]
summary(data)
## V1 V2 V3 V4
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
## V5 V6 V7 V8
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
Preprocessing
Let's normalize the dataset (values to be in the interval [0,1]), an
operation that will maintain the correllation between the variables, and
split between training and testing.
normalize <- function(x) {
(x - min(x, na.rm=TRUE))/(max(x,na.rm=TRUE) - min(x, na.rm=TRUE))
}
normData <- cbind(data[,1], as.data.frame(lapply(data[,-1], normalize)))
# name variables
names(normData) <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin")
# check correlation
cat("Cor between disp. and weight before norm.:", cor(data$V3, data$V5), "\n")
## Cor between disp. and weight before norm.: 0.9329944
cat("After norm.:", cor(normData$displacement, normData$weight))
## After norm.: 0.9329944
# Train/Test split
library(tidyverse)
library(caret)
training.samples <- normData$mpg %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- normData[training.samples, ]
test.data <- normData[-training.samples, ]
Modelling
Linear Regression
linearModel <- lm(mpg ~., data = train.data)
summary(linearModel)
##
## Call:
## lm(formula = mpg ~ ., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4147 -2.1845 -0.1875 1.7702 12.8931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.4017 1.3012 19.521 < 2e-16 ***
## cylinders -2.5022 1.8711 -1.337 0.1821
## displacement 7.7778 3.2280 2.409 0.0166 *
## horsepower -2.8404 2.8077 -1.012 0.3125
## weight -22.9293 2.5394 -9.029 < 2e-16 ***
## acceleration 2.0678 1.8409 1.123 0.2622
## model_year 9.3472 0.7023 13.309 < 2e-16 ***
## origin 2.9604 0.6383 4.638 5.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.389 on 307 degrees of freedom
## Multiple R-squared: 0.8175, Adjusted R-squared: 0.8134
## F-statistic: 196.5 on 7 and 307 DF, p-value: < 2.2e-16
One can see that weight is a very important factor both in terms of the
coefficient value and in terms of statistical significance (unlikely to
observe a relationship between weight and mpg due to change). Notice the
negative coefficient (more weight, less miles per gallon), which can be
explained by the laws of physics. But also notice that even though
weight and diplacement have a correlation o 0.93 (almost collinear),
their coefficients have different signs. Based on common knowledge
though they should have had the same signs. Collinearity is bad when you
try and explain the outputs of models. Let's examine the VIF values:
library(car)
vif(linearModel)
## cylinders displacement horsepower weight acceleration
## 10.797995 19.995685 8.952301 10.191612 2.424953
## model_year origin
## 1.223735 1.794710
We observe that 3 predictors have a value more than 10 that is a concern
for the existence of collinearity (or multicollinearity in this case).
So let's drop displacement and create a second model:
linearModelMinusDisp <- lm(mpg ~.-displacement, data = train.data)
summary(linearModelMinusDisp)
##
## Call:
## lm(formula = mpg ~ . - displacement, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4776 -2.2073 -0.1473 1.7625 12.9679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.4241 1.3113 19.388 < 2e-16 ***
## cylinders 0.5072 1.4040 0.361 0.718
## horsepower -1.1352 2.7382 -0.415 0.679
## weight -20.6134 2.3688 -8.702 < 2e-16 ***
## acceleration 1.5673 1.8434 0.850 0.396
## model_year 9.2684 0.7070 13.110 < 2e-16 ***
## origin 2.4812 0.6112 4.060 6.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.415 on 308 degrees of freedom
## Multiple R-squared: 0.8141, Adjusted R-squared: 0.8104
## F-statistic: 224.7 on 6 and 308 DF, p-value: < 2.2e-16
vif(linearModelMinusDisp)
## cylinders horsepower weight acceleration model_year
## 5.986398 8.383554 8.731646 2.394082 1.221086
## origin
## 1.620491
and let's check the predictive ability of the two:
predLM <- linearModel %>% predict(test.data)
predLMMD <- linearModelMinusDisp %>% predict(test.data)
cat("Full model:", RMSE(predLM, test.data$mpg), "\n")
## Full model: 3.098026
cat("Minus disp. model:", RMSE(predLMMD, test.data$mpg))
## Minus disp. model: 3.125825
As one can see I now have a more "understandable" model, with kind of a
"worse" predictive ability (slightly higher error).
Now let's remove all the variables that had a VIF value of more than 10
linearModelSimpler <- lm(mpg ~.-displacement-cylinders-weight, data = train.data)
summary(linearModelSimpler)
##
## Call:
## lm(formula = mpg ~ . - displacement - cylinders - weight, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.832 -2.415 -0.533 1.930 12.794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.7618 1.4362 20.027 < 2e-16 ***
## horsepower -24.3357 1.6960 -14.349 < 2e-16 ***
## acceleration -6.9758 1.8700 -3.730 0.000227 ***
## model_year 8.0722 0.8034 10.048 < 2e-16 ***
## origin 4.9410 0.6324 7.813 8.76e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.938 on 310 degrees of freedom
## Multiple R-squared: 0.7511, Adjusted R-squared: 0.7479
## F-statistic: 233.9 on 4 and 310 DF, p-value: < 2.2e-16
vif(linearModelSimpler)
## horsepower acceleration model_year origin
## 2.418236 1.852449 1.185461 1.304459
predLMS <- linearModelSimpler %>% predict(test.data)
cat("Even simpler model - RMSE:", RMSE(predLMS, test.data$mpg), "\n")
## Even simpler model - RMSE: 3.575336
Now the model is simple, more explainable, without any colinearities
(low VIF values) but not as good in terms of RMSE as the previous ones.
Feature Selection
To check also a feature selection method, stepwise feature selection
using the Akaike Information Criterion (AIC):
require(leaps)
require(MASS)
step.model <- stepAIC(linearModel, direction = "both", trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = mpg ~ displacement + weight + acceleration + model_year +
## origin, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1253 -2.1573 -0.1547 1.8907 12.8220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.4535 1.0334 23.663 < 2e-16 ***
## displacement 4.3108 2.3178 1.860 0.0639 .
## weight -24.4212 2.2367 -10.918 < 2e-16 ***
## acceleration 3.1711 1.4748 2.150 0.0323 *
## model_year 9.5092 0.6811 13.963 < 2e-16 ***
## origin 2.7723 0.6225 4.453 1.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.392 on 309 degrees of freedom
## Multiple R-squared: 0.816, Adjusted R-squared: 0.813
## F-statistic: 274 on 5 and 309 DF, p-value: < 2.2e-16
linearModelAIC <- lm(as.formula(step.model), data = train.data)
vif(linearModelAIC)
## displacement weight acceleration model_year origin
## 10.288515 7.890980 1.553344 1.148540 1.703980
predLMAIC <- linearModelAIC %>% predict(test.data)
cat("AIC model:", RMSE(predLMAIC, test.data$mpg), "\n")
## AIC model: 3.114763
Through this example we can see than even though we have colinearities
involved (VIF value of 10+), we obtain a low RMSE of 3.11. Or using
another feature selection package:
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model2 <- train(mpg ~., data = train.data,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:7),
trControl = train.control
)
step.model2$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 4.387909 0.6950849 3.378332 0.9854225 0.09849010 0.6697571
## 2 2 3.407376 0.8137023 2.661535 0.9123314 0.06485389 0.5518240
## 3 3 3.333900 0.8224665 2.549990 0.8779426 0.06157152 0.5311275
## 4 4 3.371991 0.8188230 2.592064 0.8832272 0.06214344 0.5410169
## 5 5 3.395429 0.8170493 2.618763 0.8325787 0.05722386 0.4970754
## 6 6 3.404866 0.8163436 2.633234 0.7979961 0.05357786 0.4526265
## 7 7 3.384003 0.8180447 2.618266 0.8100225 0.05401179 0.4619242
summary(step.model2$finalModel)
## Subset selection object
## 7 Variables (and intercept)
## Forced in Forced out
## cylinders FALSE FALSE
## displacement FALSE FALSE
## horsepower FALSE FALSE
## weight FALSE FALSE
## acceleration FALSE FALSE
## model_year FALSE FALSE
## origin FALSE FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: backward
## cylinders displacement horsepower weight acceleration model_year
## 1 ( 1 ) " " " " " " "*" " " " "
## 2 ( 1 ) " " " " " " "*" " " "*"
## 3 ( 1 ) " " " " " " "*" " " "*"
## origin
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) "*"
coef(step.model2$finalModel, 3)
## (Intercept) weight model_year origin
## 26.190907 -21.244932 9.522434 2.370324
The best model has 3 predictors: weight
, model_year
and origin
. So
making one more final linear regression model and predicting the mpg
in the test set we have:
linearModelBest <- lm(mpg ~weight+model_year+origin, data = train.data)
summary(linearModelBest)
##
## Call:
## lm(formula = mpg ~ weight + model_year + origin, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8638 -2.1894 -0.0388 1.7413 13.0971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.1909 0.6744 38.834 < 2e-16 ***
## weight -21.2449 1.0134 -20.965 < 2e-16 ***
## model_year 9.5224 0.6649 14.321 < 2e-16 ***
## origin 2.3703 0.5942 3.989 8.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.412 on 311 degrees of freedom
## Multiple R-squared: 0.8126, Adjusted R-squared: 0.8108
## F-statistic: 449.6 on 3 and 311 DF, p-value: < 2.2e-16
vif(linearModelBest)
## weight model_year origin
## 1.601095 1.082269 1.534712
predLMBest <- linearModelBest %>% predict(test.data)
cat("Best model:", RMSE(predLMBest, test.data$mpg), "\n")
## Best model: 3.096694
3.09!!! Lowest error, with only 3 predictors and no collinearities
involved. In this case feature selection went along with a fina model
that is interpretable as well.
Ridge Regression
One solution to the collinearity problem (without doing any feature
selection) is to apply Ridge Regression and to try to "constrain" the
number of solution of the beta
coefficients into a single solution. In
this case since we have the hyperparameter lambda to optimize for which
we will apply 10-fold cross-validation on the training set to find the
best value and then use that to train the model and predict mpg in the
testing dataset.
library(glmnet)
y <- train.data$mpg
x <- train.data %>% dplyr::select(-starts_with("mpg")) %>% data.matrix()
lambdas <- 10^seq(3, -2, by = -.1)
fit <- glmnet(x, y, alpha = 0, lambda = lambdas)
cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas, nfolds = 10)
# uncomment the plot to see how lambda changes the error.
#plot(cv_fit)
opt_lambda <- cv_fit$lambda.min
x_test <- test.data %>% dplyr::select(-starts_with("mpg")) %>% data.matrix()
y_predicted <- predict(fit, s = opt_lambda, newx = x_test)
cat("Ridge RMSE:", RMSE(y_predicted, test.data$mpg))
## Ridge RMSE: 3.096991
The Ridge Regression produced one of the lowest error and without
dropping any of the coefficients. And as for the coefficients' values:
coef(cv_fit)
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 26.7194742
## cylinders -2.1671592
## displacement -1.6690043
## horsepower -5.5116891
## weight -11.7117393
## acceleration -0.3212087
## model_year 7.8818066
## origin 2.6205643
which as we can see gave a much more reasonable and physically
explainable model. The more cyclinders, displacement, horsepower, weight
and accellaration...the less miles per gallon you can drive, while the
more recent the model, which originated from origin 2 (Europe) or 3
(Japan) the more miles on the gallon you can go. As for the RMSE, it is
close both to the full model and to the optimized model using feature
selection.
Discussion
Collinearity is important if you need to have an understandable model.
If you don't, and you just care for predictive ability you can be more
brute and care about the numbers.
Comments