On collinearity and feature selection

by Kyriakos Chatzidimitriou | Mar 22, 2019 15:00 | tutorials

collinearityfeature selectionridge regressionvifvariable inflation factormulticollinearityr

I am writing this post as an action to Kent C. Dodds Call for Action in the area of intentional carreer building. In that post Kent C. Dodds discusses (possibly reproducible) ideas on how he built his carreer (by creating and communicating value). One of the proposed actions was "Answer your co-worker's question in a public space (YouTube, gist, etc.) and share it".

In this blog post I am going ahead and answering a student's question in public space. In particular, I got asked by a student, whether one should elliminate collinearity - using Variance Inflation Factor (VIF) for example - before using a feature selection algorithm. I'll do my best to provide an insightful answer and to do that I will be fusing my knowledge, experimentation and different resources I found on the Internet.

More posts like that will follow. It is a way of finding ideas and writing posts that help you become a better communicator of ideas and concepts, creating content and value and helping other along the way. But I am stalling, so let's start.

## Resources

I read and used the following resources on the subject:

## Collinearity

Intercorrelation or multi-collinearity is the existence of predictor variables that are (highly) correlated among themselves. For example family income, family savings, age of head of household are correlated among themselves in an example when we try to predict family food expenditures. The older you are, you probably have more money and more savings and vice-versa.

In the presence of perfect collinearity, i.e. feature $X_1 = \alpha + \beta * X_2$ (1), we could have many different coefficient values of features that would predict response variable $Y$ equally well after performing Ordinary Least Squares (OLS). So given all these solutions, one would not be able to say something regarding the effect $X_1$ and $X_2$ have on Y. In addition, if a new sample arrives that we want to predict and does not follow equation (1), the prediction error will be probably very big. Finally, the regression coefficients of any multicollinear predictor variables can be very different in the presence of non-correlated variables. In practice though, this is rarelly the case since there is also an error component to the relationship.

In plain words: If there is a "nice" relationaship between $X_1$ and $X_2$, when new disturbed data arrive, don't expect to have good predictions. On the other hand, if the relationship between $X_1$ and $X_2$ is fuzzy and it continuous to be fuzzy then you won't have a problem. In general collinearity causes problems to the interpretability of the model. Prediction is not hurt as long as the new samples that arrive follow the same (multi)collinear pattern.

## Variance Inflation Factor (VIF)

The formal method for detecting the presence of multicollinearity is Variance Inflation Factor (VIF). VIF measures how much the variances of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. VIF is 1 when $X_n$ is not linearly related to other predictors and greater than 1 in the presence of intercorrelations with other features. A VIF of more than 10 is a heuristic for the indication that colinearity is influencing regression.

One can drop one or more collinear variables from the model, but 1) you get no other insight on whether the dropped variables hurt or not the prediction 2) the coefficients of the remaining variables will change. One can also do Principal Components Analysis, which will provide new uncorrelated variables. Of course the new variables will not have any physical meaning though whatsoever, hurting again interpretability. Remedial measures against serious collinearity may be Ridge Regression, which through regularization it gives preference to one solution over the others.

On the other hand, in Machine Learning we most often care about robust predictions and models that can generalize well, rather than issues regarding the interpretability of the models. (Sidenote: I believe this will change because we would often like to know why a machine leanring model provided this or that prediction...think of bank credit scoring systems or autonomous cars using deep neural networks that must adhere to certain laws as well.) The balance of how much regularization is needed (a form of a bias-variance trade-off with examples: the $\lambda$ factor in Ridge Regression, number of variables sampled in Random Forrests or regularization parameter C in Support Vector Machines), is useually found through cross-validation. Of course it is always good to know about the existence of collinearity, since in principle when the collinearity equation changes, we can have large prediction errors.

To conclude, if you are interested in the effects of the predictor variables to the response and the interpretability of the model, do care about collinearity. If you are interested about the predictive abilities of a model then you can skip it and follow regular machine learning flows. But it is always good to check.

Below one can find an experimentation of various cases along with comments I've done using R to prove the arguments above. The Rmd document can be found here and the rendered html document here.

# Rmd document for Collinearity and Feature Selection

## Intro

This notebook is an online appendix of my blog post: On Collinearity and Feature Selection, where I play with the concepts using R code.

## The dataset

We will use the auto-mpg dataset, where we will try to predict the miles per galon (mpg) consumption given some car related features like horsepower, weight etc.

set.seed(1234)
fileURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
# remove instances with missing values and the name of the car
data <- data[complete.cases(data),-9]
summary(data)

##        V1              V2              V3              V4
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0
##        V5             V6              V7              V8
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000
##  Median :2804   Median :15.50   Median :76.00   Median :1.000
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000


## Preprocessing

Let's normalize the dataset (values to be in the interval [0,1]), an operation that will maintain the correllation between the variables, and split between training and testing.

normalize <- function(x) {
(x - min(x, na.rm=TRUE))/(max(x,na.rm=TRUE) - min(x, na.rm=TRUE))
}
normData <- cbind(data[,1], as.data.frame(lapply(data[,-1], normalize)))

# name variables
names(normData) <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin")

# check correlation
cat("Cor between disp. and weight before norm.:", cor(data$V3, data$V5), "\n")

## Cor between disp. and weight before norm.: 0.9329944

cat("After norm.:", cor(normData$displacement, normData$weight))

## After norm.: 0.9329944

# Train/Test split
library(tidyverse)
library(caret)
training.samples <- normData$mpg %>% createDataPartition(p = 0.8, list = FALSE) train.data <- normData[training.samples, ] test.data <- normData[-training.samples, ]  ## Modelling ### Linear Regression linearModel <- lm(mpg ~., data = train.data) summary(linearModel) ## ## Call: ## lm(formula = mpg ~ ., data = train.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.4147 -2.1845 -0.1875 1.7702 12.8931 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 25.4017 1.3012 19.521 < 2e-16 *** ## cylinders -2.5022 1.8711 -1.337 0.1821 ## displacement 7.7778 3.2280 2.409 0.0166 * ## horsepower -2.8404 2.8077 -1.012 0.3125 ## weight -22.9293 2.5394 -9.029 < 2e-16 *** ## acceleration 2.0678 1.8409 1.123 0.2622 ## model_year 9.3472 0.7023 13.309 < 2e-16 *** ## origin 2.9604 0.6383 4.638 5.21e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.389 on 307 degrees of freedom ## Multiple R-squared: 0.8175, Adjusted R-squared: 0.8134 ## F-statistic: 196.5 on 7 and 307 DF, p-value: < 2.2e-16  One can see that weight is a very important factor both in terms of the coefficient value and in terms of statistical significance (unlikely to observe a relationship between weight and mpg due to change). Notice the negative coefficient (more weight, less miles per gallon), which can be explained by the laws of physics. But also notice that even though weight and diplacement have a correlation o 0.93 (almost collinear), their coefficients have different signs. Based on common knowledge though they should have had the same signs. Collinearity is bad when you try and explain the outputs of models. Let's examine the VIF values: library(car) vif(linearModel) ## cylinders displacement horsepower weight acceleration ## 10.797995 19.995685 8.952301 10.191612 2.424953 ## model_year origin ## 1.223735 1.794710  We observe that 3 predictors have a value more than 10 that is a concern for the existence of collinearity (or multicollinearity in this case). So let's drop displacement and create a second model: linearModelMinusDisp <- lm(mpg ~.-displacement, data = train.data) summary(linearModelMinusDisp) ## ## Call: ## lm(formula = mpg ~ . - displacement, data = train.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.4776 -2.2073 -0.1473 1.7625 12.9679 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 25.4241 1.3113 19.388 < 2e-16 *** ## cylinders 0.5072 1.4040 0.361 0.718 ## horsepower -1.1352 2.7382 -0.415 0.679 ## weight -20.6134 2.3688 -8.702 < 2e-16 *** ## acceleration 1.5673 1.8434 0.850 0.396 ## model_year 9.2684 0.7070 13.110 < 2e-16 *** ## origin 2.4812 0.6112 4.060 6.24e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.415 on 308 degrees of freedom ## Multiple R-squared: 0.8141, Adjusted R-squared: 0.8104 ## F-statistic: 224.7 on 6 and 308 DF, p-value: < 2.2e-16 vif(linearModelMinusDisp) ## cylinders horsepower weight acceleration model_year ## 5.986398 8.383554 8.731646 2.394082 1.221086 ## origin ## 1.620491  and let's check the predictive ability of the two: predLM <- linearModel %>% predict(test.data) predLMMD <- linearModelMinusDisp %>% predict(test.data) cat("Full model:", RMSE(predLM, test.data$mpg), "\n")

## Full model: 3.098026

cat("Minus disp. model:", RMSE(predLMMD, test.data$mpg)) ## Minus disp. model: 3.125825  As one can see I now have a more "understandable" model, with kind of a "worse" predictive ability (slightly higher error). Now let's remove all the variables that had a VIF value of more than 10 linearModelSimpler <- lm(mpg ~.-displacement-cylinders-weight, data = train.data) summary(linearModelSimpler) ## ## Call: ## lm(formula = mpg ~ . - displacement - cylinders - weight, data = train.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.832 -2.415 -0.533 1.930 12.794 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 28.7618 1.4362 20.027 < 2e-16 *** ## horsepower -24.3357 1.6960 -14.349 < 2e-16 *** ## acceleration -6.9758 1.8700 -3.730 0.000227 *** ## model_year 8.0722 0.8034 10.048 < 2e-16 *** ## origin 4.9410 0.6324 7.813 8.76e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.938 on 310 degrees of freedom ## Multiple R-squared: 0.7511, Adjusted R-squared: 0.7479 ## F-statistic: 233.9 on 4 and 310 DF, p-value: < 2.2e-16 vif(linearModelSimpler) ## horsepower acceleration model_year origin ## 2.418236 1.852449 1.185461 1.304459 predLMS <- linearModelSimpler %>% predict(test.data) cat("Even simpler model - RMSE:", RMSE(predLMS, test.data$mpg), "\n")

## Even simpler model - RMSE: 3.575336


Now the model is simple, more explainable, without any colinearities (low VIF values) but not as good in terms of RMSE as the previous ones.

## Feature Selection

To check also a feature selection method, stepwise feature selection using the Akaike Information Criterion (AIC):

require(leaps)
require(MASS)
step.model <- stepAIC(linearModel, direction = "both", trace = FALSE)
summary(step.model)

##
## Call:
## lm(formula = mpg ~ displacement + weight + acceleration + model_year +
##     origin, data = train.data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -9.1253 -2.1573 -0.1547  1.8907 12.8220
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   24.4535     1.0334  23.663  < 2e-16 ***
## displacement   4.3108     2.3178   1.860   0.0639 .
## weight       -24.4212     2.2367 -10.918  < 2e-16 ***
## acceleration   3.1711     1.4748   2.150   0.0323 *
## model_year     9.5092     0.6811  13.963  < 2e-16 ***
## origin         2.7723     0.6225   4.453 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.392 on 309 degrees of freedom
## Multiple R-squared:  0.816,  Adjusted R-squared:  0.813
## F-statistic:   274 on 5 and 309 DF,  p-value: < 2.2e-16

linearModelAIC <- lm(as.formula(step.model), data = train.data)
vif(linearModelAIC)

## displacement       weight acceleration   model_year       origin
##    10.288515     7.890980     1.553344     1.148540     1.703980

predLMAIC <- linearModelAIC %>% predict(test.data)
cat("AIC model:", RMSE(predLMAIC, test.data$mpg), "\n") ## AIC model: 3.114763  Through this example we can see than even though we have colinearities involved (VIF value of 10+), we obtain a low RMSE of 3.11. Or using another feature selection package: # Set up repeated k-fold cross-validation train.control <- trainControl(method = "cv", number = 10) # Train the model step.model2 <- train(mpg ~., data = train.data, method = "leapBackward", tuneGrid = data.frame(nvmax = 1:7), trControl = train.control ) step.model2$results

##   nvmax     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1     1 4.387909 0.6950849 3.378332 0.9854225 0.09849010 0.6697571
## 2     2 3.407376 0.8137023 2.661535 0.9123314 0.06485389 0.5518240
## 3     3 3.333900 0.8224665 2.549990 0.8779426 0.06157152 0.5311275
## 4     4 3.371991 0.8188230 2.592064 0.8832272 0.06214344 0.5410169
## 5     5 3.395429 0.8170493 2.618763 0.8325787 0.05722386 0.4970754
## 6     6 3.404866 0.8163436 2.633234 0.7979961 0.05357786 0.4526265
## 7     7 3.384003 0.8180447 2.618266 0.8100225 0.05401179 0.4619242

summary(step.model2$finalModel) ## Subset selection object ## 7 Variables (and intercept) ## Forced in Forced out ## cylinders FALSE FALSE ## displacement FALSE FALSE ## horsepower FALSE FALSE ## weight FALSE FALSE ## acceleration FALSE FALSE ## model_year FALSE FALSE ## origin FALSE FALSE ## 1 subsets of each size up to 3 ## Selection Algorithm: backward ## cylinders displacement horsepower weight acceleration model_year ## 1 ( 1 ) " " " " " " "*" " " " " ## 2 ( 1 ) " " " " " " "*" " " "*" ## 3 ( 1 ) " " " " " " "*" " " "*" ## origin ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) "*" coef(step.model2$finalModel, 3)

## (Intercept)      weight  model_year      origin
##   26.190907  -21.244932    9.522434    2.370324


The best model has 3 predictors: weight, model_year and origin. So making one more final linear regression model and predicting the mpg in the test set we have:

linearModelBest <- lm(mpg ~weight+model_year+origin, data = train.data)
summary(linearModelBest)

##
## Call:
## lm(formula = mpg ~ weight + model_year + origin, data = train.data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -9.8638 -2.1894 -0.0388  1.7413 13.0971
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  26.1909     0.6744  38.834  < 2e-16 ***
## weight      -21.2449     1.0134 -20.965  < 2e-16 ***
## model_year    9.5224     0.6649  14.321  < 2e-16 ***
## origin        2.3703     0.5942   3.989 8.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.412 on 311 degrees of freedom
## Multiple R-squared:  0.8126, Adjusted R-squared:  0.8108
## F-statistic: 449.6 on 3 and 311 DF,  p-value: < 2.2e-16

vif(linearModelBest)

##     weight model_year     origin
##   1.601095   1.082269   1.534712

predLMBest <- linearModelBest %>% predict(test.data)
cat("Best model:", RMSE(predLMBest, test.data$mpg), "\n") ## Best model: 3.096694  3.09!!! Lowest error, with only 3 predictors and no collinearities involved. In this case feature selection went along with a fina model that is interpretable as well. ## Ridge Regression One solution to the collinearity problem (without doing any feature selection) is to apply Ridge Regression and to try to "constrain" the number of solution of the beta coefficients into a single solution. In this case since we have the hyperparameter lambda to optimize for which we will apply 10-fold cross-validation on the training set to find the best value and then use that to train the model and predict mpg in the testing dataset. library(glmnet) y <- train.data$mpg
x <- train.data %>% dplyr::select(-starts_with("mpg")) %>% data.matrix()
lambdas <- 10^seq(3, -2, by = -.1)
fit <- glmnet(x, y, alpha = 0, lambda = lambdas)
cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas, nfolds = 10)
# uncomment the plot to see how lambda changes the error.
#plot(cv_fit)
opt_lambda <- cv_fit$lambda.min x_test <- test.data %>% dplyr::select(-starts_with("mpg")) %>% data.matrix() y_predicted <- predict(fit, s = opt_lambda, newx = x_test) cat("Ridge RMSE:", RMSE(y_predicted, test.data$mpg))

## Ridge RMSE: 3.096991


The Ridge Regression produced one of the lowest error and without dropping any of the coefficients. And as for the coefficients' values:

coef(cv_fit)

## 8 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)   26.7194742
## cylinders     -2.1671592
## displacement  -1.6690043
## horsepower    -5.5116891
## weight       -11.7117393
## acceleration  -0.3212087
## model_year     7.8818066
## origin         2.6205643


which as we can see gave a much more reasonable and physically explainable model. The more cyclinders, displacement, horsepower, weight and accellaration...the less miles per gallon you can drive, while the more recent the model, which originated from origin 2 (Europe) or 3 (Japan) the more miles on the gallon you can go. As for the RMSE, it is close both to the full model and to the optimized model using feature selection.

## Discussion

Collinearity is important if you need to have an understandable model. If you don't, and you just care for predictive ability you can be more brute and care about the numbers.