Support Vector Machines in R

Introduction

The goal of this notebook is to introduce the Support Vector Machine model in R and also the k-fold cross validation method. For this example we are going to use the Breast Cancer Wisconsin (Original) Data Set and in particular the breast-cancer-wisconsin.data file from the UCI Machine Learning Repository.

First, use the same steps as in the Loading the data section of the Decision Trees in R notebook to load the data:

In [31]:
data <- read.table('breast-cancer-wisconsin.data', na.strings = "?", sep=",")
data <- data[,-1]
names(data) <- c("ClumpThickness", 
                "UniformityCellSize", 
                "UniformityCellShape", 
                "MarginalAdhesion",
                "SingleEpithelialCellSize",
                "BareNuclei",
                "BlandChromatin",
                "NormalNucleoli",
                "Mitoses",
                "Class")
data$Class <- factor(data$Class, levels=c(2,4), labels=c("benign", "malignant"))
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]
# remove cases with missing data
trainData <- trainData[complete.cases(trainData),]
validationData <- validationData[complete.cases(validationData),]

Training

In [32]:
# Loading the SVM library
library(e1071)

Let's train the classifier

In [33]:
svm.model <- svm(Class ~ ., data = trainData, cost = 100, gamma = 1)
summary(svm.model)
Out[33]:
Call:
svm(formula = Class ~ ., data = trainData, cost = 100, gamma = 1)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  100 
      gamma:  1 

Number of Support Vectors:  216

 ( 50 166 )


Number of Classes:  2 

Levels: 
 benign malignant


Evaluation

In [34]:
cat("\nConfusion matrix:\n")
prediction = predict(svm.model, validationData[,-12])
xtab = table(prediction, validationData$Class)
print(xtab)
cat("\nEvaluation:\n\n")
accuracy = sum(prediction == validationData$Class)/length(validationData$Class)
precision = xtab[1,1]/sum(xtab[,1])
recall = xtab[1,1]/sum(xtab[1,])
f = 2 * (precision * recall) / (precision + recall)
cat(paste("Accuracy:\t", format(accuracy, digits=2), "\n",sep=" "))
cat(paste("Precision:\t", format(precision, digits=2), "\n",sep=" "))
cat(paste("Recall:\t\t", format(recall, digits=2), "\n",sep=" "))
cat(paste("F-measure:\t", format(f, digits=2), "\n",sep=" "))
Confusion matrix:
           
prediction  benign malignant
  benign       135         0
  malignant      5        69

Evaluation:

Accuracy:	 0.98 
Precision:	 0.96 
Recall:		 1 
F-measure:	 0.98 

Hyperparameter optimization & cross validation

One can perform a grid search over the parameters gamma and cost using the tune.svm function. The tuning is executed by default using 10-fold cross validation, but one can configure the tuning parameters by using the function tune.control.

In [35]:
tuning <- tune.svm(Class~., data = trainData, cost = 10^(-1:3), gamma = c(0.1,0.5,1,2,5))
summary(tuning)
Out[35]:
Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 gamma cost
   0.1    1

- best performance: 0.03798759 

- Detailed performance results:
   gamma  cost      error dispersion
1    0.1 1e-01 0.04015957 0.02892300
2    0.5 1e-01 0.05908688 0.02961359
3    1.0 1e-01 0.06959220 0.03141852
4    2.0 1e-01 0.11187943 0.04243410
5    5.0 1e-01 0.35833333 0.06485791
6    0.1 1e+00 0.03798759 0.02606263
7    0.5 1e+00 0.04210993 0.03139210
8    1.0 1e+00 0.05691489 0.03447532
9    2.0 1e+00 0.06959220 0.03141852
10   5.0 1e+00 0.10975177 0.04087436
11   0.1 1e+01 0.04410461 0.02852325
12   0.5 1e+01 0.04206560 0.02781897
13   1.0 1e+01 0.05053191 0.03309825
14   2.0 1e+01 0.06325355 0.03139631
15   5.0 1e+01 0.10762411 0.04038951
16   0.1 1e+02 0.04414894 0.03026631
17   0.5 1e+02 0.04206560 0.02781897
18   1.0 1e+02 0.05053191 0.03309825
19   2.0 1e+02 0.06325355 0.03139631
20   5.0 1e+02 0.10762411 0.04038951
21   0.1 1e+03 0.04414894 0.03026631
22   0.5 1e+03 0.04206560 0.02781897
23   1.0 1e+03 0.05053191 0.03309825
24   2.0 1e+03 0.06325355 0.03139631
25   5.0 1e+03 0.10762411 0.04038951

The best parameters can be used later for training the model on the full training set and then validating it on the validation dataset.

In [36]:
svm.model <- svm(Class ~ ., data = trainData, cost = 1, gamma = 0.1)
cat("\nConfusion matrix:\n")
prediction = predict(svm.model, validationData[,-12])
xtab = table(prediction, validationData$Class)
print(xtab)
cat("\nEvaluation:\n\n")
accuracy = sum(prediction == validationData$Class)/length(validationData$Class)
precision = xtab[1,1]/sum(xtab[,1])
recall = xtab[1,1]/sum(xtab[1,])
f = 2 * (precision * recall) / (precision + recall)
cat(paste("Accuracy:\t", format(accuracy, digits=2), "\n",sep=" "))
cat(paste("Precision:\t", format(precision, digits=2), "\n",sep=" "))
cat(paste("Recall:\t\t", format(recall, digits=2), "\n",sep=" "))
cat(paste("F-measure:\t", format(f, digits=2), "\n",sep=" "))
Confusion matrix:
           
prediction  benign malignant
  benign       138         3
  malignant      2        66

Evaluation:

Accuracy:	 0.98 
Precision:	 0.99 
Recall:		 0.98 
F-measure:	 0.98