k-Nearest Neighbors in R


The goal of this notebook is to introduce the k-Nearest Neighbors instance-based learning model in R using the class package. For this example we are going to use the Breast Cancer Wisconsin (Original) Data Set and in particular the breast-cancer-wisconsin.data file from the UCI Machine Learning Repository.

First, use the same steps as in the Loading the data section of the Decision Trees in R notebook to load the data:

In [26]:
# download the file
fileURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
download.file(fileURL, destfile="breast-cancer-wisconsin.data", method="curl")
# read the data
data <- read.table('breast-cancer-wisconsin.data', na.strings = "?", sep=",")
data <- data[,-1]
names(data) <- c("ClumpThickness", 
data$Class <- factor(data$Class, levels=c(2,4), labels=c("benign", "malignant"))
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]


There is a kNN algorithm in the class package.

In [27]:

Because all the attributes are between 1 and 10 there is no need to do normalization between 0 and 1, since no attribute will dominate the others in the distance calculation of kNN. Because kNN accepts the training and testing datasets without the target column, which puts in a 3rd argument, we are going to do some data manipulation to have the data the way the kNN function likes them (look the manual with ?knn). Also, because no missing values are allowed in kNN, let's remove those too.

In [28]:
trainData <- trainData[complete.cases(trainData),]
validationData <- validationData[complete.cases(validationData),]

trainDataX <- trainData[,-ncol(trainData)]
trainDataY <- trainData$Class
validationDataX <- validationData[,-ncol(trainData)]
validationDataY <- validationData$Class

Let's predict, since there is no need to training when using kNN. The training instances are the model.

In [29]:
prediction = knn(trainDataX, validationDataX, trainDataY, k = 1)

You can play with the values of k to look for a better model.


Make the predictions for the validation dataset and print the confusion matrix:

In [30]:
cat("Confusion matrix:\n")
xtab = table(prediction, validationData$Class)
accuracy = sum(prediction == validationData$Class)/length(validationData$Class)
precision = xtab[1,1]/sum(xtab[,1])
recall = xtab[1,1]/sum(xtab[1,])
f = 2 * (precision * recall) / (precision + recall)
cat(paste("Accuracy:\t", format(accuracy, digits=2), "\n",sep=" "))
cat(paste("Precision:\t", format(precision, digits=2), "\n",sep=" "))
cat(paste("Recall:\t\t", format(recall, digits=2), "\n",sep=" "))
cat(paste("F-measure:\t", format(f, digits=2), "\n",sep=" "))
Confusion matrix:
prediction  benign malignant
  benign       137         6
  malignant      3        63


Accuracy:	 0.96 
Precision:	 0.98 
Recall:		 0.96 
F-measure:	 0.97