Decision Trees in R

Introduction

The goal of this notebook is to introduce how to induce decition trees in R using the party and rpart packages. For this example we are going to use the Breast Cancer Wisconsin (Original) Data Set and in particular the breast-cancer-wisconsin.data file from the UCI Machine Learning Repository.

Loading the data

First we are going to load the dataset as a dataframe. We are assuming the script is in the same directory with the dataset. We add the sep option because the default separator is the empty string. In addition, as one can see from the dataset instructions, missing values are denoted with ?. To check the documentation of the read.table function use the command ?read.table.

In [37]:
# download the file
fileURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
download.file(fileURL, destfile="breast-cancer-wisconsin.data", method="curl")
# read the data
data <- read.table('breast-cancer-wisconsin.data', na.strings = "?", sep=",")

We won't be needing the id number (1st column), because it is not a very informative attribute, so let's remove it.

In [22]:
data <- data[,-1]

Let's put names in the columns

In [23]:
names(data) <- c("ClumpThickness", 
                "UniformityCellSize", 
                "UniformityCellShape", 
                "MarginalAdhesion",
                "SingleEpithelialCellSize",
                "BareNuclei",
                "BlandChromatin",
                "NormalNucleoli",
                "Mitoses",
                "Class")

Also let's make the Class column a categorical value, aka a factor in R

In [24]:
data$Class <- factor(data$Class, levels=c(2,4), labels=c("benign", "malignant"))

Now let's see a summary of the data:

In [25]:
print(summary(data))
 ClumpThickness   UniformityCellSize UniformityCellShape MarginalAdhesion
 Min.   : 1.000   Min.   : 1.000     Min.   : 1.000      Min.   : 1.000  
 1st Qu.: 2.000   1st Qu.: 1.000     1st Qu.: 1.000      1st Qu.: 1.000  
 Median : 4.000   Median : 1.000     Median : 1.000      Median : 1.000  
 Mean   : 4.418   Mean   : 3.134     Mean   : 3.207      Mean   : 2.807  
 3rd Qu.: 6.000   3rd Qu.: 5.000     3rd Qu.: 5.000      3rd Qu.: 4.000  
 Max.   :10.000   Max.   :10.000     Max.   :10.000      Max.   :10.000  
                                                                         
 SingleEpithelialCellSize   BareNuclei     BlandChromatin   NormalNucleoli  
 Min.   : 1.000           Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
 1st Qu.: 2.000           1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.000  
 Median : 2.000           Median : 1.000   Median : 3.000   Median : 1.000  
 Mean   : 3.216           Mean   : 3.545   Mean   : 3.438   Mean   : 2.867  
 3rd Qu.: 4.000           3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.000  
 Max.   :10.000           Max.   :10.000   Max.   :10.000   Max.   :10.000  
                          NA's   :16                                        
    Mitoses             Class    
 Min.   : 1.000   benign   :458  
 1st Qu.: 1.000   malignant:241  
 Median : 1.000                  
 Mean   : 1.589                  
 3rd Qu.: 1.000                  
 Max.   :10.000                  
                                 

The next step is to split the dataset into a training and a validation set. For comparing later different models or the same models trained with differernt parameters, we are going to use the same training and validation set. Since we are splitting them randomly, we set a seed so that we maintain the same split throughout our experiments.

In [26]:
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]

Training

Now we load the libraries rpart and party.

In [27]:
{
    sink("/dev/null")
    country.code <- 'de'  # use yours
    url.pattern <- 'https://'  # use http if you want
    repo.data.frame <- subset(getCRANmirrors(), CountryCode == country.code & grepl(url.pattern, URL))
    options(repos = repo.data.frame$URL)
    suppressMessages(install.packages("rpart"))
    library(rpart)
    suppressMessages(install.packages("party"))
    library(party)
    sink();
}

rpart

Let's product a decision tree by training the induction algorithm on the train dataset. Check out the options of rpart with the command ?rpart.

In [28]:
tree = rpart(Class ~ ., data=trainData, method="class")

One can also use a differrent split criterion like the entropy split decision rule

In [29]:
entTree = rpart(Class ~ ., data=trainData, method="class", parms=list(split="information"))

The tree in text form

In [30]:
print(tree)
n= 485 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 485 171 benign (0.647422680 0.352577320)  
   2) UniformityCellSize< 2.5 292   9 benign (0.969178082 0.030821918)  
     4) BareNuclei< 4.5 280   1 benign (0.996428571 0.003571429) *
     5) BareNuclei>=4.5 12   4 malignant (0.333333333 0.666666667) *
   3) UniformityCellSize>=2.5 193  31 malignant (0.160621762 0.839378238)  
     6) UniformityCellShape< 2.5 15   3 benign (0.800000000 0.200000000) *
     7) UniformityCellShape>=2.5 178  19 malignant (0.106741573 0.893258427)  
      14) UniformityCellSize< 4.5 45  14 malignant (0.311111111 0.688888889)  
        28) BareNuclei< 2.5 10   2 benign (0.800000000 0.200000000) *
        29) BareNuclei>=2.5 35   6 malignant (0.171428571 0.828571429) *
      15) UniformityCellSize>=4.5 133   5 malignant (0.037593985 0.962406015) *

And in a visual representation

In [31]:
plot(tree)
text(tree)

party

Let's also try the party library and the ctree function:

In [32]:
ctree = ctree(Class ~ ., data=trainData)
# print it
print(ctree)
# visualize it
plot(ctree, type="simple")
	 Conditional inference tree with 7 terminal nodes

Response:  Class 
Inputs:  ClumpThickness, UniformityCellSize, UniformityCellShape, MarginalAdhesion, SingleEpithelialCellSize, BareNuclei, BlandChromatin, NormalNucleoli, Mitoses 
Number of observations:  485 

1) UniformityCellSize <= 2; criterion = 1, statistic = 323.402
  2) BareNuclei <= 4; criterion = 1, statistic = 169.767
    3) SingleEpithelialCellSize <= 2; criterion = 1, statistic = 29.834
      4)*  weights = 264 
    3) SingleEpithelialCellSize > 2
      5)*  weights = 16 
  2) BareNuclei > 4
    6)*  weights = 12 
1) UniformityCellSize > 2
  7) BareNuclei <= 3; criterion = 1, statistic = 42.361
    8) UniformityCellSize <= 4; criterion = 1, statistic = 23.999
      9)*  weights = 27 
    8) UniformityCellSize > 4
      10)*  weights = 20 
  7) BareNuclei > 3
    11) BlandChromatin <= 4; criterion = 0.996, statistic = 12.225
      12)*  weights = 50 
    11) BlandChromatin > 4
      13)*  weights = 96 

Evaluation

The final step is to print the results in the validation set

In [33]:
evaluation <- function(model, data, atype) {
    cat("\nConfusion matrix:\n")
    prediction = predict(model, data, type=atype)
    xtab = table(prediction, data$Class)
    print(xtab)
    cat("\nEvaluation:\n\n")
    accuracy = sum(prediction == data$Class)/length(data$Class)
    precision = xtab[1,1]/sum(xtab[,1])
    recall = xtab[1,1]/sum(xtab[1,])
    f = 2 * (precision * recall) / (precision + recall)
    cat(paste("Accuracy:\t", format(accuracy, digits=2), "\n",sep=" "))
    cat(paste("Precision:\t", format(precision, digits=2), "\n",sep=" "))
    cat(paste("Recall:\t\t", format(recall, digits=2), "\n",sep=" "))
    cat(paste("F-measure:\t", format(f, digits=2), "\n",sep=" "))    
}

evaluation(tree, validationData, "class")
Confusion matrix:
           
prediction  benign malignant
  benign       138         6
  malignant      6        64

Evaluation:

Accuracy:	 0.94 
Precision:	 0.96 
Recall:		 0.96 
F-measure:	 0.96 

and for the second tree:

In [34]:
evaluation(entTree, validationData, "class")
Confusion matrix:
           
prediction  benign malignant
  benign       138         5
  malignant      6        65

Evaluation:

Accuracy:	 0.95 
Precision:	 0.96 
Recall:		 0.97 
F-measure:	 0.96 

The evaluation of the ctree:

In [35]:
evaluation(ctree, validationData, "response")
Confusion matrix:
           
prediction  benign malignant
  benign       138         6
  malignant      6        64

Evaluation:

Accuracy:	 0.94 
Precision:	 0.96 
Recall:		 0.96 
F-measure:	 0.96