Introduction to R

Metadata

The basics

R is an open source programming language and a free environment, mainly used for statistical computing and graphics. Information about R you can find in the official website. By searching with the keyword R with other topic-specific words in sites like Google, one can find additional information from sites, blog posts, tutorials, documents etc.

Even through R comes with its own environment: command line and graphical interfaces, one can use the popular RStudio, which offers additional graphical functionalities.

When in the R environment (the R prompt is >) one can exit by calling the quit() function or q() for short. When asked if you want to save the workspace, if you reply with a y for yes, all the variables that you have during the current R session will be saved into a file names .Rdata in the current working directory. If you later start R in the same directory, the variables and their names will be automatically loaded.

To check which is your current working directory, your can enter:

In [3]:
getwd()
'/home/kyrcha/Workspaces/github/ml-tutorials/R/Introduction'

To set the working directory one can use the setwd function:

In [4]:
setwd("~/Desktop")

What you type at the R prompt is an expression, which R attempts to evaluate and type the result. For example getwd() is an expression that is evaluated by calling the function getwd() with no arguments. The same for 42

In [5]:
42
42

and the same for

In [6]:
(100 * 2 - 12 ^ 2) / 7 * 5 + 2
42

There are also predefined constants like pi or e

In [7]:
sin(pi/2)
1

To find out the documentation of a specific function you can enter ?sum or help(sum). To search for functions, there is the help.search("sin") function to help you with that. For certain functions on can see examples of use by using the expression example(plot). Comments start with #, while to assign values to variables you can use <- or =. For example:

In [8]:
a <- 42
b <- (42 + a) / 2
print(a)
print(b)
[1] 42
[1] 42

With ls() one can check all the variables existing in the current R session.

In [9]:
ls()
# while to delete all the variables in the current session you can use the call: 
rm(list=ls())
  1. 'a'
  2. 'b'

Vectors

Create the vector a = (10, 5, 3, 100, -2, 5, -50)

In [10]:
a <- c(10, 5, 3, 100, -2, 5, -50)
a
  1. 10
  2. 5
  3. 3
  4. 100
  5. -2
  6. 5
  7. -50

Select the elements of the vector with indices 1, 3, 4, and 5:

In [11]:
a[c(1,3:4)]
  1. 10
  2. 3
  3. 100

The above expression uses the c() function for combining values and the : operator that generates sequences from:to with step 1. Another easy way of specifying sequences is to use the seq function.

In [12]:
c(1, 2, 7, 10)
1:10
seq(1, 6, by=1)
seq(1,6, by=2)
seq(1,by=2, length=6)
  1. 1
  2. 2
  3. 7
  4. 10
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  1. 1
  2. 3
  3. 5
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9
  6. 11

Type ?seq to get to know the function.

To check the type of a variable there is the class function:

In [13]:
class(a)
'numeric'

To check which a elements have a value greater than 5:

In [14]:
a > 5
which(a>5)
# returns the indices for which the values are TRUE
  1. TRUE
  2. FALSE
  3. FALSE
  4. TRUE
  5. FALSE
  6. FALSE
  7. FALSE
  1. 1
  2. 4

To get the positive elements of a:

In [15]:
b <- a > 0
positives <- a[b]
positives
# or more succintly
positives <- a[a>0]
positives
  1. 10
  2. 5
  3. 3
  4. 100
  5. 5
  1. 10
  2. 5
  3. 3
  4. 100
  5. 5

To check the length of a vector:

In [16]:
length(a)
7

One can also bind vectors by column (cbind()) or by row (rbind())

In [17]:
c <- 1:7
rbind(a,c)
cbind(a,c)
a10 5 3 100-2 5 -50
c 1 2 3 4 5 6 7
ac
101
52
33
1004
-25
56
-507

Matrices

To create matrics use the matrix() function

Create, rowSums, colSums, mean, multiplication

In [18]:
matrix(10,3, 2)
# or
matrix(c(1,2,3,4,5,6), 3, 2)
# or 
matrix(c(1,2,3), 3, 2)
1010
1010
1010
14
25
36
11
22
33

But let's examine how are we calling the matrix function:

In [19]:
args(matrix)
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) 
NULL

So the first argument are the data, then with nrow or ncol arguments we can declare the number of rows and columns and with the argument byrow we declare that we want to fill in the matrix column-by-column if byrow=FALSE and row-by-row if byrow=TRUE. In the above calls we didn't use the byrow argument because the function matrix has a default value byrow=FALSE as we can also check from the documentation, ?matrix.

In [20]:
m = matrix(1:9, byrow = TRUE, nrow=3)
m
123
456
789

Here we have filled in a matrix with values 1 to 9, by row, with the number of rows equal to 3. This gives us a square 3x3 matrix. R is pretty smart in knowing that the number of columns should be 3 as well!

We can also call cbind and rbind and other functions like rowSums, colSums, mean, t for transpose etc.

In [21]:
m2 <- rbind(m, m)
m2
rowSums(m2)
colSums(m2)
mean(m2)
123
456
789
123
456
789
  1. 6
  2. 15
  3. 24
  4. 6
  5. 15
  6. 24
  1. 24
  2. 30
  3. 36
5

For element wise multiplication on can use the * operator while for matrix multiplication you can use the %*% operator.

In [22]:
am <- matrix(10:18, byrow = TRUE, nrow = 3)
am
bm <- matrix(c(3,6,7,10,8,1,2,3,2), byrow = TRUE, nrow = 3)
bm
am * bm
am %*% bm
t(am)
101112
131415
161718
36 7
108 1
23 2
30 6684
13011215
32 5136
164184105
209235135
254286165
101316
111417
121518

Data frames

Unlike matrices, data frames can store values of different types in their columns. They are used extensively in R for data analysis. As rows usually we have the observations (or samples) and as columns we have the characteristics (or attributes or features). When we read from a file, the result is read as a data frame. Download the zip file r-novice-inflammation.zip and unzip it in the Desktop. Examine the file inflammation-01.csv with a text editor to see what we are going to be loading. The read the file:

download the data and save it in the desktop read.csv head class dim str summary selecting max min apply plot

In [23]:
data <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
# Notice the use of the path including data/ since we previously set the working directory as the Desktop
getwd()
dir()
'/home/kyrcha/Desktop'
'data'

The dir function return the files and directories of the file system. The argument header=FALSE lets the read.csv function know that there is no header row to give the columns names.

With head(data) I can check if the data are loaded correctly. It return the first few rows:

In [24]:
head(data)
V1V2V3V4V5V6V7V8V9V10V31V32V33V34V35V36V37V38V39V40
0 0 1 3 1 2 4 7 8 3 44 5 7 3 4 2 3 0 0
0 1 2 1 2 1 3 2 2 6 35 4 4 5 5 1 1 0 1
0 1 1 3 3 2 6 2 5 9 105 4 2 2 3 2 2 1 1
0 0 2 0 4 2 2 1 6 7 35 6 3 3 4 2 3 2 1
0 1 1 3 3 1 3 5 2 4 96 3 2 2 4 2 0 1 1
0 0 1 2 2 4 2 1 6 4 84 7 3 5 4 4 3 2 1

Other function I can use are:

In [25]:
# type of the variable
class(data)
# dimensions
dim(data)
# structure
str(data)
# statistical summarization of the data frame
summary(data)
'data.frame'
  1. 60
  2. 40
'data.frame':	60 obs. of  40 variables:
 $ V1 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V2 : int  0 1 1 0 1 0 0 0 0 1 ...
 $ V3 : int  1 2 1 2 1 1 2 1 0 1 ...
 $ V4 : int  3 1 3 0 3 2 2 2 3 2 ...
 $ V5 : int  1 2 3 4 3 2 4 3 1 1 ...
 $ V6 : int  2 1 2 2 1 4 2 1 5 3 ...
 $ V7 : int  4 3 6 2 3 2 2 2 6 5 ...
 $ V8 : int  7 2 2 1 5 1 5 3 5 3 ...
 $ V9 : int  8 2 5 6 2 6 5 5 5 5 ...
 $ V10: int  3 6 9 7 4 4 8 3 8 8 ...
 $ V11: int  3 10 5 10 4 7 6 7 2 6 ...
 $ V12: int  3 11 7 7 7 6 5 8 4 8 ...
 $ V13: int  10 5 4 9 6 6 11 8 11 12 ...
 $ V14: int  5 9 5 13 5 9 9 5 12 5 ...
 $ V15: int  7 4 4 8 3 9 4 10 10 13 ...
 $ V16: int  4 4 15 8 10 15 13 9 11 6 ...
 $ V17: int  7 7 5 15 8 4 5 15 9 13 ...
 $ V18: int  7 16 11 10 10 16 12 11 10 8 ...
 $ V19: int  12 8 9 10 6 18 10 18 17 16 ...
 $ V20: int  18 6 10 7 17 12 6 19 11 8 ...
 $ V21: int  6 18 19 17 9 12 9 20 6 18 ...
 $ V22: int  13 4 14 4 14 5 17 8 16 15 ...
 $ V23: int  11 12 12 4 9 18 15 5 12 16 ...
 $ V24: int  11 5 17 7 7 9 8 13 6 14 ...
 $ V25: int  7 12 7 6 13 5 9 15 8 12 ...
 $ V26: int  7 7 12 15 9 3 3 10 14 7 ...
 $ V27: int  4 11 11 6 12 10 13 6 6 3 ...
 $ V28: int  6 5 7 4 6 3 7 10 13 8 ...
 $ V29: int  8 11 4 9 7 12 8 6 10 9 ...
 $ V30: int  8 3 2 11 7 7 2 7 11 11 ...
 $ V31: int  4 3 10 3 9 8 8 4 4 2 ...
 $ V32: int  4 5 5 5 6 4 8 9 6 5 ...
 $ V33: int  5 4 4 6 3 7 4 3 4 4 ...
 $ V34: int  7 4 2 3 2 3 2 5 7 5 ...
 $ V35: int  3 5 2 3 2 5 3 2 6 1 ...
 $ V36: int  4 5 3 4 4 4 5 5 3 4 ...
 $ V37: int  2 1 2 2 2 4 4 3 2 1 ...
 $ V38: int  3 1 2 3 0 3 1 2 1 2 ...
 $ V39: int  0 0 1 2 1 2 1 2 0 0 ...
 $ V40: int  0 1 1 1 1 1 1 1 0 0 ...
       V1          V2             V3              V4             V5       
 Min.   :0   Min.   :0.00   Min.   :0.000   Min.   :0.00   Min.   :1.000  
 1st Qu.:0   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1.000  
 Median :0   Median :0.00   Median :1.000   Median :2.00   Median :2.000  
 Mean   :0   Mean   :0.45   Mean   :1.117   Mean   :1.75   Mean   :2.433  
 3rd Qu.:0   3rd Qu.:1.00   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:3.000  
 Max.   :0   Max.   :1.00   Max.   :2.000   Max.   :3.00   Max.   :4.000  
       V6             V7            V8              V9             V10       
 Min.   :1.00   Min.   :1.0   Min.   :1.000   Min.   :2.000   Min.   :2.000  
 1st Qu.:2.00   1st Qu.:2.0   1st Qu.:2.000   1st Qu.:4.000   1st Qu.:3.750  
 Median :3.00   Median :4.0   Median :4.000   Median :5.000   Median :6.000  
 Mean   :3.15   Mean   :3.8   Mean   :3.883   Mean   :5.233   Mean   :5.517  
 3rd Qu.:4.00   3rd Qu.:5.0   3rd Qu.:5.250   3rd Qu.:7.000   3rd Qu.:7.000  
 Max.   :5.00   Max.   :6.0   Max.   :7.000   Max.   :8.000   Max.   :9.000  
      V11             V12             V13             V14        
 Min.   : 2.00   Min.   : 2.00   Min.   : 3.00   Min.   : 3.000  
 1st Qu.: 4.00   1st Qu.: 3.75   1st Qu.: 5.00   1st Qu.: 5.000  
 Median : 6.00   Median : 5.50   Median : 9.50   Median : 8.000  
 Mean   : 5.95   Mean   : 5.90   Mean   : 8.35   Mean   : 7.733  
 3rd Qu.: 9.00   3rd Qu.: 8.00   3rd Qu.:11.00   3rd Qu.:10.000  
 Max.   :10.00   Max.   :11.00   Max.   :12.00   Max.   :13.000  
      V15              V16            V17              V18       
 Min.   : 3.000   Min.   : 3.0   Min.   : 4.000   Min.   : 5.00  
 1st Qu.: 5.000   1st Qu.: 6.0   1st Qu.: 6.750   1st Qu.: 7.75  
 Median : 8.000   Median :10.0   Median : 8.500   Median :11.00  
 Mean   : 8.367   Mean   : 9.5   Mean   : 9.583   Mean   :10.63  
 3rd Qu.:12.000   3rd Qu.:13.0   3rd Qu.:13.000   3rd Qu.:13.00  
 Max.   :14.000   Max.   :15.0   Max.   :16.000   Max.   :17.00  
      V19             V20             V21             V22       
 Min.   : 5.00   Min.   : 5.00   Min.   : 5.00   Min.   : 4.00  
 1st Qu.: 8.00   1st Qu.: 8.75   1st Qu.: 9.00   1st Qu.: 8.00  
 Median :11.50   Median :13.00   Median :14.00   Median :13.00  
 Mean   :11.57   Mean   :12.35   Mean   :13.25   Mean   :11.97  
 3rd Qu.:15.00   3rd Qu.:16.00   3rd Qu.:16.25   3rd Qu.:15.25  
 Max.   :18.00   Max.   :19.00   Max.   :20.00   Max.   :19.00  
      V23             V24             V25            V26        
 Min.   : 4.00   Min.   : 4.00   Min.   : 4.0   Min.   : 3.000  
 1st Qu.: 7.75   1st Qu.: 6.75   1st Qu.: 7.0   1st Qu.: 5.750  
 Median :11.00   Median :10.00   Median :10.5   Median : 9.000  
 Mean   :11.03   Mean   :10.17   Mean   :10.0   Mean   : 8.667  
 3rd Qu.:15.00   3rd Qu.:13.25   3rd Qu.:13.0   3rd Qu.:12.000  
 Max.   :18.00   Max.   :17.00   Max.   :16.0   Max.   :15.000  
      V27             V28             V29              V30        
 Min.   : 3.00   Min.   : 3.00   Min.   : 3.000   Min.   : 2.000  
 1st Qu.: 7.00   1st Qu.: 5.00   1st Qu.: 5.000   1st Qu.: 3.000  
 Median :10.00   Median : 7.00   Median : 7.000   Median : 7.000  
 Mean   : 9.15   Mean   : 7.25   Mean   : 7.333   Mean   : 6.583  
 3rd Qu.:12.00   3rd Qu.: 9.00   3rd Qu.:10.000   3rd Qu.:10.000  
 Max.   :14.00   Max.   :13.00   Max.   :12.000   Max.   :11.000  
      V31              V32            V33             V34           V35     
 Min.   : 2.000   Min.   :2.00   Min.   :2.000   Min.   :1.0   Min.   :1.0  
 1st Qu.: 4.000   1st Qu.:4.00   1st Qu.:4.000   1st Qu.:2.0   1st Qu.:2.0  
 Median : 6.000   Median :6.00   Median :5.000   Median :4.0   Median :3.0  
 Mean   : 6.067   Mean   :5.95   Mean   :5.117   Mean   :3.6   Mean   :3.3  
 3rd Qu.: 8.000   3rd Qu.:8.00   3rd Qu.:6.000   3rd Qu.:5.0   3rd Qu.:5.0  
 Max.   :10.000   Max.   :9.00   Max.   :8.000   Max.   :7.0   Max.   :6.0  
      V36             V37             V38           V39             V40        
 Min.   :1.000   Min.   :1.000   Min.   :0.0   Min.   :0.000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.0   1st Qu.:0.000   1st Qu.:0.0000  
 Median :4.000   Median :2.000   Median :1.0   Median :1.000   Median :1.0000  
 Mean   :3.567   Mean   :2.483   Mean   :1.5   Mean   :1.133   Mean   :0.5667  
 3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:3.0   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :5.000   Max.   :4.000   Max.   :3.0   Max.   :2.000   Max.   :1.0000  

There are also different way I can select slices of data from the data frame

In [26]:
# first value in data
data[1, 1]

# middle value in data
data[30, 20]

# first four rows and first ten columns
data[1:4, 1:10]

# doesn't have to start from the beginning
data[5:10, 1:10]

# specific rows and columns
data[c(3, 8, 37, 56), c(10, 14, 29)]

# All columns from row 5
data[5, ]

# All rows from column 16
data[, 16]
0
16
V1V2V3V4V5V6V7V8V9V10
0013124783
0121213226
0113326259
0020422167
V1V2V3V4V5V6V7V8V9V10
50113313524
60012242164
70022422558
80012312353
90003156558
100112135358
V10V14V29
39 5 4
83 5 6
376 910
567 11 9
V1V2V3V4V5V6V7V8V9V10V31V32V33V34V35V36V37V38V39V40
501133135249632242011
  1. 4
  2. 4
  3. 15
  4. 8
  5. 10
  6. 15
  7. 13
  8. 9
  9. 11
  10. 6
  11. 3
  12. 8
  13. 12
  14. 3
  15. 5
  16. 10
  17. 11
  18. 4
  19. 11
  20. 13
  21. 15
  22. 5
  23. 14
  24. 13
  25. 4
  26. 9
  27. 13
  28. 6
  29. 7
  30. 6
  31. 14
  32. 3
  33. 15
  34. 4
  35. 15
  36. 11
  37. 7
  38. 10
  39. 15
  40. 6
  41. 5
  42. 6
  43. 15
  44. 11
  45. 15
  46. 6
  47. 11
  48. 15
  49. 14
  50. 4
  51. 10
  52. 15
  53. 11
  54. 6
  55. 13
  56. 8
  57. 4
  58. 13
  59. 12
  60. 9

A subtle point is that the last selection returned a vector instead of a data frame. This is because we selected only a single column. If you don't want this behavior do:

In [27]:
# All columns from row 5
d1 <- data[5, ]
class(d1)

# All rows from column 16
d2 <- data[, 16]
class(d2)

d3 <- data[, 16, drop=FALSE]
class(d3)
'data.frame'
'integer'
'data.frame'

Other functions you can call are min, max, mean, sd and median to get statistical values of interest:

In [28]:
# first row, all of the columns
patient_1 <- data[1, ]

# max inflammation for patient 1
max(patient_1)

# max inflammation for patient 2
max(data[2, ])

# minimum inflammation on day 7
min(data[, 7])

# mean inflammation on day 7
mean(data[, 7])

# median inflammation on day 7
median(data[, 7])

# standard deviation of inflammation on day 7
sd(data[, 7])
18
18
1
3.8
4
1.72518729025016

To do more complex calculations like the maximum inflammation for all patients, or the average for each day? we need to apply the function max or mean per row or column respectivelly. Luckily there is the function apply that applies a function for each one of the "margins", 1 for rows and 2 for columns:

In [29]:
args(apply) # args return NULL because it prints the information, but every function must return something!

max_patient_inflammation <- apply(data, 1, max)
max_patient_inflammation

avg_day_inflammation <- apply(data, 2, mean)
avg_day_inflammation
function (X, MARGIN, FUN, ...) 
NULL
  1. 18
  2. 18
  3. 19
  4. 17
  5. 17
  6. 18
  7. 17
  8. 20
  9. 17
  10. 18
  11. 18
  12. 18
  13. 17
  14. 16
  15. 17
  16. 18
  17. 19
  18. 19
  19. 17
  20. 19
  21. 19
  22. 16
  23. 17
  24. 15
  25. 17
  26. 17
  27. 18
  28. 17
  29. 20
  30. 17
  31. 16
  32. 19
  33. 15
  34. 15
  35. 19
  36. 17
  37. 16
  38. 17
  39. 19
  40. 16
  41. 18
  42. 19
  43. 16
  44. 19
  45. 18
  46. 16
  47. 19
  48. 15
  49. 16
  50. 18
  51. 14
  52. 20
  53. 17
  54. 15
  55. 17
  56. 16
  57. 17
  58. 19
  59. 18
  60. 18
V1
0
V2
0.45
V3
1.11666666666667
V4
1.75
V5
2.43333333333333
V6
3.15
V7
3.8
V8
3.88333333333333
V9
5.23333333333333
V10
5.51666666666667
V11
5.95
V12
5.9
V13
8.35
V14
7.73333333333333
V15
8.36666666666667
V16
9.5
V17
9.58333333333333
V18
10.6333333333333
V19
11.5666666666667
V20
12.35
V21
13.25
V22
11.9666666666667
V23
11.0333333333333
V24
10.1666666666667
V25
10
V26
8.66666666666667
V27
9.15
V28
7.25
V29
7.33333333333333
V30
6.58333333333333
V31
6.06666666666667
V32
5.95
V33
5.11666666666667
V34
3.6
V35
3.3
V36
3.56666666666667
V37
2.48333333333333
V38
1.5
V39
1.13333333333333
V40
0.566666666666667

Now let's do some plotting

In [30]:
plot(avg_day_inflammation)
In [31]:
max_day_inflammation <- apply(data, 2, max)
plot(max_day_inflammation)
In [32]:
min_day_inflammation <- apply(data, 2, min)
plot(min_day_inflammation)

plot is a function with many arguments so you will probably need to study a lot of examples to do what you want (change an axis, name an axis, change the plot points and/or lines, add title, add grids, add legend, color the graph, add arrows and text etc.)

R Scripts

So far we have been typing directly into the R command line. What we could also do is save a sequence of commands in an R source file to run it at will. The way to do this is to have such a file with an .R extension and use the function source to run it.

If the source file contains an analysis from the beginning to the end it is a good practice to always clear your session of variables using rm(list=ls()). On the other hand if it is used as a library, for example to load some functions you have created, then you probably should not do it. You can also include other source files inside your current source file using the function (you guessed it): source

Because it is a good practice to have coding guidelines/conventions for standardizing the way you write your script to make it more readable, Google has some: https://google.github.io/styleguide/Rguide.xml

Functions

Let's learn how to create functions by creating a fuction fahr_to_kelvin that converts temperatures from Fahrenheit to Kelvin:

In [33]:
fahr_to_kelvin <- function(temp) {
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}

To run a function:

In [34]:
# freezing point of water
fahr_to_kelvin(32)

# boiling point of water
fahr_to_kelvin(212)
273.15
373.15

Let's also create a function that converts Kelvin to Celcius:

In [35]:
kelvin_to_celsius <- function(temp) {
  celsius <- temp - 273.15
  return(celsius)
}

#absolute zero in Celsius
kelvin_to_celsius(0)
-273.15

We can also use functions inside functions

In [36]:
fahr_to_celsius <- function(temp) {
  temp_k <- fahr_to_kelvin(temp)
  result <- kelvin_to_celsius(temp_k)
  return(result)
}

# freezing point of water in Celsius
fahr_to_celsius(32.0)
0

or we can obtain this result by function chaining:

In [37]:
# freezing point of water in Celsius
kelvin_to_celsius(fahr_to_kelvin(32.0))
0

For loops

Like all the programming language R also has for loops to do recurring tasks. In general the syntax is:

for (variable in collection) {
  do things with variable
}

Let's do an example:

In [38]:
best_practice <- c("Let", "the", "computer", "do", "the", "work")

print_words <- function(sentence) {
  for (word in sentence) {
    print(word)
  }
}

print_words(best_practice)
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"

or another example:

In [39]:
len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
  len <- len + 1
}
# Number of vowels
len
5

Making decisions (if & else)

To make decisions in your R scripts, R provides you with the standard if-else conditional statements

In [40]:
num <- 37
if (num > 100) {
  print("greater")
} else {
  print("not greater")
}
print("done")

# or let's create a function that uses conditionals

sign <- function(num) {
  if (num > 0) {
    return(1)
  } else if (num == 0) {
    return(0)
  } else {
    return(-1)
  }
}

sign(-3)
[1] "not greater"
[1] "done"
-1

You can make decisions using the logical operators

  • equal (==)
  • greater than or equal to (>=),
  • less than or equal to (<=),
  • and not equal to (!=).

We can also combine tests. An ampersand, &, symbolizes “and”. A vertical bar, |, symbolizes “or”.

Datasets and statistics

By running the function data() we can see some datasets that are currently included in the R installation. To check for example the iris dataset, we can for example use the function:

In [41]:
head(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

To see the structure of it:

In [42]:
str(iris)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

For a statistical summary

In [43]:
summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

For the Specis column, one can observe that the summary function did not do the standard statistical calculation like it did with the other variables. From the str function we can see that the Species column is a factor, which is what R uses to declare categorial or ordinal values. More on that in the next section.

We can also get the full attribute of Sepal.Length for example by name

In [44]:
iris$Sepal.Length
  1. 5.1
  2. 4.9
  3. 4.7
  4. 4.6
  5. 5
  6. 5.4
  7. 4.6
  8. 5
  9. 4.4
  10. 4.9
  11. 5.4
  12. 4.8
  13. 4.8
  14. 4.3
  15. 5.8
  16. 5.7
  17. 5.4
  18. 5.1
  19. 5.7
  20. 5.1
  21. 5.4
  22. 5.1
  23. 4.6
  24. 5.1
  25. 4.8
  26. 5
  27. 5
  28. 5.2
  29. 5.2
  30. 4.7
  31. 4.8
  32. 5.4
  33. 5.2
  34. 5.5
  35. 4.9
  36. 5
  37. 5.5
  38. 4.9
  39. 4.4
  40. 5.1
  41. 5
  42. 4.5
  43. 4.4
  44. 5
  45. 5.1
  46. 4.8
  47. 5.1
  48. 4.6
  49. 5.3
  50. 5
  51. 7
  52. 6.4
  53. 6.9
  54. 5.5
  55. 6.5
  56. 5.7
  57. 6.3
  58. 4.9
  59. 6.6
  60. 5.2
  61. 5
  62. 5.9
  63. 6
  64. 6.1
  65. 5.6
  66. 6.7
  67. 5.6
  68. 5.8
  69. 6.2
  70. 5.6
  71. 5.9
  72. 6.1
  73. 6.3
  74. 6.1
  75. 6.4
  76. 6.6
  77. 6.8
  78. 6.7
  79. 6
  80. 5.7
  81. 5.5
  82. 5.5
  83. 5.8
  84. 6
  85. 5.4
  86. 6
  87. 6.7
  88. 6.3
  89. 5.6
  90. 5.5
  91. 5.5
  92. 6.1
  93. 5.8
  94. 5
  95. 5.6
  96. 5.7
  97. 5.7
  98. 6.2
  99. 5.1
  100. 5.7
  101. 6.3
  102. 5.8
  103. 7.1
  104. 6.3
  105. 6.5
  106. 7.6
  107. 4.9
  108. 7.3
  109. 6.7
  110. 7.2
  111. 6.5
  112. 6.4
  113. 6.8
  114. 5.7
  115. 5.8
  116. 6.4
  117. 6.5
  118. 7.7
  119. 7.7
  120. 6
  121. 6.9
  122. 5.6
  123. 7.7
  124. 6.3
  125. 6.7
  126. 7.2
  127. 6.2
  128. 6.1
  129. 6.4
  130. 7.2
  131. 7.4
  132. 7.9
  133. 6.4
  134. 6.3
  135. 6.1
  136. 7.7
  137. 6.3
  138. 6.4
  139. 6
  140. 6.9
  141. 6.7
  142. 6.9
  143. 5.8
  144. 6.8
  145. 6.7
  146. 6.7
  147. 6.3
  148. 6.5
  149. 6.2
  150. 5.9

and run the standard statistics:

In [45]:
print("mean")
mean(iris$Sepal.Length)
print("median")
median(iris$Sepal.Length)
print("min")
min(iris$Sepal.Length)
print("max")
max(iris$Sepal.Length)
print("sd")
sd(iris$Sepal.Length)
print("var")
var(iris$Sepal.Length)
print("range")
range(iris$Sepal.Length)

# or other functions like:
print("sort")
sort(iris$Sepal.Length)
print("length")
length(iris$Sepal.Length)
[1] "mean"
5.84333333333333
[1] "median"
5.8
[1] "min"
4.3
[1] "max"
7.9
[1] "sd"
0.828066127977863
[1] "var"
0.685693512304251
[1] "range"
  1. 4.3
  2. 7.9
[1] "sort"
  1. 4.3
  2. 4.4
  3. 4.4
  4. 4.4
  5. 4.5
  6. 4.6
  7. 4.6
  8. 4.6
  9. 4.6
  10. 4.7
  11. 4.7
  12. 4.8
  13. 4.8
  14. 4.8
  15. 4.8
  16. 4.8
  17. 4.9
  18. 4.9
  19. 4.9
  20. 4.9
  21. 4.9
  22. 4.9
  23. 5
  24. 5
  25. 5
  26. 5
  27. 5
  28. 5
  29. 5
  30. 5
  31. 5
  32. 5
  33. 5.1
  34. 5.1
  35. 5.1
  36. 5.1
  37. 5.1
  38. 5.1
  39. 5.1
  40. 5.1
  41. 5.1
  42. 5.2
  43. 5.2
  44. 5.2
  45. 5.2
  46. 5.3
  47. 5.4
  48. 5.4
  49. 5.4
  50. 5.4
  51. 5.4
  52. 5.4
  53. 5.5
  54. 5.5
  55. 5.5
  56. 5.5
  57. 5.5
  58. 5.5
  59. 5.5
  60. 5.6
  61. 5.6
  62. 5.6
  63. 5.6
  64. 5.6
  65. 5.6
  66. 5.7
  67. 5.7
  68. 5.7
  69. 5.7
  70. 5.7
  71. 5.7
  72. 5.7
  73. 5.7
  74. 5.8
  75. 5.8
  76. 5.8
  77. 5.8
  78. 5.8
  79. 5.8
  80. 5.8
  81. 5.9
  82. 5.9
  83. 5.9
  84. 6
  85. 6
  86. 6
  87. 6
  88. 6
  89. 6
  90. 6.1
  91. 6.1
  92. 6.1
  93. 6.1
  94. 6.1
  95. 6.1
  96. 6.2
  97. 6.2
  98. 6.2
  99. 6.2
  100. 6.3
  101. 6.3
  102. 6.3
  103. 6.3
  104. 6.3
  105. 6.3
  106. 6.3
  107. 6.3
  108. 6.3
  109. 6.4
  110. 6.4
  111. 6.4
  112. 6.4
  113. 6.4
  114. 6.4
  115. 6.4
  116. 6.5
  117. 6.5
  118. 6.5
  119. 6.5
  120. 6.5
  121. 6.6
  122. 6.6
  123. 6.7
  124. 6.7
  125. 6.7
  126. 6.7
  127. 6.7
  128. 6.7
  129. 6.7
  130. 6.7
  131. 6.8
  132. 6.8
  133. 6.8
  134. 6.9
  135. 6.9
  136. 6.9
  137. 6.9
  138. 7
  139. 7.1
  140. 7.2
  141. 7.2
  142. 7.2
  143. 7.3
  144. 7.4
  145. 7.6
  146. 7.7
  147. 7.7
  148. 7.7
  149. 7.7
  150. 7.9
[1] "length"
150

Factors

The factor() command is used to create and modify factors in R:

In [46]:
sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can check this by using the function levels(), and check the number of levels using nlevels():

In [47]:
levels(sex)
nlevels(sex)
  1. 'female'
  2. 'male'
2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels:

In [48]:
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
  1. 'high'
  2. 'low'
  3. 'medium'
In [49]:
food <- factor(food, levels = c("low", "medium", "high"))
levels(food)
  1. 'low'
  2. 'medium'
  3. 'high'
In [50]:
min(food) ## doesn't work
Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low", : ‘min’ not meaningful for factors
Traceback:

1. Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low", 
 . "medium", "high"), class = "factor"), na.rm = FALSE)
2. stop(gettextf("%s not meaningful for factors", sQuote(.Generic)))
In [ ]:
food <- factor(food, levels = c("low", "medium", "high"), ordered=TRUE)
levels(food)
In [ ]:
min(food) ## works!

Challenge

Use the dataset AirPassengers that comes with R and refers to number of passengers traveled every month from 1949 to 1960 in thousands. Because the dataset is of type Time-Series or ts, you can make it a data.frame through the following commands:

> dn = list(paste("Y", as.character(1949:1960), sep = ""), month.abb)
> airmat = matrix(AirPassengers, 12, byrow = TRUE, dimnames = dn)
> air = as.data.frame(t(airmat))

Then try to answer the next questions/problems:

  • Use the help functionality to try and learn about the functions used above.
  • How many passengers traveled in average for the year 1951?
  • Which is the maximum number of passengers for the months January and February?
  • Calculate the summation per year and assign the result to a vector.
  • Plot the vector nicely (names in axes, point and lines for the graph, title the graph, add grid lines)
  • Repeat the last two bullets for every month for all the years.

Tip: to transform a row of the data frame to a vector you can use unlist (e.g. unlist(air["Jan",])

Acknowledgements

This tutorial, besides including the author's knowledge of R, is derived also from other material and more specifically: from an introductory R leaflet from the Pattern Recognition course of the Electrical and Computer Engineering Department of the Aristotle University of Thessaloniki (Fall semester 2016, author: Themistoklis Diamantopoulos, professor: Andreas L. Symeonidis), the leaflet from the course CS545 Machine Learnig (Fall 2008, author and professor: Charles W. Anderson) and the Software Caprentry lesson: Programming in R from which the inflammation data were also taken.

License

This work is made available under the Creative Commons Attribution license.