# Introduction to R¶

## The basics¶

R is an open source programming language and a free environment, mainly used for statistical computing and graphics. Information about R you can find in the official website. By searching with the keyword R with other topic-specific words in sites like Google, one can find additional information from sites, blog posts, tutorials, documents etc.

Even through R comes with its own environment: command line and graphical interfaces, one can use the popular RStudio, which offers additional graphical functionalities.

When in the R environment (the R prompt is >) one can exit by calling the quit() function or q() for short. When asked if you want to save the workspace, if you reply with a y for yes, all the variables that you have during the current R session will be saved into a file names .Rdata in the current working directory. If you later start R in the same directory, the variables and their names will be automatically loaded.

In [3]:
getwd()

'/home/kyrcha/Workspaces/github/ml-tutorials/R/Introduction'

To set the working directory one can use the setwd function:

In [4]:
setwd("~/Desktop")


What you type at the R prompt is an expression, which R attempts to evaluate and type the result. For example getwd() is an expression that is evaluated by calling the function getwd() with no arguments. The same for 42

In [5]:
42

42

and the same for

In [6]:
(100 * 2 - 12 ^ 2) / 7 * 5 + 2

42

There are also predefined constants like pi or e

In [7]:
sin(pi/2)

1

To find out the documentation of a specific function you can enter ?sum or help(sum). To search for functions, there is the help.search("sin") function to help you with that. For certain functions on can see examples of use by using the expression example(plot). Comments start with #, while to assign values to variables you can use <- or =. For example:

In [8]:
a <- 42
b <- (42 + a) / 2
print(a)
print(b)

[1] 42
[1] 42


With ls() one can check all the variables existing in the current R session.

In [9]:
ls()
# while to delete all the variables in the current session you can use the call:
rm(list=ls())

1. 'a'
2. 'b'

## Vectors¶

Create the vector a = (10, 5, 3, 100, -2, 5, -50)

In [10]:
a <- c(10, 5, 3, 100, -2, 5, -50)
a

1. 10
2. 5
3. 3
4. 100
5. -2
6. 5
7. -50

Select the elements of the vector with indices 1, 3, 4, and 5:

In [11]:
a[c(1,3:4)]

1. 10
2. 3
3. 100

The above expression uses the c() function for combining values and the : operator that generates sequences from:to with step 1. Another easy way of specifying sequences is to use the seq function.

In [12]:
c(1, 2, 7, 10)
1:10
seq(1, 6, by=1)
seq(1,6, by=2)
seq(1,by=2, length=6)

1. 1
2. 2
3. 7
4. 10
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
1. 1
2. 3
3. 5
1. 1
2. 3
3. 5
4. 7
5. 9
6. 11

Type ?seq to get to know the function.

To check the type of a variable there is the class function:

In [13]:
class(a)

'numeric'

To check which a elements have a value greater than 5:

In [14]:
a > 5
which(a>5)
# returns the indices for which the values are TRUE

1. TRUE
2. FALSE
3. FALSE
4. TRUE
5. FALSE
6. FALSE
7. FALSE
1. 1
2. 4

To get the positive elements of a:

In [15]:
b <- a > 0
positives <- a[b]
positives
# or more succintly
positives <- a[a>0]
positives

1. 10
2. 5
3. 3
4. 100
5. 5
1. 10
2. 5
3. 3
4. 100
5. 5

To check the length of a vector:

In [16]:
length(a)

7

One can also bind vectors by column (cbind()) or by row (rbind())

In [17]:
c <- 1:7
rbind(a,c)
cbind(a,c)

 a c 10 5 3 100 -2 5 -50 1 2 3 4 5 6 7
ac
101
52
33
1004
-25
56
-507

## Matrices¶

To create matrics use the matrix() function

Create, rowSums, colSums, mean, multiplication

In [18]:
matrix(10,3, 2)
# or
matrix(c(1,2,3,4,5,6), 3, 2)
# or
matrix(c(1,2,3), 3, 2)

 10 10 10 10 10 10
 1 4 2 5 3 6
 1 1 2 2 3 3

But let's examine how are we calling the matrix function:

In [19]:
args(matrix)

function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
NULL

So the first argument are the data, then with nrow or ncol arguments we can declare the number of rows and columns and with the argument byrow we declare that we want to fill in the matrix column-by-column if byrow=FALSE and row-by-row if byrow=TRUE. In the above calls we didn't use the byrow argument because the function matrix has a default value byrow=FALSE as we can also check from the documentation, ?matrix.

In [20]:
m = matrix(1:9, byrow = TRUE, nrow=3)
m

 1 2 3 4 5 6 7 8 9

Here we have filled in a matrix with values 1 to 9, by row, with the number of rows equal to 3. This gives us a square 3x3 matrix. R is pretty smart in knowing that the number of columns should be 3 as well!

We can also call cbind and rbind and other functions like rowSums, colSums, mean, t for transpose etc.

In [21]:
m2 <- rbind(m, m)
m2
rowSums(m2)
colSums(m2)
mean(m2)

 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
1. 6
2. 15
3. 24
4. 6
5. 15
6. 24
1. 24
2. 30
3. 36
5

For element wise multiplication on can use the * operator while for matrix multiplication you can use the %*% operator.

In [22]:
am <- matrix(10:18, byrow = TRUE, nrow = 3)
am
bm <- matrix(c(3,6,7,10,8,1,2,3,2), byrow = TRUE, nrow = 3)
bm
am * bm
am %*% bm
t(am)

 10 11 12 13 14 15 16 17 18
 3 6 7 10 8 1 2 3 2
 30 66 84 130 112 15 32 51 36
 164 184 105 209 235 135 254 286 165
 10 13 16 11 14 17 12 15 18

## Data frames¶

Unlike matrices, data frames can store values of different types in their columns. They are used extensively in R for data analysis. As rows usually we have the observations (or samples) and as columns we have the characteristics (or attributes or features). When we read from a file, the result is read as a data frame. Download the zip file r-novice-inflammation.zip and unzip it in the Desktop. Examine the file inflammation-01.csv with a text editor to see what we are going to be loading. The read the file:

In [23]:
data <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
# Notice the use of the path including data/ since we previously set the working directory as the Desktop
getwd()
dir()

'/home/kyrcha/Desktop'
'data'

The dir function return the files and directories of the file system. The argument header=FALSE lets the read.csv function know that there is no header row to give the columns names.

With head(data) I can check if the data are loaded correctly. It return the first few rows:

In [24]:
head(data)

V1V2V3V4V5V6V7V8V9V10V31V32V33V34V35V36V37V38V39V40
0 0 1 3 1 2 4 7 8 3 44 5 7 3 4 2 3 0 0
0 1 2 1 2 1 3 2 2 6 35 4 4 5 5 1 1 0 1
0 1 1 3 3 2 6 2 5 9 105 4 2 2 3 2 2 1 1
0 0 2 0 4 2 2 1 6 7 35 6 3 3 4 2 3 2 1
0 1 1 3 3 1 3 5 2 4 96 3 2 2 4 2 0 1 1
0 0 1 2 2 4 2 1 6 4 84 7 3 5 4 4 3 2 1

Other function I can use are:

In [25]:
# type of the variable
class(data)
# dimensions
dim(data)
# structure
str(data)
# statistical summarization of the data frame
summary(data)

'data.frame'
1. 60
2. 40
'data.frame':	60 obs. of  40 variables:
$V1 : int 0 0 0 0 0 0 0 0 0 0 ...$ V2 : int  0 1 1 0 1 0 0 0 0 1 ...
$V3 : int 1 2 1 2 1 1 2 1 0 1 ...$ V4 : int  3 1 3 0 3 2 2 2 3 2 ...
$V5 : int 1 2 3 4 3 2 4 3 1 1 ...$ V6 : int  2 1 2 2 1 4 2 1 5 3 ...
$V7 : int 4 3 6 2 3 2 2 2 6 5 ...$ V8 : int  7 2 2 1 5 1 5 3 5 3 ...
$V9 : int 8 2 5 6 2 6 5 5 5 5 ...$ V10: int  3 6 9 7 4 4 8 3 8 8 ...
$V11: int 3 10 5 10 4 7 6 7 2 6 ...$ V12: int  3 11 7 7 7 6 5 8 4 8 ...
$V13: int 10 5 4 9 6 6 11 8 11 12 ...$ V14: int  5 9 5 13 5 9 9 5 12 5 ...
$V15: int 7 4 4 8 3 9 4 10 10 13 ...$ V16: int  4 4 15 8 10 15 13 9 11 6 ...
$V17: int 7 7 5 15 8 4 5 15 9 13 ...$ V18: int  7 16 11 10 10 16 12 11 10 8 ...
$V19: int 12 8 9 10 6 18 10 18 17 16 ...$ V20: int  18 6 10 7 17 12 6 19 11 8 ...
$V21: int 6 18 19 17 9 12 9 20 6 18 ...$ V22: int  13 4 14 4 14 5 17 8 16 15 ...
$V23: int 11 12 12 4 9 18 15 5 12 16 ...$ V24: int  11 5 17 7 7 9 8 13 6 14 ...
$V25: int 7 12 7 6 13 5 9 15 8 12 ...$ V26: int  7 7 12 15 9 3 3 10 14 7 ...
$V27: int 4 11 11 6 12 10 13 6 6 3 ...$ V28: int  6 5 7 4 6 3 7 10 13 8 ...
$V29: int 8 11 4 9 7 12 8 6 10 9 ...$ V30: int  8 3 2 11 7 7 2 7 11 11 ...
$V31: int 4 3 10 3 9 8 8 4 4 2 ...$ V32: int  4 5 5 5 6 4 8 9 6 5 ...
$V33: int 5 4 4 6 3 7 4 3 4 4 ...$ V34: int  7 4 2 3 2 3 2 5 7 5 ...
$V35: int 3 5 2 3 2 5 3 2 6 1 ...$ V36: int  4 5 3 4 4 4 5 5 3 4 ...
$V37: int 2 1 2 2 2 4 4 3 2 1 ...$ V38: int  3 1 2 3 0 3 1 2 1 2 ...
$V39: int 0 0 1 2 1 2 1 2 0 0 ...$ V40: int  0 1 1 1 1 1 1 1 0 0 ...

       V1          V2             V3              V4             V5
Min.   :0   Min.   :0.00   Min.   :0.000   Min.   :0.00   Min.   :1.000
1st Qu.:0   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1.000
Median :0   Median :0.00   Median :1.000   Median :2.00   Median :2.000
Mean   :0   Mean   :0.45   Mean   :1.117   Mean   :1.75   Mean   :2.433
3rd Qu.:0   3rd Qu.:1.00   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:3.000
Max.   :0   Max.   :1.00   Max.   :2.000   Max.   :3.00   Max.   :4.000
V6             V7            V8              V9             V10
Min.   :1.00   Min.   :1.0   Min.   :1.000   Min.   :2.000   Min.   :2.000
1st Qu.:2.00   1st Qu.:2.0   1st Qu.:2.000   1st Qu.:4.000   1st Qu.:3.750
Median :3.00   Median :4.0   Median :4.000   Median :5.000   Median :6.000
Mean   :3.15   Mean   :3.8   Mean   :3.883   Mean   :5.233   Mean   :5.517
3rd Qu.:4.00   3rd Qu.:5.0   3rd Qu.:5.250   3rd Qu.:7.000   3rd Qu.:7.000
Max.   :5.00   Max.   :6.0   Max.   :7.000   Max.   :8.000   Max.   :9.000
V11             V12             V13             V14
Min.   : 2.00   Min.   : 2.00   Min.   : 3.00   Min.   : 3.000
1st Qu.: 4.00   1st Qu.: 3.75   1st Qu.: 5.00   1st Qu.: 5.000
Median : 6.00   Median : 5.50   Median : 9.50   Median : 8.000
Mean   : 5.95   Mean   : 5.90   Mean   : 8.35   Mean   : 7.733
3rd Qu.: 9.00   3rd Qu.: 8.00   3rd Qu.:11.00   3rd Qu.:10.000
Max.   :10.00   Max.   :11.00   Max.   :12.00   Max.   :13.000
V15              V16            V17              V18
Min.   : 3.000   Min.   : 3.0   Min.   : 4.000   Min.   : 5.00
1st Qu.: 5.000   1st Qu.: 6.0   1st Qu.: 6.750   1st Qu.: 7.75
Median : 8.000   Median :10.0   Median : 8.500   Median :11.00
Mean   : 8.367   Mean   : 9.5   Mean   : 9.583   Mean   :10.63
3rd Qu.:12.000   3rd Qu.:13.0   3rd Qu.:13.000   3rd Qu.:13.00
Max.   :14.000   Max.   :15.0   Max.   :16.000   Max.   :17.00
V19             V20             V21             V22
Min.   : 5.00   Min.   : 5.00   Min.   : 5.00   Min.   : 4.00
1st Qu.: 8.00   1st Qu.: 8.75   1st Qu.: 9.00   1st Qu.: 8.00
Median :11.50   Median :13.00   Median :14.00   Median :13.00
Mean   :11.57   Mean   :12.35   Mean   :13.25   Mean   :11.97
3rd Qu.:15.00   3rd Qu.:16.00   3rd Qu.:16.25   3rd Qu.:15.25
Max.   :18.00   Max.   :19.00   Max.   :20.00   Max.   :19.00
V23             V24             V25            V26
Min.   : 4.00   Min.   : 4.00   Min.   : 4.0   Min.   : 3.000
1st Qu.: 7.75   1st Qu.: 6.75   1st Qu.: 7.0   1st Qu.: 5.750
Median :11.00   Median :10.00   Median :10.5   Median : 9.000
Mean   :11.03   Mean   :10.17   Mean   :10.0   Mean   : 8.667
3rd Qu.:15.00   3rd Qu.:13.25   3rd Qu.:13.0   3rd Qu.:12.000
Max.   :18.00   Max.   :17.00   Max.   :16.0   Max.   :15.000
V27             V28             V29              V30
Min.   : 3.00   Min.   : 3.00   Min.   : 3.000   Min.   : 2.000
1st Qu.: 7.00   1st Qu.: 5.00   1st Qu.: 5.000   1st Qu.: 3.000
Median :10.00   Median : 7.00   Median : 7.000   Median : 7.000
Mean   : 9.15   Mean   : 7.25   Mean   : 7.333   Mean   : 6.583
3rd Qu.:12.00   3rd Qu.: 9.00   3rd Qu.:10.000   3rd Qu.:10.000
Max.   :14.00   Max.   :13.00   Max.   :12.000   Max.   :11.000
V31              V32            V33             V34           V35
Min.   : 2.000   Min.   :2.00   Min.   :2.000   Min.   :1.0   Min.   :1.0
1st Qu.: 4.000   1st Qu.:4.00   1st Qu.:4.000   1st Qu.:2.0   1st Qu.:2.0
Median : 6.000   Median :6.00   Median :5.000   Median :4.0   Median :3.0
Mean   : 6.067   Mean   :5.95   Mean   :5.117   Mean   :3.6   Mean   :3.3
3rd Qu.: 8.000   3rd Qu.:8.00   3rd Qu.:6.000   3rd Qu.:5.0   3rd Qu.:5.0
Max.   :10.000   Max.   :9.00   Max.   :8.000   Max.   :7.0   Max.   :6.0
V36             V37             V38           V39             V40
Min.   :1.000   Min.   :1.000   Min.   :0.0   Min.   :0.000   Min.   :0.0000
1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.0   1st Qu.:0.000   1st Qu.:0.0000
Median :4.000   Median :2.000   Median :1.0   Median :1.000   Median :1.0000
Mean   :3.567   Mean   :2.483   Mean   :1.5   Mean   :1.133   Mean   :0.5667
3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:3.0   3rd Qu.:2.000   3rd Qu.:1.0000
Max.   :5.000   Max.   :4.000   Max.   :3.0   Max.   :2.000   Max.   :1.0000  

There are also different way I can select slices of data from the data frame

In [26]:
# first value in data
data[1, 1]

# middle value in data
data[30, 20]

# first four rows and first ten columns
data[1:4, 1:10]

# doesn't have to start from the beginning
data[5:10, 1:10]

# specific rows and columns
data[c(3, 8, 37, 56), c(10, 14, 29)]

# All columns from row 5
data[5, ]

# All rows from column 16
data[, 16]

0
16
V1V2V3V4V5V6V7V8V9V10
0013124783
0121213226
0113326259
0020422167
V1V2V3V4V5V6V7V8V9V10
50113313524
60012242164
70022422558
80012312353
90003156558
100112135358
V10V14V29
39 5 4
83 5 6
376 910
567 11 9
V1V2V3V4V5V6V7V8V9V10V31V32V33V34V35V36V37V38V39V40
501133135249632242011
1. 4
2. 4
3. 15
4. 8
5. 10
6. 15
7. 13
8. 9
9. 11
10. 6
11. 3
12. 8
13. 12
14. 3
15. 5
16. 10
17. 11
18. 4
19. 11
20. 13
21. 15
22. 5
23. 14
24. 13
25. 4
26. 9
27. 13
28. 6
29. 7
30. 6
31. 14
32. 3
33. 15
34. 4
35. 15
36. 11
37. 7
38. 10
39. 15
40. 6
41. 5
42. 6
43. 15
44. 11
45. 15
46. 6
47. 11
48. 15
49. 14
50. 4
51. 10
52. 15
53. 11
54. 6
55. 13
56. 8
57. 4
58. 13
59. 12
60. 9

A subtle point is that the last selection returned a vector instead of a data frame. This is because we selected only a single column. If you don't want this behavior do:

In [27]:
# All columns from row 5
d1 <- data[5, ]
class(d1)

# All rows from column 16
d2 <- data[, 16]
class(d2)

d3 <- data[, 16, drop=FALSE]
class(d3)

'data.frame'
'integer'
'data.frame'

Other functions you can call are min, max, mean, sd and median to get statistical values of interest:

In [28]:
# first row, all of the columns
patient_1 <- data[1, ]

# max inflammation for patient 1
max(patient_1)

# max inflammation for patient 2
max(data[2, ])

# minimum inflammation on day 7
min(data[, 7])

# mean inflammation on day 7
mean(data[, 7])

# median inflammation on day 7
median(data[, 7])

# standard deviation of inflammation on day 7
sd(data[, 7])

18
18
1
3.8
4
1.72518729025016

To do more complex calculations like the maximum inflammation for all patients, or the average for each day? we need to apply the function max or mean per row or column respectivelly. Luckily there is the function apply that applies a function for each one of the "margins", 1 for rows and 2 for columns:

In [29]:
args(apply) # args return NULL because it prints the information, but every function must return something!

max_patient_inflammation <- apply(data, 1, max)
max_patient_inflammation

avg_day_inflammation <- apply(data, 2, mean)
avg_day_inflammation

function (X, MARGIN, FUN, ...)
NULL
1. 18
2. 18
3. 19
4. 17
5. 17
6. 18
7. 17
8. 20
9. 17
10. 18
11. 18
12. 18
13. 17
14. 16
15. 17
16. 18
17. 19
18. 19
19. 17
20. 19
21. 19
22. 16
23. 17
24. 15
25. 17
26. 17
27. 18
28. 17
29. 20
30. 17
31. 16
32. 19
33. 15
34. 15
35. 19
36. 17
37. 16
38. 17
39. 19
40. 16
41. 18
42. 19
43. 16
44. 19
45. 18
46. 16
47. 19
48. 15
49. 16
50. 18
51. 14
52. 20
53. 17
54. 15
55. 17
56. 16
57. 17
58. 19
59. 18
60. 18
V1
0
V2
0.45
V3
1.11666666666667
V4
1.75
V5
2.43333333333333
V6
3.15
V7
3.8
V8
3.88333333333333
V9
5.23333333333333
V10
5.51666666666667
V11
5.95
V12
5.9
V13
8.35
V14
7.73333333333333
V15
8.36666666666667
V16
9.5
V17
9.58333333333333
V18
10.6333333333333
V19
11.5666666666667
V20
12.35
V21
13.25
V22
11.9666666666667
V23
11.0333333333333
V24
10.1666666666667
V25
10
V26
8.66666666666667
V27
9.15
V28
7.25
V29
7.33333333333333
V30
6.58333333333333
V31
6.06666666666667
V32
5.95
V33
5.11666666666667
V34
3.6
V35
3.3
V36
3.56666666666667
V37
2.48333333333333
V38
1.5
V39
1.13333333333333
V40
0.566666666666667

Now let's do some plotting

In [30]:
plot(avg_day_inflammation)

In [31]:
max_day_inflammation <- apply(data, 2, max)
plot(max_day_inflammation)

In [32]:
min_day_inflammation <- apply(data, 2, min)
plot(min_day_inflammation)


plot is a function with many arguments so you will probably need to study a lot of examples to do what you want (change an axis, name an axis, change the plot points and/or lines, add title, add grids, add legend, color the graph, add arrows and text etc.)

## R Scripts¶

So far we have been typing directly into the R command line. What we could also do is save a sequence of commands in an R source file to run it at will. The way to do this is to have such a file with an .R extension and use the function source to run it.

If the source file contains an analysis from the beginning to the end it is a good practice to always clear your session of variables using rm(list=ls()). On the other hand if it is used as a library, for example to load some functions you have created, then you probably should not do it. You can also include other source files inside your current source file using the function (you guessed it): source

Because it is a good practice to have coding guidelines/conventions for standardizing the way you write your script to make it more readable, Google has some: https://google.github.io/styleguide/Rguide.xml

## Functions¶

Let's learn how to create functions by creating a fuction fahr_to_kelvin that converts temperatures from Fahrenheit to Kelvin:

In [33]:
fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}


To run a function:

In [34]:
# freezing point of water
fahr_to_kelvin(32)

# boiling point of water
fahr_to_kelvin(212)

273.15
373.15

Let's also create a function that converts Kelvin to Celcius:

In [35]:
kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}

#absolute zero in Celsius
kelvin_to_celsius(0)

-273.15

We can also use functions inside functions

In [36]:
fahr_to_celsius <- function(temp) {
temp_k <- fahr_to_kelvin(temp)
result <- kelvin_to_celsius(temp_k)
return(result)
}

# freezing point of water in Celsius
fahr_to_celsius(32.0)

0

or we can obtain this result by function chaining:

In [37]:
# freezing point of water in Celsius
kelvin_to_celsius(fahr_to_kelvin(32.0))

0

## For loops¶

Like all the programming language R also has for loops to do recurring tasks. In general the syntax is:

for (variable in collection) {
do things with variable
}



Let's do an example:

In [38]:
best_practice <- c("Let", "the", "computer", "do", "the", "work")

print_words <- function(sentence) {
for (word in sentence) {
print(word)
}
}

print_words(best_practice)

[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"


or another example:

In [39]:
len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
len <- len + 1
}
# Number of vowels
len

5

## Making decisions (if & else)¶

To make decisions in your R scripts, R provides you with the standard if-else conditional statements

In [40]:
num <- 37
if (num > 100) {
print("greater")
} else {
print("not greater")
}
print("done")

# or let's create a function that uses conditionals

sign <- function(num) {
if (num > 0) {
return(1)
} else if (num == 0) {
return(0)
} else {
return(-1)
}
}

sign(-3)

[1] "not greater"
[1] "done"

-1

You can make decisions using the logical operators

• equal (==)
• greater than or equal to (>=),
• less than or equal to (<=),
• and not equal to (!=).

We can also combine tests. An ampersand, &, symbolizes “and”. A vertical bar, |, symbolizes “or”.

## Datasets and statistics¶

By running the function data() we can see some datasets that are currently included in the R installation. To check for example the iris dataset, we can for example use the function:

In [41]:
head(iris)

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

To see the structure of it:

In [42]:
str(iris)

'data.frame':	150 obs. of  5 variables:
$Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...  For a statistical summary In [43]: summary(iris)   Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50  For the Specis column, one can observe that the summary function did not do the standard statistical calculation like it did with the other variables. From the str function we can see that the Species column is a factor, which is what R uses to declare categorial or ordinal values. More on that in the next section. We can also get the full attribute of Sepal.Length for example by name In [44]: iris$Sepal.Length

1. 5.1
2. 4.9
3. 4.7
4. 4.6
5. 5
6. 5.4
7. 4.6
8. 5
9. 4.4
10. 4.9
11. 5.4
12. 4.8
13. 4.8
14. 4.3
15. 5.8
16. 5.7
17. 5.4
18. 5.1
19. 5.7
20. 5.1
21. 5.4
22. 5.1
23. 4.6
24. 5.1
25. 4.8
26. 5
27. 5
28. 5.2
29. 5.2
30. 4.7
31. 4.8
32. 5.4
33. 5.2
34. 5.5
35. 4.9
36. 5
37. 5.5
38. 4.9
39. 4.4
40. 5.1
41. 5
42. 4.5
43. 4.4
44. 5
45. 5.1
46. 4.8
47. 5.1
48. 4.6
49. 5.3
50. 5
51. 7
52. 6.4
53. 6.9
54. 5.5
55. 6.5
56. 5.7
57. 6.3
58. 4.9
59. 6.6
60. 5.2
61. 5
62. 5.9
63. 6
64. 6.1
65. 5.6
66. 6.7
67. 5.6
68. 5.8
69. 6.2
70. 5.6
71. 5.9
72. 6.1
73. 6.3
74. 6.1
75. 6.4
76. 6.6
77. 6.8
78. 6.7
79. 6
80. 5.7
81. 5.5
82. 5.5
83. 5.8
84. 6
85. 5.4
86. 6
87. 6.7
88. 6.3
89. 5.6
90. 5.5
91. 5.5
92. 6.1
93. 5.8
94. 5
95. 5.6
96. 5.7
97. 5.7
98. 6.2
99. 5.1
100. 5.7
101. 6.3
102. 5.8
103. 7.1
104. 6.3
105. 6.5
106. 7.6
107. 4.9
108. 7.3
109. 6.7
110. 7.2
111. 6.5
112. 6.4
113. 6.8
114. 5.7
115. 5.8
116. 6.4
117. 6.5
118. 7.7
119. 7.7
120. 6
121. 6.9
122. 5.6
123. 7.7
124. 6.3
125. 6.7
126. 7.2
127. 6.2
128. 6.1
129. 6.4
130. 7.2
131. 7.4
132. 7.9
133. 6.4
134. 6.3
135. 6.1
136. 7.7
137. 6.3
138. 6.4
139. 6
140. 6.9
141. 6.7
142. 6.9
143. 5.8
144. 6.8
145. 6.7
146. 6.7
147. 6.3
148. 6.5
149. 6.2
150. 5.9

and run the standard statistics:

In [45]:
print("mean")
mean(iris$Sepal.Length) print("median") median(iris$Sepal.Length)
print("min")
min(iris$Sepal.Length) print("max") max(iris$Sepal.Length)
print("sd")
sd(iris$Sepal.Length) print("var") var(iris$Sepal.Length)
print("range")
range(iris$Sepal.Length) # or other functions like: print("sort") sort(iris$Sepal.Length)
print("length")
length(iris\$Sepal.Length)

[1] "mean"

5.84333333333333
[1] "median"

5.8
[1] "min"

4.3
[1] "max"

7.9
[1] "sd"

0.828066127977863
[1] "var"

0.685693512304251
[1] "range"

1. 4.3
2. 7.9
[1] "sort"

1. 4.3
2. 4.4
3. 4.4
4. 4.4
5. 4.5
6. 4.6
7. 4.6
8. 4.6
9. 4.6
10. 4.7
11. 4.7
12. 4.8
13. 4.8
14. 4.8
15. 4.8
16. 4.8
17. 4.9
18. 4.9
19. 4.9
20. 4.9
21. 4.9
22. 4.9
23. 5
24. 5
25. 5
26. 5
27. 5
28. 5
29. 5
30. 5
31. 5
32. 5
33. 5.1
34. 5.1
35. 5.1
36. 5.1
37. 5.1
38. 5.1
39. 5.1
40. 5.1
41. 5.1
42. 5.2
43. 5.2
44. 5.2
45. 5.2
46. 5.3
47. 5.4
48. 5.4
49. 5.4
50. 5.4
51. 5.4
52. 5.4
53. 5.5
54. 5.5
55. 5.5
56. 5.5
57. 5.5
58. 5.5
59. 5.5
60. 5.6
61. 5.6
62. 5.6
63. 5.6
64. 5.6
65. 5.6
66. 5.7
67. 5.7
68. 5.7
69. 5.7
70. 5.7
71. 5.7
72. 5.7
73. 5.7
74. 5.8
75. 5.8
76. 5.8
77. 5.8
78. 5.8
79. 5.8
80. 5.8
81. 5.9
82. 5.9
83. 5.9
84. 6
85. 6
86. 6
87. 6
88. 6
89. 6
90. 6.1
91. 6.1
92. 6.1
93. 6.1
94. 6.1
95. 6.1
96. 6.2
97. 6.2
98. 6.2
99. 6.2
100. 6.3
101. 6.3
102. 6.3
103. 6.3
104. 6.3
105. 6.3
106. 6.3
107. 6.3
108. 6.3
109. 6.4
110. 6.4
111. 6.4
112. 6.4
113. 6.4
114. 6.4
115. 6.4
116. 6.5
117. 6.5
118. 6.5
119. 6.5
120. 6.5
121. 6.6
122. 6.6
123. 6.7
124. 6.7
125. 6.7
126. 6.7
127. 6.7
128. 6.7
129. 6.7
130. 6.7
131. 6.8
132. 6.8
133. 6.8
134. 6.9
135. 6.9
136. 6.9
137. 6.9
138. 7
139. 7.1
140. 7.2
141. 7.2
142. 7.2
143. 7.3
144. 7.4
145. 7.6
146. 7.7
147. 7.7
148. 7.7
149. 7.7
150. 7.9
[1] "length"

150

## Factors¶

The factor() command is used to create and modify factors in R:

In [46]:
sex <- factor(c("male", "female", "female", "male"))


R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can check this by using the function levels(), and check the number of levels using nlevels():

In [47]:
levels(sex)
nlevels(sex)

1. 'female'
2. 'male'
2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels:

In [48]:
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)

1. 'high'
2. 'low'
3. 'medium'
In [49]:
food <- factor(food, levels = c("low", "medium", "high"))
levels(food)

1. 'low'
2. 'medium'
3. 'high'
In [50]:
min(food) ## doesn't work

Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low", : ‘min’ not meaningful for factors
Traceback:

1. Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low",
. "medium", "high"), class = "factor"), na.rm = FALSE)
2. stop(gettextf("%s not meaningful for factors", sQuote(.Generic)))
In [ ]:
food <- factor(food, levels = c("low", "medium", "high"), ordered=TRUE)
levels(food)

In [ ]:
min(food) ## works!


## Challenge¶

Use the dataset AirPassengers that comes with R and refers to number of passengers traveled every month from 1949 to 1960 in thousands. Because the dataset is of type Time-Series or ts, you can make it a data.frame through the following commands:

> dn = list(paste("Y", as.character(1949:1960), sep = ""), month.abb)
> airmat = matrix(AirPassengers, 12, byrow = TRUE, dimnames = dn)
> air = as.data.frame(t(airmat))



Then try to answer the next questions/problems:

• Use the help functionality to try and learn about the functions used above.
• How many passengers traveled in average for the year 1951?
• Which is the maximum number of passengers for the months January and February?
• Calculate the summation per year and assign the result to a vector.
• Plot the vector nicely (names in axes, point and lines for the graph, title the graph, add grid lines)
• Repeat the last two bullets for every month for all the years.

Tip: to transform a row of the data frame to a vector you can use unlist (e.g. unlist(air["Jan",])

## Acknowledgements¶

This tutorial, besides including the author's knowledge of R, is derived also from other material and more specifically: from an introductory R leaflet from the Pattern Recognition course of the Electrical and Computer Engineering Department of the Aristotle University of Thessaloniki (Fall semester 2016, author: Themistoklis Diamantopoulos, professor: Andreas L. Symeonidis), the leaflet from the course CS545 Machine Learnig (Fall 2008, author and professor: Charles W. Anderson) and the Software Caprentry lesson: Programming in R from which the inflammation data were also taken.