5 An Introduction to the R language

5.1 Background

R is a computer programming language and environment especially useful for graphic visualization and statistical analysis of data. It is an offshoot of a language developed in 1976 at Bell Laboratories called S. R is an interpreted language, meaning that every time code is run it must be translated to machine language by the R interpreter, as opposed to being compiled prior to running. R is the premier computational platform for statistical analysis thanks to its GNU open-source status and countless packages contributed by diverse members of the scientific community.

5.2 Why use `R`?

Good general scripting tool for statistics and mathematics
Powerful and flexible and free
Runs on all computer platforms
New packages released all the time
Superb data management & graphics capabilities
Reproducibility - can keep your scripts to see exactly what was done
Can embed your R analyses in dynamic, polished files using R markdown
You can write your own functions
Lots of online help available
Can use a nice IDE such as RStudio

5.3 Important `R` terms and definitions

From Logan, M. 2010. *Biostatistical Design and Analysis Using R*

Operators are symbols in programming that have a specific meaning

5.4 Getting started with `R` via the RStudio Environment

To begin working with R, open RStudio. You should first see something that looks like this: Alt text

To open a new script editor (where you will keep track of your code and notes), go to File > New File > R Script. Note that there are other options for file types, which we will be using in the future. For now, though, we want a plain script, which when saved will have the extention .R.

It is easy to run code directly from the script editor. For single lines of code, simply make sure your cursor is on that line, and hit Ctrl-Enter. For multiple lines, highlight the block of code you want to run and hit Ctrl-Enter.

Now your display should look somehting like below (but without the red pane labels, of course): Alt text

Note that you can also type commands directly from the command line using the R Console (lower left pane), and the R interpreter will run them when you press Enter.

Any objects you define, and a summary of their values, will appear in the upper right pane, and the lower right pane differs in appearance depending on instructions you provide to R Studio. For instance, if you produce a plot, it will appear there by default. Another extremely important feature of R functions (we’ll get to them in a bit) is the help file. Recall from Chapter 5 our discussion of man pages for UNIX programs. Help files the equivalent for R functions. They contain almost everything you need to know about a given function, and most of them even include and example at the bottom. These help files will appear in the lower right RStudio pane when you call them, for example when you run help(function_name) at the R Console.

5.4.1 R Programming Basics

For the code examples below, it might be useful for you to start your own RStudio session, open a new .R file and type/run code while reading.

Commands can be submitted through the terminal, console or scripts
In your scripts, anything that follows # symbol (aka hash) is just for humans
Notice on these slides I’m evaluating the code chunks and showing output
The output is shown here after the two # symbols and the number of output items is in []
Also notice that R follows the normal priority of mathematical evaluation

4*4

[1] 16

(4+3*2^2)

[1] 16

5.4.1.1 A note on `R` Markdown

This format provides a much better way to embed code and output, in an easily readable, reproducible manner. We will dive into R Markdown next week, so for now just be aware that it exists.

http://kbroman.org/knitr_knutshell/pages/Rmarkdown.html
You can insert R chunks into Rmarkdown documents

5.4.1.2 Assigning Variables

To “store” information for later use, like the arithmetic operation above, we can assign variables in R.
Variables are assigned values using the <- operator.
Variable names must begin with a letter, and should not contain spaces or R operators (see above) but other than that, just about anything goes. It is good practice to avoid periods in variable names, as they have other functionality in related programming languages, like Python.
Do keep in mind that R is case sensitive.

x <- 2
x * 3

[1] 6

y <- x * 3
y - 2

[1] 4

These do not work

3y <- 3
3*y <- 3

5.4.1.3 Arithmetic operations with functions

Arithmetic operations can be used with functions as well as numbers.
Try the following, and then your own.

x+2
x^2
log(x) + log(x+1)

Note that the last of these - log() - is a built in function of R, and therefore the argument for the function (in this case “x” or “x+1”) needs to be put in parentheses.
These parentheses will be important, and we’ll come back to them later when we add other arguments after the object in the parentheses.
The outcome of calculations can be assigned to new variables as well, and the results can be checked using the print() function.

y <- 67
print(y)

[1] 67

x <- 124
z <- (x*y)^2
print(z)

[1] 69022864

5.4.1.4 Strings

Assignments and operations can be performed on characters as well.
Note that characters need to be set off by quotation marks to differentiate them from numeric objects.
The c(function) stands for ‘concatenate’.
Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment.
A good general rule is to use new names for each variable, and make them short but still descriptive

x <- "I Love"
print (x)

[1] "I Love"

y <- "Biostatistics"
print (y)

[1] "Biostatistics"

z <- c(x,y)
print (z)

[1] "I Love"        "Biostatistics"

The variable z is now a vector of character objects.

5.4.1.5 Factors

Sometimes we would like to treat character objects as if they were units for subsequent calculations.
These are called factors, and we can redefine our character object as one of class factor.
This might seem a bit strange, but it’s important for statistical analyses where we might want to calculate the mean or variance for two different treatments. In that case the two different treatments would be coded as two different “levels” of a factor we designate in our metadata. This will become clear when we get into hypothesis testing in R.

z_factor <- as.factor(z)
print(z_factor)
class(z_factor)

Note that factor levels are reported alphabetically. I used the class() function to ask R what type of object “z_factor” is. class() is one of the most important tools at your disposal. Often times you can debug your code simply by changing the class of an object. Because functions are written to work with specific classes, changing the class of a given object is crucial in many cases.

5.4.1.6 Null values

R (and many programming languages) have special strings that mean ‘no value’, or ‘null’. In R, the most common is NA, although there are others as well (NULL and NaN)
Typically, NA is used to indicate a lack of data for a given observation, or a missing value where there normally should be one.
Any instance of a blank entry in your data file will be read into R as an NA.
NA is a technically a logical data type, and is not equivalent to an empty string or the numeric 0. It is also a reserved word and can’t be used as a variable name.

class(NA)

[1] "logical"

Many functions in R (e.g. mean()) will not work by default if passed any NA values as an argument. So if we want to determine the mean of a vector of numeric values, we need to ensure there are either no NA values in the vector, or specify an additional ‘argument’ to the function telling it to ignore NA. Additionally NA, like other ‘null’ values, are operated on by a number of unique functions in R.

num <- c(0,1,2,NA,4)
mean(num)

[1] NA

mean(num, na.rm = TRUE)

[1] 1.75

is.na(num)

[1] FALSE FALSE FALSE  TRUE FALSE

5.4.1.7 Vectors

In general R thinks in terms of vectors (a list of characters factors or numerical values) and it will benefit any R user to try to write programs with that in mind.
The simplest vectors in R are ‘atomic’ vectors, meaning that they consist of only one data type.
R operations, and therefore functions, are vectorized.
This means an operation or function will be performed for each element in a vector.
Vectors can be assigned directly using the ‘c()’ function and then entering the exact values.

x <- c(2,3,4,2,1,2,4,5,10,8,9)
print(x)

 [1]  2  3  4  2  1  2  4  5 10  8  9

x_plus <- x+1
print(x_plus)

 [1]  3  4  5  3  2  3  5  6 11  9 10

Creating vectors of new data by entering it by hand can be a drag.
However, it is also very easy to use functions such as seq() and sample().
Try the examples below. Can you figure out what the three arguments in the parentheses mean?
Within reason, try varying the arguments to see what happens

seq_1 <- seq(0.0, 10.0, by = 0.1)
print(seq_1)

  [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3  1.4
 [16]  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7  2.8  2.9
 [31]  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1  4.2  4.3  4.4
 [46]  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9
 [61]  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4
 [76]  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9
 [91]  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9 10.0

seq_2 <- seq(10.0, 0.0, by = -0.1)
print(seq_2)

  [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7  8.6
 [16]  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3  7.2  7.1
 [31]  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9  5.8  5.7  5.6
 [46]  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5  4.4  4.3  4.2  4.1
 [61]  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1  3.0  2.9  2.8  2.7  2.6
 [76]  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7  1.6  1.5  1.4  1.3  1.2  1.1
 [91]  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.2  0.1  0.0

seq_square <- (seq_2)*(seq_2)
print(seq_square)

  [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
 [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
 [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
 [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
 [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
 [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
 [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
 [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
 [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
 [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
[101]   0.00

seq_square_new <- (seq_2)^2
print(seq_square_new)

  [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
 [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
 [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
 [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
 [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
 [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
 [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
 [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
 [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
 [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
[101]   0.00

Here is a way to create your own data sets that are random samples.
Again, on your own, play around with the arguments in the parentheses to see what happens.

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(x,y)

You’ve probably figured out that “y” from the last example is a draw of numbers with equal probability (what we call a flat, or uniform distribution).
What if you want to draw from a defined probability distribution, like the normal distribution?
Again, play around with the arguments in the parentheses to see what happens.

x <-rnorm(100, 0, 100)
print (x)

  [1]  -69.152544   96.145976   56.104805  -58.674296  -65.802408  143.361974
  [7]  196.461792   39.712626   53.335655  110.085311  -10.930155  -22.449469
 [13]  191.115620   41.084773  -77.531952   18.914509  -25.713070 -113.668051
 [19]  -19.125146  104.130430    7.942500   99.681345   14.731046  -28.472885
 [25]  119.213117  145.053291  -19.392547 -146.421620    5.430961  -63.914529
 [31]   21.815178   41.631799  155.791378   -8.433466 -107.611430  244.806042
 [37]  -13.875018  132.652141   28.746530  -21.163993    3.171919  137.423183
 [43]  -83.661419  -36.391923  -43.017289 -134.112450   23.113455   70.923486
 [49] -218.833128   88.060395   23.631610  -61.146499  -89.732732  -25.999610
 [55]  -50.176918  -53.760480   81.046838 -150.460702   70.544618  146.220913
 [61]  -87.257098    5.596045  206.623867  -55.348222  -89.462636   68.832957
 [67] -251.692491    5.364223 -138.681812   83.987936  126.391156   21.335142
 [73]   84.681852  -88.451959 -115.129575   12.903592 -128.683746   11.379143
 [79]  -57.038187   31.974334  -43.430315  155.461792   19.436525   88.920168
 [85]  -19.305196   11.366543  -40.936774   37.594176   94.811145   18.084675
 [91]   67.008880  -57.332647    9.749443 -212.214776   20.286108   35.831822
 [97] -150.734772   -3.847387  101.847398 -129.478030

hist(x, xlim = c(-50,50))

hist(x, xlim = c(-500,500))

Can you figure out what the three rnorm() arguments represent?

5.4.1.8 Basic Summary Statistics

We will get into the details regarding summary statistics later, but for now, check out several of the R functions that calculate them.

mean(x)
median(x)
var(x)
log(x)
ln(x)
sqrt(x)
sum(x)
length(x)
sample(x, replace = T)

Notice that the last function (sample) has an argument (replace=T)
Arguments simply modify or direct the function in some way
There are many arguments for each function, some of which are defaults

5.4.1.9 Getting help to understand functions

Getting help on any function is very easy - just type a question mark and the name of the function.
There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of arguments you assign parameters to and send to the function.
Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.
Help can be accessed a few ways - try them :

- help(mean)
- ?mean
- example(mean)
- help.search("mean")
- apropos("mean")
- args(mean)

5.5 Additional learning resources:

Logan, M. 2010. Biostatistical Design and Analysis Using R. - A great intro to R for statistical analysis
http://library.open.oregonstate.edu/computationalbiology/ - O’Neil, S.T. 2017. A Primer for Computational Biology