5 R and RStudio

Figure 5.1: The RStudio integrated development environment

5.1 What is R?

R is a computer programming language and environment especially useful for graphic visualization and statistical analysis of data. It is an offshoot of a language developed in 1976 at Bell Laboratories called S. R is an interpreted language, meaning that every time code is run it must be translated to machine language by the R interpreter, as opposed to being compiled prior to running. R is the premier computational platform for statistical analysis thanks to its GNU open-source status and countless packages contributed by diverse members of the scientific community.

5.2 Why R?

R is a programming language designed specifically for statistical computing and graphics. Created in the early 1990s as an open-source implementation of the S language, R has become the lingua franca of statistical analysis in academia and is widely used in industry as well.

Several features make R particularly well-suited for data analysis. It provides an extensive collection of statistical and graphical techniques built into the language. It is powerful, flexible, and completely free. It runs on Windows, Mac, and Linux, so your code will work across platforms. New capabilities are constantly being added through packages contributed by the community, with thousands of packages available for specialized analyses.

R excels at reproducibility. You can keep your scripts to document exactly what analyses you performed. Unlike point-and-click software where actions leave no trace, R code provides a complete record of your analytical workflow. This record can be shared with collaborators, included in publications, and revisited years later when you need to remember how you produced a particular result.

You can write your own functions in R, extending the language to meet your specific needs. Extensive online help and active user communities mean that answers to most questions are a web search away. The RStudio integrated development environment makes working with R much more pleasant, especially for newcomers. And with tools like R Markdown and Quarto, you can embed your analyses in polished documents, presentations, websites, and books—this book itself was created with these tools.

5.3 Installing R and RStudio

R must be installed before RStudio. Download R from https://www.r-project.org, selecting the version appropriate for your operating system. Follow the installation instructions for your platform.

RStudio is an integrated development environment (IDE) that makes working with R much easier. Download the free RStudio Desktop from https://www.rstudio.com. RStudio provides a console for running R commands, an editor for writing scripts, tools for viewing plots and data, and integration with version control systems.

After installing both programs, launch RStudio. You will see a window divided into panes, each serving a different purpose. The console pane is where R commands are executed. The source pane is where you edit scripts and documents. The environment pane shows what objects currently exist in your R session. The files/plots/packages/help pane provides access to various utilities.

5.4 R Basics

R evaluates expressions and returns results. You can use it as a calculator by typing arithmetic expressions at the console.

Code

4 * 4

[1] 16

Code

(4 + 3 * 2^2)

[1] 16

Notice that R follows standard mathematical order of operations: exponentiation before multiplication and division, which come before addition and subtraction. Parentheses can override this ordering.

5.5 Variables and Assignment

More useful than evaluating isolated expressions is storing values in variables for later use. Variables are assigned using the <- operator (a less-than sign followed by a hyphen).

Code

x <- 2
x * 3

[1] 6

Code

y <- x * 3
y - 2

[1] 4

Variable names must begin with a letter but can contain letters, numbers, periods, and underscores after the first character. R is case-sensitive, so myVariable, MyVariable, and myvariable are three different names. Choose descriptive names that make your code readable. It is good practice to avoid periods in variable names, as they have other functionality in related programming languages like Python.

Invalid Variable Names

Variable names cannot begin with numbers or contain operators. The following will produce errors:

Code

3y <- 3    # cannot start with a number
3*y <- 3   # cannot include operators

Reserved Words

R has reserved words that cannot be used as variable names because they have special meaning in the language:

Reserved Words	Purpose
`if`, `else`	Conditional statements
`for`, `while`, `repeat`	Loops
`function`	Function definition
`in`, `next`, `break`	Loop control
`TRUE`, `FALSE`	Logical constants
`NULL`, `NA`, `NaN`, `Inf`	Special values

R also has semi-reserved names—built-in functions and constants that you can technically overwrite but should avoid:

Code

# These work but are dangerous:
T <- 5       # Overwrites TRUE abbreviation
c <- "text"  # Shadows the c() function
mean <- 42   # Shadows mean()

# If you accidentally overwrite something, remove it:
rm(c)        # Restores access to c()

Avoid Common Name Collisions

Never name variables T, F (abbreviations for TRUE/FALSE), c, t, mean, sum, data, or df. These are commonly used R functions, and shadowing them leads to confusing errors.

Note that when you assign a value to a variable, R does not print anything. To see a variable’s value, type its name alone or use the print() function.

Code

z <- 100
z

[1] 100

Code

print(z)

[1] 100

5.6 Understanding R Objects

A fundamental principle of R is that everything is an object. Numbers, text, datasets, functions—all are stored as objects with specific properties. Understanding this helps you debug problems and write better code.

Every object has a class (which determines how functions treat it) and a type (its underlying storage mode). Use class() and typeof() to examine objects:

Code

# Numbers are objects
x <- 42
class(x)

[1] "numeric"

Code

typeof(x)

[1] "double"

Code

# Text strings are objects
name <- "Gene Expression"
class(name)

[1] "character"

Code

# Even functions are objects!
class(mean)

[1] "function"

The str() function (structure) provides a compact display of any object’s structure—it is one of the most useful diagnostic tools in R:

Code

# Examine a vector
str(c(1, 2, 3, 4, 5))

 num [1:5] 1 2 3 4 5

Code

# Examine a data frame
str(head(iris))

'data.frame':   6 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1

When functions produce errors or unexpected results, checking the class of your objects is often the first step toward understanding what went wrong.

5.7 Functions

Functions are the workhorses of R. A function takes inputs (called arguments), performs some operation, and returns an output. R has many built-in functions, and packages provide thousands more.

Code

log(10)

[1] 2.302585

Code

sqrt(16)

[1] 4

Code

exp(1)

[1] 2.718282

Functions are called by typing their name followed by parentheses containing their arguments. Many functions accept multiple arguments, separated by commas. Arguments can be specified by position or by name.

Code

round(3.14159, digits = 2)

[1] 3.14

Code

round(3.14159, 2)  # same result, argument specified by position

[1] 3.14

To learn about a function, use the help system. Type ?functionname or help(functionname) to open the documentation.

Code

?round
help(sqrt)

5.8 Vectors

The fundamental data structure in R is the vector, an ordered collection of values of the same type. You create vectors using the c() function (for concatenate or combine).

Code

numbers <- c(1, 2, 3, 4, 5)
numbers

[1] 1 2 3 4 5

Code

names <- c("Alice", "Bob", "Carol")
names

[1] "Alice" "Bob"   "Carol"

Many operations in R are vectorized, meaning they operate on entire vectors at once rather than requiring you to loop through elements.

Code

numbers * 2

[1]  2  4  6  8 10

Code

numbers + 10

[1] 11 12 13 14 15

Code

numbers^2

[1]  1  4  9 16 25

You can access individual elements using square brackets with an index (R uses 1-based indexing, so the first element is at position 1).

Code

numbers[1]

[1] 1

Code

numbers[3]

[1] 3

Code

numbers[c(1, 3, 5)]

[1] 1 3 5

5.9 Creating Sequences

R provides convenient functions for creating regular sequences.

Code

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

Code

seq(0, 10, by = 2)

[1]  0  2  4  6  8 10

Code

seq(0, 1, length.out = 5)

[1] 0.00 0.25 0.50 0.75 1.00

Code

rep(1, times = 5)

[1] 1 1 1 1 1

Code

rep(c(1, 2), times = 3)

[1] 1 2 1 2 1 2

5.10 Generating Random Numbers

R can generate random numbers from various probability distributions, which is invaluable for simulation and understanding statistical concepts.

Code

# Draw 1000 values from a normal distribution with mean 0 and SD 10
x <- rnorm(1000, mean = 0, sd = 10)
hist(x)

Figure 5.2: Histogram of 1000 random draws from a normal distribution with mean 0 and standard deviation 10

Code

# Draw from a binomial distribution: 1000 experiments, 20 trials each, p=0.5
heads <- rbinom(n = 1000, size = 20, prob = 0.5)
hist(heads)

Figure 5.3: Histogram of binomial distribution results from 1000 experiments of 20 coin flips each

The set.seed() function allows you to make random simulations reproducible by initializing the random number generator to a known state.

Code

set.seed(42)
rnorm(5)

[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

Code

set.seed(42)  # same seed produces same "random" numbers
rnorm(5)

[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

5.11 Data Frames

Data frames are R’s structure for tabular data—rows of observations and columns of variables. Each column can contain a different type of data (numeric, character, logical), but all values within a column must be the same type.

Code

# Create a data frame from vectors
hydrogel_concentration <- factor(c("low", "high", "high", "high", 
                                    "medium", "medium", "medium", "low"))
compression <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
conductivity <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)

mydata <- data.frame(hydrogel_concentration, compression, conductivity)
mydata

  hydrogel_concentration compression conductivity
1                    low         3.4          0.0
2                   high         3.4          9.2
3                   high         8.4          3.8
4                   high         3.0          5.0
5                 medium         5.6          5.6
6                 medium         8.1          4.1
7                 medium         8.3          7.1
8                    low         4.5          5.3

Access columns using the $ operator or square brackets.

Code

mydata$compression

[1] 3.4 3.4 8.4 3.0 5.6 8.1 8.3 4.5

Code

mydata[, 2]  # second column

[1] 3.4 3.4 8.4 3.0 5.6 8.1 8.3 4.5

Code

mydata[1, ]  # first row

  hydrogel_concentration compression conductivity
1                    low         3.4            0

Code

mydata[1, 2] # first row, second column

[1] 3.4

5.12 Reading and Writing Data

Real analyses typically begin by reading data from external files. R provides functions for various file formats.

Code

# Read comma-separated values
data <- read.csv("mydata.csv")

# Read tab-separated values
data <- read.table("mydata.txt", header = TRUE, sep = "\t")

# Read Excel files (requires readxl package)
library(readxl)
data <- read_excel("mydata.xlsx")

Similarly, you can write data to files.

Code

write.csv(mydata, "output.csv", row.names = FALSE)
write.table(mydata, "output.txt", sep = "\t", row.names = FALSE)

5.13 Basic Plotting

R has extensive graphics capabilities. The base plot() function creates scatterplots and other basic visualizations.

Code

x <- 1:10
y <- x^2
plot(x, y,
     xlab = "X values",
     ylab = "Y squared",
     main = "A Simple Plot",
     col = "blue",
     pch = 19)

Figure 5.4: A simple scatterplot showing the relationship between x and x squared

Histograms visualize the distribution of a single variable.

Code

data <- rnorm(1000)
hist(data, breaks = 30, col = "lightblue", main = "Normal Distribution")

Figure 5.5: Histogram of 1000 random samples from a standard normal distribution

Boxplots compare distributions across groups.

Code

boxplot(compression ~ hydrogel_concentration, data = mydata,
        xlab = "Concentration", ylab = "Compression")

Figure 5.6: Boxplot comparing compression values across hydrogel concentration levels

We will explore the more sophisticated ggplot2 package for graphics in a later chapter.

5.14 Scripts and Reproducibility

While you can type commands directly at the console, for anything beyond simple explorations you should write scripts—text files containing R commands that can be saved, edited, and rerun.

In RStudio, create a new script with File > New File > R Script. Type your commands in the script editor, and run them by placing your cursor on a line and pressing Ctrl+Enter (Cmd+Enter on Mac) or by selecting code and clicking Run.

Scripts should be self-contained, including all the commands needed to reproduce your analysis from start to finish. Begin scripts by loading required packages, then reading data, then performing analyses. Add comments (lines beginning with #) to explain what your code does and why.

Code

# Analysis of hydrogel mechanical properties
# Author: Your Name
# Date: 2025-04-01

# Load required packages
library(tidyverse)

# Read data
data <- read.csv("hydrogel_data.csv")

# Calculate summary statistics
summary(data)

# Create visualization
ggplot(data, aes(x = concentration, y = compression)) +
  geom_boxplot()

5.15 Getting Help

When you encounter problems, R provides several resources. The ? operator opens documentation for functions. The help.search() function searches the help system for topics. The example() function runs examples from a function’s documentation.

Code

?mean
help.search("regression")
example(plot)

Beyond R’s built-in help, the internet offers vast resources. Stack Overflow has answers to almost any R question you can imagine. Package vignettes provide tutorials for specific packages. The RStudio community forums are welcoming to beginners.

When asking for help online, provide a minimal reproducible example—the smallest piece of code that demonstrates your problem, including sample data. This makes it much easier for others to understand and solve your issue.

5.16 Data Types in R

R has several fundamental data types that you will work with frequently.

Character Strings

Assignments and operations can be performed on characters as well as numbers. Characters need to be set off by quotation marks to differentiate them from numeric objects or variable names.

Code

x <- "I Love"
print(x)

[1] "I Love"

Code

y <- "Biostatistics"
print(y)

[1] "Biostatistics"

Code

# Combine strings using c()
z <- c(x, y)
print(z)

[1] "I Love"        "Biostatistics"

The variable z is now a vector of character objects. Note that we are overwriting our previous numeric assignments—a good general rule is to use descriptive, unique names for each variable.

Factors

Sometimes we would like to treat character objects as if they were categorical units for subsequent calculations. These are called factors, and we can convert a character vector to factor class.

Code

z_factor <- as.factor(z)
print(z_factor)

[1] I Love        Biostatistics
Levels: Biostatistics I Love

Code

class(z_factor)

[1] "factor"

Note that factor levels are reported alphabetically. The class() function tells us what type of object we are working with—it is one of the most important diagnostic tools in R. Often you can debug your code simply by checking and changing the class of an object.

Factors are especially important for statistical analyses where we might want to calculate the mean or variance for different experimental treatments. In that case, the treatments would be coded as different levels of a factor.

Missing Values (NA)

R uses special values to represent missing or undefined data. The most common is NA, which stands for “Not Available.”

Code

class(NA)

[1] "logical"

NA is a logical data type and is distinct from the character string “NA”, the numeric 0, or an empty string. It is also a reserved word and cannot be used as a variable name.

Any instance of a blank entry in your data file will be read into R as NA. Many functions in R will not work by default if passed any NA values:

Code

num <- c(0, 1, 2, NA, 4)
mean(num)

[1] NA

Code

# Use na.rm = TRUE to ignore missing values
mean(num, na.rm = TRUE)

[1] 1.75

Code

# Check for missing values
is.na(num)

[1] FALSE FALSE FALSE  TRUE FALSE

Floating-Point Precision

A common source of confusion involves floating-point arithmetic. Computers represent decimal numbers with limited precision, which can lead to unexpected results:

Code

# This seems wrong, but is due to how computers store decimals
0.1 + 0.2 == 0.3

[1] FALSE

Code

# The actual values differ slightly
print(0.1 + 0.2, digits = 20)

[1] 0.30000000000000004441

Code

print(0.3, digits = 20)

[1] 0.2999999999999999889

Never use == to compare floating-point numbers directly. Instead, use all.equal() which checks if values are “nearly equal” within a small tolerance:

Code

# Safe comparison for floating-point numbers
all.equal(0.1 + 0.2, 0.3)

[1] TRUE

Code

# Use isTRUE() if you need a logical result
isTRUE(all.equal(0.1 + 0.2, 0.3))

[1] TRUE

The tidyverse provides dplyr::near() as a convenient alternative, especially when filtering data frames:

Code

# Works well in filter operations
library(dplyr)
data |> filter(near(value, target_value))

Floating-Point Comparisons

Always use all.equal() or near() instead of == when comparing decimal calculations. This is a common source of bugs in data analysis code.

5.17 More on Vectors

Indexing Vectors

Isolating specific elements from vectors is called indexing. R uses 1-based indexing with square brackets [].

Code

x <- c(10, 20, 30, 40, 50, 100, 200)

# First element
x[1]

[1] 10

Code

# Third element
x[3]

[1] 30

Code

# Series of consecutive elements
x[1:4]

[1] 10 20 30 40

Code

# Last four elements
x[4:7]

[1]  40  50 100 200

Code

# Non-consecutive elements using c()
x[c(1:3, 5)]

[1] 10 20 30 50

Code

# All elements EXCEPT the first two
x[-c(1:2)]

[1]  30  40  50 100 200

Useful Functions for Vectors

Functions that provide information about vectors:

head(): returns the first elements of an object
tail(): returns the last elements of an object
length(): returns the number of elements in a vector
class(): returns the class of elements in a vector

Functions that modify or generate vectors:

sort(): returns a sorted vector
seq(): creates a sequence of values
rep(): repeats values

Code

rep(1, 5)

[1] 1 1 1 1 1

Code

rep("treatment", 5)

[1] "treatment" "treatment" "treatment" "treatment" "treatment"

Functions for random sampling:

sample(): randomly selects elements from a vector
rnorm(): draws values from a normal distribution
rbinom(): draws values from a binomial distribution
set.seed(): sets the random number generator seed for reproducibility

Functions to change data types:

as.numeric(): converts to numeric class
as.factor(): converts to factor class
as.character(): converts to character class

5.18 Lists

Lists in R are aggregates of different objects that can be mixed types and different lengths.

Code

vec1 <- c(10, 20, 30, 40, 50, 100, 200)
vec2 <- c("happy", "sad", "grumpy")
vec3 <- factor(c("high", "low"))

mylist <- list(vec1, vec2, vec3)
print(mylist)

[[1]]
[1]  10  20  30  40  50 100 200

[[2]]
[1] "happy"  "sad"    "grumpy"

[[3]]
[1] high low 
Levels: high low

Code

class(mylist)

[1] "list"

Code

str(mylist)

List of 3
 $ : num [1:7] 10 20 30 40 50 100 200
 $ : chr [1:3] "happy" "sad" "grumpy"
 $ : Factor w/ 2 levels "high","low": 1 2

Elements of lists are indexed with double square brackets [[]]. To access the second element of mylist:

Code

mylist[[2]]

[1] "happy"  "sad"    "grumpy"

Code

# The second item of the second element
mylist[[2]][2]

[1] "sad"

The str() function (for “structure”) is extremely useful for understanding complex R objects.

5.19 Matrices

Matrices in R are two-dimensional arrays where all elements must be the same type. They are indexed by [row, column].

Code

# Create a 3x3 matrix
matrix(1:9, nrow = 3, ncol = 3)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Useful matrix functions include:

dim(): returns the dimensions (rows and columns)
t(): transposes a matrix (swaps rows and columns)
cbind(): combines columns
rbind(): combines rows

5.20 Installing and Using Packages

Base R includes many useful functions, but the real power comes from packages—collections of functions contributed by the community. Packages are distributed via the Comprehensive R Archive Network (CRAN).

Code

# Install a package (only need to do once)
install.packages("name_of_package")

# Check if package is installed
installed.packages("name_of_package")

# Load package for use (needed each session)
library(name_of_package)

Note that install.packages() requires the package name in quotation marks, while library() does not.

Namespace Conflicts

When you load multiple packages, function names can collide. If two packages define a function with the same name, the most recently loaded package “wins,” and its version masks the earlier one. R warns you when this happens:

Code

library(dplyr)
# Attaching package: 'dplyr'
# The following objects are masked from 'package:stats':
#     filter, lag

This message indicates that dplyr’s filter() and lag() functions are now masking the base R functions with those names. If you need the masked version, use the package prefix:

Code

# Use dplyr's filter (now the default after loading dplyr)
data |> filter(x > 5)

# Explicitly use base R's filter
stats::filter(x, method = "convolution")

# You can use the prefix even without loading a package
stringr::str_detect(text, "pattern")

Common conflicts occur between:

dplyr::filter() and stats::filter()
dplyr::lag() and stats::lag()
dplyr::select() and MASS::select()

Avoiding Conflicts

The :: notation explicitly specifies which package’s function to use. When writing scripts, it is good practice to use package::function() for functions that commonly conflict, making your code’s behavior explicit and predictable.

5.21 The Split-Apply-Combine Approach

A common pattern in data analysis is to split data by groups, apply a function to each group, and combine the results. R provides several functions for this workflow.

The replicate() Function

Repeats an expression multiple times and collects the results:

Code

# Shuffle integers 1-10 five times
replicate(5, sample(1:10, size = 10, replace = FALSE))

      [,1] [,2] [,3] [,4] [,5]
 [1,]    3    9    9    3    5
 [2,]    1    2   10    8    4
 [3,]    8    3    3    6    9
 [4,]    9    6    4    9    1
 [5,]   10    5    2    4   10
 [6,]    7    4    1    7    7
 [7,]    4    1    5    5    6
 [8,]    5   10    8   10    2
 [9,]    6    8    6    2    8
[10,]    2    7    7    1    3

The apply() Family

The apply() function applies a function to rows or columns of a matrix or data frame:

Code

# Create sample matrix
m <- matrix(1:12, nrow = 3, ncol = 4)
m

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Code

# Sum across rows (MARGIN = 1)
apply(m, 1, sum)

[1] 22 26 30

Code

# Sum across columns (MARGIN = 2)
apply(m, 2, sum)

[1]  6 15 24 33

The tapply() Function

Applies a function to subsets of a vector, grouped by a factor:

Code

# Find maximum petal length for each species
tapply(iris$Petal.Length, iris$Species, max)

    setosa versicolor  virginica 
       1.9        5.1        6.9

The aggregate() Function

Summarizes multiple variables by groups:

Code

# Mean of each variable by species
aggregate(iris[, 1:4], by = list(Species = iris$Species), FUN = mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

5.22 Conditional Statements with ifelse()

The ifelse() function provides vectorized conditional logic. The first argument is a logical test, the second is the value if TRUE, and the third is the value if FALSE.

Code

# Create a character vector
treatment <- c(rep("treatment", 5), rep("control", 3),
               rep("treatment", 4), rep("control", 6))

# Assign colors based on treatment
colors <- ifelse(treatment == "treatment", "red", "blue")
print(colors)

 [1] "red"  "red"  "red"  "red"  "red"  "blue" "blue" "blue" "red"  "red" 
[11] "red"  "red"  "blue" "blue" "blue" "blue" "blue" "blue"

5.23 For Loops

For loops iterate through a sequence, executing code for each value. However, R is vectorized, so many operations that would require loops in other languages can be done more efficiently without them.

When loops are necessary, pre-allocate output objects for better performance:

Code

# Pre-allocate a numeric vector
results <- numeric(5)

for (i in 1:5) {
  results[i] <- i^2
}
results

[1]  1  4  9 16 25

Avoiding Loops

Before writing a loop, consider whether the task can be accomplished with vectorized operations or the apply family of functions. These approaches are often faster and more readable.

5.24 More on Plotting

Customizing Plots with par()

Many plotting parameters are controlled by the par() function. Understanding par() dramatically increases your plotting capabilities.

Code

# Create multiple panels
par(mfrow = c(1, 2))  # 1 row, 2 columns

seq_1 <- seq(0, 10, by = 0.1)
seq_2 <- seq(10, 0, by = -0.1)

plot(seq_1, xlab = "Index", ylab = "Value", type = "p", col = "red",
     main = "Increasing Sequence")
plot(seq_2, xlab = "Index", ylab = "Value", type = "l", col = "blue",
     main = "Decreasing Sequence")

Figure 5.7: Multiple plot panels showing increasing (points) and decreasing (lines) sequences

Vectorized Graphical Parameters

Graphical parameters like col, pch (point character), and cex (character expansion) are vectorized:

Code

seq_1 <- seq(0, 10, by = 0.1)
seq_2 <- seq(10, 0, by = -0.1)

# First 10 points blue, rest red
colors <- c(rep("blue", 10), rep("red", 91))

plot(seq_1, seq_2, xlab = "Sequence 1", ylab = "Sequence 2",
     col = colors, pch = 19,
     main = "Two-Color Scatterplot")

Figure 5.8: Scatterplot demonstrating vectorized graphical parameters with two colors

Useful Plotting Arguments

Key arguments for plot() and related functions:

main: plot title
xlab, ylab: axis labels
xlim, ylim: axis limits
col: color
pch: point character (0-25)
cex: character/point size multiplier
lwd: line width
type: “p” for points, “l” for lines, “b” for both

5.25 Introduction to R Markdown

R Markdown combines R code with formatted text to create reproducible documents. Files have the .Rmd extension and can be rendered (“knitted”) to HTML, PDF, or Word.

Getting Started

Install the rmarkdown package, then in RStudio: File → New File → R Markdown.

Basic Formatting

## Section Header
### Subsection Header

Text can be *italicized* or **bolded** or ***both***.

Links: [Link Text](https://example.com)

Code Chunks

R code is placed in code chunks delimited by three backticks:

```{r}
seq(1, 10, 1)
```

Chunk options control whether code is evaluated (eval), displayed (echo), and more:

```{r, eval = TRUE, echo = TRUE}
seq(1, 10, 1)
```

Knitting

Click the “Knit” button in RStudio to render your document. Start with HTML output, which has the fewest dependencies.

Learning More

For comprehensive R Markdown documentation, see the R Markdown introduction and R Markdown cheat sheet.

5.26 Practice Exercises

Exercise R.1: Exploring RStudio

Take a few minutes to familiarize yourself with the RStudio environment:

Locate the four main panes:
- The code editor (top left)
- The workspace and history (top right)
- The plots and files window (bottom right)
- The R console (bottom left)
In the plots and files window, click on the Packages and Help tabs to see what they offer
See what types of new files can be made in RStudio by clicking File → New File
Open a new R script and a new R Markdown file to see the difference

Exercise R.2: Basic Mathematics in R

Insert a code chunk and complete the following tasks:

Add and subtract numbers
Multiply and divide numbers
Raise a number to a power using the ^ symbol
Create a more complex equation involving all of these operations to convince yourself that R follows the normal priority of mathematical evaluation (PEMDAS)

Code

# Example:
(4 + 3 * 2^2) / 5 - 1

Exercise R.3: Assigning Variables and Functions

Assign three variables using basic mathematical operations
Take the log of your three variables using log()
Use the print() function to display your most complex variable
Use the c() (concatenate) function combined with paste() to create and print a sentence

Code

# Example:
x <- 10
y <- x * 2
z <- sqrt(x + y)
print(paste("The value of z is", z))

Exercise R.4: Vectors and Factors

Create a numeric vector using the c() function with at least 5 elements
Create a character vector and convert it to a factor using as.factor()

Code

# Example:
vec1 <- c("control", "treatment", "control", "treatment", "control")
fac1 <- as.factor(vec1)
print(fac1)

[1] control   treatment control   treatment control  
Levels: control treatment

Code

levels(fac1)

[1] "control"   "treatment"

Use str() and class() to evaluate your variables
What is the difference between a character vector and a factor?

Exercise R.5: Basic Statistics

Create a numeric vector with at least 10 elements
Calculate the mean(), sd(), sum(), length(), and var() of your vector
Use the log() and sqrt() functions on your vector
What happens when you try to apply mean() to a factor? Try it and explain the result

Code

# Example:
my_vector <- c(12, 15, 18, 22, 25, 28, 31, 35, 38, 42)
mean(my_vector)
sd(my_vector)

Exercise R.6: Creating Sequences and Random Sampling

Set the random seed for reproducibility, then:

Code

set.seed(42)

Create a vector with 100 elements using seq() and calculate the mean and standard deviation
Create a variable and sample() it with equal probability—experiment with the size and replace arguments
Create a normally distributed variable of 10000 elements using rnorm(), then sample that distribution with and without replacement
Use hist() to plot your normally distributed variable

Exercise R.7: Basic Visualization

Create visualizations with proper axis labels and colors:

Create a sequence variable using seq() and make two different plots by changing the type argument ("p" for points, "l" for lines, "b" for both)
Create a normally distributed variable using rnorm() and make histograms with different breaks values—what does breaks control?
Use par(mfrow = c(2, 2)) to create a 2×2 grid of plots

Code

par(mfrow = c(2, 2))
x <- seq(1, 100, by = 1)
plot(x, type = "p", main = "Points", col = "blue")
plot(x, type = "l", main = "Lines", col = "red")
y <- rnorm(1000)
hist(y, breaks = 10, main = "10 Breaks", col = "lightblue")
hist(y, breaks = 50, main = "50 Breaks", col = "lightgreen")

Exercise R.8: Creating Data Frames

Create a data frame with at least three columns: one character/factor, one numeric, and one logical
Assign row names to your data frame using rownames()
Examine your data frame structure using str()
Calculate the mean of each numeric variable
Use head() and tail() to view portions of your data frame

Code

# Example:
treatment <- c("control", "low", "medium", "high", "control", "low")
response <- c(12.3, 15.6, 18.9, 24.2, 11.8, 16.1)
significant <- c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE)
my_data <- data.frame(treatment, response, significant)
str(my_data)

Exercise R.9: Data Import and Indexing

Create a simple CSV file or use a built-in dataset like iris
Use read.csv() to read in your file (or access iris directly)
Use str() and head() to examine the data structure
Use $ and [ ] operators to select different parts of the data frame
Create a plot of two numeric variables
Use tapply() to calculate summary statistics grouped by a categorical variable
Export your data frame using write.csv()

Code

# Example with iris:
data(iris)
str(iris)
head(iris)
iris$Sepal.Length[1:5]  # First 5 sepal lengths
iris[1:3, ]  # First 3 rows
plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species)
tapply(iris$Sepal.Length, iris$Species, mean)

Exercise R.10: Understanding Object Types

Explore how R handles different data types:

Create variables of different classes: numeric, character, logical, and factor
What happens when you try to perform arithmetic on character data?
Experiment with type coercion using as.numeric(), as.character(), and as.factor()
What happens when you add a character element to a numeric vector?

5.27 Additional Resources

Logan (2010) - A comprehensive introduction to R for statistical analysis
A Primer for Computational Biology - Free online textbook by S.T. O’Neil
R Colors Reference - Visual guide to R colors
Introduction to Colors in R - Tutorial on using colors effectively

# R and RStudio {#sec-r-rstudio} ```{r} #| echo: false #| message: false library(tidyverse) theme_set(theme_minimal()) ``` ![The RStudio integrated development environment](../images/ch04/ch04_rstudio_ide.png){#fig-rstudio-ide fig-align="center"} ## What is R? R is a computer programming language and environment especially useful for graphic visualization and statistical analysis of data. It is an offshoot of a language developed in 1976 at Bell Laboratories called S. R is an interpreted language, meaning that every time code is run it must be translated to machine language by the R interpreter, as opposed to being compiled prior to running. R is the premier computational platform for statistical analysis thanks to its GNU open-source status and countless packages contributed by diverse members of the scientific community. ## Why R? R is a programming language designed specifically for statistical computing and graphics. Created in the early 1990s as an open-source implementation of the S language, R has become the lingua franca of statistical analysis in academia and is widely used in industry as well. Several features make R particularly well-suited for data analysis. It provides an extensive collection of statistical and graphical techniques built into the language. It is powerful, flexible, and completely free. It runs on Windows, Mac, and Linux, so your code will work across platforms. New capabilities are constantly being added through packages contributed by the community, with thousands of packages available for specialized analyses. R excels at reproducibility. You can keep your scripts to document exactly what analyses you performed. Unlike point-and-click software where actions leave no trace, R code provides a complete record of your analytical workflow. This record can be shared with collaborators, included in publications, and revisited years later when you need to remember how you produced a particular result. You can write your own functions in R, extending the language to meet your specific needs. Extensive online help and active user communities mean that answers to most questions are a web search away. The RStudio integrated development environment makes working with R much more pleasant, especially for newcomers. And with tools like R Markdown and Quarto, you can embed your analyses in polished documents, presentations, websites, and books—this book itself was created with these tools. ## Installing R and RStudio R must be installed before RStudio. Download R from [https://www.r-project.org](https://www.r-project.org), selecting the version appropriate for your operating system. Follow the installation instructions for your platform. RStudio is an integrated development environment (IDE) that makes working with R much easier. Download the free RStudio Desktop from [https://www.rstudio.com](https://www.rstudio.com). RStudio provides a console for running R commands, an editor for writing scripts, tools for viewing plots and data, and integration with version control systems. After installing both programs, launch RStudio. You will see a window divided into panes, each serving a different purpose. The console pane is where R commands are executed. The source pane is where you edit scripts and documents. The environment pane shows what objects currently exist in your R session. The files/plots/packages/help pane provides access to various utilities. ## R Basics R evaluates expressions and returns results. You can use it as a calculator by typing arithmetic expressions at the console. ```{r} 4 * 4 (4 + 3 * 2^2) ``` Notice that R follows standard mathematical order of operations: exponentiation before multiplication and division, which come before addition and subtraction. Parentheses can override this ordering. ## Variables and Assignment More useful than evaluating isolated expressions is storing values in variables for later use. Variables are assigned using the `<-` operator (a less-than sign followed by a hyphen). ```{r} x <- 2 x * 3 y <- x * 3 y - 2 ``` Variable names must begin with a letter but can contain letters, numbers, periods, and underscores after the first character. R is case-sensitive, so `myVariable`, `MyVariable`, and `myvariable` are three different names. Choose descriptive names that make your code readable. It is good practice to avoid periods in variable names, as they have other functionality in related programming languages like Python. ::: {.callout-warning} ## Invalid Variable Names Variable names cannot begin with numbers or contain operators. The following will produce errors: ```{r} #| eval: false 3y <- 3 # cannot start with a number 3*y <- 3 # cannot include operators ``` ::: ### Reserved Words R has **reserved words** that cannot be used as variable names because they have special meaning in the language: | Reserved Words | Purpose | |:---------------|:--------| | `if`, `else` | Conditional statements | | `for`, `while`, `repeat` | Loops | | `function` | Function definition | | `in`, `next`, `break` | Loop control | | `TRUE`, `FALSE` | Logical constants | | `NULL`, `NA`, `NaN`, `Inf` | Special values | R also has **semi-reserved names**—built-in functions and constants that you can technically overwrite but should avoid: ```{r} #| eval: false # These work but are dangerous: T <- 5 # Overwrites TRUE abbreviation c <- "text" # Shadows the c() function mean <- 42 # Shadows mean() # If you accidentally overwrite something, remove it: rm(c) # Restores access to c() ``` ::: {.callout-warning} ## Avoid Common Name Collisions Never name variables `T`, `F` (abbreviations for `TRUE`/`FALSE`), `c`, `t`, `mean`, `sum`, `data`, or `df`. These are commonly used R functions, and shadowing them leads to confusing errors. ::: Note that when you assign a value to a variable, R does not print anything. To see a variable's value, type its name alone or use the `print()` function. ```{r} z <- 100 z print(z) ``` ## Understanding R Objects A fundamental principle of R is that **everything is an object**. Numbers, text, datasets, functions—all are stored as objects with specific properties. Understanding this helps you debug problems and write better code. Every object has a **class** (which determines how functions treat it) and a **type** (its underlying storage mode). Use `class()` and `typeof()` to examine objects: ```{r} # Numbers are objects x <- 42 class(x) typeof(x) # Text strings are objects name <- "Gene Expression" class(name) # Even functions are objects! class(mean) ``` The `str()` function (**str**ucture) provides a compact display of any object's structure—it is one of the most useful diagnostic tools in R: ```{r} # Examine a vector str(c(1, 2, 3, 4, 5)) # Examine a data frame str(head(iris)) ``` When functions produce errors or unexpected results, checking the class of your objects is often the first step toward understanding what went wrong. ## Functions Functions are the workhorses of R. A function takes inputs (called arguments), performs some operation, and returns an output. R has many built-in functions, and packages provide thousands more. ```{r} log(10) sqrt(16) exp(1) ``` Functions are called by typing their name followed by parentheses containing their arguments. Many functions accept multiple arguments, separated by commas. Arguments can be specified by position or by name. ```{r} round(3.14159, digits = 2) round(3.14159, 2) # same result, argument specified by position ``` To learn about a function, use the help system. Type `?functionname` or `help(functionname)` to open the documentation. ```{r} #| eval: false ?round help(sqrt) ``` ## Vectors The fundamental data structure in R is the vector, an ordered collection of values of the same type. You create vectors using the `c()` function (for concatenate or combine). ```{r} numbers <- c(1, 2, 3, 4, 5) numbers names <- c("Alice", "Bob", "Carol") names ``` Many operations in R are vectorized, meaning they operate on entire vectors at once rather than requiring you to loop through elements. ```{r} numbers * 2 numbers + 10 numbers^2 ``` You can access individual elements using square brackets with an index (R uses 1-based indexing, so the first element is at position 1). ```{r} numbers[1] numbers[3] numbers[c(1, 3, 5)] ``` ## Creating Sequences R provides convenient functions for creating regular sequences. ```{r} 1:10 seq(0, 10, by = 2) seq(0, 1, length.out = 5) rep(1, times = 5) rep(c(1, 2), times = 3) ``` ## Generating Random Numbers R can generate random numbers from various probability distributions, which is invaluable for simulation and understanding statistical concepts. ```{r} #| label: fig-rnorm-hist #| fig-cap: "Histogram of 1000 random draws from a normal distribution with mean 0 and standard deviation 10" #| fig-width: 6 #| fig-height: 4 # Draw 1000 values from a normal distribution with mean 0 and SD 10 x <- rnorm(1000, mean = 0, sd = 10) hist(x) ``` ```{r} #| label: fig-rbinom-hist #| fig-cap: "Histogram of binomial distribution results from 1000 experiments of 20 coin flips each" #| fig-width: 6 #| fig-height: 4 # Draw from a binomial distribution: 1000 experiments, 20 trials each, p=0.5 heads <- rbinom(n = 1000, size = 20, prob = 0.5) hist(heads) ``` The `set.seed()` function allows you to make random simulations reproducible by initializing the random number generator to a known state. ```{r} set.seed(42) rnorm(5) set.seed(42) # same seed produces same "random" numbers rnorm(5) ``` ## Data Frames Data frames are R's structure for tabular data—rows of observations and columns of variables. Each column can contain a different type of data (numeric, character, logical), but all values within a column must be the same type. ```{r} # Create a data frame from vectors hydrogel_concentration <- factor(c("low", "high", "high", "high", "medium", "medium", "medium", "low")) compression <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5) conductivity <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3) mydata <- data.frame(hydrogel_concentration, compression, conductivity) mydata ``` Access columns using the `$` operator or square brackets. ```{r} mydata$compression mydata[, 2] # second column mydata[1, ] # first row mydata[1, 2] # first row, second column ``` ## Reading and Writing Data Real analyses typically begin by reading data from external files. R provides functions for various file formats. ```{r} #| eval: false # Read comma-separated values data <- read.csv("mydata.csv") # Read tab-separated values data <- read.table("mydata.txt", header = TRUE, sep = "\t") # Read Excel files (requires readxl package) library(readxl) data <- read_excel("mydata.xlsx") ``` Similarly, you can write data to files. ```{r} #| eval: false write.csv(mydata, "output.csv", row.names = FALSE) write.table(mydata, "output.txt", sep = "\t", row.names = FALSE) ``` ## Basic Plotting R has extensive graphics capabilities. The base `plot()` function creates scatterplots and other basic visualizations. ```{r} #| label: fig-basic-scatter #| fig-cap: "A simple scatterplot showing the relationship between x and x squared" #| fig-width: 6 #| fig-height: 4 x <- 1:10 y <- x^2 plot(x, y, xlab = "X values", ylab = "Y squared", main = "A Simple Plot", col = "blue", pch = 19) ``` Histograms visualize the distribution of a single variable. ```{r} #| label: fig-normal-hist #| fig-cap: "Histogram of 1000 random samples from a standard normal distribution" #| fig-width: 6 #| fig-height: 4 data <- rnorm(1000) hist(data, breaks = 30, col = "lightblue", main = "Normal Distribution") ``` Boxplots compare distributions across groups. ```{r} #| label: fig-hydrogel-boxplot #| fig-cap: "Boxplot comparing compression values across hydrogel concentration levels" #| fig-width: 6 #| fig-height: 4 boxplot(compression ~ hydrogel_concentration, data = mydata, xlab = "Concentration", ylab = "Compression") ``` We will explore the more sophisticated `ggplot2` package for graphics in a later chapter. ## Scripts and Reproducibility While you can type commands directly at the console, for anything beyond simple explorations you should write scripts—text files containing R commands that can be saved, edited, and rerun. In RStudio, create a new script with File > New File > R Script. Type your commands in the script editor, and run them by placing your cursor on a line and pressing Ctrl+Enter (Cmd+Enter on Mac) or by selecting code and clicking Run. Scripts should be self-contained, including all the commands needed to reproduce your analysis from start to finish. Begin scripts by loading required packages, then reading data, then performing analyses. Add comments (lines beginning with `#`) to explain what your code does and why. ```{r} #| eval: false # Analysis of hydrogel mechanical properties # Author: Your Name # Date: 2025-04-01 # Load required packages library(tidyverse) # Read data data <- read.csv("hydrogel_data.csv") # Calculate summary statistics summary(data) # Create visualization ggplot(data, aes(x = concentration, y = compression)) + geom_boxplot() ``` ## Getting Help When you encounter problems, R provides several resources. The `?` operator opens documentation for functions. The `help.search()` function searches the help system for topics. The `example()` function runs examples from a function's documentation. ```{r} #| eval: false ?mean help.search("regression") example(plot) ``` Beyond R's built-in help, the internet offers vast resources. Stack Overflow has answers to almost any R question you can imagine. Package vignettes provide tutorials for specific packages. The RStudio community forums are welcoming to beginners. When asking for help online, provide a minimal reproducible example—the smallest piece of code that demonstrates your problem, including sample data. This makes it much easier for others to understand and solve your issue. ## Data Types in R R has several fundamental data types that you will work with frequently. ### Character Strings Assignments and operations can be performed on characters as well as numbers. Characters need to be set off by quotation marks to differentiate them from numeric objects or variable names. ```{r} x <- "I Love" print(x) y <- "Biostatistics" print(y) # Combine strings using c() z <- c(x, y) print(z) ``` The variable `z` is now a vector of character objects. Note that we are overwriting our previous numeric assignments—a good general rule is to use descriptive, unique names for each variable. ### Factors Sometimes we would like to treat character objects as if they were categorical units for subsequent calculations. These are called factors, and we can convert a character vector to factor class. ```{r} z_factor <- as.factor(z) print(z_factor) class(z_factor) ``` Note that factor levels are reported alphabetically. The `class()` function tells us what type of object we are working with—it is one of the most important diagnostic tools in R. Often you can debug your code simply by checking and changing the class of an object. Factors are especially important for statistical analyses where we might want to calculate the mean or variance for different experimental treatments. In that case, the treatments would be coded as different levels of a factor. ### Missing Values (NA) R uses special values to represent missing or undefined data. The most common is `NA`, which stands for "Not Available." ```{r} class(NA) ``` `NA` is a logical data type and is distinct from the character string "NA", the numeric 0, or an empty string. It is also a reserved word and cannot be used as a variable name. Any instance of a blank entry in your data file will be read into R as `NA`. Many functions in R will not work by default if passed any `NA` values: ```{r} num <- c(0, 1, 2, NA, 4) mean(num) # Use na.rm = TRUE to ignore missing values mean(num, na.rm = TRUE) # Check for missing values is.na(num) ``` ### Floating-Point Precision A common source of confusion involves floating-point arithmetic. Computers represent decimal numbers with limited precision, which can lead to unexpected results: ```{r} # This seems wrong, but is due to how computers store decimals 0.1 + 0.2 == 0.3 # The actual values differ slightly print(0.1 + 0.2, digits = 20) print(0.3, digits = 20) ``` Never use `==` to compare floating-point numbers directly. Instead, use `all.equal()` which checks if values are "nearly equal" within a small tolerance: ```{r} # Safe comparison for floating-point numbers all.equal(0.1 + 0.2, 0.3) # Use isTRUE() if you need a logical result isTRUE(all.equal(0.1 + 0.2, 0.3)) ``` The tidyverse provides `dplyr::near()` as a convenient alternative, especially when filtering data frames: ```{r} #| eval: false # Works well in filter operations library(dplyr) data |> filter(near(value, target_value)) ``` ::: {.callout-warning} ## Floating-Point Comparisons Always use `all.equal()` or `near()` instead of `==` when comparing decimal calculations. This is a common source of bugs in data analysis code. ::: ## More on Vectors ### Indexing Vectors Isolating specific elements from vectors is called indexing. R uses 1-based indexing with square brackets `[]`. ```{r} x <- c(10, 20, 30, 40, 50, 100, 200) # First element x[1] # Third element x[3] # Series of consecutive elements x[1:4] # Last four elements x[4:7] # Non-consecutive elements using c() x[c(1:3, 5)] # All elements EXCEPT the first two x[-c(1:2)] ``` ### Useful Functions for Vectors **Functions that provide information about vectors:** - `head()`: returns the first elements of an object - `tail()`: returns the last elements of an object - `length()`: returns the number of elements in a vector - `class()`: returns the class of elements in a vector **Functions that modify or generate vectors:** - `sort()`: returns a sorted vector - `seq()`: creates a sequence of values - `rep()`: repeats values ```{r} rep(1, 5) rep("treatment", 5) ``` **Functions for random sampling:** - `sample()`: randomly selects elements from a vector - `rnorm()`: draws values from a normal distribution - `rbinom()`: draws values from a binomial distribution - `set.seed()`: sets the random number generator seed for reproducibility **Functions to change data types:** - `as.numeric()`: converts to numeric class - `as.factor()`: converts to factor class - `as.character()`: converts to character class ## Lists Lists in R are aggregates of different objects that can be mixed types and different lengths. ```{r} vec1 <- c(10, 20, 30, 40, 50, 100, 200) vec2 <- c("happy", "sad", "grumpy") vec3 <- factor(c("high", "low")) mylist <- list(vec1, vec2, vec3) print(mylist) class(mylist) str(mylist) ``` Elements of lists are indexed with double square brackets `[[]]`. To access the second element of `mylist`: ```{r} mylist[[2]] # The second item of the second element mylist[[2]][2] ``` The `str()` function (for "structure") is extremely useful for understanding complex R objects. ## Matrices Matrices in R are two-dimensional arrays where all elements must be the same type. They are indexed by `[row, column]`. ```{r} # Create a 3x3 matrix matrix(1:9, nrow = 3, ncol = 3) ``` Useful matrix functions include: - `dim()`: returns the dimensions (rows and columns) - `t()`: transposes a matrix (swaps rows and columns) - `cbind()`: combines columns - `rbind()`: combines rows ## Installing and Using Packages Base R includes many useful functions, but the real power comes from packages—collections of functions contributed by the community. Packages are distributed via the Comprehensive R Archive Network (CRAN). ```{r} #| eval: false # Install a package (only need to do once) install.packages("name_of_package") # Check if package is installed installed.packages("name_of_package") # Load package for use (needed each session) library(name_of_package) ``` Note that `install.packages()` requires the package name in quotation marks, while `library()` does not. ### Namespace Conflicts When you load multiple packages, function names can collide. If two packages define a function with the same name, the most recently loaded package "wins," and its version masks the earlier one. R warns you when this happens: ```{r} #| eval: false library(dplyr) # Attaching package: 'dplyr' # The following objects are masked from 'package:stats': # filter, lag ``` This message indicates that dplyr's `filter()` and `lag()` functions are now masking the base R functions with those names. If you need the masked version, use the package prefix: ```{r} #| eval: false # Use dplyr's filter (now the default after loading dplyr) data |> filter(x > 5) # Explicitly use base R's filter stats::filter(x, method = "convolution") # You can use the prefix even without loading a package stringr::str_detect(text, "pattern") ``` Common conflicts occur between: - `dplyr::filter()` and `stats::filter()` - `dplyr::lag()` and `stats::lag()` - `dplyr::select()` and `MASS::select()` ::: {.callout-tip} ## Avoiding Conflicts The `::` notation explicitly specifies which package's function to use. When writing scripts, it is good practice to use `package::function()` for functions that commonly conflict, making your code's behavior explicit and predictable. ::: ## The Split-Apply-Combine Approach A common pattern in data analysis is to split data by groups, apply a function to each group, and combine the results. R provides several functions for this workflow. ### The replicate() Function Repeats an expression multiple times and collects the results: ```{r} # Shuffle integers 1-10 five times replicate(5, sample(1:10, size = 10, replace = FALSE)) ``` ### The apply() Family The `apply()` function applies a function to rows or columns of a matrix or data frame: ```{r} # Create sample matrix m <- matrix(1:12, nrow = 3, ncol = 4) m # Sum across rows (MARGIN = 1) apply(m, 1, sum) # Sum across columns (MARGIN = 2) apply(m, 2, sum) ``` ### The tapply() Function Applies a function to subsets of a vector, grouped by a factor: ```{r} # Find maximum petal length for each species tapply(iris$Petal.Length, iris$Species, max) ``` ### The aggregate() Function Summarizes multiple variables by groups: ```{r} #| warning: false # Mean of each variable by species aggregate(iris[, 1:4], by = list(Species = iris$Species), FUN = mean) ``` ## Conditional Statements with ifelse() The `ifelse()` function provides vectorized conditional logic. The first argument is a logical test, the second is the value if TRUE, and the third is the value if FALSE. ```{r} # Create a character vector treatment <- c(rep("treatment", 5), rep("control", 3), rep("treatment", 4), rep("control", 6)) # Assign colors based on treatment colors <- ifelse(treatment == "treatment", "red", "blue") print(colors) ``` ## For Loops For loops iterate through a sequence, executing code for each value. However, R is vectorized, so many operations that would require loops in other languages can be done more efficiently without them. When loops are necessary, pre-allocate output objects for better performance: ```{r} # Pre-allocate a numeric vector results <- numeric(5) for (i in 1:5) { results[i] <- i^2 } results ``` ::: {.callout-tip} ## Avoiding Loops Before writing a loop, consider whether the task can be accomplished with vectorized operations or the apply family of functions. These approaches are often faster and more readable. ::: ## More on Plotting ### Customizing Plots with par() Many plotting parameters are controlled by the `par()` function. Understanding `par()` dramatically increases your plotting capabilities. ```{r} #| label: fig-multipanel-plots #| fig-cap: "Multiple plot panels showing increasing (points) and decreasing (lines) sequences" #| fig-width: 8 #| fig-height: 4 # Create multiple panels par(mfrow = c(1, 2)) # 1 row, 2 columns seq_1 <- seq(0, 10, by = 0.1) seq_2 <- seq(10, 0, by = -0.1) plot(seq_1, xlab = "Index", ylab = "Value", type = "p", col = "red", main = "Increasing Sequence") plot(seq_2, xlab = "Index", ylab = "Value", type = "l", col = "blue", main = "Decreasing Sequence") ``` ### Vectorized Graphical Parameters Graphical parameters like `col`, `pch` (point character), and `cex` (character expansion) are vectorized: ```{r} #| label: fig-twocolor-scatter #| fig-cap: "Scatterplot demonstrating vectorized graphical parameters with two colors" #| fig-width: 6 #| fig-height: 5 seq_1 <- seq(0, 10, by = 0.1) seq_2 <- seq(10, 0, by = -0.1) # First 10 points blue, rest red colors <- c(rep("blue", 10), rep("red", 91)) plot(seq_1, seq_2, xlab = "Sequence 1", ylab = "Sequence 2", col = colors, pch = 19, main = "Two-Color Scatterplot") ``` ### Useful Plotting Arguments Key arguments for `plot()` and related functions: - `main`: plot title - `xlab`, `ylab`: axis labels - `xlim`, `ylim`: axis limits - `col`: color - `pch`: point character (0-25) - `cex`: character/point size multiplier - `lwd`: line width - `type`: "p" for points, "l" for lines, "b" for both ## Introduction to R Markdown R Markdown combines R code with formatted text to create reproducible documents. Files have the `.Rmd` extension and can be rendered ("knitted") to HTML, PDF, or Word. ### Getting Started Install the `rmarkdown` package, then in RStudio: File → New File → R Markdown. ### Basic Formatting ``` ## Section Header ### Subsection Header Text can be *italicized* or **bolded** or ***both***. Links: [Link Text](https://example.com) ``` ### Code Chunks R code is placed in code chunks delimited by three backticks: ```` ```{r} seq(1, 10, 1) ``` ```` Chunk options control whether code is evaluated (`eval`), displayed (`echo`), and more: ```` ```{r, eval = TRUE, echo = TRUE} seq(1, 10, 1) ``` ```` ### Knitting Click the "Knit" button in RStudio to render your document. Start with HTML output, which has the fewest dependencies. ::: {.callout-note} ## Learning More For comprehensive R Markdown documentation, see the [R Markdown introduction](https://rmarkdown.rstudio.com/articles_intro.html) and [R Markdown cheat sheet](https://rmarkdown.rstudio.com/lesson-15.html). ::: ## Practice Exercises ::: {.callout-note} ### Exercise R.1: Exploring RStudio Take a few minutes to familiarize yourself with the RStudio environment: 1. Locate the four main panes: - The code editor (top left) - The workspace and history (top right) - The plots and files window (bottom right) - The R console (bottom left) 2. In the plots and files window, click on the Packages and Help tabs to see what they offer 3. See what types of new files can be made in RStudio by clicking File → New File 4. Open a new R script and a new R Markdown file to see the difference ::: ::: {.callout-note} ### Exercise R.2: Basic Mathematics in R Insert a code chunk and complete the following tasks: 1. Add and subtract numbers 2. Multiply and divide numbers 3. Raise a number to a power using the `^` symbol 4. Create a more complex equation involving all of these operations to convince yourself that R follows the normal priority of mathematical evaluation (PEMDAS) ```{r} #| eval: false # Example: (4 + 3 * 2^2) / 5 - 1 ``` ::: ::: {.callout-note} ### Exercise R.3: Assigning Variables and Functions 1. Assign three variables using basic mathematical operations 2. Take the log of your three variables using `log()` 3. Use the `print()` function to display your most complex variable 4. Use the `c()` (concatenate) function combined with `paste()` to create and print a sentence ```{r} #| eval: false # Example: x <- 10 y <- x * 2 z <- sqrt(x + y) print(paste("The value of z is", z)) ``` ::: ::: {.callout-note} ### Exercise R.4: Vectors and Factors 1. Create a numeric vector using the `c()` function with at least 5 elements 2. Create a character vector and convert it to a factor using `as.factor()` ```{r} # Example: vec1 <- c("control", "treatment", "control", "treatment", "control") fac1 <- as.factor(vec1) print(fac1) levels(fac1) ``` 3. Use `str()` and `class()` to evaluate your variables 4. What is the difference between a character vector and a factor? ::: ::: {.callout-note} ### Exercise R.5: Basic Statistics 1. Create a numeric vector with at least 10 elements 2. Calculate the `mean()`, `sd()`, `sum()`, `length()`, and `var()` of your vector 3. Use the `log()` and `sqrt()` functions on your vector 4. What happens when you try to apply `mean()` to a factor? Try it and explain the result ```{r} #| eval: false # Example: my_vector <- c(12, 15, 18, 22, 25, 28, 31, 35, 38, 42) mean(my_vector) sd(my_vector) ``` ::: ::: {.callout-note} ### Exercise R.6: Creating Sequences and Random Sampling Set the random seed for reproducibility, then: ```{r} set.seed(42) ``` 1. Create a vector with 100 elements using `seq()` and calculate the mean and standard deviation 2. Create a variable and `sample()` it with equal probability—experiment with the `size` and `replace` arguments 3. Create a normally distributed variable of 10000 elements using `rnorm()`, then sample that distribution with and without replacement 4. Use `hist()` to plot your normally distributed variable ::: ::: {.callout-note} ### Exercise R.7: Basic Visualization Create visualizations with proper axis labels and colors: 1. Create a sequence variable using `seq()` and make two different plots by changing the `type` argument (`"p"` for points, `"l"` for lines, `"b"` for both) 2. Create a normally distributed variable using `rnorm()` and make histograms with different `breaks` values—what does `breaks` control? 3. Use `par(mfrow = c(2, 2))` to create a 2×2 grid of plots ```{r} #| fig-width: 8 #| fig-height: 6 #| eval: false par(mfrow = c(2, 2)) x <- seq(1, 100, by = 1) plot(x, type = "p", main = "Points", col = "blue") plot(x, type = "l", main = "Lines", col = "red") y <- rnorm(1000) hist(y, breaks = 10, main = "10 Breaks", col = "lightblue") hist(y, breaks = 50, main = "50 Breaks", col = "lightgreen") ``` ::: ::: {.callout-note} ### Exercise R.8: Creating Data Frames 1. Create a data frame with at least three columns: one character/factor, one numeric, and one logical 2. Assign row names to your data frame using `rownames()` 3. Examine your data frame structure using `str()` 4. Calculate the mean of each numeric variable 5. Use `head()` and `tail()` to view portions of your data frame ```{r} #| eval: false # Example: treatment <- c("control", "low", "medium", "high", "control", "low") response <- c(12.3, 15.6, 18.9, 24.2, 11.8, 16.1) significant <- c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE) my_data <- data.frame(treatment, response, significant) str(my_data) ``` ::: ::: {.callout-note} ### Exercise R.9: Data Import and Indexing 1. Create a simple CSV file or use a built-in dataset like `iris` 2. Use `read.csv()` to read in your file (or access `iris` directly) 3. Use `str()` and `head()` to examine the data structure 4. Use `$` and `[ ]` operators to select different parts of the data frame 5. Create a plot of two numeric variables 6. Use `tapply()` to calculate summary statistics grouped by a categorical variable 7. Export your data frame using `write.csv()` ```{r} #| eval: false # Example with iris: data(iris) str(iris) head(iris) iris$Sepal.Length[1:5] # First 5 sepal lengths iris[1:3, ] # First 3 rows plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species) tapply(iris$Sepal.Length, iris$Species, mean) ``` ::: ::: {.callout-note} ### Exercise R.10: Understanding Object Types Explore how R handles different data types: 1. Create variables of different classes: numeric, character, logical, and factor 2. What happens when you try to perform arithmetic on character data? 3. Experiment with type coercion using `as.numeric()`, `as.character()`, and `as.factor()` 4. What happens when you add a character element to a numeric vector? ::: ## Additional Resources - @logan2010biostatistical - A comprehensive introduction to R for statistical analysis - [A Primer for Computational Biology](http://library.open.oregonstate.edu/computationalbiology/) - Free online textbook by S.T. O'Neil - [R Colors Reference](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) - Visual guide to R colors - [Introduction to Colors in R](https://www.stat.ubc.ca/~jenny/STAT545A/block14_colors.html) - Tutorial on using colors effectively