Figure 5.1: The RStudio integrated development environment

5.1 What is R?

R is a computer programming language and environment especially useful for graphic visualization and statistical analysis of data. It is an offshoot of a language developed in 1976 at Bell Laboratories called S. R is an interpreted language, meaning that every time code is run it must be translated to machine language by the R interpreter, as opposed to being compiled prior to running. R is the premier computational platform for statistical analysis thanks to its GNU open-source status and countless packages contributed by diverse members of the scientific community.

5.2 Why R?

R is a programming language designed specifically for statistical computing and graphics. Created in the early 1990s as an open-source implementation of the S language, R has become the lingua franca of statistical analysis in academia and is widely used in industry as well.

Several features make R particularly well-suited for data analysis. It provides an extensive collection of statistical and graphical techniques built into the language. It is powerful, flexible, and completely free. It runs on Windows, Mac, and Linux, so your code will work across platforms. New capabilities are constantly being added through packages contributed by the community, with thousands of packages available for specialized analyses.

R excels at reproducibility. You can keep your scripts to document exactly what analyses you performed. Unlike point-and-click software where actions leave no trace, R code provides a complete record of your analytical workflow. This record can be shared with collaborators, included in publications, and revisited years later when you need to remember how you produced a particular result.

You can write your own functions in R, extending the language to meet your specific needs. Extensive online help and active user communities mean that answers to most questions are a web search away. The RStudio integrated development environment makes working with R much more pleasant, especially for newcomers. And with tools like R Markdown and Quarto, you can embed your analyses in polished documents, presentations, websites, and books—this book itself was created with these tools.

5.3 Installing R and RStudio

R must be installed before RStudio. Download R from https://www.r-project.org, selecting the version appropriate for your operating system. Follow the installation instructions for your platform.

RStudio is an integrated development environment (IDE) that makes working with R much easier. Download the free RStudio Desktop from https://www.rstudio.com. RStudio provides a console for running R commands, an editor for writing scripts, tools for viewing plots and data, and integration with version control systems.

After installing both programs, launch RStudio. You will see a window divided into panes, each serving a different purpose. The console pane is where R commands are executed. The source pane is where you edit scripts and documents. The environment pane shows what objects currently exist in your R session. The files/plots/packages/help pane provides access to various utilities.

5.4 R Basics

R evaluates expressions and returns results. You can use it as a calculator by typing arithmetic expressions at the console.

Code
4 * 4
[1] 16
Code
(4 + 3 * 2^2)
[1] 16

Notice that R follows standard mathematical order of operations: exponentiation before multiplication and division, which come before addition and subtraction. Parentheses can override this ordering.

5.5 Variables and Assignment

More useful than evaluating isolated expressions is storing values in variables for later use. Variables are assigned using the <- operator (a less-than sign followed by a hyphen).

Code
x <- 2
x * 3
[1] 6
Code
y <- x * 3
y - 2
[1] 4

Variable names must begin with a letter but can contain letters, numbers, periods, and underscores after the first character. R is case-sensitive, so myVariable, MyVariable, and myvariable are three different names. Choose descriptive names that make your code readable. It is good practice to avoid periods in variable names, as they have other functionality in related programming languages like Python.

Invalid Variable Names

Variable names cannot begin with numbers or contain operators. The following will produce errors:

Code
3y <- 3    # cannot start with a number
3*y <- 3   # cannot include operators

Reserved Words

R has reserved words that cannot be used as variable names because they have special meaning in the language:

Reserved Words Purpose
if, else Conditional statements
for, while, repeat Loops
function Function definition
in, next, break Loop control
TRUE, FALSE Logical constants
NULL, NA, NaN, Inf Special values

R also has semi-reserved names—built-in functions and constants that you can technically overwrite but should avoid:

Code
# These work but are dangerous:
T <- 5       # Overwrites TRUE abbreviation
c <- "text"  # Shadows the c() function
mean <- 42   # Shadows mean()

# If you accidentally overwrite something, remove it:
rm(c)        # Restores access to c()
Avoid Common Name Collisions

Never name variables T, F (abbreviations for TRUE/FALSE), c, t, mean, sum, data, or df. These are commonly used R functions, and shadowing them leads to confusing errors.

Note that when you assign a value to a variable, R does not print anything. To see a variable’s value, type its name alone or use the print() function.

Code
z <- 100
z
[1] 100
Code
print(z)
[1] 100

5.6 Understanding R Objects

A fundamental principle of R is that everything is an object. Numbers, text, datasets, functions—all are stored as objects with specific properties. Understanding this helps you debug problems and write better code.

Every object has a class (which determines how functions treat it) and a type (its underlying storage mode). Use class() and typeof() to examine objects:

Code
# Numbers are objects
x <- 42
class(x)
[1] "numeric"
Code
typeof(x)
[1] "double"
Code
# Text strings are objects
name <- "Gene Expression"
class(name)
[1] "character"
Code
# Even functions are objects!
class(mean)
[1] "function"

The str() function (structure) provides a compact display of any object’s structure—it is one of the most useful diagnostic tools in R:

Code
# Examine a vector
str(c(1, 2, 3, 4, 5))
 num [1:5] 1 2 3 4 5
Code
# Examine a data frame
str(head(iris))
'data.frame':   6 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1

When functions produce errors or unexpected results, checking the class of your objects is often the first step toward understanding what went wrong.

5.7 Functions

Functions are the workhorses of R. A function takes inputs (called arguments), performs some operation, and returns an output. R has many built-in functions, and packages provide thousands more.

Code
log(10)
[1] 2.302585
Code
sqrt(16)
[1] 4
Code
exp(1)
[1] 2.718282

Functions are called by typing their name followed by parentheses containing their arguments. Many functions accept multiple arguments, separated by commas. Arguments can be specified by position or by name.

Code
round(3.14159, digits = 2)
[1] 3.14
Code
round(3.14159, 2)  # same result, argument specified by position
[1] 3.14

To learn about a function, use the help system. Type ?functionname or help(functionname) to open the documentation.

Code
?round
help(sqrt)

5.8 Vectors

The fundamental data structure in R is the vector, an ordered collection of values of the same type. You create vectors using the c() function (for concatenate or combine).

Code
numbers <- c(1, 2, 3, 4, 5)
numbers
[1] 1 2 3 4 5
Code
names <- c("Alice", "Bob", "Carol")
names
[1] "Alice" "Bob"   "Carol"

Many operations in R are vectorized, meaning they operate on entire vectors at once rather than requiring you to loop through elements.

Code
numbers * 2
[1]  2  4  6  8 10
Code
numbers + 10
[1] 11 12 13 14 15
Code
numbers^2
[1]  1  4  9 16 25

You can access individual elements using square brackets with an index (R uses 1-based indexing, so the first element is at position 1).

Code
numbers[1]
[1] 1
Code
numbers[3]
[1] 3
Code
numbers[c(1, 3, 5)]
[1] 1 3 5

5.9 Creating Sequences

R provides convenient functions for creating regular sequences.

Code
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
Code
seq(0, 10, by = 2)
[1]  0  2  4  6  8 10
Code
seq(0, 1, length.out = 5)
[1] 0.00 0.25 0.50 0.75 1.00
Code
rep(1, times = 5)
[1] 1 1 1 1 1
Code
rep(c(1, 2), times = 3)
[1] 1 2 1 2 1 2

5.10 Generating Random Numbers

R can generate random numbers from various probability distributions, which is invaluable for simulation and understanding statistical concepts.

Code
# Draw 1000 values from a normal distribution with mean 0 and SD 10
x <- rnorm(1000, mean = 0, sd = 10)
hist(x)
Figure 5.2: Histogram of 1000 random draws from a normal distribution with mean 0 and standard deviation 10
Code
# Draw from a binomial distribution: 1000 experiments, 20 trials each, p=0.5
heads <- rbinom(n = 1000, size = 20, prob = 0.5)
hist(heads)
Figure 5.3: Histogram of binomial distribution results from 1000 experiments of 20 coin flips each

The set.seed() function allows you to make random simulations reproducible by initializing the random number generator to a known state.

Code
set.seed(42)
rnorm(5)
[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683
Code
set.seed(42)  # same seed produces same "random" numbers
rnorm(5)
[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

5.11 Data Frames

Data frames are R’s structure for tabular data—rows of observations and columns of variables. Each column can contain a different type of data (numeric, character, logical), but all values within a column must be the same type.

Code
# Create a data frame from vectors
hydrogel_concentration <- factor(c("low", "high", "high", "high", 
                                    "medium", "medium", "medium", "low"))
compression <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
conductivity <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)

mydata <- data.frame(hydrogel_concentration, compression, conductivity)
mydata
  hydrogel_concentration compression conductivity
1                    low         3.4          0.0
2                   high         3.4          9.2
3                   high         8.4          3.8
4                   high         3.0          5.0
5                 medium         5.6          5.6
6                 medium         8.1          4.1
7                 medium         8.3          7.1
8                    low         4.5          5.3

Access columns using the $ operator or square brackets.

Code
mydata$compression
[1] 3.4 3.4 8.4 3.0 5.6 8.1 8.3 4.5
Code
mydata[, 2]  # second column
[1] 3.4 3.4 8.4 3.0 5.6 8.1 8.3 4.5
Code
mydata[1, ]  # first row
  hydrogel_concentration compression conductivity
1                    low         3.4            0
Code
mydata[1, 2] # first row, second column
[1] 3.4

5.12 Reading and Writing Data

Real analyses typically begin by reading data from external files. R provides functions for various file formats.

Code
# Read comma-separated values
data <- read.csv("mydata.csv")

# Read tab-separated values
data <- read.table("mydata.txt", header = TRUE, sep = "\t")

# Read Excel files (requires readxl package)
library(readxl)
data <- read_excel("mydata.xlsx")

Similarly, you can write data to files.

Code
write.csv(mydata, "output.csv", row.names = FALSE)
write.table(mydata, "output.txt", sep = "\t", row.names = FALSE)

5.13 Basic Plotting

R has extensive graphics capabilities. The base plot() function creates scatterplots and other basic visualizations.

Code
x <- 1:10
y <- x^2
plot(x, y,
     xlab = "X values",
     ylab = "Y squared",
     main = "A Simple Plot",
     col = "blue",
     pch = 19)
Figure 5.4: A simple scatterplot showing the relationship between x and x squared

Histograms visualize the distribution of a single variable.

Code
data <- rnorm(1000)
hist(data, breaks = 30, col = "lightblue", main = "Normal Distribution")
Figure 5.5: Histogram of 1000 random samples from a standard normal distribution

Boxplots compare distributions across groups.

Code
boxplot(compression ~ hydrogel_concentration, data = mydata,
        xlab = "Concentration", ylab = "Compression")
Figure 5.6: Boxplot comparing compression values across hydrogel concentration levels

We will explore the more sophisticated ggplot2 package for graphics in a later chapter.

5.14 Scripts and Reproducibility

While you can type commands directly at the console, for anything beyond simple explorations you should write scripts—text files containing R commands that can be saved, edited, and rerun.

In RStudio, create a new script with File > New File > R Script. Type your commands in the script editor, and run them by placing your cursor on a line and pressing Ctrl+Enter (Cmd+Enter on Mac) or by selecting code and clicking Run.

Scripts should be self-contained, including all the commands needed to reproduce your analysis from start to finish. Begin scripts by loading required packages, then reading data, then performing analyses. Add comments (lines beginning with #) to explain what your code does and why.

Code
# Analysis of hydrogel mechanical properties
# Author: Your Name
# Date: 2025-04-01

# Load required packages
library(tidyverse)

# Read data
data <- read.csv("hydrogel_data.csv")

# Calculate summary statistics
summary(data)

# Create visualization
ggplot(data, aes(x = concentration, y = compression)) +
  geom_boxplot()

5.15 Getting Help

When you encounter problems, R provides several resources. The ? operator opens documentation for functions. The help.search() function searches the help system for topics. The example() function runs examples from a function’s documentation.

Code
?mean
help.search("regression")
example(plot)

Beyond R’s built-in help, the internet offers vast resources. Stack Overflow has answers to almost any R question you can imagine. Package vignettes provide tutorials for specific packages. The RStudio community forums are welcoming to beginners.

When asking for help online, provide a minimal reproducible example—the smallest piece of code that demonstrates your problem, including sample data. This makes it much easier for others to understand and solve your issue.

5.16 Data Types in R

R has several fundamental data types that you will work with frequently.

Character Strings

Assignments and operations can be performed on characters as well as numbers. Characters need to be set off by quotation marks to differentiate them from numeric objects or variable names.

Code
x <- "I Love"
print(x)
[1] "I Love"
Code
y <- "Biostatistics"
print(y)
[1] "Biostatistics"
Code
# Combine strings using c()
z <- c(x, y)
print(z)
[1] "I Love"        "Biostatistics"

The variable z is now a vector of character objects. Note that we are overwriting our previous numeric assignments—a good general rule is to use descriptive, unique names for each variable.

Factors

Sometimes we would like to treat character objects as if they were categorical units for subsequent calculations. These are called factors, and we can convert a character vector to factor class.

Code
z_factor <- as.factor(z)
print(z_factor)
[1] I Love        Biostatistics
Levels: Biostatistics I Love
Code
class(z_factor)
[1] "factor"

Note that factor levels are reported alphabetically. The class() function tells us what type of object we are working with—it is one of the most important diagnostic tools in R. Often you can debug your code simply by checking and changing the class of an object.

Factors are especially important for statistical analyses where we might want to calculate the mean or variance for different experimental treatments. In that case, the treatments would be coded as different levels of a factor.

Missing Values (NA)

R uses special values to represent missing or undefined data. The most common is NA, which stands for “Not Available.”

Code
class(NA)
[1] "logical"

NA is a logical data type and is distinct from the character string “NA”, the numeric 0, or an empty string. It is also a reserved word and cannot be used as a variable name.

Any instance of a blank entry in your data file will be read into R as NA. Many functions in R will not work by default if passed any NA values:

Code
num <- c(0, 1, 2, NA, 4)
mean(num)
[1] NA
Code
# Use na.rm = TRUE to ignore missing values
mean(num, na.rm = TRUE)
[1] 1.75
Code
# Check for missing values
is.na(num)
[1] FALSE FALSE FALSE  TRUE FALSE

Floating-Point Precision

A common source of confusion involves floating-point arithmetic. Computers represent decimal numbers with limited precision, which can lead to unexpected results:

Code
# This seems wrong, but is due to how computers store decimals
0.1 + 0.2 == 0.3
[1] FALSE
Code
# The actual values differ slightly
print(0.1 + 0.2, digits = 20)
[1] 0.30000000000000004441
Code
print(0.3, digits = 20)
[1] 0.2999999999999999889

Never use == to compare floating-point numbers directly. Instead, use all.equal() which checks if values are “nearly equal” within a small tolerance:

Code
# Safe comparison for floating-point numbers
all.equal(0.1 + 0.2, 0.3)
[1] TRUE
Code
# Use isTRUE() if you need a logical result
isTRUE(all.equal(0.1 + 0.2, 0.3))
[1] TRUE

The tidyverse provides dplyr::near() as a convenient alternative, especially when filtering data frames:

Code
# Works well in filter operations
library(dplyr)
data |> filter(near(value, target_value))
Floating-Point Comparisons

Always use all.equal() or near() instead of == when comparing decimal calculations. This is a common source of bugs in data analysis code.

5.17 More on Vectors

Indexing Vectors

Isolating specific elements from vectors is called indexing. R uses 1-based indexing with square brackets [].

Code
x <- c(10, 20, 30, 40, 50, 100, 200)

# First element
x[1]
[1] 10
Code
# Third element
x[3]
[1] 30
Code
# Series of consecutive elements
x[1:4]
[1] 10 20 30 40
Code
# Last four elements
x[4:7]
[1]  40  50 100 200
Code
# Non-consecutive elements using c()
x[c(1:3, 5)]
[1] 10 20 30 50
Code
# All elements EXCEPT the first two
x[-c(1:2)]
[1]  30  40  50 100 200

Useful Functions for Vectors

Functions that provide information about vectors:

  • head(): returns the first elements of an object
  • tail(): returns the last elements of an object
  • length(): returns the number of elements in a vector
  • class(): returns the class of elements in a vector

Functions that modify or generate vectors:

  • sort(): returns a sorted vector
  • seq(): creates a sequence of values
  • rep(): repeats values
Code
rep(1, 5)
[1] 1 1 1 1 1
Code
rep("treatment", 5)
[1] "treatment" "treatment" "treatment" "treatment" "treatment"

Functions for random sampling:

  • sample(): randomly selects elements from a vector
  • rnorm(): draws values from a normal distribution
  • rbinom(): draws values from a binomial distribution
  • set.seed(): sets the random number generator seed for reproducibility

Functions to change data types:

  • as.numeric(): converts to numeric class
  • as.factor(): converts to factor class
  • as.character(): converts to character class

5.18 Lists

Lists in R are aggregates of different objects that can be mixed types and different lengths.

Code
vec1 <- c(10, 20, 30, 40, 50, 100, 200)
vec2 <- c("happy", "sad", "grumpy")
vec3 <- factor(c("high", "low"))

mylist <- list(vec1, vec2, vec3)
print(mylist)
[[1]]
[1]  10  20  30  40  50 100 200

[[2]]
[1] "happy"  "sad"    "grumpy"

[[3]]
[1] high low 
Levels: high low
Code
class(mylist)
[1] "list"
Code
str(mylist)
List of 3
 $ : num [1:7] 10 20 30 40 50 100 200
 $ : chr [1:3] "happy" "sad" "grumpy"
 $ : Factor w/ 2 levels "high","low": 1 2

Elements of lists are indexed with double square brackets [[]]. To access the second element of mylist:

Code
mylist[[2]]
[1] "happy"  "sad"    "grumpy"
Code
# The second item of the second element
mylist[[2]][2]
[1] "sad"

The str() function (for “structure”) is extremely useful for understanding complex R objects.

5.19 Matrices

Matrices in R are two-dimensional arrays where all elements must be the same type. They are indexed by [row, column].

Code
# Create a 3x3 matrix
matrix(1:9, nrow = 3, ncol = 3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Useful matrix functions include:

  • dim(): returns the dimensions (rows and columns)
  • t(): transposes a matrix (swaps rows and columns)
  • cbind(): combines columns
  • rbind(): combines rows

5.20 Installing and Using Packages

Base R includes many useful functions, but the real power comes from packages—collections of functions contributed by the community. Packages are distributed via the Comprehensive R Archive Network (CRAN).

Code
# Install a package (only need to do once)
install.packages("name_of_package")

# Check if package is installed
installed.packages("name_of_package")

# Load package for use (needed each session)
library(name_of_package)

Note that install.packages() requires the package name in quotation marks, while library() does not.

Namespace Conflicts

When you load multiple packages, function names can collide. If two packages define a function with the same name, the most recently loaded package “wins,” and its version masks the earlier one. R warns you when this happens:

Code
library(dplyr)
# Attaching package: 'dplyr'
# The following objects are masked from 'package:stats':
#     filter, lag

This message indicates that dplyr’s filter() and lag() functions are now masking the base R functions with those names. If you need the masked version, use the package prefix:

Code
# Use dplyr's filter (now the default after loading dplyr)
data |> filter(x > 5)

# Explicitly use base R's filter
stats::filter(x, method = "convolution")

# You can use the prefix even without loading a package
stringr::str_detect(text, "pattern")

Common conflicts occur between:

  • dplyr::filter() and stats::filter()
  • dplyr::lag() and stats::lag()
  • dplyr::select() and MASS::select()
Avoiding Conflicts

The :: notation explicitly specifies which package’s function to use. When writing scripts, it is good practice to use package::function() for functions that commonly conflict, making your code’s behavior explicit and predictable.

5.21 The Split-Apply-Combine Approach

A common pattern in data analysis is to split data by groups, apply a function to each group, and combine the results. R provides several functions for this workflow.

The replicate() Function

Repeats an expression multiple times and collects the results:

Code
# Shuffle integers 1-10 five times
replicate(5, sample(1:10, size = 10, replace = FALSE))
      [,1] [,2] [,3] [,4] [,5]
 [1,]    3    9    9    3    5
 [2,]    1    2   10    8    4
 [3,]    8    3    3    6    9
 [4,]    9    6    4    9    1
 [5,]   10    5    2    4   10
 [6,]    7    4    1    7    7
 [7,]    4    1    5    5    6
 [8,]    5   10    8   10    2
 [9,]    6    8    6    2    8
[10,]    2    7    7    1    3

The apply() Family

The apply() function applies a function to rows or columns of a matrix or data frame:

Code
# Create sample matrix
m <- matrix(1:12, nrow = 3, ncol = 4)
m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
Code
# Sum across rows (MARGIN = 1)
apply(m, 1, sum)
[1] 22 26 30
Code
# Sum across columns (MARGIN = 2)
apply(m, 2, sum)
[1]  6 15 24 33

The tapply() Function

Applies a function to subsets of a vector, grouped by a factor:

Code
# Find maximum petal length for each species
tapply(iris$Petal.Length, iris$Species, max)
    setosa versicolor  virginica 
       1.9        5.1        6.9 

The aggregate() Function

Summarizes multiple variables by groups:

Code
# Mean of each variable by species
aggregate(iris[, 1:4], by = list(Species = iris$Species), FUN = mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

5.22 Conditional Statements with ifelse()

The ifelse() function provides vectorized conditional logic. The first argument is a logical test, the second is the value if TRUE, and the third is the value if FALSE.

Code
# Create a character vector
treatment <- c(rep("treatment", 5), rep("control", 3),
               rep("treatment", 4), rep("control", 6))

# Assign colors based on treatment
colors <- ifelse(treatment == "treatment", "red", "blue")
print(colors)
 [1] "red"  "red"  "red"  "red"  "red"  "blue" "blue" "blue" "red"  "red" 
[11] "red"  "red"  "blue" "blue" "blue" "blue" "blue" "blue"

5.23 For Loops

For loops iterate through a sequence, executing code for each value. However, R is vectorized, so many operations that would require loops in other languages can be done more efficiently without them.

When loops are necessary, pre-allocate output objects for better performance:

Code
# Pre-allocate a numeric vector
results <- numeric(5)

for (i in 1:5) {
  results[i] <- i^2
}
results
[1]  1  4  9 16 25
Avoiding Loops

Before writing a loop, consider whether the task can be accomplished with vectorized operations or the apply family of functions. These approaches are often faster and more readable.

5.24 More on Plotting

Customizing Plots with par()

Many plotting parameters are controlled by the par() function. Understanding par() dramatically increases your plotting capabilities.

Code
# Create multiple panels
par(mfrow = c(1, 2))  # 1 row, 2 columns

seq_1 <- seq(0, 10, by = 0.1)
seq_2 <- seq(10, 0, by = -0.1)

plot(seq_1, xlab = "Index", ylab = "Value", type = "p", col = "red",
     main = "Increasing Sequence")
plot(seq_2, xlab = "Index", ylab = "Value", type = "l", col = "blue",
     main = "Decreasing Sequence")
Figure 5.7: Multiple plot panels showing increasing (points) and decreasing (lines) sequences

Vectorized Graphical Parameters

Graphical parameters like col, pch (point character), and cex (character expansion) are vectorized:

Code
seq_1 <- seq(0, 10, by = 0.1)
seq_2 <- seq(10, 0, by = -0.1)

# First 10 points blue, rest red
colors <- c(rep("blue", 10), rep("red", 91))

plot(seq_1, seq_2, xlab = "Sequence 1", ylab = "Sequence 2",
     col = colors, pch = 19,
     main = "Two-Color Scatterplot")
Figure 5.8: Scatterplot demonstrating vectorized graphical parameters with two colors

Useful Plotting Arguments

Key arguments for plot() and related functions:

  • main: plot title
  • xlab, ylab: axis labels
  • xlim, ylim: axis limits
  • col: color
  • pch: point character (0-25)
  • cex: character/point size multiplier
  • lwd: line width
  • type: “p” for points, “l” for lines, “b” for both

5.25 Introduction to R Markdown

R Markdown combines R code with formatted text to create reproducible documents. Files have the .Rmd extension and can be rendered (“knitted”) to HTML, PDF, or Word.

Getting Started

Install the rmarkdown package, then in RStudio: File → New File → R Markdown.

Basic Formatting

## Section Header
### Subsection Header

Text can be *italicized* or **bolded** or ***both***.

Links: [Link Text](https://example.com)

Code Chunks

R code is placed in code chunks delimited by three backticks:

```{r}
seq(1, 10, 1)
```

Chunk options control whether code is evaluated (eval), displayed (echo), and more:

```{r, eval = TRUE, echo = TRUE}
seq(1, 10, 1)
```

Knitting

Click the “Knit” button in RStudio to render your document. Start with HTML output, which has the fewest dependencies.

Learning More

For comprehensive R Markdown documentation, see the R Markdown introduction and R Markdown cheat sheet.

5.26 Practice Exercises

Exercise R.1: Exploring RStudio

Take a few minutes to familiarize yourself with the RStudio environment:

  1. Locate the four main panes:

    • The code editor (top left)
    • The workspace and history (top right)
    • The plots and files window (bottom right)
    • The R console (bottom left)
  2. In the plots and files window, click on the Packages and Help tabs to see what they offer

  3. See what types of new files can be made in RStudio by clicking File → New File

  4. Open a new R script and a new R Markdown file to see the difference

Exercise R.2: Basic Mathematics in R

Insert a code chunk and complete the following tasks:

  1. Add and subtract numbers
  2. Multiply and divide numbers
  3. Raise a number to a power using the ^ symbol
  4. Create a more complex equation involving all of these operations to convince yourself that R follows the normal priority of mathematical evaluation (PEMDAS)
Code
# Example:
(4 + 3 * 2^2) / 5 - 1
Exercise R.3: Assigning Variables and Functions
  1. Assign three variables using basic mathematical operations
  2. Take the log of your three variables using log()
  3. Use the print() function to display your most complex variable
  4. Use the c() (concatenate) function combined with paste() to create and print a sentence
Code
# Example:
x <- 10
y <- x * 2
z <- sqrt(x + y)
print(paste("The value of z is", z))
Exercise R.4: Vectors and Factors
  1. Create a numeric vector using the c() function with at least 5 elements
  2. Create a character vector and convert it to a factor using as.factor()
Code
# Example:
vec1 <- c("control", "treatment", "control", "treatment", "control")
fac1 <- as.factor(vec1)
print(fac1)
[1] control   treatment control   treatment control  
Levels: control treatment
Code
levels(fac1)
[1] "control"   "treatment"
  1. Use str() and class() to evaluate your variables
  2. What is the difference between a character vector and a factor?
Exercise R.5: Basic Statistics
  1. Create a numeric vector with at least 10 elements
  2. Calculate the mean(), sd(), sum(), length(), and var() of your vector
  3. Use the log() and sqrt() functions on your vector
  4. What happens when you try to apply mean() to a factor? Try it and explain the result
Code
# Example:
my_vector <- c(12, 15, 18, 22, 25, 28, 31, 35, 38, 42)
mean(my_vector)
sd(my_vector)
Exercise R.6: Creating Sequences and Random Sampling

Set the random seed for reproducibility, then:

Code
set.seed(42)
  1. Create a vector with 100 elements using seq() and calculate the mean and standard deviation
  2. Create a variable and sample() it with equal probability—experiment with the size and replace arguments
  3. Create a normally distributed variable of 10000 elements using rnorm(), then sample that distribution with and without replacement
  4. Use hist() to plot your normally distributed variable
Exercise R.7: Basic Visualization

Create visualizations with proper axis labels and colors:

  1. Create a sequence variable using seq() and make two different plots by changing the type argument ("p" for points, "l" for lines, "b" for both)

  2. Create a normally distributed variable using rnorm() and make histograms with different breaks values—what does breaks control?

  3. Use par(mfrow = c(2, 2)) to create a 2×2 grid of plots

Code
par(mfrow = c(2, 2))
x <- seq(1, 100, by = 1)
plot(x, type = "p", main = "Points", col = "blue")
plot(x, type = "l", main = "Lines", col = "red")
y <- rnorm(1000)
hist(y, breaks = 10, main = "10 Breaks", col = "lightblue")
hist(y, breaks = 50, main = "50 Breaks", col = "lightgreen")
Exercise R.8: Creating Data Frames
  1. Create a data frame with at least three columns: one character/factor, one numeric, and one logical
  2. Assign row names to your data frame using rownames()
  3. Examine your data frame structure using str()
  4. Calculate the mean of each numeric variable
  5. Use head() and tail() to view portions of your data frame
Code
# Example:
treatment <- c("control", "low", "medium", "high", "control", "low")
response <- c(12.3, 15.6, 18.9, 24.2, 11.8, 16.1)
significant <- c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE)
my_data <- data.frame(treatment, response, significant)
str(my_data)
Exercise R.9: Data Import and Indexing
  1. Create a simple CSV file or use a built-in dataset like iris
  2. Use read.csv() to read in your file (or access iris directly)
  3. Use str() and head() to examine the data structure
  4. Use $ and [ ] operators to select different parts of the data frame
  5. Create a plot of two numeric variables
  6. Use tapply() to calculate summary statistics grouped by a categorical variable
  7. Export your data frame using write.csv()
Code
# Example with iris:
data(iris)
str(iris)
head(iris)
iris$Sepal.Length[1:5]  # First 5 sepal lengths
iris[1:3, ]  # First 3 rows
plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species)
tapply(iris$Sepal.Length, iris$Species, mean)
Exercise R.10: Understanding Object Types

Explore how R handles different data types:

  1. Create variables of different classes: numeric, character, logical, and factor
  2. What happens when you try to perform arithmetic on character data?
  3. Experiment with type coercion using as.numeric(), as.character(), and as.factor()
  4. What happens when you add a character element to a numeric vector?

5.27 Additional Resources