10 Data Visualization

10.1 Why Visualize Data?

Before diving into the mechanics of creating plots, consider why visualization matters. The human visual system excels at detecting patterns, spotting outliers, and perceiving relationships—abilities that summary statistics cannot replace.

Consider Anscombe’s Quartet—four datasets with nearly identical summary statistics (same mean, variance, and correlation) but completely different patterns:

Code

# Reshape Anscombe's built-in dataset
anscombe_long <- anscombe |>
  pivot_longer(everything(),
               names_to = c(".value", "set"),
               names_pattern = "(.)(.)")

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_point(size = 2, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "coral") +
  facet_wrap(~set, ncol = 2) +
  labs(title = "Same Mean, Variance, and Correlation—Different Stories") +
  theme_minimal()

Figure 10.1: Anscombe’s Quartet: Four datasets with identical summary statistics but very different patterns

All four datasets have nearly identical statistical summaries, yet they represent fundamentally different phenomena: a linear relationship, a curved relationship, an outlier-driven relationship, and a vertical cluster with one outlier. Summary statistics alone would suggest these datasets are equivalent—only visualization reveals the truth.

Always Visualize Your Data

Never trust summary statistics alone. Before running statistical tests, visualize your data to check assumptions, identify outliers, and understand the underlying patterns.

10.2 Choosing the Right Chart Type

Different questions call for different visualizations. Matching your question to the right chart type is the first step toward effective communication:

Question	Chart Type	Why
How are values distributed?	Histogram, density plot	Shows shape, center, spread
How do groups compare?	Box plot, bar chart	Side-by-side comparison
How do two variables relate?	Scatter plot	Shows correlation, patterns
How does a value change over time?	Line plot	Connects sequential observations
What is the composition?	Stacked bar chart	Shows parts of a whole

Start with the Question

Before creating any visualization, ask yourself: “What question am I trying to answer?” The chart type should emerge from the question, not the other way around.

10.3 The Grammar of Graphics

Data visualization is both an art and a science. A well-designed graphic can reveal patterns, communicate findings, and guide analysis in ways that tables of numbers cannot. The ggplot2 package implements a coherent system for creating graphics based on Leland Wilkinson’s “Grammar of Graphics”—a framework that describes the fundamental components from which all statistical graphics can be built.

Just as grammar provides rules for constructing sentences from words, the grammar of graphics provides rules for constructing visualizations from components. Every graphic is composed of data, aesthetic mappings that connect variables to visual properties, and geometric objects that represent data points. Additional components like scales, statistical transformations, coordinate systems, and facets allow for sophisticated customizations.

Figure 10.2: The grammar of graphics breaks visualizations into fundamental components

10.4 Building Plots with ggplot2

The basic structure of a ggplot2 call begins with the ggplot() function, which creates a coordinate system. You add layers to this foundation using the + operator.

Code

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

Figure 10.3: A basic scatterplot of highway fuel efficiency versus engine displacement

This creates a scatterplot of highway fuel efficiency against engine displacement using the built-in mpg dataset. The aes() function establishes the aesthetic mapping—which variables map to which visual properties. Here, displ maps to the x-axis and hwy to the y-axis. The geom_point() function adds a layer of points.

10.5 Aesthetic Mappings

Aesthetics are visual properties of the plot. Beyond position (x and y), common aesthetics include color, size, shape, and transparency (alpha). You can map variables to these aesthetics to encode additional information.

Code

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 3, alpha = 0.7)

Figure 10.4: Scatterplot with color mapped to vehicle class

Now the color of each point indicates the vehicle class. The legend is created automatically. Note that aesthetics defined inside aes() are mapped to variables, while those defined outside (like size = 3) apply uniformly to all points.

10.6 Geometric Objects

Geometric objects, or geoms, determine what type of plot you create. Different geoms represent data in different ways.

Scatterplots with geom_point()

Points are good for showing the relationship between two continuous variables:

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Figure 10.5: Simple scatterplot using geom_point()

Line Plots with geom_line() and geom_smooth()

Lines connect points in order, useful for time series or showing trends:

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

Figure 10.6: Scatterplot with smoothed trend line and confidence interval

The geom_smooth() function adds a smoothed conditional mean with confidence interval.

Bar Charts with geom_bar()

Bar charts show counts or summaries of categorical data:

Code

ggplot(diamonds, aes(x = cut)) +
  geom_bar()

Figure 10.7: Bar chart showing diamond counts by cut quality

Use fill to color bars by another variable:

Code

ggplot(diamonds, aes(x = cut, fill = clarity)) +
  geom_bar(position = "dodge")

Figure 10.8: Grouped bar chart showing diamond counts by cut and clarity

Histograms with geom_histogram()

Histograms show the distribution of a continuous variable:

Code

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "steelblue", color = "white")

Figure 10.9: Histogram showing the distribution of diamond carat weights

Boxplots with geom_boxplot()

Boxplots summarize distributions and highlight outliers:

Code

ggplot(mpg, aes(x = class, y = hwy)) +
  geom_boxplot()

Figure 10.10: Boxplot comparing highway fuel efficiency across vehicle classes

10.7 Combining Multiple Geoms

You can layer multiple geoms to create richer visualizations:

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE, color = "black")

Figure 10.11: Combining multiple geoms: points colored by class with a black trend line

10.8 Faceting

Faceting creates small multiples—separate panels for subsets of the data. This is powerful for comparing patterns across groups.

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ class, nrow = 2)

Use facet_grid() for two-variable faceting:

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl)

10.9 Labels and Titles

Add informative labels with the labs() function:

Code

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Fuel Efficiency Decreases with Engine Size",
    subtitle = "Data from EPA fuel economy tests",
    caption = "Source: fueleconomy.gov",
    x = "Engine Displacement (liters)",
    y = "Highway Fuel Efficiency (mpg)",
    color = "Vehicle Class"
  )

Figure 10.14: A well-labeled plot with title, subtitle, caption, and axis labels

10.10 Themes

Themes control the non-data aspects of the plot—background, grid lines, fonts, etc. ggplot2 includes several built-in themes:

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  theme_classic()

Figure 10.15: Using theme_classic() for a clean, publication-ready appearance

Other built-in themes include theme_minimal(), theme_bw(), theme_light(), and theme_dark(). The ggthemes package provides many additional themes.

10.11 Choosing the Right Plot

Choosing an appropriate visualization depends on the types of variables you want to display and the message you want to convey.

Figure 10.16: Choosing the right visualization depends on your data types and question

For one categorical variable, use bar charts. For one continuous variable, use histograms or density plots. For two continuous variables, use scatterplots. For one continuous and one categorical, use boxplots or violin plots. For two categorical variables, use stacked or grouped bar charts or heat maps.

10.12 Principles of Effective Visualization

Edward Tufte articulated principles of graphical excellence that remain influential: “Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.”

Key principles include:

Show the data. Above all else, make the data visible. Avoid chart junk that obscures what you are trying to communicate.

Encourage comparison. Design graphics to facilitate comparison of different groups or conditions.

Represent magnitudes honestly. The visual representation should be proportional to the numerical quantities being represented. Avoid truncated axes that exaggerate differences.

Minimize clutter. Remove unnecessary grid lines, borders, and decorations. Every element should serve a purpose.

Make displays easy to interpret. Use clear labels, appropriate colors, and logical organization.

Order Categories Meaningfully

By default, R orders categorical variables alphabetically, which is rarely the most informative arrangement. Use reorder() to order categories by a meaningful value:

Code

# Create sample data
sample_data <- tibble(
  treatment = c("Control", "Low Dose", "Medium Dose", "High Dose"),
  response = c(12, 18, 25, 31)
)

# Default alphabetical order (not ideal)
ggplot(sample_data, aes(x = treatment, y = response)) +
  geom_col(fill = "steelblue") +
  labs(title = "Default Order (Alphabetical)",
       x = "Treatment", y = "Response") +
  theme_minimal()

Figure 10.17: Bar chart with default alphabetical ordering (often not ideal)

Code

# Order by response value (more meaningful)
sample_data |>
  mutate(treatment = reorder(treatment, response)) |>
  ggplot(aes(x = treatment, y = response)) +
  geom_col(fill = "steelblue") +
  labs(title = "Ordered by Value (More Meaningful)",
       x = "Treatment", y = "Response") +
  theme_minimal()

Figure 10.18: Bar chart with categories ordered by value for easier comparison

The reorder() function takes a categorical variable and a numeric variable, reordering the categories by the numeric values. For horizontal bar charts (which are often easier to read), add coord_flip():

Code

sample_data |>
  mutate(treatment = reorder(treatment, response)) |>
  ggplot(aes(x = treatment, y = response)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Horizontal Bars (Good for Long Labels)",
       x = NULL, y = "Response") +
  theme_minimal()

Figure 10.19: Horizontal bar chart with coord_flip() for easier reading of long labels

10.13 Visual Encoding: How We Perceive Data

Effective visualization depends on understanding how humans perceive visual information. We encode data using visual cues—properties like position, length, color, and shape. But not all visual cues are equally effective.

The Hierarchy of Visual Encoding

Research by Cleveland and McGill established that we perceive some visual encodings more accurately than others. From most to least accurate:

Position along a common scale (scatterplots, dot plots)
Position along non-aligned scales (small multiples)
Length (bar charts)
Angle/slope (some line charts)
Area (bubble charts, treemaps)
Volume (3D charts—generally avoid)
Color saturation/hue (choropleth maps, heatmaps)

This hierarchy explains why bar charts work better than pie charts for comparing quantities—we judge lengths more accurately than angles or areas.

Position Is Most Powerful

When possible, encode your most important data using position. Scatterplots, line graphs, and dot plots all use position effectively:

Code

# Compare a pie chart vs. bar chart for the same data
library(patchwork)

category_data <- data.frame(
  category = c("Engineering", "Medicine", "Natural Sciences", "Social Sciences"),
  funding = c(35, 28, 22, 15)
)

# Bar chart - easy to compare
p_bar <- ggplot(category_data, aes(x = reorder(category, funding), y = funding)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Bar Chart: Easy Comparison", x = "", y = "Funding (%)") +
  theme_minimal()

# Pie chart - harder to compare
p_pie <- ggplot(category_data, aes(x = "", y = funding, fill = category)) +
  geom_col(width = 1) +
  coord_polar("y") +
  labs(title = "Pie Chart: Harder to Compare", fill = "Category") +
  theme_void() +
  theme(legend.position = "bottom")

p_bar + p_pie

Figure 10.20: Comparing bar charts and pie charts: position-based encoding is easier to judge than angles

The bar chart makes it immediately obvious that Engineering has the most funding. With the pie chart, you must work harder to compare the slice sizes.

10.14 When Should Zero Be Included?

A contentious issue in data visualization is whether the y-axis should always start at zero. The answer depends on the type of chart and what you’re trying to show.

Bar Charts: Always Include Zero

For bar charts, the length of the bar represents the magnitude of the value. If the axis doesn’t start at zero, the visual representation misrepresents the data:

Code

# Demonstration of misleading truncated axis
gdp_data <- data.frame(
  country = c("A", "B", "C"),
  gdp = c(45000, 47000, 49000)
)

p_trunc <- ggplot(gdp_data, aes(x = country, y = gdp)) +
  geom_col(fill = "steelblue") +
  coord_cartesian(ylim = c(44000, 50000)) +
  labs(title = "Truncated Axis: Misleading",
       subtitle = "Differences appear huge",
       y = "GDP per capita") +
  theme_minimal()

p_full <- ggplot(gdp_data, aes(x = country, y = gdp)) +
  geom_col(fill = "steelblue") +
  labs(title = "Full Axis: Honest",
       subtitle = "Differences in proper context",
       y = "GDP per capita") +
  theme_minimal()

p_trunc + p_full

Figure 10.21: Truncated axes in bar charts can mislead readers about the magnitude of differences

Scatterplots and Line Charts: Context Matters

For position-based encodings like scatterplots and line charts, zero doesn’t need to be included if it would waste space and obscure meaningful variation:

Code

# Temperature data - zero would be meaningless
set.seed(42)
temp_data <- data.frame(
  day = 1:30,
  temp = rnorm(30, mean = 72, sd = 5)
)

p_zero <- ggplot(temp_data, aes(x = day, y = temp)) +
  geom_line(color = "firebrick") +
  ylim(0, 100) +
  labs(title = "Including Zero: Wastes Space", y = "Temperature (°F)") +
  theme_minimal()

p_auto <- ggplot(temp_data, aes(x = day, y = temp)) +
  geom_line(color = "firebrick") +
  labs(title = "Natural Range: Shows Variation", y = "Temperature (°F)") +
  theme_minimal()

p_zero + p_auto

Figure 10.22: For line charts, including zero depends on whether it has meaning for the variable

The key question is: what would zero mean for this variable? For temperature in Fahrenheit, zero has no special significance for daily weather data. For proportions or counts, zero is meaningful and often should be included.

The Zero Rule

Bar charts: Always include zero—the bar length represents magnitude. Line charts and scatterplots: Include zero if it’s meaningful; otherwise, show the natural range of the data. When in doubt: Ask whether excluding zero could mislead readers about the magnitude of differences.

10.15 Data Transformations for Visualization

Sometimes the raw data doesn’t visualize well. Transformations can reveal patterns that are hidden in the original scale.

Log Transformations for Skewed Data

Many biological variables—gene expression, population sizes, concentrations—follow approximately log-normal distributions with long right tails. Log transformation can make patterns visible:

Code

# Simulated gene expression data
set.seed(123)
expression_data <- data.frame(
  gene_a = rlnorm(200, meanlog = 2, sdlog = 1.5),
  gene_b = rlnorm(200, meanlog = 3, sdlog = 1.2)
)

p_raw <- ggplot(expression_data, aes(x = gene_a, y = gene_b)) +
  geom_point(alpha = 0.5) +
  labs(title = "Original Scale",
       subtitle = "Pattern obscured by outliers",
       x = "Gene A Expression", y = "Gene B Expression") +
  theme_minimal()

p_log <- ggplot(expression_data, aes(x = gene_a, y = gene_b)) +
  geom_point(alpha = 0.5) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "Log Scale",
       subtitle = "Relationship visible",
       x = "Gene A Expression (log)", y = "Gene B Expression (log)") +
  theme_minimal()

p_raw + p_log

Figure 10.23: Log transformation reveals patterns in skewed data that span multiple orders of magnitude

When to Use Log Scales

Consider log transformation when:

Data span several orders of magnitude
Relationships are multiplicative rather than additive
Distribution is strongly right-skewed
You’re comparing fold changes or ratios

Be sure to label axes clearly when using transformed scales, and remember that zero cannot be log-transformed.

10.16 Color in Data Visualization

Color is a powerful but often misused encoding. Effective use of color requires understanding perception and accessibility.

Types of Color Scales

Sequential: For ordered data from low to high. Use a single hue varying in lightness.

Diverging: For data with a meaningful midpoint. Two hues diverge from a neutral center.

Qualitative: For categorical data with no inherent order. Use distinct hues.

Color Accessibility

Approximately 8% of men and 0.5% of women have some form of color vision deficiency. Design for accessibility:

Avoid red-green as the only distinguishing feature
Use the viridis color scales, designed for perceptual uniformity and colorblind accessibility
Supplement color with shape or pattern when possible

Code

# Good practice: color + shape
ggplot(mpg, aes(x = displ, y = hwy, color = drv, shape = drv)) +
  geom_point(size = 3) +
  scale_color_viridis_d() +
  labs(title = "Color + Shape: Accessible Design",
       x = "Engine Displacement (L)", y = "Highway MPG",
       color = "Drive Type", shape = "Drive Type") +
  theme_minimal()

Figure 10.24: Using both color and shape ensures accessibility for colorblind viewers

Avoid Rainbow Color Scales

Rainbow color scales (like the default “jet” colormap in MATLAB) have serious problems:

They’re not perceptually uniform—yellow appears brighter than blue
They create false boundaries where colors change dramatically
They’re particularly problematic for colorblind viewers

Use viridis, plasma, or other perceptually uniform scales instead.

10.17 The Power of Small Multiples

Small multiples—the same chart repeated for different subsets of the data—are remarkably effective for comparison. Edward Tufte called them “the best design solution for a wide range of problems in data presentation.”

Code

# Small multiples example
ggplot(gapminder::gapminder %>%
         filter(continent != "Oceania"),
       aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.3, size = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "firebrick") +
  scale_x_log10(labels = scales::comma) +
  facet_grid(continent ~ cut(year, breaks = c(1950, 1970, 1990, 2010),
                              labels = c("1952-1970", "1971-1990", "1991-2007"))) +
  labs(title = "Life Expectancy vs. GDP Over Time by Continent",
       x = "GDP per Capita (log scale)",
       y = "Life Expectancy (years)") +
  theme_minimal() +
  theme(strip.text = element_text(size = 9))

Figure 10.25: Small multiples allow easy comparison of patterns across groups and time periods

Small multiples work because:

The eye can quickly scan and compare panels
Each panel has identical axes, making comparison fair
Patterns and outliers become visible through repetition

10.18 Common Visualization Mistakes

Beyond the principles discussed, watch out for these common errors:

Overplotting: Too many points obscure patterns. Use transparency, jittering, or density plots.

Code

# Overplotting solution
set.seed(42)
overplot_data <- data.frame(
  x = rnorm(5000),
  y = rnorm(5000)
)

p_over <- ggplot(overplot_data, aes(x, y)) +
  geom_point() +
  labs(title = "Overplotting: Points Hidden") +
  theme_minimal()

p_alpha <- ggplot(overplot_data, aes(x, y)) +
  geom_point(alpha = 0.1) +
  labs(title = "Transparency: Density Visible") +
  theme_minimal()

p_over + p_alpha

Figure 10.26: Using transparency (alpha) to reveal density when many points overlap

Dual y-axes: These are almost always misleading. The relationship between the two scales is arbitrary and can be manipulated to show any desired pattern.

3D effects: Three-dimensional bar charts, pie charts, and similar decorations distort perception without adding information. Avoid them.

Excessive decoration: Gridlines, borders, backgrounds, and other “chart junk” should be minimized. Focus attention on the data.

10.19 Examples of Poor Graphics

Recognizing bad graphics helps you avoid making them.

Figure 10.27: Examples of poor graphic design that obscure rather than reveal patterns

Ticker-tape style displays make it hard to see patterns. Lines connecting unrelated points mislead. Pie charts make comparisons difficult because humans are poor at judging angles. Three-dimensional effects distort perception without adding information.

Figure 10.28: Pie charts and 3D effects make comparisons difficult and distort perception

10.20 A Famous Good Example

Charles Minard’s 1869 map of Napoleon’s Russian campaign is often cited as one of the best statistical graphics ever made. It displays six variables: the size of the army, its location (latitude and longitude), direction of movement, temperature, and date—all in a single coherent image.

Figure 10.29: Minard’s 1869 map of Napoleon’s Russian campaign displays six variables in a single coherent image

The graphic tells a story. You can see the army shrink as it advances, the devastating losses during the retreat, and the correlation with plummeting temperatures. No legend is needed; the meaning is immediately apparent.

10.21 Saving Plots

Save plots with ggsave():

Code

# Create and save a plot
p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

ggsave("my_plot.png", p, width = 8, height = 6, dpi = 300)
ggsave("my_plot.pdf", p, width = 8, height = 6)

The function infers the format from the file extension. Specify dimensions and resolution for publication-quality output.

10.22 Practice Exercises

The best way to learn ggplot2 is to use it. Take a dataset you care about and try different visualizations. Experiment with aesthetics, geoms, and facets. Read error messages carefully—they often point directly to the problem.

Exercise V.1: Basic ggplot2

Create your first ggplot visualizations:

Code

library(ggplot2)
data(mpg)

# Basic scatterplot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Create a scatterplot of engine displacement vs. highway mpg
Add color based on vehicle class
Add appropriate axis labels and a title
Try different themes (theme_minimal(), theme_classic(), theme_bw())

Exercise V.2: Geometric Objects

Practice with different geoms:

Create a histogram of highway mpg using geom_histogram()
Create a boxplot of highway mpg by vehicle class using geom_boxplot()
Create a bar chart showing the count of vehicles by manufacturer using geom_bar()
Create a line plot (use a time series dataset or create synthetic data)

Exercise V.3: Aesthetic Mappings

Explore different aesthetic mappings:

Map a continuous variable to color in a scatterplot
Map a categorical variable to shape
Set fixed aesthetics (like size = 3) outside of aes()
What is the difference between mapping a variable to an aesthetic inside aes() versus setting a fixed value outside?

Code

# Mapped aesthetic (variable determines color)
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 3)

# Fixed aesthetic (all points same color)
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "steelblue", size = 3)

Exercise V.4: Faceting

Practice creating small multiples:

Create a scatterplot faceted by vehicle class using facet_wrap()
Create a grid of plots using facet_grid() with two variables
Experiment with the scales argument to allow different axis scales per facet

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ class, nrow = 2)

Exercise V.5: Combining Layers

Build complex visualizations by layering:

Create a scatterplot with a smoothed trend line
Add both points and a regression line
Use different colors for points and the trend line

Code

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class), alpha = 0.7) +
  geom_smooth(method = "lm", color = "black", se = TRUE) +
  labs(
    title = "Engine Size vs. Fuel Efficiency",
    x = "Engine Displacement (L)",
    y = "Highway MPG",
    color = "Vehicle Class"
  ) +
  theme_minimal()

Exercise V.6: Publication-Quality Figures

Create a polished figure suitable for publication:

Choose an appropriate chart type for your data
Add informative labels (title, subtitle, caption, axis labels)
Use an appropriate color palette
Adjust theme elements for clarity
Save the figure using ggsave() with appropriate dimensions and resolution

# Data Visualization {#sec-visualization} ```{r} #| echo: false #| message: false library(tidyverse) theme_set(theme_minimal()) ``` ## Why Visualize Data? Before diving into the mechanics of creating plots, consider why visualization matters. The human visual system excels at detecting patterns, spotting outliers, and perceiving relationships—abilities that summary statistics cannot replace. Consider **Anscombe's Quartet**—four datasets with nearly identical summary statistics (same mean, variance, and correlation) but completely different patterns: ```{r} #| label: fig-anscombes #| fig-cap: "Anscombe's Quartet: Four datasets with identical summary statistics but very different patterns" #| fig-height: 5 #| fig-width: 8 # Reshape Anscombe's built-in dataset anscombe_long <- anscombe |> pivot_longer(everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") ggplot(anscombe_long, aes(x = x, y = y)) + geom_point(size = 2, color = "steelblue") + geom_smooth(method = "lm", se = FALSE, color = "coral") + facet_wrap(~set, ncol = 2) + labs(title = "Same Mean, Variance, and Correlation—Different Stories") + theme_minimal() ``` All four datasets have nearly identical statistical summaries, yet they represent fundamentally different phenomena: a linear relationship, a curved relationship, an outlier-driven relationship, and a vertical cluster with one outlier. Summary statistics alone would suggest these datasets are equivalent—only visualization reveals the truth. ::: {.callout-important} ## Always Visualize Your Data Never trust summary statistics alone. Before running statistical tests, visualize your data to check assumptions, identify outliers, and understand the underlying patterns. ::: ## Choosing the Right Chart Type Different questions call for different visualizations. Matching your question to the right chart type is the first step toward effective communication: | Question | Chart Type | Why | |:---------|:-----------|:----| | How are values distributed? | Histogram, density plot | Shows shape, center, spread | | How do groups compare? | Box plot, bar chart | Side-by-side comparison | | How do two variables relate? | Scatter plot | Shows correlation, patterns | | How does a value change over time? | Line plot | Connects sequential observations | | What is the composition? | Stacked bar chart | Shows parts of a whole | ::: {.callout-tip} ## Start with the Question Before creating any visualization, ask yourself: "What question am I trying to answer?" The chart type should emerge from the question, not the other way around. ::: ## The Grammar of Graphics Data visualization is both an art and a science. A well-designed graphic can reveal patterns, communicate findings, and guide analysis in ways that tables of numbers cannot. The `ggplot2` package implements a coherent system for creating graphics based on Leland Wilkinson's "Grammar of Graphics"—a framework that describes the fundamental components from which all statistical graphics can be built. Just as grammar provides rules for constructing sentences from words, the grammar of graphics provides rules for constructing visualizations from components. Every graphic is composed of data, aesthetic mappings that connect variables to visual properties, and geometric objects that represent data points. Additional components like scales, statistical transformations, coordinate systems, and facets allow for sophisticated customizations. ![The grammar of graphics breaks visualizations into fundamental components](../images/ch09/ch09_grammar_graphics.jpeg){#fig-grammar-graphics fig-align="center"} ## Building Plots with ggplot2 The basic structure of a ggplot2 call begins with the `ggplot()` function, which creates a coordinate system. You add layers to this foundation using the `+` operator. ```{r} #| label: fig-basic-ggplot #| fig-cap: "A basic scatterplot of highway fuel efficiency versus engine displacement" #| fig-width: 7 #| fig-height: 5 ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() ``` This creates a scatterplot of highway fuel efficiency against engine displacement using the built-in `mpg` dataset. The `aes()` function establishes the aesthetic mapping—which variables map to which visual properties. Here, `displ` maps to the x-axis and `hwy` to the y-axis. The `geom_point()` function adds a layer of points. ## Aesthetic Mappings Aesthetics are visual properties of the plot. Beyond position (x and y), common aesthetics include color, size, shape, and transparency (alpha). You can map variables to these aesthetics to encode additional information. ```{r} #| label: fig-aesthetic-mapping #| fig-cap: "Scatterplot with color mapped to vehicle class" #| fig-width: 8 #| fig-height: 5 ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point(size = 3, alpha = 0.7) ``` Now the color of each point indicates the vehicle class. The legend is created automatically. Note that aesthetics defined inside `aes()` are mapped to variables, while those defined outside (like `size = 3`) apply uniformly to all points. ## Geometric Objects Geometric objects, or geoms, determine what type of plot you create. Different geoms represent data in different ways. ### Scatterplots with geom_point() Points are good for showing the relationship between two continuous variables: ```{r} #| label: fig-geom-point #| fig-cap: "Simple scatterplot using geom_point()" #| fig-width: 7 #| fig-height: 4 ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ``` ### Line Plots with geom_line() and geom_smooth() Lines connect points in order, useful for time series or showing trends: ```{r} #| label: fig-geom-smooth #| fig-cap: "Scatterplot with smoothed trend line and confidence interval" #| fig-width: 7 #| fig-height: 4 ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_smooth() ``` The `geom_smooth()` function adds a smoothed conditional mean with confidence interval. ### Bar Charts with geom_bar() Bar charts show counts or summaries of categorical data: ```{r} #| label: fig-geom-bar #| fig-cap: "Bar chart showing diamond counts by cut quality" #| fig-width: 7 #| fig-height: 4 ggplot(diamonds, aes(x = cut)) + geom_bar() ``` Use `fill` to color bars by another variable: ```{r} #| label: fig-bar-dodge #| fig-cap: "Grouped bar chart showing diamond counts by cut and clarity" #| fig-width: 8 #| fig-height: 4 ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "dodge") ``` ### Histograms with geom_histogram() Histograms show the distribution of a continuous variable: ```{r} #| label: fig-histogram #| fig-cap: "Histogram showing the distribution of diamond carat weights" #| fig-width: 7 #| fig-height: 4 ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.1, fill = "steelblue", color = "white") ``` ### Boxplots with geom_boxplot() Boxplots summarize distributions and highlight outliers: ```{r} #| label: fig-boxplot #| fig-cap: "Boxplot comparing highway fuel efficiency across vehicle classes" #| fig-width: 8 #| fig-height: 4 ggplot(mpg, aes(x = class, y = hwy)) + geom_boxplot() ``` ## Combining Multiple Geoms You can layer multiple geoms to create richer visualizations: ```{r} #| label: fig-combined-geoms #| fig-cap: "Combining multiple geoms: points colored by class with a black trend line" #| fig-width: 7 #| fig-height: 5 ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_smooth(se = FALSE, color = "black") ``` ## Faceting Faceting creates small multiples—separate panels for subsets of the data. This is powerful for comparing patterns across groups. ```{r} #| label: fig-facet-wrap #| fig-cap: "Small multiples showing the displacement-efficiency relationship for each vehicle class" #| fig-width: 9 #| fig-height: 6 ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_wrap(~ class, nrow = 2) ``` Use `facet_grid()` for two-variable faceting: ```{r} #| label: fig-facet-grid #| fig-cap: "Two-variable faceting by drive type (rows) and number of cylinders (columns)" #| fig-width: 9 #| fig-height: 6 ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_grid(drv ~ cyl) ``` ## Labels and Titles Add informative labels with the `labs()` function: ```{r} #| label: fig-labs-demo #| fig-cap: "A well-labeled plot with title, subtitle, caption, and axis labels" #| fig-width: 8 #| fig-height: 5 ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() + geom_smooth(se = FALSE) + labs( title = "Fuel Efficiency Decreases with Engine Size", subtitle = "Data from EPA fuel economy tests", caption = "Source: fueleconomy.gov", x = "Engine Displacement (liters)", y = "Highway Fuel Efficiency (mpg)", color = "Vehicle Class" ) ``` ## Themes Themes control the non-data aspects of the plot—background, grid lines, fonts, etc. ggplot2 includes several built-in themes: ```{r} #| label: fig-theme-classic #| fig-cap: "Using theme_classic() for a clean, publication-ready appearance" #| fig-width: 7 #| fig-height: 4 ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + theme_classic() ``` Other built-in themes include `theme_minimal()`, `theme_bw()`, `theme_light()`, and `theme_dark()`. The `ggthemes` package provides many additional themes. ## Choosing the Right Plot Choosing an appropriate visualization depends on the types of variables you want to display and the message you want to convey. ![Choosing the right visualization depends on your data types and question](../images/ch09/ch09_chart_selection.jpeg){#fig-chart-selection fig-align="center"} For one categorical variable, use bar charts. For one continuous variable, use histograms or density plots. For two continuous variables, use scatterplots. For one continuous and one categorical, use boxplots or violin plots. For two categorical variables, use stacked or grouped bar charts or heat maps. ## Principles of Effective Visualization Edward Tufte articulated principles of graphical excellence that remain influential: "Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space." Key principles include: **Show the data.** Above all else, make the data visible. Avoid chart junk that obscures what you are trying to communicate. **Encourage comparison.** Design graphics to facilitate comparison of different groups or conditions. **Represent magnitudes honestly.** The visual representation should be proportional to the numerical quantities being represented. Avoid truncated axes that exaggerate differences. **Minimize clutter.** Remove unnecessary grid lines, borders, and decorations. Every element should serve a purpose. **Make displays easy to interpret.** Use clear labels, appropriate colors, and logical organization. ### Order Categories Meaningfully By default, R orders categorical variables alphabetically, which is rarely the most informative arrangement. Use `reorder()` to order categories by a meaningful value: ```{r} #| label: fig-order-alphabetical #| fig-cap: "Bar chart with default alphabetical ordering (often not ideal)" #| fig-width: 7 #| fig-height: 4 # Create sample data sample_data <- tibble( treatment = c("Control", "Low Dose", "Medium Dose", "High Dose"), response = c(12, 18, 25, 31) ) # Default alphabetical order (not ideal) ggplot(sample_data, aes(x = treatment, y = response)) + geom_col(fill = "steelblue") + labs(title = "Default Order (Alphabetical)", x = "Treatment", y = "Response") + theme_minimal() ``` ```{r} #| label: fig-order-meaningful #| fig-cap: "Bar chart with categories ordered by value for easier comparison" #| fig-width: 7 #| fig-height: 4 # Order by response value (more meaningful) sample_data |> mutate(treatment = reorder(treatment, response)) |> ggplot(aes(x = treatment, y = response)) + geom_col(fill = "steelblue") + labs(title = "Ordered by Value (More Meaningful)", x = "Treatment", y = "Response") + theme_minimal() ``` The `reorder()` function takes a categorical variable and a numeric variable, reordering the categories by the numeric values. For horizontal bar charts (which are often easier to read), add `coord_flip()`: ```{r} #| label: fig-horizontal-bars #| fig-cap: "Horizontal bar chart with coord_flip() for easier reading of long labels" #| fig-width: 7 #| fig-height: 4 sample_data |> mutate(treatment = reorder(treatment, response)) |> ggplot(aes(x = treatment, y = response)) + geom_col(fill = "steelblue") + coord_flip() + labs(title = "Horizontal Bars (Good for Long Labels)", x = NULL, y = "Response") + theme_minimal() ``` ## Visual Encoding: How We Perceive Data Effective visualization depends on understanding how humans perceive visual information. We encode data using **visual cues**—properties like position, length, color, and shape. But not all visual cues are equally effective. ### The Hierarchy of Visual Encoding Research by Cleveland and McGill established that we perceive some visual encodings more accurately than others. From most to least accurate: 1. **Position along a common scale** (scatterplots, dot plots) 2. **Position along non-aligned scales** (small multiples) 3. **Length** (bar charts) 4. **Angle/slope** (some line charts) 5. **Area** (bubble charts, treemaps) 6. **Volume** (3D charts—generally avoid) 7. **Color saturation/hue** (choropleth maps, heatmaps) ```{r} #| fig-width: 9 #| fig-height: 4 #| echo: false # Demonstration: comparing visual encodings library(patchwork) demo_data <- data.frame( category = c("A", "B", "C", "D"), value = c(25, 50, 75, 100) ) # Position (dot plot) - most accurate p1 <- ggplot(demo_data, aes(x = value, y = category)) + geom_point(size = 4, color = "steelblue") + labs(title = "Position: Most Accurate", x = "Value", y = "") + theme_minimal() + xlim(0, 110) # Length (bar chart) - very accurate p2 <- ggplot(demo_data, aes(x = category, y = value)) + geom_col(fill = "steelblue") + labs(title = "Length: Very Accurate", x = "", y = "Value") + theme_minimal() # Area (bubble) - less accurate p3 <- ggplot(demo_data, aes(x = category, y = 1, size = value)) + geom_point(color = "steelblue") + scale_size_area(max_size = 20) + labs(title = "Area: Less Accurate", x = "", y = "") + theme_minimal() + theme(axis.text.y = element_blank(), legend.position = "none") p1 + p2 + p3 ``` This hierarchy explains why bar charts work better than pie charts for comparing quantities—we judge lengths more accurately than angles or areas. ### Position Is Most Powerful When possible, encode your most important data using position. Scatterplots, line graphs, and dot plots all use position effectively: ```{r} #| label: fig-bar-vs-pie #| fig-cap: "Comparing bar charts and pie charts: position-based encoding is easier to judge than angles" #| fig-width: 8 #| fig-height: 4 # Compare a pie chart vs. bar chart for the same data library(patchwork) category_data <- data.frame( category = c("Engineering", "Medicine", "Natural Sciences", "Social Sciences"), funding = c(35, 28, 22, 15) ) # Bar chart - easy to compare p_bar <- ggplot(category_data, aes(x = reorder(category, funding), y = funding)) + geom_col(fill = "steelblue") + coord_flip() + labs(title = "Bar Chart: Easy Comparison", x = "", y = "Funding (%)") + theme_minimal() # Pie chart - harder to compare p_pie <- ggplot(category_data, aes(x = "", y = funding, fill = category)) + geom_col(width = 1) + coord_polar("y") + labs(title = "Pie Chart: Harder to Compare", fill = "Category") + theme_void() + theme(legend.position = "bottom") p_bar + p_pie ``` The bar chart makes it immediately obvious that Engineering has the most funding. With the pie chart, you must work harder to compare the slice sizes. ## When Should Zero Be Included? A contentious issue in data visualization is whether the y-axis should always start at zero. The answer depends on the type of chart and what you're trying to show. ### Bar Charts: Always Include Zero For bar charts, the length of the bar represents the magnitude of the value. If the axis doesn't start at zero, the visual representation misrepresents the data: ```{r} #| label: fig-truncated-axis #| fig-cap: "Truncated axes in bar charts can mislead readers about the magnitude of differences" #| fig-width: 8 #| fig-height: 4 # Demonstration of misleading truncated axis gdp_data <- data.frame( country = c("A", "B", "C"), gdp = c(45000, 47000, 49000) ) p_trunc <- ggplot(gdp_data, aes(x = country, y = gdp)) + geom_col(fill = "steelblue") + coord_cartesian(ylim = c(44000, 50000)) + labs(title = "Truncated Axis: Misleading", subtitle = "Differences appear huge", y = "GDP per capita") + theme_minimal() p_full <- ggplot(gdp_data, aes(x = country, y = gdp)) + geom_col(fill = "steelblue") + labs(title = "Full Axis: Honest", subtitle = "Differences in proper context", y = "GDP per capita") + theme_minimal() p_trunc + p_full ``` ### Scatterplots and Line Charts: Context Matters For position-based encodings like scatterplots and line charts, zero doesn't need to be included if it would waste space and obscure meaningful variation: ```{r} #| label: fig-zero-context #| fig-cap: "For line charts, including zero depends on whether it has meaning for the variable" #| fig-width: 8 #| fig-height: 4 # Temperature data - zero would be meaningless set.seed(42) temp_data <- data.frame( day = 1:30, temp = rnorm(30, mean = 72, sd = 5) ) p_zero <- ggplot(temp_data, aes(x = day, y = temp)) + geom_line(color = "firebrick") + ylim(0, 100) + labs(title = "Including Zero: Wastes Space", y = "Temperature (°F)") + theme_minimal() p_auto <- ggplot(temp_data, aes(x = day, y = temp)) + geom_line(color = "firebrick") + labs(title = "Natural Range: Shows Variation", y = "Temperature (°F)") + theme_minimal() p_zero + p_auto ``` The key question is: what would zero mean for this variable? For temperature in Fahrenheit, zero has no special significance for daily weather data. For proportions or counts, zero is meaningful and often should be included. ::: {.callout-tip} ## The Zero Rule **Bar charts**: Always include zero—the bar length represents magnitude. **Line charts and scatterplots**: Include zero if it's meaningful; otherwise, show the natural range of the data. **When in doubt**: Ask whether excluding zero could mislead readers about the magnitude of differences. ::: ## Data Transformations for Visualization Sometimes the raw data doesn't visualize well. Transformations can reveal patterns that are hidden in the original scale. ### Log Transformations for Skewed Data Many biological variables—gene expression, population sizes, concentrations—follow approximately log-normal distributions with long right tails. Log transformation can make patterns visible: ```{r} #| label: fig-log-transform #| fig-cap: "Log transformation reveals patterns in skewed data that span multiple orders of magnitude" #| fig-width: 8 #| fig-height: 4 # Simulated gene expression data set.seed(123) expression_data <- data.frame( gene_a = rlnorm(200, meanlog = 2, sdlog = 1.5), gene_b = rlnorm(200, meanlog = 3, sdlog = 1.2) ) p_raw <- ggplot(expression_data, aes(x = gene_a, y = gene_b)) + geom_point(alpha = 0.5) + labs(title = "Original Scale", subtitle = "Pattern obscured by outliers", x = "Gene A Expression", y = "Gene B Expression") + theme_minimal() p_log <- ggplot(expression_data, aes(x = gene_a, y = gene_b)) + geom_point(alpha = 0.5) + scale_x_log10() + scale_y_log10() + labs(title = "Log Scale", subtitle = "Relationship visible", x = "Gene A Expression (log)", y = "Gene B Expression (log)") + theme_minimal() p_raw + p_log ``` ### When to Use Log Scales Consider log transformation when: - Data span several orders of magnitude - Relationships are multiplicative rather than additive - Distribution is strongly right-skewed - You're comparing fold changes or ratios Be sure to label axes clearly when using transformed scales, and remember that zero cannot be log-transformed. ## Color in Data Visualization Color is a powerful but often misused encoding. Effective use of color requires understanding perception and accessibility. ### Types of Color Scales **Sequential**: For ordered data from low to high. Use a single hue varying in lightness. ```{r} #| fig-width: 7 #| fig-height: 5 #| echo: false # Sequential color scale example ggplot(faithfuld, aes(waiting, eruptions, fill = density)) + geom_tile() + scale_fill_viridis_c(option = "plasma") + labs(title = "Sequential Scale: Low to High Values", x = "Waiting Time (min)", y = "Eruption Duration (min)") + theme_minimal() ``` **Diverging**: For data with a meaningful midpoint. Two hues diverge from a neutral center. ```{r} #| fig-width: 7 #| fig-height: 5 #| echo: false # Diverging color scale example set.seed(42) corr_matrix <- cor(mtcars[, 1:6]) corr_df <- as.data.frame(as.table(corr_matrix)) names(corr_df) <- c("Var1", "Var2", "Correlation") ggplot(corr_df, aes(Var1, Var2, fill = Correlation)) + geom_tile() + scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) + labs(title = "Diverging Scale: Positive and Negative Values") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` **Qualitative**: For categorical data with no inherent order. Use distinct hues. ### Color Accessibility Approximately 8% of men and 0.5% of women have some form of color vision deficiency. Design for accessibility: - Avoid red-green as the only distinguishing feature - Use the **viridis** color scales, designed for perceptual uniformity and colorblind accessibility - Supplement color with shape or pattern when possible ```{r} #| label: fig-accessible-design #| fig-cap: "Using both color and shape ensures accessibility for colorblind viewers" #| fig-width: 8 #| fig-height: 4 # Good practice: color + shape ggplot(mpg, aes(x = displ, y = hwy, color = drv, shape = drv)) + geom_point(size = 3) + scale_color_viridis_d() + labs(title = "Color + Shape: Accessible Design", x = "Engine Displacement (L)", y = "Highway MPG", color = "Drive Type", shape = "Drive Type") + theme_minimal() ``` ::: {.callout-warning} ## Avoid Rainbow Color Scales Rainbow color scales (like the default "jet" colormap in MATLAB) have serious problems: - They're not perceptually uniform—yellow appears brighter than blue - They create false boundaries where colors change dramatically - They're particularly problematic for colorblind viewers Use viridis, plasma, or other perceptually uniform scales instead. ::: ## The Power of Small Multiples **Small multiples**—the same chart repeated for different subsets of the data—are remarkably effective for comparison. Edward Tufte called them "the best design solution for a wide range of problems in data presentation." ```{r} #| label: fig-small-multiples #| fig-cap: "Small multiples allow easy comparison of patterns across groups and time periods" #| fig-width: 9 #| fig-height: 6 # Small multiples example ggplot(gapminder::gapminder %>% filter(continent != "Oceania"), aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.3, size = 0.8) + geom_smooth(method = "loess", se = FALSE, color = "firebrick") + scale_x_log10(labels = scales::comma) + facet_grid(continent ~ cut(year, breaks = c(1950, 1970, 1990, 2010), labels = c("1952-1970", "1971-1990", "1991-2007"))) + labs(title = "Life Expectancy vs. GDP Over Time by Continent", x = "GDP per Capita (log scale)", y = "Life Expectancy (years)") + theme_minimal() + theme(strip.text = element_text(size = 9)) ``` Small multiples work because: - The eye can quickly scan and compare panels - Each panel has identical axes, making comparison fair - Patterns and outliers become visible through repetition ## Common Visualization Mistakes Beyond the principles discussed, watch out for these common errors: **Overplotting**: Too many points obscure patterns. Use transparency, jittering, or density plots. ```{r} #| label: fig-overplotting #| fig-cap: "Using transparency (alpha) to reveal density when many points overlap" #| fig-width: 8 #| fig-height: 4 # Overplotting solution set.seed(42) overplot_data <- data.frame( x = rnorm(5000), y = rnorm(5000) ) p_over <- ggplot(overplot_data, aes(x, y)) + geom_point() + labs(title = "Overplotting: Points Hidden") + theme_minimal() p_alpha <- ggplot(overplot_data, aes(x, y)) + geom_point(alpha = 0.1) + labs(title = "Transparency: Density Visible") + theme_minimal() p_over + p_alpha ``` **Dual y-axes**: These are almost always misleading. The relationship between the two scales is arbitrary and can be manipulated to show any desired pattern. **3D effects**: Three-dimensional bar charts, pie charts, and similar decorations distort perception without adding information. Avoid them. **Excessive decoration**: Gridlines, borders, backgrounds, and other "chart junk" should be minimized. Focus attention on the data. ## Examples of Poor Graphics Recognizing bad graphics helps you avoid making them. ![Examples of poor graphic design that obscure rather than reveal patterns](../images/ch09/ch09_poor_graphics_1.jpeg){#fig-poor-graphics fig-align="center"} Ticker-tape style displays make it hard to see patterns. Lines connecting unrelated points mislead. Pie charts make comparisons difficult because humans are poor at judging angles. Three-dimensional effects distort perception without adding information. ![Pie charts and 3D effects make comparisons difficult and distort perception](../images/ch09/ch09_poor_graphics_2.jpeg){#fig-poor-graphics-2 fig-align="center"} ## A Famous Good Example Charles Minard's 1869 map of Napoleon's Russian campaign is often cited as one of the best statistical graphics ever made. It displays six variables: the size of the army, its location (latitude and longitude), direction of movement, temperature, and date—all in a single coherent image. ![Minard's 1869 map of Napoleon's Russian campaign displays six variables in a single coherent image](../images/ch09/ch09_minard_map.jpeg){#fig-minard fig-align="center"} The graphic tells a story. You can see the army shrink as it advances, the devastating losses during the retreat, and the correlation with plummeting temperatures. No legend is needed; the meaning is immediately apparent. ## Saving Plots Save plots with `ggsave()`: ```{r} #| eval: false # Create and save a plot p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ggsave("my_plot.png", p, width = 8, height = 6, dpi = 300) ggsave("my_plot.pdf", p, width = 8, height = 6) ``` The function infers the format from the file extension. Specify dimensions and resolution for publication-quality output. ## Practice Exercises The best way to learn ggplot2 is to use it. Take a dataset you care about and try different visualizations. Experiment with aesthetics, geoms, and facets. Read error messages carefully—they often point directly to the problem. ::: {.callout-note} ### Exercise V.1: Basic ggplot2 Create your first ggplot visualizations: ```{r} #| eval: false library(ggplot2) data(mpg) # Basic scatterplot ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ``` 1. Create a scatterplot of engine displacement vs. highway mpg 2. Add color based on vehicle class 3. Add appropriate axis labels and a title 4. Try different themes (`theme_minimal()`, `theme_classic()`, `theme_bw()`) ::: ::: {.callout-note} ### Exercise V.2: Geometric Objects Practice with different geoms: 1. Create a histogram of highway mpg using `geom_histogram()` 2. Create a boxplot of highway mpg by vehicle class using `geom_boxplot()` 3. Create a bar chart showing the count of vehicles by manufacturer using `geom_bar()` 4. Create a line plot (use a time series dataset or create synthetic data) ::: ::: {.callout-note} ### Exercise V.3: Aesthetic Mappings Explore different aesthetic mappings: 1. Map a continuous variable to color in a scatterplot 2. Map a categorical variable to shape 3. Set fixed aesthetics (like `size = 3`) outside of `aes()` 4. What is the difference between mapping a variable to an aesthetic inside `aes()` versus setting a fixed value outside? ```{r} #| fig-width: 8 #| fig-height: 5 #| eval: false # Mapped aesthetic (variable determines color) ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point(size = 3) # Fixed aesthetic (all points same color) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "steelblue", size = 3) ``` ::: ::: {.callout-note} ### Exercise V.4: Faceting Practice creating small multiples: 1. Create a scatterplot faceted by vehicle class using `facet_wrap()` 2. Create a grid of plots using `facet_grid()` with two variables 3. Experiment with the `scales` argument to allow different axis scales per facet ```{r} #| fig-width: 9 #| fig-height: 6 #| eval: false ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_wrap(~ class, nrow = 2) ``` ::: ::: {.callout-note} ### Exercise V.5: Combining Layers Build complex visualizations by layering: 1. Create a scatterplot with a smoothed trend line 2. Add both points and a regression line 3. Use different colors for points and the trend line ```{r} #| fig-width: 7 #| fig-height: 5 #| eval: false ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class), alpha = 0.7) + geom_smooth(method = "lm", color = "black", se = TRUE) + labs( title = "Engine Size vs. Fuel Efficiency", x = "Engine Displacement (L)", y = "Highway MPG", color = "Vehicle Class" ) + theme_minimal() ``` ::: ::: {.callout-note} ### Exercise V.6: Publication-Quality Figures Create a polished figure suitable for publication: 1. Choose an appropriate chart type for your data 2. Add informative labels (title, subtitle, caption, axis labels) 3. Use an appropriate color palette 4. Adjust theme elements for clarity 5. Save the figure using `ggsave()` with appropriate dimensions and resolution :::