17.1 Comparing Means

One of the most common questions in data analysis is whether two groups differ. Is the mean expression level different between treatment and control? Does the new material have different strength than the standard? Do patients on drug A have different outcomes than patients on drug B?

The t-test is the classic method for comparing means. It compares the observed difference between groups to the variability expected by chance, producing a test statistic that follows a t-distribution under the null hypothesis of no difference.

17.2 The T-Distribution

The t-distribution, introduced by William Sealy Gosset writing under the pseudonym “Student” (Student 1908), resembles the normal distribution but has heavier tails. This accounts for the extra uncertainty that comes from estimating the population standard deviation from sample data.

The t-distribution is characterized by its degrees of freedom (df). As df increases, the t-distribution approaches the normal distribution. For small samples, the heavier tails mean that extreme values are more likely, leading to wider confidence intervals and more conservative tests.

Code
# Compare t-distributions with different df
x <- seq(-4, 4, length.out = 200)
plot(x, dnorm(x), type = "l", lwd = 2, col = "black",
     xlab = "x", ylab = "Density",
     main = "T-distributions vs. Normal")
lines(x, dt(x, df = 3), lwd = 2, col = "red")
lines(x, dt(x, df = 10), lwd = 2, col = "blue")
legend("topright",
       legend = c("Normal", "t (df=3)", "t (df=10)"),
       col = c("black", "red", "blue"), lwd = 2)
Figure 17.1: The t-distribution has heavier tails than the normal, especially with low degrees of freedom

Why Heavier Tails Matter in Practice

The heavier tails of the t-distribution have real practical consequences. When you estimate the standard deviation from a small sample, you might underestimate or overestimate the true value. The t-distribution accounts for this uncertainty by assigning more probability to extreme values.

Consider this concrete example: suppose you’re estimating voter support from a poll of 25 likely voters. With the true population proportion unknown and estimated from the sample, how wide should your confidence interval be?

Code
# Demonstrate the practical difference between normal and t-based intervals
set.seed(2016)

# Simulate: true support is 48.5%, poll 25 people
true_support <- 0.485
n_poll <- 25

# One poll result
poll_result <- rbinom(1, n_poll, true_support) / n_poll
poll_se <- sqrt(poll_result * (1 - poll_result) / n_poll)

# Compare critical values
z_crit <- qnorm(0.975)      # Normal: 1.96
t_crit <- qt(0.975, df = n_poll - 1)  # t with 24 df: 2.06

# Calculate intervals
normal_ci <- c(poll_result - z_crit * poll_se, poll_result + z_crit * poll_se)
t_ci <- c(poll_result - t_crit * poll_se, poll_result + t_crit * poll_se)

cat("Poll result:", round(poll_result * 100, 1), "%\n")
Poll result: 40 %
Code
cat("Standard error:", round(poll_se * 100, 1), "%\n")
Standard error: 9.8 %
Code
cat("\nNormal-based 95% CI: [", round(normal_ci[1]*100, 1), "%, ",
    round(normal_ci[2]*100, 1), "%]\n")

Normal-based 95% CI: [ 20.8 %,  59.2 %]
Code
cat("t-based 95% CI:      [", round(t_ci[1]*100, 1), "%, ",
    round(t_ci[2]*100, 1), "%]\n")
t-based 95% CI:      [ 19.8 %,  60.2 %]
Code
cat("\nDifference in width:", round((t_ci[2] - t_ci[1] - (normal_ci[2] - normal_ci[1]))*100, 2),
    "percentage points\n")

Difference in width: 2.04 percentage points

With only 25 observations, the t-distribution gives a critical value of about 2.06 instead of 1.96. This ~5% wider interval provides better coverage when the sample standard deviation might deviate substantially from the population value.

The difference matters most in the tails. For extreme values (like being 2.5+ standard errors away from the mean), the t-distribution assigns noticeably more probability:

Code
# Probability of being more than 2.5 SE from the mean
prob_extreme_normal <- 2 * pnorm(-2.5)
prob_extreme_t <- 2 * pt(-2.5, df = 24)

cat("P(|Z| > 2.5) with normal distribution:", round(prob_extreme_normal, 4), "\n")
P(|Z| > 2.5) with normal distribution: 0.0124 
Code
cat("P(|T| > 2.5) with t(df=24):", round(prob_extreme_t, 4), "\n")
P(|T| > 2.5) with t(df=24): 0.0197 
Code
cat("The t-distribution gives", round(prob_extreme_t/prob_extreme_normal, 1),
    "times higher probability to extreme values\n")
The t-distribution gives 1.6 times higher probability to extreme values

This is why using the normal distribution instead of the t-distribution for small samples leads to confidence intervals that are too narrow and p-values that are too small—both resulting in overconfident conclusions.

17.3 One-Sample T-Test

The one-sample t-test compares a sample mean to a hypothesized population value. The null hypothesis is that the population mean equals the specified value: \(H_0: \mu = \mu_0\).

The test statistic is:

\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]

This is the difference between the sample mean and hypothesized value, divided by the standard error of the mean. Under the null hypothesis, this statistic follows a t-distribution with \(n-1\) degrees of freedom.

Code
# One-sample t-test example
# Does this sample come from a population with mean = 100?
set.seed(42)
sample_data <- rnorm(25, mean = 105, sd = 15)

t.test(sample_data, mu = 100)

    One Sample t-test

data:  sample_data
t = 1.9936, df = 24, p-value = 0.05768
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
  99.72443 115.90166
sample estimates:
mean of x 
  107.813 

The output shows the t-statistic, degrees of freedom, p-value, confidence interval, and sample mean. The small p-value indicates evidence that the true mean differs from 100.

17.4 Two-Sample T-Test

The two-sample (independent samples) t-test compares means from two independent groups. The null hypothesis is that the population means are equal: \(H_0: \mu_1 = \mu_2\).

The test assumes: - Independence of observations within and between groups - Normally distributed populations (or large samples) - Equal variances in both groups (for the standard version)

Code
# Two-sample t-test example
set.seed(518)
treatment <- rnorm(n = 30, mean = 12, sd = 3)
control <- rnorm(n = 30, mean = 10, sd = 3)

# Visualize the data
par(mfrow = c(1, 2))
boxplot(treatment, control, names = c("Treatment", "Control"),
        col = c("lightblue", "lightgreen"), main = "Boxplot")

# Combined histogram
hist(treatment, col = rgb(0, 0, 1, 0.5), xlim = c(0, 20),
     main = "Histograms", xlab = "Value")
hist(control, col = rgb(0, 1, 0, 0.5), add = TRUE)
legend("topright", legend = c("Treatment", "Control"),
       fill = c(rgb(0, 0, 1, 0.5), rgb(0, 1, 0, 0.5)))
Figure 17.2: Visualization of two groups before performing a two-sample t-test
Code
# Perform the t-test
t.test(treatment, control)

    Welch Two Sample t-test

data:  treatment and control
t = 1.3224, df = 57.98, p-value = 0.1912
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5256045  2.5718411
sample estimates:
mean of x mean of y 
 11.08437  10.06125 

17.5 Welch’s T-Test

The classic two-sample t-test assumes equal variances. When this assumption is violated, Welch’s t-test provides a better alternative. It adjusts the degrees of freedom to account for unequal variances.

R’s t.test() function uses Welch’s test by default. To use the equal-variance version, set var.equal = TRUE.

Code
# When variances are unequal
set.seed(42)
group1 <- rnorm(30, mean = 50, sd = 5)
group2 <- rnorm(30, mean = 52, sd = 15)

# Welch's test (default)
t.test(group1, group2)

    Welch Two Sample t-test

data:  group1 and group2
t = 0.055423, df = 37.98, p-value = 0.9561
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.095093  6.438216
sample estimates:
mean of x mean of y 
 50.34293  50.17137 
Code
# Equal variance assumed
t.test(group1, group2, var.equal = TRUE)

    Two Sample t-test

data:  group1 and group2
t = 0.055423, df = 58, p-value = 0.956
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.024786  6.367910
sample estimates:
mean of x mean of y 
 50.34293  50.17137 

17.6 Paired T-Test

When observations in two groups are naturally paired—the same subjects measured twice, matched pairs, or before-and-after measurements—the paired t-test is more appropriate. It tests whether the mean difference within pairs is zero.

The paired t-test is more powerful than the two-sample test when pairs are positively correlated, because it removes between-subject variability.

Code
# Paired t-test example: before and after treatment
set.seed(123)
n <- 20
before <- rnorm(n, mean = 100, sd = 15)
# After measurements are correlated with before
after <- before + rnorm(n, mean = 5, sd = 5)

# Paired test (correct for this data)
t.test(after, before, paired = TRUE)

    Paired t-test

data:  after and before
t = 5.1123, df = 19, p-value = 6.19e-05
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 2.801598 6.685830
sample estimates:
mean difference 
       4.743714 
Code
# Compare to unpaired (less power)
t.test(after, before, paired = FALSE)

    Welch Two Sample t-test

data:  after and before
t = 1.0209, df = 37.992, p-value = 0.3138
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.663231 14.150660
sample estimates:
mean of x mean of y 
 106.8681  102.1244 

Notice that the paired test produces a smaller p-value because it accounts for the correlation between measurements on the same subject.

17.7 One-Tailed vs. Two-Tailed Tests

By default, t.test() performs a two-tailed test. For a one-tailed test, specify the alternative hypothesis:

Code
# Two-tailed (default): H_A: treatment ≠ control
t.test(treatment, control, alternative = "two.sided")$p.value
[1] 0.1912327
Code
# One-tailed: H_A: treatment > control
t.test(treatment, control, alternative = "greater")$p.value
[1] 0.09561633
Code
# One-tailed: H_A: treatment < control
t.test(treatment, control, alternative = "less")$p.value
[1] 0.9043837

Use one-tailed tests only when you have a strong prior reason to expect an effect in a specific direction and would not act on an effect in the opposite direction.

17.8 Checking Assumptions

T-tests assume normally distributed data (or large samples) and, for the standard two-sample test, equal variances. Check these assumptions before interpreting results.

Normality: Use histograms, Q-Q plots, or formal tests like Shapiro-Wilk.

Code
# Check normality with Q-Q plot
qqnorm(treatment)
qqline(treatment, col = "red")
Figure 17.3: Q-Q plot for assessing normality: points following the line suggest approximate normality
Code
# Shapiro-Wilk test for normality
shapiro.test(treatment)

    Shapiro-Wilk normality test

data:  treatment
W = 0.9115, p-value = 0.01624

A non-significant Shapiro-Wilk test suggests the data are consistent with normality. However, this test has low power for small samples and may reject normality for trivial deviations with large samples.

Equal variances: Compare standard deviations or use Levene’s test.

Code
# Compare standard deviations
sd(treatment)
[1] 3.024138
Code
sd(control)
[1] 2.968592
Code
# Levene's test (from car package)
# car::leveneTest(c(treatment, control), 
#                 factor(rep(c("treatment", "control"), each = 30)))

17.9 Effect Size: Cohen’s d

Statistical significance does not tell you how large an effect is. Cohen’s d measures effect size as the standardized difference between means:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}\]

where \(s_{pooled}\) is the pooled standard deviation.

Conventional interpretations: \(|d| = 0.2\) is small, \(|d| = 0.5\) is medium, \(|d| = 0.8\) is large. However, context matters—a small d might be practically important in some fields.

Code
# Calculate Cohen's d
mean_diff <- mean(treatment) - mean(control)
s_pooled <- sqrt((var(treatment) + var(control)) / 2)
cohens_d <- mean_diff / s_pooled

cat("Cohen's d:", round(cohens_d, 2), "\n")
Cohen's d: 0.34 

17.10 Practical Example

Let’s work through a complete analysis comparing two groups:

Code
# Simulated drug trial data
set.seed(999)
drug <- rnorm(40, mean = 75, sd = 12)
placebo <- rnorm(40, mean = 70, sd = 12)

# Step 1: Visualize
par(mfrow = c(2, 2))
boxplot(drug, placebo, names = c("Drug", "Placebo"),
        col = c("coral", "lightblue"), main = "Response by Group")

# Step 2: Check normality
qqnorm(drug, main = "Q-Q Plot: Drug")
qqline(drug, col = "red")
qqnorm(placebo, main = "Q-Q Plot: Placebo")
qqline(placebo, col = "red")

# Combined histogram
hist(drug, col = rgb(1, 0.5, 0.5, 0.5), xlim = c(40, 110),
     main = "Distribution Comparison", xlab = "Response")
hist(placebo, col = rgb(0.5, 0.5, 1, 0.5), add = TRUE)
Figure 17.4: Complete analysis workflow: visualization and assumption checking before the t-test
Code
# Step 3: Perform t-test
result <- t.test(drug, placebo)
print(result)

    Welch Two Sample t-test

data:  drug and placebo
t = 1.2147, df = 75.923, p-value = 0.2282
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.990367  8.213525
sample estimates:
mean of x mean of y 
 72.26982  69.15824 
Code
# Step 4: Calculate effect size
cohens_d <- (mean(drug) - mean(placebo)) / 
            sqrt((var(drug) + var(placebo)) / 2)
cat("\nCohen's d:", round(cohens_d, 2), "\n")

Cohen's d: 0.27 

The t-test shows a significant difference (p < 0.05), and Cohen’s d indicates a medium effect size. We can conclude that the drug group shows higher response than the placebo group, with the mean difference being about 0.4 standard deviations.

17.11 Randomization Tests as an Alternative

When normality assumptions are questionable and sample sizes are small, randomization (permutation) tests provide a non-parametric alternative to the t-test. The logic is elegant: if there is no difference between groups, then the group labels are arbitrary and could be shuffled without affecting the distribution of the test statistic.

Code
# Randomization test example
set.seed(42)
group_A <- c(23, 25, 28, 31, 35, 29)
group_B <- c(18, 20, 22, 19, 21, 23)

# Observed difference
obs_diff <- mean(group_A) - mean(group_B)

# Combine all observations
all_data <- c(group_A, group_B)
n_A <- length(group_A)
n_B <- length(group_B)

# Generate null distribution by permutation
n_perms <- 10000
perm_diffs <- numeric(n_perms)

for (i in 1:n_perms) {
  shuffled <- sample(all_data)
  perm_diffs[i] <- mean(shuffled[1:n_A]) - mean(shuffled[(n_A+1):(n_A+n_B)])
}

# Plot null distribution
hist(perm_diffs, breaks = 50, col = "lightblue",
     main = "Randomization Null Distribution",
     xlab = "Difference in Means")
abline(v = obs_diff, col = "red", lwd = 2)
abline(v = -obs_diff, col = "red", lwd = 2, lty = 2)

# Two-tailed p-value
p_value <- mean(abs(perm_diffs) >= abs(obs_diff))
cat("Observed difference:", round(obs_diff, 2), "\n")
Observed difference: 8 
Code
cat("Permutation p-value:", p_value, "\n")
Permutation p-value: 0.0039 
Figure 17.5: Permutation test null distribution with observed difference (red line) for comparison

The randomization test makes no assumptions about the underlying distribution—it only assumes that observations are exchangeable under the null hypothesis. This makes it robust to non-normality and outliers.

When to Use Randomization Tests
  • Sample sizes are small (n < 20 per group)
  • Data are clearly non-normal or contain outliers
  • You want to avoid distributional assumptions
  • As a sensitivity analysis to complement parametric results

17.12 Choosing the Right T-Test

Scenario Test R Function
Compare sample mean to known value One-sample t.test(x, mu = value)
Compare two independent groups Two-sample (Welch’s) t.test(x, y)
Compare two independent groups (equal variance) Two-sample (Student’s) t.test(x, y, var.equal = TRUE)
Compare paired measurements Paired t.test(x, y, paired = TRUE)

Decision guidelines:

  1. If comparing to a fixed, known value: one-sample t-test
  2. If observations in groups are naturally paired: paired t-test
  3. If groups are independent with potentially unequal variances: Welch’s t-test (the default)
  4. If groups are independent and you have strong evidence of equal variances: Student’s t-test

When in doubt, use Welch’s t-test—it performs nearly as well as Student’s t-test when variances are equal and much better when they are not.

17.13 Summary

The t-test family provides essential tools for comparing means:

  • One-sample tests compare a sample to a hypothesized value
  • Two-sample tests compare independent groups
  • Paired tests compare matched or repeated measurements
  • Welch’s version handles unequal variances (recommended default)
  • Randomization tests provide a distribution-free alternative

Always visualize your data, check assumptions, and report effect sizes alongside p-values. A statistically significant result is only meaningful if the underlying assumptions are reasonable and the effect size is practically relevant.

17.14 Practice Exercises

Exercise H.1: One-Sample t-test
  1. Generate a sample of 30 observations from a normal distribution with mean 105 and SD 15
  2. Test whether the mean differs significantly from 100
  3. Interpret the p-value and confidence interval
  4. What happens to the p-value when you increase the sample size?
Code
set.seed(42)
sample_data <- rnorm(30, mean = 105, sd = 15)
t.test(sample_data, mu = 100)
Exercise H.2: Two-Sample t-test

Create a dummy dataset with one continuous and one categorical variable:

  1. Draw samples of 100 observations from two normal distributions with slightly different means but equal standard deviations
  2. Perform a two-sample t-test
  3. Visualize the data with a boxplot
  4. Repeat with sample sizes of 10, 100, and 1000—how does sample size affect the results?
  5. What happens when you make the means more different?
Code
set.seed(42)
group_a <- rnorm(100, mean = 10, sd = 2)
group_b <- rnorm(100, mean = 11, sd = 2)

# Combine into data frame
data <- data.frame(
  value = c(group_a, group_b),
  group = rep(c("A", "B"), each = 100)
)

# t-test
t.test(value ~ group, data = data)

# Visualization
boxplot(value ~ group, data = data)
Exercise H.3: Chi-Square Test for Hardy-Weinberg Equilibrium

Test whether a population is in Hardy-Weinberg equilibrium:

Code
# Observed genotype counts
AA_counts <- 50
Aa_counts <- 40
aa_counts <- 10

# Calculate allele frequencies
total <- AA_counts + Aa_counts + aa_counts
p <- (2*AA_counts + Aa_counts) / (2*total)
q <- 1 - p

# Expected counts under HWE
expected <- c(p^2, 2*p*q, q^2) * total

# Chi-square test
observed <- c(AA_counts, Aa_counts, aa_counts)
chisq.test(observed, p = c(p^2, 2*p*q, q^2))

    Chi-squared test for given probabilities

data:  observed
X-squared = 0.22676, df = 2, p-value = 0.8928
  1. Modify the observed counts and see how it affects the test result
  2. What genotype frequencies would indicate strong departure from HWE?
Exercise H.4: Effect Size and Power
  1. Using the two-sample t-test from Exercise H.2, calculate Cohen’s d effect size
  2. How does effect size change when you increase the difference between means?
  3. How does effect size change when you increase the standard deviation?

17.15 Additional Resources

  • Logan (2010) - Detailed coverage of t-tests with biological examples
  • Irizarry (2019) - Excellent treatment of randomization and permutation methods