40  Bayesian Statistics

40.1 A Different Approach to Probability

What does it mean when an election forecaster tells us that a candidate has a 90% chance of winning? In classical (frequentist) statistics, probability refers to long-run frequencies of events. A parameter like the true proportion of voters supporting a candidate is a fixed but unknown number—it doesn’t make sense to assign it a probability.

Bayesian statistics takes a different view: probability represents our degree of belief about uncertain quantities. Under this framework, we can legitimately say “there is a 90% probability that candidate A will win” because we’re expressing our uncertainty about the outcome.

This chapter introduces Bayesian thinking and shows how it provides a principled framework for combining prior knowledge with observed data.

40.2 Bayes’ Theorem: The Foundation

Bayes’ theorem relates conditional probabilities:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

This simple equation has profound implications. It tells us how to update our beliefs about \(A\) after observing evidence \(B\).

A Diagnostic Test Example

Consider a test for cystic fibrosis with 99% accuracy:

\[ P(+ \mid D=1) = 0.99 \quad \text{and} \quad P(- \mid D=0) = 0.99 \]

where \(+\) denotes a positive test result and \(D\) indicates disease status (1 = has disease, 0 = doesn’t).

If a randomly selected person tests positive, what is the probability they actually have the disease?

The cystic fibrosis rate is approximately 1 in 3,900, so \(P(D=1) = 0.00025\).

Applying Bayes’ theorem:

\[ \begin{aligned} P(D=1 \mid +) &= \frac{P(+ \mid D=1) \cdot P(D=1)}{P(+)} \\[1em] &= \frac{P(+ \mid D=1) \cdot P(D=1)}{P(+ \mid D=1) \cdot P(D=1) + P(+ \mid D=0) \cdot P(D=0)} \end{aligned} \]

Plugging in the numbers:

\[ P(D=1 \mid +) = \frac{0.99 \times 0.00025}{0.99 \times 0.00025 + 0.01 \times 0.99975} = 0.024 \]

Despite the test having 99% accuracy, the probability of having the disease given a positive test is only about 2.4%. This counterintuitive result arises because the disease is so rare—most positive tests are false positives from the large healthy population.

Simulation to Verify

Code
set.seed(42)
prev <- 0.00025
N <- 100000
accuracy <- 0.99

# Randomly assign disease status
outcome <- sample(c("Disease", "Healthy"), N, replace = TRUE,
                  prob = c(prev, 1 - prev))

N_D <- sum(outcome == "Disease")
N_H <- sum(outcome == "Healthy")

cat("People with disease:", N_D, "\n")
People with disease: 28 
Code
cat("Healthy people:", N_H, "\n")
Healthy people: 99972 
Code
# Administer test
test <- vector("character", N)
test[outcome == "Disease"] <- sample(c("+", "-"), N_D, replace = TRUE,
                                      prob = c(accuracy, 1 - accuracy))
test[outcome == "Healthy"] <- sample(c("-", "+"), N_H, replace = TRUE,
                                      prob = c(accuracy, 1 - accuracy))

# Results
confusion <- table(outcome, test)
print(confusion)
         test
outcome       -     +
  Disease     0    28
  Healthy 98948  1024
Code
# Probability of disease given positive test
cat("\nP(Disease | Positive test):",
    round(sum(test == "+" & outcome == "Disease") / sum(test == "+"), 3), "\n")

P(Disease | Positive test): 0.027 

The simulation confirms our calculation: despite the high test accuracy, most people who test positive are actually healthy because the disease is so rare.

The Base Rate Fallacy

Ignoring the base rate (prior probability) of a condition leads to dramatically wrong conclusions. This is critical in medical testing, forensic evidence, and any diagnostic situation. Always consider:

  1. How accurate is the test? (sensitivity, specificity)
  2. How common is the condition? (base rate/prevalence)
  3. What population is being tested? (targeted vs. general screening)

40.3 From Bayes’ Theorem to Bayesian Inference

Bayes’ theorem becomes a framework for statistical inference when we apply it to unknown parameters:

\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})} \]

The components have specific names:

Term Name Meaning
\(P(\theta \mid \text{data})\) Posterior Updated belief after seeing data
\(P(\text{data} \mid \theta)\) Likelihood Probability of data given parameter
\(P(\theta)\) Prior Initial belief before seeing data
\(P(\text{data})\) Evidence Normalizing constant

The posterior combines what we knew before (prior) with what the data tell us (likelihood).

40.4 Hierarchical Models: Shrinkage and Borrowing Strength

One of Bayesian statistics’ most powerful applications is hierarchical modeling, where we model variability at multiple levels. This allows us to “borrow strength” across related observations.

A Baseball Example

Consider José Iglesias, a baseball player who in April 2013 had the following statistics:

Month At Bats Hits AVG
April 20 9 .450

A batting average (AVG) of .450 is extraordinarily high—no player has finished a season above .400 since Ted Williams in 1941. How should we predict José’s season-ending average?

The frequentist approach uses only José’s data. With \(p = 0.450\) as our estimate and \(n = 20\) at bats:

\[ \text{SE} = \sqrt{\frac{0.450 \times 0.550}{20}} = 0.111 \]

A 95% confidence interval is \(.450 \pm 0.222\), or roughly \(.228\) to \(.672\). This interval is wide and centered at .450, implying our best guess is that José will break Ted Williams’ record.

The Bayesian approach incorporates what we know about baseball players in general.

The Prior: What Do We Know About Batting Averages?

Let’s examine batting averages for players with substantial at-bats over recent seasons:

Code
library(Lahman)
Error in `library()`:
! there is no package called 'Lahman'
Code
batting_data <- Batting |>
  filter(yearID %in% 2010:2012) |>
  mutate(AVG = H / AB) |>
  filter(AB > 500)
Error:
! object 'Batting' not found
Code
batting_data |>
  ggplot(aes(AVG)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~yearID) +
  labs(x = "Batting Average", y = "Count",
       title = "MLB Batting Averages by Year")
Error:
! object 'batting_data' not found
Code
# Summary statistics
cat("Mean batting average:", round(mean(batting_data$AVG), 3), "\n")
Error:
! object 'batting_data' not found
Code
cat("SD of batting averages:", round(sd(batting_data$AVG), 3), "\n")
Error:
! object 'batting_data' not found

The average player has an AVG around .275 with a standard deviation of about 0.027. José’s .450 is over six standard deviations above the mean—an extreme anomaly if taken at face value.

The Hierarchical Model

We model the data at two levels:

Level 1 (Prior): Each player has a “true” batting ability \(p\) drawn from a population distribution: \[ p \sim N(\mu, \tau^2) \]

Level 2 (Likelihood): Given ability \(p\), observed performance \(Y\) varies due to luck: \[ Y \mid p \sim N(p, \sigma^2) \]

From the data: - Population mean: \(\mu = 0.275\) - Population SD: \(\tau = 0.027\) - Individual sampling error: \(\sigma = \sqrt{p(1-p)/n} \approx 0.111\)

The Posterior: Combining Prior and Data

For this model, the posterior distribution has a beautiful closed form. The posterior mean is:

\[ E(p \mid Y = y) = B \cdot \mu + (1-B) \cdot y \]

where the shrinkage factor \(B\) is:

\[ B = \frac{\sigma^2}{\sigma^2 + \tau^2} \]

This is a weighted average of the population mean \(\mu\) and observed data \(y\). The weights depend on: - How much individual variation there is (\(\sigma\)) - How much population variation there is (\(\tau\))

When individual uncertainty is high relative to population variation, we “shrink” more toward the population mean.

Applying to José Iglesias

Code
# Parameters
mu <- 0.275      # Population mean
tau <- 0.027     # Population SD
sigma <- 0.111   # José's sampling error (from 20 at-bats)
y <- 0.450       # José's observed average

# Shrinkage factor
B <- sigma^2 / (sigma^2 + tau^2)
cat("Shrinkage factor B:", round(B, 3), "\n")
Shrinkage factor B: 0.944 
Code
# Posterior mean
posterior_mean <- B * mu + (1 - B) * y
cat("Posterior mean (predicted true ability):", round(posterior_mean, 3), "\n")
Posterior mean (predicted true ability): 0.285 
Code
# Posterior standard error
posterior_se <- sqrt(1 / (1/sigma^2 + 1/tau^2))
cat("Posterior SE:", round(posterior_se, 3), "\n")
Posterior SE: 0.026 
Code
# 95% credible interval
ci_lower <- posterior_mean - 1.96 * posterior_se
ci_upper <- posterior_mean + 1.96 * posterior_se
cat("95% credible interval: [", round(ci_lower, 3), ",", round(ci_upper, 3), "]\n")
95% credible interval: [ 0.233 , 0.336 ]

The Bayesian estimate “shrinks” José’s .450 toward the population average, giving a predicted true ability of about .285. The 95% credible interval is much narrower than the frequentist interval.

The Outcome

Here’s how José actually performed through the season:

Month At Bats Hits AVG
April 20 9 .450
May 26 11 .423
June 86 34 .395
July 83 17 .205
August 85 25 .294
September 50 10 .200
Total (w/o April) 330 97 .294

His final average (excluding April) was .294—almost exactly what the Bayesian model predicted! The hierarchical approach correctly recognized that his early performance was partially luck and regressed his prediction toward the population mean.

Code
# Visualize shrinkage
observations <- c(0.450, 0.320, 0.260, 0.200)
labels <- c("José (Apr)", "Good hitter", "Average", "Struggling")

# Calculate posteriors for each
posteriors <- B * mu + (1 - B) * observations

# Plot
plot(1:4, observations, pch = 19, cex = 2, col = "blue",
     ylim = c(0.15, 0.50), xlim = c(0.5, 4.5),
     xaxt = "n", xlab = "", ylab = "Batting Average",
     main = "Bayesian Shrinkage: Observations → Posteriors")
axis(1, at = 1:4, labels = labels, las = 2)

points(1:4, posteriors, pch = 17, cex = 2, col = "red")
arrows(1:4, observations, 1:4, posteriors,
       length = 0.1, col = "gray50", lwd = 2)
abline(h = mu, lty = 2, col = "darkgreen", lwd = 2)

legend("topright",
       c("Observed", "Posterior estimate", "Population mean"),
       pch = c(19, 17, NA), lty = c(NA, NA, 2),
       col = c("blue", "red", "darkgreen"), pt.cex = 1.5)
Figure 40.1: Bayesian shrinkage pulls extreme observations toward the population mean

Notice how the shrinkage is proportional to how extreme the observation is. José’s .450 shrinks dramatically, while the average player’s .260 barely moves.

40.5 Credible Intervals vs. Confidence Intervals

Bayesian inference produces credible intervals rather than confidence intervals. The interpretation differs:

Frequentist Confidence Interval Bayesian Credible Interval
Statement “95% of intervals from repeated sampling would contain the true value” “There is a 95% probability the true value lies in this interval”
Parameter Fixed but unknown Random variable with a distribution
Interpretation About the procedure About the parameter

The Bayesian interpretation is often more intuitive: we can directly say “there’s a 95% probability the true batting average is between .233 and .337.”

40.6 Choosing Priors

The prior distribution encodes what we believe before seeing data. Prior choice is both a strength (incorporating domain knowledge) and a criticism (subjectivity) of Bayesian methods.

Types of Priors

Priors fall along a spectrum of informativeness. Informative priors incorporate substantial prior knowledge—for example, a prior for batting average based on historical data might be N(0.275, 0.027²), reflecting that most players hit between .220 and .330. Weakly informative priors provide gentle regularization without imposing strong beliefs; a prior of N(0, 10²) for a regression coefficient allows a wide range of values while preferring smaller effects over extremely large ones. Finally, non-informative or diffuse priors attempt to “let the data speak” by imposing minimal prior constraints—for instance, a uniform prior on [0, 1] for a probability parameter, or an improper prior where P(θ) ∝ 1 across the entire real line.

Empirical Bayes

When we estimate prior parameters from data (as we did with batting averages), this is called empirical Bayes. It’s a practical compromise: we use the data twice—once to set the prior, once for inference—but it often works well in practice.

Prior Sensitivity

It’s good practice to check whether conclusions depend strongly on prior choice:

Code
# How does the posterior change with different priors?
tau_values <- seq(0.01, 0.15, length.out = 100)

posterior_means <- sapply(tau_values, function(tau) {
  B <- sigma^2 / (sigma^2 + tau^2)
  B * mu + (1 - B) * y
})

plot(tau_values, posterior_means, type = "l", lwd = 2, col = "blue",
     xlab = expression(paste("Prior SD (", tau, ")")),
     ylab = "Posterior Mean",
     main = "Prior Sensitivity Analysis")
abline(h = y, lty = 2, col = "red")      # Observed value
abline(h = mu, lty = 2, col = "green")   # Prior mean
abline(v = 0.027, lty = 3, col = "gray") # Our chosen tau

legend("right", c("Posterior mean", "Observed (.450)", "Prior mean (.275)", "τ = 0.027"),
       col = c("blue", "red", "green", "gray"), lty = c(1, 2, 2, 3), lwd = c(2, 1, 1, 1))
Figure 40.2: Posterior mean as a function of prior standard deviation: with strong priors (small τ), we trust the population average; with weak priors (large τ), we trust José’s data

40.7 When Bayesian Methods Excel

Bayesian approaches are particularly valuable in several contexts. They excel at combining multiple sources of information, allowing you to incorporate prior knowledge, expert opinion, and data from related studies into a single coherent analysis. When working with hierarchical or multilevel data—such as patients grouped within hospitals or students nested within schools—Bayesian hierarchical models naturally capture this structure and allow appropriate borrowing of strength across groups. Bayesian methods are especially useful when working with small samples, where prior information can stabilize estimates that would otherwise be unreliable. The framework also supports sequential updating, where yesterday’s posterior becomes today’s prior as new data arrive, making it natural for adaptive designs and ongoing monitoring studies. Finally, modern MCMC methods enable fitting complex models that would be analytically or computationally intractable using traditional frequentist approaches, opening doors to realistic models of complicated phenomena.

40.8 Computational Bayesian Statistics

For complex models, posteriors don’t have closed-form solutions. Modern Bayesian statistics relies on Markov Chain Monte Carlo (MCMC) methods to sample from posterior distributions.

Popular tools include: - Stan (via rstan or brms packages in R) - JAGS (via rjags package) - PyMC (in Python)

Code
# Example using brms (Bayesian Regression Models using Stan)
library(brms)

# Fit a Bayesian linear regression
model <- brm(
  formula = weight ~ height,
  data = my_data,
  prior = c(
    prior(normal(0, 10), class = "b"),      # Prior on slope
    prior(normal(100, 50), class = "Intercept")  # Prior on intercept
  )
)

# Summarize posterior distributions
summary(model)

40.9 Exercises

Exercise Bay.1: Sally Clark Case - Base Rate Fallacy
  1. In 1999 in England, Sally Clark was convicted of murdering her two infant sons after both were found dead, apparently from Sudden Infant Death Syndrome (SIDS). Expert testimony stated that the chance of two SIDS deaths was 1 in 73 million, calculated as \((1/8500)^2\). What error did this calculation make?

  2. Suppose the probability of a second SIDS case given a first is actually 1/100 (due to genetic factors). What is \(P(\text{two SIDS deaths})\)?

  3. According to Bayes’ theorem, the probability the prosecution wanted was \(P(\text{guilty} \mid \text{two deaths})\), not \(P(\text{two deaths} \mid \text{innocent})\). Assume:

    • \(P(\text{two deaths} \mid \text{guilty}) = 0.50\)
    • \(P(\text{guilty}) = 1/1,000,000\) (rate of child-murdering parents)
    • \(P(\text{two natural deaths}) = 1/8500 \times 1/100 = 1.2 \times 10^{-5}\)

    Calculate \(P(\text{guilty} \mid \text{two deaths})\).

Exercise Bay.2: Bayesian Election Polling
  1. Consider a Bayesian model for Florida election polling:
Code
library(dslabs)
data(polls_us_election_2016)

polls <- polls_us_election_2016 |>
  filter(state == "Florida" & enddate >= "2016-11-04") |>
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

# Calculate average spread and SE
results <- polls |>
  summarize(
    avg = mean(spread),
    se = sd(spread) / sqrt(n())
  )

Using a prior of \(N(0, 0.01^2)\) (based on Florida being historically close), calculate the posterior mean and 95% credible interval for the true spread.

  1. How does the posterior change if we use a wider prior (\(\tau = 0.05\))? What about a narrower prior (\(\tau = 0.005\))?

40.10 Summary

  • Bayes’ theorem provides a principled way to update beliefs based on evidence
  • The base rate fallacy occurs when prior probabilities are ignored
  • Bayesian inference treats parameters as random variables with distributions
  • The posterior combines prior beliefs with observed data
  • Hierarchical models borrow strength across related observations through shrinkage
  • Credible intervals have a more intuitive interpretation than confidence intervals
  • Prior sensitivity analysis checks whether conclusions depend on prior assumptions
  • Modern Bayesian computation uses MCMC methods implemented in tools like Stan

40.11 Additional Resources

  • Gelman et al. (2013) - The comprehensive reference for Bayesian data analysis
  • Kruschke (2014) - Accessible introduction with R examples
  • McElreath (2020) - Excellent modern treatment with Stan
  • James et al. (2023) - Connects Bayesian ideas to statistical learning