20 What are Models?

20.1 The Essence of Modeling

A model is a simplified representation of reality that helps us understand, explain, or predict phenomena. In statistics and data science, models are mathematical relationships between variables that capture patterns in data while ignoring irrelevant details.

“All models are wrong, but some are useful.” — George Box

This famous quote captures the fundamental truth of modeling: no model perfectly represents reality, but a good model can still provide valuable insights and predictions.

20.2 Two Cultures of Statistical Modeling

In his influential 2001 paper, Leo Breiman identified two cultures in statistical modeling:

The Data Modeling Culture (traditional statistics):

Assumes data are generated by a specific stochastic model
Focus on estimating parameters and testing hypotheses
Emphasis on interpretability and understanding mechanisms
Examples: linear regression, ANOVA, generalized linear models

The Algorithmic Modeling Culture (machine learning):

Treats the data-generating mechanism as unknown
Focus on predictive accuracy
Emphasis on performance over interpretability
Examples: random forests, neural networks, boosting

Both approaches have value. Traditional models excel at inference and explanation; algorithmic approaches often produce better predictions. The choice depends on whether your goal is understanding or prediction.

20.3 The General Linear Model Framework

Most statistical models you encounter are special cases of the General Linear Model (GLM):

\[Y = X\beta + \epsilon\]

where:

$Y$ is the response variable (what we want to predict/explain)
$X$ is the design matrix of predictor variables
$\beta$ are coefficients we estimate
$\epsilon$ is random error

This framework unifies many methods:

Method	Response Type	Predictors
One-sample t-test	Continuous	None (intercept only)
Two-sample t-test	Continuous	One categorical (2 levels)
ANOVA	Continuous	One or more categorical
Simple regression	Continuous	One continuous
Multiple regression	Continuous	Multiple (any type)
ANCOVA	Continuous	Mixed categorical and continuous

The beauty of this unified framework is that once you understand regression, you understand the entire family of linear models.

20.4 Components of a Statistical Model

Every statistical model specifies:

Response variable: What we want to predict or explain
Predictor variables: Information we use to make predictions
Functional form: How predictors relate to the response (linear, polynomial, etc.)
Error structure: Assumptions about variability (normally distributed, etc.)

Code

# Visualizing a simple linear model
set.seed(42)
x <- 1:50
y <- 2 + 0.5*x + rnorm(50, sd = 3)

plot(x, y, pch = 19, col = "steelblue",
     main = "Components of a Linear Model",
     xlab = "Predictor (X)", ylab = "Response (Y)")

# Fitted line (the model)
model <- lm(y ~ x)
abline(model, col = "red", lwd = 2)

# Show residuals for a few points
segments(x[c(10,25,40)], y[c(10,25,40)],
         x[c(10,25,40)], fitted(model)[c(10,25,40)],
         col = "darkgreen", lwd = 2)
text(x[25] + 3, (y[25] + fitted(model)[25])/2,
     "ε (residual)", col = "darkgreen")

legend("topleft",
       c("Data points", "Model (E[Y|X])", "Residuals (ε)"),
       pch = c(19, NA, NA), lty = c(NA, 1, 1),
       col = c("steelblue", "red", "darkgreen"), lwd = 2)

Figure 20.1: Components of a linear model showing data points, the fitted model line representing E[Y|X], and residuals (ε) as vertical deviations from the line

20.5 Model Fitting: Finding the Best Parameters

Model fitting is the process of finding parameter values that make the model best explain the observed data.

Least Squares

For linear models, least squares minimizes the sum of squared residuals:

\[\text{minimize} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

This produces the best linear unbiased estimators (BLUE) under certain conditions.

Maximum Likelihood

Maximum likelihood estimation (MLE) finds parameters that maximize the probability of observing the data:

\[\hat{\theta} = \arg\max_{\theta} L(\theta | \text{data}) = \arg\max_{\theta} \prod_{i=1}^{n} f(y_i | \theta)\]

For normally distributed errors, least squares and MLE give identical results. MLE extends to non-normal distributions and complex models.

Code

# Visualize likelihood for estimating a mean
set.seed(123)
data <- rnorm(20, mean = 5, sd = 2)

# Calculate log-likelihood for different values of mu
mu_values <- seq(3, 7, length.out = 100)
log_lik <- sapply(mu_values, function(mu) {
  sum(dnorm(data, mean = mu, sd = 2, log = TRUE))
})

par(mfrow = c(1, 2))

# Data histogram
hist(data, breaks = 10, main = "Sample Data", xlab = "Value",
     col = "lightblue", border = "white")
abline(v = mean(data), col = "red", lwd = 2)

# Log-likelihood curve
plot(mu_values, log_lik, type = "l", lwd = 2, col = "blue",
     xlab = expression(mu), ylab = "Log-likelihood",
     main = "Maximum Likelihood Estimation")
abline(v = mu_values[which.max(log_lik)], col = "red", lwd = 2, lty = 2)
text(mu_values[which.max(log_lik)], min(log_lik) + 2,
     paste("MLE =", round(mu_values[which.max(log_lik)], 2)))

Figure 20.2: Maximum likelihood estimation for the mean of a normal distribution, showing sample data (left) and the log-likelihood curve (right) with maximum at the MLE

20.6 Overfitting: When Models Learn Too Much

Overfitting occurs when a model captures noise rather than signal—it fits the training data extremely well but fails to generalize to new data.

Code

# Demonstrate overfitting with polynomial regression
set.seed(42)
n <- 20
x <- seq(0, 1, length.out = n)
y_true <- sin(2*pi*x)
y <- y_true + rnorm(n, sd = 0.3)

par(mfrow = c(1, 2))

# Underfitting (too simple)
plot(x, y, pch = 19, main = "Underfitting (degree 1)")
abline(lm(y ~ x), col = "red", lwd = 2)

# Overfitting (too complex)
plot(x, y, pch = 19, main = "Overfitting (degree 15)")
x_new <- seq(0, 1, length.out = 100)
lines(x_new, predict(lm(y ~ poly(x, 15)),
                     newdata = data.frame(x = x_new)),
      col = "red", lwd = 2)

Figure 20.3: Demonstration of underfitting (degree 1 polynomial, left) and overfitting (degree 15 polynomial, right) in regression models

Signs of overfitting:

Model fits training data nearly perfectly
Predictions on new data are poor
Coefficients are extremely large or unstable
Small changes in data produce very different models

The Bias-Variance Tradeoff

Prediction error has two sources:

Bias: Error from oversimplifying—missing important patterns

Variance: Error from oversensitivity—fitting noise

\[\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

Simple models have high bias but low variance. Complex models have low bias but high variance. The goal is to find the sweet spot.

Code

# Conceptual bias-variance plot
complexity <- 1:20
bias_sq <- 10 / complexity
variance <- 0.5 * complexity
total <- bias_sq + variance + 2  # irreducible error = 2

plot(complexity, total, type = "l", lwd = 2,
     ylim = c(0, max(total) + 1),
     xlab = "Model Complexity", ylab = "Error",
     main = "Bias-Variance Tradeoff")
lines(complexity, bias_sq, col = "blue", lwd = 2, lty = 2)
lines(complexity, variance, col = "red", lwd = 2, lty = 2)
abline(h = 2, col = "gray", lty = 3)

# Optimal complexity
optimal <- which.min(total)
points(optimal, total[optimal], pch = 19, cex = 1.5, col = "darkgreen")

legend("topright",
       c("Total Error", "Bias²", "Variance", "Irreducible"),
       col = c("black", "blue", "red", "gray"),
       lty = c(1, 2, 2, 3), lwd = 2)

Figure 20.4: The bias-variance tradeoff showing how total prediction error decomposes into bias squared, variance, and irreducible error, with optimal model complexity at the minimum total error

20.7 Feature Engineering and Transformations

The raw predictor variables may not capture the true relationship. Feature engineering creates new variables that better represent the underlying patterns.

Common transformations:

Polynomial terms: $x^2$, $x^3$ for curved relationships
Log transform: $\log(x)$ for multiplicative relationships
Interactions: $x_1 \times x_2$ when effects depend on each other
Categorical encoding: Converting categories to numbers

Code

# Example: log transformation
set.seed(42)
x <- runif(50, 1, 100)
y <- 2 * log(x) + rnorm(50, sd = 0.5)

par(mfrow = c(1, 2))

# Raw scale - looks nonlinear
plot(x, y, pch = 19, main = "Original Scale",
     xlab = "X", ylab = "Y")
abline(lm(y ~ x), col = "red", lwd = 2)

# Log scale - linear
plot(log(x), y, pch = 19, main = "After Log Transform",
     xlab = "log(X)", ylab = "Y")
abline(lm(y ~ log(x)), col = "red", lwd = 2)

Figure 20.5: Effect of log transformation on a nonlinear relationship, showing curved pattern on original scale (left) that becomes linear after log transformation (right)

20.8 Model Selection

When multiple models are possible, how do we choose? Several criteria exist:

Adjusted R²: Penalizes for additional parameters \[R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\]

AIC (Akaike Information Criterion): Balances fit and complexity \[AIC = -2\ln(L) + 2k\]

BIC (Bayesian Information Criterion): Heavier penalty for complexity \[BIC = -2\ln(L) + k\ln(n)\]

Lower AIC/BIC values indicate better models (balancing fit and parsimony).

Code

# Model selection example
data(mtcars)

# Compare models of increasing complexity
m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)
m3 <- lm(mpg ~ wt + hp + disp, data = mtcars)
m4 <- lm(mpg ~ wt + hp + disp + drat + qsec, data = mtcars)

# Compare using AIC
models <- list(m1, m2, m3, m4)
comparison <- data.frame(
  Model = c("mpg ~ wt", "mpg ~ wt + hp",
            "mpg ~ wt + hp + disp",
            "mpg ~ wt + hp + disp + drat + qsec"),
  R_squared = sapply(models, function(m) summary(m)$r.squared),
  Adj_R_squared = sapply(models, function(m) summary(m)$adj.r.squared),
  AIC = sapply(models, AIC),
  BIC = sapply(models, BIC)
)

knitr::kable(comparison, digits = 2)

Model	R_squared	Adj_R_squared	AIC	BIC
mpg ~ wt	0.75	0.74	166.03	170.43
mpg ~ wt + hp	0.83	0.81	156.65	162.52
mpg ~ wt + hp + disp	0.83	0.81	158.64	165.97
mpg ~ wt + hp + disp + drat + qsec	0.85	0.82	158.28	168.54

20.9 Cross-Validation for Model Assessment

The gold standard for evaluating predictive performance is cross-validation: testing the model on data it hasn’t seen.

K-fold cross-validation: 1. Split data into K parts 2. For each part: train on the other K-1 parts, test on the held-out part 3. Average performance across all K tests

Code

library(boot)

# Compare polynomial degrees using cross-validation
set.seed(42)
n <- 100
x <- seq(0, 4*pi, length.out = n)
y <- sin(x) + rnorm(n, sd = 0.5)
data_cv <- data.frame(x, y)

# Calculate CV error for different polynomial degrees
degrees <- 1:15
cv_errors <- sapply(degrees, function(d) {
  model <- glm(y ~ poly(x, d), data = data_cv)
  cv.glm(data_cv, model, K = 10)$delta[1]
})

plot(degrees, cv_errors, type = "b", pch = 19,
     xlab = "Polynomial Degree", ylab = "CV Error",
     main = "Cross-Validation for Model Selection")
abline(v = which.min(cv_errors), col = "red", lty = 2)

Figure 20.6: Cross-validation error for polynomial models of different degrees, showing optimal model complexity at minimum CV error

20.10 Prediction vs. Explanation

Different goals require different approaches:

For Explanation:

Simpler models are often preferable
Focus on coefficient interpretation
Statistical significance matters
Understand which variables drive the relationship

For Prediction:

Model complexity can be higher if it helps
Focus on out-of-sample performance
Accuracy metrics matter most
Understanding why is secondary

In biology and bioengineering, we often want both—models that predict well AND provide mechanistic insight. This tension shapes many modeling decisions.

20.11 Practical Modeling Workflow

Define the question: What are you trying to learn or predict?
Explore the data: Visualize relationships, check distributions, identify issues
Choose candidate models: Based on data type, assumptions, and goals
Fit and evaluate: Use appropriate metrics and validation
Check assumptions: Residual analysis, diagnostic plots
Iterate: Refine based on diagnostics
Report honestly: Including limitations and uncertainty

20.12 Summary

Models are simplified representations of reality that help us understand and predict
The general linear model framework unifies many common statistical methods
Model fitting finds parameters that best explain the data (least squares, MLE)
Overfitting occurs when models learn noise instead of signal
The bias-variance tradeoff governs model complexity choices
Feature engineering can improve model performance
Cross-validation provides honest estimates of predictive performance
Different goals (prediction vs. explanation) may favor different approaches

20.13 Additional Resources

James et al. (2023) - Comprehensive introduction to statistical learning concepts
Crawley (2007) - Practical guide to statistical modeling in R
Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science, 16(3), 199-231.

# What are Models? {#sec-what-are-models} ```{r} #| echo: false #| message: false library(tidyverse) theme_set(theme_minimal()) ``` ## The Essence of Modeling A **model** is a simplified representation of reality that helps us understand, explain, or predict phenomena. In statistics and data science, models are mathematical relationships between variables that capture patterns in data while ignoring irrelevant details. > "All models are wrong, but some are useful." — George Box This famous quote captures the fundamental truth of modeling: no model perfectly represents reality, but a good model can still provide valuable insights and predictions. ## Two Cultures of Statistical Modeling In his influential 2001 paper, Leo Breiman identified **two cultures** in statistical modeling: **The Data Modeling Culture** (traditional statistics): - Assumes data are generated by a specific stochastic model - Focus on estimating parameters and testing hypotheses - Emphasis on interpretability and understanding mechanisms - Examples: linear regression, ANOVA, generalized linear models **The Algorithmic Modeling Culture** (machine learning): - Treats the data-generating mechanism as unknown - Focus on predictive accuracy - Emphasis on performance over interpretability - Examples: random forests, neural networks, boosting Both approaches have value. Traditional models excel at inference and explanation; algorithmic approaches often produce better predictions. The choice depends on whether your goal is understanding or prediction. ## The General Linear Model Framework Most statistical models you encounter are special cases of the **General Linear Model (GLM)**: $$Y = X\beta + \epsilon$$ where: - $Y$ is the response variable (what we want to predict/explain) - $X$ is the design matrix of predictor variables - $\beta$ are coefficients we estimate - $\epsilon$ is random error This framework unifies many methods: | Method | Response Type | Predictors | |:-------|:-------------|:-----------| | One-sample t-test | Continuous | None (intercept only) | | Two-sample t-test | Continuous | One categorical (2 levels) | | ANOVA | Continuous | One or more categorical | | Simple regression | Continuous | One continuous | | Multiple regression | Continuous | Multiple (any type) | | ANCOVA | Continuous | Mixed categorical and continuous | The beauty of this unified framework is that once you understand regression, you understand the entire family of linear models. ## Components of a Statistical Model Every statistical model specifies: 1. **Response variable**: What we want to predict or explain 2. **Predictor variables**: Information we use to make predictions 3. **Functional form**: How predictors relate to the response (linear, polynomial, etc.) 4. **Error structure**: Assumptions about variability (normally distributed, etc.) ```{r} #| label: fig-model-components #| fig-cap: "Components of a linear model showing data points, the fitted model line representing E[Y|X], and residuals (ε) as vertical deviations from the line" #| fig-width: 7 #| fig-height: 5 # Visualizing a simple linear model set.seed(42) x <- 1:50 y <- 2 + 0.5*x + rnorm(50, sd = 3) plot(x, y, pch = 19, col = "steelblue", main = "Components of a Linear Model", xlab = "Predictor (X)", ylab = "Response (Y)") # Fitted line (the model) model <- lm(y ~ x) abline(model, col = "red", lwd = 2) # Show residuals for a few points segments(x[c(10,25,40)], y[c(10,25,40)], x[c(10,25,40)], fitted(model)[c(10,25,40)], col = "darkgreen", lwd = 2) text(x[25] + 3, (y[25] + fitted(model)[25])/2, "ε (residual)", col = "darkgreen") legend("topleft", c("Data points", "Model (E[Y|X])", "Residuals (ε)"), pch = c(19, NA, NA), lty = c(NA, 1, 1), col = c("steelblue", "red", "darkgreen"), lwd = 2) ``` ## Model Fitting: Finding the Best Parameters **Model fitting** is the process of finding parameter values that make the model best explain the observed data. ### Least Squares For linear models, **least squares** minimizes the sum of squared residuals: $$\text{minimize} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$ This produces the best linear unbiased estimators (BLUE) under certain conditions. ### Maximum Likelihood **Maximum likelihood estimation (MLE)** finds parameters that maximize the probability of observing the data: $$\hat{\theta} = \arg\max_{\theta} L(\theta | \text{data}) = \arg\max_{\theta} \prod_{i=1}^{n} f(y_i | \theta)$$ For normally distributed errors, least squares and MLE give identical results. MLE extends to non-normal distributions and complex models. ```{r} #| label: fig-mle-example #| fig-cap: "Maximum likelihood estimation for the mean of a normal distribution, showing sample data (left) and the log-likelihood curve (right) with maximum at the MLE" #| fig-width: 8 #| fig-height: 4 # Visualize likelihood for estimating a mean set.seed(123) data <- rnorm(20, mean = 5, sd = 2) # Calculate log-likelihood for different values of mu mu_values <- seq(3, 7, length.out = 100) log_lik <- sapply(mu_values, function(mu) { sum(dnorm(data, mean = mu, sd = 2, log = TRUE)) }) par(mfrow = c(1, 2)) # Data histogram hist(data, breaks = 10, main = "Sample Data", xlab = "Value", col = "lightblue", border = "white") abline(v = mean(data), col = "red", lwd = 2) # Log-likelihood curve plot(mu_values, log_lik, type = "l", lwd = 2, col = "blue", xlab = expression(mu), ylab = "Log-likelihood", main = "Maximum Likelihood Estimation") abline(v = mu_values[which.max(log_lik)], col = "red", lwd = 2, lty = 2) text(mu_values[which.max(log_lik)], min(log_lik) + 2, paste("MLE =", round(mu_values[which.max(log_lik)], 2))) ``` ## Overfitting: When Models Learn Too Much **Overfitting** occurs when a model captures noise rather than signal—it fits the training data extremely well but fails to generalize to new data. ```{r} #| label: fig-overfitting #| fig-cap: "Demonstration of underfitting (degree 1 polynomial, left) and overfitting (degree 15 polynomial, right) in regression models" #| fig-width: 8 #| fig-height: 5 # Demonstrate overfitting with polynomial regression set.seed(42) n <- 20 x <- seq(0, 1, length.out = n) y_true <- sin(2*pi*x) y <- y_true + rnorm(n, sd = 0.3) par(mfrow = c(1, 2)) # Underfitting (too simple) plot(x, y, pch = 19, main = "Underfitting (degree 1)") abline(lm(y ~ x), col = "red", lwd = 2) # Overfitting (too complex) plot(x, y, pch = 19, main = "Overfitting (degree 15)") x_new <- seq(0, 1, length.out = 100) lines(x_new, predict(lm(y ~ poly(x, 15)), newdata = data.frame(x = x_new)), col = "red", lwd = 2) ``` Signs of overfitting: - Model fits training data nearly perfectly - Predictions on new data are poor - Coefficients are extremely large or unstable - Small changes in data produce very different models ### The Bias-Variance Tradeoff Prediction error has two sources: **Bias**: Error from oversimplifying—missing important patterns **Variance**: Error from oversensitivity—fitting noise $$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$ Simple models have high bias but low variance. Complex models have low bias but high variance. The goal is to find the sweet spot. ```{r} #| label: fig-bias-variance-tradeoff #| fig-cap: "The bias-variance tradeoff showing how total prediction error decomposes into bias squared, variance, and irreducible error, with optimal model complexity at the minimum total error" #| fig-width: 7 #| fig-height: 5 # Conceptual bias-variance plot complexity <- 1:20 bias_sq <- 10 / complexity variance <- 0.5 * complexity total <- bias_sq + variance + 2 # irreducible error = 2 plot(complexity, total, type = "l", lwd = 2, ylim = c(0, max(total) + 1), xlab = "Model Complexity", ylab = "Error", main = "Bias-Variance Tradeoff") lines(complexity, bias_sq, col = "blue", lwd = 2, lty = 2) lines(complexity, variance, col = "red", lwd = 2, lty = 2) abline(h = 2, col = "gray", lty = 3) # Optimal complexity optimal <- which.min(total) points(optimal, total[optimal], pch = 19, cex = 1.5, col = "darkgreen") legend("topright", c("Total Error", "Bias²", "Variance", "Irreducible"), col = c("black", "blue", "red", "gray"), lty = c(1, 2, 2, 3), lwd = 2) ``` ## Feature Engineering and Transformations The raw predictor variables may not capture the true relationship. **Feature engineering** creates new variables that better represent the underlying patterns. Common transformations: - **Polynomial terms**: $x^2$, $x^3$ for curved relationships - **Log transform**: $\log(x)$ for multiplicative relationships - **Interactions**: $x_1 \times x_2$ when effects depend on each other - **Categorical encoding**: Converting categories to numbers ```{r} #| label: fig-log-transform #| fig-cap: "Effect of log transformation on a nonlinear relationship, showing curved pattern on original scale (left) that becomes linear after log transformation (right)" #| fig-width: 8 #| fig-height: 4 # Example: log transformation set.seed(42) x <- runif(50, 1, 100) y <- 2 * log(x) + rnorm(50, sd = 0.5) par(mfrow = c(1, 2)) # Raw scale - looks nonlinear plot(x, y, pch = 19, main = "Original Scale", xlab = "X", ylab = "Y") abline(lm(y ~ x), col = "red", lwd = 2) # Log scale - linear plot(log(x), y, pch = 19, main = "After Log Transform", xlab = "log(X)", ylab = "Y") abline(lm(y ~ log(x)), col = "red", lwd = 2) ``` ## Model Selection When multiple models are possible, how do we choose? Several criteria exist: **Adjusted R²**: Penalizes for additional parameters $$R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$$ **AIC (Akaike Information Criterion)**: Balances fit and complexity $$AIC = -2\ln(L) + 2k$$ **BIC (Bayesian Information Criterion)**: Heavier penalty for complexity $$BIC = -2\ln(L) + k\ln(n)$$ Lower AIC/BIC values indicate better models (balancing fit and parsimony). ```{r} # Model selection example data(mtcars) # Compare models of increasing complexity m1 <- lm(mpg ~ wt, data = mtcars) m2 <- lm(mpg ~ wt + hp, data = mtcars) m3 <- lm(mpg ~ wt + hp + disp, data = mtcars) m4 <- lm(mpg ~ wt + hp + disp + drat + qsec, data = mtcars) # Compare using AIC models <- list(m1, m2, m3, m4) comparison <- data.frame( Model = c("mpg ~ wt", "mpg ~ wt + hp", "mpg ~ wt + hp + disp", "mpg ~ wt + hp + disp + drat + qsec"), R_squared = sapply(models, function(m) summary(m)$r.squared), Adj_R_squared = sapply(models, function(m) summary(m)$adj.r.squared), AIC = sapply(models, AIC), BIC = sapply(models, BIC) ) knitr::kable(comparison, digits = 2) ``` ## Cross-Validation for Model Assessment The gold standard for evaluating predictive performance is **cross-validation**: testing the model on data it hasn't seen. **K-fold cross-validation**: 1. Split data into K parts 2. For each part: train on the other K-1 parts, test on the held-out part 3. Average performance across all K tests ```{r} #| label: fig-cross-validation #| fig-cap: "Cross-validation error for polynomial models of different degrees, showing optimal model complexity at minimum CV error" #| fig-width: 6 #| fig-height: 5 library(boot) # Compare polynomial degrees using cross-validation set.seed(42) n <- 100 x <- seq(0, 4*pi, length.out = n) y <- sin(x) + rnorm(n, sd = 0.5) data_cv <- data.frame(x, y) # Calculate CV error for different polynomial degrees degrees <- 1:15 cv_errors <- sapply(degrees, function(d) { model <- glm(y ~ poly(x, d), data = data_cv) cv.glm(data_cv, model, K = 10)$delta[1] }) plot(degrees, cv_errors, type = "b", pch = 19, xlab = "Polynomial Degree", ylab = "CV Error", main = "Cross-Validation for Model Selection") abline(v = which.min(cv_errors), col = "red", lty = 2) ``` ## Prediction vs. Explanation Different goals require different approaches: **For Explanation**: - Simpler models are often preferable - Focus on coefficient interpretation - Statistical significance matters - Understand which variables drive the relationship **For Prediction**: - Model complexity can be higher if it helps - Focus on out-of-sample performance - Accuracy metrics matter most - Understanding why is secondary In biology and bioengineering, we often want both—models that predict well AND provide mechanistic insight. This tension shapes many modeling decisions. ## Practical Modeling Workflow 1. **Define the question**: What are you trying to learn or predict? 2. **Explore the data**: Visualize relationships, check distributions, identify issues 3. **Choose candidate models**: Based on data type, assumptions, and goals 4. **Fit and evaluate**: Use appropriate metrics and validation 5. **Check assumptions**: Residual analysis, diagnostic plots 6. **Iterate**: Refine based on diagnostics 7. **Report honestly**: Including limitations and uncertainty ## Summary - Models are simplified representations of reality that help us understand and predict - The general linear model framework unifies many common statistical methods - Model fitting finds parameters that best explain the data (least squares, MLE) - Overfitting occurs when models learn noise instead of signal - The bias-variance tradeoff governs model complexity choices - Feature engineering can improve model performance - Cross-validation provides honest estimates of predictive performance - Different goals (prediction vs. explanation) may favor different approaches ## Additional Resources - @james2023islr - Comprehensive introduction to statistical learning concepts - @crawley2007r - Practical guide to statistical modeling in R - Breiman, L. (2001). Statistical Modeling: The Two Cultures. *Statistical Science*, 16(3), 199-231.