37  Dimensionality Reduction and Multivariate Methods

37.1 The Challenge of High-Dimensional Data

Modern biology generates datasets with many variables: gene expression across thousands of genes, metabolomic profiles with hundreds of compounds, morphological measurements on many traits. When datasets have many variables, visualization becomes challenging and statistical analysis becomes complicated.

A typical machine learning challenge might include hundreds or thousands of predictors. For example, to compare each of the 784 features in a digit recognition problem, we would need to create 306,936 scatterplots! Creating a single scatter-plot of all the data is impossible due to the high dimensionality.

Dimensionality reduction creates a smaller set of new variables that capture most of the information in the original data. The general idea is to reduce the dimension of the dataset while preserving important characteristics, such as the distance between features or observations. These techniques help us:

  • Visualize high-dimensional data in 2D or 3D
  • Identify patterns and clusters
  • Remove noise and redundancy
  • Reduce complexity of downstream models
  • Create composite variables for analysis

37.2 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) (Pearson 1901; Hotelling 1933) is the most widely used dimensionality reduction technique. It finds new variables (principal components) that are linear combinations of the originals, chosen to capture maximum variance. The technique behind it—the singular value decomposition—is also useful in many other contexts.

Intuition: Preserving Distance

Before diving into the mathematics, let’s build intuition with a simple example. Consider twin heights data where we have height measurements for 100 pairs of twins—some adults, some children.

Code
set.seed(1988)
n <- 100
Sigma <- matrix(c(9, 9 * 0.9, 9 * 0.92, 9 * 1), 2, 2)
x <- rbind(mvrnorm(n / 2, c(69, 69), Sigma),
           mvrnorm(n / 2, c(55, 55), Sigma))

lim <- c(48, 78)
rafalib::mypar()
plot(x, xlim = lim, ylim = lim,
     xlab = "Twin 1 Height", ylab = "Twin 2 Height",
     main = "Twin Heights: Adults and Children")
lines(x[c(1, 2), ], col = "blue", lwd = 2)
lines(x[c(2, 51), ], col = "red", lwd = 2)
points(x[c(1, 2, 51), ], pch = 16)
Figure 37.1: Simulated twin heights showing two groups (adults and children) with high correlation between twins

The scatterplot reveals high correlation and two clear groups: adults (upper right) and children (lower left). We can compute distances between observations:

Code
d <- dist(x)
cat("Distance between observations 1 and 2:", round(as.matrix(d)[1, 2], 2), "\n")
Distance between observations 1 and 2: 1.98 
Code
cat("Distance between observations 2 and 51:", round(as.matrix(d)[2, 51], 2), "\n")
Distance between observations 2 and 51: 18.74 

Now suppose we want to reduce from two dimensions to one while preserving these distances. A naive approach is to simply use one of the original variables:

Code
z <- x[, 1]
rafalib::mypar()
plot(dist(x) / sqrt(2), dist(z),
     xlab = "Original distance (scaled)", ylab = "One-dimension distance",
     main = "Distance Approximation Using One Variable")
abline(0, 1, col = "red")
Figure 37.2: Using only one dimension (Twin 1 height) underestimates the true distances

This works reasonably well, but we can do better. Notice that the variation in the data lies mostly along the diagonal. If we transform to the average and difference:

Code
z <- cbind((x[, 2] + x[, 1]) / 2, x[, 2] - x[, 1])
rafalib::mypar()
plot(z, xlim = lim, ylim = lim - mean(lim),
     xlab = "Average Height", ylab = "Difference",
     main = "Transformed Coordinates")
lines(z[c(1, 2), ], col = "blue", lwd = 2)
lines(z[c(2, 51), ], col = "red", lwd = 2)
points(z[c(1, 2, 51), ], pch = 16)
Figure 37.3: After rotating to average and difference coordinates, distances are mostly explained by the first dimension

Now the distances are mostly explained by the first dimension (the average). Using just this one dimension gives better distance approximation:

Code
cat("Typical error with original variable:",
    round(sd(dist(x) - dist(x[, 1]) * sqrt(2)), 2), "\n")
Typical error with original variable: 1.21 
Code
cat("Typical error with average:",
    round(sd(dist(x) - dist(z[, 1]) * sqrt(2)), 2), "\n")
Typical error with average: 0.32 

The average of the twin heights is essentially the first principal component! PCA finds these optimal linear combinations automatically.

Linear Transformations

Each row of the original matrix \(X\) was transformed using a linear transformation to create \(Z\):

\[Z_{i,1} = a_{1,1} X_{i,1} + a_{2,1} X_{i,2}\]

with \(a_{1,1} = 0.5\) and \(a_{2,1} = 0.5\) (the average).

In matrix notation: \[ Z = X A \mbox{ with } A = \begin{pmatrix} 1/2 & 1 \\ 1/2 & -1 \end{pmatrix} \]

Dimension reduction can be described as applying a transformation \(A\) to a matrix \(X\) that moves the information to the first few columns of \(Z = XA\), then keeping just these informative columns.

Orthogonal Transformations

To preserve distances exactly, we need an orthogonal transformation—one where the columns of \(A\) have unit length and are perpendicular:

Code
# Orthogonal transformation
z[, 1] <- (x[, 1] + x[, 2]) / sqrt(2)
z[, 2] <- (x[, 2] - x[, 1]) / sqrt(2)

# This preserves the original distances exactly
cat("Maximum distance difference:", max(abs(dist(z) - dist(x))), "\n")
Maximum distance difference: 3.241851e-14 

An orthogonal rotation preserves all distances while reorganizing the variance. Now most variance is in the first dimension:

Code
qplot(z[, 1], bins = 20, color = I("black"), fill = I("steelblue")) +
  labs(x = "First Principal Component", y = "Count",
       title = "PC1 Separates Adults from Children")
Figure 37.4: After orthogonal rotation, the first dimension clearly separates adults from children

The first PC clearly shows the two groups. We’ve reduced two dimensions to one with minimal information loss because the original variables were highly correlated:

Code
cat("Correlation between twin heights:", round(cor(x[, 1], x[, 2]), 3), "\n")
Correlation between twin heights: 0.988 
Code
cat("Correlation between PCs:", round(cor(z[, 1], z[, 2]), 3), "\n")
Correlation between PCs: 0.088 

The Eigenanalysis Foundation

The mathematical foundation involves eigenanalysis: decomposing the covariance (or correlation) matrix to find directions of maximum variation. For a detailed treatment of eigenvalues, eigenvectors, and their interpretation, see Section 52.9.

Given a covariance matrix \(\Sigma\), eigenanalysis finds:

  • Eigenvectors: The directions of the principal components (loadings)
  • Eigenvalues: The variance explained by each component

The first principal component points in the direction of maximum variance. Each subsequent component is orthogonal (uncorrelated) and captures remaining variance in decreasing order.

The total variability can be defined as the sum of variances: \[ v_1 + v_2 + \cdots + v_p \]

An orthogonal transformation preserves total variability but redistributes it. PCA finds the transformation that concentrates variance in the first few components.

Interpreting Eigenvalues and Eigenvectors
  • Eigenvalues tell you how much variance each PC captures—larger means more important
  • Eigenvectors (loadings) tell you how original variables combine to form each PC
  • Variables with large absolute loadings contribute strongly to that PC
  • The sum of all eigenvalues equals the total variance in the data

See Section 52.9.5 for a complete explanation with examples.

Figure 37.5: Geometric interpretation of principal components as directions of maximum variance

PCA in R

Code
# PCA on iris data
iris_pca <- prcomp(iris[, 1:4], scale. = TRUE)

# Variance explained
summary(iris_pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
Code
# Scree plot
par(mfrow = c(1, 2))
plot(iris_pca, type = "l", main = "Scree Plot")

# PC scores colored by species
plot(iris_pca$x[, 1:2],
     col = c("red", "green", "blue")[iris$Species],
     pch = 19, xlab = "PC1", ylab = "PC2",
     main = "PCA of Iris Data")
legend("topright", levels(iris$Species),
       col = c("red", "green", "blue"), pch = 19)
Figure 37.6: PCA scree plot and principal component scores for iris data colored by species

Loadings: What Variables Drive Each PC?

Each principal component is defined by its loadings—the coefficients showing how much each original variable contributes:

Code
# Loadings (rotation matrix)
iris_pca$rotation
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

Large absolute loadings indicate that a variable strongly influences that component. The sign indicates the direction of the relationship.

Interpreting PCA Results

Key elements of PCA output:

  • Eigenvalues: Variance explained by each component (shown in scree plot)
  • Proportion of variance: How much of total variance each PC captures
  • Loadings: Coefficients relating original variables to PCs
  • Scores: Values of the new variables for each observation

The first few PCs often capture most of the meaningful variation, allowing you to reduce many variables to just 2-3 for visualization and analysis.

How Many Components to Keep?

Common approaches:

  • Keep components with eigenvalues > 1 (Kaiser criterion)
  • Keep enough to explain 80-90% of variance
  • Look for an “elbow” in the scree plot
  • Use cross-validation if using PCs for prediction

Biplot Visualization

A biplot shows both observations (scores) and variables (loadings) on the same plot:

Code
biplot(iris_pca, col = c("gray50", "red"), cex = 0.7,
       main = "PCA Biplot of Iris Data")
Figure 37.7: PCA biplot showing both observations and variable loadings simultaneously

Arrows show variable loadings—their direction and length indicate how each variable relates to the principal components.

MNIST Example: High-Dimensional Image Data

The real power of PCA becomes apparent with truly high-dimensional data. The MNIST dataset of handwritten digits has 784 features (28×28 pixels). Can we reduce this dimensionality while preserving useful information?

Code
library(dslabs)
if (!exists("mnist")) mnist <- read_mnist()

Because pixels near each other on the image grid are correlated, we expect dimension reduction to work well:

Code
col_means <- colMeans(mnist$test$images)
pca <- prcomp(mnist$train$images)

# Scree plot
pc <- 1:ncol(mnist$test$images)
qplot(pc[1:50], pca$sdev[1:50], geom = "line") +
  labs(x = "Principal Component", y = "Standard Deviation",
       title = "MNIST Scree Plot (first 50 PCs)")
Figure 37.8: Variance explained by principal components of MNIST digit images. The first few PCs capture substantial variance.

The first few PCs capture substantial variance:

Code
summary(pca)$importance[, 1:5]
                             PC1       PC2       PC3       PC4       PC5
Standard deviation     576.82291 493.23822 459.89930 429.85624 408.56680
Proportion of Variance   0.09705   0.07096   0.06169   0.05389   0.04869
Cumulative Proportion    0.09705   0.16801   0.22970   0.28359   0.33228

Even with just two dimensions, we can see structure related to digit class:

Code
data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2],
           label = factor(mnist$train$label)) %>%
  sample_n(2000) %>%
  ggplot(aes(PC1, PC2, fill = label)) +
  geom_point(cex = 3, pch = 21, alpha = 0.7) +
  labs(title = "MNIST Digits in PC Space")
Figure 37.9: First two principal components of MNIST data colored by digit label. Different digits occupy different regions of PC space.

We can visualize what each PC “looks for” by reshaping the loadings back to the 28×28 grid:

Code
library(RColorBrewer)
tmp <- lapply(1:4, function(i) {
  expand.grid(Row = 1:28, Column = 1:28) %>%
    mutate(id = i, label = paste0("PC", i),
           value = pca$rotation[, i])
})
tmp <- Reduce(rbind, tmp)

tmp %>%
  ggplot(aes(Row, Column, fill = value)) +
  geom_raster() +
  scale_y_reverse() +
  scale_fill_gradientn(colors = brewer.pal(9, "RdBu")) +
  facet_wrap(~label, nrow = 1) +
  theme_minimal() +
  labs(title = "PCA Loadings as Images")
Figure 37.10: First four principal component loadings visualized as images. Each PC captures different aspects of digit structure.

Applying PCA to Improve Classification

PCA can reduce model complexity while maintaining predictive performance. Let’s use 36 PCs (explaining ~80% of variance) for kNN classification:

Code
library(caret)
k <- 36
x_train <- pca$x[, 1:k]
y <- factor(mnist$train$labels)
fit <- knn3(x_train, y)

# Transform test set using training PCA
x_test <- sweep(mnist$test$images, 2, col_means) %*% pca$rotation
x_test <- x_test[, 1:k]

# Predict and evaluate
y_hat <- predict(fit, x_test, type = "class")
confusionMatrix(y_hat, factor(mnist$test$labels))$overall["Accuracy"]
Accuracy 
  0.9751 

With just 36 dimensions (instead of 784), we achieve excellent accuracy. This demonstrates the power of PCA for both visualization and model simplification.

37.3 Principal Coordinate Analysis (PCoA)

While PCA uses correlations among variables, Principal Coordinate Analysis (PCoA) (also called Metric Multidimensional Scaling) starts with a dissimilarity matrix among observations. This is valuable when:

  • You have a meaningful distance metric (e.g., genetic distances)
  • Variables are mixed types or non-numeric
  • The data are counts (e.g., microbiome data)
Code
# PCoA example using Euclidean distances
dist_matrix <- dist(iris[, 1:4])
pcoa_result <- cmdscale(dist_matrix, k = 2, eig = TRUE)

# Proportion of variance explained
eig_vals <- pcoa_result$eig[pcoa_result$eig > 0]
var_explained <- eig_vals / sum(eig_vals)
cat("Variance explained by first two axes:",
    round(sum(var_explained[1:2]) * 100, 1), "%\n")
Variance explained by first two axes: 97.8 %
Code
# Plot
plot(pcoa_result$points,
     col = c("red", "green", "blue")[iris$Species],
     pch = 19,
     xlab = paste0("PCoA1 (", round(var_explained[1]*100, 1), "%)"),
     ylab = paste0("PCoA2 (", round(var_explained[2]*100, 1), "%)"),
     main = "PCoA of Iris Data")
legend("topright", levels(iris$Species),
       col = c("red", "green", "blue"), pch = 19)
Figure 37.11: Principal Coordinate Analysis (PCoA) ordination of iris data using Euclidean distances

When to Use PCoA vs. PCA

  • PCA: Variables are measured on a common scale; interested in variable contributions
  • PCoA: Have a distance matrix; want to preserve distances among samples
  • For Euclidean distances, PCA and PCoA give equivalent results

37.4 Non-Metric Multidimensional Scaling (NMDS)

NMDS is an ordination technique that preserves rank-order of distances rather than exact distances. It’s widely used in ecology because it makes no assumptions about the data distribution.

Code
# NMDS example
library(vegan)
nmds_result <- metaMDS(iris[, 1:4], k = 2, trymax = 100, trace = FALSE)

# Stress value indicates fit (< 0.1 is good, < 0.2 is acceptable)
cat("Stress:", round(nmds_result$stress, 3), "\n")
Stress: 0.038 
Code
plot(nmds_result$points,
     col = c("red", "green", "blue")[iris$Species],
     pch = 19, xlab = "NMDS1", ylab = "NMDS2",
     main = "NMDS of Iris Data")
legend("topright", levels(iris$Species),
       col = c("red", "green", "blue"), pch = 19)
Figure 37.12: Non-metric multidimensional scaling (NMDS) ordination of iris data
Metric vs. Non-Metric Methods

PCA and metric PCoA produce scores on a ratio scale—differences between scores are meaningful. These can be used directly in linear models.

Non-metric multidimensional scaling (NMDS) produces ordinal rankings only. NMDS scores should not be used in parametric analyses like ANOVA or regression.

37.5 MANOVA: Multivariate Analysis of Variance

When you have multiple response variables and want to test for group differences, MANOVA (Multivariate Analysis of Variance) is the appropriate technique. It extends ANOVA to multiple dependent variables simultaneously.

Why Not Multiple ANOVAs?

Running separate ANOVAs on each variable:

  • Ignores correlations among response variables
  • Inflates Type I error rate with multiple tests
  • May miss differences only apparent when variables are considered together

MANOVA tests whether group centroids differ in multivariate space.

The MANOVA Framework

MANOVA decomposes the total multivariate variation:

\[\mathbf{T} = \mathbf{H} + \mathbf{E}\]

where:

  • T: Total sum of squares and cross-products matrix
  • H: Hypothesis (between-groups) matrix
  • E: Error (within-groups) matrix

These are matrices because we have multiple response variables.

Figure 37.13: MANOVA decomposes multivariate variation into between-group and within-group components

Test Statistics

Several test statistics exist for MANOVA, each a function of the eigenvalues of \(\mathbf{HE}^{-1}\):

Statistic Description
Wilks’ Lambda (Λ) Product of 1/(1+λᵢ); most commonly used
Hotelling-Lawley Trace Sum of eigenvalues
Pillai’s Trace Sum of λᵢ/(1+λᵢ); most robust
Roy’s Largest Root Maximum eigenvalue; most powerful but sensitive

Pillai’s Trace is generally recommended because it’s most robust to violations of assumptions.

MANOVA in R

Code
# MANOVA on iris data
manova_model <- manova(cbind(Sepal.Length, Sepal.Width,
                              Petal.Length, Petal.Width) ~ Species,
                       data = iris)

# Summary with different test statistics
summary(manova_model, test = "Pillai")
           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 1.1919   53.466      8    290 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
summary(manova_model, test = "Wilks")
           Df    Wilks approx F num Df den Df    Pr(>F)    
Species     2 0.023439   199.15      8    288 < 2.2e-16 ***
Residuals 147                                              
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The significant result tells us that species differ in their multivariate centroid—the combination of all four measurements.

Follow-Up Analyses

A significant MANOVA should be followed by:

  1. Univariate ANOVAs to see which variables differ
  2. Discriminant Function Analysis to understand how groups differ
Code
# Univariate follow-ups
summary.aov(manova_model)
 Response Sepal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Sepal.Width :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 11.345  5.6725   49.16 < 2.2e-16 ***
Residuals   147 16.962  0.1154                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Petal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Petal.Width :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 80.413  40.207  960.01 < 2.2e-16 ***
Residuals   147  6.157   0.042                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MANOVA Assumptions

MANOVA assumes:

  1. Multivariate normality within groups
  2. Homogeneity of covariance matrices across groups
  3. Independence of observations
  4. No multicollinearity among response variables

Test homogeneity of covariance matrices with Box’s M test (though it’s sensitive to non-normality):

Code
# Box's M test (requires biotools package)
# library(biotools)
# boxM(iris[, 1:4], iris$Species)

37.6 Discriminant Function Analysis (DFA)

Discriminant Function Analysis (DFA, also called Linear Discriminant Analysis or LDA) finds linear combinations of variables that best separate groups. It complements MANOVA by showing how groups differ.

The Goal of DFA

DFA finds discriminant functions—weighted combinations of original variables—that maximize separation between groups while minimizing variation within groups.

The first discriminant function captures the most separation, the second captures remaining separation orthogonal to the first, and so on.

Figure 37.14: Discriminant function analysis finds linear combinations that maximize group separation

DFA in R

Code
# Linear Discriminant Analysis
lda_model <- lda(Species ~ ., data = iris)

# View the model
lda_model
Call:
lda(Species ~ ., data = iris)

Prior probabilities of groups:
    setosa versicolor  virginica 
 0.3333333  0.3333333  0.3333333 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

Coefficients of linear discriminants:
                    LD1         LD2
Sepal.Length  0.8293776 -0.02410215
Sepal.Width   1.5344731 -2.16452123
Petal.Length -2.2012117  0.93192121
Petal.Width  -2.8104603 -2.83918785

Proportion of trace:
   LD1    LD2 
0.9912 0.0088 
Code
# Discriminant scores
lda_scores <- predict(lda_model)$x

# Plot
plot(lda_scores,
     col = c("red", "green", "blue")[iris$Species],
     pch = 19,
     main = "Discriminant Function Scores",
     xlab = "LD1", ylab = "LD2")
legend("topright", levels(iris$Species),
       col = c("red", "green", "blue"), pch = 19)
Figure 37.15: Linear discriminant analysis showing group separation along discriminant functions

Interpreting DFA Output

Key components:

  • Coefficients of linear discriminants: Weights for creating discriminant scores
  • Proportion of trace: Variance explained by each discriminant function
  • Group means: Average score on each discriminant function for each group
Code
# Coefficients (loadings)
lda_model$scaling
                    LD1         LD2
Sepal.Length  0.8293776 -0.02410215
Sepal.Width   1.5344731 -2.16452123
Petal.Length -2.2012117  0.93192121
Petal.Width  -2.8104603 -2.83918785
Code
# Proportion of separation explained
lda_model$svd^2 / sum(lda_model$svd^2)
[1] 0.991212605 0.008787395

Using DFA for Prediction

DFA can classify new observations into groups based on their discriminant scores:

Code
# Classification accuracy
predictions <- predict(lda_model)$class
table(Predicted = predictions, Actual = iris$Species)
            Actual
Predicted    setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         48         1
  virginica       0          2        49
Code
# Classification accuracy
mean(predictions == iris$Species)
[1] 0.98

Cross-Validated Classification

For honest estimates of classification accuracy, use leave-one-out cross-validation:

Code
# Cross-validated LDA
lda_cv <- lda(Species ~ ., data = iris, CV = TRUE)

# Cross-validated classification table
table(Predicted = lda_cv$class, Actual = iris$Species)
            Actual
Predicted    setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         48         1
  virginica       0          2        49
Code
# Cross-validated accuracy
mean(lda_cv$class == iris$Species)
[1] 0.98

DFA for Biomarker Discovery

DFA is valuable for identifying which variables best distinguish groups—useful in biomarker discovery:

Code
# Which variables contribute most to separation?
scaling_df <- data.frame(
  Variable = rownames(lda_model$scaling),
  LD1 = abs(lda_model$scaling[, 1]),
  LD2 = abs(lda_model$scaling[, 2])
)

barplot(scaling_df$LD1, names.arg = scaling_df$Variable,
        main = "Variable Contributions to LD1",
        ylab = "Absolute Coefficient",
        col = "steelblue")
Figure 37.16: Variable contributions to the first linear discriminant function

37.7 Comparing Methods

Method Input Output Supervision Best For
PCA Variables Continuous scores None Reducing correlated variables
PCoA Distance matrix Continuous scores None Preserving sample distances
NMDS Distance matrix Ordinal scores None Ecological community data
MANOVA Variables + groups Test statistics Groups known Testing group differences
DFA Variables + groups Discriminant scores Groups known Classifying observations

For clustering methods (hierarchical, k-means) that group observations based on similarity, see Chapter 36.

37.8 Using Ordination Scores in Further Analyses

PC scores and discriminant scores are legitimate new variables that can be used in downstream analysis:

  • Regression of scores on other continuous variables
  • ANOVA comparing groups on ordination scores
  • Correlation of scores with environmental gradients

This is valuable when you have many correlated variables and want to reduce dimensionality before hypothesis testing.

Code
# Use PC scores in ANOVA
pc_scores <- data.frame(
  PC1 = iris_pca$x[, 1],
  PC2 = iris_pca$x[, 2],
  Species = iris$Species
)

summary(aov(PC1 ~ Species, data = pc_scores))
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  406.4  203.21    1051 <2e-16 ***
Residuals   147   28.4    0.19                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

37.9 Practical Workflow

  1. Explore data: Check for outliers, missing values, scaling issues

  2. Standardize if needed: Especially important when variables are on different scales

  3. Choose appropriate method: Based on your data type and question

  4. Examine output: Scree plots, loadings, clustering diagnostics

  5. Validate: Cross-validation for classification; permutation tests for significance

  6. Interpret biologically: What do the patterns mean in your system?

37.10 Exercises

Exercise DR.1: PCA Exploration
  1. We want to explore the tissue_gene_expression predictors by plotting them.
Code
data("tissue_gene_expression")
dim(tissue_gene_expression$x)

We want to get an idea of which observations are close to each other, but the predictors are 500-dimensional so plotting is difficult. Plot the first two principal components with color representing tissue type.

  1. The predictors for each observation are measured on the same device and experimental procedure. This introduces biases that can affect all the predictors from one observation. For each observation, compute the average across all predictors and then plot this against the first PC with color representing tissue. Report the correlation.

  2. We see an association with the first PC and the observation averages. Redo the PCA but only after removing the center (row means).

  3. For the first 10 PCs, make a boxplot showing the values for each tissue.

  4. Plot the percent variance explained by PC number. Hint: use the summary function.

Exercise DR.2: Distance Matrices
  1. Load the following dataset and compute distances:
Code
data("tissue_gene_expression")

Compare the distance between the first two observations (both cerebellums), the 39th and 40th (both colons), and the 73rd and 74th (both endometriums). Are observations of the same tissue type closer to each other?

  1. Make an image plot of all the distances using the image function. Hint: convert the distance object to a matrix first. Does the pattern suggest that observations of the same tissue type are generally closer?

37.11 Non-Linear Methods

The methods covered in this chapter are primarily linear dimensionality reduction techniques. PCA finds linear combinations of variables, and PCoA preserves Euclidean distances. However, biological data often lies on complex, non-linear manifolds.

For visualization of complex, non-linear structure, see Chapter 38 which covers:

  • t-SNE (t-distributed Stochastic Neighbor Embedding): Excellent for visualizing local cluster structure
  • UMAP (Uniform Manifold Approximation and Projection): Faster than t-SNE and may better preserve global structure

These non-linear methods are particularly valuable for single-cell RNA-seq data and other high-dimensional biological datasets with complex structure.

37.12 Summary

  • Dimensionality reduction creates fewer variables that capture most information
  • PCA finds linear combinations that maximize variance
    • Orthogonal transformations preserve distances while concentrating variance
    • Highly correlated variables can be effectively reduced to fewer dimensions
    • Loadings show which original variables contribute to each PC
    • The twin heights example shows how correlated variables compress to one dimension
  • PCoA works from distance matrices; useful for ecological and genetic data
  • NMDS preserves rank-order of distances; robust for non-normal data
  • MANOVA tests whether groups differ on multiple response variables simultaneously
  • DFA/LDA finds combinations that best discriminate known groups
  • These methods can be combined: use PCA to reduce dimensions, then classify
  • PCA can substantially reduce model complexity while maintaining predictive performance
  • For non-linear structure, consider t-SNE or UMAP (see Chapter 38)

37.13 Additional Resources

  • Chapter 52 - Matrix algebra fundamentals and eigenanalysis for PCA
  • Chapter 38 - Non-linear dimensionality reduction with t-SNE and UMAP
  • James et al. (2023) - Modern treatment of dimensionality reduction and clustering
  • Logan (2010) - MANOVA and DFA in biological research contexts
  • Borcard, D., Gillet, F., & Legendre, P. (2018). Numerical Ecology with R - Comprehensive ordination methods