18  Nonparametric Tests

18.1 When Assumptions Fail

Parametric tests like the t-test make assumptions about the underlying data distribution—typically that data are normally distributed with equal variances across groups. When these assumptions are violated, the tests may give misleading results. Nonparametric tests provide alternatives that make fewer assumptions about the data.

Nonparametric methods are sometimes called distribution-free methods because they do not assume a specific probability distribution. Instead, they typically work with ranks or signs of data rather than the raw values. This makes them robust to outliers and applicable to ordinal data where parametric methods would be inappropriate.

18.2 The Mann-Whitney U Test

The Mann-Whitney U test (Mann and Whitney 1947) (also called the Wilcoxon rank-sum test) is the nonparametric equivalent of the two-sample t-test. It tests whether two independent groups tend to have different values, based on comparing the ranks of observations rather than the observations themselves.

The null hypothesis is that the distributions of the two groups are identical. The alternative is that one group tends to have larger values than the other.

Code
# Generate data with non-normal distributions
set.seed(518)
group1 <- sample(rnorm(n = 10000, mean = 2, sd = 0.5), size = 100)
group2 <- sample(rnorm(n = 10000, mean = 5, sd = 1.5), size = 100)

# Mann-Whitney U test
wilcox.test(group1, group2)

    Wilcoxon rank sum test with continuity correction

data:  group1 and group2
W = 440, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

The test works by combining all observations, ranking them, and comparing the sum of ranks in each group. If one group tends to have higher values, its rank sum will be larger than expected by chance.

18.3 Wilcoxon Signed-Rank Test

For paired data, the Wilcoxon signed-rank test (Wilcoxon 1945) is the nonparametric alternative to the paired t-test. It tests whether the median difference between pairs is zero.

Code
# Paired data example
set.seed(123)
before <- rnorm(20, mean = 100, sd = 15)
after <- before + rexp(20, rate = 0.2)  # Skewed improvement

wilcox.test(after, before, paired = TRUE)

    Wilcoxon signed rank exact test

data:  after and before
V = 210, p-value = 1.907e-06
alternative hypothesis: true location shift is not equal to 0

The test calculates the differences between pairs, ranks their absolute values, and considers the signs of the differences. Under the null hypothesis, positive and negative differences should be equally likely and of similar magnitude.

18.4 Kruskal-Wallis Test

The Kruskal-Wallis test extends the Mann-Whitney U test to more than two groups, serving as a nonparametric alternative to one-way ANOVA. It tests whether at least one group tends to have different values from the others.

Code
# Example with three groups
set.seed(42)
data <- data.frame(
  value = c(rexp(30, 0.1), rexp(30, 0.15), rexp(30, 0.2)),
  group = factor(rep(c("A", "B", "C"), each = 30))
)

kruskal.test(value ~ group, data = data)

    Kruskal-Wallis rank sum test

data:  value by group
Kruskal-Wallis chi-squared = 9.3507, df = 2, p-value = 0.009322

Like ANOVA, a significant Kruskal-Wallis test tells you that groups differ but not which specific groups differ from which others. Post-hoc pairwise comparisons can follow up on a significant result.

18.5 Advantages and Limitations

Advantages of nonparametric tests:

Nonparametric tests do not require normally distributed data. They are robust to outliers since they work with ranks rather than raw values. They can be applied to ordinal data where the assumption of interval-level measurement would be violated. They often have good power relative to parametric tests even when parametric assumptions are met.

Limitations:

When parametric assumptions are met, nonparametric tests are slightly less powerful than their parametric counterparts. They test hypotheses about distributions or medians rather than means, which may not always align with research questions. They can be more difficult to extend to complex designs with multiple factors or covariates.

18.6 Choosing Between Parametric and Nonparametric

The choice depends on your data and research question. If your data are reasonably normal (or your sample is large enough for the Central Limit Theorem to apply) and you care about means, parametric tests are appropriate and efficient. If your data are severely non-normal, contain outliers, or are ordinal in nature, nonparametric tests provide a safer alternative.

With large samples, the Central Limit Theorem ensures that parametric tests are robust to non-normality, so the choice matters less. With small samples, checking assumptions becomes more important.

18.7 Frequency Analysis: Chi-Square Tests

When data consist of counts in categories rather than continuous measurements, we need tests designed for categorical data. The chi-square (\(\chi^2\)) test compares observed frequencies to expected frequencies under a null hypothesis.

Goodness-of-Fit Test

The chi-square goodness-of-fit test asks whether observed frequencies match expected proportions. For example, do offspring genotypes follow expected Mendelian ratios?

Code
# Test whether observed counts match expected 3:1 ratio
observed <- c(75, 25)  # Dominant, Recessive phenotypes
expected_ratio <- c(3, 1)
expected <- sum(observed) * expected_ratio / sum(expected_ratio)

# Chi-square test
chisq.test(observed, p = expected_ratio / sum(expected_ratio))

    Chi-squared test for given probabilities

data:  observed
X-squared = 0, df = 1, p-value = 1

The test statistic is:

\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]

where \(O_i\) are observed counts and \(E_i\) are expected counts. Under the null hypothesis (observed = expected), this follows a chi-square distribution with \(k-1\) degrees of freedom, where \(k\) is the number of categories.

Tests of Independence: Contingency Tables

When we have counts cross-classified by two categorical variables, a contingency table displays the frequencies. The chi-square test of independence asks whether the two variables are associated.

Code
# Example: Is treatment outcome associated with gender?
treatment_data <- matrix(c(
  45, 35,   # Males: Success, Failure
  55, 15    # Females: Success, Failure
), nrow = 2, byrow = TRUE)
rownames(treatment_data) <- c("Male", "Female")
colnames(treatment_data) <- c("Success", "Failure")

treatment_data
       Success Failure
Male        45      35
Female      55      15
Code
# Chi-square test of independence
chisq.test(treatment_data)

    Pearson's Chi-squared test with Yates' continuity correction

data:  treatment_data
X-squared = 7.3962, df = 1, p-value = 0.006536

Expected counts under independence are calculated as:

\[E_{ij} = \frac{(\text{Row Total}_i) \times (\text{Column Total}_j)}{\text{Grand Total}}\]

Assumptions of Chi-Square Tests
  • Observations must be independent
  • Expected counts should be at least 5 in each cell (some sources say 80% of cells should have expected counts ≥ 5)
  • For 2×2 tables with small expected counts, use Fisher’s exact test instead

Fisher’s Exact Test

When sample sizes are small, Fisher’s exact test provides exact p-values rather than relying on the chi-square approximation:

Code
# Small sample example
small_table <- matrix(c(3, 1, 1, 3), nrow = 2)
fisher.test(small_table)

    Fisher's Exact Test for Count Data

data:  small_table
p-value = 0.4857
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   0.2117329 621.9337505
sample estimates:
odds ratio 
  6.408309 

G-Test (Likelihood Ratio Test)

The G-test is an alternative to the chi-square test based on the likelihood ratio. It has better theoretical properties and is preferred by some statisticians:

\[G = 2 \sum O_i \ln\left(\frac{O_i}{E_i}\right)\]

Code
# G-test for the treatment data
# Using observed and expected from chi-square
test_result <- chisq.test(treatment_data)
observed_counts <- as.vector(treatment_data)
expected_counts <- as.vector(test_result$expected)

G <- 2 * sum(observed_counts * log(observed_counts / expected_counts))
p_value <- 1 - pchisq(G, df = 1)

cat("G statistic:", round(G, 3), "\n")
G statistic: 8.563 
Code
cat("p-value:", round(p_value, 4), "\n")
p-value: 0.0034 

Odds Ratios

For 2×2 tables, the odds ratio quantifies the strength of association between two binary variables:

\[OR = \frac{a/b}{c/d} = \frac{ad}{bc}\]

where the table is:

Outcome+ Outcome-
Exposure+ a b
Exposure- c d

An odds ratio of 1 indicates no association. OR > 1 indicates positive association; OR < 1 indicates negative association.

Code
# Calculate odds ratio for treatment data
a <- treatment_data[1, 1]  # Male, Success
b <- treatment_data[1, 2]  # Male, Failure
c <- treatment_data[2, 1]  # Female, Success
d <- treatment_data[2, 2]  # Female, Failure

odds_ratio <- (a * d) / (b * c)
cat("Odds ratio:", round(odds_ratio, 3), "\n")
Odds ratio: 0.351 
Code
# Using fisher.test to get OR with confidence interval
fisher.test(treatment_data)

    Fisher's Exact Test for Count Data

data:  treatment_data
p-value = 0.005273
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.1579843 0.7609357
sample estimates:
odds ratio 
 0.3531441 

An odds ratio of 0.35 indicates that males have lower odds of success compared to females in this example.

McNemar’s Test for Paired Data

When categorical data are paired (e.g., before/after measurements on the same subjects), McNemar’s test is appropriate:

Code
# Before/after treatment: did opinion change?
before_after <- matrix(c(
  40, 10,  # Agree before: Agree after, Disagree after
  25, 25   # Disagree before: Agree after, Disagree after
), nrow = 2, byrow = TRUE)

mcnemar.test(before_after)

    McNemar's Chi-squared test with continuity correction

data:  before_after
McNemar's chi-squared = 5.6, df = 1, p-value = 0.01796

The test focuses on the discordant pairs—cases where the response changed—and asks whether changes in one direction are more common than changes in the other direction.

18.8 Summary

Nonparametric and frequency-based tests provide alternatives when parametric assumptions fail or data are categorical:

  • Mann-Whitney U and Kruskal-Wallis for comparing groups with non-normal data
  • Wilcoxon signed-rank for paired non-normal data
  • Chi-square tests for categorical data (goodness-of-fit and independence)
  • Fisher’s exact test for small samples
  • Odds ratios to quantify association strength
  • McNemar’s test for paired categorical data

18.9 Exercises

Exercise N.1: Mann-Whitney U Test

You measure enzyme activity (in arbitrary units) in two groups of transgenic plants:

control <- c(23.1, 25.4, 22.8, 26.3, 24.1, 25.9, 23.7, 24.8, 22.5, 26.1)
transgenic <- c(28.4, 31.2, 29.6, 30.1, 27.8, 32.3, 29.9, 28.7, 30.5, 31.8)
  1. Create side-by-side boxplots to visualize the two groups
  2. Assess whether the data appear normally distributed (use Q-Q plots or Shapiro-Wilk test)
  3. Perform a Mann-Whitney U test to compare the two groups
  4. Perform a two-sample t-test for comparison
  5. Which test is more appropriate for these data and why?
  6. Calculate the median and IQR for each group and report these along with your test results
Code
# Your code here
Exercise N.2: Wilcoxon Signed-Rank Test

A researcher measures blood pressure before and after a stress reduction intervention in 12 participants:

before <- c(142, 138, 155, 148, 162, 140, 151, 139, 156, 145, 149, 158)
after <- c(136, 132, 149, 142, 150, 138, 145, 135, 151, 140, 143, 152)
  1. Calculate the paired differences (after - before) and visualize their distribution
  2. Test whether the differences are normally distributed
  3. Perform a Wilcoxon signed-rank test
  4. Perform a paired t-test for comparison
  5. Calculate the median reduction in blood pressure and construct an approximate 95% CI using a bootstrap approach
  6. Report your conclusions about the intervention’s effectiveness
Code
# Your code here
Exercise N.3: Chi-Square Goodness-of-Fit

A genetics experiment produces offspring with four phenotypes. According to Mendelian theory, these should occur in a 9:3:3:1 ratio. You observe the following counts:

observed <- c(315, 108, 101, 32)  # Phenotypes: AB, Ab, aB, ab
  1. Calculate the expected counts based on the 9:3:3:1 ratio
  2. Perform a chi-square goodness-of-fit test
  3. Visualize the observed vs. expected counts with a barplot
  4. Calculate the contribution of each category to the total chi-square statistic
  5. Do the data support the theoretical prediction? Explain your reasoning
Code
# Your code here
Exercise N.4: Chi-Square Test of Independence

A clinical trial tests a new treatment for bacterial infections. Results are classified by treatment group and outcome:

Cured Not Cured
Treatment 48 12
Control 32 28
  1. Create this 2×2 table in R
  2. Perform a chi-square test of independence
  3. Calculate and interpret the odds ratio
  4. Use Fisher’s exact test and compare the p-value to the chi-square test
  5. Check the expected cell counts—is the chi-square approximation appropriate?
  6. Report your conclusions about treatment effectiveness
Code
# Your code here
Exercise N.5: Choosing the Right Test

For each scenario below, indicate which test is most appropriate (Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, chi-square, Fisher’s exact, or McNemar’s test) and explain why:

  1. Comparing survival times of patients on three different chemotherapy regimens (n = 15 per group, data are heavily right-skewed)
  2. Testing whether a coin is fair after observing 65 heads in 100 flips
  3. Comparing patient satisfaction ratings (on a scale of 1-5) before and after a hospital redesign
  4. Testing whether smoking status (yes/no) is associated with lung disease (yes/no) in a sample of 30 patients
  5. Comparing memory test scores in four age groups (n = 50 per group, scores range 0-100 but are not normally distributed)
  6. Analyzing a before/after survey where respondents indicate support (yes/no) for a policy

For two of these scenarios, write complete R code to perform the analysis with simulated or provided data.

Code
# Your code here

18.10 Additional Resources

  • Logan (2010) - Comprehensive coverage of nonparametric and categorical data analysis
  • Crawley (2007) - Detailed treatment of chi-square and contingency table methods in R