Parametric tests like the t-test make assumptions about the underlying data distribution—typically that data are normally distributed with equal variances across groups. When these assumptions are violated, the tests may give misleading results. Nonparametric tests provide alternatives that make fewer assumptions about the data.
Nonparametric methods are sometimes called distribution-free methods because they do not assume a specific probability distribution. Instead, they typically work with ranks or signs of data rather than the raw values. This makes them robust to outliers and applicable to ordinal data where parametric methods would be inappropriate.
18.2 The Mann-Whitney U Test
The Mann-Whitney U test (Mann and Whitney 1947) (also called the Wilcoxon rank-sum test) is the nonparametric equivalent of the two-sample t-test. It tests whether two independent groups tend to have different values, based on comparing the ranks of observations rather than the observations themselves.
The null hypothesis is that the distributions of the two groups are identical. The alternative is that one group tends to have larger values than the other.
Code
# Generate data with non-normal distributionsset.seed(518)group1 <-sample(rnorm(n =10000, mean =2, sd =0.5), size =100)group2 <-sample(rnorm(n =10000, mean =5, sd =1.5), size =100)# Mann-Whitney U testwilcox.test(group1, group2)
Wilcoxon rank sum test with continuity correction
data: group1 and group2
W = 440, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
The test works by combining all observations, ranking them, and comparing the sum of ranks in each group. If one group tends to have higher values, its rank sum will be larger than expected by chance.
18.3 Wilcoxon Signed-Rank Test
For paired data, the Wilcoxon signed-rank test (Wilcoxon 1945) is the nonparametric alternative to the paired t-test. It tests whether the median difference between pairs is zero.
Code
# Paired data exampleset.seed(123)before <-rnorm(20, mean =100, sd =15)after <- before +rexp(20, rate =0.2) # Skewed improvementwilcox.test(after, before, paired =TRUE)
Wilcoxon signed rank exact test
data: after and before
V = 210, p-value = 1.907e-06
alternative hypothesis: true location shift is not equal to 0
The test calculates the differences between pairs, ranks their absolute values, and considers the signs of the differences. Under the null hypothesis, positive and negative differences should be equally likely and of similar magnitude.
18.4 Kruskal-Wallis Test
The Kruskal-Wallis test extends the Mann-Whitney U test to more than two groups, serving as a nonparametric alternative to one-way ANOVA. It tests whether at least one group tends to have different values from the others.
Code
# Example with three groupsset.seed(42)data <-data.frame(value =c(rexp(30, 0.1), rexp(30, 0.15), rexp(30, 0.2)),group =factor(rep(c("A", "B", "C"), each =30)))kruskal.test(value ~ group, data = data)
Kruskal-Wallis rank sum test
data: value by group
Kruskal-Wallis chi-squared = 9.3507, df = 2, p-value = 0.009322
Like ANOVA, a significant Kruskal-Wallis test tells you that groups differ but not which specific groups differ from which others. Post-hoc pairwise comparisons can follow up on a significant result.
18.5 Advantages and Limitations
Advantages of nonparametric tests:
Nonparametric tests do not require normally distributed data. They are robust to outliers since they work with ranks rather than raw values. They can be applied to ordinal data where the assumption of interval-level measurement would be violated. They often have good power relative to parametric tests even when parametric assumptions are met.
Limitations:
When parametric assumptions are met, nonparametric tests are slightly less powerful than their parametric counterparts. They test hypotheses about distributions or medians rather than means, which may not always align with research questions. They can be more difficult to extend to complex designs with multiple factors or covariates.
18.6 Choosing Between Parametric and Nonparametric
The choice depends on your data and research question. If your data are reasonably normal (or your sample is large enough for the Central Limit Theorem to apply) and you care about means, parametric tests are appropriate and efficient. If your data are severely non-normal, contain outliers, or are ordinal in nature, nonparametric tests provide a safer alternative.
With large samples, the Central Limit Theorem ensures that parametric tests are robust to non-normality, so the choice matters less. With small samples, checking assumptions becomes more important.
18.7 Frequency Analysis: Chi-Square Tests
When data consist of counts in categories rather than continuous measurements, we need tests designed for categorical data. The chi-square (\(\chi^2\)) test compares observed frequencies to expected frequencies under a null hypothesis.
Goodness-of-Fit Test
The chi-square goodness-of-fit test asks whether observed frequencies match expected proportions. For example, do offspring genotypes follow expected Mendelian ratios?
Code
# Test whether observed counts match expected 3:1 ratioobserved <-c(75, 25) # Dominant, Recessive phenotypesexpected_ratio <-c(3, 1)expected <-sum(observed) * expected_ratio /sum(expected_ratio)# Chi-square testchisq.test(observed, p = expected_ratio /sum(expected_ratio))
Chi-squared test for given probabilities
data: observed
X-squared = 0, df = 1, p-value = 1
The test statistic is:
\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]
where \(O_i\) are observed counts and \(E_i\) are expected counts. Under the null hypothesis (observed = expected), this follows a chi-square distribution with \(k-1\) degrees of freedom, where \(k\) is the number of categories.
Tests of Independence: Contingency Tables
When we have counts cross-classified by two categorical variables, a contingency table displays the frequencies. The chi-square test of independence asks whether the two variables are associated.
Expected counts should be at least 5 in each cell (some sources say 80% of cells should have expected counts ≥ 5)
For 2×2 tables with small expected counts, use Fisher’s exact test instead
Fisher’s Exact Test
When sample sizes are small, Fisher’s exact test provides exact p-values rather than relying on the chi-square approximation:
Code
# Small sample examplesmall_table <-matrix(c(3, 1, 1, 3), nrow =2)fisher.test(small_table)
Fisher's Exact Test for Count Data
data: small_table
p-value = 0.4857
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.2117329 621.9337505
sample estimates:
odds ratio
6.408309
G-Test (Likelihood Ratio Test)
The G-test is an alternative to the chi-square test based on the likelihood ratio. It has better theoretical properties and is preferred by some statisticians:
# Using fisher.test to get OR with confidence intervalfisher.test(treatment_data)
Fisher's Exact Test for Count Data
data: treatment_data
p-value = 0.005273
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.1579843 0.7609357
sample estimates:
odds ratio
0.3531441
An odds ratio of 0.35 indicates that males have lower odds of success compared to females in this example.
McNemar’s Test for Paired Data
When categorical data are paired (e.g., before/after measurements on the same subjects), McNemar’s test is appropriate:
McNemar's Chi-squared test with continuity correction
data: before_after
McNemar's chi-squared = 5.6, df = 1, p-value = 0.01796
The test focuses on the discordant pairs—cases where the response changed—and asks whether changes in one direction are more common than changes in the other direction.
18.8 Summary
Nonparametric and frequency-based tests provide alternatives when parametric assumptions fail or data are categorical:
Mann-Whitney U and Kruskal-Wallis for comparing groups with non-normal data
Wilcoxon signed-rank for paired non-normal data
Chi-square tests for categorical data (goodness-of-fit and independence)
Fisher’s exact test for small samples
Odds ratios to quantify association strength
McNemar’s test for paired categorical data
18.9 Exercises
Exercise N.1: Mann-Whitney U Test
You measure enzyme activity (in arbitrary units) in two groups of transgenic plants:
Calculate the paired differences (after - before) and visualize their distribution
Test whether the differences are normally distributed
Perform a Wilcoxon signed-rank test
Perform a paired t-test for comparison
Calculate the median reduction in blood pressure and construct an approximate 95% CI using a bootstrap approach
Report your conclusions about the intervention’s effectiveness
Code
# Your code here
Exercise N.3: Chi-Square Goodness-of-Fit
A genetics experiment produces offspring with four phenotypes. According to Mendelian theory, these should occur in a 9:3:3:1 ratio. You observe the following counts:
Calculate the expected counts based on the 9:3:3:1 ratio
Perform a chi-square goodness-of-fit test
Visualize the observed vs. expected counts with a barplot
Calculate the contribution of each category to the total chi-square statistic
Do the data support the theoretical prediction? Explain your reasoning
Code
# Your code here
Exercise N.4: Chi-Square Test of Independence
A clinical trial tests a new treatment for bacterial infections. Results are classified by treatment group and outcome:
Cured
Not Cured
Treatment
48
12
Control
32
28
Create this 2×2 table in R
Perform a chi-square test of independence
Calculate and interpret the odds ratio
Use Fisher’s exact test and compare the p-value to the chi-square test
Check the expected cell counts—is the chi-square approximation appropriate?
Report your conclusions about treatment effectiveness
Code
# Your code here
Exercise N.5: Choosing the Right Test
For each scenario below, indicate which test is most appropriate (Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, chi-square, Fisher’s exact, or McNemar’s test) and explain why:
Comparing survival times of patients on three different chemotherapy regimens (n = 15 per group, data are heavily right-skewed)
Testing whether a coin is fair after observing 65 heads in 100 flips
Comparing patient satisfaction ratings (on a scale of 1-5) before and after a hospital redesign
Testing whether smoking status (yes/no) is associated with lung disease (yes/no) in a sample of 30 patients
Comparing memory test scores in four age groups (n = 50 per group, scores range 0-100 but are not normally distributed)
Analyzing a before/after survey where respondents indicate support (yes/no) for a policy
For two of these scenarios, write complete R code to perform the analysis with simulated or provided data.
Code
# Your code here
18.10 Additional Resources
Logan (2010) - Comprehensive coverage of nonparametric and categorical data analysis
Crawley (2007) - Detailed treatment of chi-square and contingency table methods in R
Crawley, Michael J. 2007. The r Book. John Wiley & Sons.
Logan, Murray. 2010. Biostatistical Design and Analysis Using r. Wiley-Blackwell.
Mann, Henry B., and Donald R. Whitney. 1947. “On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other.”The Annals of Mathematical Statistics 18 (1): 50–60.
# Nonparametric Tests {#sec-nonparametric}```{r}#| echo: false#| message: falselibrary(tidyverse)theme_set(theme_minimal())```## When Assumptions FailParametric tests like the t-test make assumptions about the underlying data distribution—typically that data are normally distributed with equal variances across groups. When these assumptions are violated, the tests may give misleading results. Nonparametric tests provide alternatives that make fewer assumptions about the data.Nonparametric methods are sometimes called distribution-free methods because they do not assume a specific probability distribution. Instead, they typically work with ranks or signs of data rather than the raw values. This makes them robust to outliers and applicable to ordinal data where parametric methods would be inappropriate.## The Mann-Whitney U TestThe Mann-Whitney U test [@mann1947test] (also called the Wilcoxon rank-sum test) is the nonparametric equivalent of the two-sample t-test. It tests whether two independent groups tend to have different values, based on comparing the ranks of observations rather than the observations themselves.The null hypothesis is that the distributions of the two groups are identical. The alternative is that one group tends to have larger values than the other.```{r}# Generate data with non-normal distributionsset.seed(518)group1 <-sample(rnorm(n =10000, mean =2, sd =0.5), size =100)group2 <-sample(rnorm(n =10000, mean =5, sd =1.5), size =100)# Mann-Whitney U testwilcox.test(group1, group2)```The test works by combining all observations, ranking them, and comparing the sum of ranks in each group. If one group tends to have higher values, its rank sum will be larger than expected by chance.## Wilcoxon Signed-Rank TestFor paired data, the Wilcoxon signed-rank test [@wilcoxon1945individual] is the nonparametric alternative to the paired t-test. It tests whether the median difference between pairs is zero.```{r}# Paired data exampleset.seed(123)before <-rnorm(20, mean =100, sd =15)after <- before +rexp(20, rate =0.2) # Skewed improvementwilcox.test(after, before, paired =TRUE)```The test calculates the differences between pairs, ranks their absolute values, and considers the signs of the differences. Under the null hypothesis, positive and negative differences should be equally likely and of similar magnitude.## Kruskal-Wallis TestThe Kruskal-Wallis test extends the Mann-Whitney U test to more than two groups, serving as a nonparametric alternative to one-way ANOVA. It tests whether at least one group tends to have different values from the others.```{r}# Example with three groupsset.seed(42)data <-data.frame(value =c(rexp(30, 0.1), rexp(30, 0.15), rexp(30, 0.2)),group =factor(rep(c("A", "B", "C"), each =30)))kruskal.test(value ~ group, data = data)```Like ANOVA, a significant Kruskal-Wallis test tells you that groups differ but not which specific groups differ from which others. Post-hoc pairwise comparisons can follow up on a significant result.## Advantages and Limitations**Advantages of nonparametric tests:**Nonparametric tests do not require normally distributed data. They are robust to outliers since they work with ranks rather than raw values. They can be applied to ordinal data where the assumption of interval-level measurement would be violated. They often have good power relative to parametric tests even when parametric assumptions are met.**Limitations:**When parametric assumptions are met, nonparametric tests are slightly less powerful than their parametric counterparts. They test hypotheses about distributions or medians rather than means, which may not always align with research questions. They can be more difficult to extend to complex designs with multiple factors or covariates.## Choosing Between Parametric and NonparametricThe choice depends on your data and research question. If your data are reasonably normal (or your sample is large enough for the Central Limit Theorem to apply) and you care about means, parametric tests are appropriate and efficient. If your data are severely non-normal, contain outliers, or are ordinal in nature, nonparametric tests provide a safer alternative.With large samples, the Central Limit Theorem ensures that parametric tests are robust to non-normality, so the choice matters less. With small samples, checking assumptions becomes more important.## Frequency Analysis: Chi-Square TestsWhen data consist of counts in categories rather than continuous measurements, we need tests designed for categorical data. The chi-square ($\chi^2$) test compares observed frequencies to expected frequencies under a null hypothesis.### Goodness-of-Fit TestThe chi-square goodness-of-fit test asks whether observed frequencies match expected proportions. For example, do offspring genotypes follow expected Mendelian ratios?```{r}# Test whether observed counts match expected 3:1 ratioobserved <-c(75, 25) # Dominant, Recessive phenotypesexpected_ratio <-c(3, 1)expected <-sum(observed) * expected_ratio /sum(expected_ratio)# Chi-square testchisq.test(observed, p = expected_ratio /sum(expected_ratio))```The test statistic is:$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$where $O_i$ are observed counts and $E_i$ are expected counts. Under the null hypothesis (observed = expected), this follows a chi-square distribution with $k-1$ degrees of freedom, where $k$ is the number of categories.### Tests of Independence: Contingency TablesWhen we have counts cross-classified by two categorical variables, a contingency table displays the frequencies. The chi-square test of independence asks whether the two variables are associated.```{r}# Example: Is treatment outcome associated with gender?treatment_data <-matrix(c(45, 35, # Males: Success, Failure55, 15# Females: Success, Failure), nrow =2, byrow =TRUE)rownames(treatment_data) <-c("Male", "Female")colnames(treatment_data) <-c("Success", "Failure")treatment_data# Chi-square test of independencechisq.test(treatment_data)```Expected counts under independence are calculated as:$$E_{ij} = \frac{(\text{Row Total}_i) \times (\text{Column Total}_j)}{\text{Grand Total}}$$::: {.callout-warning}## Assumptions of Chi-Square Tests- Observations must be independent- Expected counts should be at least 5 in each cell (some sources say 80% of cells should have expected counts ≥ 5)- For 2×2 tables with small expected counts, use Fisher's exact test instead:::### Fisher's Exact TestWhen sample sizes are small, Fisher's exact test provides exact p-values rather than relying on the chi-square approximation:```{r}# Small sample examplesmall_table <-matrix(c(3, 1, 1, 3), nrow =2)fisher.test(small_table)```### G-Test (Likelihood Ratio Test)The G-test is an alternative to the chi-square test based on the likelihood ratio. It has better theoretical properties and is preferred by some statisticians:$$G = 2 \sum O_i \ln\left(\frac{O_i}{E_i}\right)$$```{r}# G-test for the treatment data# Using observed and expected from chi-squaretest_result <-chisq.test(treatment_data)observed_counts <-as.vector(treatment_data)expected_counts <-as.vector(test_result$expected)G <-2*sum(observed_counts *log(observed_counts / expected_counts))p_value <-1-pchisq(G, df =1)cat("G statistic:", round(G, 3), "\n")cat("p-value:", round(p_value, 4), "\n")```### Odds RatiosFor 2×2 tables, the **odds ratio** quantifies the strength of association between two binary variables:$$OR = \frac{a/b}{c/d} = \frac{ad}{bc}$$where the table is:|| Outcome+ | Outcome- ||:--|:--|:--|| Exposure+ | a | b || Exposure- | c | d |An odds ratio of 1 indicates no association. OR > 1 indicates positive association; OR < 1 indicates negative association.```{r}# Calculate odds ratio for treatment dataa <- treatment_data[1, 1] # Male, Successb <- treatment_data[1, 2] # Male, Failurec <- treatment_data[2, 1] # Female, Successd <- treatment_data[2, 2] # Female, Failureodds_ratio <- (a * d) / (b * c)cat("Odds ratio:", round(odds_ratio, 3), "\n")# Using fisher.test to get OR with confidence intervalfisher.test(treatment_data)```An odds ratio of `r round(odds_ratio, 2)` indicates that males have lower odds of success compared to females in this example.### McNemar's Test for Paired DataWhen categorical data are paired (e.g., before/after measurements on the same subjects), McNemar's test is appropriate:```{r}# Before/after treatment: did opinion change?before_after <-matrix(c(40, 10, # Agree before: Agree after, Disagree after25, 25# Disagree before: Agree after, Disagree after), nrow =2, byrow =TRUE)mcnemar.test(before_after)```The test focuses on the discordant pairs—cases where the response changed—and asks whether changes in one direction are more common than changes in the other direction.## SummaryNonparametric and frequency-based tests provide alternatives when parametric assumptions fail or data are categorical:- Mann-Whitney U and Kruskal-Wallis for comparing groups with non-normal data- Wilcoxon signed-rank for paired non-normal data- Chi-square tests for categorical data (goodness-of-fit and independence)- Fisher's exact test for small samples- Odds ratios to quantify association strength- McNemar's test for paired categorical data## Exercises::: {.callout-note}### Exercise N.1: Mann-Whitney U TestYou measure enzyme activity (in arbitrary units) in two groups of transgenic plants:```control <- c(23.1, 25.4, 22.8, 26.3, 24.1, 25.9, 23.7, 24.8, 22.5, 26.1)transgenic <- c(28.4, 31.2, 29.6, 30.1, 27.8, 32.3, 29.9, 28.7, 30.5, 31.8)```a) Create side-by-side boxplots to visualize the two groupsb) Assess whether the data appear normally distributed (use Q-Q plots or Shapiro-Wilk test)c) Perform a Mann-Whitney U test to compare the two groupsd) Perform a two-sample t-test for comparisone) Which test is more appropriate for these data and why?f) Calculate the median and IQR for each group and report these along with your test results```{r}#| eval: false# Your code here```:::::: {.callout-note}### Exercise N.2: Wilcoxon Signed-Rank TestA researcher measures blood pressure before and after a stress reduction intervention in 12 participants:```before <- c(142, 138, 155, 148, 162, 140, 151, 139, 156, 145, 149, 158)after <- c(136, 132, 149, 142, 150, 138, 145, 135, 151, 140, 143, 152)```a) Calculate the paired differences (after - before) and visualize their distributionb) Test whether the differences are normally distributedc) Perform a Wilcoxon signed-rank testd) Perform a paired t-test for comparisone) Calculate the median reduction in blood pressure and construct an approximate 95% CI using a bootstrap approachf) Report your conclusions about the intervention's effectiveness```{r}#| eval: false# Your code here```:::::: {.callout-note}### Exercise N.3: Chi-Square Goodness-of-FitA genetics experiment produces offspring with four phenotypes. According to Mendelian theory, these should occur in a 9:3:3:1 ratio. You observe the following counts:```observed <- c(315, 108, 101, 32) # Phenotypes: AB, Ab, aB, ab```a) Calculate the expected counts based on the 9:3:3:1 ratiob) Perform a chi-square goodness-of-fit testc) Visualize the observed vs. expected counts with a barplotd) Calculate the contribution of each category to the total chi-square statistice) Do the data support the theoretical prediction? Explain your reasoning```{r}#| eval: false# Your code here```:::::: {.callout-note}### Exercise N.4: Chi-Square Test of IndependenceA clinical trial tests a new treatment for bacterial infections. Results are classified by treatment group and outcome:|| Cured | Not Cured ||:--|:------|:----------|| Treatment | 48 | 12 || Control | 32 | 28 |a) Create this 2×2 table in Rb) Perform a chi-square test of independencec) Calculate and interpret the odds ratiod) Use Fisher's exact test and compare the p-value to the chi-square teste) Check the expected cell counts—is the chi-square approximation appropriate?f) Report your conclusions about treatment effectiveness```{r}#| eval: false# Your code here```:::::: {.callout-note}### Exercise N.5: Choosing the Right TestFor each scenario below, indicate which test is most appropriate (Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, chi-square, Fisher's exact, or McNemar's test) and explain why:a) Comparing survival times of patients on three different chemotherapy regimens (n = 15 per group, data are heavily right-skewed)b) Testing whether a coin is fair after observing 65 heads in 100 flipsc) Comparing patient satisfaction ratings (on a scale of 1-5) before and after a hospital redesignd) Testing whether smoking status (yes/no) is associated with lung disease (yes/no) in a sample of 30 patientse) Comparing memory test scores in four age groups (n = 50 per group, scores range 0-100 but are not normally distributed)f) Analyzing a before/after survey where respondents indicate support (yes/no) for a policyFor two of these scenarios, write complete R code to perform the analysis with simulated or provided data.```{r}#| eval: false# Your code here```:::## Additional Resources- @logan2010biostatistical - Comprehensive coverage of nonparametric and categorical data analysis- @crawley2007r - Detailed treatment of chi-square and contingency table methods in R