50 Common Probability Distributions

This appendix provides a reference for probability distributions commonly encountered in biological and bioengineering applications. Each distribution is characterized by its probability function, parameters, mean, variance, and typical applications.

50.1 Discrete Distributions

Discrete distributions describe random variables that take on countable values (integers).

Bernoulli Distribution

The Bernoulli distribution describes a single trial with two possible outcomes (success/failure).

Probability mass function: \[P(X = k) = p^k(1-p)^{1-k}, \quad k \in \{0, 1\}\]

Parameters: $p$ = probability of success (0 ≤ p ≤ 1)

Mean: $E[X] = p$

Variance: $\text{Var}(X) = p(1-p)$

Applications:

Single coin flip
Whether a patient responds to treatment
Whether an allele is inherited

Code

# Bernoulli with different p values
p_vals <- c(0.2, 0.5, 0.8)
data.frame(
  outcome = rep(c("Failure", "Success"), 3),
  p = rep(p_vals, each = 2),
  prob = c(1-p_vals[1], p_vals[1], 1-p_vals[2], p_vals[2], 1-p_vals[3], p_vals[3])
) %>%
  ggplot(aes(x = outcome, y = prob, fill = factor(p))) +
  geom_col(position = "dodge") +
  labs(title = "Bernoulli Distribution", y = "Probability", fill = "p")

Figure 50.1: Bernoulli distribution showing probability of success and failure for different values of p

Sampling Process Example

Consider recording whether a neuron fires in response to a stimulus. Each trial represents exposing the neuron to a brief stimulus pulse and observing whether it fires (success = 1) or not (failure = 0). If the neuron has a baseline firing probability of $p = 0.3$ in response to the stimulus, each observation is a Bernoulli trial.

Code

# Simulate 10 neuron stimulation trials
set.seed(123)
p_fire <- 0.3
n_trials <- 10

# Sample from Bernoulli distribution
neuron_responses <- rbinom(n_trials, size = 1, prob = p_fire)
cat("Neuron firing pattern (1=fire, 0=no fire):", neuron_responses, "\n")

Neuron firing pattern (1=fire, 0=no fire): 0 1 0 1 1 0 0 1 0 0

Code

cat("Total fires:", sum(neuron_responses), "out of", n_trials, "trials")

Total fires: 4 out of 10 trials

Binomial Distribution

The binomial distribution describes the number of successes in $n$ independent Bernoulli trials.

Probability mass function: \[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

Parameters:

$n$ = number of trials
$p$ = probability of success on each trial

Mean: $E[X] = np$

Variance: $\text{Var}(X) = np(1-p)$

Applications:

Number of heads in $n$ coin flips
Number of mutant alleles in offspring
Number of patients responding to treatment out of $n$ treated

Code

# Binomial distributions with n=20
par(mfrow = c(1, 3))
for (p in c(0.2, 0.5, 0.8)) {
  x <- 0:20
  barplot(dbinom(x, 20, p), names.arg = x,
          main = paste("Binomial(n=20, p=", p, ")"),
          xlab = "k", ylab = "P(X=k)", col = "steelblue")
}

Figure 50.2: Binomial distribution with n=20 trials for different probabilities of success

R functions: dbinom(), pbinom(), qbinom(), rbinom()

Sampling Process Example

A plant biologist plants 100 seeds of a particular species and wants to know how many will germinate. Based on previous experiments, each seed has a probability $p = 0.75$ of germinating independently. The number of seeds that germinate follows a binomial distribution with $n = 100$ and $p = 0.75$.

Code

# Simulate seed germination experiment
set.seed(456)
n_seeds <- 100
p_germinate <- 0.75

# Sample from binomial distribution
germinated_seeds <- rbinom(1, size = n_seeds, prob = p_germinate)
cat("Number of seeds germinated:", germinated_seeds, "out of", n_seeds, "\n")

Number of seeds germinated: 81 out of 100

Code

# Simulate multiple experiments
n_experiments <- 1000
germination_counts <- rbinom(n_experiments, size = n_seeds, prob = p_germinate)
cat("Mean across", n_experiments, "experiments:", mean(germination_counts), "\n")

Mean across 1000 experiments: 74.723

Code

cat("Expected value (np):", n_seeds * p_germinate)

Expected value (np): 75

Geometric Distribution

The geometric distribution describes the number of trials needed to get the first success.

Probability mass function: \[P(X = k) = (1-p)^{k-1}p, \quad k = 1, 2, 3, \ldots\]

Parameters: $p$ = probability of success

Mean: $E[X] = \frac{1}{p}$

Variance: $\text{Var}(X) = \frac{1-p}{p^2}$

Applications:

Number of trials until first success
Time until extinction of an endangered population
Number of reads until finding a specific sequence

Code

# Geometric distribution example
# If extinction probability is 0.1 per year, expected time to extinction?
p <- 0.1
x <- 1:50
plot(x, dgeom(x-1, p), type = "h", lwd = 2, col = "steelblue",
     xlab = "Years until extinction", ylab = "Probability",
     main = paste("Geometric Distribution (p =", p, ")\nMean =", 1/p, "years"))

Figure 50.3: Geometric distribution showing the probability of years until extinction when p = 0.1

R functions: dgeom(), pgeom(), qgeom(), rgeom()

Sampling Process Example

In a PCR amplification experiment, a researcher is attempting to amplify a rare target sequence. Each PCR cycle has a probability $p = 0.15$ of successfully amplifying the target to detectable levels. The number of cycles until the first successful amplification follows a geometric distribution.

Code

# Simulate PCR amplification trials
set.seed(789)
p_success <- 0.15

# Sample number of cycles until first success
cycles_to_success <- rgeom(1, prob = p_success) + 1  # +1 because rgeom gives failures before success
cat("Cycles until first successful amplification:", cycles_to_success, "\n")

Cycles until first successful amplification: 1

Code

# Simulate multiple PCR experiments
n_experiments <- 1000
cycles_distribution <- rgeom(n_experiments, prob = p_success) + 1
cat("Mean cycles across", n_experiments, "experiments:", mean(cycles_distribution), "\n")

Mean cycles across 1000 experiments: 6.428

Code

cat("Expected value (1/p):", 1/p_success)

Expected value (1/p): 6.666667

Negative Binomial Distribution

The negative binomial extends the geometric distribution to describe the number of trials needed to achieve $r$ successes.

Probability mass function: \[P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, r+2, \ldots\]

Parameters:

$r$ = number of successes needed
$p$ = probability of success

Mean: $E[X] = \frac{r}{p}$

Variance: $\text{Var}(X) = \frac{r(1-p)}{p^2}$

Applications:

Overdispersed count data (variance > mean)
RNA-seq count modeling
Number of trials until $r$ successes

Code

# Negative binomial example
# Predator must capture 10 prey before reproduction
r <- 10
p <- 0.1
x <- r:100
plot(x, dnbinom(x - r, size = r, prob = p), type = "h", lwd = 2, col = "steelblue",
     xlab = "Days until reproduction", ylab = "Probability",
     main = paste("Negative Binomial (r=10, p=0.1)\nMean =", r/p, "days"))

Figure 50.4: Negative binomial distribution showing days until a predator captures 10 prey (r=10, p=0.1)

R functions: dnbinom(), pnbinom(), qnbinom(), rnbinom()

Sampling Process Example

A genomics researcher is sequencing DNA to identify rare somatic mutations. Each sequencing read has a probability $p = 0.05$ of covering a region containing a mutation of interest. The researcher needs to find $r = 10$ such mutations to have sufficient statistical power. The number of reads required follows a negative binomial distribution.

Code

# Simulate sequencing until finding r mutations
set.seed(101)
r_mutations <- 10
p_mutation <- 0.05

# Sample number of reads until r mutations found
# rnbinom gives number of failures before r successes
reads_needed <- rnbinom(1, size = r_mutations, prob = p_mutation) + r_mutations
cat("Reads needed to find", r_mutations, "mutations:", reads_needed, "\n")

Reads needed to find 10 mutations: 177

Code

# Simulate multiple sequencing experiments
n_simulations <- 1000
reads_distribution <- rnbinom(n_simulations, size = r_mutations, prob = p_mutation) + r_mutations
cat("Mean reads across", n_simulations, "simulations:", mean(reads_distribution), "\n")

Mean reads across 1000 simulations: 199.935

Code

cat("Expected value (r/p):", r_mutations/p_mutation)

Expected value (r/p): 200

Poisson Distribution

The Poisson distribution describes the number of events occurring in a fixed interval when events occur independently at a constant average rate.

Probability mass function: \[P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots\]

Parameters: $\lambda$ = rate parameter (expected count)

Mean: $E[X] = \lambda$

Variance: $\text{Var}(X) = \lambda$

Note: For a Poisson distribution, mean equals variance. When observed variance exceeds the mean, the data are called “overdispersed.”

Applications:

Number of mutations per gene
Number of cells in a microscope field
Number of organisms per sample area
Rare event counts in genomics

Code

# Poisson distributions with different lambda values
par(mfrow = c(1, 3))
for (lambda in c(1, 5, 15)) {
  x <- 0:30
  barplot(dpois(x, lambda), names.arg = x,
          main = paste("Poisson(λ =", lambda, ")"),
          xlab = "k", ylab = "P(X=k)", col = "steelblue")
}

Figure 50.5: Poisson distribution for different rate parameters (λ = 1, 5, and 15)

R functions: dpois(), ppois(), qpois(), rpois()

Sampling Process Example

In evolutionary genomics, mutations occur randomly along a gene sequence. Consider a gene of fixed length where mutations occur at an average rate of $\lambda = 3.5$ mutations per gene per million years. The number of mutations observed in a particular gene follows a Poisson distribution.

Code

# Simulate mutation counts in genes
set.seed(202)
lambda_mutations <- 3.5

# Sample mutation count for a single gene
mutations_in_gene <- rpois(1, lambda = lambda_mutations)
cat("Mutations observed in one gene:", mutations_in_gene, "\n")

Mutations observed in one gene: 1

Code

# Simulate mutation counts across multiple genes
n_genes <- 100
mutation_counts <- rpois(n_genes, lambda = lambda_mutations)
cat("Mean mutations across", n_genes, "genes:", mean(mutation_counts), "\n")

Mean mutations across 100 genes: 3.49

Code

cat("Variance across", n_genes, "genes:", var(mutation_counts), "\n")

Variance across 100 genes: 3.141313

Code

cat("Expected value (λ):", lambda_mutations, "\n")

Expected value (λ): 3.5

Code

cat("Note: For Poisson, mean ≈ variance")

Note: For Poisson, mean ≈ variance

50.2 Continuous Distributions

Continuous distributions describe random variables that can take any value in some range.

Uniform Distribution

The uniform distribution assigns equal probability to all values in a specified range.

Probability density function: \[f(x) = \frac{1}{b-a}, \quad a \leq x \leq b\]

Parameters:

$a$ = minimum value
$b$ = maximum value

Mean: $E[X] = \frac{a+b}{2}$

Variance: $\text{Var}(X) = \frac{(b-a)^2}{12}$

Applications:

Random number generation
Uninformative priors in Bayesian analysis
Equal probability among alternatives

Code

x <- seq(-1, 6, length.out = 200)
plot(x, dunif(x, min = 0, max = 5), type = "l", lwd = 2, col = "steelblue",
     xlab = "x", ylab = "f(x)", main = "Uniform Distribution (a=0, b=5)",
     ylim = c(0, 0.3))
polygon(c(0, 0, 5, 5), c(0, 0.2, 0.2, 0), col = rgb(0, 0, 1, 0.3), border = NA)

Figure 50.6: Uniform distribution with equal probability between a=0 and b=5

R functions: dunif(), punif(), qunif(), runif()

Sampling Process Example

In a cell culture experiment, cells are synchronized to divide during a specific 60-minute window. Within this window, the exact timing of any individual cell’s division is equally likely at any moment. If we measure the time (in minutes from the start of the window) when each cell divides, these times follow a uniform distribution between $a = 0$ and $b = 60$ minutes.

Code

# Simulate cell division times within a synchronized window
set.seed(303)
window_start <- 0
window_end <- 60

# Sample division times for 20 cells
n_cells <- 20
division_times <- runif(n_cells, min = window_start, max = window_end)
cat("Division times (minutes):", round(division_times, 1), "\n")

Division times (minutes): 18.6 50.1 36.6 44.4 39 58.8 31.6 28.1 16.9 16.7 54.3 52.4 6.2 2.2 14 12.8 24.5 14 43.7 39.3

Code

cat("Mean division time:", round(mean(division_times), 1), "minutes\n")

Mean division time: 30.2 minutes

Code

cat("Expected mean:", (window_start + window_end)/2, "minutes")

Expected mean: 30 minutes

Exponential Distribution

The exponential distribution describes the time between events in a Poisson process.

Probability density function: \[f(x) = \lambda e^{-\lambda x}, \quad x \geq 0\]

Parameters: $\lambda$ = rate parameter

Mean: $E[X] = \frac{1}{\lambda}$

Variance: $\text{Var}(X) = \frac{1}{\lambda^2}$

Applications:

Time until next event (radioactive decay, mutations)
Lifespan distributions (constant hazard rate)
Waiting times

Code

# Exponential with different rate parameters
x <- seq(0, 5, length.out = 200)
plot(x, dexp(x, rate = 0.5), type = "l", lwd = 2, col = "blue",
     xlab = "x", ylab = "f(x)", main = "Exponential Distribution",
     ylim = c(0, 2))
lines(x, dexp(x, rate = 1), lwd = 2, col = "red")
lines(x, dexp(x, rate = 2), lwd = 2, col = "darkgreen")
legend("topright", c("λ = 0.5", "λ = 1", "λ = 2"),
       col = c("blue", "red", "darkgreen"), lwd = 2)

Figure 50.7: Exponential distribution for different rate parameters (λ = 0.5, 1, and 2)

R functions: dexp(), pexp(), qexp(), rexp()

Sampling Process Example

Bacterial cells divide asynchronously in exponential growth phase. If cells divide at a constant average rate of $\lambda = 0.5$ divisions per hour (one division every 2 hours on average), the waiting time between successive cell divisions follows an exponential distribution.

Code

# Simulate times between cell divisions
set.seed(404)
lambda_rate <- 0.5  # divisions per hour

# Sample inter-division times for 10 successive divisions
n_divisions <- 10
inter_division_times <- rexp(n_divisions, rate = lambda_rate)
cat("Time between divisions (hours):", round(inter_division_times, 2), "\n")

Time between divisions (hours): 1.12 0.21 0.1 0.72 2.39 2.13 1.83 8.49 0.31 0.56

Code

cat("Mean inter-division time:", round(mean(inter_division_times), 2), "hours\n")

Mean inter-division time: 1.79 hours

Code

cat("Expected mean (1/λ):", 1/lambda_rate, "hours\n")

Expected mean (1/λ): 2 hours

Code

# Total time for all divisions
cat("Total time for", n_divisions, "divisions:", round(sum(inter_division_times), 2), "hours")

Total time for 10 divisions: 17.87 hours

Gamma Distribution

The gamma distribution generalizes the exponential distribution to describe waiting time until the $r$th event.

Probability density function: \[f(x) = \frac{\lambda^r}{\Gamma(r)} x^{r-1} e^{-\lambda x}, \quad x \geq 0\]

Parameters:

$r$ = shape parameter (number of events)
$\lambda$ = rate parameter

Mean: $E[X] = \frac{r}{\lambda}$

Variance: $\text{Var}(X) = \frac{r}{\lambda^2}$

Applications:

Time for multiple events to occur
Duration of processes with multiple stages
Prior distributions in Bayesian analysis

Code

# Gamma distribution with different shape parameters
x <- seq(0, 15, length.out = 200)
plot(x, dgamma(x, shape = 1, rate = 0.5), type = "l", lwd = 2, col = "blue",
     xlab = "x", ylab = "f(x)", main = "Gamma Distribution (rate = 0.5)",
     ylim = c(0, 0.5))
lines(x, dgamma(x, shape = 2, rate = 0.5), lwd = 2, col = "red")
lines(x, dgamma(x, shape = 5, rate = 0.5), lwd = 2, col = "darkgreen")
legend("topright", c("shape = 1", "shape = 2", "shape = 5"),
       col = c("blue", "red", "darkgreen"), lwd = 2)

Figure 50.8: Gamma distribution with rate = 0.5 for different shape parameters (1, 2, and 5)

R functions: dgamma(), pgamma(), qgamma(), rgamma()

Sampling Process Example

In developmental biology, a cell must complete $r = 5$ distinct checkpoints before differentiating into a specialized cell type. Each checkpoint is passed at an average rate of $\lambda = 2$ checkpoints per day. The total time until the cell completes all 5 checkpoints and differentiates follows a gamma distribution.

Code

# Simulate time to complete multiple checkpoints
set.seed(505)
r_checkpoints <- 5
lambda_rate <- 2  # checkpoints per day

# Sample differentiation times for 15 cells
n_cells <- 15
differentiation_times <- rgamma(n_cells, shape = r_checkpoints, rate = lambda_rate)
cat("Time to differentiation (days):", round(differentiation_times, 2), "\n")

Time to differentiation (days): 1.22 1.06 1.37 1.74 1.39 2.26 3.64 3.09 2.42 2.24 2.04 4.06 1.8 3.93 3.21

Code

cat("Mean differentiation time:", round(mean(differentiation_times), 2), "days\n")

Mean differentiation time: 2.36 days

Code

cat("Expected mean (r/λ):", r_checkpoints/lambda_rate, "days")

Expected mean (r/λ): 2.5 days

Normal (Gaussian) Distribution

The normal distribution is the most important continuous distribution in statistics. It arises naturally when many independent factors contribute additively to an outcome.

Probability density function: \[f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

Parameters:

$\mu$ = mean (location)
$\sigma$ = standard deviation (scale)

Mean: $E[X] = \mu$

Variance: $\text{Var}(X) = \sigma^2$

Notation: $X \sim \mathcal{N}(\mu, \sigma^2)$

Applications:

Heights, weights, and other biological measurements
Measurement errors
Sampling distributions (via Central Limit Theorem)
Basis for many statistical tests

Code

x <- seq(-4, 8, length.out = 200)
plot(x, dnorm(x, mean = 2, sd = 1), type = "l", lwd = 2, col = "blue",
     xlab = "x", ylab = "f(x)", main = "Normal Distribution",
     ylim = c(0, 0.45))
lines(x, dnorm(x, mean = 2, sd = 0.5), lwd = 2, col = "red")
lines(x, dnorm(x, mean = 2, sd = 2), lwd = 2, col = "darkgreen")
legend("topright", c("μ=2, σ=1", "μ=2, σ=0.5", "μ=2, σ=2"),
       col = c("blue", "red", "darkgreen"), lwd = 2)

Figure 50.9: Normal distribution with mean μ=2 for different standard deviations (σ = 0.5, 1, and 2)

Key properties:

68% of values fall within 1 SD of the mean
95% of values fall within 2 SDs of the mean
99.7% of values fall within 3 SDs of the mean

R functions: dnorm(), pnorm(), qnorm(), rnorm()

Sampling Process Example

Human height is influenced by many genetic and environmental factors that contribute additively. In a population of adult males, height follows approximately a normal distribution with mean $\mu = 175$ cm and standard deviation $\sigma = 7$ cm. When we measure heights of individuals, we are sampling from this distribution.

Code

# Simulate measuring heights in a population
set.seed(606)
mu_height <- 175  # cm
sigma_height <- 7  # cm

# Sample heights of 30 individuals
n_individuals <- 30
heights <- rnorm(n_individuals, mean = mu_height, sd = sigma_height)
cat("Sample heights (cm):", round(heights, 1), "\n")

Sample heights (cm): 175.8 168.5 166.2 174 171 172.9 175.2 176.9 181.3 183.9 169.3 173.5 161.7 176.2 183.6 169.5 187.2 187.1 182.9 176.9 172 164.3 192.5 167.8 176.9 170.5 187.4 171.4 177.8 177.5

Code

cat("Sample mean:", round(mean(heights), 1), "cm\n")

Sample mean: 175.7 cm

Code

cat("Sample SD:", round(sd(heights), 1), "cm\n")

Sample SD: 7.5 cm

Code

cat("Population parameters: μ =", mu_height, "cm, σ =", sigma_height, "cm")

Population parameters: μ = 175 cm, σ = 7 cm

Standard Normal Distribution

The standard normal distribution is a normal distribution with $\mu = 0$ and $\sigma = 1$.

Any normal variable $X \sim \mathcal{N}(\mu, \sigma^2)$ can be converted to a standard normal $Z$ using:

\[Z = \frac{X - \mu}{\sigma}\]

This z-score tells us how many standard deviations a value is from the mean.

Code

x <- seq(-4, 4, length.out = 200)
y <- dnorm(x)

plot(x, y, type = "l", lwd = 2,
     xlab = "z", ylab = "f(z)",
     main = "Standard Normal Distribution")

# Shade areas
x_1sd <- seq(-1, 1, length.out = 100)
polygon(c(-1, x_1sd, 1), c(0, dnorm(x_1sd), 0),
        col = rgb(0, 0, 1, 0.3), border = NA)

x_2sd <- c(seq(-2, -1, length.out = 50), seq(1, 2, length.out = 50))
for (region in list(c(-2, -1), c(1, 2))) {
  xx <- seq(region[1], region[2], length.out = 50)
  polygon(c(region[1], xx, region[2]), c(0, dnorm(xx), 0),
          col = rgb(0, 1, 0, 0.3), border = NA)
}

legend("topright", c("68% within 1σ", "95% within 2σ"),
       fill = c(rgb(0, 0, 1, 0.3), rgb(0, 1, 0, 0.3)))

Figure 50.10: Standard normal distribution showing 68% of values within 1 standard deviation and 95% within 2 standard deviations

Log-Normal Distribution

If $\ln(X)$ follows a normal distribution, then $X$ follows a log-normal distribution. This arises when effects are multiplicative rather than additive.

Probability density function: \[f(x) = \frac{1}{x\sigma\sqrt{2\pi}} e^{-\frac{(\ln x - \mu)^2}{2\sigma^2}}, \quad x > 0\]

Parameters:

$\mu$ = mean of the log-transformed variable
$\sigma$ = SD of the log-transformed variable

Mean: $E[X] = e^{\mu + \sigma^2/2}$

Variance: $\text{Var}(X) = (e^{\sigma^2} - 1)e^{2\mu + \sigma^2}$

Applications:

Gene expression levels
Organism sizes
Concentrations
Income distributions

Code

par(mfrow = c(1, 2))

# Log-normal data
set.seed(42)
x <- rlnorm(1000, meanlog = 1, sdlog = 0.5)

hist(x, breaks = 30, main = "Log-Normal Data", col = "lightblue",
     xlab = "x", probability = TRUE)
curve(dlnorm(x, meanlog = 1, sdlog = 0.5), add = TRUE, col = "red", lwd = 2)

# After log transformation
hist(log(x), breaks = 30, main = "After Log Transform", col = "lightblue",
     xlab = "log(x)", probability = TRUE)
curve(dnorm(x, mean = 1, sd = 0.5), add = TRUE, col = "red", lwd = 2)

Figure 50.11: Log-normal distribution (left) and its transformation to normal distribution after log transformation (right)

R functions: dlnorm(), plnorm(), qlnorm(), rlnorm()

Sampling Process Example

Gene expression levels often follow a log-normal distribution because they result from multiplicative processes (transcription, translation, and degradation rates). When measuring mRNA levels of a particular gene across cells, if the log-transformed expression follows $\mathcal{N}(\mu = 2, \sigma = 0.6)$, the raw expression levels follow a log-normal distribution.

Code

# Simulate gene expression measurements
set.seed(707)
mu_log <- 2
sigma_log <- 0.6

# Sample expression levels from 50 cells
n_cells <- 50
expression_levels <- rlnorm(n_cells, meanlog = mu_log, sdlog = sigma_log)
cat("Expression levels (arbitrary units):", round(expression_levels, 2), "\n")

Expression levels (arbitrary units): 3.42 5.1 11.24 14.64 4.51 3.97 25.53 9.63 3.63 21.63 9.5 14.48 8.16 4.31 4.81 3.68 18.24 9.31 7.71 13.99 5.13 4.18 7.56 7.94 11.77 28.58 6.69 16.92 5.49 9.77 10.61 15.75 12.96 5.43 7.88 11.54 6.1 6.11 8.73 7.17 3.86 10.07 8.18 9.41 4.12 6.45 25.62 8.32 13.5 5.49

Code

cat("Mean expression:", round(mean(expression_levels), 2), "\n")

Mean expression: 9.78

Code

cat("Median expression:", round(median(expression_levels), 2), "\n")

Median expression: 8.17

Code

# Log-transformed data should be approximately normal
log_expression <- log(expression_levels)
cat("Mean of log-expression:", round(mean(log_expression), 2), "\n")

Mean of log-expression: 2.12

Code

cat("SD of log-expression:", round(sd(log_expression), 2))

SD of log-expression: 0.55

Chi-Square Distribution

The chi-square distribution is the distribution of a sum of squared standard normal variables. It is fundamental to many statistical tests.

Parameters: $k$ = degrees of freedom

Mean: $E[X] = k$

Variance: $\text{Var}(X) = 2k$

Applications:

Goodness-of-fit tests
Tests of independence
Variance estimation

Code

x <- seq(0, 20, length.out = 200)
plot(x, dchisq(x, df = 2), type = "l", lwd = 2, col = "blue",
     xlab = "x", ylab = "f(x)", main = "Chi-Square Distribution",
     ylim = c(0, 0.5))
lines(x, dchisq(x, df = 5), lwd = 2, col = "red")
lines(x, dchisq(x, df = 10), lwd = 2, col = "darkgreen")
legend("topright", c("df = 2", "df = 5", "df = 10"),
       col = c("blue", "red", "darkgreen"), lwd = 2)

Figure 50.12: Chi-square distribution for different degrees of freedom (df = 2, 5, and 10)

R functions: dchisq(), pchisq(), qchisq(), rchisq()

Sampling Process Example

In a quality control study, a bioengineering lab measures the precision of a fabrication process by taking repeated measurements of device dimensions. The sample variance $s^2$ from $n = 10$ measurements is related to the chi-square distribution. Specifically, $(n-1)s^2/\sigma^2$ follows a chi-square distribution with $k = n-1 = 9$ degrees of freedom.

Code

# Simulate sampling process to understand variance distribution
set.seed(808)
n_measurements <- 10
true_sigma <- 2.5  # true population standard deviation

# Generate many samples and compute their variances
n_samples <- 1000
sample_variances <- replicate(n_samples, {
  measurements <- rnorm(n_measurements, mean = 100, sd = true_sigma)
  var(measurements)
})

# The scaled variances follow chi-square
scaled_variances <- (n_measurements - 1) * sample_variances / true_sigma^2
cat("Sample variance from one experiment:", round(sample_variances[1], 2), "\n")

Sample variance from one experiment: 11.59

Code

cat("Mean of", n_samples, "sample variances:", round(mean(sample_variances), 2), "\n")

Mean of 1000 sample variances: 6.21

Code

cat("True variance (σ²):", true_sigma^2, "\n")

True variance (σ²): 6.25

Code

cat("Mean of scaled variances:", round(mean(scaled_variances), 2), "\n")

Mean of scaled variances: 8.94

Code

cat("Expected (df = ", n_measurements - 1, "):", n_measurements - 1)

Expected (df =  9 ): 9

Student’s t Distribution

The t-distribution arises when estimating the mean of a normally distributed population when the sample size is small and the population standard deviation is unknown.

Parameters: $\nu$ = degrees of freedom

As $\nu \to \infty$, the t-distribution approaches the standard normal.

Applications:

t-tests
Confidence intervals for means
Robust regression

Code

x <- seq(-4, 4, length.out = 200)
plot(x, dnorm(x), type = "l", lwd = 2, col = "black", lty = 2,
     xlab = "x", ylab = "f(x)", main = "t-Distribution vs Normal")
lines(x, dt(x, df = 2), lwd = 2, col = "blue")
lines(x, dt(x, df = 5), lwd = 2, col = "red")
lines(x, dt(x, df = 30), lwd = 2, col = "darkgreen")
legend("topright", c("Normal", "t (df=2)", "t (df=5)", "t (df=30)"),
       col = c("black", "blue", "red", "darkgreen"), lwd = 2, lty = c(2, 1, 1, 1))

Figure 50.13: Student’s t-distribution compared to normal distribution for different degrees of freedom

R functions: dt(), pt(), qt(), rt()

Sampling Process Example

A pharmacologist is testing the effect of a new drug on blood pressure. She samples $n = 8$ patients and measures their blood pressure after treatment. The population mean $\mu$ and standard deviation $\sigma$ are unknown. To test whether the mean differs from a reference value, she computes a t-statistic: $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$, which follows a t-distribution with $\nu = n-1 = 7$ degrees of freedom under the null hypothesis.

Code

# Simulate t-statistic sampling distribution
set.seed(909)
n_patients <- 8
true_mu <- 120  # true mean blood pressure
true_sigma <- 12  # true SD
mu_0 <- 120  # null hypothesis value (no change)

# Simulate many experiments
n_experiments <- 1000
t_statistics <- replicate(n_experiments, {
  blood_pressures <- rnorm(n_patients, mean = true_mu, sd = true_sigma)
  sample_mean <- mean(blood_pressures)
  sample_sd <- sd(blood_pressures)
  # Compute t-statistic
  (sample_mean - mu_0) / (sample_sd / sqrt(n_patients))
})

cat("First 10 t-statistics:", round(t_statistics[1:10], 2), "\n")

First 10 t-statistics: -1.22 2.32 2.34 0.1 -2.1 -0.81 1 -2.88 0.9 -0.91

Code

cat("Mean of t-statistics:", round(mean(t_statistics), 2), "\n")

Mean of t-statistics: 0.02

Code

cat("SD of t-statistics:", round(sd(t_statistics), 2), "\n")

SD of t-statistics: 1.19

Code

cat("Expected SD for t with df=7:", round(sqrt(7/(7-2)), 2))

Expected SD for t with df=7: 1.18

F Distribution

The F-distribution is the ratio of two chi-square distributions divided by their degrees of freedom. It is used in ANOVA and comparing variances.

Parameters:

$d_1$ = numerator degrees of freedom
$d_2$ = denominator degrees of freedom

Applications:

ANOVA F-tests
Comparing variances
Regression significance tests

Code

x <- seq(0, 5, length.out = 200)
plot(x, df(x, df1 = 5, df2 = 20), type = "l", lwd = 2, col = "blue",
     xlab = "x", ylab = "f(x)", main = "F Distribution",
     ylim = c(0, 1))
lines(x, df(x, df1 = 10, df2 = 20), lwd = 2, col = "red")
lines(x, df(x, df1 = 20, df2 = 20), lwd = 2, col = "darkgreen")
legend("topright", c("F(5,20)", "F(10,20)", "F(20,20)"),
       col = c("blue", "red", "darkgreen"), lwd = 2)

Figure 50.14: F distribution for different numerator and denominator degrees of freedom

R functions: df(), pf(), qf(), rf()

Sampling Process Example

An ecologist is comparing the variability in body mass between two populations of birds. Population 1 has $n_1 = 12$ individuals and population 2 has $n_2 = 15$ individuals. To test whether the two populations have equal variances, she computes the F-statistic: $F = \frac{s_1^2}{s_2^2}$, which follows an F-distribution with $d_1 = n_1 - 1 = 11$ and $d_2 = n_2 - 1 = 14$ degrees of freedom when the true variances are equal.

Code

# Simulate F-statistic sampling distribution
set.seed(1010)
n1 <- 12
n2 <- 15
true_sigma1 <- 5  # true SD for population 1
true_sigma2 <- 5  # true SD for population 2 (equal variances)

# Simulate many paired samples
n_comparisons <- 1000
F_statistics <- replicate(n_comparisons, {
  pop1_masses <- rnorm(n1, mean = 100, sd = true_sigma1)
  pop2_masses <- rnorm(n2, mean = 100, sd = true_sigma2)
  var1 <- var(pop1_masses)
  var2 <- var(pop2_masses)
  # Compute F-statistic
  var1 / var2
})

cat("First 10 F-statistics:", round(F_statistics[1:10], 2), "\n")

First 10 F-statistics: 2.64 1.26 1.01 1.55 0.55 0.46 0.88 1.67 2.88 1.16

Code

cat("Mean of F-statistics:", round(mean(F_statistics), 2), "\n")

Mean of F-statistics: 1.14

Code

cat("Expected mean for F(11,14):", round(14/(14-2), 2))

Expected mean for F(11,14): 1.17

Beta Distribution

The beta distribution is defined on the interval [0, 1] and is useful for modeling proportions and probabilities.

Parameters:

$\alpha$ = shape parameter 1
$\beta$ = shape parameter 2

Mean: $E[X] = \frac{\alpha}{\alpha + \beta}$

Variance: $\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$

Applications:

Prior distributions for probabilities (Bayesian analysis)
Modeling proportions
Allele frequencies

Code

x <- seq(0, 1, length.out = 200)
plot(x, dbeta(x, 2, 5), type = "l", lwd = 2, col = "blue",
     xlab = "x", ylab = "f(x)", main = "Beta Distribution",
     ylim = c(0, 3))
lines(x, dbeta(x, 5, 2), lwd = 2, col = "red")
lines(x, dbeta(x, 2, 2), lwd = 2, col = "darkgreen")
lines(x, dbeta(x, 0.5, 0.5), lwd = 2, col = "purple")
legend("top", c("Beta(2,5)", "Beta(5,2)", "Beta(2,2)", "Beta(0.5,0.5)"),
       col = c("blue", "red", "darkgreen", "purple"), lwd = 2)

Figure 50.15: Beta distribution for different shape parameters showing various distribution shapes

R functions: dbeta(), pbeta(), qbeta(), rbeta()

Sampling Process Example

In population genetics, a researcher is estimating the frequency of a particular allele in a population. Based on prior studies, she has some information about the likely allele frequency, which can be encoded as a beta prior distribution. For example, if previous evidence suggests the allele is moderately common but variable, she might use a Beta(5, 3) distribution to represent her prior beliefs about the allele frequency $p$ before collecting new data.

Code

# Simulate prior distributions for allele frequencies
set.seed(1111)
alpha_param <- 5
beta_param <- 3

# Sample possible allele frequencies from prior
n_simulations <- 1000
prior_frequencies <- rbeta(n_simulations, shape1 = alpha_param, shape2 = beta_param)
cat("Sample of prior allele frequencies:", round(prior_frequencies[1:10], 3), "\n")

Sample of prior allele frequencies: 0.642 0.337 0.371 0.602 0.484 0.384 0.322 0.796 0.873 0.905

Code

cat("Mean prior frequency:", round(mean(prior_frequencies), 3), "\n")

Mean prior frequency: 0.628

Code

cat("Expected mean (α/(α+β)):", round(alpha_param/(alpha_param + beta_param), 3), "\n")

Expected mean (α/(α+β)): 0.625

Code

# In practice, these could be updated with observed data using Bayesian inference
# For example: observing 15 allele A and 5 allele B would update to Beta(5+15, 3+5)

50.3 Summary Table

Distribution	Type	Parameters	Mean	Variance	R prefix
Bernoulli	Discrete	p	p	p(1-p)	binom (n=1)
Binomial	Discrete	n, p	np	np(1-p)	binom
Geometric	Discrete	p	1/p	(1-p)/p²	geom
Neg. Binomial	Discrete	r, p	r/p	r(1-p)/p²	nbinom
Poisson	Discrete	λ	λ	λ	pois
Uniform	Continuous	a, b	(a+b)/2	(b-a)²/12	unif
Exponential	Continuous	λ	1/λ	1/λ²	exp
Gamma	Continuous	r, λ	r/λ	r/λ²	gamma
Normal	Continuous	μ, σ	μ	σ²	norm
Log-normal	Continuous	μ, σ	exp(μ+σ²/2)	(exp(σ²)-1)exp(2μ+σ²)	lnorm
Chi-square	Continuous	k	k	2k	chisq
t	Continuous	ν	0 (ν>1)	ν/(ν-2) (ν>2)	t
F	Continuous	d₁, d₂	d₂/(d₂-2)	…	f
Beta	Continuous	α, β	α/(α+β)	…	beta

50.4 R Function Naming Convention

R uses a consistent naming convention for distribution functions:

dxxx - density/mass function (PDF/PMF)
pxxx - cumulative distribution function (CDF)
qxxx - quantile function (inverse CDF)
rxxx - random number generation

For example, for the normal distribution:

Code

# Density at x = 0 for standard normal
dnorm(0)

[1] 0.3989423

Code

# Probability that X ≤ 1.96
pnorm(1.96)

[1] 0.9750021

Code

# Value where P(X ≤ q) = 0.975
qnorm(0.975)

[1] 1.959964

Code

# Generate 5 random normal values
set.seed(42)
rnorm(5)

[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

50.5 Choosing the Right Distribution

Figure 50.16: Decision flowchart for choosing the appropriate probability distribution based on data type and characteristics

Guidelines for selection:

Is your variable discrete or continuous?
- Discrete: Binomial, Poisson, Negative Binomial
- Continuous: Normal, Gamma, Beta, etc.
For counts:
- Fixed number of trials with yes/no: Binomial
- Events per unit time/space, rare: Poisson
- Overdispersed counts (variance > mean): Negative Binomial
For continuous measurements:
- Symmetric, unbounded: Normal
- Positive, right-skewed: Log-normal or Gamma
- Bounded between 0 and 1: Beta
- Waiting times: Exponential or Gamma
For testing:
- Means (unknown variance): t-distribution
- Variances: Chi-square or F-distribution

# Common Probability Distributions {#sec-probability-distributions} ```{r} #| echo: false #| message: false library(tidyverse) theme_set(theme_minimal()) ``` This appendix provides a reference for probability distributions commonly encountered in biological and bioengineering applications. Each distribution is characterized by its probability function, parameters, mean, variance, and typical applications. ## Discrete Distributions Discrete distributions describe random variables that take on countable values (integers). ### Bernoulli Distribution The Bernoulli distribution describes a single trial with two possible outcomes (success/failure). **Probability mass function:** $$P(X = k) = p^k(1-p)^{1-k}, \quad k \in \{0, 1\}$$ **Parameters:** $p$ = probability of success (0 ≤ p ≤ 1) **Mean:** $E[X] = p$ **Variance:** $\text{Var}(X) = p(1-p)$ **Applications:** - Single coin flip - Whether a patient responds to treatment - Whether an allele is inherited ```{r} #| label: fig-appendix-bernoulli #| fig-cap: "Bernoulli distribution showing probability of success and failure for different values of p" #| fig-width: 5 #| fig-height: 3 # Bernoulli with different p values p_vals <- c(0.2, 0.5, 0.8) data.frame( outcome = rep(c("Failure", "Success"), 3), p = rep(p_vals, each = 2), prob = c(1-p_vals[1], p_vals[1], 1-p_vals[2], p_vals[2], 1-p_vals[3], p_vals[3]) ) %>% ggplot(aes(x = outcome, y = prob, fill = factor(p))) + geom_col(position = "dodge") + labs(title = "Bernoulli Distribution", y = "Probability", fill = "p") ``` ### Sampling Process Example Consider recording whether a neuron fires in response to a stimulus. Each trial represents exposing the neuron to a brief stimulus pulse and observing whether it fires (success = 1) or not (failure = 0). If the neuron has a baseline firing probability of $p = 0.3$ in response to the stimulus, each observation is a Bernoulli trial. ```{r} # Simulate 10 neuron stimulation trials set.seed(123) p_fire <- 0.3 n_trials <- 10 # Sample from Bernoulli distribution neuron_responses <- rbinom(n_trials, size = 1, prob = p_fire) cat("Neuron firing pattern (1=fire, 0=no fire):", neuron_responses, "\n") cat("Total fires:", sum(neuron_responses), "out of", n_trials, "trials") ``` ### Binomial Distribution The binomial distribution describes the number of successes in $n$ independent Bernoulli trials. **Probability mass function:** $$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$ **Parameters:** - $n$ = number of trials - $p$ = probability of success on each trial **Mean:** $E[X] = np$ **Variance:** $\text{Var}(X) = np(1-p)$ **Applications:** - Number of heads in $n$ coin flips - Number of mutant alleles in offspring - Number of patients responding to treatment out of $n$ treated ```{r} #| label: fig-appendix-binomial #| fig-cap: "Binomial distribution with n=20 trials for different probabilities of success" #| fig-width: 8 #| fig-height: 4 # Binomial distributions with n=20 par(mfrow = c(1, 3)) for (p in c(0.2, 0.5, 0.8)) { x <- 0:20 barplot(dbinom(x, 20, p), names.arg = x, main = paste("Binomial(n=20, p=", p, ")"), xlab = "k", ylab = "P(X=k)", col = "steelblue") } ``` **R functions:** `dbinom()`, `pbinom()`, `qbinom()`, `rbinom()` ### Sampling Process Example A plant biologist plants 100 seeds of a particular species and wants to know how many will germinate. Based on previous experiments, each seed has a probability $p = 0.75$ of germinating independently. The number of seeds that germinate follows a binomial distribution with $n = 100$ and $p = 0.75$. ```{r} # Simulate seed germination experiment set.seed(456) n_seeds <- 100 p_germinate <- 0.75 # Sample from binomial distribution germinated_seeds <- rbinom(1, size = n_seeds, prob = p_germinate) cat("Number of seeds germinated:", germinated_seeds, "out of", n_seeds, "\n") # Simulate multiple experiments n_experiments <- 1000 germination_counts <- rbinom(n_experiments, size = n_seeds, prob = p_germinate) cat("Mean across", n_experiments, "experiments:", mean(germination_counts), "\n") cat("Expected value (np):", n_seeds * p_germinate) ``` ### Geometric Distribution The geometric distribution describes the number of trials needed to get the first success. **Probability mass function:** $$P(X = k) = (1-p)^{k-1}p, \quad k = 1, 2, 3, \ldots$$ **Parameters:** $p$ = probability of success **Mean:** $E[X] = \frac{1}{p}$ **Variance:** $\text{Var}(X) = \frac{1-p}{p^2}$ **Applications:** - Number of trials until first success - Time until extinction of an endangered population - Number of reads until finding a specific sequence ```{r} #| label: fig-appendix-geometric #| fig-cap: "Geometric distribution showing the probability of years until extinction when p = 0.1" #| fig-width: 7 #| fig-height: 4 # Geometric distribution example # If extinction probability is 0.1 per year, expected time to extinction? p <- 0.1 x <- 1:50 plot(x, dgeom(x-1, p), type = "h", lwd = 2, col = "steelblue", xlab = "Years until extinction", ylab = "Probability", main = paste("Geometric Distribution (p =", p, ")\nMean =", 1/p, "years")) ``` **R functions:** `dgeom()`, `pgeom()`, `qgeom()`, `rgeom()` ### Sampling Process Example In a PCR amplification experiment, a researcher is attempting to amplify a rare target sequence. Each PCR cycle has a probability $p = 0.15$ of successfully amplifying the target to detectable levels. The number of cycles until the first successful amplification follows a geometric distribution. ```{r} # Simulate PCR amplification trials set.seed(789) p_success <- 0.15 # Sample number of cycles until first success cycles_to_success <- rgeom(1, prob = p_success) + 1 # +1 because rgeom gives failures before success cat("Cycles until first successful amplification:", cycles_to_success, "\n") # Simulate multiple PCR experiments n_experiments <- 1000 cycles_distribution <- rgeom(n_experiments, prob = p_success) + 1 cat("Mean cycles across", n_experiments, "experiments:", mean(cycles_distribution), "\n") cat("Expected value (1/p):", 1/p_success) ``` ### Negative Binomial Distribution The negative binomial extends the geometric distribution to describe the number of trials needed to achieve $r$ successes. **Probability mass function:** $$P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, r+2, \ldots$$ **Parameters:** - $r$ = number of successes needed - $p$ = probability of success **Mean:** $E[X] = \frac{r}{p}$ **Variance:** $\text{Var}(X) = \frac{r(1-p)}{p^2}$ **Applications:** - Overdispersed count data (variance > mean) - RNA-seq count modeling - Number of trials until $r$ successes ```{r} #| label: fig-appendix-negbinom #| fig-cap: "Negative binomial distribution showing days until a predator captures 10 prey (r=10, p=0.1)" #| fig-width: 7 #| fig-height: 4 # Negative binomial example # Predator must capture 10 prey before reproduction r <- 10 p <- 0.1 x <- r:100 plot(x, dnbinom(x - r, size = r, prob = p), type = "h", lwd = 2, col = "steelblue", xlab = "Days until reproduction", ylab = "Probability", main = paste("Negative Binomial (r=10, p=0.1)\nMean =", r/p, "days")) ``` **R functions:** `dnbinom()`, `pnbinom()`, `qnbinom()`, `rnbinom()` ### Sampling Process Example A genomics researcher is sequencing DNA to identify rare somatic mutations. Each sequencing read has a probability $p = 0.05$ of covering a region containing a mutation of interest. The researcher needs to find $r = 10$ such mutations to have sufficient statistical power. The number of reads required follows a negative binomial distribution. ```{r} # Simulate sequencing until finding r mutations set.seed(101) r_mutations <- 10 p_mutation <- 0.05 # Sample number of reads until r mutations found # rnbinom gives number of failures before r successes reads_needed <- rnbinom(1, size = r_mutations, prob = p_mutation) + r_mutations cat("Reads needed to find", r_mutations, "mutations:", reads_needed, "\n") # Simulate multiple sequencing experiments n_simulations <- 1000 reads_distribution <- rnbinom(n_simulations, size = r_mutations, prob = p_mutation) + r_mutations cat("Mean reads across", n_simulations, "simulations:", mean(reads_distribution), "\n") cat("Expected value (r/p):", r_mutations/p_mutation) ``` ### Poisson Distribution The Poisson distribution describes the number of events occurring in a fixed interval when events occur independently at a constant average rate. **Probability mass function:** $$P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots$$ **Parameters:** $\lambda$ = rate parameter (expected count) **Mean:** $E[X] = \lambda$ **Variance:** $\text{Var}(X) = \lambda$ Note: For a Poisson distribution, mean equals variance. When observed variance exceeds the mean, the data are called "overdispersed." **Applications:** - Number of mutations per gene - Number of cells in a microscope field - Number of organisms per sample area - Rare event counts in genomics ```{r} #| label: fig-appendix-poisson #| fig-cap: "Poisson distribution for different rate parameters (λ = 1, 5, and 15)" #| fig-width: 8 #| fig-height: 4 # Poisson distributions with different lambda values par(mfrow = c(1, 3)) for (lambda in c(1, 5, 15)) { x <- 0:30 barplot(dpois(x, lambda), names.arg = x, main = paste("Poisson(λ =", lambda, ")"), xlab = "k", ylab = "P(X=k)", col = "steelblue") } ``` **R functions:** `dpois()`, `ppois()`, `qpois()`, `rpois()` ### Sampling Process Example In evolutionary genomics, mutations occur randomly along a gene sequence. Consider a gene of fixed length where mutations occur at an average rate of $\lambda = 3.5$ mutations per gene per million years. The number of mutations observed in a particular gene follows a Poisson distribution. ```{r} # Simulate mutation counts in genes set.seed(202) lambda_mutations <- 3.5 # Sample mutation count for a single gene mutations_in_gene <- rpois(1, lambda = lambda_mutations) cat("Mutations observed in one gene:", mutations_in_gene, "\n") # Simulate mutation counts across multiple genes n_genes <- 100 mutation_counts <- rpois(n_genes, lambda = lambda_mutations) cat("Mean mutations across", n_genes, "genes:", mean(mutation_counts), "\n") cat("Variance across", n_genes, "genes:", var(mutation_counts), "\n") cat("Expected value (λ):", lambda_mutations, "\n") cat("Note: For Poisson, mean ≈ variance") ``` ## Continuous Distributions Continuous distributions describe random variables that can take any value in some range. ### Uniform Distribution The uniform distribution assigns equal probability to all values in a specified range. **Probability density function:** $$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b$$ **Parameters:** - $a$ = minimum value - $b$ = maximum value **Mean:** $E[X] = \frac{a+b}{2}$ **Variance:** $\text{Var}(X) = \frac{(b-a)^2}{12}$ **Applications:** - Random number generation - Uninformative priors in Bayesian analysis - Equal probability among alternatives ```{r} #| label: fig-appendix-uniform #| fig-cap: "Uniform distribution with equal probability between a=0 and b=5" #| fig-width: 6 #| fig-height: 4 x <- seq(-1, 6, length.out = 200) plot(x, dunif(x, min = 0, max = 5), type = "l", lwd = 2, col = "steelblue", xlab = "x", ylab = "f(x)", main = "Uniform Distribution (a=0, b=5)", ylim = c(0, 0.3)) polygon(c(0, 0, 5, 5), c(0, 0.2, 0.2, 0), col = rgb(0, 0, 1, 0.3), border = NA) ``` **R functions:** `dunif()`, `punif()`, `qunif()`, `runif()` ### Sampling Process Example In a cell culture experiment, cells are synchronized to divide during a specific 60-minute window. Within this window, the exact timing of any individual cell's division is equally likely at any moment. If we measure the time (in minutes from the start of the window) when each cell divides, these times follow a uniform distribution between $a = 0$ and $b = 60$ minutes. ```{r} # Simulate cell division times within a synchronized window set.seed(303) window_start <- 0 window_end <- 60 # Sample division times for 20 cells n_cells <- 20 division_times <- runif(n_cells, min = window_start, max = window_end) cat("Division times (minutes):", round(division_times, 1), "\n") cat("Mean division time:", round(mean(division_times), 1), "minutes\n") cat("Expected mean:", (window_start + window_end)/2, "minutes") ``` ### Exponential Distribution The exponential distribution describes the time between events in a Poisson process. **Probability density function:** $$f(x) = \lambda e^{-\lambda x}, \quad x \geq 0$$ **Parameters:** $\lambda$ = rate parameter **Mean:** $E[X] = \frac{1}{\lambda}$ **Variance:** $\text{Var}(X) = \frac{1}{\lambda^2}$ **Applications:** - Time until next event (radioactive decay, mutations) - Lifespan distributions (constant hazard rate) - Waiting times ```{r} #| label: fig-appendix-exponential #| fig-cap: "Exponential distribution for different rate parameters (λ = 0.5, 1, and 2)" #| fig-width: 7 #| fig-height: 4 # Exponential with different rate parameters x <- seq(0, 5, length.out = 200) plot(x, dexp(x, rate = 0.5), type = "l", lwd = 2, col = "blue", xlab = "x", ylab = "f(x)", main = "Exponential Distribution", ylim = c(0, 2)) lines(x, dexp(x, rate = 1), lwd = 2, col = "red") lines(x, dexp(x, rate = 2), lwd = 2, col = "darkgreen") legend("topright", c("λ = 0.5", "λ = 1", "λ = 2"), col = c("blue", "red", "darkgreen"), lwd = 2) ``` **R functions:** `dexp()`, `pexp()`, `qexp()`, `rexp()` ### Sampling Process Example Bacterial cells divide asynchronously in exponential growth phase. If cells divide at a constant average rate of $\lambda = 0.5$ divisions per hour (one division every 2 hours on average), the waiting time between successive cell divisions follows an exponential distribution. ```{r} # Simulate times between cell divisions set.seed(404) lambda_rate <- 0.5 # divisions per hour # Sample inter-division times for 10 successive divisions n_divisions <- 10 inter_division_times <- rexp(n_divisions, rate = lambda_rate) cat("Time between divisions (hours):", round(inter_division_times, 2), "\n") cat("Mean inter-division time:", round(mean(inter_division_times), 2), "hours\n") cat("Expected mean (1/λ):", 1/lambda_rate, "hours\n") # Total time for all divisions cat("Total time for", n_divisions, "divisions:", round(sum(inter_division_times), 2), "hours") ``` ### Gamma Distribution The gamma distribution generalizes the exponential distribution to describe waiting time until the $r$th event. **Probability density function:** $$f(x) = \frac{\lambda^r}{\Gamma(r)} x^{r-1} e^{-\lambda x}, \quad x \geq 0$$ **Parameters:** - $r$ = shape parameter (number of events) - $\lambda$ = rate parameter **Mean:** $E[X] = \frac{r}{\lambda}$ **Variance:** $\text{Var}(X) = \frac{r}{\lambda^2}$ **Applications:** - Time for multiple events to occur - Duration of processes with multiple stages - Prior distributions in Bayesian analysis ```{r} #| label: fig-appendix-gamma #| fig-cap: "Gamma distribution with rate = 0.5 for different shape parameters (1, 2, and 5)" #| fig-width: 7 #| fig-height: 4 # Gamma distribution with different shape parameters x <- seq(0, 15, length.out = 200) plot(x, dgamma(x, shape = 1, rate = 0.5), type = "l", lwd = 2, col = "blue", xlab = "x", ylab = "f(x)", main = "Gamma Distribution (rate = 0.5)", ylim = c(0, 0.5)) lines(x, dgamma(x, shape = 2, rate = 0.5), lwd = 2, col = "red") lines(x, dgamma(x, shape = 5, rate = 0.5), lwd = 2, col = "darkgreen") legend("topright", c("shape = 1", "shape = 2", "shape = 5"), col = c("blue", "red", "darkgreen"), lwd = 2) ``` **R functions:** `dgamma()`, `pgamma()`, `qgamma()`, `rgamma()` ### Sampling Process Example In developmental biology, a cell must complete $r = 5$ distinct checkpoints before differentiating into a specialized cell type. Each checkpoint is passed at an average rate of $\lambda = 2$ checkpoints per day. The total time until the cell completes all 5 checkpoints and differentiates follows a gamma distribution. ```{r} # Simulate time to complete multiple checkpoints set.seed(505) r_checkpoints <- 5 lambda_rate <- 2 # checkpoints per day # Sample differentiation times for 15 cells n_cells <- 15 differentiation_times <- rgamma(n_cells, shape = r_checkpoints, rate = lambda_rate) cat("Time to differentiation (days):", round(differentiation_times, 2), "\n") cat("Mean differentiation time:", round(mean(differentiation_times), 2), "days\n") cat("Expected mean (r/λ):", r_checkpoints/lambda_rate, "days") ``` ### Normal (Gaussian) Distribution The normal distribution is the most important continuous distribution in statistics. It arises naturally when many independent factors contribute additively to an outcome. **Probability density function:** $$f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$ **Parameters:** - $\mu$ = mean (location) - $\sigma$ = standard deviation (scale) **Mean:** $E[X] = \mu$ **Variance:** $\text{Var}(X) = \sigma^2$ **Notation:** $X \sim \mathcal{N}(\mu, \sigma^2)$ **Applications:** - Heights, weights, and other biological measurements - Measurement errors - Sampling distributions (via Central Limit Theorem) - Basis for many statistical tests ```{r} #| label: fig-appendix-normal #| fig-cap: "Normal distribution with mean μ=2 for different standard deviations (σ = 0.5, 1, and 2)" #| fig-width: 7 #| fig-height: 5 x <- seq(-4, 8, length.out = 200) plot(x, dnorm(x, mean = 2, sd = 1), type = "l", lwd = 2, col = "blue", xlab = "x", ylab = "f(x)", main = "Normal Distribution", ylim = c(0, 0.45)) lines(x, dnorm(x, mean = 2, sd = 0.5), lwd = 2, col = "red") lines(x, dnorm(x, mean = 2, sd = 2), lwd = 2, col = "darkgreen") legend("topright", c("μ=2, σ=1", "μ=2, σ=0.5", "μ=2, σ=2"), col = c("blue", "red", "darkgreen"), lwd = 2) ``` **Key properties:** - 68% of values fall within 1 SD of the mean - 95% of values fall within 2 SDs of the mean - 99.7% of values fall within 3 SDs of the mean **R functions:** `dnorm()`, `pnorm()`, `qnorm()`, `rnorm()` ### Sampling Process Example Human height is influenced by many genetic and environmental factors that contribute additively. In a population of adult males, height follows approximately a normal distribution with mean $\mu = 175$ cm and standard deviation $\sigma = 7$ cm. When we measure heights of individuals, we are sampling from this distribution. ```{r} # Simulate measuring heights in a population set.seed(606) mu_height <- 175 # cm sigma_height <- 7 # cm # Sample heights of 30 individuals n_individuals <- 30 heights <- rnorm(n_individuals, mean = mu_height, sd = sigma_height) cat("Sample heights (cm):", round(heights, 1), "\n") cat("Sample mean:", round(mean(heights), 1), "cm\n") cat("Sample SD:", round(sd(heights), 1), "cm\n") cat("Population parameters: μ =", mu_height, "cm, σ =", sigma_height, "cm") ``` ### Standard Normal Distribution The standard normal distribution is a normal distribution with $\mu = 0$ and $\sigma = 1$. Any normal variable $X \sim \mathcal{N}(\mu, \sigma^2)$ can be converted to a standard normal $Z$ using: $$Z = \frac{X - \mu}{\sigma}$$ This **z-score** tells us how many standard deviations a value is from the mean. ```{r} #| label: fig-appendix-standard-normal #| fig-cap: "Standard normal distribution showing 68% of values within 1 standard deviation and 95% within 2 standard deviations" #| fig-width: 7 #| fig-height: 5 x <- seq(-4, 4, length.out = 200) y <- dnorm(x) plot(x, y, type = "l", lwd = 2, xlab = "z", ylab = "f(z)", main = "Standard Normal Distribution") # Shade areas x_1sd <- seq(-1, 1, length.out = 100) polygon(c(-1, x_1sd, 1), c(0, dnorm(x_1sd), 0), col = rgb(0, 0, 1, 0.3), border = NA) x_2sd <- c(seq(-2, -1, length.out = 50), seq(1, 2, length.out = 50)) for (region in list(c(-2, -1), c(1, 2))) { xx <- seq(region[1], region[2], length.out = 50) polygon(c(region[1], xx, region[2]), c(0, dnorm(xx), 0), col = rgb(0, 1, 0, 0.3), border = NA) } legend("topright", c("68% within 1σ", "95% within 2σ"), fill = c(rgb(0, 0, 1, 0.3), rgb(0, 1, 0, 0.3))) ``` ### Log-Normal Distribution If $\ln(X)$ follows a normal distribution, then $X$ follows a log-normal distribution. This arises when effects are multiplicative rather than additive. **Probability density function:** $$f(x) = \frac{1}{x\sigma\sqrt{2\pi}} e^{-\frac{(\ln x - \mu)^2}{2\sigma^2}}, \quad x > 0$$ **Parameters:** - $\mu$ = mean of the log-transformed variable - $\sigma$ = SD of the log-transformed variable **Mean:** $E[X] = e^{\mu + \sigma^2/2}$ **Variance:** $\text{Var}(X) = (e^{\sigma^2} - 1)e^{2\mu + \sigma^2}$ **Applications:** - Gene expression levels - Organism sizes - Concentrations - Income distributions ```{r} #| label: fig-appendix-lognormal #| fig-cap: "Log-normal distribution (left) and its transformation to normal distribution after log transformation (right)" #| fig-width: 8 #| fig-height: 4 par(mfrow = c(1, 2)) # Log-normal data set.seed(42) x <- rlnorm(1000, meanlog = 1, sdlog = 0.5) hist(x, breaks = 30, main = "Log-Normal Data", col = "lightblue", xlab = "x", probability = TRUE) curve(dlnorm(x, meanlog = 1, sdlog = 0.5), add = TRUE, col = "red", lwd = 2) # After log transformation hist(log(x), breaks = 30, main = "After Log Transform", col = "lightblue", xlab = "log(x)", probability = TRUE) curve(dnorm(x, mean = 1, sd = 0.5), add = TRUE, col = "red", lwd = 2) ``` **R functions:** `dlnorm()`, `plnorm()`, `qlnorm()`, `rlnorm()` ### Sampling Process Example Gene expression levels often follow a log-normal distribution because they result from multiplicative processes (transcription, translation, and degradation rates). When measuring mRNA levels of a particular gene across cells, if the log-transformed expression follows $\mathcal{N}(\mu = 2, \sigma = 0.6)$, the raw expression levels follow a log-normal distribution. ```{r} # Simulate gene expression measurements set.seed(707) mu_log <- 2 sigma_log <- 0.6 # Sample expression levels from 50 cells n_cells <- 50 expression_levels <- rlnorm(n_cells, meanlog = mu_log, sdlog = sigma_log) cat("Expression levels (arbitrary units):", round(expression_levels, 2), "\n") cat("Mean expression:", round(mean(expression_levels), 2), "\n") cat("Median expression:", round(median(expression_levels), 2), "\n") # Log-transformed data should be approximately normal log_expression <- log(expression_levels) cat("Mean of log-expression:", round(mean(log_expression), 2), "\n") cat("SD of log-expression:", round(sd(log_expression), 2)) ``` ### Chi-Square Distribution The chi-square distribution is the distribution of a sum of squared standard normal variables. It is fundamental to many statistical tests. **Parameters:** $k$ = degrees of freedom **Mean:** $E[X] = k$ **Variance:** $\text{Var}(X) = 2k$ **Applications:** - Goodness-of-fit tests - Tests of independence - Variance estimation ```{r} #| label: fig-appendix-chisquare #| fig-cap: "Chi-square distribution for different degrees of freedom (df = 2, 5, and 10)" #| fig-width: 7 #| fig-height: 4 x <- seq(0, 20, length.out = 200) plot(x, dchisq(x, df = 2), type = "l", lwd = 2, col = "blue", xlab = "x", ylab = "f(x)", main = "Chi-Square Distribution", ylim = c(0, 0.5)) lines(x, dchisq(x, df = 5), lwd = 2, col = "red") lines(x, dchisq(x, df = 10), lwd = 2, col = "darkgreen") legend("topright", c("df = 2", "df = 5", "df = 10"), col = c("blue", "red", "darkgreen"), lwd = 2) ``` **R functions:** `dchisq()`, `pchisq()`, `qchisq()`, `rchisq()` ### Sampling Process Example In a quality control study, a bioengineering lab measures the precision of a fabrication process by taking repeated measurements of device dimensions. The sample variance $s^2$ from $n = 10$ measurements is related to the chi-square distribution. Specifically, $(n-1)s^2/\sigma^2$ follows a chi-square distribution with $k = n-1 = 9$ degrees of freedom. ```{r} # Simulate sampling process to understand variance distribution set.seed(808) n_measurements <- 10 true_sigma <- 2.5 # true population standard deviation # Generate many samples and compute their variances n_samples <- 1000 sample_variances <- replicate(n_samples, { measurements <- rnorm(n_measurements, mean = 100, sd = true_sigma) var(measurements) }) # The scaled variances follow chi-square scaled_variances <- (n_measurements - 1) * sample_variances / true_sigma^2 cat("Sample variance from one experiment:", round(sample_variances[1], 2), "\n") cat("Mean of", n_samples, "sample variances:", round(mean(sample_variances), 2), "\n") cat("True variance (σ²):", true_sigma^2, "\n") cat("Mean of scaled variances:", round(mean(scaled_variances), 2), "\n") cat("Expected (df = ", n_measurements - 1, "):", n_measurements - 1) ``` ### Student's t Distribution The t-distribution arises when estimating the mean of a normally distributed population when the sample size is small and the population standard deviation is unknown. **Parameters:** $\nu$ = degrees of freedom As $\nu \to \infty$, the t-distribution approaches the standard normal. **Applications:** - t-tests - Confidence intervals for means - Robust regression ```{r} #| label: fig-appendix-t-dist #| fig-cap: "Student's t-distribution compared to normal distribution for different degrees of freedom" #| fig-width: 7 #| fig-height: 4 x <- seq(-4, 4, length.out = 200) plot(x, dnorm(x), type = "l", lwd = 2, col = "black", lty = 2, xlab = "x", ylab = "f(x)", main = "t-Distribution vs Normal") lines(x, dt(x, df = 2), lwd = 2, col = "blue") lines(x, dt(x, df = 5), lwd = 2, col = "red") lines(x, dt(x, df = 30), lwd = 2, col = "darkgreen") legend("topright", c("Normal", "t (df=2)", "t (df=5)", "t (df=30)"), col = c("black", "blue", "red", "darkgreen"), lwd = 2, lty = c(2, 1, 1, 1)) ``` **R functions:** `dt()`, `pt()`, `qt()`, `rt()` ### Sampling Process Example A pharmacologist is testing the effect of a new drug on blood pressure. She samples $n = 8$ patients and measures their blood pressure after treatment. The population mean $\mu$ and standard deviation $\sigma$ are unknown. To test whether the mean differs from a reference value, she computes a t-statistic: $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$, which follows a t-distribution with $\nu = n-1 = 7$ degrees of freedom under the null hypothesis. ```{r} # Simulate t-statistic sampling distribution set.seed(909) n_patients <- 8 true_mu <- 120 # true mean blood pressure true_sigma <- 12 # true SD mu_0 <- 120 # null hypothesis value (no change) # Simulate many experiments n_experiments <- 1000 t_statistics <- replicate(n_experiments, { blood_pressures <- rnorm(n_patients, mean = true_mu, sd = true_sigma) sample_mean <- mean(blood_pressures) sample_sd <- sd(blood_pressures) # Compute t-statistic (sample_mean - mu_0) / (sample_sd / sqrt(n_patients)) }) cat("First 10 t-statistics:", round(t_statistics[1:10], 2), "\n") cat("Mean of t-statistics:", round(mean(t_statistics), 2), "\n") cat("SD of t-statistics:", round(sd(t_statistics), 2), "\n") cat("Expected SD for t with df=7:", round(sqrt(7/(7-2)), 2)) ``` ### F Distribution The F-distribution is the ratio of two chi-square distributions divided by their degrees of freedom. It is used in ANOVA and comparing variances. **Parameters:** - $d_1$ = numerator degrees of freedom - $d_2$ = denominator degrees of freedom **Applications:** - ANOVA F-tests - Comparing variances - Regression significance tests ```{r} #| label: fig-appendix-f-dist #| fig-cap: "F distribution for different numerator and denominator degrees of freedom" #| fig-width: 7 #| fig-height: 4 x <- seq(0, 5, length.out = 200) plot(x, df(x, df1 = 5, df2 = 20), type = "l", lwd = 2, col = "blue", xlab = "x", ylab = "f(x)", main = "F Distribution", ylim = c(0, 1)) lines(x, df(x, df1 = 10, df2 = 20), lwd = 2, col = "red") lines(x, df(x, df1 = 20, df2 = 20), lwd = 2, col = "darkgreen") legend("topright", c("F(5,20)", "F(10,20)", "F(20,20)"), col = c("blue", "red", "darkgreen"), lwd = 2) ``` **R functions:** `df()`, `pf()`, `qf()`, `rf()` ### Sampling Process Example An ecologist is comparing the variability in body mass between two populations of birds. Population 1 has $n_1 = 12$ individuals and population 2 has $n_2 = 15$ individuals. To test whether the two populations have equal variances, she computes the F-statistic: $F = \frac{s_1^2}{s_2^2}$, which follows an F-distribution with $d_1 = n_1 - 1 = 11$ and $d_2 = n_2 - 1 = 14$ degrees of freedom when the true variances are equal. ```{r} # Simulate F-statistic sampling distribution set.seed(1010) n1 <- 12 n2 <- 15 true_sigma1 <- 5 # true SD for population 1 true_sigma2 <- 5 # true SD for population 2 (equal variances) # Simulate many paired samples n_comparisons <- 1000 F_statistics <- replicate(n_comparisons, { pop1_masses <- rnorm(n1, mean = 100, sd = true_sigma1) pop2_masses <- rnorm(n2, mean = 100, sd = true_sigma2) var1 <- var(pop1_masses) var2 <- var(pop2_masses) # Compute F-statistic var1 / var2 }) cat("First 10 F-statistics:", round(F_statistics[1:10], 2), "\n") cat("Mean of F-statistics:", round(mean(F_statistics), 2), "\n") cat("Expected mean for F(11,14):", round(14/(14-2), 2)) ``` ### Beta Distribution The beta distribution is defined on the interval [0, 1] and is useful for modeling proportions and probabilities. **Parameters:** - $\alpha$ = shape parameter 1 - $\beta$ = shape parameter 2 **Mean:** $E[X] = \frac{\alpha}{\alpha + \beta}$ **Variance:** $\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$ **Applications:** - Prior distributions for probabilities (Bayesian analysis) - Modeling proportions - Allele frequencies ```{r} #| label: fig-appendix-beta #| fig-cap: "Beta distribution for different shape parameters showing various distribution shapes" #| fig-width: 7 #| fig-height: 4 x <- seq(0, 1, length.out = 200) plot(x, dbeta(x, 2, 5), type = "l", lwd = 2, col = "blue", xlab = "x", ylab = "f(x)", main = "Beta Distribution", ylim = c(0, 3)) lines(x, dbeta(x, 5, 2), lwd = 2, col = "red") lines(x, dbeta(x, 2, 2), lwd = 2, col = "darkgreen") lines(x, dbeta(x, 0.5, 0.5), lwd = 2, col = "purple") legend("top", c("Beta(2,5)", "Beta(5,2)", "Beta(2,2)", "Beta(0.5,0.5)"), col = c("blue", "red", "darkgreen", "purple"), lwd = 2) ``` **R functions:** `dbeta()`, `pbeta()`, `qbeta()`, `rbeta()` ### Sampling Process Example In population genetics, a researcher is estimating the frequency of a particular allele in a population. Based on prior studies, she has some information about the likely allele frequency, which can be encoded as a beta prior distribution. For example, if previous evidence suggests the allele is moderately common but variable, she might use a Beta(5, 3) distribution to represent her prior beliefs about the allele frequency $p$ before collecting new data. ```{r} # Simulate prior distributions for allele frequencies set.seed(1111) alpha_param <- 5 beta_param <- 3 # Sample possible allele frequencies from prior n_simulations <- 1000 prior_frequencies <- rbeta(n_simulations, shape1 = alpha_param, shape2 = beta_param) cat("Sample of prior allele frequencies:", round(prior_frequencies[1:10], 3), "\n") cat("Mean prior frequency:", round(mean(prior_frequencies), 3), "\n") cat("Expected mean (α/(α+β)):", round(alpha_param/(alpha_param + beta_param), 3), "\n") # In practice, these could be updated with observed data using Bayesian inference # For example: observing 15 allele A and 5 allele B would update to Beta(5+15, 3+5) ``` ## Summary Table | Distribution | Type | Parameters | Mean | Variance | R prefix | |:------------|:-----|:-----------|:-----|:---------|:---------| | Bernoulli | Discrete | p | p | p(1-p) | binom (n=1) | | Binomial | Discrete | n, p | np | np(1-p) | binom | | Geometric | Discrete | p | 1/p | (1-p)/p² | geom | | Neg. Binomial | Discrete | r, p | r/p | r(1-p)/p² | nbinom | | Poisson | Discrete | λ | λ | λ | pois | | Uniform | Continuous | a, b | (a+b)/2 | (b-a)²/12 | unif | | Exponential | Continuous | λ | 1/λ | 1/λ² | exp | | Gamma | Continuous | r, λ | r/λ | r/λ² | gamma | | Normal | Continuous | μ, σ | μ | σ² | norm | | Log-normal | Continuous | μ, σ | exp(μ+σ²/2) | (exp(σ²)-1)exp(2μ+σ²) | lnorm | | Chi-square | Continuous | k | k | 2k | chisq | | t | Continuous | ν | 0 (ν>1) | ν/(ν-2) (ν>2) | t | | F | Continuous | d₁, d₂ | d₂/(d₂-2) | ... | f | | Beta | Continuous | α, β | α/(α+β) | ... | beta | ## R Function Naming Convention R uses a consistent naming convention for distribution functions: - **d**xxx - density/mass function (PDF/PMF) - **p**xxx - cumulative distribution function (CDF) - **q**xxx - quantile function (inverse CDF) - **r**xxx - random number generation For example, for the normal distribution: ```{r} # Density at x = 0 for standard normal dnorm(0) # Probability that X ≤ 1.96 pnorm(1.96) # Value where P(X ≤ q) = 0.975 qnorm(0.975) # Generate 5 random normal values set.seed(42) rnorm(5) ``` ## Choosing the Right Distribution ```{r} #| label: fig-appendix-dist-choice #| fig-cap: "Decision flowchart for choosing the appropriate probability distribution based on data type and characteristics" #| echo: false #| fig-width: 9 #| fig-height: 6 # Decision flowchart concept visualization par(mar = c(1, 1, 2, 1)) plot(1, type = "n", xlim = c(0, 10), ylim = c(0, 7), axes = FALSE, xlab = "", ylab = "", main = "Choosing a Distribution") # Boxes rect(3.5, 6, 6.5, 6.8) text(5, 6.4, "Data Type?", font = 2) rect(1, 4.5, 3.5, 5.3) text(2.25, 4.9, "Discrete\n(Counts)") rect(6.5, 4.5, 9, 5.3) text(7.75, 4.9, "Continuous") # Discrete branch rect(0, 2.5, 2, 3.3) text(1, 2.9, "Binary?") rect(2.5, 2.5, 4.5, 3.3) text(3.5, 2.9, "Count events?") # Continuous branch rect(6, 2.5, 8, 3.3) text(7, 2.9, "Bounded?") rect(8.5, 2.5, 10.5, 3.3) text(9.5, 2.9, "Positive?") # Results text(0.5, 1.5, "Bernoulli/\nBinomial", cex = 0.8, col = "blue") text(3.5, 1.5, "Poisson", cex = 0.8, col = "blue") text(6.5, 1.5, "Beta/\nUniform", cex = 0.8, col = "blue") text(9.5, 1.5, "Gamma/\nLog-normal", cex = 0.8, col = "blue") # Arrows arrows(5, 6, 2.25, 5.3, length = 0.1) arrows(5, 6, 7.75, 5.3, length = 0.1) ``` **Guidelines for selection:** 1. **Is your variable discrete or continuous?** - Discrete: Binomial, Poisson, Negative Binomial - Continuous: Normal, Gamma, Beta, etc. 2. **For counts:** - Fixed number of trials with yes/no: **Binomial** - Events per unit time/space, rare: **Poisson** - Overdispersed counts (variance > mean): **Negative Binomial** 3. **For continuous measurements:** - Symmetric, unbounded: **Normal** - Positive, right-skewed: **Log-normal** or **Gamma** - Bounded between 0 and 1: **Beta** - Waiting times: **Exponential** or **Gamma** 4. **For testing:** - Means (unknown variance): **t-distribution** - Variances: **Chi-square** or **F-distribution**